The Ops Layer That Keeps Your OpenClaw Agents Alive

Here’s what nobody tells you about running OpenClaw agents: the agents aren’t the hard part.

The gateway goes down. An update resets a config field you didn’t know existed. Your exec approval rule looks right but some named agent entry is silently shadowing the wildcard, so your agents are stuck waiting for /approve all night. You find out because the agents have gone quiet — not because anything announced itself.

I built openclaw-ops to handle this. It’s the operations layer I install in every OpenClaw environment I run. Health checks, auto-repair scripts, a watchdog that restarts the gateway if it dies at 3am, and update triage that tells you exactly what broke after a version bump.

Here’s what it actually solves.

The Four Things That Kill Your OpenClaw Agents (Usually Overnight)

1. The Gateway Goes Down

This one’s obvious once it happens — nothing works. But the why varies:

Port conflict blocking startup
auth: "none" was removed in v2026.1.29, so upgrading killed the gateway immediately with no useful error
Discord WebSocket disconnects that left a stuck typing indicator (v2026.2.24)

The watchdog script handles this. It pings the gateway every 5 minutes and restarts it if it’s down. If three restarts fail in 15 minutes, it fires a macOS notification and stops trying. You can’t auto-fix everything, but you can stop silent loops and know when to intervene.

2. Exec Approvals Break After Updates

This is the most common post-update breakage, and it’s subtle.

You set a wildcard rule that should allow all exec commands. An update resets tools.exec.ask and tools.exec.security to defaults. Your agents start sending /approve requests for every command. But here’s the part that gets people: named agent entries with empty allowlists silently shadow the * wildcard. So even after you fix the global rule, specific agents keep getting blocked.

Both layers have to be correct. The heal script checks both and fixes them.

3. Cron Jobs Go Silent

Cron jobs auto-disable after consecutive errors. There’s no notification. You just notice your scheduled agents stopped doing their thing, and it might be days before you catch it.

4. Session Files Bloat Past 10MB

Agents in a rapid-fire loop can push session files past 10MB. They appear to be running — 0 tokens, empty content, spinning. The heal script identifies and clears dead sessions.

What’s in the Repo

Scripts you run from your shell:

heal.sh — one-shot auto-fix for the most common gateway issues. Run it first whenever something feels wrong.
check-update.sh — detects version changes and explains what config broke and why. Run this after every OpenClaw update.
watchdog.sh — runs every 5 minutes, restarts gateway if down, escalates after 3 failures.
watchdog-install.sh — installs the watchdog as a macOS LaunchAgent so it survives reboots.
health-check.sh — declarative URL/process health checks for gateway-adjacent dependencies.
security-scan.sh — config hardening and credential exposure scan with redacted findings.
skill-audit.sh — static audit for third-party skills before you install them from ClawHub.

As a Claude skill:

Load /openclaw-ops and your AI does the triage: checks gateway health, auth, exec approvals, cron jobs, channels, and sessions, then explains what’s broken and fixes it.

Security Stuff Worth Knowing

Running security-scan.sh scores your config hardening 0-100 with specific fixes. It also checks for:

config.get leaking unredacted secrets via sourceConfig
Credential patterns leaked into ~/.openclaw/ files or wrong file permissions
Third-party ClawHub skills with hardcoded secrets, suspicious network calls, or prompt injection

The skill-audit catches the third-party skill issues before you install them. Worth running before you pull anything from ClawHub.

On the update side: if you’re running OpenClaw below v2026.2.12, upgrade now. That version fixed CVE-2026-25253 (one-click RCE via token leakage) plus 40+ SSRF, path traversal, and prompt injection issues. The check-update.sh --fix flag handles the config migration.

Setup

Requires OpenClaw v2026.2.12 or later.

# Clone into your skills folder
git clone https://github.com/cathrynlavery/openclaw-ops.git ~/.openclaw/skills/openclaw-ops

cd ~/.openclaw/skills/openclaw-ops

# Fix whatever is currently broken
bash scripts/heal.sh

# Install the always-on watchdog (macOS)
bash scripts/watchdog-install.sh

# Check if a recent update broke your config
bash scripts/check-update.sh

For Linux, the watchdog works via cron instead of LaunchAgent:

*/5 * * * * bash /path/to/scripts/watchdog.sh >> ~/.openclaw/logs/watchdog.log 2>&1

The Watchdog Escalation Tiers

I set it up in three layers:

Tier 1 — HTTP ping every 5 minutes via LaunchAgent
Tier 2 — Gateway restart + heal.sh if simple restart fails
Tier 3 — macOS notification after 3 failed attempts in 15 minutes; requires manual intervention

The goal is: the system handles what it can, tells you what it can’t, and doesn’t spam you with false alarms. Most overnight failures resolve at Tier 2.

One Note on the Health Checks

If you run health-check.sh right after an OpenClaw update or gateway restart, it can fail immediately — some process targets require minimum uptime (like 300 seconds) before reporting healthy. That’s expected. Lower the threshold during smoke tests, then restore it once you’re in steady-state.

This is the ops layer I wish someone had handed me when I started running OpenClaw agents. It doesn’t make your agents smarter. It keeps them running.

GitHub: cathrynlavery/openclaw-ops