Seed Watchdog & Self-Healing
This guide covers deploying the self-healing watchdog for Monolythium seed nodes. The watchdog automatically detects failures and recovers nodes via state sync, minimizing downtime.
Overview
The seed watchdog monitors for three failure conditions:
| Trigger | Detection Method | Threshold |
|---|---|---|
| AppHash Mismatch | Log pattern matching | Immediate |
| Stalled Height | Height unchanged over time | 5 minutes (configurable) |
| Behind Canonical | Lag from network tip | 500 blocks (configurable) |
When any trigger fires, the watchdog:
- Stops the node
- Wipes the data directory (preserves config, keys)
- Configures state sync with current trust height/hash
- Starts the node
- Verifies recovery succeeded
If state sync fails, it falls back to snapshot restore.
When to Use
Use the watchdog for:
- Seed nodes - High availability, rapid recovery
- Archive nodes - Where state sync is acceptable
- Development nodes - Quick recovery during testing
Do NOT use for:
- Validators - Cannot miss blocks during recovery
- Nodes with unique data - Data wipe would lose it
Quick Start (Systemd)
For systemd-managed seed nodes:
# Clone the watchdog
git clone https://github.com/monolythium/mono-core-peers.git
cd mono-core-peers/watchdog
# Install for your network
sudo ./install-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed
# Verify timer is running
sudo systemctl list-timers seed-watchdog.timer
The watchdog runs every 2 minutes via systemd timer.
Quick Start (Docker)
For Docker-managed seed nodes:
# Clone the watchdog
git clone https://github.com/monolythium/mono-core-peers.git
cd mono-core-peers/watchdog
# Option 1: Install cron-based watchdog on host
sudo ./install-watchdog.sh \
--network sprintnet \
--home /opt/monod \
--docker sprintnet-seed
# Option 2: Run as Docker sidecar
# Edit docker-watchdog-sidecar.yml with your settings
docker-compose -f docker-watchdog-sidecar.yml up -d
Manual Installation
1. Copy Watchdog Script
# Create directory
sudo mkdir -p /opt/monolythium/watchdog
# Download script
curl -sL https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/watchdog/seed-watchdog.sh \
-o /opt/monolythium/watchdog/seed-watchdog.sh
chmod +x /opt/monolythium/watchdog/seed-watchdog.sh
2. Create Systemd Service
cat << 'EOF' | sudo tee /etc/systemd/system/seed-watchdog.service
[Unit]
Description=Monolythium Seed Node Watchdog
After=network-online.target
[Service]
Type=oneshot
User=root
Environment="WATCHDOG_STALL_MINUTES=5"
Environment="WATCHDOG_LAG_THRESHOLD=500"
ExecStart=/opt/monolythium/watchdog/seed-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed \
--once
StandardOutput=journal
StandardError=journal
SyslogIdentifier=seed-watchdog
[Install]
WantedBy=multi-user.target
EOF
3. Create Systemd Timer
cat << 'EOF' | sudo tee /etc/systemd/system/seed-watchdog.timer
[Unit]
Description=Run seed watchdog health check
[Timer]
OnBootSec=60
OnUnitActiveSec=2min
RandomizedDelaySec=30
Persistent=true
[Install]
WantedBy=timers.target
EOF
4. Enable and Start
sudo systemctl daemon-reload
sudo systemctl enable seed-watchdog.timer
sudo systemctl start seed-watchdog.timer
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
WATCHDOG_STALL_MINUTES | 5 | Minutes without height change before triggering |
WATCHDOG_LAG_THRESHOLD | 500 | Blocks behind canonical before triggering |
WATCHDOG_CHECK_INTERVAL | 60 | Seconds between checks (daemon mode) |
WATCHDOG_STATE_FILE | /tmp/seed-watchdog-state.json | State persistence file |
Command Line Options
seed-watchdog.sh --help
Required:
--network <name> Network name (sprintnet, testnet, mainnet)
--home <path> Node home directory
Deployment mode (choose one):
--service <name> Systemd service name
--docker <name> Docker container name
Optional:
--rpc-port <port> Local RPC port (default: 26657)
--dry-run Check only, do not trigger recovery
--once Run once and exit
--daemon Run continuously
Monitoring
View Logs
# Systemd
journalctl -u seed-watchdog -f
# Docker (cron-based)
tail -f /var/log/seed-watchdog.log
Check Timer Status
sudo systemctl list-timers seed-watchdog.timer
Manual Health Check
Run a one-time check without triggering recovery:
sudo /opt/monolythium/watchdog/seed-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed \
--dry-run \
--once
Recovery Flow
When a trigger condition is detected:
[TRIGGER DETECTED]
│
▼
┌─────────────────┐
│ Stop Node │
└─────────────────┘
│
▼
┌─────────────────┐
│ Wipe Data/ │ ← Preserves config/, keys
└─────────────────┘
│
▼
┌─────────────────┐
│ Configure State │ ← Gets trust_height/hash from RPC
│ Sync │
└─────────────────┘
│
▼
┌─────────────────┐
│ Start Node │
└─────────────────┘
│
▼
┌─────────────────┐
│ Verify Recovery │ ← Waits up to 60s for RPC
└─────────────────┘
│
├── Success ──► Done
│
└── Failure ──► Retry (max 2 attempts)
│
└── Still fails ──► Snapshot Fallback
State Sync Requirements
For state sync to work, your network must have:
- Archive nodes with RPC accessible and snapshots enabled
- RPC endpoints configured in the network registry
- Recent snapshots available (within unbonding period)
The watchdog fetches trust height and hash from the network's RPC endpoints.
Snapshot Fallback
If state sync fails after 2 attempts, the watchdog attempts to restore from a snapshot:
- Downloads snapshot from
snapshot_urlin network registry - Extracts to node home
- Starts node
Ensure your network registry has a valid snapshot_url:
{
"snapshot_url": "https://snapshots.sprintnet.mononodes.xyz/latest.tar.lz4"
}
Troubleshooting
Watchdog Not Triggering
-
Check timer is active:
sudo systemctl status seed-watchdog.timer -
Check for errors in service:
journalctl -u seed-watchdog --since "1 hour ago" -
Run manually with verbose output:
sudo bash -x /opt/monolythium/watchdog/seed-watchdog.sh \
--network sprintnet --home /var/lib/monod --service sprintnet-seed --once
State Sync Failing
-
Verify RPC endpoints are accessible:
curl -s https://rpc.sprintnet.mononodes.xyz/status | jq .result.sync_info -
Check if snapshots are enabled on RPC nodes:
curl -s https://rpc.sprintnet.mononodes.xyz/abci_info | jq . -
Try snapshot fallback manually:
SNAPSHOT_URL=$(curl -s https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/networks/sprintnet/peers.json | jq -r .snapshot_url)
curl -sL $SNAPSHOT_URL -o /tmp/snapshot.tar.lz4
Recovery Loop
If the node keeps recovering:
-
Check validator connectivity:
# Get persistent peers from registry
curl -s https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/networks/sprintnet/peers.json | jq -r '.persistent_peers[]' -
Verify firewall allows P2P:
sudo ufw status | grep 26656 -
Check node identity wasn't corrupted:
# View node ID
monod tendermint show-node-id --home /var/lib/monod
Security Considerations
- The watchdog runs as root to manage services and wipe data
- It only deletes the
data/directory, neverconfig/orkeyring-*/ - RPC calls are made to configured endpoints only
- No external commands are executed from fetched data
See Also
- Seed Node Deployment - Initial seed setup
- Seeds and Peers - Understanding peer types
- Monitoring - Setting up observability