Skip to main content

Seed Watchdog & Self-Healing

This guide covers deploying the self-healing watchdog for Monolythium seed nodes. The watchdog automatically detects failures and recovers nodes via state sync, minimizing downtime.

Overview

The seed watchdog monitors for three failure conditions:

TriggerDetection MethodThreshold
AppHash MismatchLog pattern matchingImmediate
Stalled HeightHeight unchanged over time5 minutes (configurable)
Behind CanonicalLag from network tip500 blocks (configurable)

When any trigger fires, the watchdog:

  1. Stops the node
  2. Wipes the data directory (preserves config, keys)
  3. Configures state sync with current trust height/hash
  4. Starts the node
  5. Verifies recovery succeeded

If state sync fails, it falls back to snapshot restore.

When to Use

Use the watchdog for:

  • Seed nodes - High availability, rapid recovery
  • Archive nodes - Where state sync is acceptable
  • Development nodes - Quick recovery during testing

Do NOT use for:

  • Validators - Cannot miss blocks during recovery
  • Nodes with unique data - Data wipe would lose it

Quick Start (Systemd)

For systemd-managed seed nodes:

# Clone the watchdog
git clone https://github.com/monolythium/mono-core-peers.git
cd mono-core-peers/watchdog

# Install for your network
sudo ./install-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed

# Verify timer is running
sudo systemctl list-timers seed-watchdog.timer

The watchdog runs every 2 minutes via systemd timer.

Quick Start (Docker)

For Docker-managed seed nodes:

# Clone the watchdog
git clone https://github.com/monolythium/mono-core-peers.git
cd mono-core-peers/watchdog

# Option 1: Install cron-based watchdog on host
sudo ./install-watchdog.sh \
--network sprintnet \
--home /opt/monod \
--docker sprintnet-seed

# Option 2: Run as Docker sidecar
# Edit docker-watchdog-sidecar.yml with your settings
docker-compose -f docker-watchdog-sidecar.yml up -d

Manual Installation

1. Copy Watchdog Script

# Create directory
sudo mkdir -p /opt/monolythium/watchdog

# Download script
curl -sL https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/watchdog/seed-watchdog.sh \
-o /opt/monolythium/watchdog/seed-watchdog.sh
chmod +x /opt/monolythium/watchdog/seed-watchdog.sh

2. Create Systemd Service

cat << 'EOF' | sudo tee /etc/systemd/system/seed-watchdog.service
[Unit]
Description=Monolythium Seed Node Watchdog
After=network-online.target

[Service]
Type=oneshot
User=root
Environment="WATCHDOG_STALL_MINUTES=5"
Environment="WATCHDOG_LAG_THRESHOLD=500"
ExecStart=/opt/monolythium/watchdog/seed-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed \
--once
StandardOutput=journal
StandardError=journal
SyslogIdentifier=seed-watchdog

[Install]
WantedBy=multi-user.target
EOF

3. Create Systemd Timer

cat << 'EOF' | sudo tee /etc/systemd/system/seed-watchdog.timer
[Unit]
Description=Run seed watchdog health check

[Timer]
OnBootSec=60
OnUnitActiveSec=2min
RandomizedDelaySec=30
Persistent=true

[Install]
WantedBy=timers.target
EOF

4. Enable and Start

sudo systemctl daemon-reload
sudo systemctl enable seed-watchdog.timer
sudo systemctl start seed-watchdog.timer

Configuration

Environment Variables

VariableDefaultDescription
WATCHDOG_STALL_MINUTES5Minutes without height change before triggering
WATCHDOG_LAG_THRESHOLD500Blocks behind canonical before triggering
WATCHDOG_CHECK_INTERVAL60Seconds between checks (daemon mode)
WATCHDOG_STATE_FILE/tmp/seed-watchdog-state.jsonState persistence file

Command Line Options

seed-watchdog.sh --help

Required:
--network <name> Network name (sprintnet, testnet, mainnet)
--home <path> Node home directory

Deployment mode (choose one):
--service <name> Systemd service name
--docker <name> Docker container name

Optional:
--rpc-port <port> Local RPC port (default: 26657)
--dry-run Check only, do not trigger recovery
--once Run once and exit
--daemon Run continuously

Monitoring

View Logs

# Systemd
journalctl -u seed-watchdog -f

# Docker (cron-based)
tail -f /var/log/seed-watchdog.log

Check Timer Status

sudo systemctl list-timers seed-watchdog.timer

Manual Health Check

Run a one-time check without triggering recovery:

sudo /opt/monolythium/watchdog/seed-watchdog.sh \
--network sprintnet \
--home /var/lib/monod \
--service sprintnet-seed \
--dry-run \
--once

Recovery Flow

When a trigger condition is detected:

[TRIGGER DETECTED]


┌─────────────────┐
│ Stop Node │
└─────────────────┘


┌─────────────────┐
│ Wipe Data/ │ ← Preserves config/, keys
└─────────────────┘


┌─────────────────┐
│ Configure State │ ← Gets trust_height/hash from RPC
│ Sync │
└─────────────────┘


┌─────────────────┐
│ Start Node │
└─────────────────┘


┌─────────────────┐
│ Verify Recovery │ ← Waits up to 60s for RPC
└─────────────────┘

├── Success ──► Done

└── Failure ──► Retry (max 2 attempts)

└── Still fails ──► Snapshot Fallback

State Sync Requirements

For state sync to work, your network must have:

  1. Archive nodes with RPC accessible and snapshots enabled
  2. RPC endpoints configured in the network registry
  3. Recent snapshots available (within unbonding period)

The watchdog fetches trust height and hash from the network's RPC endpoints.

Snapshot Fallback

If state sync fails after 2 attempts, the watchdog attempts to restore from a snapshot:

  1. Downloads snapshot from snapshot_url in network registry
  2. Extracts to node home
  3. Starts node

Ensure your network registry has a valid snapshot_url:

{
"snapshot_url": "https://snapshots.sprintnet.mononodes.xyz/latest.tar.lz4"
}

Troubleshooting

Watchdog Not Triggering

  1. Check timer is active:

    sudo systemctl status seed-watchdog.timer
  2. Check for errors in service:

    journalctl -u seed-watchdog --since "1 hour ago"
  3. Run manually with verbose output:

    sudo bash -x /opt/monolythium/watchdog/seed-watchdog.sh \
    --network sprintnet --home /var/lib/monod --service sprintnet-seed --once

State Sync Failing

  1. Verify RPC endpoints are accessible:

    curl -s https://rpc.sprintnet.mononodes.xyz/status | jq .result.sync_info
  2. Check if snapshots are enabled on RPC nodes:

    curl -s https://rpc.sprintnet.mononodes.xyz/abci_info | jq .
  3. Try snapshot fallback manually:

    SNAPSHOT_URL=$(curl -s https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/networks/sprintnet/peers.json | jq -r .snapshot_url)
    curl -sL $SNAPSHOT_URL -o /tmp/snapshot.tar.lz4

Recovery Loop

If the node keeps recovering:

  1. Check validator connectivity:

    # Get persistent peers from registry
    curl -s https://raw.githubusercontent.com/monolythium/mono-core-peers/prod/networks/sprintnet/peers.json | jq -r '.persistent_peers[]'
  2. Verify firewall allows P2P:

    sudo ufw status | grep 26656
  3. Check node identity wasn't corrupted:

    # View node ID
    monod tendermint show-node-id --home /var/lib/monod

Security Considerations

  • The watchdog runs as root to manage services and wipe data
  • It only deletes the data/ directory, never config/ or keyring-*/
  • RPC calls are made to configured endpoints only
  • No external commands are executed from fetched data

See Also