Skip to main content

Monitoring

This guide covers monitoring your Monolythium node for health and performance.

Key Metrics

Node Health

MetricWhat to Monitor
Process runningIs monod process alive
Block heightIs it increasing
Sync statusIs node caught up
Peer countAre peers connected

System Resources

MetricAlert Threshold
CPU usage> 80% sustained
Memory usage> 90%
Disk usage> 80%
Disk I/OHigh sustained

Validator-Specific

MetricWhat to Monitor
Missed blocksAny misses
Signing statusPre-commits signed
Jailed statusNot jailed

Prometheus Metrics

Monod exposes Prometheus metrics when enabled.

Enable Metrics

Edit config.toml:

[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"

Key Metrics

MetricDescription
tendermint_consensus_heightCurrent block height
tendermint_consensus_validatorsNumber of validators
tendermint_consensus_missing_validatorsValidators not signing
tendermint_p2p_peersConnected peer count
tendermint_mempool_sizePending transactions

Prometheus Setup

Install Prometheus

# Download
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-*.tar.gz
cd prometheus-*

# Configure
cat > prometheus.yml <<EOF
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'monod'
static_configs:
- targets: ['localhost:26660']
EOF

# Run
./prometheus

Access

Open http://localhost:9090 in browser.

Grafana Dashboards

Install Grafana

# Ubuntu
sudo apt install -y apt-transport-https software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install grafana

# Start
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Access

Open http://localhost:3000 (default: admin/admin)

Dashboard Setup

  1. Add Prometheus data source
  2. Import Cosmos/Tendermint dashboard
  3. Customize for your needs

Alerting

Prometheus Alertmanager

# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alerts@example.com'

route:
receiver: 'email'

receivers:
- name: 'email'
email_configs:
- to: 'operator@example.com'

Alert Rules

# alerts.yml
groups:
- name: monod
rules:
- alert: NodeDown
expr: up{job="monod"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Monod node is down"

- alert: MissedBlocks
expr: tendermint_consensus_missing_validators > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Validator missing blocks"

- alert: LowPeers
expr: tendermint_p2p_peers < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count"

Command Line Checks

Node Status

# Check if synced
curl -s localhost:26657/status | jq '.result.sync_info'

# Check peers
curl -s localhost:26657/net_info | jq '.result.n_peers'

# Check health
curl -s localhost:26657/health

Validator Status

# Check signing info
monod query slashing signing-info $(monod tendermint show-validator)

# Check if jailed
monod query staking validator $(monod keys show validator --bech val -a) | grep jailed

Log Monitoring

View Logs

# Recent logs
sudo journalctl -u monod -n 100

# Follow logs
sudo journalctl -u monod -f

# Filter errors
sudo journalctl -u monod | grep -i error

Log Aggregation

Consider tools like:

  • Loki + Grafana
  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Datadog
  • Splunk

Health Check Script

#!/bin/bash
# health-check.sh

# Check process
if ! pgrep -x "monod" > /dev/null; then
echo "CRITICAL: monod not running"
exit 2
fi

# Check sync status
CATCHING_UP=$(curl -s localhost:26657/status | jq -r '.result.sync_info.catching_up')
if [ "$CATCHING_UP" = "true" ]; then
echo "WARNING: Node is catching up"
exit 1
fi

# Check peers
PEERS=$(curl -s localhost:26657/net_info | jq -r '.result.n_peers')
if [ "$PEERS" -lt 3 ]; then
echo "WARNING: Low peer count: $PEERS"
exit 1
fi

echo "OK: Node healthy"
exit 0

Uptime Monitoring

External services to monitor your node:

  • UptimeRobot
  • Pingdom
  • StatusCake
  • Custom scripts

Monitor RPC endpoint externally:

curl -s https://your-node:26657/health

Opt-In Node Monitoring

Monolythium provides an opt-in monitoring service that tracks your node's health and sends alerts via Telegram. This service is privacy-preserving: no IPs, hostnames, or RPC URLs are ever exposed publicly.

What Gets Monitored

Data SentPurpose
Node IDUnique identifier (derived from node_key.json)
Block heightSync status tracking
Version infomonod and monoctl versions
Chain IDNetwork verification
CapabilitiesArchive, state-sync, snapshots

Never sent: IP addresses, hostnames, RPC URLs, private keys.

Register Your Node

# Register with the monitoring service
monoctl monitor register --network Sprintnet --moniker "my-node" --role fullnode --home ~/.monod

# Roles: validator, seed, bootstrap, fullnode, gateway, snapshot

This generates an ed25519 keypair for signing heartbeats and returns a link token. Send this token to the @MonolythiumBot on Telegram to complete registration and receive alerts.

Install Automatic Heartbeats

# Install systemd timer (sends heartbeat every 30 seconds)
sudo monoctl monitor install --network Sprintnet --user $(whoami) --home ~/.monod

# Verify timer is active
sudo systemctl status monoctl-monitor@Sprintnet.timer

Manual Heartbeat

# Send a single heartbeat (useful for testing)
monoctl monitor heartbeat --network Sprintnet --home ~/.monod

Control Visibility

By default, registered nodes are private and don't appear on the public explorer. To make your node visible:

# Make node public (visible on explorer)
monoctl monitor visibility --public --network Sprintnet --home ~/.monod

# Make node private again
monoctl monitor visibility --private --network Sprintnet --home ~/.monod

Public nodes appear at monoscan.xyz/sprintnet/nodes.

Uninstall

# Remove the systemd timer
sudo monoctl monitor uninstall --network Sprintnet

Telegram Bot Commands

Once registered, you can interact with @MonolythiumBot:

CommandDescription
/startWelcome message and setup
/helpList available commands
/statusView your registered nodes
/pause <moniker> <minutes>Pause alerts (e.g., /pause my-node 60)
/resume <moniker>Resume alerts for a node
/deregister <moniker>Remove node from monitoring

Pause alerts when doing maintenance:

/pause my-node 60

This pauses alerts for 60 minutes (max 24 hours). Use /resume my-node to resume earlier.

Alert Types

The bot sends alerts when:

  • Node goes offline (no heartbeat for 2+ minutes)
  • Node starts lagging behind canonical height
  • Node becomes stalled (height not increasing)

Best Practices

  1. Monitor 24/7 - Use automated monitoring
  2. Set up alerts - Know about issues immediately
  3. Test alerts - Verify notifications work
  4. Monitor trends - Catch issues before they're critical
  5. Keep dashboards - Visualize health at a glance
  6. Log retention - Keep logs for troubleshooting

FAQ

What's the minimum monitoring setup?

At least: process monitoring, disk space alerts, and external ping.

Should I expose metrics publicly?

No. Keep Prometheus metrics internal or behind authentication.

How often should I check?

Automated: every 15-30 seconds. Manual: at least daily review of dashboards.