JSPM

  • ESM via JSPM
  • ES Module Entrypoint
  • Export Map
  • Keywords
  • License
  • Repository URL
  • TypeScript Types
  • README
  • Created
  • Published
  • Downloads 5
  • Score
    100M100P100Q43999F
  • License MIT

Enterprise-grade automatic service monitoring, recovery, and alerting system for Linux servers with advanced features like dependencies, health checks, and auto-recovery

Package Exports

  • @timemacro/service-guardian
  • @timemacro/service-guardian/src/index.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (@timemacro/service-guardian) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Service Guardian

Enterprise-grade automatic service monitoring, recovery, and alerting system for Linux servers

npm version License: MIT Node.js Version

Service Guardian is a production-ready Node.js daemon that monitors your Linux services, automatically recovers from failures, and sends intelligent alerts. Built for system administrators and DevOps teams who need reliable service uptime without manual intervention.

The Problem It Solves

Ever had MySQL crash at 3 AM due to an OOM killer? Or Apache go down during peak traffic? Service Guardian ensures your critical services stay running by:

  • Detecting failures instantly - Not just checking if process exists, but verifying services actually work
  • Smart auto-recovery - Distinguishes between crashes and manual stops, only restarts genuine failures
  • Intelligent alerting - Batched, actionable alerts with system context, not spam

Key Features

πŸ›‘οΈ Core Monitoring

  • Systemd Integration - Deep integration with systemd for accurate service state detection
  • Intelligent Failure Analysis - Differentiates between:
    • OOM (Out of Memory) kills
    • Service crashes
    • Manual stops (won't restart these)
    • Dependency failures
  • Parallel Monitoring - Efficiently monitors multiple services simultaneously
  • Resource-Aware - Monitors CPU, memory, disk usage before taking actions

πŸ”„ Advanced Recovery

  • Smart Auto-Restart - With exponential backoff to prevent restart loops
  • Dependency Management - Handles service dependencies and circular dependencies
  • Recovery Actions - Beyond just restart:
    • Clear system cache
    • Kill memory-intensive processes
    • Reload configurations
    • Clean zombie processes
    • Repair databases
  • Maintenance Windows - Pause monitoring during planned maintenance

πŸ₯ Health Checks

  • Beyond Process Monitoring - Tests if services actually work:
    • TCP port checks (is MySQL accepting connections?)
    • HTTP endpoint checks (is API returning 200?)
    • Custom script checks (complex business logic)
    • Command checks (simple shell commands)
  • Failure Thresholds - Only alerts after X consecutive failures (no false alarms)
  • User-Friendly Messages - Clear explanations of what's wrong and how to fix it

πŸ“§ Intelligent Alerting

  • Beautiful HTML Emails - Professional, readable alert emails with system context
  • Alert Aggregation - Batches multiple alerts to reduce email spam
  • Rate Limiting - Prevents alert storms during major incidents
  • Cooldown Periods - Won't repeatedly alert for the same issue
  • Contextual Information - Includes failure analysis, resource usage, recent logs

πŸ“Š Metrics & Reporting

  • Service Metrics - Track uptime, restart counts, failure patterns
  • Resource Metrics - Monitor CPU, memory, disk usage over time
  • Daily Aggregation - Historical data for trend analysis
  • Health Reports - Summary of all monitored services

πŸ”’ Security

  • Command Injection Protection - All inputs sanitized and validated
  • Whitelisted Commands - Only approved system commands can be executed
  • Path Traversal Prevention - Secure file operations
  • No Hardcoded Credentials - Everything configurable via environment variables

Installation

Prerequisites

  • Node.js >= 16.0.0
  • Linux with systemd (Debian, Ubuntu, RHEL, etc.)
  • Root or sudo access (for systemctl commands)

Install via npm

# Install globally
npm install -g @timemacro/service-guardian

# Or with sudo if needed
sudo npm install -g @timemacro/service-guardian

Install from source

git clone https://github.com/derricksiawor/service-guardian.git
cd service-guardian
npm install
npm link

Quick Start

1. Install and Check Version

# Install globally
npm install -g @timemacro/service-guardian

# Verify installation
sg --version
sg --help                     # See all available commands
sg config email               # Interactive email setup

You'll be prompted for SMTP settings:

  • SMTP Host (e.g., smtp.gmail.com)
  • SMTP Port (e.g., 587)
  • Username
  • Password
  • From address
  • To address

3. Add Services to Monitor

# Add a service with auto-restart and alerts
sg add mysql --restart --alert

# Add multiple services
sg add nginx --restart --alert
sg add postgresql --restart --alert
sg add redis --restart

# Add with custom settings
sg add apache2 --restart --alert --max-restarts 10

# List all monitored services
sg list

4. Monitor Your Services

# The daemon auto-starts when you add services
sg status                     # Check daemon and all services status

# View logs
sg logs                       # Recent logs
sg logs --follow              # Live logs (like tail -f)
sg logs --tail 100            # Last 100 lines

# Manual operations
sg check mysql                # Check specific service
sg restart                    # Restart the daemon
sg test                       # Test all services

Usage

Command Reference

Service Guardian can be invoked using either service-guardian or sg (shorthand). We recommend using sg for convenience.

Quick Information Commands

# Get started quickly
sg                            # Show help and available commands
sg --help                     # Show detailed help
sg --version                  # Show version

# View current state
sg status                     # Show daemon status and all monitored services
sg list                       # List all monitored services
sg info                       # Show system information and configuration

Core Commands

# Daemon Control (auto-starts if not running)
sg start                      # Start monitoring daemon (auto-starts on first command)
sg stop                       # Stop monitoring daemon
sg restart                    # Restart daemon
sg status                     # Show daemon and services status

# Service Management
sg add <service> [options]    # Add service to monitoring
sg remove <service>           # Remove service from monitoring
sg list                       # List all monitored services
sg enable <service>           # Enable monitoring for service
sg disable <service>          # Disable monitoring for service

# Monitoring & Logs
sg logs                       # View recent daemon logs
sg logs --follow              # View logs in real-time (like tail -f)
sg logs --tail 50             # View last 50 log lines
sg check <service>            # Manually check service status
sg test                       # Test monitoring all services

Advanced Features

# Health Checks
sg health add <service> [options]     # Add health check
sg health list                         # List all health checks
sg health remove <service>             # Remove health check
sg health test <service>               # Test health check

# Dependencies
sg deps add <service> <deps...>       # Add service dependencies
sg deps remove <service> <deps...>    # Remove dependencies
sg deps list [service]                # List dependencies
sg deps check                          # Check for circular dependencies

# Maintenance Windows
sg maintenance add [options]          # Schedule maintenance
sg maintenance list                    # List maintenance windows
sg maintenance remove <name>           # Remove maintenance window

# Groups & Tags
sg group create <name>                 # Create service group
sg group add <group> <services...>    # Add services to group
sg group list                          # List all groups
sg tag add <service> <tags...>        # Add tags to service
sg tag list [service]                  # List tags

# Metrics & Reports
sg metrics [service] [options]         # View service metrics
sg report [options]                    # Generate health report

# Configuration
sg config email                        # Configure email settings
sg config show                         # Show configuration
sg config set <key> <value>           # Set config value
sg export [file]                       # Export configuration
sg import <file>                       # Import configuration

Configuration Options

Configuration is stored in /etc/service-guardian/config.json (or ~/.service-guardian/config.json for non-root users).

{
  // Monitoring
  "CHECK_INTERVAL": 30,              // Seconds between checks
  "HEALTH_CHECK_INTERVAL": 60,       // Seconds between health checks
  
  // Restart Settings
  "MAX_RESTARTS": 5,                 // Max restart attempts
  "RESTART_DELAY": 10,               // Initial delay (seconds)
  "RESTART_BACKOFF_MULTIPLIER": 2,   // Exponential backoff
  "MAX_RESTART_DELAY": 300,          // Max delay (seconds)
  
  // Alerts
  "ALERT_COOLDOWN": 600,             // Seconds between alerts
  "ALERT_BATCH_INTERVAL": 60,        // Batch window (seconds)
  "MAX_ALERTS_PER_HOUR": 10,         // Rate limiting
  
  // Email Settings (set via sg config email)
  "SMTP_HOST": "smtp.gmail.com",
  "SMTP_PORT": 587,
  "SMTP_USER": "your-email@gmail.com",
  "SMTP_PASS": "your-app-password",
  "EMAIL_FROM": "alerts@yourserver.com",
  "EMAIL_TO": "admin@yourcompany.com"
}

How It Works

1. Service Monitoring Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cron Scheduler  β”‚ Every 30 seconds
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check Services  β”‚ Parallel checks
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Analyze Status  β”‚ Is service healthy?
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β”‚ Healthy β”‚ Not Healthy
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Failure Analysisβ”‚ Why did it fail?
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Recovery Actionsβ”‚ Try to fix
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Auto-Restart?   β”‚ If enabled
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Send Alert?     β”‚ If enabled & not in cooldown
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Failure Detection

Service Guardian performs intelligent failure analysis:

// Not just "is process running?"
if (!service.isActive) {
  // Analyze WHY it's not running
  const analysis = await analyzeFailure(service);
  
  if (analysis.type === 'MANUAL_STOP') {
    // User stopped it, don't restart
    return;
  }
  
  if (analysis.type === 'OOM_KILL') {
    // Killed by OOM, check memory before restart
    if (memory.usage > 90%) {
      // Clean up memory first
      await clearSystemCache();
    }
  }
  
  // Smart restart with backoff
  await attemptRestart(service);
}

3. Health Checks

Beyond process monitoring, health checks verify services actually work:

// TCP Health Check Example
const mysql_health = {
  type: 'tcp',
  host: 'localhost',
  port: 3306,
  timeout: 10,
  interval: 60
};

// Results in user-friendly messages:
// βœ… "mysql is responding on localhost:3306"
// ❌ "mysql is not accepting connections on localhost:3306. 
//     The service may be down or not listening on this port.
//     Suggestion: Verify mysql is running with: systemctl status mysql"

4. Alert Aggregation

Intelligent batching reduces email spam:

// Instead of 10 emails in 1 minute:
// "nginx failed"
// "mysql failed"
// "redis failed"
// ...

// You get 1 comprehensive email:
// "3 services need attention:
//  - nginx: Connection refused on port 80
//  - mysql: OOM killed (memory: 95%)
//  - redis: Dependency postgres is down"

Real-World Examples

Example 1: MySQL OOM Protection

# Add MySQL with OOM recovery
sg add mysql --restart --alert --max-restarts 5

# Add health check to verify it's accepting connections
sg health add mysql --type tcp --port 3306

# Add recovery action to clear cache when memory is high
sg recovery add mysql --type clear-cache --threshold 90

When MySQL gets OOM-killed:

  1. Service Guardian detects the OOM kill (not just "service down")
  2. Checks system memory usage
  3. If memory > 90%, clears system cache first
  4. Restarts MySQL with exponential backoff
  5. Verifies it's accepting connections
  6. Sends detailed alert with memory stats and suggestions

Example 2: Dependent Services

# Setup WordPress stack with dependencies
sg add nginx --restart --alert
sg add php-fpm --restart --alert
sg add mysql --restart --alert

# Define dependencies
sg deps add nginx php-fpm
sg deps add php-fpm mysql

# If MySQL fails, Service Guardian will:
# 1. Restart MySQL first
# 2. Then restart php-fpm (depends on MySQL)
# 3. Then restart nginx (depends on php-fpm)

Example 3: Maintenance Windows

# Schedule maintenance window for updates
sg maintenance add "Weekly Updates" \
  --days sunday \
  --start 02:00 \
  --duration 2 \
  --services nginx,mysql,redis

# During maintenance:
# - No auto-restarts
# - No alerts
# - Services can be safely updated

Example 4: Custom Health Checks

# Create custom health check script
cat > /etc/service-guardian/health-checks/api-check.sh << 'EOF'
#!/bin/bash
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/api/health)
if [ "$RESPONSE" = "200" ]; then
  echo "API is healthy"
  exit 0
else
  echo "API returned status code: $RESPONSE"
  exit 1
fi
EOF

chmod +x /etc/service-guardian/health-checks/api-check.sh

# Add the health check
sg health add api --type script --script api-check.sh

Architecture

Security Features

  1. Input Validation - All inputs validated with JSON schemas
  2. Command Whitelisting - Only approved system commands
  3. Shell Escape - Prevents command injection
  4. Path Validation - Prevents directory traversal
  5. Secure Execution - Isolated command execution

Performance

  • Parallel Monitoring - Check multiple services simultaneously
  • Efficient Resource Usage - Minimal CPU and memory footprint
  • Optimized Queries - Batch operations where possible
  • Caching - Reduces repeated system calls

Reliability

  • Crash Recovery - Daemon automatically recovers from crashes
  • Data Persistence - Configuration and metrics survive restarts
  • Atomic Operations - Prevents partial updates
  • Graceful Shutdown - Cleanly stops all operations

Troubleshooting

Service Guardian won't start

# Check if already running
sg status

# Check logs for errors
sg logs --tail 50

# Verify Node.js version
node --version  # Should be >= 16.0.0

# Check permissions
ls -la /etc/service-guardian/

Services not being monitored

# Verify service is added
sg list

# Check if service exists
systemctl status <service-name>

# Test monitoring manually
sg check <service-name>

# Check dependencies
sg deps check

Not receiving alerts

# Test email configuration
sg config email --test

# Check alert settings
sg config show | grep ALERT

# View recent alerts
sg logs | grep "Alert sent"

# Check cooldown status
sg status --verbose

High memory usage

# Check metrics history
sg metrics --days 7

# Clear old metrics
sg metrics --cleanup

# Reduce check frequency
sg config set CHECK_INTERVAL 60

Development

Running Tests

npm test                 # Run all tests
npm run test:watch      # Watch mode
npm run test:coverage   # Coverage report

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT License - Copyright (c) 2025 Derrick S. K. Siawor

See LICENSE file for details.

Author

Derrick S. K. Siawor
Website: https://derricksiawor.com
GitHub: @derricksiawor

Support

Acknowledgments

Built with enterprise-grade libraries:


Stop losing sleep over crashed services. Let Service Guardian keep watch.