Building a Self-healing Infrastructure with Docker and Prometheus Alertmanager

Introduction

Modern infrastructure demands resilience beyond basic monitoring. A self-healing system automatically detects failures—from a crashed container to a misconfigured service—and executes corrective actions before users ever notice. By combining Docker's container orchestration with Prometheus Alertmanager, you can build a pipeline that scrapes metrics, evaluates alerting rules, and triggers recovery workflows without manual intervention. This article walks through the complete setup, from architecture decisions to production-ready automation scripts, using practical examples and proven patterns.

What is Self-Healing Infrastructure?

Self-healing infrastructure is a design pattern where your platform automatically identifies and recovers from known failure states. It reduces mean time to recovery (MTTR) by eliminating the need for an on-call engineer to restart a service or reroute traffic. The core loop involves:

Detection – Collecting telemetry (CPU, memory, health check status, latency) from every component.
Evaluation – Applying rules that define what constitutes a problem (e.g., HTTP 5xx rate exceeds 5% in five minutes).
Notification & Action – Alerting teams when necessary, and executing automated recovery (e.g., restart a container, scale up a service, recycle a node).

This approach is especially valuable in containerized environments where ephemeral workloads change constantly and manual recovery is both slow and error-prone.

Core Components for the Setup

To implement self-healing with Docker and Prometheus, you need these four pieces:

Docker & Docker Compose – To run your application containers and the monitoring stack itself in a reproducible, isolated way.
Prometheus – A time-series database and monitoring system that scrapes metrics from your containers via their endpoints.
Alertmanager – The component that receives alerts from Prometheus, deduplicates them, and routes them to receivers—including webhooks that trigger recovery actions.
Recovery Script / Service – A lightweight HTTP server (or shell script) that listens for webhook calls from Alertmanager and executes docker restart, docker-compose up -d, or other corrective commands.

All components will run as containers, making the entire system portable and easy to version-control.

Why Use Prometheus Instead of Docker’s Built-In Restart Policies?

Docker’s --restart=always or --restart=unless-stopped can restart a container if it exits, but it cannot detect a hanging service that is alive but not responding to requests. Prometheus gives you the ability to detect application-level failures—like an HTTP 503 error budget being exceeded—and act only when the service is truly unhealthy according to your own logic.

Setting Up the Docker Environment

Create a directory structure for the project:

self-healing-infra/
├── docker-compose.yml
├── prometheus/
│   └── prometheus.yml
├── alertmanager/
│   └── alertmanager.yml
└── recovery-agent/
    ├── Dockerfile
    └── agent.py

Define the Docker Compose File

Start with a docker-compose.yml that includes a sample web application, Prometheus, Alertmanager, and the recovery agent. Use version 3.8 or later for the best features.

Tip: Always pin image versions (e.g., prom/prometheus:v2.54.0) to avoid breaking changes.

Example snippet:

version: '3.8'
services:
  webapp:
    image: nginx:alpine
    ports:
      - "8080:80"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost"]
      interval: 30s
      timeout: 10s
      retries: 3
    labels:
      - "prometheus.job=webapp"
      - "prometheus.port=9113"   # if using nginx-exporter

  prometheus:
    image: prom/prometheus:v2.54.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    depends_on:
      - webapp

  alertmanager:
    image: prom/alertmanager:v0.27.0
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

  recovery-agent:
    build: ./recovery-agent
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - "5001:5001"   # webhook listener
    depends_on:
      - alertmanager

Mounting the Docker socket inside the recovery agent is a common pattern, but be aware of security implications. In production, consider using a dedicated API with restricted permissions or a tool like Docker’s remote API with TLS.

Configuring Prometheus

Prometheus needs to know where to scrape metrics and what rules to evaluate for alerting. Create prometheus/prometheus.yml:

Scrape Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9090']  # Prometheus itself

  - job_name: 'docker-containers'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_job]
        regex: (.+)
        target_label: job
        replacement: $1
      - source_labels: [__meta_docker_container_label_prometheus_port]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /metrics

This uses Docker service discovery so Prometheus automatically finds containers with the label prometheus.job and scrapes their metrics. If you prefer static targets, you can list them directly.

Alerting Rules

Create a file named prometheus/alerts.yml and include it in the main config:

rule_files:
  - 'alerts.yml'

Now define a rule that triggers when the webapp container is unreachable:

groups:
  - name: docker_alerts
    rules:
      - alert: WebAppDown
        expr: up{job="webapp"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "WebApp container is down"
          description: "Container {{ $labels.instance }} has been unreachable for more than 1 minute."
      
      - alert: HighHTTPLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.instance }}"

Each alert can have different severity, and Alertmanager will handle them accordingly.

Setting Up Alertmanager

Alertmanager configuration (alertmanager/alertmanager.yml) determines how alerts are routed and what actions are taken. For self-healing, we route critical alerts to a webhook that triggers the recovery agent.

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'recovery-webhook'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'secret'

  - name: 'recovery-webhook'
    webhook_configs:
      - url: 'http://recovery-agent:5001/webhook'
        send_resolved: true

Key details:

Critical alerts go to the webhook; less severe alerts notify via email.
send_resolved: true tells Alertmanager to notify the webhook when the alert clears, which allows the recovery agent to mark the incident as resolved.
The webhook URL uses Docker’s internal DNS (recovery-agent), which resolves to the container’s IP.

Testing the Alert Chain

Before automating recovery, verify that Prometheus and Alertmanager are communicating correctly. Use the Prometheus web UI (port 9090) to see active alerts, and temporarily stop the webapp container to confirm an alert fires. Check Alertmanager’s web UI (port 9093) to see that the alert reaches the webhook receiver.

Building the Recovery Agent

The recovery agent is a simple HTTP server that listens for POST requests from Alertmanager and executes Docker commands. We’ll use Python for readability, but you can use Go, Node.js, or a shell script with netcat.

Create `recovery-agent/Dockerfile`

FROM python:3.12-slim
RUN pip install flask requests
COPY agent.py /agent.py
CMD ["python", "/agent.py"]

Create `recovery-agent/agent.py`

import os
import subprocess
import json
from flask import Flask, request

app = Flask(__name__)

DOCKER_SOCKET = '/var/run/docker.sock'

@app.route('/webhook', methods=['POST'])
def webhook():
    alert_data = request.json
    if not alert_data:
        return 'No data', 400

    for alert in alert_data.get('alerts', []):
        labels = alert.get('labels', {})
        status = alert.get('status', '')
        alertname = labels.get('alertname', '')
        instance = labels.get('instance', '')

        # Only act on firing critical alerts
        if status == 'firing' and alertname == 'WebAppDown':
            # Extract container name from instance label (e.g., "webapp:8080")
            container = instance.split(':')[0] if ':' in instance else instance
            # Alternatively, use Docker labels to map alert to container name
            try:
                result = subprocess.run(
                    ['docker', 'restart', container],
                    capture_output=True,
                    text=True,
                    timeout=30
                )
                if result.returncode == 0:
                    print(f"Restarted container {container} successfully")
                else:
                    print(f"Failed to restart {container}: {result.stderr}")
            except Exception as e:
                print(f"Error restarting container: {e}")
    return 'OK', 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001)

Important considerations:

The agent runs with access to the Docker socket. In production, restrict its permissions—use Docker’s --cap-drop, run as non-root, or use a dedicated API token with minimal rights.
You may want to add rate limiting to prevent restart loops. For example, only allow one restart per container per 5 minutes.
Log all actions to a file or stdout for debugging.

Handling Multiple Services

If you have many services, extend the agent to look up container names from a configuration file or use Docker labels. Example: match alert labels to container labels like com.docker.compose.service.

Automating Recovery Actions Beyond Restart

Restarting a container is the simplest self-healing action, but you can do much more:

Scale up/down – Use docker-compose up --scale webapp=3 or call a Docker Swarm / Kubernetes API to increase replicas when latency spikes.
Rollback – If a new deployment causes failures, trigger a rollback to a previous image tag.
Cleanup – Remove stuck containers or unused volumes.
Notify – Send a Slack or PagerDuty message as a fallback if the automated recovery fails.

Each action can be its own webhook endpoint or a parameter in the alert data.

Testing Your Self-Healing Pipeline

Before going live, test each scenario:

Manual kill: Run docker kill webapp and observe that Prometheus detects the loss, Alertmanager sends a webhook, and the recovery agent restarts it within one minute (including for: 1m evaluation time).
Simulated high latency: Use a tool like tc (traffic control) or a proxy to introduce delay. Verify that the high-latency alert fires but does not trigger a restart (since you defined it as severity warning, which routes to email, not the webhook).
Flapping prevention: Rapidly stop and start the container. Ensure the agent doesn’t get stuck in an infinite restart loop—implement a cooldown.

Monitoring the Monitor

Your self-healing system itself must be monitored. Use Prometheus to scrape the recovery agent’s metrics (add a /metrics endpoint that exposes restart counts and errors). Also set up an alert if the recovery agent itself goes down.

Advanced Patterns and Best Practices

Decouple Alert Logic from Recovery Logic

Keep your alert rules focused on detection and severity. Recovery actions should be handled externally by the agent, not embedded in Alertmanager. This separation makes it easier to change recovery strategies without touching monitoring configuration.

Use Labels for Flexible Targeting

Label your containers with metadata that the recovery agent can read:

com.example.recovery.action=restart
com.example.recovery.cooldown=300

The agent can then read these labels from the Docker API and adjust its behavior accordingly.

Integrate with Orchestrators

If you use Docker Swarm or Compose in production, the recovery agent can call docker service update to force a recreation of tasks, or docker stack deploy to re-apply the stack configuration. For Docker Compose in production environments, consider using the --force-recreate flag carefully.

Handle Persistent State

Be careful when restarting containers that hold data. If your application uses volumes, ensure that restarting doesn’t corrupt data. Consider a pre-stop hook that flushes caches or writes state to a durable store.

External Resources

To deepen your understanding, explore:

Prometheus Alertmanager Documentation – Official guide for configuration and routing.
Docker Restart Policies – Understand the built-in levels of self-healing.
Prometheus Exporters – A list of exporters to monitor various services.

Conclusion

Combining Docker with Prometheus Alertmanager creates a robust foundation for self-healing infrastructure. By configuring Prometheus to scrape metrics and define meaningful alerts, and by building a lightweight recovery agent that translates webhook notifications into Docker commands, you can dramatically reduce downtime without human effort. Start small—automate restarts for one critical service—then expand to more sophisticated actions like scaling, rollback, or cleanup. The key is to iterate: monitor the recovery actions themselves, refine alert thresholds, and always keep a human in the loop for scenarios that automation cannot safely handle. With this pipeline in place, your infrastructure becomes not just monitored, but reactive and resilient.