civil-and-structural-engineering
Building a Self-healing Infrastructure with Docker and Prometheus Alertmanager
Table of Contents
Introduction
Modern infrastructure demands resilience beyond basic monitoring. A self-healing system automatically detects failures—from a crashed container to a misconfigured service—and executes corrective actions before users ever notice. By combining Docker's container orchestration with Prometheus Alertmanager, you can build a pipeline that scrapes metrics, evaluates alerting rules, and triggers recovery workflows without manual intervention. This article walks through the complete setup, from architecture decisions to production-ready automation scripts, using practical examples and proven patterns.
What is Self-Healing Infrastructure?
Self-healing infrastructure is a design pattern where your platform automatically identifies and recovers from known failure states. It reduces mean time to recovery (MTTR) by eliminating the need for an on-call engineer to restart a service or reroute traffic. The core loop involves:
- Detection – Collecting telemetry (CPU, memory, health check status, latency) from every component.
- Evaluation – Applying rules that define what constitutes a problem (e.g., HTTP 5xx rate exceeds 5% in five minutes).
- Notification & Action – Alerting teams when necessary, and executing automated recovery (e.g., restart a container, scale up a service, recycle a node).
This approach is especially valuable in containerized environments where ephemeral workloads change constantly and manual recovery is both slow and error-prone.
Core Components for the Setup
To implement self-healing with Docker and Prometheus, you need these four pieces:
- Docker & Docker Compose – To run your application containers and the monitoring stack itself in a reproducible, isolated way.
- Prometheus – A time-series database and monitoring system that scrapes metrics from your containers via their endpoints.
- Alertmanager – The component that receives alerts from Prometheus, deduplicates them, and routes them to receivers—including webhooks that trigger recovery actions.
- Recovery Script / Service – A lightweight HTTP server (or shell script) that listens for webhook calls from Alertmanager and executes
docker restart,docker-compose up -d, or other corrective commands.
All components will run as containers, making the entire system portable and easy to version-control.
Why Use Prometheus Instead of Docker’s Built-In Restart Policies?
Docker’s --restart=always or --restart=unless-stopped can restart a container if it exits, but it cannot detect a hanging service that is alive but not responding to requests. Prometheus gives you the ability to detect application-level failures—like an HTTP 503 error budget being exceeded—and act only when the service is truly unhealthy according to your own logic.
Setting Up the Docker Environment
Create a directory structure for the project:
self-healing-infra/
├── docker-compose.yml
├── prometheus/
│ └── prometheus.yml
├── alertmanager/
│ └── alertmanager.yml
└── recovery-agent/
├── Dockerfile
└── agent.py
Define the Docker Compose File
Start with a docker-compose.yml that includes a sample web application, Prometheus, Alertmanager, and the recovery agent. Use version 3.8 or later for the best features.
Tip: Always pin image versions (e.g.,
prom/prometheus:v2.54.0) to avoid breaking changes.
Example snippet:
version: '3.8'
services:
webapp:
image: nginx:alpine
ports:
- "8080:80"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost"]
interval: 30s
timeout: 10s
retries: 3
labels:
- "prometheus.job=webapp"
- "prometheus.port=9113" # if using nginx-exporter
prometheus:
image: prom/prometheus:v2.54.0
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
depends_on:
- webapp
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
recovery-agent:
build: ./recovery-agent
volumes:
- /var/run/docker.sock:/var/run/docker.sock
ports:
- "5001:5001" # webhook listener
depends_on:
- alertmanager
Mounting the Docker socket inside the recovery agent is a common pattern, but be aware of security implications. In production, consider using a dedicated API with restricted permissions or a tool like Docker’s remote API with TLS.
Configuring Prometheus
Prometheus needs to know where to scrape metrics and what rules to evaluate for alerting. Create prometheus/prometheus.yml:
Scrape Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9090'] # Prometheus itself
- job_name: 'docker-containers'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_label_prometheus_job]
regex: (.+)
target_label: job
replacement: $1
- source_labels: [__meta_docker_container_label_prometheus_port]
regex: (.+)
target_label: __metrics_path__
replacement: /metrics
This uses Docker service discovery so Prometheus automatically finds containers with the label prometheus.job and scrapes their metrics. If you prefer static targets, you can list them directly.
Alerting Rules
Create a file named prometheus/alerts.yml and include it in the main config:
rule_files: - 'alerts.yml'
Now define a rule that triggers when the webapp container is unreachable:
groups:
- name: docker_alerts
rules:
- alert: WebAppDown
expr: up{job="webapp"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "WebApp container is down"
description: "Container {{ $labels.instance }} has been unreachable for more than 1 minute."
- alert: HighHTTPLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 2m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.instance }}"
Each alert can have different severity, and Alertmanager will handle them accordingly.
Setting Up Alertmanager
Alertmanager configuration (alertmanager/alertmanager.yml) determines how alerts are routed and what actions are taken. For self-healing, we route critical alerts to a webhook that triggers the recovery agent.
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'recovery-webhook'
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'secret'
- name: 'recovery-webhook'
webhook_configs:
- url: 'http://recovery-agent:5001/webhook'
send_resolved: true
Key details:
- Critical alerts go to the webhook; less severe alerts notify via email.
send_resolved: truetells Alertmanager to notify the webhook when the alert clears, which allows the recovery agent to mark the incident as resolved.- The webhook URL uses Docker’s internal DNS (
recovery-agent), which resolves to the container’s IP.
Testing the Alert Chain
Before automating recovery, verify that Prometheus and Alertmanager are communicating correctly. Use the Prometheus web UI (port 9090) to see active alerts, and temporarily stop the webapp container to confirm an alert fires. Check Alertmanager’s web UI (port 9093) to see that the alert reaches the webhook receiver.
Building the Recovery Agent
The recovery agent is a simple HTTP server that listens for POST requests from Alertmanager and executes Docker commands. We’ll use Python for readability, but you can use Go, Node.js, or a shell script with netcat.
Create recovery-agent/Dockerfile
FROM python:3.12-slim RUN pip install flask requests COPY agent.py /agent.py CMD ["python", "/agent.py"]
Create recovery-agent/agent.py
import os
import subprocess
import json
from flask import Flask, request
app = Flask(__name__)
DOCKER_SOCKET = '/var/run/docker.sock'
@app.route('/webhook', methods=['POST'])
def webhook():
alert_data = request.json
if not alert_data:
return 'No data', 400
for alert in alert_data.get('alerts', []):
labels = alert.get('labels', {})
status = alert.get('status', '')
alertname = labels.get('alertname', '')
instance = labels.get('instance', '')
# Only act on firing critical alerts
if status == 'firing' and alertname == 'WebAppDown':
# Extract container name from instance label (e.g., "webapp:8080")
container = instance.split(':')[0] if ':' in instance else instance
# Alternatively, use Docker labels to map alert to container name
try:
result = subprocess.run(
['docker', 'restart', container],
capture_output=True,
text=True,
timeout=30
)
if result.returncode == 0:
print(f"Restarted container {container} successfully")
else:
print(f"Failed to restart {container}: {result.stderr}")
except Exception as e:
print(f"Error restarting container: {e}")
return 'OK', 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001)
Important considerations:
- The agent runs with access to the Docker socket. In production, restrict its permissions—use Docker’s
--cap-drop, run as non-root, or use a dedicated API token with minimal rights. - You may want to add rate limiting to prevent restart loops. For example, only allow one restart per container per 5 minutes.
- Log all actions to a file or stdout for debugging.
Handling Multiple Services
If you have many services, extend the agent to look up container names from a configuration file or use Docker labels. Example: match alert labels to container labels like com.docker.compose.service.
Automating Recovery Actions Beyond Restart
Restarting a container is the simplest self-healing action, but you can do much more:
- Scale up/down – Use
docker-compose up --scale webapp=3or call a Docker Swarm / Kubernetes API to increase replicas when latency spikes. - Rollback – If a new deployment causes failures, trigger a rollback to a previous image tag.
- Cleanup – Remove stuck containers or unused volumes.
- Notify – Send a Slack or PagerDuty message as a fallback if the automated recovery fails.
Each action can be its own webhook endpoint or a parameter in the alert data.
Testing Your Self-Healing Pipeline
Before going live, test each scenario:
- Manual kill: Run
docker kill webappand observe that Prometheus detects the loss, Alertmanager sends a webhook, and the recovery agent restarts it within one minute (includingfor: 1mevaluation time). - Simulated high latency: Use a tool like
tc(traffic control) or a proxy to introduce delay. Verify that the high-latency alert fires but does not trigger a restart (since you defined it as severity warning, which routes to email, not the webhook). - Flapping prevention: Rapidly stop and start the container. Ensure the agent doesn’t get stuck in an infinite restart loop—implement a cooldown.
Monitoring the Monitor
Your self-healing system itself must be monitored. Use Prometheus to scrape the recovery agent’s metrics (add a /metrics endpoint that exposes restart counts and errors). Also set up an alert if the recovery agent itself goes down.
Advanced Patterns and Best Practices
Decouple Alert Logic from Recovery Logic
Keep your alert rules focused on detection and severity. Recovery actions should be handled externally by the agent, not embedded in Alertmanager. This separation makes it easier to change recovery strategies without touching monitoring configuration.
Use Labels for Flexible Targeting
Label your containers with metadata that the recovery agent can read:
com.example.recovery.action=restartcom.example.recovery.cooldown=300
The agent can then read these labels from the Docker API and adjust its behavior accordingly.
Integrate with Orchestrators
If you use Docker Swarm or Compose in production, the recovery agent can call docker service update to force a recreation of tasks, or docker stack deploy to re-apply the stack configuration. For Docker Compose in production environments, consider using the --force-recreate flag carefully.
Handle Persistent State
Be careful when restarting containers that hold data. If your application uses volumes, ensure that restarting doesn’t corrupt data. Consider a pre-stop hook that flushes caches or writes state to a durable store.
External Resources
To deepen your understanding, explore:
- Prometheus Alertmanager Documentation – Official guide for configuration and routing.
- Docker Restart Policies – Understand the built-in levels of self-healing.
- Prometheus Exporters – A list of exporters to monitor various services.
Conclusion
Combining Docker with Prometheus Alertmanager creates a robust foundation for self-healing infrastructure. By configuring Prometheus to scrape metrics and define meaningful alerts, and by building a lightweight recovery agent that translates webhook notifications into Docker commands, you can dramatically reduce downtime without human effort. Start small—automate restarts for one critical service—then expand to more sophisticated actions like scaling, rollback, or cleanup. The key is to iterate: monitor the recovery actions themselves, refine alert thresholds, and always keep a human in the loop for scenarios that automation cannot safely handle. With this pipeline in place, your infrastructure becomes not just monitored, but reactive and resilient.