civil-and-structural-engineering
Implementing Docker Container Health Monitoring with Prometheus and Alertmanager
Table of Contents
Implementing Docker Container Health Monitoring with Prometheus and Alertmanager
Modern applications increasingly rely on Docker containers to deliver consistent, scalable deployments. However, as the number of running containers grows, so does the complexity of ensuring each container remains healthy and performs optimally. Proactive monitoring is no longer optional—it is a critical component of any production environment. This guide provides a comprehensive walkthrough for implementing Docker container health monitoring using Prometheus and Alertmanager. You will learn how to collect metrics, set up meaningful alerts, and respond to incidents before they impact users.
By the end of this article, you will have a production-ready monitoring stack that scrapes container-level metrics, exposes them to Prometheus, and sends notifications via email, Slack, or other channels when problems arise. We assume you have basic familiarity with Docker and Docker Compose, but we will explain each step in detail.
Understanding the Core Components
Before diving into the setup, it helps to understand how Prometheus and Alertmanager work together and what additional tools facilitate Docker monitoring.
Prometheus
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It operates by pulling (scraping) metrics from configured endpoints at specified intervals and storing them in a time-series database. The powerful PromQL query language allows you to slice, aggregate, and transform the collected data. Prometheus excels at collecting numeric time-series data such as CPU usage, memory consumption, request rates, and container health states.
Prometheus can scrape metrics from various exporters. For Docker containers, the most common exporters are cAdvisor (for container-level resource usage) and the Node Exporter (for host-level metrics). Prometheus also supports service discovery, which can automatically detect new containers when integrated with orchestrators like Kubernetes.
Learn more on the official project page: Prometheus Overview
Alertmanager
Alertmanager is the component that handles alerts fired by Prometheus. It deduplicates, groups, and routes alerts to the appropriate receivers. Alertmanager supports multiple notification channels including email, Slack, PagerDuty, OpsGenie, and custom webhooks. Its flexible routing tree allows you to send high-severity alerts to email and pagers while low-severity alerts go to a Slack channel.
Alertmanager also provides silencing and inhibition features, giving operators time to investigate without being flooded by repetitive notifications. Deployment is usually side-by-side with Prometheus, often via a shared Docker Compose file.
Official documentation: Alertmanager Configuration
cAdvisor (Container Advisor)
cAdvisor (Container Advisor) is an open-source agent that collects, aggregates, and exports resource usage and performance characteristics of running containers. It exposes a rich set of metrics including CPU, memory, network, and filesystem usage. Prometheus can scrape cAdvisor’s /metrics endpoint to obtain per-container data. cAdvisor runs as a Docker container and requires access to the Docker socket and the host’s /sys and /var/lib/docker directories. It works out-of-the-box with minimal configuration.
cAdvisor GitHub: cAdvisor Repository
Setting Up the Monitoring Stack with Docker Compose
The easiest way to deploy Prometheus, cAdvisor, and Alertmanager is using Docker Compose. Create a project directory and navigate into it. Then create a file named docker-compose.yml with the following content:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert.rules.yml:/etc/prometheus/alert.rules.yml
ports:
- "9090:9090"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
restart: unless-stopped
This Compose file defines three services:
- prometheus: mounts a configuration file and a rules file, exposes port 9090, and will store data inside the container (you may add a persistent volume for production).
- alertmanager: mounts its own configuration file, exposes port 9093.
- cadvisor: required volumes to access host system data; reads the Docker socket to list containers.
Configuring Prometheus to Scrape Docker Metrics
Create a file named prometheus.yml in the same directory. This is the main configuration file for Prometheus. The minimal version to scrape cAdvisor looks like this:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert.rules.yml"
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
metrics_path: /metrics
# Add additional relabel configs if needed
Key points:
- scrape_interval: how often Prometheus scrapes targets (15 seconds is fine for container monitoring).
- alerting: tells Prometheus where Alertmanager is running (here we use the Docker Compose service name).
- rule_files: location of the alert rules file.
- scrape_configs: defines the cAdvisor target. Since they are in the same Docker network, hostnames are resolved via Docker DNS.
You can add more scrape jobs, for example to monitor the Prometheus server itself (job_name: 'prometheus' with target localhost:9090) or a Node Exporter on the host.
For advanced setups, you may use Docker service discovery to automatically scrape all containers running on a host or across a swarm. Refer to Prometheus’ scrape configuration documentation.
Configuring Alertmanager
Create a file named alertmanager.yml. The following example sets up a global timeout, a default route, and an email receiver. For production, replace with your SMTP credentials.
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'your_username'
smtp_auth_password: 'your_password'
route:
receiver: 'email-admin'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'email-admin'
email_configs:
- to: '[email protected]'
send_resolved: true
Alternatively, to send alerts to Slack, replace the email config with:
receivers:
- name: 'slack-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
send_resolved: true
Alertmanager supports many integrations. See the Alertmanager configuration reference for details.
Defining Alert Rules for Container Health
Now create a file named alert.rules.yml. This file contains Prometheus alerting rules that evaluate expressions against metrics and fire alerts. Below are several rules specific to Docker containers.
groups:
- name: container_rules
interval: 30s
rules:
- alert: ContainerNotReady
expr: |
time() - container_last_seen{name!=""} > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} not ready"
description: "Container {{ $labels.name }} has not been seen for 5 minutes."
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} memory usage > 90%"
description: "Container {{ $labels.name }} is using {{ $value | humanizePercentage }} of its memory limit."
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total[5m]) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} CPU usage > 80%"
description: "Container {{ $labels.name }} CPU usage is {{ $value | humanizePercentage }} over 5 minutes."
- alert: ContainerDown
expr: |
absent(container_last_seen{name=~".+"}) or (time() - container_last_seen > 120)
for: 2m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
description: "Container {{ $labels.name }} has not reported metrics for 2 minutes."
Explanation of key metrics:
container_last_seen: cAdvisor records a timestamp when a container is active.container_memory_usage_bytesandcontainer_spec_memory_limit_bytes: compare actual usage to the limit.container_cpu_usage_seconds_total: cumulative CPU time; userate()to get a per-second average.absent(): fires when a metric disappears (e.g., container stops).
You can customize thresholds and intervals based on your workload. For containers with no memory limit, you may need to adjust the expression or use host-level metrics instead.
Visualizing Metrics with Grafana
While Prometheus provides a basic expression browser, Grafana offers a much richer dashboarding experience. To deploy Grafana alongside your stack, add this service to your docker-compose.yml:
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_admin_password
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
volumes:
grafana_data:
After starting the stack, log into Grafana at http://localhost:3000 with admin / your_admin_password. Add a Prometheus data source pointing to http://prometheus:9090. Then import a community dashboard for Docker monitoring—for example, Dashboard 893 (cAdvisor exporter) or build your own panels using cAdvisor metrics.
Best Practices for Production Monitoring
- Persist Prometheus data: Add a Docker volume mount for
/prometheusin the Compose file to avoid losing metrics on container restarts. - Use separate alert rules for critical vs. warning: Route critical alerts to pageable channels (email + SMS) and warnings to chat.
- Enable service discovery: If you run containers across multiple hosts, use Consul or the Docker SD provider to automatically update scrape targets.
- Set up a central Prometheus: For multi-node environments, consider a federated architecture where a global Prometheus collects from local instances.
- Monitor the monitors: Add a blackbox_exporter or self-monitoring alerts to know if Prometheus or Alertmanager become unreachable.
- Tune alert thresholds: Avoid alert fatigue by adjusting
fordurations and thresholds based on historical data. - Secure endpoints: Use basic auth or OAuth2 for Prometheus and Alertmanager web UIs if exposed beyond localhost.
Troubleshooting Common Issues
cAdvisor not exposing metrics
Verify cAdvisor is running: curl http://localhost:8080/metrics. If empty, check the container logs: docker logs cadvisor. Common misconfigurations include missing volume mounts for /var/run/docker.sock or /sys.
Prometheus cannot scrape cAdvisor
Ensure the prometheus.yml target hostname matches the service name (cadvisor). If running on separate hosts, replace with the actual IP address. Check Prometheus targets at http://localhost:9090/targets.
Alerts not firing
Verify the rule file is correctly formatted YAML and loaded. Prometheus logs will show errors during startup. Check the Alerts page at http://localhost:9090/alerts to see the state of each rule. Also confirm Alertmanager is reachable (the “alerting” section in prometheus.yml).
Email notifications not sent
Test Alertmanager’s configuration by sending a test alert. Use amtool check-config alertmanager.yml to validate syntax. Check Alertmanager logs for SMTP connection errors.
Conclusion
Setting up Docker container health monitoring with Prometheus and Alertmanager provides a robust, scalable foundation for maintaining application reliability. By deploying cAdvisor to expose container metrics, scraping them with Prometheus, and routing alerts via Alertmanager, you gain real-time visibility into the health and performance of your containerized services. The stack is open-source, well-documented, and can be extended to cover host-level metrics, application-specific metrics, and multi-cluster environments.
Start with the Docker Compose template provided, customize the alert rules to your thresholds, and integrate Grafana for beautiful dashboards. With proper configuration, you will detect containers that are unresponsive, running out of memory, or consuming excessive CPU before they cause downtime. Proactive monitoring is an investment that pays off in reduced incident resolution time and improved system stability.