Implementing Docker Container Health Monitoring with Prometheus and Alertmanager

Modern applications increasingly rely on Docker containers to deliver consistent, scalable deployments. However, as the number of running containers grows, so does the complexity of ensuring each container remains healthy and performs optimally. Proactive monitoring is no longer optional—it is a critical component of any production environment. This guide provides a comprehensive walkthrough for implementing Docker container health monitoring using Prometheus and Alertmanager. You will learn how to collect metrics, set up meaningful alerts, and respond to incidents before they impact users.

By the end of this article, you will have a production-ready monitoring stack that scrapes container-level metrics, exposes them to Prometheus, and sends notifications via email, Slack, or other channels when problems arise. We assume you have basic familiarity with Docker and Docker Compose, but we will explain each step in detail.

Understanding the Core Components

Before diving into the setup, it helps to understand how Prometheus and Alertmanager work together and what additional tools facilitate Docker monitoring.

Prometheus

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It operates by pulling (scraping) metrics from configured endpoints at specified intervals and storing them in a time-series database. The powerful PromQL query language allows you to slice, aggregate, and transform the collected data. Prometheus excels at collecting numeric time-series data such as CPU usage, memory consumption, request rates, and container health states.

Prometheus can scrape metrics from various exporters. For Docker containers, the most common exporters are cAdvisor (for container-level resource usage) and the Node Exporter (for host-level metrics). Prometheus also supports service discovery, which can automatically detect new containers when integrated with orchestrators like Kubernetes.

Learn more on the official project page: Prometheus Overview

Alertmanager

Alertmanager is the component that handles alerts fired by Prometheus. It deduplicates, groups, and routes alerts to the appropriate receivers. Alertmanager supports multiple notification channels including email, Slack, PagerDuty, OpsGenie, and custom webhooks. Its flexible routing tree allows you to send high-severity alerts to email and pagers while low-severity alerts go to a Slack channel.

Alertmanager also provides silencing and inhibition features, giving operators time to investigate without being flooded by repetitive notifications. Deployment is usually side-by-side with Prometheus, often via a shared Docker Compose file.

Official documentation: Alertmanager Configuration

cAdvisor (Container Advisor)

cAdvisor (Container Advisor) is an open-source agent that collects, aggregates, and exports resource usage and performance characteristics of running containers. It exposes a rich set of metrics including CPU, memory, network, and filesystem usage. Prometheus can scrape cAdvisor’s /metrics endpoint to obtain per-container data. cAdvisor runs as a Docker container and requires access to the Docker socket and the host’s /sys and /var/lib/docker directories. It works out-of-the-box with minimal configuration.

cAdvisor GitHub: cAdvisor Repository

Setting Up the Monitoring Stack with Docker Compose

The easiest way to deploy Prometheus, cAdvisor, and Alertmanager is using Docker Compose. Create a project directory and navigate into it. Then create a file named docker-compose.yml with the following content:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules.yml:/etc/prometheus/alert.rules.yml
    ports:
      - "9090:9090"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    restart: unless-stopped

This Compose file defines three services:

  • prometheus: mounts a configuration file and a rules file, exposes port 9090, and will store data inside the container (you may add a persistent volume for production).
  • alertmanager: mounts its own configuration file, exposes port 9093.
  • cadvisor: required volumes to access host system data; reads the Docker socket to list containers.

Configuring Prometheus to Scrape Docker Metrics

Create a file named prometheus.yml in the same directory. This is the main configuration file for Prometheus. The minimal version to scrape cAdvisor looks like this:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert.rules.yml"

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    metrics_path: /metrics
    # Add additional relabel configs if needed

Key points:

  • scrape_interval: how often Prometheus scrapes targets (15 seconds is fine for container monitoring).
  • alerting: tells Prometheus where Alertmanager is running (here we use the Docker Compose service name).
  • rule_files: location of the alert rules file.
  • scrape_configs: defines the cAdvisor target. Since they are in the same Docker network, hostnames are resolved via Docker DNS.

You can add more scrape jobs, for example to monitor the Prometheus server itself (job_name: 'prometheus' with target localhost:9090) or a Node Exporter on the host.

For advanced setups, you may use Docker service discovery to automatically scrape all containers running on a host or across a swarm. Refer to Prometheus’ scrape configuration documentation.

Configuring Alertmanager

Create a file named alertmanager.yml. The following example sets up a global timeout, a default route, and an email receiver. For production, replace with your SMTP credentials.

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: 'your_username'
  smtp_auth_password: 'your_password'

route:
  receiver: 'email-admin'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'email-admin'
    email_configs:
      - to: '[email protected]'
        send_resolved: true

Alternatively, to send alerts to Slack, replace the email config with:

receivers:
  - name: 'slack-team'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
        channel: '#alerts'
        send_resolved: true

Alertmanager supports many integrations. See the Alertmanager configuration reference for details.

Defining Alert Rules for Container Health

Now create a file named alert.rules.yml. This file contains Prometheus alerting rules that evaluate expressions against metrics and fire alerts. Below are several rules specific to Docker containers.

groups:
- name: container_rules
  interval: 30s
  rules:
  - alert: ContainerNotReady
    expr: |
      time() - container_last_seen{name!=""} > 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.name }} not ready"
      description: "Container {{ $labels.name }} has not been seen for 5 minutes."

  - alert: HighMemoryUsage
    expr: |
      (container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.name }} memory usage > 90%"
      description: "Container {{ $labels.name }} is using {{ $value | humanizePercentage }} of its memory limit."

  - alert: HighCPUUsage
    expr: |
      rate(container_cpu_usage_seconds_total[5m]) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container {{ $labels.name }} CPU usage > 80%"
      description: "Container {{ $labels.name }} CPU usage is {{ $value | humanizePercentage }} over 5 minutes."

  - alert: ContainerDown
    expr: |
      absent(container_last_seen{name=~".+"}) or (time() - container_last_seen > 120)
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Container {{ $labels.name }} is down"
      description: "Container {{ $labels.name }} has not reported metrics for 2 minutes."

Explanation of key metrics:

  • container_last_seen: cAdvisor records a timestamp when a container is active.
  • container_memory_usage_bytes and container_spec_memory_limit_bytes: compare actual usage to the limit.
  • container_cpu_usage_seconds_total: cumulative CPU time; use rate() to get a per-second average.
  • absent(): fires when a metric disappears (e.g., container stops).

You can customize thresholds and intervals based on your workload. For containers with no memory limit, you may need to adjust the expression or use host-level metrics instead.

Visualizing Metrics with Grafana

While Prometheus provides a basic expression browser, Grafana offers a much richer dashboarding experience. To deploy Grafana alongside your stack, add this service to your docker-compose.yml:

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_admin_password
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

volumes:
  grafana_data:

After starting the stack, log into Grafana at http://localhost:3000 with admin / your_admin_password. Add a Prometheus data source pointing to http://prometheus:9090. Then import a community dashboard for Docker monitoring—for example, Dashboard 893 (cAdvisor exporter) or build your own panels using cAdvisor metrics.

Best Practices for Production Monitoring

  1. Persist Prometheus data: Add a Docker volume mount for /prometheus in the Compose file to avoid losing metrics on container restarts.
  2. Use separate alert rules for critical vs. warning: Route critical alerts to pageable channels (email + SMS) and warnings to chat.
  3. Enable service discovery: If you run containers across multiple hosts, use Consul or the Docker SD provider to automatically update scrape targets.
  4. Set up a central Prometheus: For multi-node environments, consider a federated architecture where a global Prometheus collects from local instances.
  5. Monitor the monitors: Add a blackbox_exporter or self-monitoring alerts to know if Prometheus or Alertmanager become unreachable.
  6. Tune alert thresholds: Avoid alert fatigue by adjusting for durations and thresholds based on historical data.
  7. Secure endpoints: Use basic auth or OAuth2 for Prometheus and Alertmanager web UIs if exposed beyond localhost.

Troubleshooting Common Issues

cAdvisor not exposing metrics

Verify cAdvisor is running: curl http://localhost:8080/metrics. If empty, check the container logs: docker logs cadvisor. Common misconfigurations include missing volume mounts for /var/run/docker.sock or /sys.

Prometheus cannot scrape cAdvisor

Ensure the prometheus.yml target hostname matches the service name (cadvisor). If running on separate hosts, replace with the actual IP address. Check Prometheus targets at http://localhost:9090/targets.

Alerts not firing

Verify the rule file is correctly formatted YAML and loaded. Prometheus logs will show errors during startup. Check the Alerts page at http://localhost:9090/alerts to see the state of each rule. Also confirm Alertmanager is reachable (the “alerting” section in prometheus.yml).

Email notifications not sent

Test Alertmanager’s configuration by sending a test alert. Use amtool check-config alertmanager.yml to validate syntax. Check Alertmanager logs for SMTP connection errors.

Conclusion

Setting up Docker container health monitoring with Prometheus and Alertmanager provides a robust, scalable foundation for maintaining application reliability. By deploying cAdvisor to expose container metrics, scraping them with Prometheus, and routing alerts via Alertmanager, you gain real-time visibility into the health and performance of your containerized services. The stack is open-source, well-documented, and can be extended to cover host-level metrics, application-specific metrics, and multi-cluster environments.

Start with the Docker Compose template provided, customize the alert rules to your thresholds, and integrate Grafana for beautiful dashboards. With proper configuration, you will detect containers that are unresponsive, running out of memory, or consuming excessive CPU before they cause downtime. Proactive monitoring is an investment that pays off in reduced incident resolution time and improved system stability.