Building a Production-Ready Monitoring Stack with Prometheus and Grafana

David Childs

Set up comprehensive monitoring for your infrastructure with Prometheus and Grafana, including alerts, dashboards, and best practices for production environments.

Visibility is everything in DevOps. After implementing Prometheus and Grafana across multiple production environments, I've developed a battle-tested approach to building monitoring stacks that actually help you sleep at night.

Architecture Overview

A robust monitoring stack consists of:

  • Prometheus: Time-series database and monitoring system
  • Grafana: Visualization and dashboarding
  • Alertmanager: Alert routing and management
  • Exporters: Metric collectors for various services
  • Pushgateway: For short-lived jobs

Setting Up Prometheus

Basic Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production-cluster'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node1:9100', 'node2:9100', 'node3:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Service Discovery

Dynamic service discovery for cloud environments:

# AWS EC2 Discovery
- job_name: 'ec2-instances'
  ec2_sd_configs:
    - region: us-east-1
      access_key: YOUR_ACCESS_KEY
      secret_key: YOUR_SECRET_KEY
      filters:
        - name: tag:Environment
          values: [production]
  relabel_configs:
    - source_labels: [__meta_ec2_tag_Name]
      target_label: instance_name

Essential Metrics to Collect

System Metrics

# Node Exporter provides these
- CPU usage and load averages
- Memory utilization
- Disk I/O and space
- Network traffic
- File descriptor usage

Application Metrics

# Python example with prometheus_client
from prometheus_client import Counter, Histogram, Gauge
import time

# Request counter
request_count = Counter('app_requests_total', 
                       'Total requests',
                       ['method', 'endpoint', 'status'])

# Request duration
request_duration = Histogram('app_request_duration_seconds',
                            'Request duration',
                            ['method', 'endpoint'])

# Active connections
active_connections = Gauge('app_active_connections',
                          'Active connections')

# Usage in application
@request_duration.labels(method='GET', endpoint='/api/users').time()
def get_users():
    request_count.labels(method='GET', endpoint='/api/users', status=200).inc()
    # Your logic here

Creating Effective Alerts

Alert Rules

# alerts/node_alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Less than 10% disk space remaining"

      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Available memory is less than 10%"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        team: infrastructure
      receiver: 'infrastructure-team'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}'

  - name: 'infrastructure-team'
    email_configs:
      - to: 'infrastructure@company.com'
        from: 'alerts@company.com'

Grafana Dashboard Best Practices

Essential Dashboards

  1. System Overview Dashboard
{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
          }
        ]
      }
    ]
  }
}
  1. Application Performance Dashboard
  • Request rate and errors
  • Response time percentiles (p50, p95, p99)
  • Active connections
  • Database query performance
  1. Business Metrics Dashboard
  • User signups
  • Transaction volume
  • Revenue metrics
  • Conversion rates

Dashboard Design Principles

  1. Use consistent color schemes

    • Green: Good/Normal
    • Yellow: Warning
    • Red: Critical
    • Blue: Informational
  2. Organize by importance

    • Critical metrics at the top
    • Detailed breakdowns below
    • Historical comparisons at the bottom
  3. Include context

    • Add threshold lines
    • Show week-over-week comparisons
    • Include annotations for deployments

Advanced Monitoring Patterns

SLI/SLO Monitoring

# SLO: 99.9% availability
- alert: SLOViolation
  expr: |
    (1 - rate(http_requests_total{status=~"5.."}[1h]) 
     / rate(http_requests_total[1h])) < 0.999
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "SLO violation: Availability below 99.9%"

Multi-Window Multi-Burn-Rate Alerts

- alert: HighErrorRate
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      AND
      rate(http_requests_total{status=~"5.."}[1h]) > 0.01
    )
  for: 2m
  labels:
    severity: critical

Scaling Considerations

Federation for Multi-Cluster Monitoring

# Global Prometheus configuration
- job_name: 'federate'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"kubernetes-.*"}'
      - 'up{job=~"kubernetes-.*"}'
  static_configs:
    - targets:
      - 'prometheus-cluster1:9090'
      - 'prometheus-cluster2:9090'

Long-Term Storage with Thanos

# Thanos sidecar configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-storage
data:
  object-store.yaml: |
    type: S3
    config:
      bucket: prometheus-long-term
      endpoint: s3.amazonaws.com
      region: us-east-1

Optimization Tips

Query Performance

  1. Use recording rules for expensive queries
groups:
  - name: aggregations
    interval: 30s
    rules:
      - record: instance:node_cpu:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  1. Optimize cardinality
    • Limit label combinations
    • Drop unnecessary metrics
    • Use metric relabeling

Resource Management

# Prometheus resource limits
resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "4Gi"
    cpu: "2"

Troubleshooting Common Issues

High Memory Usage

  • Check cardinality: prometheus_tsdb_symbol_table_size_bytes
  • Review retention policies
  • Implement downsampling for historical data

Missing Metrics

  • Verify network connectivity
  • Check target health: /targets endpoint
  • Review scrape configuration
  • Validate service discovery

Alert Fatigue

  • Implement alert grouping
  • Use inhibition rules
  • Set appropriate thresholds
  • Create runbooks for each alert

Security Considerations

Enable Authentication

# Prometheus web.yml
basic_auth_users:
  admin: $2y$10$V2RmZ2VuZXJhdGVkQnlCQ3J5cHQ...

TLS Configuration

tls_server_config:
  cert_file: /etc/prometheus/prometheus.crt
  key_file: /etc/prometheus/prometheus.key

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-network-policy
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: grafana

Conclusion

A well-designed Prometheus and Grafana stack provides the observability needed for modern DevOps. Start with basic system metrics, gradually add application-specific monitoring, and refine your alerts based on actual incidents.

Remember: the goal isn't to monitor everything, but to monitor what matters. Focus on metrics that directly impact your users and business objectives, and build your monitoring stack to surface problems before they become crises.

Share this article

DC

David Childs

Consulting Systems Engineer with over 10 years of experience building scalable infrastructure and helping organizations optimize their technology stack.

Related Articles