Production Monitoring with Prometheus & Grafana

Build production monitoring with Prometheus and Grafana including alerting, custom dashboards, and operational best practices for DevOps teams.

Visibility is everything in DevOps. After implementing Prometheus and Grafana across multiple production environments, I've developed a battle-tested approach to building monitoring stacks that actually help you sleep at night.

Architecture Overview

A robust monitoring stack consists of:

Prometheus: Time-series database and monitoring system
Grafana: Visualization and dashboarding
Alertmanager: Alert routing and management
Exporters: Metric collectors for various services
Pushgateway: For short-lived jobs

Setting Up Prometheus

Basic Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'production-cluster'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node1:9100', 'node2:9100', 'node3:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Service Discovery

Dynamic service discovery for cloud environments:

# AWS EC2 Discovery
- job_name: 'ec2-instances'
  ec2_sd_configs:
    - region: us-east-1
      access_key: YOUR_ACCESS_KEY
      secret_key: YOUR_SECRET_KEY
      filters:
        - name: tag:Environment
          values: [production]
  relabel_configs:
    - source_labels: [__meta_ec2_tag_Name]
      target_label: instance_name

Essential Metrics to Collect

System Metrics

# Node Exporter provides these
- CPU usage and load averages
- Memory utilization
- Disk I/O and space
- Network traffic
- File descriptor usage

Application Metrics

# Python example with prometheus_client
from prometheus_client import Counter, Histogram, Gauge
import time

# Request counter
request_count = Counter('app_requests_total', 
                       'Total requests',
                       ['method', 'endpoint', 'status'])

# Request duration
request_duration = Histogram('app_request_duration_seconds',
                            'Request duration',
                            ['method', 'endpoint'])

# Active connections
active_connections = Gauge('app_active_connections',
                          'Active connections')

# Usage in application
@request_duration.labels(method='GET', endpoint='/api/users').time()
def get_users():
    request_count.labels(method='GET', endpoint='/api/users', status=200).inc()
    # Your logic here

Creating Effective Alerts

Alert Rules

# alerts/node_alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 1m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Less than 10% disk space remaining"

      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: "Available memory is less than 10%"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        team: infrastructure
      receiver: 'infrastructure-team'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        description: '{{ .GroupLabels.alertname }}'

  - name: 'infrastructure-team'
    email_configs:
      - to: 'infrastructure@company.com'
        from: 'alerts@company.com'

Grafana Dashboard Best Practices

Essential Dashboards

System Overview Dashboard

{
  "dashboard": {
    "title": "System Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
          }
        ]
      }
    ]
  }
}

Application Performance Dashboard

Request rate and errors
Response time percentiles (p50, p95, p99)
Active connections
Database query performance

Business Metrics Dashboard

User signups
Transaction volume
Revenue metrics
Conversion rates

Dashboard Design Principles

Use consistent color schemes
- Green: Good/Normal
- Yellow: Warning
- Red: Critical
- Blue: Informational
Organize by importance
- Critical metrics at the top
- Detailed breakdowns below
- Historical comparisons at the bottom
Include context
- Add threshold lines
- Show week-over-week comparisons
- Include annotations for deployments

Advanced Monitoring Patterns

SLI/SLO Monitoring

# SLO: 99.9% availability
- alert: SLOViolation
  expr: |
    (1 - rate(http_requests_total{status=~"5.."}[1h]) 
     / rate(http_requests_total[1h])) < 0.999
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "SLO violation: Availability below 99.9%"

Multi-Window Multi-Burn-Rate Alerts

- alert: HighErrorRate
  expr: |
    (
      rate(http_requests_total{status=~"5.."}[5m]) > 0.05
      AND
      rate(http_requests_total{status=~"5.."}[1h]) > 0.01
    )
  for: 2m
  labels:
    severity: critical

Scaling Considerations

Federation for Multi-Cluster Monitoring

# Global Prometheus configuration
- job_name: 'federate'
  scrape_interval: 15s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"kubernetes-.*"}'
      - 'up{job=~"kubernetes-.*"}'
  static_configs:
    - targets:
      - 'prometheus-cluster1:9090'
      - 'prometheus-cluster2:9090'

Long-Term Storage with Thanos

# Thanos sidecar configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-storage
data:
  object-store.yaml: |
    type: S3
    config:
      bucket: prometheus-long-term
      endpoint: s3.amazonaws.com
      region: us-east-1

Optimization Tips

Query Performance

Use recording rules for expensive queries

groups:
  - name: aggregations
    interval: 30s
    rules:
      - record: instance:node_cpu:rate5m
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Optimize cardinality
- Limit label combinations
- Drop unnecessary metrics
- Use metric relabeling

Resource Management

# Prometheus resource limits
resources:
  requests:
    memory: "2Gi"
    cpu: "1"
  limits:
    memory: "4Gi"
    cpu: "2"

Troubleshooting Common Issues

High Memory Usage

Check cardinality: prometheus_tsdb_symbol_table_size_bytes
Review retention policies
Implement downsampling for historical data

Missing Metrics

Verify network connectivity
Check target health: /targets endpoint
Review scrape configuration
Validate service discovery

Alert Fatigue

Implement alert grouping
Use inhibition rules
Set appropriate thresholds
Create runbooks for each alert

Security Considerations

Enable Authentication

# Prometheus web.yml
basic_auth_users:
  admin: $2y$10$V2RmZ2VuZXJhdGVkQnlCQ3J5cHQ...

TLS Configuration

tls_server_config:
  cert_file: /etc/prometheus/prometheus.crt
  key_file: /etc/prometheus/prometheus.key

Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: prometheus-network-policy
spec:
  podSelector:
    matchLabels:
      app: prometheus
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: grafana

Conclusion

A well-designed Prometheus and Grafana stack provides the observability needed for modern DevOps. Start with basic system metrics, gradually add application-specific monitoring, and refine your alerts based on actual incidents.

Remember: the goal isn't to monitor everything, but to monitor what matters. Focus on metrics that directly impact your users and business objectives, and build your monitoring stack to surface problems before they become crises.