Set up comprehensive monitoring for your infrastructure with Prometheus and Grafana, including alerts, dashboards, and best practices for production environments.
Visibility is everything in DevOps. After implementing Prometheus and Grafana across multiple production environments, I've developed a battle-tested approach to building monitoring stacks that actually help you sleep at night.
Architecture Overview
A robust monitoring stack consists of:
- Prometheus: Time-series database and monitoring system
- Grafana: Visualization and dashboarding
- Alertmanager: Alert routing and management
- Exporters: Metric collectors for various services
- Pushgateway: For short-lived jobs
Setting Up Prometheus
Basic Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'production-cluster'
environment: 'production'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node1:9100', 'node2:9100', 'node3:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Service Discovery
Dynamic service discovery for cloud environments:
# AWS EC2 Discovery
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-east-1
access_key: YOUR_ACCESS_KEY
secret_key: YOUR_SECRET_KEY
filters:
- name: tag:Environment
values: [production]
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
Essential Metrics to Collect
System Metrics
# Node Exporter provides these
- CPU usage and load averages
- Memory utilization
- Disk I/O and space
- Network traffic
- File descriptor usage
Application Metrics
# Python example with prometheus_client
from prometheus_client import Counter, Histogram, Gauge
import time
# Request counter
request_count = Counter('app_requests_total',
'Total requests',
['method', 'endpoint', 'status'])
# Request duration
request_duration = Histogram('app_request_duration_seconds',
'Request duration',
['method', 'endpoint'])
# Active connections
active_connections = Gauge('app_active_connections',
'Active connections')
# Usage in application
@request_duration.labels(method='GET', endpoint='/api/users').time()
def get_users():
request_count.labels(method='GET', endpoint='/api/users', status=200).inc()
# Your logic here
Creating Effective Alerts
Alert Rules
# alerts/node_alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 1m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Less than 10% disk space remaining"
- alert: MemoryPressure
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Memory pressure on {{ $labels.instance }}"
description: "Available memory is less than 10%"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
team: infrastructure
receiver: 'infrastructure-team'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
description: '{{ .GroupLabels.alertname }}'
- name: 'infrastructure-team'
email_configs:
- to: 'infrastructure@company.com'
from: 'alerts@company.com'
Grafana Dashboard Best Practices
Essential Dashboards
- System Overview Dashboard
{
"dashboard": {
"title": "System Overview",
"panels": [
{
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
]
},
{
"title": "Memory Usage",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
}
]
}
}
- Application Performance Dashboard
- Request rate and errors
- Response time percentiles (p50, p95, p99)
- Active connections
- Database query performance
- Business Metrics Dashboard
- User signups
- Transaction volume
- Revenue metrics
- Conversion rates
Dashboard Design Principles
-
Use consistent color schemes
- Green: Good/Normal
- Yellow: Warning
- Red: Critical
- Blue: Informational
-
Organize by importance
- Critical metrics at the top
- Detailed breakdowns below
- Historical comparisons at the bottom
-
Include context
- Add threshold lines
- Show week-over-week comparisons
- Include annotations for deployments
Advanced Monitoring Patterns
SLI/SLO Monitoring
# SLO: 99.9% availability
- alert: SLOViolation
expr: |
(1 - rate(http_requests_total{status=~"5.."}[1h])
/ rate(http_requests_total[1h])) < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "SLO violation: Availability below 99.9%"
Multi-Window Multi-Burn-Rate Alerts
- alert: HighErrorRate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
AND
rate(http_requests_total{status=~"5.."}[1h]) > 0.01
)
for: 2m
labels:
severity: critical
Scaling Considerations
Federation for Multi-Cluster Monitoring
# Global Prometheus configuration
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
- 'up{job=~"kubernetes-.*"}'
static_configs:
- targets:
- 'prometheus-cluster1:9090'
- 'prometheus-cluster2:9090'
Long-Term Storage with Thanos
# Thanos sidecar configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: thanos-storage
data:
object-store.yaml: |
type: S3
config:
bucket: prometheus-long-term
endpoint: s3.amazonaws.com
region: us-east-1
Optimization Tips
Query Performance
- Use recording rules for expensive queries
groups:
- name: aggregations
interval: 30s
rules:
- record: instance:node_cpu:rate5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- Optimize cardinality
- Limit label combinations
- Drop unnecessary metrics
- Use metric relabeling
Resource Management
# Prometheus resource limits
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
Troubleshooting Common Issues
High Memory Usage
- Check cardinality:
prometheus_tsdb_symbol_table_size_bytes
- Review retention policies
- Implement downsampling for historical data
Missing Metrics
- Verify network connectivity
- Check target health:
/targets
endpoint - Review scrape configuration
- Validate service discovery
Alert Fatigue
- Implement alert grouping
- Use inhibition rules
- Set appropriate thresholds
- Create runbooks for each alert
Security Considerations
Enable Authentication
# Prometheus web.yml
basic_auth_users:
admin: $2y$10$V2RmZ2VuZXJhdGVkQnlCQ3J5cHQ...
TLS Configuration
tls_server_config:
cert_file: /etc/prometheus/prometheus.crt
key_file: /etc/prometheus/prometheus.key
Network Policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-network-policy
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: grafana
Conclusion
A well-designed Prometheus and Grafana stack provides the observability needed for modern DevOps. Start with basic system metrics, gradually add application-specific monitoring, and refine your alerts based on actual incidents.
Remember: the goal isn't to monitor everything, but to monitor what matters. Focus on metrics that directly impact your users and business objectives, and build your monitoring stack to surface problems before they become crises.
Share this article
David Childs
Consulting Systems Engineer with over 10 years of experience building scalable infrastructure and helping organizations optimize their technology stack.