Implement chaos engineering practices to proactively discover weaknesses and build resilient systems that survive real-world failures.
"Everything fails all the time" - Werner Vogels, Amazon CTO. After our payment system crashed during Black Friday, costing us $2M in lost sales, we implemented chaos engineering. Now we break things on purpose so they don't break when it matters. Here's how to build antifragile systems.
Why Chaos Engineering?
Traditional testing verifies known scenarios. Chaos engineering discovers unknown unknowns:
- Network partitions that split your cluster
- Resource exhaustion from memory leaks
- Cascading failures from service dependencies
- Data corruption from partial writes
- Performance degradation under load
Getting Started with Chaos Monkey
Basic Setup
# chaos-monkey-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: chaosmonkey-config
namespace: chaos-engineering
data:
config.yaml: |
chaosmonkey:
enabled: true
schedule:
type: "cron"
path: "0 0 9-17 * * MON-FRI" # Business hours only
accounts:
- name: production
region: us-east-1
stack: "*"
detail: "*"
owner: platform-team@company.com
spinnaker:
enabled: false
kill:
enabled: true
probability: 0.1 # 10% chance
meanTimeBetweenKillsInWorkDays: 2
Kubernetes Chaos
# chaos-mesh-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure
namespace: production
spec:
mode: random-max-percent
value: "30"
action: pod-kill
duration: "60s"
selector:
namespaces:
- production
labelSelectors:
app: payment-service
scheduler:
cron: "@every 2h"
Building a Chaos Engineering Platform
Experiment Framework
# chaos_framework.py
import random
import time
import logging
from dataclasses import dataclass
from typing import List, Callable
import boto3
@dataclass
class ChaosExperiment:
name: str
hypothesis: str
blast_radius: str
rollback_plan: Callable
steady_state_check: Callable
chaos_injection: Callable
class ChaosOrchestrator:
def __init__(self):
self.experiments = []
self.metrics_client = boto3.client('cloudwatch')
def run_experiment(self, experiment: ChaosExperiment):
logging.info(f"Starting experiment: {experiment.name}")
logging.info(f"Hypothesis: {experiment.hypothesis}")
try:
# Verify steady state
if not experiment.steady_state_check():
logging.error("System not in steady state, aborting")
return False
# Record baseline metrics
baseline_metrics = self.capture_metrics()
# Inject chaos
logging.info(f"Injecting chaos: {experiment.name}")
experiment.chaos_injection()
# Monitor impact
time.sleep(60) # Observation period
# Check if steady state maintained
if not experiment.steady_state_check():
logging.warning("Steady state lost, rolling back")
experiment.rollback_plan()
return False
# Compare metrics
post_chaos_metrics = self.capture_metrics()
self.analyze_impact(baseline_metrics, post_chaos_metrics)
return True
except Exception as e:
logging.error(f"Experiment failed: {e}")
experiment.rollback_plan()
return False
def capture_metrics(self):
# Capture system metrics
return {
'error_rate': self.get_metric('ErrorRate'),
'latency_p99': self.get_metric('LatencyP99'),
'success_rate': self.get_metric('SuccessRate'),
'throughput': self.get_metric('Throughput')
}
Network Chaos
Latency Injection
# network_chaos.py
import subprocess
import random
class NetworkChaos:
def __init__(self, interface='eth0'):
self.interface = interface
def add_latency(self, delay_ms=100, variation=20, probability=0.5):
"""Add network latency"""
cmd = f"tc qdisc add dev {self.interface} root netem delay {delay_ms}ms {variation}ms {probability*100}%"
subprocess.run(cmd.split(), check=True)
def add_packet_loss(self, loss_percent=5):
"""Simulate packet loss"""
cmd = f"tc qdisc add dev {self.interface} root netem loss {loss_percent}%"
subprocess.run(cmd.split(), check=True)
def add_bandwidth_limit(self, rate_kbps=1000):
"""Limit bandwidth"""
cmd = f"tc qdisc add dev {self.interface} root tbf rate {rate_kbps}kbit burst 32kbit latency 400ms"
subprocess.run(cmd.split(), check=True)
def partition_network(self, blocked_ips: List[str]):
"""Create network partition"""
for ip in blocked_ips:
cmd = f"iptables -A INPUT -s {ip} -j DROP"
subprocess.run(cmd.split(), check=True)
cmd = f"iptables -A OUTPUT -d {ip} -j DROP"
subprocess.run(cmd.split(), check=True)
def clear_all(self):
"""Remove all network chaos"""
subprocess.run(f"tc qdisc del dev {self.interface} root".split(), check=False)
subprocess.run("iptables -F", shell=True, check=False)
Service Mesh Chaos
# istio-fault-injection.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-service-chaos
spec:
hosts:
- payment-service
http:
- fault:
delay:
percentage:
value: 10.0
fixedDelay: 5s
abort:
percentage:
value: 5.0
httpStatus: 503
route:
- destination:
host: payment-service
Application-Level Chaos
Database Chaos
# database_chaos.py
import psycopg2
import random
import time
class DatabaseChaos:
def __init__(self, connection_string):
self.conn_string = connection_string
def slow_queries(self, duration_seconds=30):
"""Inject slow queries"""
conn = psycopg2.connect(self.conn_string)
cursor = conn.cursor()
end_time = time.time() + duration_seconds
while time.time() < end_time:
# Run expensive query
cursor.execute("""
SELECT pg_sleep(5);
SELECT * FROM large_table
CROSS JOIN another_large_table
LIMIT 1000;
""")
time.sleep(random.uniform(1, 5))
cursor.close()
conn.close()
def connection_pool_exhaustion(self, num_connections=100):
"""Exhaust connection pool"""
connections = []
for _ in range(num_connections):
conn = psycopg2.connect(self.conn_string)
connections.append(conn)
time.sleep(0.1)
# Hold connections for 60 seconds
time.sleep(60)
# Clean up
for conn in connections:
conn.close()
def lock_contention(self, table_name='critical_table'):
"""Create lock contention"""
conn = psycopg2.connect(self.conn_string)
cursor = conn.cursor()
# Acquire exclusive lock
cursor.execute(f"LOCK TABLE {table_name} IN EXCLUSIVE MODE")
# Hold lock for 30 seconds
time.sleep(30)
conn.commit()
cursor.close()
conn.close()
Memory Chaos
# memory_chaos.py
import os
import time
import threading
class MemoryChaos:
def __init__(self):
self.allocated_memory = []
self.running = False
def memory_leak(self, rate_mb_per_second=10, duration_seconds=60):
"""Simulate memory leak"""
self.running = True
end_time = time.time() + duration_seconds
while time.time() < end_time and self.running:
# Allocate memory
data = bytearray(rate_mb_per_second * 1024 * 1024)
self.allocated_memory.append(data)
time.sleep(1)
# Clean up
self.allocated_memory.clear()
def memory_spike(self, size_mb=1000):
"""Create sudden memory spike"""
data = bytearray(size_mb * 1024 * 1024)
time.sleep(10)
del data
def cache_invalidation(self):
"""Invalidate system caches"""
os.system("sync && echo 3 > /proc/sys/vm/drop_caches")
Chaos Testing in CI/CD
Pipeline Integration
# .github/workflows/chaos-tests.yml
name: Chaos Engineering Tests
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
workflow_dispatch:
jobs:
chaos-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Deploy to Test Environment
run: |
kubectl apply -f k8s/test-environment/
kubectl wait --for=condition=ready pod -l app=test-app --timeout=300s
- name: Run Baseline Tests
run: |
npm run test:performance
npm run test:integration
- name: Inject Pod Failures
run: |
kubectl apply -f chaos/pod-failure.yaml
sleep 60
- name: Verify Resilience
run: |
npm run test:resilience
- name: Inject Network Latency
run: |
kubectl apply -f chaos/network-latency.yaml
sleep 60
- name: Check SLOs
run: |
python scripts/check_slos.py --error-rate 0.01 --latency-p99 500
- name: Clean Up Chaos
if: always()
run: |
kubectl delete -f chaos/
Game Day Exercises
Planning Template
# Game Day: Database Failure Simulation
## Objective
Test system resilience to primary database failure
## Hypothesis
System will maintain 99% availability during database failover
## Participants
- Incident Commander: John Doe
- Technical Lead: Jane Smith
- Observer: Bob Johnson
## Timeline
- 10:00 - Pre-checks and baseline metrics
- 10:15 - Inject failure
- 10:30 - Monitor recovery
- 11:00 - Post-mortem discussion
## Success Criteria
- Automatic failover completes within 2 minutes
- No data loss
- Error rate stays below 1%
- All alerts fire correctly
## Rollback Plan
1. Restore database from backup
2. Clear connection pools
3. Restart application pods
Execution Script
# gameday_executor.py
class GameDayExecutor:
def __init__(self):
self.start_time = None
self.metrics = []
def run_scenario(self, scenario_name):
print(f"Starting Game Day: {scenario_name}")
self.start_time = time.time()
# Phase 1: Steady State
print("Phase 1: Verifying steady state...")
if not self.verify_steady_state():
print("System not ready, aborting")
return
# Phase 2: Inject Failure
print("Phase 2: Injecting failure...")
self.inject_failure(scenario_name)
# Phase 3: Observe
print("Phase 3: Observing system behavior...")
self.observe_and_record(duration=300)
# Phase 4: Recovery
print("Phase 4: Verifying recovery...")
recovery_time = self.measure_recovery_time()
# Phase 5: Analysis
print("Phase 5: Analyzing results...")
self.generate_report(recovery_time)
def verify_steady_state(self):
# Check system health
health_checks = [
self.check_service_health('api'),
self.check_service_health('database'),
self.check_service_health('cache'),
self.check_error_rate() < 0.001,
self.check_latency_p99() < 100
]
return all(health_checks)
Observability for Chaos
Metrics Collection
# chaos_metrics.py
class ChaosMetrics:
def __init__(self):
self.prometheus = PrometheusClient()
def record_experiment(self, experiment_name, outcome):
self.prometheus.gauge(
'chaos_experiment_status',
1 if outcome == 'success' else 0,
labels={'experiment': experiment_name}
)
def record_blast_radius(self, affected_services):
self.prometheus.gauge(
'chaos_blast_radius',
len(affected_services),
labels={'services': ','.join(affected_services)}
)
def record_recovery_time(self, seconds):
self.prometheus.histogram(
'chaos_recovery_time_seconds',
seconds
)
Dashboards
{
"dashboard": {
"title": "Chaos Engineering Dashboard",
"panels": [
{
"title": "Experiment Success Rate",
"query": "avg(chaos_experiment_status) * 100"
},
{
"title": "Mean Time to Recovery",
"query": "histogram_quantile(0.5, chaos_recovery_time_seconds)"
},
{
"title": "Blast Radius Trend",
"query": "avg_over_time(chaos_blast_radius[7d])"
},
{
"title": "System Resilience Score",
"query": "(1 - (rate(error_total[5m]) / rate(request_total[5m]))) * 100"
}
]
}
}
Safety Mechanisms
Automatic Abort
# safety_mechanisms.py
class ChaosSafety:
def __init__(self, thresholds):
self.thresholds = thresholds
self.abort_flag = False
def monitor_safety(self):
while not self.abort_flag:
if self.check_thresholds_exceeded():
self.emergency_stop()
break
time.sleep(1)
def check_thresholds_exceeded(self):
checks = [
self.get_error_rate() > self.thresholds['error_rate'],
self.get_latency_p99() > self.thresholds['latency_p99'],
self.get_availability() < self.thresholds['availability']
]
return any(checks)
def emergency_stop(self):
print("EMERGENCY: Safety thresholds exceeded, aborting chaos!")
# Roll back all chaos
subprocess.run(['kubectl', 'delete', '-f', 'chaos/'], check=False)
# Alert on-call
self.send_alert("Chaos experiment aborted due to safety threshold breach")
Chaos Engineering Maturity Model
Level 1: Ad-hoc
- Manual failure injection
- No formal process
- Limited scope
Level 2: Defined
- Documented experiments
- Regular game days
- Basic automation
Level 3: Managed
- Automated chaos tests
- Continuous validation
- Metrics tracking
Level 4: Optimized
- Chaos in production
- Self-healing systems
- Predictive failures
Common Failure Scenarios
The Top 10 to Test
- Service dependency failure
- Database connection pool exhaustion
- Memory leaks and OOM kills
- Network partitions
- Disk full scenarios
- Certificate expiration
- DNS failures
- Rate limiting and throttling
- Cascading failures
- Time drift issues
Lessons Learned
After 2 years of chaos engineering:
- Start small: Begin with read-only services
- Communicate clearly: Everyone should know about experiments
- Automate rollbacks: Manual rollbacks are too slow
- Monitor everything: You can't improve what you don't measure
- Learn from failures: Every experiment teaches something
- Build confidence gradually: Start in dev, move to production
- Make it routine: Regular chaos prevents surprise failures
Conclusion
Chaos engineering isn't about breaking things—it's about building confidence. By proactively discovering weaknesses, we build systems that gracefully handle failure. Start small, measure everything, and gradually increase complexity. Remember: it's better to fail during a controlled experiment than during Black Friday.
The goal isn't to avoid failure entirely—that's impossible. The goal is to minimize the blast radius and recovery time when failures inevitably occur. Embrace chaos, and build antifragile systems that get stronger under stress.