Chaos Engineering for System Resilience

Implement chaos engineering practices to proactively discover weaknesses and build resilient systems that survive real-world failures.

"Everything fails all the time" - Werner Vogels, Amazon CTO. After our payment system crashed during Black Friday, costing us $2M in lost sales, we implemented chaos engineering. Now we break things on purpose so they don't break when it matters. Here's how to build antifragile systems.

Why Chaos Engineering?

Traditional testing verifies known scenarios. Chaos engineering discovers unknown unknowns:

Network partitions that split your cluster
Resource exhaustion from memory leaks
Cascading failures from service dependencies
Data corruption from partial writes
Performance degradation under load

Getting Started with Chaos Monkey

Basic Setup

# chaos-monkey-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaosmonkey-config
  namespace: chaos-engineering
data:
  config.yaml: |
    chaosmonkey:
      enabled: true
      schedule:
        type: "cron"
        path: "0 0 9-17 * * MON-FRI"  # Business hours only
      
      accounts:
        - name: production
          region: us-east-1
          stack: "*"
          detail: "*"
          owner: platform-team@company.com
      
      spinnaker:
        enabled: false
      
      kill:
        enabled: true
        probability: 0.1  # 10% chance
        meanTimeBetweenKillsInWorkDays: 2

Kubernetes Chaos

# chaos-mesh-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
  namespace: production
spec:
  mode: random-max-percent
  value: "30"
  action: pod-kill
  duration: "60s"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 2h"

Building a Chaos Engineering Platform

Experiment Framework

# chaos_framework.py
import random
import time
import logging
from dataclasses import dataclass
from typing import List, Callable
import boto3

@dataclass
class ChaosExperiment:
    name: str
    hypothesis: str
    blast_radius: str
    rollback_plan: Callable
    steady_state_check: Callable
    chaos_injection: Callable
    
class ChaosOrchestrator:
    def __init__(self):
        self.experiments = []
        self.metrics_client = boto3.client('cloudwatch')
        
    def run_experiment(self, experiment: ChaosExperiment):
        logging.info(f"Starting experiment: {experiment.name}")
        logging.info(f"Hypothesis: {experiment.hypothesis}")
        
        try:
            # Verify steady state
            if not experiment.steady_state_check():
                logging.error("System not in steady state, aborting")
                return False
            
            # Record baseline metrics
            baseline_metrics = self.capture_metrics()
            
            # Inject chaos
            logging.info(f"Injecting chaos: {experiment.name}")
            experiment.chaos_injection()
            
            # Monitor impact
            time.sleep(60)  # Observation period
            
            # Check if steady state maintained
            if not experiment.steady_state_check():
                logging.warning("Steady state lost, rolling back")
                experiment.rollback_plan()
                return False
            
            # Compare metrics
            post_chaos_metrics = self.capture_metrics()
            self.analyze_impact(baseline_metrics, post_chaos_metrics)
            
            return True
            
        except Exception as e:
            logging.error(f"Experiment failed: {e}")
            experiment.rollback_plan()
            return False
        
    def capture_metrics(self):
        # Capture system metrics
        return {
            'error_rate': self.get_metric('ErrorRate'),
            'latency_p99': self.get_metric('LatencyP99'),
            'success_rate': self.get_metric('SuccessRate'),
            'throughput': self.get_metric('Throughput')
        }

Network Chaos

Latency Injection

# network_chaos.py
import subprocess
import random

class NetworkChaos:
    def __init__(self, interface='eth0'):
        self.interface = interface
    
    def add_latency(self, delay_ms=100, variation=20, probability=0.5):
        """Add network latency"""
        cmd = f"tc qdisc add dev {self.interface} root netem delay {delay_ms}ms {variation}ms {probability*100}%"
        subprocess.run(cmd.split(), check=True)
        
    def add_packet_loss(self, loss_percent=5):
        """Simulate packet loss"""
        cmd = f"tc qdisc add dev {self.interface} root netem loss {loss_percent}%"
        subprocess.run(cmd.split(), check=True)
        
    def add_bandwidth_limit(self, rate_kbps=1000):
        """Limit bandwidth"""
        cmd = f"tc qdisc add dev {self.interface} root tbf rate {rate_kbps}kbit burst 32kbit latency 400ms"
        subprocess.run(cmd.split(), check=True)
        
    def partition_network(self, blocked_ips: List[str]):
        """Create network partition"""
        for ip in blocked_ips:
            cmd = f"iptables -A INPUT -s {ip} -j DROP"
            subprocess.run(cmd.split(), check=True)
            cmd = f"iptables -A OUTPUT -d {ip} -j DROP"
            subprocess.run(cmd.split(), check=True)
    
    def clear_all(self):
        """Remove all network chaos"""
        subprocess.run(f"tc qdisc del dev {self.interface} root".split(), check=False)
        subprocess.run("iptables -F", shell=True, check=False)

Service Mesh Chaos

# istio-fault-injection.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
  - payment-service
  http:
  - fault:
      delay:
        percentage:
          value: 10.0
        fixedDelay: 5s
      abort:
        percentage:
          value: 5.0
        httpStatus: 503
    route:
    - destination:
        host: payment-service

Application-Level Chaos

Database Chaos

# database_chaos.py
import psycopg2
import random
import time

class DatabaseChaos:
    def __init__(self, connection_string):
        self.conn_string = connection_string
        
    def slow_queries(self, duration_seconds=30):
        """Inject slow queries"""
        conn = psycopg2.connect(self.conn_string)
        cursor = conn.cursor()
        
        end_time = time.time() + duration_seconds
        while time.time() < end_time:
            # Run expensive query
            cursor.execute("""
                SELECT pg_sleep(5);
                SELECT * FROM large_table 
                CROSS JOIN another_large_table 
                LIMIT 1000;
            """)
            time.sleep(random.uniform(1, 5))
        
        cursor.close()
        conn.close()
    
    def connection_pool_exhaustion(self, num_connections=100):
        """Exhaust connection pool"""
        connections = []
        for _ in range(num_connections):
            conn = psycopg2.connect(self.conn_string)
            connections.append(conn)
            time.sleep(0.1)
        
        # Hold connections for 60 seconds
        time.sleep(60)
        
        # Clean up
        for conn in connections:
            conn.close()
    
    def lock_contention(self, table_name='critical_table'):
        """Create lock contention"""
        conn = psycopg2.connect(self.conn_string)
        cursor = conn.cursor()
        
        # Acquire exclusive lock
        cursor.execute(f"LOCK TABLE {table_name} IN EXCLUSIVE MODE")
        
        # Hold lock for 30 seconds
        time.sleep(30)
        
        conn.commit()
        cursor.close()
        conn.close()

Memory Chaos

# memory_chaos.py
import os
import time
import threading

class MemoryChaos:
    def __init__(self):
        self.allocated_memory = []
        self.running = False
        
    def memory_leak(self, rate_mb_per_second=10, duration_seconds=60):
        """Simulate memory leak"""
        self.running = True
        end_time = time.time() + duration_seconds
        
        while time.time() < end_time and self.running:
            # Allocate memory
            data = bytearray(rate_mb_per_second * 1024 * 1024)
            self.allocated_memory.append(data)
            time.sleep(1)
        
        # Clean up
        self.allocated_memory.clear()
    
    def memory_spike(self, size_mb=1000):
        """Create sudden memory spike"""
        data = bytearray(size_mb * 1024 * 1024)
        time.sleep(10)
        del data
    
    def cache_invalidation(self):
        """Invalidate system caches"""
        os.system("sync && echo 3 > /proc/sys/vm/drop_caches")

Chaos Testing in CI/CD

Pipeline Integration

# .github/workflows/chaos-tests.yml
name: Chaos Engineering Tests

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  workflow_dispatch:

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to Test Environment
        run: |
          kubectl apply -f k8s/test-environment/
          kubectl wait --for=condition=ready pod -l app=test-app --timeout=300s
      
      - name: Run Baseline Tests
        run: |
          npm run test:performance
          npm run test:integration
      
      - name: Inject Pod Failures
        run: |
          kubectl apply -f chaos/pod-failure.yaml
          sleep 60
      
      - name: Verify Resilience
        run: |
          npm run test:resilience
          
      - name: Inject Network Latency
        run: |
          kubectl apply -f chaos/network-latency.yaml
          sleep 60
      
      - name: Check SLOs
        run: |
          python scripts/check_slos.py --error-rate 0.01 --latency-p99 500
      
      - name: Clean Up Chaos
        if: always()
        run: |
          kubectl delete -f chaos/

Game Day Exercises

Planning Template

# Game Day: Database Failure Simulation

## Objective
Test system resilience to primary database failure

## Hypothesis
System will maintain 99% availability during database failover

## Participants
- Incident Commander: John Doe
- Technical Lead: Jane Smith
- Observer: Bob Johnson

## Timeline
- 10:00 - Pre-checks and baseline metrics
- 10:15 - Inject failure
- 10:30 - Monitor recovery
- 11:00 - Post-mortem discussion

## Success Criteria
- Automatic failover completes within 2 minutes
- No data loss
- Error rate stays below 1%
- All alerts fire correctly

## Rollback Plan
1. Restore database from backup
2. Clear connection pools
3. Restart application pods

Execution Script

# gameday_executor.py
class GameDayExecutor:
    def __init__(self):
        self.start_time = None
        self.metrics = []
        
    def run_scenario(self, scenario_name):
        print(f"Starting Game Day: {scenario_name}")
        self.start_time = time.time()
        
        # Phase 1: Steady State
        print("Phase 1: Verifying steady state...")
        if not self.verify_steady_state():
            print("System not ready, aborting")
            return
        
        # Phase 2: Inject Failure
        print("Phase 2: Injecting failure...")
        self.inject_failure(scenario_name)
        
        # Phase 3: Observe
        print("Phase 3: Observing system behavior...")
        self.observe_and_record(duration=300)
        
        # Phase 4: Recovery
        print("Phase 4: Verifying recovery...")
        recovery_time = self.measure_recovery_time()
        
        # Phase 5: Analysis
        print("Phase 5: Analyzing results...")
        self.generate_report(recovery_time)
    
    def verify_steady_state(self):
        # Check system health
        health_checks = [
            self.check_service_health('api'),
            self.check_service_health('database'),
            self.check_service_health('cache'),
            self.check_error_rate() < 0.001,
            self.check_latency_p99() < 100
        ]
        return all(health_checks)

Observability for Chaos

Metrics Collection

# chaos_metrics.py
class ChaosMetrics:
    def __init__(self):
        self.prometheus = PrometheusClient()
        
    def record_experiment(self, experiment_name, outcome):
        self.prometheus.gauge(
            'chaos_experiment_status',
            1 if outcome == 'success' else 0,
            labels={'experiment': experiment_name}
        )
    
    def record_blast_radius(self, affected_services):
        self.prometheus.gauge(
            'chaos_blast_radius',
            len(affected_services),
            labels={'services': ','.join(affected_services)}
        )
    
    def record_recovery_time(self, seconds):
        self.prometheus.histogram(
            'chaos_recovery_time_seconds',
            seconds
        )

Dashboards

{
  "dashboard": {
    "title": "Chaos Engineering Dashboard",
    "panels": [
      {
        "title": "Experiment Success Rate",
        "query": "avg(chaos_experiment_status) * 100"
      },
      {
        "title": "Mean Time to Recovery",
        "query": "histogram_quantile(0.5, chaos_recovery_time_seconds)"
      },
      {
        "title": "Blast Radius Trend",
        "query": "avg_over_time(chaos_blast_radius[7d])"
      },
      {
        "title": "System Resilience Score",
        "query": "(1 - (rate(error_total[5m]) / rate(request_total[5m]))) * 100"
      }
    ]
  }
}

Safety Mechanisms

Automatic Abort

# safety_mechanisms.py
class ChaosSafety:
    def __init__(self, thresholds):
        self.thresholds = thresholds
        self.abort_flag = False
        
    def monitor_safety(self):
        while not self.abort_flag:
            if self.check_thresholds_exceeded():
                self.emergency_stop()
                break
            time.sleep(1)
    
    def check_thresholds_exceeded(self):
        checks = [
            self.get_error_rate() > self.thresholds['error_rate'],
            self.get_latency_p99() > self.thresholds['latency_p99'],
            self.get_availability() < self.thresholds['availability']
        ]
        return any(checks)
    
    def emergency_stop(self):
        print("EMERGENCY: Safety thresholds exceeded, aborting chaos!")
        # Roll back all chaos
        subprocess.run(['kubectl', 'delete', '-f', 'chaos/'], check=False)
        # Alert on-call
        self.send_alert("Chaos experiment aborted due to safety threshold breach")

Chaos Engineering Maturity Model

Level 1: Ad-hoc

Manual failure injection
No formal process
Limited scope

Level 2: Defined

Documented experiments
Regular game days
Basic automation

Level 3: Managed

Automated chaos tests
Continuous validation
Metrics tracking

Level 4: Optimized

Chaos in production
Self-healing systems
Predictive failures

Common Failure Scenarios

The Top 10 to Test

Service dependency failure
Database connection pool exhaustion
Memory leaks and OOM kills
Network partitions
Disk full scenarios
Certificate expiration
DNS failures
Rate limiting and throttling
Cascading failures
Time drift issues

Lessons Learned

After 2 years of chaos engineering:

Start small: Begin with read-only services
Communicate clearly: Everyone should know about experiments
Automate rollbacks: Manual rollbacks are too slow
Monitor everything: You can't improve what you don't measure
Learn from failures: Every experiment teaches something
Build confidence gradually: Start in dev, move to production
Make it routine: Regular chaos prevents surprise failures

Conclusion

Chaos engineering isn't about breaking things—it's about building confidence. By proactively discovering weaknesses, we build systems that gracefully handle failure. Start small, measure everything, and gradually increase complexity. Remember: it's better to fail during a controlled experiment than during Black Friday.

The goal isn't to avoid failure entirely—that's impossible. The goal is to minimize the blast radius and recovery time when failures inevitably occur. Embrace chaos, and build antifragile systems that get stronger under stress.