Chaos Engineering for System Resilience

David Childs

Implement chaos engineering practices to proactively discover weaknesses and build resilient systems that survive real-world failures.

"Everything fails all the time" - Werner Vogels, Amazon CTO. After our payment system crashed during Black Friday, costing us $2M in lost sales, we implemented chaos engineering. Now we break things on purpose so they don't break when it matters. Here's how to build antifragile systems.

Why Chaos Engineering?

Traditional testing verifies known scenarios. Chaos engineering discovers unknown unknowns:

  • Network partitions that split your cluster
  • Resource exhaustion from memory leaks
  • Cascading failures from service dependencies
  • Data corruption from partial writes
  • Performance degradation under load

Getting Started with Chaos Monkey

Basic Setup

# chaos-monkey-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: chaosmonkey-config
  namespace: chaos-engineering
data:
  config.yaml: |
    chaosmonkey:
      enabled: true
      schedule:
        type: "cron"
        path: "0 0 9-17 * * MON-FRI"  # Business hours only
      
      accounts:
        - name: production
          region: us-east-1
          stack: "*"
          detail: "*"
          owner: platform-team@company.com
      
      spinnaker:
        enabled: false
      
      kill:
        enabled: true
        probability: 0.1  # 10% chance
        meanTimeBetweenKillsInWorkDays: 2

Kubernetes Chaos

# chaos-mesh-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
  namespace: production
spec:
  mode: random-max-percent
  value: "30"
  action: pod-kill
  duration: "60s"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  scheduler:
    cron: "@every 2h"

Building a Chaos Engineering Platform

Experiment Framework

# chaos_framework.py
import random
import time
import logging
from dataclasses import dataclass
from typing import List, Callable
import boto3

@dataclass
class ChaosExperiment:
    name: str
    hypothesis: str
    blast_radius: str
    rollback_plan: Callable
    steady_state_check: Callable
    chaos_injection: Callable
    
class ChaosOrchestrator:
    def __init__(self):
        self.experiments = []
        self.metrics_client = boto3.client('cloudwatch')
        
    def run_experiment(self, experiment: ChaosExperiment):
        logging.info(f"Starting experiment: {experiment.name}")
        logging.info(f"Hypothesis: {experiment.hypothesis}")
        
        try:
            # Verify steady state
            if not experiment.steady_state_check():
                logging.error("System not in steady state, aborting")
                return False
            
            # Record baseline metrics
            baseline_metrics = self.capture_metrics()
            
            # Inject chaos
            logging.info(f"Injecting chaos: {experiment.name}")
            experiment.chaos_injection()
            
            # Monitor impact
            time.sleep(60)  # Observation period
            
            # Check if steady state maintained
            if not experiment.steady_state_check():
                logging.warning("Steady state lost, rolling back")
                experiment.rollback_plan()
                return False
            
            # Compare metrics
            post_chaos_metrics = self.capture_metrics()
            self.analyze_impact(baseline_metrics, post_chaos_metrics)
            
            return True
            
        except Exception as e:
            logging.error(f"Experiment failed: {e}")
            experiment.rollback_plan()
            return False
        
    def capture_metrics(self):
        # Capture system metrics
        return {
            'error_rate': self.get_metric('ErrorRate'),
            'latency_p99': self.get_metric('LatencyP99'),
            'success_rate': self.get_metric('SuccessRate'),
            'throughput': self.get_metric('Throughput')
        }

Network Chaos

Latency Injection

# network_chaos.py
import subprocess
import random

class NetworkChaos:
    def __init__(self, interface='eth0'):
        self.interface = interface
    
    def add_latency(self, delay_ms=100, variation=20, probability=0.5):
        """Add network latency"""
        cmd = f"tc qdisc add dev {self.interface} root netem delay {delay_ms}ms {variation}ms {probability*100}%"
        subprocess.run(cmd.split(), check=True)
        
    def add_packet_loss(self, loss_percent=5):
        """Simulate packet loss"""
        cmd = f"tc qdisc add dev {self.interface} root netem loss {loss_percent}%"
        subprocess.run(cmd.split(), check=True)
        
    def add_bandwidth_limit(self, rate_kbps=1000):
        """Limit bandwidth"""
        cmd = f"tc qdisc add dev {self.interface} root tbf rate {rate_kbps}kbit burst 32kbit latency 400ms"
        subprocess.run(cmd.split(), check=True)
        
    def partition_network(self, blocked_ips: List[str]):
        """Create network partition"""
        for ip in blocked_ips:
            cmd = f"iptables -A INPUT -s {ip} -j DROP"
            subprocess.run(cmd.split(), check=True)
            cmd = f"iptables -A OUTPUT -d {ip} -j DROP"
            subprocess.run(cmd.split(), check=True)
    
    def clear_all(self):
        """Remove all network chaos"""
        subprocess.run(f"tc qdisc del dev {self.interface} root".split(), check=False)
        subprocess.run("iptables -F", shell=True, check=False)

Service Mesh Chaos

# istio-fault-injection.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
  - payment-service
  http:
  - fault:
      delay:
        percentage:
          value: 10.0
        fixedDelay: 5s
      abort:
        percentage:
          value: 5.0
        httpStatus: 503
    route:
    - destination:
        host: payment-service

Application-Level Chaos

Database Chaos

# database_chaos.py
import psycopg2
import random
import time

class DatabaseChaos:
    def __init__(self, connection_string):
        self.conn_string = connection_string
        
    def slow_queries(self, duration_seconds=30):
        """Inject slow queries"""
        conn = psycopg2.connect(self.conn_string)
        cursor = conn.cursor()
        
        end_time = time.time() + duration_seconds
        while time.time() < end_time:
            # Run expensive query
            cursor.execute("""
                SELECT pg_sleep(5);
                SELECT * FROM large_table 
                CROSS JOIN another_large_table 
                LIMIT 1000;
            """)
            time.sleep(random.uniform(1, 5))
        
        cursor.close()
        conn.close()
    
    def connection_pool_exhaustion(self, num_connections=100):
        """Exhaust connection pool"""
        connections = []
        for _ in range(num_connections):
            conn = psycopg2.connect(self.conn_string)
            connections.append(conn)
            time.sleep(0.1)
        
        # Hold connections for 60 seconds
        time.sleep(60)
        
        # Clean up
        for conn in connections:
            conn.close()
    
    def lock_contention(self, table_name='critical_table'):
        """Create lock contention"""
        conn = psycopg2.connect(self.conn_string)
        cursor = conn.cursor()
        
        # Acquire exclusive lock
        cursor.execute(f"LOCK TABLE {table_name} IN EXCLUSIVE MODE")
        
        # Hold lock for 30 seconds
        time.sleep(30)
        
        conn.commit()
        cursor.close()
        conn.close()

Memory Chaos

# memory_chaos.py
import os
import time
import threading

class MemoryChaos:
    def __init__(self):
        self.allocated_memory = []
        self.running = False
        
    def memory_leak(self, rate_mb_per_second=10, duration_seconds=60):
        """Simulate memory leak"""
        self.running = True
        end_time = time.time() + duration_seconds
        
        while time.time() < end_time and self.running:
            # Allocate memory
            data = bytearray(rate_mb_per_second * 1024 * 1024)
            self.allocated_memory.append(data)
            time.sleep(1)
        
        # Clean up
        self.allocated_memory.clear()
    
    def memory_spike(self, size_mb=1000):
        """Create sudden memory spike"""
        data = bytearray(size_mb * 1024 * 1024)
        time.sleep(10)
        del data
    
    def cache_invalidation(self):
        """Invalidate system caches"""
        os.system("sync && echo 3 > /proc/sys/vm/drop_caches")

Chaos Testing in CI/CD

Pipeline Integration

# .github/workflows/chaos-tests.yml
name: Chaos Engineering Tests

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  workflow_dispatch:

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Deploy to Test Environment
        run: |
          kubectl apply -f k8s/test-environment/
          kubectl wait --for=condition=ready pod -l app=test-app --timeout=300s
      
      - name: Run Baseline Tests
        run: |
          npm run test:performance
          npm run test:integration
      
      - name: Inject Pod Failures
        run: |
          kubectl apply -f chaos/pod-failure.yaml
          sleep 60
      
      - name: Verify Resilience
        run: |
          npm run test:resilience
          
      - name: Inject Network Latency
        run: |
          kubectl apply -f chaos/network-latency.yaml
          sleep 60
      
      - name: Check SLOs
        run: |
          python scripts/check_slos.py --error-rate 0.01 --latency-p99 500
      
      - name: Clean Up Chaos
        if: always()
        run: |
          kubectl delete -f chaos/

Game Day Exercises

Planning Template

# Game Day: Database Failure Simulation

## Objective
Test system resilience to primary database failure

## Hypothesis
System will maintain 99% availability during database failover

## Participants
- Incident Commander: John Doe
- Technical Lead: Jane Smith
- Observer: Bob Johnson

## Timeline
- 10:00 - Pre-checks and baseline metrics
- 10:15 - Inject failure
- 10:30 - Monitor recovery
- 11:00 - Post-mortem discussion

## Success Criteria
- Automatic failover completes within 2 minutes
- No data loss
- Error rate stays below 1%
- All alerts fire correctly

## Rollback Plan
1. Restore database from backup
2. Clear connection pools
3. Restart application pods

Execution Script

# gameday_executor.py
class GameDayExecutor:
    def __init__(self):
        self.start_time = None
        self.metrics = []
        
    def run_scenario(self, scenario_name):
        print(f"Starting Game Day: {scenario_name}")
        self.start_time = time.time()
        
        # Phase 1: Steady State
        print("Phase 1: Verifying steady state...")
        if not self.verify_steady_state():
            print("System not ready, aborting")
            return
        
        # Phase 2: Inject Failure
        print("Phase 2: Injecting failure...")
        self.inject_failure(scenario_name)
        
        # Phase 3: Observe
        print("Phase 3: Observing system behavior...")
        self.observe_and_record(duration=300)
        
        # Phase 4: Recovery
        print("Phase 4: Verifying recovery...")
        recovery_time = self.measure_recovery_time()
        
        # Phase 5: Analysis
        print("Phase 5: Analyzing results...")
        self.generate_report(recovery_time)
    
    def verify_steady_state(self):
        # Check system health
        health_checks = [
            self.check_service_health('api'),
            self.check_service_health('database'),
            self.check_service_health('cache'),
            self.check_error_rate() < 0.001,
            self.check_latency_p99() < 100
        ]
        return all(health_checks)

Observability for Chaos

Metrics Collection

# chaos_metrics.py
class ChaosMetrics:
    def __init__(self):
        self.prometheus = PrometheusClient()
        
    def record_experiment(self, experiment_name, outcome):
        self.prometheus.gauge(
            'chaos_experiment_status',
            1 if outcome == 'success' else 0,
            labels={'experiment': experiment_name}
        )
    
    def record_blast_radius(self, affected_services):
        self.prometheus.gauge(
            'chaos_blast_radius',
            len(affected_services),
            labels={'services': ','.join(affected_services)}
        )
    
    def record_recovery_time(self, seconds):
        self.prometheus.histogram(
            'chaos_recovery_time_seconds',
            seconds
        )

Dashboards

{
  "dashboard": {
    "title": "Chaos Engineering Dashboard",
    "panels": [
      {
        "title": "Experiment Success Rate",
        "query": "avg(chaos_experiment_status) * 100"
      },
      {
        "title": "Mean Time to Recovery",
        "query": "histogram_quantile(0.5, chaos_recovery_time_seconds)"
      },
      {
        "title": "Blast Radius Trend",
        "query": "avg_over_time(chaos_blast_radius[7d])"
      },
      {
        "title": "System Resilience Score",
        "query": "(1 - (rate(error_total[5m]) / rate(request_total[5m]))) * 100"
      }
    ]
  }
}

Safety Mechanisms

Automatic Abort

# safety_mechanisms.py
class ChaosSafety:
    def __init__(self, thresholds):
        self.thresholds = thresholds
        self.abort_flag = False
        
    def monitor_safety(self):
        while not self.abort_flag:
            if self.check_thresholds_exceeded():
                self.emergency_stop()
                break
            time.sleep(1)
    
    def check_thresholds_exceeded(self):
        checks = [
            self.get_error_rate() > self.thresholds['error_rate'],
            self.get_latency_p99() > self.thresholds['latency_p99'],
            self.get_availability() < self.thresholds['availability']
        ]
        return any(checks)
    
    def emergency_stop(self):
        print("EMERGENCY: Safety thresholds exceeded, aborting chaos!")
        # Roll back all chaos
        subprocess.run(['kubectl', 'delete', '-f', 'chaos/'], check=False)
        # Alert on-call
        self.send_alert("Chaos experiment aborted due to safety threshold breach")

Chaos Engineering Maturity Model

Level 1: Ad-hoc

  • Manual failure injection
  • No formal process
  • Limited scope

Level 2: Defined

  • Documented experiments
  • Regular game days
  • Basic automation

Level 3: Managed

  • Automated chaos tests
  • Continuous validation
  • Metrics tracking

Level 4: Optimized

  • Chaos in production
  • Self-healing systems
  • Predictive failures

Common Failure Scenarios

The Top 10 to Test

  1. Service dependency failure
  2. Database connection pool exhaustion
  3. Memory leaks and OOM kills
  4. Network partitions
  5. Disk full scenarios
  6. Certificate expiration
  7. DNS failures
  8. Rate limiting and throttling
  9. Cascading failures
  10. Time drift issues

Lessons Learned

After 2 years of chaos engineering:

  1. Start small: Begin with read-only services
  2. Communicate clearly: Everyone should know about experiments
  3. Automate rollbacks: Manual rollbacks are too slow
  4. Monitor everything: You can't improve what you don't measure
  5. Learn from failures: Every experiment teaches something
  6. Build confidence gradually: Start in dev, move to production
  7. Make it routine: Regular chaos prevents surprise failures

Conclusion

Chaos engineering isn't about breaking things—it's about building confidence. By proactively discovering weaknesses, we build systems that gracefully handle failure. Start small, measure everything, and gradually increase complexity. Remember: it's better to fail during a controlled experiment than during Black Friday.

The goal isn't to avoid failure entirely—that's impossible. The goal is to minimize the blast radius and recovery time when failures inevitably occur. Embrace chaos, and build antifragile systems that get stronger under stress.

Share this article

DC

David Childs

Consulting Systems Engineer with over 10 years of experience building scalable infrastructure and helping organizations optimize their technology stack.

Related Articles