chaos-engineering
Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments.
allowed_tools: Read, Write, Bash, Glob, Grep
$ Installer
git clone https://github.com/dralgorhythm/claude-agentic-framework /tmp/claude-agentic-framework && cp -r /tmp/claude-agentic-framework/.claude/skills/operations/chaos-engineering ~/.claude/skills/claude-agentic-framework// tip: Run this command in your terminal to install the skill
SKILL.md
name: chaos-engineering description: Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments. allowed-tools: Read, Write, Bash, Glob, Grep
Chaos Engineering
Principles
- Build a Hypothesis: Define expected behavior
- Minimize Blast Radius: Start small
- Run in Production: Real conditions matter
- Automate: Make experiments repeatable
- Minimize Impact: Have abort conditions
Experiment Process
- Steady State: Define normal metrics
- Hypothesis: "System will maintain X under condition Y"
- Introduce Variables: Inject failure
- Observe: Compare to steady state
- Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
# Add latency
tc qdisc add dev eth0 root netem delay 100ms
# Packet loss
tc qdisc add dev eth0 root netem loss 10%
# Remove
tc qdisc del dev eth0 root
Resource Exhaustion
# CPU stress
stress --cpu 4 --timeout 60s
# Memory stress
stress --vm 2 --vm-bytes 1G --timeout 60s
# Disk fill
dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
- Kill processes
- Restart containers
- Terminate instances
- Block dependencies
Chaos Tools
- Chaos Monkey: Random instance termination
- Gremlin: Comprehensive chaos platform
- Litmus: Kubernetes chaos engineering
- Chaos Mesh: Cloud-native chaos
Experiment Template
## Experiment: [Name]
### Hypothesis
If [condition], then [expected behavior].
### Steady State
- Metric A: [baseline value]
- Metric B: [baseline value]
### Method
1. [Step 1]
2. [Step 2]
3. [Step 3]
### Abort Conditions
- If [condition], stop immediately
### Results
[What happened]
### Findings
[What we learned]
Safety Rules
- Start in non-production
- Have rollback ready
- Monitor continuously
- Communicate with team
- Document everything
Repository

dralgorhythm
Author
dralgorhythm/claude-agentic-framework/.claude/skills/operations/chaos-engineering
1
Stars
0
Forks
Updated4d ago
Added1w ago