incident-response
Respond to production incidents systematically with triage, investigation, resolution, and post-mortem analysis to minimize downtime and prevent recurrence. Use when handling production outages, triaging incidents, investigating critical bugs, coordinating incident response, implementing hotfixes, conducting post-mortems, or establishing incident response procedures.
$ 安裝
git clone https://github.com/korallis/Droidz /tmp/Droidz && cp -r /tmp/Droidz/droidz_installer/payloads/droid_cli/default/skills/incident-response ~/.claude/skills/Droidz// tip: Run this command in your terminal to install the skill
SKILL.md
name: incident-response description: Respond to production incidents systematically with triage, investigation, resolution, and post-mortem analysis to minimize downtime and prevent recurrence. Use when handling production outages, triaging incidents, investigating critical bugs, coordinating incident response, implementing hotfixes, conducting post-mortems, or establishing incident response procedures.
Incident Response - Production Issue Management
When to use this skill
- Responding to production outages
- Triaging critical incidents
- Investigating high-severity bugs
- Coordinating incident response teams
- Implementing emergency hotfixes
- Conducting post-mortem analyses
- Establishing incident response procedures
- Communicating status during incidents
- Creating runbooks for common issues
- Implementing rollback strategies
- Documenting incident timelines
- Preventing incident recurrence
When to use this skill
- Responding to outages, managing incidents, conducting postmortems.
- When working on related tasks or features
- During development that requires this expertise
Use when: Responding to outages, managing incidents, conducting postmortems.
Incident Response Process
1. Detect
- Monitoring alerts
- User reports
- Automated checks
2. Triage
- Assess severity (P0-P4)
- Page on-call engineer
- Create incident channel
3. Mitigate
- Rollback to last known good
- Scale resources
- Apply hotfix
- Communicate status
4. Resolve
- Verify fix
- Monitor metrics
- Update status page
- Close incident
5. Postmortem
- Timeline of events
- Root cause analysis
- Action items
- Follow-up tasks
Severity Levels
- P0 (Critical): Complete outage, data loss
- P1 (High): Major feature broken, revenue impact
- P2 (Medium): Degraded performance, workaround exists
- P3 (Low): Minor bug, cosmetic issue
- P4 (Informational): Enhancement request
Example Runbook
```markdown
High CPU Usage Runbook
Symptoms
- Server CPU > 90%
- Slow response times
- Request timeouts
Investigation
- Check top processes: `top`
- Check memory: `free -h`
- Check logs: `tail -f app.log`
Mitigation
- Scale horizontally: Add servers
- Restart service: `systemctl restart app`
- Rate limit: Enable aggressive rate limiting
Resolution
- Identify root cause (N+1 query, memory leak, etc.)
- Deploy fix
- Monitor for 1 hour ```
Communication Template
``` [INCIDENT] Service X degraded
Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes
Updates:
- 10:00 AM: Issue detected
- 10:05 AM: On-call paged, investigation started
- 10:15 AM: Root cause identified (database bottleneck)
- 10:30 AM: Fix deployed, monitoring
Next update: 11:00 AM ```
Resources
Repository

korallis
Author
korallis/Droidz/droidz_installer/payloads/droid_cli/default/skills/incident-response
49
Stars
6
Forks
Updated1w ago
Added1w ago