Unnamed Skill
Use when designing deployment pipelines, CI/CD, terraform, or infrastructure automation. Enforces rollback checkpoint then TodoWrite with 19+ items. Triggers: "deploy", "CI/CD", "kubernetes", "terraform". If thinking "rollback later" - use this first.
$ Installer
git clone https://github.com/pvillega/claude-templates /tmp/claude-templates && cp -r /tmp/claude-templates/.claude/skills/deployment-automation-enforcer ~/.claude/skills/claude-templates// tip: Run this command in your terminal to install the skill
name: deployment-automation-enforcer description: Use when designing deployment pipelines, CI/CD, terraform, or infrastructure automation. Enforces rollback checkpoint then TodoWrite with 19+ items. Triggers: "deploy", "CI/CD", "kubernetes", "terraform". If thinking "rollback later" - use this first.
Deployment Automation Enforcer
ROLLBACK CHECKPOINT (COMPLETE FIRST)
MANDATORY before creating TodoWrite:
- Rollback script exists? [Path: _______ | or "NEW - will create first"]
- Tested in staging? [Date: _______ | or "must test before production"]
- Duration measured? [_____ minutes]
- Triggers defined? [List: _______]
Why first: 27% skip rollback when checkpoint appears later.
TodoWrite Requirements
CREATE TodoWrite with 4 sections (19+ items total):
| Section | Min Items | Order |
|---|---|---|
| Automation | 5+ | 1st |
| Observability | 5+ | 2nd (BEFORE Failure Recovery) |
| Failure Recovery | 5+ | 3rd (requires Observability) |
| Verification | 4+ | 4th |
Section order matters: You cannot define failure recovery without observability to detect failures.
Verification Checkpoint
After creating TodoWrite, verify 3 random items:
Each item must have ALL THREE:
- ✓ Concrete numbers/thresholds ("error rate > 5%", "15 min timeout")
- ✓ Specific tools ("GitHub Actions", "CloudWatch", "PagerDuty")
- ✓ Measurable outcome ("rollback tested on [date]", "alert fires within 5min")
| ❌ FAILS | ✅ PASSES |
|---|---|
| "Add monitoring" | "CloudWatch: deployment.duration_seconds, Grafana dashboard at /dashboards/deployments, PagerDuty alert if error rate > 5% for 3min" |
| "Implement rollback" | "Rollback .github/workflows/rollback.yml reverts to previous Docker tag from S3 deployment-history/latest-stable.txt. Triggers: manual OR error rate > 5% for 3min. Target: < 5 minutes. Test staging on [date]" |
Section Requirements
Automation (5+ items)
- Identify manual steps in current deployment
- Replace with automated scripts/workflows (GitHub Actions, GitLab CI)
- Idempotency checks for safe re-runs
- Rollback automation for this change
- Document exceptions for remaining manual steps
Observability (5+ items) - BEFORE Failure Recovery
- Deployment logging (structured: deployment-id, timestamps, steps)
- Failure alerts (PagerDuty/SNS on failure, error rate spike)
- Metrics (duration, success rate in CloudWatch/Datadog)
- Health endpoint (
/healthreturns 200 + dependency status) - Log/metric locations documented
Failure Recovery (5+ items) - AFTER Observability
- Failure scenarios defined (won't start, migration fails, health check fails)
- Automated rollback triggers (error rate > X%, failed health checks Y minutes)
- Health checks post-deployment
- Rollback tested in staging (date, duration, success)
- Manual recovery documentation as last resort
Verification (4+ items)
- Pre-deployment tests automated (unit, integration, lint)
- Smoke tests post-deployment (critical flows, key endpoints)
- Monitoring/alerts verified working (trigger test alert)
- Rollback procedure accessible (script in repo, documented)
Red Flags - STOP When You Think:
| Thought | Reality | Data |
|---|---|---|
| "Manual deploy is broken, need automation fast" | Automating without rollback creates WORSE problems | 27% skip rollback |
| "We'll add monitoring/rollback after" | Can't detect/recover from failures without them | 80% never add "later" |
| "Rollback is overkill" | Manual recovery ALWAYS takes 10x longer | 30+ min manual vs 2 min automated |
| "We can manually revert" | Detect (no monitoring) + find version (no automation) + apply (error-prone) | 30+ min |
Response Templates
"We'll Add Rollback Later"
❌ BLOCKED: Cannot deploy without rollback capability.
- 27% skip rollback when not required upfront
- 80% of "add later" items never get added
- Manual recovery takes 30+ minutes vs 2 minutes automated
- Production incidents without rollback = extended downtime + customer impact
Required to override:
- Specific retrofit date (not "later")
- Budget allocated (engineer-weeks + incident risk cost)
- Risk acceptance signed by decision maker
- Interim mitigation plan (24/7 on-call? manual monitoring?)
Override Requirements
To skip ANY requirement, provide ALL 4:
- Specific retrofit date (not "later")
- Budget allocated (engineer-weeks + incident risk)
- Risk acceptance signed by decision maker
- Interim mitigation plan (24/7 on-call? manual monitoring?)
Self-Grading Before Complete
[ ] 19+ items across 4 sections
[ ] 80%+ items have concrete numbers
[ ] 80%+ items name specific tools
[ ] 100% items have measurable outcomes
[ ] 3 random items pass specificity test
[ ] Rollback checkpoint completed
[ ] Observability BEFORE Failure Recovery (correct order)
Grade 7+/8: Ready to proceed
Grade <7: Revise TodoWrite
Evidence Collection
Before marking complete:
- Automation code link (workflow file URL)
- Staging deployment log (screenshot/excerpt)
- Monitoring dashboard screenshot
- Rollback test evidence (log with timestamp, duration)
- Alert test confirmation
Repository
