Marketplace

ops-disaster-recovery

Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures.

$ Instalar

git clone https://github.com/LerianStudio/ring /tmp/ring && cp -r /tmp/ring/ops-team/skills/ops-disaster-recovery ~/.claude/skills/ring

// tip: Run this command in your terminal to install the skill


name: ops-disaster-recovery description: | Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures.

trigger: |

  • DR strategy development
  • DR plan review/update
  • DR testing/drills
  • Post-incident DR improvement

skip_when: |

  • Day-to-day backup operations -> standard procedures
  • Application-level redundancy -> use ring-dev-team specialists
  • Single-instance failure recovery -> standard runbooks

related: similar: [ops-capacity-planning] uses: [infrastructure-architect]

Disaster Recovery Workflow

This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.


DR Planning Phases

PhaseFocusOutput
1. Business ImpactDefine criticality and requirementsBIA document
2. Strategy SelectionChoose appropriate DR strategyDR strategy
3. Architecture DesignDesign DR infrastructureDR architecture
4. Runbook DevelopmentDocument failover proceduresDR runbooks
5. TestingValidate DR capabilitiesTest report
6. MaintenanceKeep DR currentUpdate schedule

Phase 1: Business Impact Analysis

Service Classification

Classify services by business criticality:

TierDefinitionRTORPOExample Services
Tier 1Critical - business cannot operate<15 min<1 minPayment processing
Tier 2Important - significant impact<1 hour<15 minCustomer portal
Tier 3Standard - moderate impact<4 hours<1 hourInternal tools
Tier 4Low - minimal impact<24 hours<24 hoursDev environments

BIA Template

## Business Impact Analysis

**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]

### Service Classification

| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |

### Data Classification

| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |

### Dependencies

| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |

Phase 2: Strategy Selection

DR Strategy Comparison

StrategyRTORPOCostComplexityBest For
Backup & RestoreHoursHours$LowTier 4 services
Pilot Light30-60 minMinutes$$MediumTier 3 services
Warm Standby10-30 minSeconds-Minutes$$$Medium-HighTier 2 services
Hot Standby<10 minSeconds$$$$HighTier 1 services
Multi-ActiveNear-zeroNear-zero$$$$$Very HighUltra-critical

Strategy Selection Matrix

## DR Strategy Selection

### Requirements Summary

| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |

### Strategy Decision

**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]

**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]

### Trade-offs Accepted

| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |

Phase 3: Architecture Design

DR Architecture Components

ComponentPrimaryDRReplication
DNSRoute53Route53Global service
Load BalancerALB (us-east-1)ALB (us-west-2)Configuration sync
ComputeEKS (us-east-1)EKS (us-west-2)GitOps deployment
DatabaseAurora (us-east-1)Aurora Global (us-west-2)Async replication
StorageS3 (us-east-1)S3 (us-west-2)Cross-region replication
SecretsSecrets ManagerSecrets ManagerManual sync

Architecture Diagram Template

Primary Region (us-east-1)          DR Region (us-west-2)
┌─────────────────────────┐         ┌─────────────────────────┐
│                         │         │                         │
│  ┌─────────────────┐    │         │  ┌─────────────────┐    │
│  │     ALB         │    │         │  │     ALB         │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │ (standby)   │
│  ┌────────┴────────┐    │         │  ┌────────┴────────┐    │
│  │  EKS Cluster    │    │         │  │  EKS Cluster    │    │
│  │  (Active)       │    │         │  │  (Standby)      │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │             │
│  ┌────────┴────────┐    │  async  │  ┌────────┴────────┐    │
│  │  Aurora         │────┼────────►│  │  Aurora         │    │
│  │  (Primary)      │    │         │  │  (Replica)      │    │
│  └─────────────────┘    │         │  └─────────────────┘    │
│                         │         │                         │
└─────────────────────────┘         └─────────────────────────┘
              │                               │
              └───────────┬───────────────────┘
                          │
                   ┌──────┴──────┐
                   │   Route53   │
                   │   (Global)  │
                   └─────────────┘

Phase 4: Runbook Development

Failover Runbook Structure

## Failover Runbook: [Service Name]

**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]

### Pre-Conditions

- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready

### Failover Decision Criteria

| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |

### Failover Steps

1. **Verify DR Readiness** (2 min)
   ```bash
   # Check DR database status
   aws rds describe-db-clusters --region us-west-2

   # Check EKS cluster status
   kubectl --context=dr get nodes
  1. Stop Writes to Primary (1 min)

    # Scale down primary services
    kubectl --context=primary scale deployment/api --replicas=0
    
  2. Promote DR Database (5 min)

    # Promote Aurora replica
    aws rds failover-global-cluster \
      --global-cluster-identifier my-global-cluster \
      --target-db-cluster-identifier dr-cluster
    
  3. Activate DR Services (2 min)

    # Scale up DR services
    kubectl --context=dr scale deployment/api --replicas=10
    
  4. Update DNS (1-5 min propagation)

    # Update Route53 health check
    aws route53 update-health-check \
      --health-check-id xxx \
      --disabled
    
  5. Verify Service (5 min)

    # Health check
    curl https://api.example.com/health
    
    # Synthetic transaction
    ./scripts/synthetic-test.sh
    

Rollback Steps

[If failover causes issues, steps to return to primary]

Communication Template

Internal:

DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]

External (if customer-facing):

We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]


---

## Phase 5: Testing

### DR Test Types

| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |

### DR Test Template

```markdown
## DR Test Report

**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]

### Test Objectives

1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness

### Test Results

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |

### Timeline

| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |

### Issues Found

| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |

### Lessons Learned

1. [Lesson 1]
2. [Lesson 2]

### Action Items

| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |

Phase 6: Maintenance

DR Maintenance Schedule

ActivityFrequencyOwner
Runbook reviewQuarterlyPlatform team
DR testPer test scheduleSRE team
Replication monitoringDaily (automated)Monitoring
Cost reviewMonthlyFinOps
Architecture reviewAnnuallyArchitecture team

Anti-Rationalization Table

RationalizationWhy It's WRONGRequired Action
"DR can be added later"DR added later is rarely testedDR is day-1 requirement
"Backups are good enough"Backups != DR. RTO is hours vs minutes.Design proper DR strategy
"Too expensive for DR"DR cost << outage costCalculate business impact
"We'll figure it out during incident"Panic != good decisionsDocument runbooks NOW
"Tested last year, still good"Systems change constantlyTest regularly

Dispatch Specialist

For DR planning tasks, dispatch:

Task tool:
  subagent_type: "infrastructure-architect"
  model: "opus"
  prompt: |
    DR PLANNING REQUEST
    Services: [services requiring DR]
    RTO Requirement: [target]
    RPO Requirement: [target]
    Current State: [existing DR if any]
    REQUEST: [design/review/test planning]