Marketplace

ops-capacity-planning

Structured workflow for infrastructure capacity planning including growth forecasting, scaling strategy, and resource provisioning decisions.

$ Instalar

git clone https://github.com/LerianStudio/ring /tmp/ring && cp -r /tmp/ring/ops-team/skills/ops-capacity-planning ~/.claude/skills/ring

// tip: Run this command in your terminal to install the skill


name: ops-capacity-planning description: | Structured workflow for infrastructure capacity planning including growth forecasting, scaling strategy, and resource provisioning decisions.

trigger: |

  • Quarterly capacity reviews
  • Pre-launch capacity assessment
  • Performance degradation investigation
  • Budget planning for infrastructure

skip_when: |

  • Application performance optimization -> use ring-dev-team specialists
  • Cost-only analysis -> use ops-cost-optimization skill
  • One-time resource adjustment -> standard change management

related: similar: [ops-cost-optimization] uses: [infrastructure-architect, cloud-cost-optimizer]

Capacity Planning Workflow

This skill defines the structured process for infrastructure capacity planning. Use it for proactive capacity management and growth forecasting.


Capacity Planning Phases

PhaseFocusOutput
1. Current StateDocument existing capacityCapacity baseline
2. Usage AnalysisAnalyze utilization patternsUtilization report
3. Growth ForecastProject future requirementsGrowth model
4. Gap AnalysisIdentify capacity gapsGap report
5. RecommendationsScaling strategyCapacity plan
6. ImplementationExecute capacity changesUpdated infrastructure

Phase 1: Current State Assessment

Data Collection

Gather the following for each service tier:

MetricComputeDatabaseStorageNetwork
ProvisionedInstance count/sizeInstance classTotal GBBandwidth
Peak utilizationCPU/Memory %Connections/IOPSUsage %Throughput
Average utilizationCPU/Memory %Connections/IOPSGrowth rateLatency
CostMonthly $Monthly $Monthly $Monthly $

Data Sources

# AWS CLI examples
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization
aws rds describe-db-instances
aws s3api list-buckets
aws ce get-cost-and-usage

Current State Template

## Current Capacity Baseline

**Assessment Date:** YYYY-MM-DD
**Scope:** [production/staging/all]

### Compute Resources

| Service | Instance Type | Count | Avg CPU | Avg Memory | Cost/Month |
|---------|--------------|-------|---------|------------|------------|
| api | m5.xlarge | 10 | 45% | 60% | $2,400 |
| worker | c5.2xlarge | 5 | 70% | 40% | $1,800 |

### Database Resources

| Database | Instance Class | Storage | Avg Connections | Avg IOPS | Cost/Month |
|----------|---------------|---------|-----------------|----------|------------|
| primary | db.r5.2xlarge | 500GB | 150 | 5000 | $1,800 |

### Storage Resources

| Bucket/Volume | Type | Size | Growth Rate | Cost/Month |
|---------------|------|------|-------------|------------|
| logs | S3 Standard | 2TB | 100GB/month | $46 |

Phase 2: Usage Analysis

Utilization Patterns

Identify patterns in resource usage:

PatternDescriptionScaling Strategy
SteadyConsistent loadReserved capacity
CyclicalPredictable peaksScheduled scaling
SpikyUnpredictable burstsAuto-scaling
GrowingSteady increaseProactive provisioning

Analysis Questions

  1. What is peak vs average utilization?
  2. When do peaks occur? (time of day, day of week)
  3. What triggers traffic spikes? (campaigns, events)
  4. What is the headroom at peak? (safety margin)
  5. Are there correlated resources? (if A scales, B must scale)

Utilization Thresholds

MetricHealthyWarningCritical
CPU<70%70-85%>85%
Memory<75%75-90%>90%
Storage<70%70-85%>85%
DB Connections<70%70-85%>85%

Phase 3: Growth Forecasting

Forecasting Methods

MethodBest ForAccuracy
Linear extrapolationSteady growthModerate
Seasonal decompositionCyclical patternsHigh
Business-drivenNew product launchesVaries
Historical comparisonSimilar past eventsModerate

Growth Forecast Template

## Growth Forecast

**Forecast Period:** [Q1 2024 / 6 months / etc.]
**Methodology:** [method used]
**Confidence:** [High/Medium/Low]

### Traffic Projections

| Metric | Current | +3 Months | +6 Months | +12 Months |
|--------|---------|-----------|-----------|------------|
| Requests/sec | 1,000 | 1,200 | 1,500 | 2,000 |
| DAU | 50,000 | 60,000 | 75,000 | 100,000 |
| Data volume | 500GB | 600GB | 750GB | 1TB |

### Key Assumptions

1. [Assumption 1 - e.g., no major product launches]
2. [Assumption 2 - e.g., 20% YoY growth continues]
3. [Assumption 3 - e.g., no seasonal events]

### Risk Factors

| Factor | Impact | Likelihood | Mitigation |
|--------|--------|------------|------------|
| Viral growth | +200% traffic | Low | Auto-scaling limits |
| Marketing campaign | +50% traffic | Medium | Pre-scale before launch |

Phase 4: Gap Analysis

Capacity Gap Identification

Compare current capacity against forecast requirements:

## Gap Analysis

### Compute Gaps

| Service | Current Capacity | Needed (+6mo) | Gap | Severity |
|---------|------------------|---------------|-----|----------|
| api | 10 x m5.xlarge | 15 x m5.xlarge | +5 | Medium |
| worker | 5 x c5.2xlarge | 8 x c5.2xlarge | +3 | High |

### Database Gaps

| Database | Current | Needed | Gap | Notes |
|----------|---------|--------|-----|-------|
| primary | db.r5.2xlarge | db.r5.4xlarge | Upgrade | Vertical scale |
| replica | 1 replica | 2 replicas | +1 | Read scaling |

### Storage Gaps

| Storage | Current | Needed (+6mo) | Gap |
|---------|---------|---------------|-----|
| logs | 2TB | 3.6TB | +1.6TB |
| backups | 1TB | 1.5TB | +0.5TB |

Gap Severity Matrix

SeverityCriteriaAction Timeline
Critical<2 weeks to capacityImmediate
High2-4 weeks to capacityThis sprint
Medium1-3 months to capacityThis quarter
Low>3 months to capacityNext quarter

Phase 5: Recommendations

Scaling Strategy Options

StrategyBest ForLead TimeCost Impact
VerticalDB, statefulHours-daysImmediate increase
HorizontalStateless computeMinutesLinear increase
ReservedPredictable loadImmediate30-70% savings
SpotBatch workloadsVariable60-90% savings
Auto-scalingVariable loadReal-timePay for use

Recommendation Template

## Capacity Recommendations

### Immediate Actions (This Sprint)

| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| api ASG | Increase max from 10 to 15 | Low | +$600/mo max |
| worker ASG | Add 3 instances | Low | +$1,080/mo |

### Short-term Actions (This Quarter)

| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| primary DB | Upgrade to r5.4xlarge | Medium | +$900/mo |
| Add read replica | Provision in us-east-1b | Medium | +$900/mo |

### Long-term Considerations (Next Quarter)

| Consideration | Rationale | Next Step |
|---------------|-----------|-----------|
| Sharding strategy | Single DB approaching limits | Architecture review |
| Multi-region | DR + latency benefits | Infrastructure-architect review |

### Cost Summary

| Timeframe | Current | Recommended | Delta |
|-----------|---------|-------------|-------|
| Monthly | $8,000 | $10,980 | +$2,980 |
| Annual | $96,000 | $131,760 | +$35,760 |

Phase 6: Implementation

Implementation Checklist

  • Recommendations approved by stakeholders
  • Change requests created
  • Implementation scheduled (avoid peak hours)
  • Rollback plan documented
  • Monitoring dashboards ready
  • Alert thresholds updated

Post-Implementation Verification

  • New capacity provisioned successfully
  • Performance metrics improved/stable
  • No unexpected errors
  • Cost tracking updated
  • Documentation updated

Anti-Rationalization Table

RationalizationWhy It's WRONGRequired Action
"We'll scale when we need to"Reactive scaling causes outagesProactive capacity planning
"Auto-scaling handles everything"Auto-scaling has limits and lagSet appropriate limits
"Current capacity is fine"Fine today ≠ fine tomorrowForecast growth
"Too expensive to over-provision"Outage cost > over-provisioning costMaintain safety margin

Dispatch Specialists

For capacity planning tasks, dispatch:

Task tool:
  subagent_type: "infrastructure-architect"
  model: "opus"
  prompt: |
    CAPACITY PLANNING: [scope]
    CURRENT STATE: [baseline]
    GROWTH FORECAST: [projection]
    REQUEST: [specific analysis needed]

For cost analysis of capacity options:

Task tool:
  subagent_type: "cloud-cost-optimizer"
  model: "opus"
  prompt: |
    CAPACITY OPTIONS: [options to evaluate]
    CONSTRAINTS: [budget, performance requirements]
    REQUEST: Cost-benefit analysis