ops-capacity-planning
Structured workflow for infrastructure capacity planning including growth forecasting, scaling strategy, and resource provisioning decisions.
$ インストール
git clone https://github.com/LerianStudio/ring /tmp/ring && cp -r /tmp/ring/ops-team/skills/ops-capacity-planning ~/.claude/skills/ring// tip: Run this command in your terminal to install the skill
name: ops-capacity-planning description: | Structured workflow for infrastructure capacity planning including growth forecasting, scaling strategy, and resource provisioning decisions.
trigger: |
- Quarterly capacity reviews
- Pre-launch capacity assessment
- Performance degradation investigation
- Budget planning for infrastructure
skip_when: |
- Application performance optimization -> use ring-dev-team specialists
- Cost-only analysis -> use ops-cost-optimization skill
- One-time resource adjustment -> standard change management
related: similar: [ops-cost-optimization] uses: [infrastructure-architect, cloud-cost-optimizer]
Capacity Planning Workflow
This skill defines the structured process for infrastructure capacity planning. Use it for proactive capacity management and growth forecasting.
Capacity Planning Phases
| Phase | Focus | Output |
|---|---|---|
| 1. Current State | Document existing capacity | Capacity baseline |
| 2. Usage Analysis | Analyze utilization patterns | Utilization report |
| 3. Growth Forecast | Project future requirements | Growth model |
| 4. Gap Analysis | Identify capacity gaps | Gap report |
| 5. Recommendations | Scaling strategy | Capacity plan |
| 6. Implementation | Execute capacity changes | Updated infrastructure |
Phase 1: Current State Assessment
Data Collection
Gather the following for each service tier:
| Metric | Compute | Database | Storage | Network |
|---|---|---|---|---|
| Provisioned | Instance count/size | Instance class | Total GB | Bandwidth |
| Peak utilization | CPU/Memory % | Connections/IOPS | Usage % | Throughput |
| Average utilization | CPU/Memory % | Connections/IOPS | Growth rate | Latency |
| Cost | Monthly $ | Monthly $ | Monthly $ | Monthly $ |
Data Sources
# AWS CLI examples
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization
aws rds describe-db-instances
aws s3api list-buckets
aws ce get-cost-and-usage
Current State Template
## Current Capacity Baseline
**Assessment Date:** YYYY-MM-DD
**Scope:** [production/staging/all]
### Compute Resources
| Service | Instance Type | Count | Avg CPU | Avg Memory | Cost/Month |
|---------|--------------|-------|---------|------------|------------|
| api | m5.xlarge | 10 | 45% | 60% | $2,400 |
| worker | c5.2xlarge | 5 | 70% | 40% | $1,800 |
### Database Resources
| Database | Instance Class | Storage | Avg Connections | Avg IOPS | Cost/Month |
|----------|---------------|---------|-----------------|----------|------------|
| primary | db.r5.2xlarge | 500GB | 150 | 5000 | $1,800 |
### Storage Resources
| Bucket/Volume | Type | Size | Growth Rate | Cost/Month |
|---------------|------|------|-------------|------------|
| logs | S3 Standard | 2TB | 100GB/month | $46 |
Phase 2: Usage Analysis
Utilization Patterns
Identify patterns in resource usage:
| Pattern | Description | Scaling Strategy |
|---|---|---|
| Steady | Consistent load | Reserved capacity |
| Cyclical | Predictable peaks | Scheduled scaling |
| Spiky | Unpredictable bursts | Auto-scaling |
| Growing | Steady increase | Proactive provisioning |
Analysis Questions
- What is peak vs average utilization?
- When do peaks occur? (time of day, day of week)
- What triggers traffic spikes? (campaigns, events)
- What is the headroom at peak? (safety margin)
- Are there correlated resources? (if A scales, B must scale)
Utilization Thresholds
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| CPU | <70% | 70-85% | >85% |
| Memory | <75% | 75-90% | >90% |
| Storage | <70% | 70-85% | >85% |
| DB Connections | <70% | 70-85% | >85% |
Phase 3: Growth Forecasting
Forecasting Methods
| Method | Best For | Accuracy |
|---|---|---|
| Linear extrapolation | Steady growth | Moderate |
| Seasonal decomposition | Cyclical patterns | High |
| Business-driven | New product launches | Varies |
| Historical comparison | Similar past events | Moderate |
Growth Forecast Template
## Growth Forecast
**Forecast Period:** [Q1 2024 / 6 months / etc.]
**Methodology:** [method used]
**Confidence:** [High/Medium/Low]
### Traffic Projections
| Metric | Current | +3 Months | +6 Months | +12 Months |
|--------|---------|-----------|-----------|------------|
| Requests/sec | 1,000 | 1,200 | 1,500 | 2,000 |
| DAU | 50,000 | 60,000 | 75,000 | 100,000 |
| Data volume | 500GB | 600GB | 750GB | 1TB |
### Key Assumptions
1. [Assumption 1 - e.g., no major product launches]
2. [Assumption 2 - e.g., 20% YoY growth continues]
3. [Assumption 3 - e.g., no seasonal events]
### Risk Factors
| Factor | Impact | Likelihood | Mitigation |
|--------|--------|------------|------------|
| Viral growth | +200% traffic | Low | Auto-scaling limits |
| Marketing campaign | +50% traffic | Medium | Pre-scale before launch |
Phase 4: Gap Analysis
Capacity Gap Identification
Compare current capacity against forecast requirements:
## Gap Analysis
### Compute Gaps
| Service | Current Capacity | Needed (+6mo) | Gap | Severity |
|---------|------------------|---------------|-----|----------|
| api | 10 x m5.xlarge | 15 x m5.xlarge | +5 | Medium |
| worker | 5 x c5.2xlarge | 8 x c5.2xlarge | +3 | High |
### Database Gaps
| Database | Current | Needed | Gap | Notes |
|----------|---------|--------|-----|-------|
| primary | db.r5.2xlarge | db.r5.4xlarge | Upgrade | Vertical scale |
| replica | 1 replica | 2 replicas | +1 | Read scaling |
### Storage Gaps
| Storage | Current | Needed (+6mo) | Gap |
|---------|---------|---------------|-----|
| logs | 2TB | 3.6TB | +1.6TB |
| backups | 1TB | 1.5TB | +0.5TB |
Gap Severity Matrix
| Severity | Criteria | Action Timeline |
|---|---|---|
| Critical | <2 weeks to capacity | Immediate |
| High | 2-4 weeks to capacity | This sprint |
| Medium | 1-3 months to capacity | This quarter |
| Low | >3 months to capacity | Next quarter |
Phase 5: Recommendations
Scaling Strategy Options
| Strategy | Best For | Lead Time | Cost Impact |
|---|---|---|---|
| Vertical | DB, stateful | Hours-days | Immediate increase |
| Horizontal | Stateless compute | Minutes | Linear increase |
| Reserved | Predictable load | Immediate | 30-70% savings |
| Spot | Batch workloads | Variable | 60-90% savings |
| Auto-scaling | Variable load | Real-time | Pay for use |
Recommendation Template
## Capacity Recommendations
### Immediate Actions (This Sprint)
| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| api ASG | Increase max from 10 to 15 | Low | +$600/mo max |
| worker ASG | Add 3 instances | Low | +$1,080/mo |
### Short-term Actions (This Quarter)
| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| primary DB | Upgrade to r5.4xlarge | Medium | +$900/mo |
| Add read replica | Provision in us-east-1b | Medium | +$900/mo |
### Long-term Considerations (Next Quarter)
| Consideration | Rationale | Next Step |
|---------------|-----------|-----------|
| Sharding strategy | Single DB approaching limits | Architecture review |
| Multi-region | DR + latency benefits | Infrastructure-architect review |
### Cost Summary
| Timeframe | Current | Recommended | Delta |
|-----------|---------|-------------|-------|
| Monthly | $8,000 | $10,980 | +$2,980 |
| Annual | $96,000 | $131,760 | +$35,760 |
Phase 6: Implementation
Implementation Checklist
- Recommendations approved by stakeholders
- Change requests created
- Implementation scheduled (avoid peak hours)
- Rollback plan documented
- Monitoring dashboards ready
- Alert thresholds updated
Post-Implementation Verification
- New capacity provisioned successfully
- Performance metrics improved/stable
- No unexpected errors
- Cost tracking updated
- Documentation updated
Anti-Rationalization Table
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "We'll scale when we need to" | Reactive scaling causes outages | Proactive capacity planning |
| "Auto-scaling handles everything" | Auto-scaling has limits and lag | Set appropriate limits |
| "Current capacity is fine" | Fine today ≠ fine tomorrow | Forecast growth |
| "Too expensive to over-provision" | Outage cost > over-provisioning cost | Maintain safety margin |
Dispatch Specialists
For capacity planning tasks, dispatch:
Task tool:
subagent_type: "infrastructure-architect"
model: "opus"
prompt: |
CAPACITY PLANNING: [scope]
CURRENT STATE: [baseline]
GROWTH FORECAST: [projection]
REQUEST: [specific analysis needed]
For cost analysis of capacity options:
Task tool:
subagent_type: "cloud-cost-optimizer"
model: "opus"
prompt: |
CAPACITY OPTIONS: [options to evaluate]
CONSTRAINTS: [budget, performance requirements]
REQUEST: Cost-benefit analysis
Repository
