Marketplace
slo-sli-design
Design Service Level Objectives, Indicators, and error budgets
allowed_tools: Read, Glob, Grep, Write, Edit
$ Installieren
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/observability-planning/skills/slo-sli-design ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: slo-sli-design description: Design Service Level Objectives, Indicators, and error budgets allowed-tools: Read, Glob, Grep, Write, Edit
SLO/SLI Design Skill
When to Use This Skill
Use this skill when:
- Slo Sli Design tasks - Working on design service level objectives, indicators, and error budgets
- Planning or design - Need guidance on Slo Sli Design approaches
- Best practices - Want to follow established patterns and standards
Overview
Design Service Level Objectives, Indicators, and error budget policies.
MANDATORY: Documentation-First Approach
Before designing SLOs:
- Invoke
docs-managementskill for SLO/SLI patterns - Verify SRE practices via MCP servers (perplexity)
- Base guidance on Google SRE and industry best practices
SLO/SLI/SLA Hierarchy
SLO/SLI/SLA RELATIONSHIP:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ SLA (Service Level Agreement) โ
โ โโโ External promise to customers โ
โ โโโ Legal/contractual implications โ
โ โโโ Example: "99.9% monthly uptime" โ
โ โ
โ โฒ โ
โ โ Buffer (SLO should be tighter) โ
โ โ โ
โ SLO (Service Level Objective) โ
โ โโโ Internal reliability target โ
โ โโโ Tighter than SLA (headroom) โ
โ โโโ Example: "99.95% monthly availability" โ
โ โ
โ โฒ โ
โ โ Measured by โ
โ โ โ
โ SLI (Service Level Indicator) โ
โ โโโ Actual measurement โ
โ โโโ Quantitative metric โ
โ โโโ Example: "successful_requests / total_requests" โ
โ โ
โ โฒ โ
โ โ Derived from โ
โ โ โ
โ Error Budget โ
โ โโโ Allowable unreliability: 100% - SLO โ
โ โโโ Example: 0.05% = 21.6 minutes/month โ
โ โโโ Spent on: releases, incidents, maintenance โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Common SLI Types
SLI CATEGORIES:
AVAILABILITY SLI:
"The proportion of requests that are served successfully"
Formula: successful_requests / total_requests ร 100%
Good Events: HTTP 2xx, 3xx, 4xx (client errors)
Bad Events: HTTP 5xx, timeouts, connection failures
Example Prometheus query:
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
LATENCY SLI:
"The proportion of requests that are served within threshold"
Formula: requests_below_threshold / total_requests ร 100%
Thresholds (example):
- P50: 100ms (median experience)
- P95: 500ms (95th percentile)
- P99: 1000ms (tail latency)
Example Prometheus query:
sum(rate(http_request_duration_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_count[5m]))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
QUALITY/CORRECTNESS SLI:
"The proportion of requests that return correct results"
Formula: correct_responses / total_responses ร 100%
Good Events: Valid data, expected format
Bad Events: Data corruption, stale data, wrong results
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FRESHNESS SLI:
"The proportion of data that is updated within threshold"
Formula: fresh_records / total_records ร 100%
Example: "95% of records updated within 5 minutes"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
THROUGHPUT SLI:
"The proportion of time system handles expected load"
Formula: time_at_capacity / total_time ร 100%
Example: "System handles 1000 req/s 99% of the time"
Error Budget Calculation
ERROR BUDGET MATH:
Monthly Error Budget (30 days):
SLO Target โ Error Budget โ Allowed Downtime
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโ
99% โ 1% โ 7h 18m
99.5% โ 0.5% โ 3h 39m
99.9% โ 0.1% โ 43m 50s
99.95% โ 0.05% โ 21m 55s
99.99% โ 0.01% โ 4m 23s
99.999% โ 0.001% โ 26s
Error Budget Consumption:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Monthly Budget: 21m 55s (99.95% SLO) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Used: 8m (36%) โ
โ โ
โ Incidents: โ
โ - Jan 5: Database failover - 5m โ
โ - Jan 12: Deployment rollback - 3m โ
โ โ
โ Remaining: 13m 55s (64%) โ
โ โ
โ Status: โ HEALTHY โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SLO Design Process
SLO DESIGN WORKFLOW:
Step 1: IDENTIFY USER JOURNEYS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What do users care about? โ
โ โ
โ Critical User Journeys (CUJs): โ
โ - Login and authentication โ
โ - Search and browse products โ
โ - Add to cart and checkout โ
โ - View order status โ
โ โ
โ For each journey: โ
โ - What constitutes success? โ
โ - What latency is acceptable? โ
โ - What's the business impact of failure? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 2: DEFINE SLIs
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What can we measure that represents user happiness? โ
โ โ
โ For "Checkout" journey: โ
โ - Availability: checkout completes without error โ
โ - Latency: checkout completes within 3 seconds โ
โ - Correctness: order total matches cart โ
โ โ
โ SLI Specification: โ
โ - What events are we measuring? โ
โ - What's a "good" event vs "bad" event? โ
โ - Where do we measure? (server, client, synthetic) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 3: SET SLO TARGETS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What reliability level should we target? โ
โ โ
โ Consider: โ
โ - Current baseline (what are we achieving now?) โ
โ - User expectations (what do users tolerate?) โ
โ - Business requirements (any SLAs?) โ
โ - Cost vs reliability trade-off โ
โ โ
โ Start achievable, improve iteratively โ
โ SLO = Current baseline - small margin โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 4: DEFINE ERROR BUDGET POLICY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ What happens when budget is exhausted? โ
โ โ
โ Error Budget Policy: โ
โ - Budget > 50%: Normal operations โ
โ - Budget 25-50%: Slow down risky changes โ
โ - Budget < 25%: Focus on reliability โ
โ - Budget = 0%: Feature freeze, reliability only โ
โ โ
โ Escalation: โ
โ - Who gets notified at each threshold? โ
โ - What actions are required? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SLO Document Template
# SLO: {Service Name} - {Journey/Feature}
## Service Overview
| Attribute | Value |
|-----------|-------|
| Service | [Service name] |
| Owner | [Team name] |
| Criticality | [Critical/High/Medium/Low] |
| User Journey | [Journey name] |
## SLI Specification
### Availability SLI
**Definition:** The proportion of [event type] that [success criteria].
**Good Event:** [What counts as success]
**Bad Event:** [What counts as failure]
**Measurement:**
- Source: [Prometheus/Azure Monitor/etc.]
- Query:
```promql
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLI
Definition: The proportion of requests served within [threshold].
Thresholds:
| Percentile | Threshold |
|---|---|
| P50 | [X]ms |
| P95 | [X]ms |
| P99 | [X]ms |
Measurement:
histogram_quantile(0.95,
rate(http_request_duration_bucket[5m]))
SLO Targets
| SLI | Target | Window |
|---|---|---|
| Availability | [99.9%] | 30 days rolling |
| Latency (P95) | [99%] below 500ms | 30 days rolling |
Error Budget
| SLO | Error Budget | Allowed Downtime (30d) |
|---|---|---|
| 99.9% availability | 0.1% | 43m 50s |
| 99% latency | 1% | 7h 18m |
Error Budget Policy
Budget Thresholds
| Budget Remaining | Status | Actions |
|---|---|---|
| > 50% | ๐ข Healthy | Normal operations |
| 25-50% | ๐ก Caution | Review recent changes |
| 10-25% | ๐ Warning | Slow deployments, reliability focus |
| < 10% | ๐ด Critical | Feature freeze |
| Exhausted | โ Frozen | Reliability-only work |
Escalation
| Threshold | Notify | Action Required |
|---|---|---|
| < 50% | Team lead | Awareness |
| < 25% | Engineering manager | Review deployment pace |
| < 10% | Director | Feature freeze decision |
| Exhausted | VP Engineering | Incident response mode |
Alerting
SLO Burn Rate Alerts
| Severity | Burn Rate | Time Window | Example |
|---|---|---|---|
| Critical | 14.4x | 1h | Budget exhausted in ~2 days |
| Warning | 6x | 6h | Budget exhausted in ~5 days |
| Info | 1x | 3d | Budget on track to exhaust |
Alert Configuration
- alert: SLOHighBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
description: "Error budget burning at 14.4x rate"
Review Schedule
- Weekly: SLO dashboard review
- Monthly: Error budget retrospective
- Quarterly: SLO target review
Appendix: Historical Performance
[Include baseline measurements and trends]
.NET SLO Implementation
// SLO metric implementation in .NET
// Infrastructure/Telemetry/SloMetrics.cs
using System.Diagnostics.Metrics;
public class SloMetrics
{
private readonly Counter<long> _totalRequests;
private readonly Counter<long> _successfulRequests;
private readonly Counter<long> _failedRequests;
private readonly Histogram<double> _requestDuration;
public SloMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("OrdersApi.SLO");
_totalRequests = meter.CreateCounter<long>(
"slo.requests.total",
"{request}",
"Total requests for SLO calculation");
_successfulRequests = meter.CreateCounter<long>(
"slo.requests.successful",
"{request}",
"Successful requests (good events)");
_failedRequests = meter.CreateCounter<long>(
"slo.requests.failed",
"{request}",
"Failed requests (bad events)");
_requestDuration = meter.CreateHistogram<double>(
"slo.request.duration",
"ms",
"Request duration for latency SLI");
}
public void RecordRequest(
string endpoint,
int statusCode,
double durationMs)
{
var tags = new TagList
{
{ "endpoint", endpoint },
{ "status_code", statusCode.ToString() }
};
_totalRequests.Add(1, tags);
// Availability SLI: 5xx = bad, everything else = good
if (statusCode >= 500)
{
_failedRequests.Add(1, tags);
}
else
{
_successfulRequests.Add(1, tags);
}
// Latency SLI
_requestDuration.Record(durationMs, tags);
}
}
// Middleware to capture SLO metrics
public class SloMetricsMiddleware
{
private readonly RequestDelegate _next;
private readonly SloMetrics _sloMetrics;
public SloMetricsMiddleware(RequestDelegate next, SloMetrics sloMetrics)
{
_next = next;
_sloMetrics = sloMetrics;
}
public async Task InvokeAsync(HttpContext context)
{
var stopwatch = Stopwatch.StartNew();
try
{
await _next(context);
}
finally
{
stopwatch.Stop();
var endpoint = context.GetEndpoint()?.DisplayName ?? "unknown";
var statusCode = context.Response.StatusCode;
var durationMs = stopwatch.Elapsed.TotalMilliseconds;
_sloMetrics.RecordRequest(endpoint, statusCode, durationMs);
}
}
}
Error Budget Dashboard Queries
# Availability SLI (30-day rolling)
1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
)
# Latency SLI (P95 < 500ms, 30-day)
sum(increase(slo_request_duration_bucket{le="500"}[30d]))
/
sum(increase(slo_request_duration_count[30d]))
# Error Budget Remaining (availability)
1 - (
(1 - 0.999) # SLO target (99.9%)
-
(1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
))
) / (1 - 0.999)
# Error Budget Burn Rate (1h)
(
sum(rate(slo_requests_failed_total[1h]))
/
sum(rate(slo_requests_total[1h]))
) / (1 - 0.999) # Divide by error budget (0.1%)
Workflow
When designing SLOs:
- Identify User Journeys: What do users care about?
- Define SLIs: What can we measure?
- Measure Baseline: What are we achieving now?
- Set SLO Targets: Achievable but aspirational
- Define Error Budget Policy: What happens when budget is low?
- Implement Alerting: Multi-window burn rate alerts
- Create Dashboards: Visibility into SLO status
- Review Regularly: Adjust based on learning
References
For detailed guidance:
Last Updated: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/observability-planning/skills/slo-sli-design
3
Stars
0
Forks
Updated1d ago
Added6d ago