Marketplace

slo-sli-design

Design Service Level Objectives, Indicators, and error budgets

allowed_tools: Read, Glob, Grep, Write, Edit

$ Installieren

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/observability-planning/skills/slo-sli-design ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: slo-sli-design description: Design Service Level Objectives, Indicators, and error budgets allowed-tools: Read, Glob, Grep, Write, Edit

SLO/SLI Design Skill

When to Use This Skill

Use this skill when:

  • Slo Sli Design tasks - Working on design service level objectives, indicators, and error budgets
  • Planning or design - Need guidance on Slo Sli Design approaches
  • Best practices - Want to follow established patterns and standards

Overview

Design Service Level Objectives, Indicators, and error budget policies.

MANDATORY: Documentation-First Approach

Before designing SLOs:

  1. Invoke docs-management skill for SLO/SLI patterns
  2. Verify SRE practices via MCP servers (perplexity)
  3. Base guidance on Google SRE and industry best practices

SLO/SLI/SLA Hierarchy

SLO/SLI/SLA RELATIONSHIP:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                  โ”‚
โ”‚  SLA (Service Level Agreement)                                   โ”‚
โ”‚  โ”œโ”€โ”€ External promise to customers                               โ”‚
โ”‚  โ”œโ”€โ”€ Legal/contractual implications                              โ”‚
โ”‚  โ””โ”€โ”€ Example: "99.9% monthly uptime"                             โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ–ฒ                                                          โ”‚
โ”‚       โ”‚ Buffer (SLO should be tighter)                           โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚  SLO (Service Level Objective)                                   โ”‚
โ”‚  โ”œโ”€โ”€ Internal reliability target                                 โ”‚
โ”‚  โ”œโ”€โ”€ Tighter than SLA (headroom)                                 โ”‚
โ”‚  โ””โ”€โ”€ Example: "99.95% monthly availability"                      โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ–ฒ                                                          โ”‚
โ”‚       โ”‚ Measured by                                              โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚  SLI (Service Level Indicator)                                   โ”‚
โ”‚  โ”œโ”€โ”€ Actual measurement                                          โ”‚
โ”‚  โ”œโ”€โ”€ Quantitative metric                                         โ”‚
โ”‚  โ””โ”€โ”€ Example: "successful_requests / total_requests"             โ”‚
โ”‚                                                                  โ”‚
โ”‚       โ–ฒ                                                          โ”‚
โ”‚       โ”‚ Derived from                                             โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚  Error Budget                                                    โ”‚
โ”‚  โ”œโ”€โ”€ Allowable unreliability: 100% - SLO                         โ”‚
โ”‚  โ”œโ”€โ”€ Example: 0.05% = 21.6 minutes/month                         โ”‚
โ”‚  โ””โ”€โ”€ Spent on: releases, incidents, maintenance                  โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Common SLI Types

SLI CATEGORIES:

AVAILABILITY SLI:
"The proportion of requests that are served successfully"

Formula: successful_requests / total_requests ร— 100%

Good Events: HTTP 2xx, 3xx, 4xx (client errors)
Bad Events: HTTP 5xx, timeouts, connection failures

Example Prometheus query:
  sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

LATENCY SLI:
"The proportion of requests that are served within threshold"

Formula: requests_below_threshold / total_requests ร— 100%

Thresholds (example):
- P50: 100ms (median experience)
- P95: 500ms (95th percentile)
- P99: 1000ms (tail latency)

Example Prometheus query:
  sum(rate(http_request_duration_bucket{le="0.5"}[5m]))
  /
  sum(rate(http_request_duration_count[5m]))

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

QUALITY/CORRECTNESS SLI:
"The proportion of requests that return correct results"

Formula: correct_responses / total_responses ร— 100%

Good Events: Valid data, expected format
Bad Events: Data corruption, stale data, wrong results

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

FRESHNESS SLI:
"The proportion of data that is updated within threshold"

Formula: fresh_records / total_records ร— 100%

Example: "95% of records updated within 5 minutes"

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

THROUGHPUT SLI:
"The proportion of time system handles expected load"

Formula: time_at_capacity / total_time ร— 100%

Example: "System handles 1000 req/s 99% of the time"

Error Budget Calculation

ERROR BUDGET MATH:

Monthly Error Budget (30 days):

SLO Target  โ”‚ Error Budget โ”‚ Allowed Downtime
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
99%         โ”‚ 1%           โ”‚ 7h 18m
99.5%       โ”‚ 0.5%         โ”‚ 3h 39m
99.9%       โ”‚ 0.1%         โ”‚ 43m 50s
99.95%      โ”‚ 0.05%        โ”‚ 21m 55s
99.99%      โ”‚ 0.01%        โ”‚ 4m 23s
99.999%     โ”‚ 0.001%       โ”‚ 26s

Error Budget Consumption:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                  โ”‚
โ”‚  Monthly Budget: 21m 55s (99.95% SLO)                            โ”‚
โ”‚                                                                  โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  Used: 8m (36%)               โ”‚
โ”‚                                                                  โ”‚
โ”‚  Incidents:                                                      โ”‚
โ”‚  - Jan 5: Database failover - 5m                                 โ”‚
โ”‚  - Jan 12: Deployment rollback - 3m                              โ”‚
โ”‚                                                                  โ”‚
โ”‚  Remaining: 13m 55s (64%)                                        โ”‚
โ”‚                                                                  โ”‚
โ”‚  Status: โœ“ HEALTHY                                               โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SLO Design Process

SLO DESIGN WORKFLOW:

Step 1: IDENTIFY USER JOURNEYS
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ What do users care about?                                        โ”‚
โ”‚                                                                  โ”‚
โ”‚ Critical User Journeys (CUJs):                                   โ”‚
โ”‚ - Login and authentication                                       โ”‚
โ”‚ - Search and browse products                                     โ”‚
โ”‚ - Add to cart and checkout                                       โ”‚
โ”‚ - View order status                                              โ”‚
โ”‚                                                                  โ”‚
โ”‚ For each journey:                                                โ”‚
โ”‚ - What constitutes success?                                      โ”‚
โ”‚ - What latency is acceptable?                                    โ”‚
โ”‚ - What's the business impact of failure?                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 2: DEFINE SLIs
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ What can we measure that represents user happiness?              โ”‚
โ”‚                                                                  โ”‚
โ”‚ For "Checkout" journey:                                          โ”‚
โ”‚ - Availability: checkout completes without error                 โ”‚
โ”‚ - Latency: checkout completes within 3 seconds                   โ”‚
โ”‚ - Correctness: order total matches cart                          โ”‚
โ”‚                                                                  โ”‚
โ”‚ SLI Specification:                                               โ”‚
โ”‚ - What events are we measuring?                                  โ”‚
โ”‚ - What's a "good" event vs "bad" event?                          โ”‚
โ”‚ - Where do we measure? (server, client, synthetic)               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 3: SET SLO TARGETS
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ What reliability level should we target?                         โ”‚
โ”‚                                                                  โ”‚
โ”‚ Consider:                                                        โ”‚
โ”‚ - Current baseline (what are we achieving now?)                  โ”‚
โ”‚ - User expectations (what do users tolerate?)                    โ”‚
โ”‚ - Business requirements (any SLAs?)                              โ”‚
โ”‚ - Cost vs reliability trade-off                                  โ”‚
โ”‚                                                                  โ”‚
โ”‚ Start achievable, improve iteratively                            โ”‚
โ”‚ SLO = Current baseline - small margin                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 4: DEFINE ERROR BUDGET POLICY
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ What happens when budget is exhausted?                           โ”‚
โ”‚                                                                  โ”‚
โ”‚ Error Budget Policy:                                             โ”‚
โ”‚ - Budget > 50%: Normal operations                                โ”‚
โ”‚ - Budget 25-50%: Slow down risky changes                         โ”‚
โ”‚ - Budget < 25%: Focus on reliability                             โ”‚
โ”‚ - Budget = 0%: Feature freeze, reliability only                  โ”‚
โ”‚                                                                  โ”‚
โ”‚ Escalation:                                                      โ”‚
โ”‚ - Who gets notified at each threshold?                           โ”‚
โ”‚ - What actions are required?                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SLO Document Template

# SLO: {Service Name} - {Journey/Feature}

## Service Overview

| Attribute | Value |
|-----------|-------|
| Service | [Service name] |
| Owner | [Team name] |
| Criticality | [Critical/High/Medium/Low] |
| User Journey | [Journey name] |

## SLI Specification

### Availability SLI

**Definition:** The proportion of [event type] that [success criteria].

**Good Event:** [What counts as success]
**Bad Event:** [What counts as failure]

**Measurement:**
- Source: [Prometheus/Azure Monitor/etc.]
- Query:
  ```promql
  sum(rate(http_requests_total{status!~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))

Latency SLI

Definition: The proportion of requests served within [threshold].

Thresholds:

PercentileThreshold
P50[X]ms
P95[X]ms
P99[X]ms

Measurement:

histogram_quantile(0.95,
  rate(http_request_duration_bucket[5m]))

SLO Targets

SLITargetWindow
Availability[99.9%]30 days rolling
Latency (P95)[99%] below 500ms30 days rolling

Error Budget

SLOError BudgetAllowed Downtime (30d)
99.9% availability0.1%43m 50s
99% latency1%7h 18m

Error Budget Policy

Budget Thresholds

Budget RemainingStatusActions
> 50%๐ŸŸข HealthyNormal operations
25-50%๐ŸŸก CautionReview recent changes
10-25%๐ŸŸ  WarningSlow deployments, reliability focus
< 10%๐Ÿ”ด CriticalFeature freeze
Exhaustedโ›” FrozenReliability-only work

Escalation

ThresholdNotifyAction Required
< 50%Team leadAwareness
< 25%Engineering managerReview deployment pace
< 10%DirectorFeature freeze decision
ExhaustedVP EngineeringIncident response mode

Alerting

SLO Burn Rate Alerts

SeverityBurn RateTime WindowExample
Critical14.4x1hBudget exhausted in ~2 days
Warning6x6hBudget exhausted in ~5 days
Info1x3dBudget on track to exhaust

Alert Configuration

- alert: SLOHighBurnRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High error budget burn rate"
    description: "Error budget burning at 14.4x rate"

Review Schedule

  • Weekly: SLO dashboard review
  • Monthly: Error budget retrospective
  • Quarterly: SLO target review

Appendix: Historical Performance

[Include baseline measurements and trends]

.NET SLO Implementation

// SLO metric implementation in .NET
// Infrastructure/Telemetry/SloMetrics.cs

using System.Diagnostics.Metrics;

public class SloMetrics
{
    private readonly Counter<long> _totalRequests;
    private readonly Counter<long> _successfulRequests;
    private readonly Counter<long> _failedRequests;
    private readonly Histogram<double> _requestDuration;

    public SloMetrics(IMeterFactory meterFactory)
    {
        var meter = meterFactory.Create("OrdersApi.SLO");

        _totalRequests = meter.CreateCounter<long>(
            "slo.requests.total",
            "{request}",
            "Total requests for SLO calculation");

        _successfulRequests = meter.CreateCounter<long>(
            "slo.requests.successful",
            "{request}",
            "Successful requests (good events)");

        _failedRequests = meter.CreateCounter<long>(
            "slo.requests.failed",
            "{request}",
            "Failed requests (bad events)");

        _requestDuration = meter.CreateHistogram<double>(
            "slo.request.duration",
            "ms",
            "Request duration for latency SLI");
    }

    public void RecordRequest(
        string endpoint,
        int statusCode,
        double durationMs)
    {
        var tags = new TagList
        {
            { "endpoint", endpoint },
            { "status_code", statusCode.ToString() }
        };

        _totalRequests.Add(1, tags);

        // Availability SLI: 5xx = bad, everything else = good
        if (statusCode >= 500)
        {
            _failedRequests.Add(1, tags);
        }
        else
        {
            _successfulRequests.Add(1, tags);
        }

        // Latency SLI
        _requestDuration.Record(durationMs, tags);
    }
}

// Middleware to capture SLO metrics
public class SloMetricsMiddleware
{
    private readonly RequestDelegate _next;
    private readonly SloMetrics _sloMetrics;

    public SloMetricsMiddleware(RequestDelegate next, SloMetrics sloMetrics)
    {
        _next = next;
        _sloMetrics = sloMetrics;
    }

    public async Task InvokeAsync(HttpContext context)
    {
        var stopwatch = Stopwatch.StartNew();

        try
        {
            await _next(context);
        }
        finally
        {
            stopwatch.Stop();

            var endpoint = context.GetEndpoint()?.DisplayName ?? "unknown";
            var statusCode = context.Response.StatusCode;
            var durationMs = stopwatch.Elapsed.TotalMilliseconds;

            _sloMetrics.RecordRequest(endpoint, statusCode, durationMs);
        }
    }
}

Error Budget Dashboard Queries

# Availability SLI (30-day rolling)
1 - (
  sum(increase(slo_requests_failed_total[30d]))
  /
  sum(increase(slo_requests_total[30d]))
)

# Latency SLI (P95 < 500ms, 30-day)
sum(increase(slo_request_duration_bucket{le="500"}[30d]))
/
sum(increase(slo_request_duration_count[30d]))

# Error Budget Remaining (availability)
1 - (
  (1 - 0.999)  # SLO target (99.9%)
  -
  (1 - (
    sum(increase(slo_requests_failed_total[30d]))
    /
    sum(increase(slo_requests_total[30d]))
  ))
) / (1 - 0.999)

# Error Budget Burn Rate (1h)
(
  sum(rate(slo_requests_failed_total[1h]))
  /
  sum(rate(slo_requests_total[1h]))
) / (1 - 0.999)  # Divide by error budget (0.1%)

Workflow

When designing SLOs:

  1. Identify User Journeys: What do users care about?
  2. Define SLIs: What can we measure?
  3. Measure Baseline: What are we achieving now?
  4. Set SLO Targets: Achievable but aspirational
  5. Define Error Budget Policy: What happens when budget is low?
  6. Implement Alerting: Multi-window burn rate alerts
  7. Create Dashboards: Visibility into SLO status
  8. Review Regularly: Adjust based on learning

References

For detailed guidance:


Last Updated: 2025-12-26