Marketplace

alert-design

Design actionable alerts with appropriate severity and noise reduction

allowed_tools: Read, Glob, Grep, Write, Edit

$ Instalar

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/observability-planning/skills/alert-design ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: alert-design description: Design actionable alerts with appropriate severity and noise reduction allowed-tools: Read, Glob, Grep, Write, Edit

Alert Design Skill

When to Use This Skill

Use this skill when:

  • Alert Design tasks - Working on design actionable alerts with appropriate severity and noise reduction
  • Planning or design - Need guidance on Alert Design approaches
  • Best practices - Want to follow established patterns and standards

Overview

Design effective, actionable alerts that minimize noise and maximize signal.

MANDATORY: Documentation-First Approach

Before designing alerts:

  1. Invoke docs-management skill for alerting patterns
  2. Verify alerting best practices via MCP servers (perplexity)
  3. Base guidance on SRE and on-call best practices

Alert Design Principles

ALERT PRINCIPLES:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    GOOD ALERTS ARE                               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  ACTIONABLE                                                      โ”‚
โ”‚  โ”œโ”€โ”€ Clear remediation steps exist                               โ”‚
โ”‚  โ”œโ”€โ”€ Responder knows what to do                                  โ”‚
โ”‚  โ””โ”€โ”€ Not just "something is wrong"                               โ”‚
โ”‚                                                                  โ”‚
โ”‚  RELEVANT                                                        โ”‚
โ”‚  โ”œโ”€โ”€ Tied to user impact or SLO                                  โ”‚
โ”‚  โ”œโ”€โ”€ Not monitoring internal metrics only                        โ”‚
โ”‚  โ””โ”€โ”€ Business-meaningful                                         โ”‚
โ”‚                                                                  โ”‚
โ”‚  TIMELY                                                          โ”‚
โ”‚  โ”œโ”€โ”€ Right urgency level                                         โ”‚
โ”‚  โ”œโ”€โ”€ Not too early (premature)                                   โ”‚
โ”‚  โ””โ”€โ”€ Not too late (missed opportunity)                           โ”‚
โ”‚                                                                  โ”‚
โ”‚  UNIQUE                                                          โ”‚
โ”‚  โ”œโ”€โ”€ No duplicate alerts for same issue                          โ”‚
โ”‚  โ”œโ”€โ”€ Aggregated where appropriate                                โ”‚
โ”‚  โ””โ”€โ”€ Clear ownership                                             โ”‚
โ”‚                                                                  โ”‚
โ”‚  DIAGNOSED                                                       โ”‚
โ”‚  โ”œโ”€โ”€ Include context and links                                   โ”‚
โ”‚  โ”œโ”€โ”€ Recent changes, related metrics                             โ”‚
โ”‚  โ””โ”€โ”€ Runbook reference                                           โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    BAD ALERTS ARE                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  โœ— NOISY          โ†’ Too many, get ignored                        โ”‚
โ”‚  โœ— VAGUE          โ†’ "Something is wrong" with no detail          โ”‚
โ”‚  โœ— UNACTIONABLE   โ†’ No clear next step                           โ”‚
โ”‚  โœ— STALE          โ†’ Outdated thresholds                          โ”‚
โ”‚  โœ— DUPLICATED     โ†’ Same issue, many alerts                      โ”‚
โ”‚  โœ— FALSE POSITIVE โ†’ Fires when nothing is wrong                  โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Severity Levels

SEVERITY CLASSIFICATION:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Level   โ”‚ Response Time  โ”‚ Criteria                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ P1/SEV1 โ”‚ Immediate      โ”‚ - Revenue/users affected NOW        โ”‚
โ”‚ CRITICALโ”‚ Page on-call   โ”‚ - Data loss risk                    โ”‚
โ”‚         โ”‚ < 5 min        โ”‚ - Security breach                   โ”‚
โ”‚         โ”‚                โ”‚ - SLO at risk of breach             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ P2/SEV2 โ”‚ Within 1 hour  โ”‚ - Degraded experience               โ”‚
โ”‚ HIGH    โ”‚ Page if OOH    โ”‚ - Partial functionality loss        โ”‚
โ”‚         โ”‚                โ”‚ - High error rate                   โ”‚
โ”‚         โ”‚                โ”‚ - Error budget burning fast         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ P3/SEV3 โ”‚ Business hours โ”‚ - Minor degradation                 โ”‚
โ”‚ MEDIUM  โ”‚ Next day OK    โ”‚ - Non-critical component            โ”‚
โ”‚         โ”‚                โ”‚ - Warning thresholds                โ”‚
โ”‚         โ”‚                โ”‚ - Capacity concerns                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ P4/SEV4 โ”‚ Best effort    โ”‚ - Informational                     โ”‚
โ”‚ LOW/INFOโ”‚ Track in ticketโ”‚ - Anomalies to investigate          โ”‚
โ”‚         โ”‚                โ”‚ - Optimization opportunities        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ROUTING BY SEVERITY:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Level   โ”‚ Notification Channels                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ P1      โ”‚ PagerDuty (page), Slack #incidents, Phone call      โ”‚
โ”‚ P2      โ”‚ PagerDuty (high), Slack #alerts, Email              โ”‚
โ”‚ P3      โ”‚ Slack #alerts, Email, Ticket auto-created           โ”‚
โ”‚ P4      โ”‚ Slack #monitoring, Dashboard only                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Structure

ALERT ANATOMY:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ALERT: OrdersApi High Error Rate                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚ SEVERITY: P2/High                                                โ”‚
โ”‚ SERVICE: orders-api                                              โ”‚
โ”‚ ENVIRONMENT: production                                          โ”‚
โ”‚                                                                  โ”‚
โ”‚ SUMMARY:                                                         โ”‚
โ”‚ Error rate is 5.2% (threshold: 1%). Affecting checkout flow.     โ”‚
โ”‚                                                                  โ”‚
โ”‚ IMPACT:                                                          โ”‚
โ”‚ ~500 users/minute seeing checkout failures.                      โ”‚
โ”‚ Revenue impact: ~$2,000/minute at risk.                          โ”‚
โ”‚                                                                  โ”‚
โ”‚ DETAILS:                                                         โ”‚
โ”‚ - Current error rate: 5.2%                                       โ”‚
โ”‚ - Normal rate: 0.3%                                              โ”‚
โ”‚ - Started: 14:32 UTC (12 minutes ago)                            โ”‚
โ”‚ - Error type: 503 Service Unavailable                            โ”‚
โ”‚                                                                  โ”‚
โ”‚ POSSIBLE CAUSES:                                                 โ”‚
โ”‚ - Recent deployment (14:28 UTC - payment-service v2.3.1)         โ”‚
โ”‚ - Database connection pool exhaustion                            โ”‚
โ”‚ - Downstream dependency failure                                  โ”‚
โ”‚                                                                  โ”‚
โ”‚ QUICK ACTIONS:                                                   โ”‚
โ”‚ 1. Check recent deployments: [Deploy Dashboard]                  โ”‚
โ”‚ 2. Check dependencies: [Dependency Status]                       โ”‚
โ”‚ 3. View error logs: [Kibana Query]                               โ”‚
โ”‚                                                                  โ”‚
โ”‚ RUNBOOK: https://wiki.example.com/runbooks/orders-high-error     โ”‚
โ”‚                                                                  โ”‚
โ”‚ RELATED:                                                         โ”‚
โ”‚ - Grafana Dashboard: [Link]                                      โ”‚
โ”‚ - Service Map: [Link]                                            โ”‚
โ”‚ - Recent Incidents: [Link]                                       โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Types

ALERT CATEGORIES:

SYMPTOM-BASED (Preferred):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Alert on what users experience, not internal metrics            โ”‚
โ”‚                                                                  โ”‚
โ”‚ โœ“ "Error rate > 1% for 5 minutes"                                โ”‚
โ”‚ โœ“ "P95 latency > 2s for 10 minutes"                              โ”‚
โ”‚ โœ“ "Availability < 99.9% (SLO breach risk)"                       โ”‚
โ”‚ โœ“ "Order completion rate dropped 20%"                            โ”‚
โ”‚                                                                  โ”‚
โ”‚ โœ— "CPU > 80%" (cause, not symptom)                               โ”‚
โ”‚ โœ— "Memory > 90%" (cause, not symptom)                            โ”‚
โ”‚ โœ— "Pod restarts" (might not affect users)                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

CAUSE-BASED (Supporting):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Use for capacity planning and early warning                     โ”‚
โ”‚                                                                  โ”‚
โ”‚ โœ“ "Disk 80% full" (warning, lower severity)                      โ”‚
โ”‚ โœ“ "Certificate expires in 7 days"                                โ”‚
โ”‚ โœ“ "Connection pool 90% utilized"                                 โ”‚
โ”‚                                                                  โ”‚
โ”‚ These should be WARNING/INFO, not PAGE-worthy                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SLO-BASED (Burn Rate):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Alert when error budget is burning too fast                     โ”‚
โ”‚                                                                  โ”‚
โ”‚ Multi-window burn rate alerts:                                   โ”‚
โ”‚                                                                  โ”‚
โ”‚ Critical: 14.4x burn rate over 1h AND 6x over 5m                โ”‚
โ”‚ โ†’ Budget exhausted in ~2 days if continues                       โ”‚
โ”‚                                                                  โ”‚
โ”‚ Warning: 6x burn rate over 6h AND 3x over 30m                   โ”‚
โ”‚ โ†’ Budget exhausted in ~5 days if continues                       โ”‚
โ”‚                                                                  โ”‚
โ”‚ Info: 1x burn rate over 3d                                       โ”‚
โ”‚ โ†’ On track to exhaust budget                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Configuration Examples

# Prometheus Alertmanager rules
# alerting-rules.yaml

groups:
  - name: orders-api-slo
    rules:
      # SLO Burn Rate - Critical
      - alert: OrdersApiHighErrorBudgetBurn
        expr: |
          (
            sum(rate(http_requests_total{service="orders-api",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{service="orders-api"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{service="orders-api",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="orders-api"}[5m]))
          ) > (6 * 0.001)
        for: 2m
        labels:
          severity: critical
          service: orders-api
          slo: availability
        annotations:
          summary: "OrdersApi error budget burning rapidly"
          description: |
            Error budget is burning at 14.4x rate.
            If this continues, budget will be exhausted in ~2 days.
            Current error rate: {{ $value | humanizePercentage }}
          runbook_url: "https://wiki.example.com/runbooks/slo-burn-rate"
          dashboard_url: "https://grafana.example.com/d/orders-slo"

      # Latency SLO
      - alert: OrdersApiHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="orders-api"}[5m]))
            by (le)
          ) > 2
        for: 5m
        labels:
          severity: high
          service: orders-api
          slo: latency
        annotations:
          summary: "OrdersApi P95 latency exceeds 2s"
          description: |
            P95 latency is {{ $value | humanizeDuration }}.
            Threshold: 2 seconds.
            This affects user experience for 5% of requests.
          runbook_url: "https://wiki.example.com/runbooks/high-latency"

  - name: orders-api-capacity
    rules:
      # Capacity warning (lower severity)
      - alert: OrdersApiHighCPU
        expr: |
          avg(rate(container_cpu_usage_seconds_total{container="orders-api"}[5m]))
          by (pod) > 0.8
        for: 15m
        labels:
          severity: warning
          service: orders-api
          category: capacity
        annotations:
          summary: "OrdersApi CPU usage high"
          description: |
            CPU usage is {{ $value | humanizePercentage }} for 15+ minutes.
            Consider scaling or optimization.
          runbook_url: "https://wiki.example.com/runbooks/capacity-cpu"

      # Disk space warning
      - alert: OrdersApiDiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/data"}
          /
          node_filesystem_size_bytes{mountpoint="/data"}) < 0.2
        for: 5m
        labels:
          severity: warning
          category: capacity
        annotations:
          summary: "Disk space below 20%"
          description: "{{ $value | humanizePercentage }} disk space remaining"

Noise Reduction Strategies

NOISE REDUCTION TECHNIQUES:

1. APPROPRIATE THRESHOLDS
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Don't alert on every anomaly                               โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Bad:  CPU > 70% for 1 minute                               โ”‚
   โ”‚ Good: CPU > 90% for 15 minutes                             โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Bad:  Any 5xx error                                        โ”‚
   โ”‚ Good: Error rate > 1% for 5 minutes                        โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. PROPER FOR DURATION
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Wait long enough to confirm it's real                      โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Transient spike: for: 0m    โ†’ Too noisy                    โ”‚
   โ”‚ Brief issue:     for: 2m    โ†’ May still be transient       โ”‚
   โ”‚ Confirmed issue: for: 5m    โ†’ Likely real                  โ”‚
   โ”‚ Capacity issue:  for: 15m   โ†’ Sustained trend              โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3. ALERT AGGREGATION
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Group related alerts into one notification                 โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Instead of:                                                โ”‚
   โ”‚ - Pod A unhealthy                                          โ”‚
   โ”‚ - Pod B unhealthy                                          โ”‚
   โ”‚ - Pod C unhealthy                                          โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Send:                                                      โ”‚
   โ”‚ - 3 pods unhealthy in orders-api deployment                โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4. INHIBITION RULES
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Suppress child alerts when parent fires                    โ”‚
   โ”‚                                                            โ”‚
   โ”‚ If "Database down" fires:                                  โ”‚
   โ”‚   Suppress all "DB connection error" alerts                โ”‚
   โ”‚                                                            โ”‚
   โ”‚ If "Kubernetes node down" fires:                           โ”‚
   โ”‚   Suppress all pod alerts on that node                     โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5. MAINTENANCE WINDOWS
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ Silence alerts during planned maintenance                  โ”‚
   โ”‚                                                            โ”‚
   โ”‚ Schedule silences before:                                  โ”‚
   โ”‚ - Deployments                                              โ”‚
   โ”‚ - Database migrations                                      โ”‚
   โ”‚ - Infrastructure changes                                   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Alert Template

# Alert Specification: {Alert Name}

## Overview

| Attribute | Value |
|-----------|-------|
| Alert Name | [Name] |
| Service | [Service name] |
| Severity | [P1/P2/P3/P4] |
| Category | [SLO/Capacity/Security/Business] |
| Owner | [Team] |

## Condition

**Expression:**
```promql
[PromQL or query expression]

Threshold: [Value] Duration (for): [Duration]

Meaning

What it means: [Explain what this alert indicates]

User impact: [How users are affected]

Business impact: [Revenue, reputation, compliance implications]

Triage Steps

  1. Verify the alert

    • Check [dashboard link]
    • Confirm [metric] is actually [condition]
  2. Assess impact

    • How many users affected?
    • Which user journeys impacted?
  3. Identify cause

    • Check recent deployments
    • Check dependencies
    • Review error logs

Remediation

Immediate actions:

  1. [First action]
  2. [Second action]
  3. [Third action]

Escalation:

  • If not resolved in [X] minutes, escalate to [team/person]

Runbook

Link: [Runbook URL]

Alert Configuration

- alert: {AlertName}
  expr: |
    [expression]
  for: [duration]
  labels:
    severity: [severity]
    service: [service]
  annotations:
    summary: "[Summary]"
    description: "[Description]"
    runbook_url: "[URL]"

Review History

DateChangeReason
[Date][Change][Why]

Workflow

When designing alerts:

  1. Start with SLOs: Alert on SLO burn rate first
  2. Focus on Symptoms: Alert on user impact, not internal metrics
  3. Set Appropriate Severity: Not everything is P1
  4. Include Context: Dashboards, runbooks, recent changes
  5. Define Escalation: Who gets notified, when to escalate
  6. Reduce Noise: Proper thresholds, aggregation, inhibition
  7. Review Regularly: Tune based on false positives/negatives

References

For detailed guidance:


Last Updated: 2025-12-26

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/observability-planning/skills/alert-design
3
Stars
0
Forks
Updated5d ago
Added1w ago