Marketplace
alert-design
Design actionable alerts with appropriate severity and noise reduction
allowed_tools: Read, Glob, Grep, Write, Edit
$ 安裝
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/observability-planning/skills/alert-design ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: alert-design description: Design actionable alerts with appropriate severity and noise reduction allowed-tools: Read, Glob, Grep, Write, Edit
Alert Design Skill
When to Use This Skill
Use this skill when:
- Alert Design tasks - Working on design actionable alerts with appropriate severity and noise reduction
- Planning or design - Need guidance on Alert Design approaches
- Best practices - Want to follow established patterns and standards
Overview
Design effective, actionable alerts that minimize noise and maximize signal.
MANDATORY: Documentation-First Approach
Before designing alerts:
- Invoke
docs-managementskill for alerting patterns - Verify alerting best practices via MCP servers (perplexity)
- Base guidance on SRE and on-call best practices
Alert Design Principles
ALERT PRINCIPLES:
┌─────────────────────────────────────────────────────────────────┐
│ GOOD ALERTS ARE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ACTIONABLE │
│ ├── Clear remediation steps exist │
│ ├── Responder knows what to do │
│ └── Not just "something is wrong" │
│ │
│ RELEVANT │
│ ├── Tied to user impact or SLO │
│ ├── Not monitoring internal metrics only │
│ └── Business-meaningful │
│ │
│ TIMELY │
│ ├── Right urgency level │
│ ├── Not too early (premature) │
│ └── Not too late (missed opportunity) │
│ │
│ UNIQUE │
│ ├── No duplicate alerts for same issue │
│ ├── Aggregated where appropriate │
│ └── Clear ownership │
│ │
│ DIAGNOSED │
│ ├── Include context and links │
│ ├── Recent changes, related metrics │
│ └── Runbook reference │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ BAD ALERTS ARE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ✗ NOISY → Too many, get ignored │
│ ✗ VAGUE → "Something is wrong" with no detail │
│ ✗ UNACTIONABLE → No clear next step │
│ ✗ STALE → Outdated thresholds │
│ ✗ DUPLICATED → Same issue, many alerts │
│ ✗ FALSE POSITIVE → Fires when nothing is wrong │
│ │
└─────────────────────────────────────────────────────────────────┘
Alert Severity Levels
SEVERITY CLASSIFICATION:
┌─────────┬────────────────┬─────────────────────────────────────┐
│ Level │ Response Time │ Criteria │
├─────────┼────────────────┼─────────────────────────────────────┤
│ P1/SEV1 │ Immediate │ - Revenue/users affected NOW │
│ CRITICAL│ Page on-call │ - Data loss risk │
│ │ < 5 min │ - Security breach │
│ │ │ - SLO at risk of breach │
├─────────┼────────────────┼─────────────────────────────────────┤
│ P2/SEV2 │ Within 1 hour │ - Degraded experience │
│ HIGH │ Page if OOH │ - Partial functionality loss │
│ │ │ - High error rate │
│ │ │ - Error budget burning fast │
├─────────┼────────────────┼─────────────────────────────────────┤
│ P3/SEV3 │ Business hours │ - Minor degradation │
│ MEDIUM │ Next day OK │ - Non-critical component │
│ │ │ - Warning thresholds │
│ │ │ - Capacity concerns │
├─────────┼────────────────┼─────────────────────────────────────┤
│ P4/SEV4 │ Best effort │ - Informational │
│ LOW/INFO│ Track in ticket│ - Anomalies to investigate │
│ │ │ - Optimization opportunities │
└─────────┴────────────────┴─────────────────────────────────────┘
ROUTING BY SEVERITY:
┌─────────┬─────────────────────────────────────────────────────┐
│ Level │ Notification Channels │
├─────────┼─────────────────────────────────────────────────────┤
│ P1 │ PagerDuty (page), Slack #incidents, Phone call │
│ P2 │ PagerDuty (high), Slack #alerts, Email │
│ P3 │ Slack #alerts, Email, Ticket auto-created │
│ P4 │ Slack #monitoring, Dashboard only │
└─────────┴─────────────────────────────────────────────────────┘
Alert Structure
ALERT ANATOMY:
┌─────────────────────────────────────────────────────────────────┐
│ ALERT: OrdersApi High Error Rate │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SEVERITY: P2/High │
│ SERVICE: orders-api │
│ ENVIRONMENT: production │
│ │
│ SUMMARY: │
│ Error rate is 5.2% (threshold: 1%). Affecting checkout flow. │
│ │
│ IMPACT: │
│ ~500 users/minute seeing checkout failures. │
│ Revenue impact: ~$2,000/minute at risk. │
│ │
│ DETAILS: │
│ - Current error rate: 5.2% │
│ - Normal rate: 0.3% │
│ - Started: 14:32 UTC (12 minutes ago) │
│ - Error type: 503 Service Unavailable │
│ │
│ POSSIBLE CAUSES: │
│ - Recent deployment (14:28 UTC - payment-service v2.3.1) │
│ - Database connection pool exhaustion │
│ - Downstream dependency failure │
│ │
│ QUICK ACTIONS: │
│ 1. Check recent deployments: [Deploy Dashboard] │
│ 2. Check dependencies: [Dependency Status] │
│ 3. View error logs: [Kibana Query] │
│ │
│ RUNBOOK: https://wiki.example.com/runbooks/orders-high-error │
│ │
│ RELATED: │
│ - Grafana Dashboard: [Link] │
│ - Service Map: [Link] │
│ - Recent Incidents: [Link] │
│ │
└─────────────────────────────────────────────────────────────────┘
Alert Types
ALERT CATEGORIES:
SYMPTOM-BASED (Preferred):
┌─────────────────────────────────────────────────────────────────┐
│ Alert on what users experience, not internal metrics │
│ │
│ ✓ "Error rate > 1% for 5 minutes" │
│ ✓ "P95 latency > 2s for 10 minutes" │
│ ✓ "Availability < 99.9% (SLO breach risk)" │
│ ✓ "Order completion rate dropped 20%" │
│ │
│ ✗ "CPU > 80%" (cause, not symptom) │
│ ✗ "Memory > 90%" (cause, not symptom) │
│ ✗ "Pod restarts" (might not affect users) │
└─────────────────────────────────────────────────────────────────┘
CAUSE-BASED (Supporting):
┌─────────────────────────────────────────────────────────────────┐
│ Use for capacity planning and early warning │
│ │
│ ✓ "Disk 80% full" (warning, lower severity) │
│ ✓ "Certificate expires in 7 days" │
│ ✓ "Connection pool 90% utilized" │
│ │
│ These should be WARNING/INFO, not PAGE-worthy │
└─────────────────────────────────────────────────────────────────┘
SLO-BASED (Burn Rate):
┌─────────────────────────────────────────────────────────────────┐
│ Alert when error budget is burning too fast │
│ │
│ Multi-window burn rate alerts: │
│ │
│ Critical: 14.4x burn rate over 1h AND 6x over 5m │
│ → Budget exhausted in ~2 days if continues │
│ │
│ Warning: 6x burn rate over 6h AND 3x over 30m │
│ → Budget exhausted in ~5 days if continues │
│ │
│ Info: 1x burn rate over 3d │
│ → On track to exhaust budget │
└─────────────────────────────────────────────────────────────────┘
Alert Configuration Examples
# Prometheus Alertmanager rules
# alerting-rules.yaml
groups:
- name: orders-api-slo
rules:
# SLO Burn Rate - Critical
- alert: OrdersApiHighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{service="orders-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="orders-api"}[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{service="orders-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="orders-api"}[5m]))
) > (6 * 0.001)
for: 2m
labels:
severity: critical
service: orders-api
slo: availability
annotations:
summary: "OrdersApi error budget burning rapidly"
description: |
Error budget is burning at 14.4x rate.
If this continues, budget will be exhausted in ~2 days.
Current error rate: {{ $value | humanizePercentage }}
runbook_url: "https://wiki.example.com/runbooks/slo-burn-rate"
dashboard_url: "https://grafana.example.com/d/orders-slo"
# Latency SLO
- alert: OrdersApiHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="orders-api"}[5m]))
by (le)
) > 2
for: 5m
labels:
severity: high
service: orders-api
slo: latency
annotations:
summary: "OrdersApi P95 latency exceeds 2s"
description: |
P95 latency is {{ $value | humanizeDuration }}.
Threshold: 2 seconds.
This affects user experience for 5% of requests.
runbook_url: "https://wiki.example.com/runbooks/high-latency"
- name: orders-api-capacity
rules:
# Capacity warning (lower severity)
- alert: OrdersApiHighCPU
expr: |
avg(rate(container_cpu_usage_seconds_total{container="orders-api"}[5m]))
by (pod) > 0.8
for: 15m
labels:
severity: warning
service: orders-api
category: capacity
annotations:
summary: "OrdersApi CPU usage high"
description: |
CPU usage is {{ $value | humanizePercentage }} for 15+ minutes.
Consider scaling or optimization.
runbook_url: "https://wiki.example.com/runbooks/capacity-cpu"
# Disk space warning
- alert: OrdersApiDiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/data"}
/
node_filesystem_size_bytes{mountpoint="/data"}) < 0.2
for: 5m
labels:
severity: warning
category: capacity
annotations:
summary: "Disk space below 20%"
description: "{{ $value | humanizePercentage }} disk space remaining"
Noise Reduction Strategies
NOISE REDUCTION TECHNIQUES:
1. APPROPRIATE THRESHOLDS
┌────────────────────────────────────────────────────────────┐
│ Don't alert on every anomaly │
│ │
│ Bad: CPU > 70% for 1 minute │
│ Good: CPU > 90% for 15 minutes │
│ │
│ Bad: Any 5xx error │
│ Good: Error rate > 1% for 5 minutes │
└────────────────────────────────────────────────────────────┘
2. PROPER FOR DURATION
┌────────────────────────────────────────────────────────────┐
│ Wait long enough to confirm it's real │
│ │
│ Transient spike: for: 0m → Too noisy │
│ Brief issue: for: 2m → May still be transient │
│ Confirmed issue: for: 5m → Likely real │
│ Capacity issue: for: 15m → Sustained trend │
└────────────────────────────────────────────────────────────┘
3. ALERT AGGREGATION
┌────────────────────────────────────────────────────────────┐
│ Group related alerts into one notification │
│ │
│ Instead of: │
│ - Pod A unhealthy │
│ - Pod B unhealthy │
│ - Pod C unhealthy │
│ │
│ Send: │
│ - 3 pods unhealthy in orders-api deployment │
└────────────────────────────────────────────────────────────┘
4. INHIBITION RULES
┌────────────────────────────────────────────────────────────┐
│ Suppress child alerts when parent fires │
│ │
│ If "Database down" fires: │
│ Suppress all "DB connection error" alerts │
│ │
│ If "Kubernetes node down" fires: │
│ Suppress all pod alerts on that node │
└────────────────────────────────────────────────────────────┘
5. MAINTENANCE WINDOWS
┌────────────────────────────────────────────────────────────┐
│ Silence alerts during planned maintenance │
│ │
│ Schedule silences before: │
│ - Deployments │
│ - Database migrations │
│ - Infrastructure changes │
└────────────────────────────────────────────────────────────┘
Alert Template
# Alert Specification: {Alert Name}
## Overview
| Attribute | Value |
|-----------|-------|
| Alert Name | [Name] |
| Service | [Service name] |
| Severity | [P1/P2/P3/P4] |
| Category | [SLO/Capacity/Security/Business] |
| Owner | [Team] |
## Condition
**Expression:**
```promql
[PromQL or query expression]
Threshold: [Value] Duration (for): [Duration]
Meaning
What it means: [Explain what this alert indicates]
User impact: [How users are affected]
Business impact: [Revenue, reputation, compliance implications]
Triage Steps
-
Verify the alert
- Check [dashboard link]
- Confirm [metric] is actually [condition]
-
Assess impact
- How many users affected?
- Which user journeys impacted?
-
Identify cause
- Check recent deployments
- Check dependencies
- Review error logs
Remediation
Immediate actions:
- [First action]
- [Second action]
- [Third action]
Escalation:
- If not resolved in [X] minutes, escalate to [team/person]
Runbook
Link: [Runbook URL]
Alert Configuration
- alert: {AlertName}
expr: |
[expression]
for: [duration]
labels:
severity: [severity]
service: [service]
annotations:
summary: "[Summary]"
description: "[Description]"
runbook_url: "[URL]"
Review History
| Date | Change | Reason |
|---|---|---|
| [Date] | [Change] | [Why] |
Workflow
When designing alerts:
- Start with SLOs: Alert on SLO burn rate first
- Focus on Symptoms: Alert on user impact, not internal metrics
- Set Appropriate Severity: Not everything is P1
- Include Context: Dashboards, runbooks, recent changes
- Define Escalation: Who gets notified, when to escalate
- Reduce Noise: Proper thresholds, aggregation, inhibition
- Review Regularly: Tune based on false positives/negatives
References
For detailed guidance:
Last Updated: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/observability-planning/skills/alert-design
3
Stars
0
Forks
Updated4d ago
Added1w ago