Marketplace
alert-design
Design actionable alerts with appropriate severity and noise reduction
allowed_tools: Read, Glob, Grep, Write, Edit
$ Instalar
git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/observability-planning/skills/alert-design ~/.claude/skills/claude-code-plugins// tip: Run this command in your terminal to install the skill
SKILL.md
name: alert-design description: Design actionable alerts with appropriate severity and noise reduction allowed-tools: Read, Glob, Grep, Write, Edit
Alert Design Skill
When to Use This Skill
Use this skill when:
- Alert Design tasks - Working on design actionable alerts with appropriate severity and noise reduction
- Planning or design - Need guidance on Alert Design approaches
- Best practices - Want to follow established patterns and standards
Overview
Design effective, actionable alerts that minimize noise and maximize signal.
MANDATORY: Documentation-First Approach
Before designing alerts:
- Invoke
docs-managementskill for alerting patterns - Verify alerting best practices via MCP servers (perplexity)
- Base guidance on SRE and on-call best practices
Alert Design Principles
ALERT PRINCIPLES:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GOOD ALERTS ARE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ACTIONABLE โ
โ โโโ Clear remediation steps exist โ
โ โโโ Responder knows what to do โ
โ โโโ Not just "something is wrong" โ
โ โ
โ RELEVANT โ
โ โโโ Tied to user impact or SLO โ
โ โโโ Not monitoring internal metrics only โ
โ โโโ Business-meaningful โ
โ โ
โ TIMELY โ
โ โโโ Right urgency level โ
โ โโโ Not too early (premature) โ
โ โโโ Not too late (missed opportunity) โ
โ โ
โ UNIQUE โ
โ โโโ No duplicate alerts for same issue โ
โ โโโ Aggregated where appropriate โ
โ โโโ Clear ownership โ
โ โ
โ DIAGNOSED โ
โ โโโ Include context and links โ
โ โโโ Recent changes, related metrics โ
โ โโโ Runbook reference โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BAD ALERTS ARE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โ NOISY โ Too many, get ignored โ
โ โ VAGUE โ "Something is wrong" with no detail โ
โ โ UNACTIONABLE โ No clear next step โ
โ โ STALE โ Outdated thresholds โ
โ โ DUPLICATED โ Same issue, many alerts โ
โ โ FALSE POSITIVE โ Fires when nothing is wrong โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Severity Levels
SEVERITY CLASSIFICATION:
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level โ Response Time โ Criteria โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ P1/SEV1 โ Immediate โ - Revenue/users affected NOW โ
โ CRITICALโ Page on-call โ - Data loss risk โ
โ โ < 5 min โ - Security breach โ
โ โ โ - SLO at risk of breach โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ P2/SEV2 โ Within 1 hour โ - Degraded experience โ
โ HIGH โ Page if OOH โ - Partial functionality loss โ
โ โ โ - High error rate โ
โ โ โ - Error budget burning fast โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ P3/SEV3 โ Business hours โ - Minor degradation โ
โ MEDIUM โ Next day OK โ - Non-critical component โ
โ โ โ - Warning thresholds โ
โ โ โ - Capacity concerns โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ P4/SEV4 โ Best effort โ - Informational โ
โ LOW/INFOโ Track in ticketโ - Anomalies to investigate โ
โ โ โ - Optimization opportunities โ
โโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ROUTING BY SEVERITY:
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Level โ Notification Channels โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ P1 โ PagerDuty (page), Slack #incidents, Phone call โ
โ P2 โ PagerDuty (high), Slack #alerts, Email โ
โ P3 โ Slack #alerts, Email, Ticket auto-created โ
โ P4 โ Slack #monitoring, Dashboard only โ
โโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Structure
ALERT ANATOMY:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ALERT: OrdersApi High Error Rate โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ SEVERITY: P2/High โ
โ SERVICE: orders-api โ
โ ENVIRONMENT: production โ
โ โ
โ SUMMARY: โ
โ Error rate is 5.2% (threshold: 1%). Affecting checkout flow. โ
โ โ
โ IMPACT: โ
โ ~500 users/minute seeing checkout failures. โ
โ Revenue impact: ~$2,000/minute at risk. โ
โ โ
โ DETAILS: โ
โ - Current error rate: 5.2% โ
โ - Normal rate: 0.3% โ
โ - Started: 14:32 UTC (12 minutes ago) โ
โ - Error type: 503 Service Unavailable โ
โ โ
โ POSSIBLE CAUSES: โ
โ - Recent deployment (14:28 UTC - payment-service v2.3.1) โ
โ - Database connection pool exhaustion โ
โ - Downstream dependency failure โ
โ โ
โ QUICK ACTIONS: โ
โ 1. Check recent deployments: [Deploy Dashboard] โ
โ 2. Check dependencies: [Dependency Status] โ
โ 3. View error logs: [Kibana Query] โ
โ โ
โ RUNBOOK: https://wiki.example.com/runbooks/orders-high-error โ
โ โ
โ RELATED: โ
โ - Grafana Dashboard: [Link] โ
โ - Service Map: [Link] โ
โ - Recent Incidents: [Link] โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Types
ALERT CATEGORIES:
SYMPTOM-BASED (Preferred):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Alert on what users experience, not internal metrics โ
โ โ
โ โ "Error rate > 1% for 5 minutes" โ
โ โ "P95 latency > 2s for 10 minutes" โ
โ โ "Availability < 99.9% (SLO breach risk)" โ
โ โ "Order completion rate dropped 20%" โ
โ โ
โ โ "CPU > 80%" (cause, not symptom) โ
โ โ "Memory > 90%" (cause, not symptom) โ
โ โ "Pod restarts" (might not affect users) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CAUSE-BASED (Supporting):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Use for capacity planning and early warning โ
โ โ
โ โ "Disk 80% full" (warning, lower severity) โ
โ โ "Certificate expires in 7 days" โ
โ โ "Connection pool 90% utilized" โ
โ โ
โ These should be WARNING/INFO, not PAGE-worthy โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SLO-BASED (Burn Rate):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Alert when error budget is burning too fast โ
โ โ
โ Multi-window burn rate alerts: โ
โ โ
โ Critical: 14.4x burn rate over 1h AND 6x over 5m โ
โ โ Budget exhausted in ~2 days if continues โ
โ โ
โ Warning: 6x burn rate over 6h AND 3x over 30m โ
โ โ Budget exhausted in ~5 days if continues โ
โ โ
โ Info: 1x burn rate over 3d โ
โ โ On track to exhaust budget โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Configuration Examples
# Prometheus Alertmanager rules
# alerting-rules.yaml
groups:
- name: orders-api-slo
rules:
# SLO Burn Rate - Critical
- alert: OrdersApiHighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{service="orders-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="orders-api"}[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{service="orders-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="orders-api"}[5m]))
) > (6 * 0.001)
for: 2m
labels:
severity: critical
service: orders-api
slo: availability
annotations:
summary: "OrdersApi error budget burning rapidly"
description: |
Error budget is burning at 14.4x rate.
If this continues, budget will be exhausted in ~2 days.
Current error rate: {{ $value | humanizePercentage }}
runbook_url: "https://wiki.example.com/runbooks/slo-burn-rate"
dashboard_url: "https://grafana.example.com/d/orders-slo"
# Latency SLO
- alert: OrdersApiHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="orders-api"}[5m]))
by (le)
) > 2
for: 5m
labels:
severity: high
service: orders-api
slo: latency
annotations:
summary: "OrdersApi P95 latency exceeds 2s"
description: |
P95 latency is {{ $value | humanizeDuration }}.
Threshold: 2 seconds.
This affects user experience for 5% of requests.
runbook_url: "https://wiki.example.com/runbooks/high-latency"
- name: orders-api-capacity
rules:
# Capacity warning (lower severity)
- alert: OrdersApiHighCPU
expr: |
avg(rate(container_cpu_usage_seconds_total{container="orders-api"}[5m]))
by (pod) > 0.8
for: 15m
labels:
severity: warning
service: orders-api
category: capacity
annotations:
summary: "OrdersApi CPU usage high"
description: |
CPU usage is {{ $value | humanizePercentage }} for 15+ minutes.
Consider scaling or optimization.
runbook_url: "https://wiki.example.com/runbooks/capacity-cpu"
# Disk space warning
- alert: OrdersApiDiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/data"}
/
node_filesystem_size_bytes{mountpoint="/data"}) < 0.2
for: 5m
labels:
severity: warning
category: capacity
annotations:
summary: "Disk space below 20%"
description: "{{ $value | humanizePercentage }} disk space remaining"
Noise Reduction Strategies
NOISE REDUCTION TECHNIQUES:
1. APPROPRIATE THRESHOLDS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Don't alert on every anomaly โ
โ โ
โ Bad: CPU > 70% for 1 minute โ
โ Good: CPU > 90% for 15 minutes โ
โ โ
โ Bad: Any 5xx error โ
โ Good: Error rate > 1% for 5 minutes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2. PROPER FOR DURATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Wait long enough to confirm it's real โ
โ โ
โ Transient spike: for: 0m โ Too noisy โ
โ Brief issue: for: 2m โ May still be transient โ
โ Confirmed issue: for: 5m โ Likely real โ
โ Capacity issue: for: 15m โ Sustained trend โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3. ALERT AGGREGATION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Group related alerts into one notification โ
โ โ
โ Instead of: โ
โ - Pod A unhealthy โ
โ - Pod B unhealthy โ
โ - Pod C unhealthy โ
โ โ
โ Send: โ
โ - 3 pods unhealthy in orders-api deployment โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4. INHIBITION RULES
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Suppress child alerts when parent fires โ
โ โ
โ If "Database down" fires: โ
โ Suppress all "DB connection error" alerts โ
โ โ
โ If "Kubernetes node down" fires: โ
โ Suppress all pod alerts on that node โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5. MAINTENANCE WINDOWS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Silence alerts during planned maintenance โ
โ โ
โ Schedule silences before: โ
โ - Deployments โ
โ - Database migrations โ
โ - Infrastructure changes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Alert Template
# Alert Specification: {Alert Name}
## Overview
| Attribute | Value |
|-----------|-------|
| Alert Name | [Name] |
| Service | [Service name] |
| Severity | [P1/P2/P3/P4] |
| Category | [SLO/Capacity/Security/Business] |
| Owner | [Team] |
## Condition
**Expression:**
```promql
[PromQL or query expression]
Threshold: [Value] Duration (for): [Duration]
Meaning
What it means: [Explain what this alert indicates]
User impact: [How users are affected]
Business impact: [Revenue, reputation, compliance implications]
Triage Steps
-
Verify the alert
- Check [dashboard link]
- Confirm [metric] is actually [condition]
-
Assess impact
- How many users affected?
- Which user journeys impacted?
-
Identify cause
- Check recent deployments
- Check dependencies
- Review error logs
Remediation
Immediate actions:
- [First action]
- [Second action]
- [Third action]
Escalation:
- If not resolved in [X] minutes, escalate to [team/person]
Runbook
Link: [Runbook URL]
Alert Configuration
- alert: {AlertName}
expr: |
[expression]
for: [duration]
labels:
severity: [severity]
service: [service]
annotations:
summary: "[Summary]"
description: "[Description]"
runbook_url: "[URL]"
Review History
| Date | Change | Reason |
|---|---|---|
| [Date] | [Change] | [Why] |
Workflow
When designing alerts:
- Start with SLOs: Alert on SLO burn rate first
- Focus on Symptoms: Alert on user impact, not internal metrics
- Set Appropriate Severity: Not everything is P1
- Include Context: Dashboards, runbooks, recent changes
- Define Escalation: Who gets notified, when to escalate
- Reduce Noise: Proper thresholds, aggregation, inhibition
- Review Regularly: Tune based on false positives/negatives
References
For detailed guidance:
Last Updated: 2025-12-26
Repository

melodic-software
Author
melodic-software/claude-code-plugins/plugins/observability-planning/skills/alert-design
3
Stars
0
Forks
Updated5d ago
Added1w ago