Prometheus Analysis
This skill should be used when the user asks to "query Prometheus", "analyze Prometheus metrics", "check Prometheus alerts", "write PromQL", "interpret Prometheus data", "fetch metrics", or mentions Prometheus querying, alerting, or metrics analysis. Provides guidance for querying and interpreting Prometheus metrics for root cause analysis.
$ 安裝
git clone https://github.com/evangelosmeklis/thufir /tmp/thufir && cp -r /tmp/thufir/skills/prometheus-analysis ~/.claude/skills/thufir// tip: Run this command in your terminal to install the skill
name: Prometheus Analysis description: This skill should be used when the user asks to "query Prometheus", "analyze Prometheus metrics", "check Prometheus alerts", "write PromQL", "interpret Prometheus data", "fetch metrics", or mentions Prometheus querying, alerting, or metrics analysis. Provides guidance for querying and interpreting Prometheus metrics for root cause analysis. version: 0.1.0
Prometheus Analysis
Overview
Prometheus is a time-series metrics collection and alerting system widely used for monitoring production systems. This skill provides guidance for querying Prometheus metrics, interpreting alert data, and using metrics for root cause analysis.
When to Use This Skill
Apply this skill when:
- Analyzing Prometheus alerts that have fired
- Querying metrics to understand system behavior
- Investigating metric anomalies or spikes
- Correlating metrics with incidents
- Writing PromQL queries for specific metrics
- Interpreting time-series data patterns
Prometheus Fundamentals
Metric Types
Counter: Cumulative value that only increases (e.g., total requests, error count)
- Use
rate()orincrease()to get per-second rate or total increase - Example:
http_requests_total
Gauge: Value that can go up or down (e.g., CPU usage, memory usage, queue depth)
- Query directly or use functions like
avg_over_time() - Example:
node_memory_usage_bytes
Histogram: Distribution of values in buckets (e.g., request durations)
- Provides
_sum,_count, and_bucketmetrics - Use for percentile calculations
- Example:
http_request_duration_seconds
Summary: Similar to histogram but with pre-calculated quantiles
- Example:
http_request_duration_seconds{quantile="0.95"}
Time Series Format
Metrics have format: metric_name{label1="value1", label2="value2"}
Example: http_requests_total{method="POST", status="500", service="api"}
Querying Prometheus
Basic Queries
Instant query (current value):
http_requests_total
Range query (over time):
http_requests_total[5m]
Filter by labels:
http_requests_total{status="500", service="api"}
Rate of increase (per-second rate):
rate(http_requests_total[5m])
Common Aggregation Functions
Sum across dimensions:
sum(rate(http_requests_total[5m])) by (status)
Average:
avg(node_memory_usage_bytes) by (instance)
Max/Min:
max(http_request_duration_seconds) by (endpoint)
Count:
count(up == 0) # Count instances that are down
Useful RCA Queries
Error rate percentage:
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
95th percentile latency:
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Request rate by endpoint:
sum(rate(http_requests_total[5m])) by (endpoint)
Memory usage percentage:
(node_memory_usage_bytes / node_memory_total_bytes) * 100
Database connection pool usage:
sum(db_connection_pool_active) / sum(db_connection_pool_max) * 100
Working with Alerts
Alert Structure
Prometheus alerts contain:
- Alert name: Identifier for the alert rule
- Labels: Dimensions and metadata (service, severity, etc.)
- Annotations: Human-readable descriptions
- State: pending, firing, resolved
- Active since: When alert started firing
- Value: Current metric value that triggered alert
Fetching Alert Details
Use Prometheus API to fetch alerts:
List active alerts:
GET /api/v1/alerts
Query alert rule:
GET /api/v1/rules
Analyzing Alert Context
When investigating an alert:
- Check alert expression: What PromQL query triggered the alert?
- Review threshold: What value caused the alert to fire?
- Check alert duration: How long has condition been true?
- Review labels: What services/instances are affected?
- Query related metrics: Get broader context around the alert
Example Alert Investigation
Alert: HighErrorRate
Expression: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
Investigation queries:
Query error rate breakdown:
rate(http_requests_total{status=~"5.."}[5m]) by (endpoint)
Query total request rate:
rate(http_requests_total[5m])
Query error rate percentage:
(sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))) * 100
Check for correlated latency increase:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Metrics Patterns for RCA
Pattern 1: Sudden Spike
Signature: Metric jumps sharply at specific time
Possible causes:
- Code deployment
- Configuration change
- Traffic surge
- Dependency failure
Investigation:
- Correlate spike time with deployments
- Check for sudden traffic increase
- Query dependency health metrics
- Review recent configuration changes
Pattern 2: Gradual Increase
Signature: Metric grows steadily over hours/days
Possible causes:
- Memory leak
- Resource exhaustion
- Unbounded data growth
- Missing cleanup job
Investigation:
- Check memory/disk usage trends
- Review data volume growth
- Query for resource leaks
- Check scheduled job execution
Pattern 3: Periodic Pattern
Signature: Metric spikes at regular intervals
Possible causes:
- Scheduled job or cron
- Batch processing
- Cache expiration
- Garbage collection
Investigation:
- Identify period (hourly, daily, etc.)
- Check for scheduled tasks at that interval
- Review batch job schedules
- Query job execution metrics
Pattern 4: Drop to Zero
Signature: Metric suddenly drops to zero or very low value
Possible causes:
- Service crash
- Instance termination
- Network partition
- Monitoring failure
Investigation:
- Check service health (
upmetric) - Review instance count
- Query service availability
- Check for infrastructure changes
Pattern 5: High Variability
Signature: Metric fluctuates wildly
Possible causes:
- Intermittent errors
- Race condition
- Resource contention
- Unhealthy load balancing
Investigation:
- Check error logs for patterns
- Review load distribution across instances
- Query resource utilization
- Check for concurrency issues
Time Range Selection
Choose appropriate time ranges for investigation:
Incident detection (5-15 minutes):
rate(metric[5m])
Trend analysis (1-6 hours):
rate(metric[1h])
Long-term patterns (1-7 days):
avg_over_time(metric[1d])
Comparison with past:
# Current value
metric
# Value 1 week ago
metric offset 1w
Correlating Metrics
Multiple Metric Analysis
Query related metrics together to understand full context:
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Request rate
rate(http_requests_total[5m])
# CPU usage
rate(process_cpu_seconds_total[5m])
# Memory usage
process_resident_memory_bytes
Identifying Correlations
Look for metrics that change together:
- Error rate ↑ + Latency ↑ → Performance degradation
- Error rate ↑ + CPU ↑ → Resource exhaustion
- Error rate ↑ + Request rate ↑ → Traffic surge
- Error rate ↑ + Dependency metric ↓ → Dependency failure
Best Practices
Query Writing
- Use appropriate time ranges (5m for recent, 1h for trends)
- Filter by relevant labels to reduce cardinality
- Use
rate()for counters, not raw values - Aggregate when dealing with multiple instances
- Use recording rules for expensive queries
Alert Investigation
- Start with alert expression to understand trigger
- Query wider time range to see pattern before/after
- Break down aggregated metrics to find specific instances/endpoints
- Check for correlated metric changes
- Compare current values with historical baseline
Metric Interpretation
- Consider metric type (counter, gauge, histogram)
- Look for patterns over time, not just current values
- Compare across instances to find outliers
- Correlate with other metrics for complete picture
- Validate hypotheses with multiple metrics
Integration with Thufir
This skill works with:
- Prometheus MCP server: Fetches alerts and queries metrics via API
- root-cause-analysis skill: Metrics provide evidence for RCA
- RCA agent: Agent queries Prometheus to gather metric data
Using Prometheus MCP
The Thufir plugin includes Prometheus MCP server for querying:
Query instant value:
Use MCP tool: prometheus_query
Query: rate(http_requests_total[5m])
Query time range:
Use MCP tool: prometheus_query_range
Query: rate(http_requests_total[5m])
Start: 2025-12-19T14:00:00Z
End: 2025-12-19T15:00:00Z
Step: 15s
Fetch active alerts:
Use MCP tool: prometheus_alerts
Additional Resources
Reference Files
For detailed PromQL patterns and advanced queries:
references/promql-cookbook.md- Common PromQL queries for RCA scenarios
Quick Reference
Error rate: rate(http_requests_total{status=~"5.."}[5m])
Latency p95: histogram_quantile(0.95, rate(duration_bucket[5m]))
CPU usage: rate(process_cpu_seconds_total[5m])
Memory: process_resident_memory_bytes
Request rate: rate(http_requests_total[5m])
Time ranges: 5m (instant), 1h (trend), 1d (baseline)
Aggregations: sum, avg, max, min, count
Filters: {label="value"}, {label=~"regex"}
Use Prometheus metrics to provide objective, time-series evidence for root cause analysis. Correlate metrics with code changes and system events to identify precise incident causes.
Repository
