Monitoring
153 skills in DevOps > Monitoring
observability
Telemetry, metrics, tracing, and observability for Elixir/BEAM applications
grafana-http-api
Comprehensive skill for interacting with Grafana's HTTP API to manage dashboards, data sources, folders, alerting, annotations, users, teams, and organizations. Use when Claude needs to (1) Create, read, update, or delete Grafana dashboards, (2) Manage data sources and connections, (3) Configure alerting rules, contact points, and notification policies, (4) Work with folders and permissions, (5) Manage users, teams, and service accounts, (6) Create or query annotations, (7) Execute queries against data sources, or any other Grafana automation task via API.
py-observability
Observability patterns for Python backends. Use when adding logging, metrics, tracing, or debugging production issues.
observability-skill
Manage full-stack observability using Logfire (logging/tracing) and OpenObserve (storage/visualization).
observabilidade-spring
Avaliar e aprimorar observabilidade do serviço Spring (Micrometer, Prometheus, tracing) garantindo métricas, logs estruturados e dashboards alinhados ao ecossistema.
observability-patterns
Comprehensive observability setup patterns for Google ADK agents including logging configuration, Cloud Trace integration, BigQuery Agent Analytics, and third-party observability tools (AgentOps, Phoenix, Weave). Use when implementing monitoring, debugging agent behavior, analyzing agent performance, setting up tracing, or when user mentions observability, logging, tracing, BigQuery analytics, AgentOps, Phoenix, Arize, or Weave.
observability-monitor
Comprehensive observability and monitoring workflow that orchestrates metrics collection, logging, distributed tracing, and alerting systems. Handles everything from monitoring architecture design and implementation to APM integration, anomaly detection, and incident response automation.
slo-implementation
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
clojure-telemere
Structured telemetry library for Clojure/Script. Use when working with logging, tracing, structured logging, events, signal handling, observability, or replacing Timbre/tools.logging.
auto-rollback-triggers
Error rate monitoring, SLO detection, and notification webhooks for automated rollback triggers. Use when setting up automated deployment rollback, monitoring error rates, configuring SLO thresholds, implementing deployment safety nets, setting up alerting webhooks, or when user mentions automated rollback, error rate monitoring, SLO violations, deployment safety, or rollback automation.
monitoring
Master Kubernetes observability, monitoring with Prometheus, logging, metrics, and distributed tracing. Learn to implement comprehensive monitoring strategies.
create-observability-config
Setup observability platform configuration (Datadog, Prometheus, Splunk) with REQ-* dashboards and alerts. Creates monitors for each requirement with SLA tracking. Use when deploying to production or setting up monitoring.
defensive-bash
Production-grade defensive Bash scripting for server automation, monitoring, and DevOps tasks. Emphasizes safety, error handling, idempotency, and logging.
otel-monitoring-setup
Use PROACTIVELY when setting up OpenTelemetry monitoring for Claude Code usage tracking, cost analysis, or productivity metrics. Provides local PoC mode (full Docker stack with Grafana) and enterprise mode (centralized infrastructure). Configures telemetry collection, imports dashboards, and verifies data flow. Not for non-Claude telemetry or custom metric definitions.
observability-monitoring
Structured logging, metrics, distributed tracing, and alerting strategies
rust-tracing
Instrument code with tracing spans and structured logging. Use for observability and performance analysis.
langfuse-observability
LLM observability with self-hosted Langfuse 3.x - tracing, evaluation, monitoring, prompt management, and cost tracking
site-reliability-engineer
Production monitoring, observability, SLO/SLI management, and incident response.Trigger terms: monitoring, observability, SRE, site reliability, alerting, incident response,SLO, SLI, error budget, Prometheus, Grafana, Datadog, New Relic, ELK stack, logs, metrics,traces, on-call, production monitoring, health checks, uptime, availability, dashboards,post-mortem, incident management, runbook.Completes SDD Stage 8 (Monitoring) with comprehensive production observability:- SLI/SLO definitions and tracking- Monitoring stack setup (Prometheus, Grafana, ELK, Datadog, etc.)- Alert rules and notification channels- Incident response runbooks- Observability dashboards (logs, metrics, traces)- Post-mortem templates and analysis- Health check endpoints- Error budget trackingUse when: user needs production monitoring, observability platform, alerting, SLOs,incident response, or post-deployment health tracking.
observability-patterns
Observability patterns for metrics, logs, and traces. Use when implementing monitoring, setting up Prometheus/Grafana, configuring logging pipelines, implementing distributed tracing, or designing alerting systems.
py-server-logs
View Flask server logs from local or remote server. Shows real-time or recent log entries for debugging. Use when monitoring server activity, debugging issues, or checking server status.