Unnamed Skill
Observability visualization with Grafana and the LGTM stack (Loki, Grafana, Tempo, Mimir). Use when implementing dashboards, log aggregation, distributed tracing, or metrics visualization. Triggers: grafana, loki, tempo, mimir, dashboard, logql, traceql, observability stack, LGTM.
$ Installieren
git clone https://github.com/cosmix/claude-code-setup /tmp/claude-code-setup && cp -r /tmp/claude-code-setup/skills/grafana ~/.claude/skills/claude-code-setup// tip: Run this command in your terminal to install the skill
name: grafana description: Observability visualization with Grafana and the LGTM stack (Loki, Grafana, Tempo, Mimir). Use when implementing dashboards, log aggregation, distributed tracing, or metrics visualization. Triggers: grafana, loki, tempo, mimir, dashboard, logql, traceql, observability stack, LGTM. allowed-tools: Read, Grep, Glob, Edit, Write, Bash
Grafana and LGTM Stack Skill
Overview
The LGTM stack provides a complete observability solution:
- Loki: Log aggregation and querying (LogQL)
- Grafana: Visualization and dashboarding
- Tempo: Distributed tracing (TraceQL)
- Mimir: Long-term metrics storage (Prometheus-compatible)
This skill covers setup, configuration, dashboard creation, querying, and alerting for production observability.
LGTM Stack Components
Loki - Log Aggregation
Architecture - Loki
Horizontally scalable log aggregation inspired by Prometheus
- Indexes only metadata (labels), not log content
- Cost-effective storage with object stores (S3, GCS, etc.)
- LogQL query language similar to PromQL
Key Concepts - Loki
- Labels for indexing (low cardinality)
- Log streams identified by unique label sets
- Parsers: logfmt, JSON, regex, pattern
- Line filters and label filters
Grafana - Visualization
Features
- Multi-datasource dashboarding
- Templating and variables
- Alerting (unified alerting)
- Dashboard provisioning
- Role-based access control (RBAC)
- Explore mode for ad-hoc queries
Tempo - Distributed Tracing
Architecture - Tempo
Scalable distributed tracing backend
- Cost-effective trace storage
- TraceQL for trace querying
- Integration with logs and metrics (trace-to-logs, trace-to-metrics)
- OpenTelemetry compatible
Mimir - Metrics Storage
Architecture - Mimir
Horizontally scalable long-term Prometheus storage
- Multi-tenancy support
- Query federation
- High availability
- Prometheus remote_write compatible
Setup and Configuration
Loki Configuration
Production Loki Config
File: loki.yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
shared_store: s3
limits_config:
retention_period: 744h # 31 days
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 30d
max_streams_per_user: 0
max_global_streams_per_user: 5000
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
querier:
max_concurrent: 4
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
Multi-tenant Loki Config
With authentication
auth_enabled: true
server:
http_listen_port: 3100
common:
storage:
s3:
endpoint: s3.amazonaws.com
bucketnames: loki-chunks
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
distributor:
ring:
kvstore:
store: memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_retain_period: 1m
max_transfer_retries: 0
memberlist:
join_members:
- loki-gossip-ring.loki.svc.cluster.local:7946
Tempo Configuration
Production Tempo Config
File: tempo.yaml
server:
http_listen_port: 3200
grpc_listen_port: 9096
distributor:
receivers:
otlp:
protocols:
http:
grpc:
jaeger:
protocols:
thrift_http:
grpc:
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 720h # 30 days
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
pool:
max_workers: 100
queue_depth: 10000
wal:
path: /var/tempo/wal
query_frontend:
search:
duration_slo: 5s
throughput_bytes_slo: 1.073741824e+09
trace_by_id:
duration_slo: 5s
metrics_generator:
registry:
external_labels:
source: tempo
cluster: primary
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://mimir:9009/api/v1/push
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
Grafana Data Source Configuration
Loki Data Source
File: loki-datasource.yaml
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo_uid
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
editable: false
Tempo Data Source
File: tempo-datasource.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo_uid
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki_uid
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
mapTagNamesEnabled: false
spanStartTimeShift: "1h"
spanEndTimeShift: "1h"
filterByTraceID: false
filterBySpanID: false
tracesToMetrics:
datasourceUid: prometheus_uid
tags: [{ key: "service.name", value: "service" }]
queries:
- name: "Sample query"
query: "sum(rate(tempo_spanmetrics_latency_bucket{$__tags}[5m]))"
serviceMap:
datasourceUid: prometheus_uid
search:
hide: false
nodeGraph:
enabled: true
lokiSearch:
datasourceUid: loki_uid
editable: false
Mimir/Prometheus Data Source
File: mimir-datasource.yaml
apiVersion: 1
datasources:
- name: Mimir
type: prometheus
access: proxy
url: http://mimir:8080/prometheus
uid: prometheus_uid
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- datasourceUid: tempo_uid
name: trace_id
prometheusType: Mimir
prometheusVersion: 2.40.0
cacheLevel: "High"
incrementalQuerying: true
incrementalQueryOverlapWindow: 10m
editable: false
LogQL Query Patterns
Basic Log Queries
Stream Selection
# Simple label matching
{namespace="production", app="api"}
# Regex matching
{app=~"api|web|worker"}
# Not equal
{env!="staging"}
# Multiple conditions
{namespace="production", app="api", level!="debug"}
Line Filters
# Contains
{app="api"} |= "error"
# Does not contain
{app="api"} != "debug"
# Regex match
{app="api"} |~ "error|exception|fatal"
# Case insensitive
{app="api"} |~ "(?i)error"
# Chaining filters
{app="api"} |= "error" != "timeout"
Parsing and Extraction
JSON Parsing
# Parse JSON logs
{app="api"} | json
# Extract specific fields
{app="api"} | json message="msg", level="severity"
# Filter on extracted field
{app="api"} | json | level="error"
# Nested JSON
{app="api"} | json | line_format "{{.response.status}}"
Logfmt Parsing
# Parse logfmt (key=value)
{app="api"} | logfmt
# Extract specific fields
{app="api"} | logfmt level, caller, msg
# Filter parsed fields
{app="api"} | logfmt | level="error"
Pattern Parsing
# Extract with pattern
{app="nginx"} | pattern `<ip> - - <_> "<method> <uri> <_>" <status> <_>`
# Filter on extracted values
{app="nginx"} | pattern `<_> <status> <_>` | status >= 400
# Complex pattern
{app="api"} | pattern `level=<level> msg="<msg>" duration=<duration>ms`
Regex Parsing
# Named capture groups
{app="api"} | regexp `(?P<method>[A-Z]+) (?P<uri>/[^ ]*) (?P<duration>[0-9.]+)ms`
# Multiple regexes
{app="api"}
| regexp `level=(?P<level>[a-z]+)`
| regexp `duration=(?P<duration>[0-9.]+)ms`
Label Formatting and Manipulation
Label Format
# Add labels from parsed fields
{app="api"} | json | label_format level=severity
# Rename labels
{app="api"} | label_format environment=env
# Template labels
{app="api"} | json | label_format full_path=`{{.namespace}}/{{.pod}}`
# Convert to lowercase
{app="api"} | label_format app=`{{ ToLower .app }}`
Line Format
# Custom output format
{app="api"} | json | line_format "{{.timestamp}} [{{.level}}] {{.message}}"
# Only specific fields
{app="api"} | logfmt | line_format "{{.msg}}"
# Conditional formatting
{app="api"} | json
| line_format `{{if eq .level "error"}}ERROR: {{end}}{{.msg}}`
Aggregations and Metrics
Count Queries
# Count log lines over time
count_over_time({app="api"}[5m])
# Rate of logs
rate({app="api"}[5m])
# Errors per second
sum(rate({app="api"} |= "error" [5m])) by (namespace)
# Error ratio
sum(rate({app="api"} |= "error" [5m]))
/
sum(rate({app="api"}[5m]))
Extracted Metrics
# Parse and aggregate numeric values
sum(rate({app="api"} | json | __error__="" [5m])) by (status_code)
# Average duration
avg_over_time({app="api"}
| logfmt
| unwrap duration [5m]) by (endpoint)
# P95 latency
quantile_over_time(0.95, {app="api"}
| regexp `duration=(?P<duration>[0-9.]+)ms`
| unwrap duration [5m]) by (method)
# Bytes processed
sum(bytes_over_time({app="api"}[5m]))
Advanced Aggregations
# Top 10 error messages
topk(10,
sum by (msg) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
)
# Request rate by endpoint
sum by (endpoint) (
rate({app="api"}
| json
| __error__="" [5m]
)
)
# 4xx/5xx rate
sum(
rate({app="api"}
| pattern `<_> <status> <_>`
| status >= 400 [5m]
)
) by (status)
Production Query Examples
Error Investigation
# Find errors with context
{namespace="production", app="api"}
| json
| level="error"
| line_format "{{.timestamp}} [{{.trace_id}}] {{.message}}"
# Errors with stack traces
{app="api"}
|= "error"
|= "stack"
| json
| line_format "{{.message}}\n{{.stack}}"
# Group errors by type
sum by (error_type) (
count_over_time({app="api"}
| json
| level="error" [1h]
)
)
Performance Analysis
# Slow requests (>1s)
{app="api"}
| logfmt
| unwrap duration
| duration > 1000
| line_format "{{.method}} {{.uri}} took {{.duration}}ms"
# P99 latency by endpoint
quantile_over_time(0.99, {app="api"}
| json
| unwrap response_time [5m]) by (endpoint)
# Request throughput
sum(rate({app="api"} | json | __error__="" [1m])) by (method, endpoint)
TraceQL Query Patterns
Basic Trace Queries
Span Search
# Find traces by service
{ .service.name = "api" }
# Find by span name
{ name = "GET /users" }
# HTTP status codes
{ .http.status_code = 500 }
# Combine conditions
{ .service.name = "api" && .http.status_code >= 400 }
# Duration filter (in nanoseconds)
{ duration > 1s }
Resource and Span Attributes
# Resource attributes (service-level)
{ resource.service.name = "checkout" }
{ resource.deployment.environment = "production" }
# Span attributes
{ span.http.method = "POST" }
{ span.db.system = "postgresql" }
# Intrinsic fields
{ name = "database-query" && duration > 100ms }
{ kind = "server" }
{ status = "error" }
Advanced TraceQL
Aggregations
# Count spans
{ .service.name = "api" } | count() > 10
# Min/Max/Avg duration
{ .service.name = "api" } | max(duration) > 5s
# Structural operators
{ .service.name = "api" }
>> { .service.name = "database" && duration > 100ms }
Span Sets and Relationships
# Parent-child relationship
{ .service.name = "frontend" }
>> { .service.name = "backend" && .http.status_code = 500 }
# Sibling spans
{ .service.name = "api" }
&& { .service.name = "cache" }
# Descendant spans
{ .service.name = "api" }
>>+ { .db.system = "postgresql" && duration > 1s }
Production TraceQL Examples
# Failed database queries
{ .service.name = "api" }
>> { .db.system = "postgresql" && status = "error" }
# Slow checkout flows
{ name = "checkout" && duration > 5s }
# Cache misses leading to slow DB calls
{ .cache.hit = false }
>> { .db.system = "postgresql" && duration > 500ms }
# Error traces with specific tag
{ status = "error" && .tenant.id = "acme-corp" }
# High cardinality debugging
{ .service.name = "api" && .user.id = "user123" }
Dashboard Design Best Practices
Dashboard Organization
- Hierarchy: Overview -> Service -> Component -> Deep Dive
- Golden Signals: Latency, Traffic, Errors, Saturation
- Variable-driven: Use templates for flexibility
- Consistent Layouts: Grid alignment, logical flow
- Performance: Limit queries, use query caching
Panel Best Practices
- Titles: Clear, descriptive (include units)
- Legends: Show only when needed, use
{{label}}format - Axes: Label with units, appropriate ranges
- Thresholds: Use for visual cues (green/yellow/red)
- Links: Deep-link to related dashboards/logs/traces
Variable Patterns
Common Variables
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus"
},
{
"name": "namespace",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info, namespace)",
"multi": true,
"includeAll": true
},
{
"name": "pod",
"type": "query",
"datasource": "${datasource}",
"query": "label_values(kube_pod_info{namespace=~\"$namespace\"}, pod)",
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"auto": true,
"auto_count": 30,
"auto_min": "10s",
"options": ["1m", "5m", "15m", "30m", "1h", "6h", "12h", "1d"]
}
]
}
}
Complete Dashboard Examples
Application Observability Dashboard
{
"dashboard": {
"title": "Application Observability - ${app}",
"tags": ["observability", "application"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "app",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up, app)",
"current": {
"selected": false,
"text": "api",
"value": "api"
}
},
{
"name": "namespace",
"type": "query",
"datasource": "Mimir",
"query": "label_values(up{app=\"$app\"}, namespace)",
"multi": true,
"includeAll": true
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (method, status)",
"legendFormat": "{{method}} - {{status}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"yaxes": [
{
"format": "reqps",
"label": "Requests/sec"
}
]
},
{
"id": 2,
"title": "P95 Latency",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"yaxes": [
{
"format": "s",
"label": "Duration"
}
],
"thresholds": [
{
"value": 1,
"colorMode": "critical",
"fill": true,
"line": true,
"op": "gt"
}
]
},
{
"id": 3,
"title": "Error Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\", status=~\"5..\"}[$__rate_interval])) / sum(rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval]))",
"legendFormat": "Error %"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"yaxes": [
{
"format": "percentunit",
"max": 1,
"min": 0
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.01],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "Error Rate Alert",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 4,
"title": "Recent Error Logs",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{app=\"$app\", namespace=~\"$namespace\"} | json | level=\"error\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": false,
"showCommonLabels": false,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
}
},
{
"id": 5,
"title": "Trace Error Rate",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(traces_spanmetrics_calls_total{service=\"$app\", status_code=\"STATUS_CODE_ERROR\"}[$__rate_interval])) / sum(rate(traces_spanmetrics_calls_total{service=\"$app\"}[$__rate_interval]))",
"legendFormat": "Trace Error %"
}
],
"gridPos": {
"h": 6,
"w": 8,
"x": 0,
"y": 16
}
},
{
"id": 6,
"title": "Active Traces",
"type": "stat",
"datasource": "Tempo",
"targets": [
{
"queryType": "traceql",
"query": "{ resource.service.name = \"$app\" }",
"refId": "A"
}
],
"options": {
"graphMode": "area",
"colorMode": "value",
"justifyMode": "auto",
"textMode": "auto"
},
"gridPos": {
"h": 6,
"w": 8,
"x": 8,
"y": 16
}
},
{
"id": 7,
"title": "Top Endpoints by Request Count",
"type": "table",
"datasource": "Mimir",
"targets": [
{
"expr": "topk(10, sum by (endpoint) (rate(http_requests_total{app=\"$app\", namespace=~\"$namespace\"}[$__rate_interval])))",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {
"Time": true
},
"renameByName": {
"endpoint": "Endpoint",
"Value": "Req/s"
}
}
}
],
"gridPos": {
"h": 6,
"w": 8,
"x": 16,
"y": 16
}
}
],
"links": [
{
"title": "Explore Logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"expr\":\"{app=\\\"$app\\\",namespace=~\\\"$namespace\\\"}\"}]}",
"type": "link",
"icon": "doc"
},
{
"title": "Explore Traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"{resource.service.name=\\\"$app\\\"}\",\"queryType\":\"traceql\"}]}",
"type": "link",
"icon": "gf-traces"
}
]
}
}
Log Analysis Dashboard
{
"dashboard": {
"title": "Log Analysis - ${namespace}",
"tags": ["logs", "loki"],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Loki",
"query": "label_values(namespace)",
"multi": false
},
{
"name": "app",
"type": "query",
"datasource": "Loki",
"query": "label_values({namespace=\"$namespace\"}, app)",
"multi": true,
"includeAll": true
},
{
"name": "level",
"type": "custom",
"options": [
{ "text": "All", "value": ".*" },
{ "text": "Error", "value": "error" },
{ "text": "Warning", "value": "warn|warning" },
{ "text": "Info", "value": "info" },
{ "text": "Debug", "value": "debug" }
],
"current": {
"text": "All",
"value": ".*"
}
}
]
},
"panels": [
{
"id": 1,
"title": "Log Volume",
"type": "graph",
"datasource": "Loki",
"targets": [
{
"expr": "sum by (app) (rate({namespace=\"$namespace\", app=~\"$app\"} | json | level=~\"$level\" [$__interval]))",
"legendFormat": "{{app}}"
}
],
"gridPos": { "h": 8, "w": 24, "x": 0, "y": 0 }
},
{
"id": 2,
"title": "Log Level Distribution",
"type": "piechart",
"datasource": "Loki",
"targets": [
{
"expr": "sum by (level) (count_over_time({namespace=\"$namespace\", app=~\"$app\"} | json | level=~\"$level\" [$__range]))",
"legendFormat": "{{level}}"
}
],
"options": {
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
}
},
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
},
{
"id": 3,
"title": "Top Error Messages",
"type": "bargauge",
"datasource": "Loki",
"targets": [
{
"expr": "topk(10, sum by (msg) (count_over_time({namespace=\"$namespace\", app=~\"$app\"} | json | level=\"error\" [$__range])))",
"legendFormat": "{{msg}}"
}
],
"options": {
"orientation": "horizontal",
"displayMode": "gradient"
},
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
},
{
"id": 4,
"title": "Log Stream",
"type": "logs",
"datasource": "Loki",
"targets": [
{
"expr": "{namespace=\"$namespace\", app=~\"$app\"} | json | level=~\"$level\"",
"refId": "A"
}
],
"options": {
"showTime": true,
"showLabels": true,
"wrapLogMessage": true,
"dedupStrategy": "none",
"enableLogDetails": true
},
"gridPos": { "h": 12, "w": 24, "x": 0, "y": 16 }
}
]
}
}
Distributed Tracing Dashboard
{
"dashboard": {
"title": "Distributed Tracing - ${service}",
"tags": ["tracing", "tempo"],
"templating": {
"list": [
{
"name": "service",
"type": "query",
"datasource": "Tempo",
"query": "service.name",
"multi": false
},
{
"name": "min_duration",
"type": "custom",
"options": [
{ "text": "Any", "value": "0" },
{ "text": ">100ms", "value": "100ms" },
{ "text": ">500ms", "value": "500ms" },
{ "text": ">1s", "value": "1s" },
{ "text": ">5s", "value": "5s" }
]
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate (from span metrics)",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(traces_spanmetrics_calls_total{service=\"$service\"}[$__rate_interval])) by (span_name)",
"legendFormat": "{{span_name}}"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"id": 2,
"title": "P95 Latency (from span metrics)",
"type": "graph",
"datasource": "Mimir",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(traces_spanmetrics_latency_bucket{service=\"$service\"}[$__rate_interval])) by (le, span_name))",
"legendFormat": "{{span_name}}"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"id": 3,
"title": "Error Rate",
"type": "stat",
"datasource": "Mimir",
"targets": [
{
"expr": "sum(rate(traces_spanmetrics_calls_total{service=\"$service\", status_code=\"STATUS_CODE_ERROR\"}[$__rate_interval])) / sum(rate(traces_spanmetrics_calls_total{service=\"$service\"}[$__rate_interval]))"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"graphMode": "area",
"colorMode": "background",
"textMode": "value_and_name"
},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": 0, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
},
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 8 }
},
{
"id": 4,
"title": "Traces with Errors",
"type": "table",
"datasource": "Tempo",
"targets": [
{
"queryType": "traceql",
"query": "{ resource.service.name = \"$service\" && status = error }",
"refId": "A",
"limit": 20
}
],
"transformations": [
{
"id": "organize",
"options": {
"renameByName": {
"traceID": "Trace ID",
"startTime": "Start Time",
"duration": "Duration"
}
}
}
],
"gridPos": { "h": 8, "w": 18, "x": 6, "y": 8 }
},
{
"id": 5,
"title": "Service Dependency Graph",
"type": "nodeGraph",
"datasource": "Mimir",
"targets": [
{
"expr": "traces_service_graph_request_total",
"format": "table"
}
],
"gridPos": { "h": 12, "w": 24, "x": 0, "y": 16 }
}
]
}
}
Grafana Alerting
Alert Rule Configuration
Prometheus/Mimir Alert Rule
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- uid: error_rate_high
title: High Error Rate
condition: A
data:
- refId: A
queryType: ""
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus_uid
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
intervalMs: 1000
maxDataPoints: 43200
noDataState: NoData
execErrState: Error
for: 5m
annotations:
description: 'Error rate is {{ printf "%.2f" $values.A.Value }}% (threshold: 5%)'
summary: Application error rate is above threshold
labels:
severity: critical
team: platform
isPaused: false
- uid: high_latency
title: High P95 Latency
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
) > 2
for: 10m
annotations:
description: "P95 latency is {{ $values.A.Value }}s on endpoint {{ $labels.endpoint }}"
runbook_url: https://wiki.company.com/runbooks/high-latency
labels:
severity: warning
- uid: pod_restart_high
title: High Pod Restart Rate
condition: A
data:
- refId: A
datasourceUid: prometheus_uid
model:
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
annotations:
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"
labels:
severity: warning
team: sre
Loki Alert Rule
apiVersion: 1
groups:
- name: log_based_alerts
interval: 1m
rules:
- uid: error_spike
title: Error Log Spike
condition: A
data:
- refId: A
queryType: ""
datasourceUid: loki_uid
model:
expr: |
sum(rate({app="api"} | json | level="error" [5m]))
> 10
for: 2m
annotations:
description: "Error log rate is {{ $values.A.Value }} logs/sec"
summary: Spike in error logs detected
labels:
severity: warning
- uid: critical_error_pattern
title: Critical Error Pattern Detected
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum(count_over_time({app="api"}
|~ "OutOfMemoryError|StackOverflowError|FatalException" [5m]
)) > 0
for: 0m
annotations:
description: "Critical error pattern found in logs"
labels:
severity: critical
page: true
- uid: database_connection_errors
title: Database Connection Errors
condition: A
data:
- refId: A
datasourceUid: loki_uid
model:
expr: |
sum by (app) (
rate({namespace="production"}
| json
| message =~ ".*database connection.*failed.*" [5m]
)
) > 1
for: 3m
annotations:
description: "App {{ $labels.app }} experiencing DB connection issues"
labels:
severity: critical
Contact Points and Notification Policies
Contact Point Configuration
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-critical
receivers:
- uid: slack_critical
type: slack
settings:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
title: "{{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
disableResolveMessage: false
- orgId: 1
name: pagerduty-oncall
receivers:
- uid: pagerduty_oncall
type: pagerduty
settings:
integrationKey: YOUR_INTEGRATION_KEY
severity: critical
class: infrastructure
component: kubernetes
- orgId: 1
name: email-team
receivers:
- uid: email_team
type: email
settings:
addresses: team@company.com
singleEmail: true
notificationPolicies:
- orgId: 1
receiver: slack-critical
group_by: ["alertname", "namespace"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: pagerduty-oncall
matchers:
- severity = critical
- page = true
group_wait: 10s
repeat_interval: 1h
continue: true
- receiver: email-team
matchers:
- severity = warning
- team = platform
group_interval: 10m
repeat_interval: 12h
Dashboard Provisioning
ConfigMap for Dashboard Provisioning
Kubernetes ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-dashboard.json: |
{
"dashboard": {
"title": "Application Metrics",
"uid": "app-metrics",
"tags": ["application"],
"timezone": "browser",
"panels": []
}
}
log-analysis.json: |
{
"dashboard": {
"title": "Log Analysis",
"uid": "log-analysis",
"tags": ["logs"],
"timezone": "browser",
"panels": []
}
}
Dashboard Provider Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-providers
namespace: monitoring
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'General'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards
- name: 'application'
orgId: 1
folder: 'Applications'
type: file
disableDeletion: true
editable: false
options:
path: /var/lib/grafana/dashboards/application
- name: 'infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
options:
path: /var/lib/grafana/dashboards/infrastructure
Grafana Deployment with Provisioning
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.0
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-admin
key: password
- name: GF_INSTALL_PLUGINS
value: "grafana-piechart-panel,grafana-worldmap-panel"
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-datasources
mountPath: /etc/grafana/provisioning/datasources
- name: grafana-dashboards
mountPath: /etc/grafana/provisioning/dashboards
- name: dashboard-files
mountPath: /var/lib/grafana/dashboards
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-pvc
- name: grafana-datasources
configMap:
name: grafana-datasources
- name: grafana-dashboards
configMap:
name: grafana-dashboard-providers
- name: dashboard-files
configMap:
name: grafana-dashboards
GrafanaDashboard CRD (Grafana Operator)
Using Grafana Operator
GrafanaDashboard Resource
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: application-observability
namespace: monitoring
labels:
app: grafana
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
# Inline dashboard JSON
json: |
{
"dashboard": {
"title": "Application Observability",
"tags": ["generated", "application"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{method}}"
}
]
}
]
}
}
GrafanaDashboard from ConfigMap
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
name: logs-dashboard
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
configMapRef:
name: log-dashboard-config
key: dashboard.json
Grafana Instance with Operator
apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: grafana
namespace: monitoring
labels:
dashboards: "grafana"
spec:
config:
log:
mode: "console"
auth:
disable_login_form: false
security:
admin_user: admin
admin_password: ${ADMIN_PASSWORD}
deployment:
spec:
replicas: 2
template:
spec:
containers:
- name: grafana
image: grafana/grafana:10.2.0
service:
spec:
type: LoadBalancer
ports:
- name: http
port: 3000
targetPort: 3000
ingress:
spec:
ingressClassName: nginx
rules:
- host: grafana.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: grafana-service
port:
number: 3000
tls:
- hosts:
- grafana.company.com
secretName: grafana-tls
GrafanaDataSource CRD
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDataSource
metadata:
name: loki-datasource
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
datasource:
name: Loki
type: loki
access: proxy
url: http://loki-gateway.monitoring.svc.cluster.local:3100
isDefault: false
editable: false
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: tempo
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
---
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDataSource
metadata:
name: tempo-datasource
namespace: monitoring
spec:
instanceSelector:
matchLabels:
dashboards: "grafana"
datasource:
name: Tempo
type: tempo
access: proxy
url: http://tempo-query-frontend.monitoring.svc.cluster.local:3200
uid: tempo
jsonData:
httpMethod: GET
tracesToLogs:
datasourceUid: loki
tags: ["job", "instance", "pod", "namespace"]
mappedTags: [{ key: "service.name", value: "service" }]
nodeGraph:
enabled: true
Production Tips
Performance Optimization
Query Optimization
- Use label filters before line filters
- Limit time ranges for expensive queries
- Use
unwrapinstead of parsing when possible - Cache query results with query frontend
Dashboard Performance
- Limit number of panels (< 15 per dashboard)
- Use appropriate time intervals
- Avoid high-cardinality grouping
- Use
$__intervalfor adaptive sampling
Storage Optimization
- Configure retention policies
- Use compaction for Loki and Tempo
- Implement tiered storage (hot/warm/cold)
- Monitor storage growth
Security Best Practices
Authentication
- Enable auth (
auth_enabled: truein Loki/Tempo) - Use OAuth/LDAP for Grafana
- Implement multi-tenancy with org isolation
Authorization
- Configure RBAC in Grafana
- Limit datasource access by team
- Use folder permissions for dashboards
Network Security
- TLS for all components
- Network policies in Kubernetes
- Rate limiting at ingress
Monitoring the Monitoring Stack
# Monitor Loki itself
sum(rate(loki_request_duration_seconds_count[5m])) by (route)
# Tempo ingestion rate
rate(tempo_distributor_spans_received_total[5m])
# Grafana active users
grafana_stat_active_users
# Query performance
histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le))
Troubleshooting
Common Issues
-
High Cardinality: Too many unique label combinations
- Solution: Reduce label dimensions, use log parsing instead
-
Query Timeouts: Complex queries on large datasets
- Solution: Reduce time range, use aggregations, add query limits
-
Storage Growth: Unbounded retention
- Solution: Configure retention policies, enable compaction
-
Missing Traces: Incomplete trace data
- Solution: Check sampling rates, verify instrumentation, check network connectivity
Debugging Queries
# Check query performance
{app="api"} | json | __error__="" # Filter parsing errors
# Verify labels exist
{namespace="production"} | logfmt | line_format "{{.level}}"
# Count parsing failures
sum(rate({app="api"} | json | __error__!="" [5m]))
