observability
Use when diagnosing operation failures, stuck or slow operations, querying Jaeger traces, working with Grafana dashboards, debugging distributed system issues, or investigating worker selection and service communication problems.
$ Installer
git clone https://github.com/kpiteira/ktrdr /tmp/ktrdr && cp -r /tmp/ktrdr/.claude/skills/observability ~/.claude/skills/ktrdr// tip: Run this command in your terminal to install the skill
name: observability description: Use when diagnosing operation failures, stuck or slow operations, querying Jaeger traces, working with Grafana dashboards, debugging distributed system issues, or investigating worker selection and service communication problems.
Observability & Debugging
Load this skill when:
- Diagnosing operation failures, stuck operations, or slow operations
- Working with Jaeger traces or Grafana dashboards
- Debugging distributed system issues
- Investigating worker selection or service communication problems
First Rule: Check Observability Before Logs
When users report issues with operations, use Jaeger first â not logs. KTRDR has comprehensive OpenTelemetry instrumentation that provides complete visibility into distributed operations.
This enables first-response diagnosis instead of iterative detective work.
When to Query Jaeger
Query Jaeger when user reports:
| Symptom | What Jaeger Shows |
|---|---|
| "Operation stuck" | Which phase is stuck and why |
| "Operation failed" | Exact error with full context |
| "Operation slow" | Bottleneck span immediately |
| "No workers selected" | Worker selection decision |
| "Missing data" | Data flow from IB to cache |
| "Service not responding" | HTTP call attempt and result |
Quick Start Workflow
Step 1: Get operation ID
From CLI output or API response (e.g., op_training_20251113_123456_abc123)
Step 2: Query Jaeger API
OPERATION_ID="op_training_20251113_123456_abc123"
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID&limit=1" | jq
Step 3: Analyze trace structure
# Get span summary with durations
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
service: .process.serviceName,
duration_ms: (.duration / 1000),
error: ([.tags[] | select(.key == "error" and .value == "true")] | length > 0)
}' | jq -s 'sort_by(.duration_ms) | reverse'
Step 4: Extract relevant attributes
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OPERATION_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
attributes: (.tags | map({key: .key, value: .value}) | from_entries)
}'
Common Diagnostic Patterns
Pattern 1: Operation Stuck
# Check for worker selection and dispatch
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName == "worker_registry.select_worker") |
.tags[] |
select(.key | startswith("worker_registry.")) |
{key: .key, value: .value}'
Look for:
worker_registry.total_workers: 0â No workers startedworker_registry.capable_workers: 0â No capable workersworker_registry.selection_status: NO_WORKERS_AVAILABLEâ All busy
Pattern 2: Operation Failed
# Extract error details
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.tags[] | select(.key == "error" and .value == "true")) |
{
span: .operationName,
service: .process.serviceName,
exception_type: (.tags[] | select(.key == "exception.type") | .value),
exception_message: (.tags[] | select(.key == "exception.message") | .value)
}'
Common errors:
ConnectionRefusedErrorâ Service not running (checkhttp.url)ValueErrorâ Invalid input parametersDataNotFoundErrorâ Data not loaded (checkdata.symbol,data.timeframe)
Pattern 3: Operation Slow
# Find bottleneck span (longest duration)
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
{
span: .operationName,
duration_ms: (.duration / 1000)
}' | jq -s 'sort_by(.duration_ms) | reverse | .[0]'
Common bottlenecks:
training.training_loopâ Checktraining.device(GPU vs CPU)data.fetchâ Checkib.latency_msib.fetch_historicalâ Checkdata.bars_requested
Pattern 4: Service Communication Failure
# Check HTTP calls between services
curl -s "http://localhost:16686/api/traces?tag=operation.id:$OP_ID" | jq '
.data[0].spans[] |
select(.operationName | startswith("POST") or startswith("GET")) |
{
http_call: .operationName,
url: (.tags[] | select(.key == "http.url") | .value),
status: (.tags[] | select(.key == "http.status_code") | .value),
error: (.tags[] | select(.key == "error.type") | .value)
}'
Look for:
http.status_code: nullâ Connection failederror.type: ConnectionRefusedErrorâ Target service not runninghttp.urlâ Shows which service was being called
Key Span Attributes Reference
Operation Attributes
operation.idâ Operation identifieroperation.typeâ TRAINING, BACKTESTING, DATA_DOWNLOADoperation.statusâ PENDING, RUNNING, COMPLETED, FAILED
Worker Selection
worker_registry.total_workersâ Total registered workersworker_registry.available_workersâ Available workersworker_registry.capable_workersâ Capable workers for this operationworker_registry.selected_worker_idâ Which worker was chosenworker_registry.selection_statusâ SUCCESS, NO_WORKERS_AVAILABLE, NO_CAPABLE_WORKERS
Progress Tracking
progress.percentageâ Current progress (0-100)progress.phaseâ Current execution phaseoperations_service.instance_idâ OperationsService instance (check for mismatches)
Error Context
exception.typeâ Python exception classexception.messageâ Error messageexception.stacktraceâ Full stack traceerror.symbol,error.strategyâ Business context
Performance
http.status_codeâ HTTP response statushttp.urlâ Target URL for HTTP callsib.latency_msâ IB Gateway latencytraining.deviceâ cuda:0 or cpugpu.utilization_percentâ GPU usage
Response Template
When diagnosing with observability, use this structure:
đ **Trace Analysis for operation_id: {operation_id}**
**Trace Summary**:
- Trace ID: {trace_id}
- Total Duration: {duration_ms}ms
- Services: {list of services}
- Status: {OK/ERROR}
**Execution Flow**:
1. {span_name} ({service}) - {duration_ms}ms
2. {span_name} ({service}) - {duration_ms}ms
...
**Diagnosis**:
{identified_issue_with_evidence_from_spans}
**Root Cause**:
{root_cause_explanation_with_span_attributes}
**Solution**:
{recommended_fix_with_commands}
Grafana Dashboards
Check Grafana for quick diagnostics before diving into traces.
| Dashboard | Path | Use Case |
|---|---|---|
| System Overview | /d/ktrdr-system-overview | Service health, error rates, latency |
| Worker Status | /d/ktrdr-worker-status | Worker capacity, resource usage |
| Operations | /d/ktrdr-operations | Operation counts, success rates |
Quick Workflows
- "Is it working?" â System Overview: Healthy Services count
- "Why is it slow?" â System Overview: P95 Latency panel
- "Workers missing?" â Worker Status: Healthy Workers and Health Matrix
- "Operations failing?" â Operations: Success Rate and Status Distribution
Benefits of Observability-First Debugging
- Diagnosis in FIRST response (not 10+ messages later)
- Complete context (all services, all phases, all attributes)
- Objective evidence (no guessing or assumptions)
- Distributed visibility (Backend â Worker â Host Service)
- Performance insights (identify bottlenecks immediately)
- Root cause analysis (trace error from source to root)
Full Documentation
For comprehensive workflows and scenarios: docs/debugging/observability-debugging-workflows.md
Repository
