qa-debugging

Systematic debugging methodologies, troubleshooting workflows, logging strategies, error tracking, performance profiling, stack trace analysis, and debugging tools across languages and environments. Covers local debugging, distributed systems, production issues, and root cause analysis.

$ インストール

git clone https://github.com/vasilyu1983/AI-Agents-public /tmp/AI-Agents-public && cp -r /tmp/AI-Agents-public/frameworks/claude-code-kit/framework/skills/qa-debugging ~/.claude/skills/AI-Agents-public

// tip: Run this command in your terminal to install the skill


name: qa-debugging description: Systematic debugging methodologies, troubleshooting workflows, logging strategies, error tracking, performance profiling, stack trace analysis, and debugging tools across languages and environments. Covers local debugging, distributed systems, production issues, and root cause analysis.

Debugging & Troubleshooting — Quick Reference

This skill provides execution-ready debugging strategies, troubleshooting workflows, and root cause analysis techniques. Claude should apply these patterns when users encounter bugs, errors, performance issues, or production incidents.

Modern Best Practices (2025): Structured logging (Pino/Winston), distributed tracing (OpenTelemetry), error tracking (Sentry/Rollbar), observability-first debugging, time-travel debugging, AI-assisted error analysis, and proactive monitoring.


Quick Reference

SymptomTool/TechniqueCommand/ApproachWhen to Use
Application crashesStack trace analysisCheck error logs, identify first line in your codeUnhandled exceptions
Slow performanceProfiling (CPU/memory)node --prof, Chrome DevTools, cProfileHigh CPU, latency issues
Memory leakHeap snapshotsnode --inspect, compare snapshots over timeMemory usage grows
Database slowQuery profilingEXPLAIN ANALYZE, slow query logSlow queries, high DB CPU
Production-only bugLog analysis + feature flagsgrep "ERROR", enable verbose logging for userCan't reproduce locally
Distributed system issueDistributed tracingOpenTelemetry, Jaeger, trace request IDMicroservices, async workflows
Intermittent failuresLogging + monitoringAdd detailed logs, monitor metricsRace conditions, timeouts
Network timeoutNetwork debuggingcurl, Postman, check firewall/DNSExternal API failures

Decision Tree: Debugging Strategy

Issue type: [Problem Scenario]
    ├─ Application Behavior?
    │   ├─ Crashes immediately? → Check stack trace, error logs
    │   ├─ Slow/hanging? → CPU/memory profiling
    │   ├─ Intermittent failures? → Add logging, reproduce consistently
    │   └─ Unexpected output? → Binary search (add logs to narrow down)
    │
    ├─ Performance Issues?
    │   ├─ High CPU? → CPU profiler to find hot functions
    │   ├─ Memory leak? → Heap snapshots, track over time
    │   ├─ Slow database? → EXPLAIN ANALYZE, check indexes
    │   ├─ Network latency? → Trace external API calls
    │   └─ Frontend slow? → Lighthouse, Web Vitals profiling
    │
    ├─ Production-Only?
    │   ├─ Can't reproduce? → Analyze logs for patterns
    │   ├─ Environment difference? → Compare configs, data volume
    │   ├─ Need safe debugging? → Feature flags for verbose logging
    │   └─ Recent deployment? → Git bisect to find regression
    │
    ├─ Distributed Systems?
    │   ├─ Multiple services involved? → Distributed tracing (Jaeger)
    │   ├─ Request lost? → Search logs by request ID
    │   ├─ Service dependency? → Check health checks, circuit breakers
    │   └─ Async workflow? → Trace message queue, event logs
    │
    └─ Error Type?
        ├─ TypeError/NullPointer? → Check object existence, defensive coding
        ├─ Network timeout? → Check external service health, retry logic
        ├─ Database error? → Check connection pool, query syntax
        └─ Unknown error? → Systematic debugging workflow (observe, hypothesize, test)

When to Use This Skill

Claude should invoke this skill when a user reports:

  • Application crashes or errors
  • Unexpected behavior or bugs
  • Performance issues (slow queries, memory leaks, high CPU)
  • Production incidents requiring root cause analysis
  • Stack trace or error message interpretation
  • Debugging strategies for specific scenarios
  • Log analysis and pattern detection
  • Distributed system debugging (microservices, async workflows)
  • Memory leaks and resource exhaustion
  • Race conditions and concurrency issues
  • Network connectivity problems
  • Database query optimization
  • Third-party API integration issues

Operational Deep Dives

See resources/operational-patterns.md for systematic debugging workflows, logging strategy details, stack trace and performance profiling guides, and language-specific tooling checklists.


Templates (Copy-Paste Ready)

Production templates organized by workflow type:


Resources (Deep-Dive Guides)

Operational best practices by domain:


Navigation

Resources

Templates

Data


External Resources

See data/sources.json for:

  • Debugging tool documentation
  • Error tracking platforms (Sentry, Rollbar, Bugsnag)
  • Observability platforms (Datadog, New Relic, Honeycomb)
  • Profiling tutorials and guides
  • Production debugging best practices

Quick Decision Matrix

SymptomLikely CauseFirst Action
Application crashesUnhandled exceptionCheck error logs and stack trace
Slow performanceDatabase/network/CPU bottleneckProfile with performance tools
Memory usage growsMemory leakTake heap snapshots over time
Intermittent failuresRace condition, network timeoutAdd detailed logging around failure
Production-only bugEnvironment difference, data volumeCompare prod vs dev config/data
High CPU usageInfinite loop, inefficient algorithmCPU profiler to find hot functions
Database slowMissing index, N+1 queriesRun EXPLAIN ANALYZE on slow queries

Anti-Patterns to Avoid

  • Random changes - Making changes without hypothesis
  • Inadequate logging - Can't debug what you can't see
  • Debugging in production - Always reproduce locally when possible
  • Ignoring stack traces - Stack trace tells you exactly where error occurred
  • Not writing tests - Fix today, break tomorrow
  • Symptom fixing - Treating symptoms instead of root cause
  • No monitoring - Flying blind in production
  • Skipping postmortems - Not learning from incidents

Related Skills

This skill works with other skills in the framework:

Development & Operations:

  • git-workflow - Git bisect for finding regressions, version control workflows
  • dev-api-design - API debugging, error handling, REST patterns, status codes

Infrastructure & Platform:

  • ops-devops-platform - CI/CD pipelines, monitoring, incident response, SRE practices, Kubernetes ops
  • data-sql-optimization - Database query optimization, EXPLAIN ANALYZE, index tuning, slow query debugging

AI/ML Operations:

  • ai-mlops - ML model debugging, drift detection, API monitoring, batch pipeline troubleshooting
  • ai-mlops - Security debugging, jailbreak detection, privacy issues, threat modeling

Success Criteria: Issues are diagnosed systematically, root causes are identified accurately, fixes include regression tests, and debugging knowledge is documented for future reference.