name: troubleshooting description: | Systematic debugging and problem diagnosis using structured troubleshooting methodologies applicable to any domain - technical issues, process failures, system problems, or general obstacles. Use when users report errors, describe malfunctions, encounter unexpected behavior, or need help diagnosing root causes. Triggers include 'debug this,' 'troubleshoot,' 'why isn't this working,' 'getting an error,' 'something's wrong with,' 'how do I fix,' or any problem description. allowed-tools: "*"

Troubleshooting & Debugging Skill

Guide users through systematic problem diagnosis and resolution using structured methodologies applicable to any domain.

Quick Start Workflow

When a troubleshooting request arrives, follow this systematic approach:

Understand - What's expected vs actual, when started, what changed
Gather - Collect diagnostic information (logs, errors, environment, steps to reproduce)
Hypothesize - List possible causes ranked by probability
Design - Create tests that isolate variables
Test - Execute systematically, document results
Identify - Distinguish symptoms from root cause
Verify - Confirm fix resolves the issue
Document - Record solution for future reference

When to Use This Skill

Activate for requests involving:

"Debug this..." / "Troubleshoot..."
"Why isn't this working..." / "Getting an error..."
"Something's wrong with..." / "How do I fix..."
Any problem description or malfunction report
Unexpected behavior or failures
Performance issues or degradation
Intermittent problems

Problem Understanding Phase

Essential First Questions

Before diving into solutions, gather critical context:

Problem Description:

What are you trying to do? (expected behavior)
What's actually happening? (actual behavior)
What error messages or symptoms do you see?
Can you reproduce it consistently?

Timeline:

When did this start?
Did it ever work correctly?
What changed recently? (code, config, environment, data, dependencies)

Scope:

Does this happen every time or intermittently?
Does it happen for everyone or just some users?
Does it happen in all environments or just specific ones?

Environment:

What system/platform/version?
What's the configuration?
What are the dependencies and their versions?

Reproduction:

What are the exact steps to reproduce?
Can you provide a minimal example that demonstrates the issue?
What's the simplest case that fails?

Problem Type Classification

Categorize to guide troubleshooting approach:

By Consistency:

Consistent - Happens every time (easier to debug)
Intermittent - Happens sometimes (harder, often timing/race conditions)
Progressive - Gets worse over time (resource leaks, degradation)

By History:

Regression - Worked before, broke recently (what changed?)
Never worked - New feature/setup (design or setup issue)

By Scope:

Universal - Happens everywhere (likely code/logic issue)
Environmental - Happens in specific environment (config, dependencies, resources)
User-specific - Happens for specific users (permissions, data, state)

By Domain:

Code/Logic - Software bugs, algorithmic errors
Configuration - Settings, parameters, environment variables
Data - Input validation, corruption, format issues
Infrastructure - Network, storage, resources, services
Integration - API failures, service dependencies
Human process - Workflow, communication, missing steps

Classification helps prioritize hypotheses and diagnostic approaches.

Diagnostic Information Gathering

Log and Error Analysis

Error Messages:

Copy exact error text (don't paraphrase)
Note error code/type if provided
Identify stack trace or error location
Look for nested/chained errors (root cause often deeper)

Log Examination:

Check timestamps (when did issue occur)
Look for ERROR, WARN, EXCEPTION entries near failure time
Trace request/operation ID through logs
Check for patterns (repeating errors, sequences)
Review logs before failure (what was system doing)

System State:

Resource usage (CPU, memory, disk, network)
Process/service status
Recent restarts or crashes
System event logs

Environment Documentation

Capture complete environment picture:

Versions:

Application/service version
Runtime/interpreter version (Node, Python, Java, etc.)
OS version
Dependency versions (libraries, packages)
Database/cache versions

Configuration:

Config files and their values
Environment variables
Feature flags or toggles
Settings that differ from defaults

Infrastructure:

Network topology
Service endpoints and connectivity
Firewall/security rules
Load balancers, proxies
DNS configuration

Reproduction Case Development

Goal: Find minimal reproducible example (MRE)

Process:

Document exact steps from clean state
Strip away non-essential steps
Minimize data/input to simplest case
Remove dependencies where possible
Verify it still reproduces

Good reproduction case:

Anyone can follow steps and see same issue
Minimal (fewest steps, smallest data)
Self-contained (no external dependencies if possible)
Deterministic (reproduces consistently)

See references/information-gathering.md for detailed techniques.

Hypothesis Generation

Forming Testable Hypotheses

Generate possibilities:

Brainstorm what could cause observed symptoms
Consider each problem type (code, config, data, infrastructure)
Review known failure patterns for this domain
Don't self-censor - list everything plausible

Rank by probability:

Most likely causes first
Consider base rates (common vs rare failures)
Factor in recent changes
Weight by evidence strength

Make hypotheses testable:

"If X is the cause, then Y should be true"
Design specific test that would confirm/refute
Ensure test is practical to execute

Example hypotheses (ranked):

High probability: Recent config change introduced typo → Check config against previous version
Medium probability: Dependency version mismatch → Verify all dependencies match working environment
Low probability: Hardware failure → Run diagnostics, check system logs for hardware errors

Common Failure Patterns

Code/Logic: Off-by-one, null/undefined, type mismatch, logic errors, unhandled edge cases, race conditions

Configuration: Typos, wrong environment variables, permissions, path issues, missing settings

Data: Invalid format, null/empty values, corruption, encoding issues, schema mismatch

Infrastructure: Network connectivity, port conflicts, insufficient resources, permissions, services, firewall

Integration: API changes, expired credentials, rate limiting, timeouts, version incompatibility

See references/hypothesis-generation.md and references/domain-specific-patterns.md for comprehensive patterns.

Systematic Testing

Scientific Method Principles

Change one variable at a time:

Modify single factor per test
Know exactly what you changed
Document change and result
Reset to baseline between tests

Verify assumptions:

Don't assume anything works
Test each layer (can you reach service? is it responding? is auth working?)
Question "obvious" things
Build confidence from known-good baseline

Isolate variables:

Remove unnecessary components
Test in controlled environment
Disable features to narrow scope
Use minimal configuration

Document everything:

What you tested
What you changed
What happened (exact result)
Conclusions drawn
Next steps

Use scripts/hypothesis_tracker.py to maintain investigation log.

Binary Search Troubleshooting

Concept: Divide problem space in half, determine which half contains issue, repeat.

Applications:

Version bisect: Which commit introduced bug? (git bisect)
Config narrowing: Which of 50 config settings causes issue?
Data range: Which records in dataset cause failure?
Code sections: Which function/module has the bug?
Time range: When did problem start occurring?

Process:

Identify search space boundaries (working ↔ broken)
Test midpoint
Determine which half contains issue
Repeat with smaller range
Continue until isolated to specific element

Example - Config debugging:

Start: All 50 config settings (broken)
Disable half (25 settings) → Still broken
Issue is in disabled half
Disable half of those (12-13) → Works!
Issue is in recently disabled set
Continue narrowing...
Result: Isolated to single problematic setting

Use scripts/binary_search_helper.py for interactive guidance.

Differential Diagnosis

Compare working vs non-working cases to identify differences:

Compare:

Configurations
Environments
Data inputs
Dependencies
Timing/sequence
System state

Look for:

What's present in failing case but not working case
What's different between the two
What versions differ
What settings differ

Technique:

Start with known-good case
Progressively make it more like failing case
Stop when you introduce the failure
That change is the cause

Root Cause Analysis

5 Whys Technique

Dig deeper by asking "why" repeatedly (typically 5 times):

Example:

Problem: Website is down
Why? Database connection failed
Why? Connection pool exhausted
Why? Connections not being released
Why? Code missing connection.close() in error path
Why? Developer didn't know about finally blocks
Root cause: Missing error-handling training, missing code review checklist

Guidelines:

Each "why" should be factual (not speculative)
Stop when you reach actionable root cause
May need more or fewer than 5 whys
Focus on process/system causes, not blaming people

Distinguishing Symptoms from Root Cause

Symptom: Observable manifestation of problem Root Cause: Underlying reason problem exists

Example:

Symptom: Application crashes
Symptom: Out of memory error
Symptom: Memory leak in caching module
Root Cause: Cache eviction policy not implemented

Test: Would fixing this prevent all instances of the problem?

If no → It's a symptom, keep digging
If yes → Likely root cause

Verify:

Fix the root cause
Test that all symptoms disappear
Verify issue doesn't recur

See references/diagnostic-frameworks.md for Fishbone diagrams and FMEA.

Troubleshooting Principles

Core Principles:

Reproduce before fixing - If you can't reproduce it, you can't verify the fix
Change one thing at a time - Otherwise you won't know what fixed it
Verify assumptions - "It should work" ≠ "It does work"
Start with simple explanations - Occam's Razor (common problems are common)
Follow the data - Believe logs/evidence over intuition
Work from known-good baseline - Establish what works, then build up
Document as you go - Your future self will thank you
Know when to stop - Time-box investigation, escalate if stuck
Avoid confirmation bias - Look for evidence that disproves your hypothesis too
Treat the disease, not symptoms - Fix root cause, not just visible symptoms

Domain-Specific Guidance

Software/Code: Syntax errors, logic bugs, runtime errors, concurrency issues, memory problems

Use debugger, add logging, rubber duck debugging, review changes, check typos

System/Infrastructure: Network, permissions, services, ports, resources, configuration

Check service status, test connectivity, review logs, verify config, check resources

Process/Workflow: Missing steps, unmet dependencies, communication gaps, timing issues

Map actual vs expected process, identify handoffs, check prerequisites

See references/domain-specific-patterns.md for comprehensive diagnostic approaches by domain.

Common Anti-Patterns to Avoid

❌ Change multiple things at once - You won't know what fixed it ❌ Assume without verifying - "It worked last week" doesn't mean it works now ❌ Skip reproduction - Can't verify fix without reproducing first ❌ Pursue only favorite hypothesis - Confirmation bias blinds you ❌ Give up on intermittent issues - Often most critical ❌ Forget to document - Others will duplicate your work ❌ Treat symptoms not root cause - Suppressing errors doesn't fix problems ❌ Blame the user - Stops investigation prematurely ❌ Rush to solution - Premature fixes waste time

Solution Verification

After implementing fix:

Test reproduction case - Does it now work?
Test edge cases - Does fix work in all scenarios?
Regression check - Did fix break anything else?
Performance check - Did fix introduce performance issues?
Monitor in production - Watch for recurrence or side effects
Verify with users - Confirm issue is resolved from their perspective

Documentation

Document for future:

Troubleshooting Log:

Problem description
Investigation steps
Hypotheses tested
Results of each test
Root cause identified
Solution implemented

Use assets/templates/troubleshooting_log_template.md

Root Cause Analysis Report:

Executive summary
Problem statement
Timeline of events
Root cause analysis (5 Whys, Fishbone)
Corrective actions
Preventive measures

Use assets/templates/rca_report_template.md

Reproduction Steps:

Minimal reproduction example
Prerequisites
Step-by-step instructions
Expected vs actual results

Use assets/templates/reproduction_steps_template.md

When to Escalate

Know when to ask for help:

You've exhausted reasonable hypotheses
Issue is outside your domain expertise
Problem is time-critical and investigation is taking too long
You need access/permissions you don't have
Problem appears to be in external system
You've been going in circles (revisiting same hypotheses)

Before escalating:

Document what you've tried
Share reproduction case
Provide all diagnostic information
Explain your hypothesis and reasoning
Suggest what help you need

Using Supporting Resources

Additional resources in this skill:

references/diagnostic-frameworks.md: 5 Whys, Fishbone, FMEA, Binary Search, Comparative Analysis
references/information-gathering.md: Log analysis, error interpretation, environment documentation
references/hypothesis-generation.md: Failure patterns, hypothesis ranking, test design
references/domain-specific-patterns.md: Common issues by domain (software, systems, hardware, process, data)
scripts/hypothesis_tracker.py: Track hypotheses and test results
scripts/binary_search_helper.py: Interactive binary search guidance
assets/templates/: Documentation templates for investigation logs, RCA reports, reproduction cases

Reference these for deeper guidance on specific techniques.

Remember: The best debuggers are methodical, patient, and systematic. Resist the urge to jump to solutions. Follow the data. Document your process. The answer will reveal itself through systematic elimination.

troubleshooting

$ 安裝