ORCHESTRATION_DEBUGGING
Troubleshoot agent & tool failures in scheduling orchestration. Use when MCP tools fail, agent communication breaks, constraint engines error, or database operations timeout. Provides systematic incident response and root cause analysis.
$ Instalar
git clone https://github.com/Euda1mon1a/Autonomous-Assignment-Program-Manager /tmp/Autonomous-Assignment-Program-Manager && cp -r /tmp/Autonomous-Assignment-Program-Manager/.claude/skills/ORCHESTRATION_DEBUGGING ~/.claude/skills/Autonomous-Assignment-Program-Manager// tip: Run this command in your terminal to install the skill
name: ORCHESTRATION_DEBUGGING description: Troubleshoot agent & tool failures in scheduling orchestration. Use when MCP tools fail, agent communication breaks, constraint engines error, or database operations timeout. Provides systematic incident response and root cause analysis.
ORCHESTRATION_DEBUGGING
A comprehensive debugging skill for diagnosing and resolving failures in the AI-orchestrated scheduling system, including MCP tool integration, agent workflows, constraint engine, and database operations.
When This Skill Activates
- MCP Tool Failures: Timeout, connection errors, or incorrect responses
- Agent Communication Issues: Multi-agent workflows failing to coordinate
- Constraint Engine Errors: OR-Tools solver failures, constraint conflicts
- Database Operation Failures: Deadlocks, connection pool exhaustion, slow queries
- Schedule Generation Failures: Validation errors, compliance violations, infeasible schedules
- Background Task Issues: Celery worker crashes, task timeouts, queue backlogs
- API Integration Failures: Backend API errors, authentication issues, rate limiting
Overview
This skill provides structured workflows for:
- Incident Review: Post-mortem analysis with root cause identification
- Log Analysis: Systematic log parsing across services (backend, MCP, Celery, database)
- Root Cause Analysis: 5-whys investigation methodology
- Common Failure Patterns: Catalog of known issues with solutions
- Debugging Checklist: Step-by-step troubleshooting for each component
Architecture Context
System Components
Claude Agent
↓ (MCP Protocol)
MCP Server (29+ tools)
↓ (HTTP API)
FastAPI Backend
↓ (SQLAlchemy)
PostgreSQL Database
↓ (Async Tasks)
Celery + Redis
Common Failure Points
| Layer | Component | Failure Mode |
|---|---|---|
| Agent | Claude Code | Token limits, context overflow, skill conflicts |
| MCP | Tool invocation | Timeout, serialization errors, auth failures |
| API | FastAPI routes | Validation errors, database session issues |
| Service | Business logic | Constraint violations, ACGME compliance failures |
| Solver | OR-Tools engine | Infeasible constraints, timeout, memory exhaustion |
| Database | PostgreSQL | Deadlocks, connection pool exhaustion, slow queries |
| Tasks | Celery workers | Task timeout, serialization errors, queue backlog |
Core Debugging Phases
Phase 1: DETECTION
Goal: Identify what failed and where
1. Check error visibility
- User-facing error message
- API response logs
- Backend service logs
- Database query logs
- MCP server logs
2. Establish failure scope
- Single request or systemic?
- Reproducible or intermittent?
- User-specific or system-wide?
Phase 2: DIAGNOSIS
Goal: Understand why it failed
1. Trace request path
- Agent → MCP → API → Service → Database
- Identify where the chain breaks
2. Collect evidence
- Error stack traces
- Recent code changes (git log)
- Database state (queries, locks)
- System resources (CPU, memory, connections)
Phase 3: RESOLUTION
Goal: Fix the issue
1. Implement fix
- Code changes
- Configuration updates
- Database repairs
2. Verify fix
- Reproduce original failure
- Confirm fix resolves it
- Check for regressions
Phase 4: PREVENTION
Goal: Prevent recurrence
1. Document incident
- Root cause
- Fix applied
- Lessons learned
2. Implement safeguards
- Add tests
- Add monitoring
- Update documentation
Workflow Files
Workflows/incident-review.md
Post-mortem template for systematic incident analysis:
- Timeline reconstruction
- Impact assessment
- Root cause identification (5-whys)
- Remediation actions
- Prevention measures
Use when: After resolving a major incident or when debugging a complex failure
Workflows/log-analysis.md
Log parsing and correlation across services:
- Log location discovery
- Error pattern extraction
- Cross-service correlation
- Timeline reconstruction
- Anomaly detection
Use when: Error is unclear or spans multiple services
Workflows/root-cause-analysis.md
5-whys investigation methodology:
- Problem statement definition
- Iterative questioning
- Evidence gathering
- Root cause identification
Use when: Surface-level fix is clear but underlying cause is not
Reference Files
Reference/common-failure-patterns.md
Catalog of known issues with symptoms and fixes:
- Database connection failures
- MCP tool timeouts
- Constraint engine errors
- Agent communication failures
- Each with: Symptoms → Diagnosis → Fix
Use when: Encountering a familiar-looking error
Reference/debugging-checklist.md
Step-by-step troubleshooting guide:
- Service health checks
- Log verification
- Database inspection
- MCP tool status
- Agent state verification
Use when: Starting investigation with no clear direction
Key Files to Inspect
Backend Logs
# Application logs
docker-compose logs backend --tail=200 --follow
# Uvicorn access logs
docker-compose logs backend | grep "POST\|GET\|PUT\|DELETE"
# Error-specific logs
docker-compose logs backend 2>&1 | grep -i "error\|exception\|failed"
MCP Server Logs
# MCP server output
docker-compose logs mcp-server --tail=100 --follow
# Tool invocation logs
docker-compose logs mcp-server | grep "tool_call\|error"
# API connectivity
docker-compose exec mcp-server curl -s http://backend:8000/health
Database Logs
# Connect to database
docker-compose exec db psql -U scheduler -d residency_scheduler
# Check active queries
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
# Check locks
SELECT * FROM pg_locks WHERE NOT granted;
Celery Logs
# Worker logs
docker-compose logs celery-worker --tail=100 --follow
# Beat scheduler logs
docker-compose logs celery-beat --tail=50 --follow
# Check queue status
docker-compose exec redis redis-cli LLEN celery
Output Format
Quick Status Check
SYSTEM HEALTH: [GREEN|YELLOW|ORANGE|RED]
Backend API: ✓ Responding (200ms avg)
MCP Server: ✓ Connected (29 tools available)
Database: ✓ 8/20 connections used
Celery: ✗ 3 failed tasks in queue
Redis: ✓ Connected
ISSUES DETECTED:
1. Celery worker timeout on schedule generation task
2. 2 database deadlocks in last hour
RECOMMENDED ACTION: Review celery worker logs and database lock contention
Full Incident Report
## INCIDENT REPORT: [Title]
**Date**: 2025-12-26 14:32 UTC
**Severity**: [LOW|MEDIUM|HIGH|CRITICAL]
**Status**: [INVESTIGATING|RESOLVED|MONITORING]
**Reporter**: [Agent/User/Automated]
### Summary
One-sentence description of what failed
### Timeline
- 14:30 - First error detected
- 14:31 - Service degraded
- 14:35 - Fix implemented
- 14:40 - Service restored
### Impact
- Users affected: [number or "all"]
- Data integrity: [preserved/compromised]
- ACGME compliance: [maintained/violated]
- Downtime: [duration]
### Root Cause
Detailed explanation using 5-whys methodology
### Resolution
What was done to fix the issue
### Prevention
How to prevent this in the future
### Action Items
- [ ] Add monitoring for [metric]
- [ ] Create test case for [scenario]
- [ ] Update documentation for [component]
Error Handling Best Practices
1. Preserve Context
# Bad - loses context
try:
result = await some_operation()
except Exception:
raise HTTPException(status_code=500, detail="Operation failed")
# Good - preserves stack trace
try:
result = await some_operation()
except Exception as e:
logger.error(f"Operation failed: {e}", exc_info=True)
raise HTTPException(
status_code=500,
detail="Operation failed - check logs for details"
)
2. Log Diagnostic Information
logger.info(f"Starting operation with params: {params}")
logger.debug(f"Intermediate state: {state}")
logger.error(f"Operation failed at step {step}", exc_info=True)
3. Add Request IDs
# For tracing requests across services
request_id = str(uuid.uuid4())
logger.info(f"[{request_id}] Processing schedule generation")
Integration with Other Skills
With systematic-debugger
For code-level debugging:
- ORCHESTRATION_DEBUGGING identifies which component failed
- systematic-debugger investigates the code
With production-incident-responder
For production emergencies:
- production-incident-responder handles immediate crisis
- ORCHESTRATION_DEBUGGING performs post-mortem
With automated-code-fixer
For automated fixes:
- ORCHESTRATION_DEBUGGING identifies root cause
- automated-code-fixer applies tested solution
Escalation Criteria
ALWAYS escalate to human when:
- Data corruption detected
- Security vulnerability discovered
- ACGME compliance violated
- Multi-hour outage
- Root cause unclear after investigation
- Fix requires database migration or schema change
Can handle automatically:
- Configuration issues
- Known failure patterns with documented fixes
- Resource exhaustion (restart services)
- Transient network errors
- Log analysis and report generation
Monitoring Recommendations
After resolving incidents, add monitoring for:
- Error rate by endpoint
- Request latency (p50, p95, p99)
- Database connection pool usage
- Celery queue depth
- MCP tool success rate
- Schedule generation success rate
References
/docs/development/DEBUGGING_WORKFLOW.md- Overall debugging methodology/docs/development/CI_CD_TROUBLESHOOTING.md- CI/CD specific patterns/mcp-server/RESILIENCE_MCP_INTEGRATION.md- MCP tool documentation/backend/app/core/logging.py- Logging configurationWorkflows/- Detailed workflow templatesReference/- Common patterns and checklists
Repository
