ml-training-debugger
Diagnose and stabilize ML training runs, recover from failures, and deliver validated fixes.
$ 安裝
git clone https://github.com/DNYoussef/context-cascade /tmp/context-cascade && cp -r /tmp/context-cascade/skills/specialists/ml-training-debugger ~/.claude/skills/context-cascade// tip: Run this command in your terminal to install the skill
name: ml-training-debugger description: Diagnose and stabilize ML training runs, recover from failures, and deliver validated fixes. allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Task, TodoWrite model: sonnet x-category: specialists x-version: 1.1.0 x-vcl-compliance: v3.1.1 x-cognitive-frames:
- HON
- MOR
- COM
- CLS
- EVD
- ASP
- SPC
STANDARD OPERATING PROCEDURE
Purpose
Rapidly triage ML training incidents (instability, divergence, degraded metrics) and deliver validated remediations with traceable evidence.
Triggers
- Positive: Failing/unstable training runs, unexplained metric drops, NaNs/exploding gradients, data/label issues, reproducibility gaps.
- Negative: New model development without incident (route to
ml-expert) or lightweight prototyping (route toml).
Guardrails
- Structure-first: ensure
SKILL.md,README,examples/,tests/,resources/, andagents/exist; create missing docs before work. - Constraint extraction: clarify environment (hardware/framework), data provenance, metric targets, and incident timeline.
- Validation discipline: reproduce issue, isolate variables (data/model/optim), run minimal change tests; adversarially probe for leakage and nondeterminism.
- Confidence ceiling enforced (inference/report 0.70; research 0.85; observation/definition 0.95) with evidence per finding.
- Safety: preserve checkpoints/logs; avoid destructive changes; keep rollback path ready.
Execution Phases
- Intake & Evidence Gathering
- Collect logs, metrics, configs, seeds, hardware info, and recent code changes.
- Confirm baseline vs expected behavior and incident start time.
- Reproduction & Isolation
- Reproduce on smallest dataset slice; fix seeds; disable randomness.
- Binary-search variables: data batches, preprocessing, model changes, optimizer settings.
- Hypothesis & Experiment Plan
- Form hypotheses (data corruption, label leakage, optimizer instability, precision issues).
- Plan targeted experiments with success/fail criteria.
- Fix & Validation
- Implement minimal fixes; run controlled tests (train/val curves, gradient norms, loss stats).
- Validate against performance/latency targets; ensure no regression on baseline metrics.
- Handoff & Prevention
- Document root cause, applied fixes, and remaining risks.
- Add monitors/tests to prevent recurrence; package rollback instructions.
Output Format
- Incident summary and constraints.
- Reproduction steps, hypotheses, and experiments run.
- Fixes applied with before/after metrics.
- Prevention plan and owners.
- Confidence statement with ceiling.
Validation Checklist
- Issue reproduced with fixed seeds and minimal data slice.
- Hypotheses tested; experiments documented.
- Metrics reported per split; gradients/loss inspected where relevant.
- Regression checks executed; rollback path documented.
- Confidence ceiling stated.
VCL COMPLIANCE APPENDIX (Internal)
[[HON:teineigo]] [[MOR:root:H-T-A]] [[COM:Hata+Teshis+Analiz]] [[CLS:ge_skill]] [[EVD:-DI]] [[ASP:nesov.]] [[SPC:path:/skills/specialists/ml-training-debugger]]
[[HON:teineigo]] [[MOR:root:E-P-S]] [[COM:Epistemik+Tavan]] [[CLS:ge_rule]] [[EVD:-DI]] [[ASP:nesov.]] [[SPC:coord:EVD-CONF]]
Confidence: 0.74 (ceiling: inference 0.70) - SOP rebuilt with prompt-architect constraint discipline and skill-forge structure/validation rules.
Repository
