prove-it

Gauntlet for absolute claims (always/never/guaranteed/optimal); pressure-test, then refine with explicit boundaries.

$ Installer

git clone https://github.com/tkersey/dotfiles /tmp/dotfiles && cp -r /tmp/dotfiles/codex/skills/prove-it ~/.claude/skills/dotfiles

// tip: Run this command in your terminal to install the skill

SKILL.md

View on GitHub →

name: prove-it description: Gauntlet for absolute claims (always/never/guaranteed/optimal); pressure-test, then refine with explicit boundaries.

Prove It

When to use

The user asserts certainty: “always”, “never”, “guaranteed”, “optimal”, “cannot fail”, “no downside”, “100%”.
The user asks for a devil’s advocate or proof.
The claim feels too clean for the domain.

Round cadence (mandatory)

Run exactly one gauntlet round per assistant turn.
After each round, publish:
- Round Ledger
- Knowledge Delta
Only batch rounds if the user explicitly requests “fast mode”.

Quick start

Restate the claim and its scope.
Ask whether to use fast mode (default: one round per turn).
Run round 1 and publish the Round Ledger + Knowledge Delta.
Continue round-by-round until Oracle synthesis.

Ten-round gauntlet

Counterexamples: smallest concrete break.
Logic traps: missing quantifiers/premises.
Boundary cases: zero/one/max/empty/extreme scale.
Adversarial inputs: worst-case distributions/abuse.
Alternative paradigms: different model flips the conclusion.
Operational constraints: latency/cost/compliance/availability.
Probabilistic uncertainty: variance, tail risk, sampling bias.
Comparative baselines: “better than what?”, which metric?
Meta-test: fastest disproof experiment.
Oracle synthesis: tightest surviving claim with boundaries.

Round question bank (pick 1–2)

Counterexamples: What is the smallest input that breaks this?
Logic traps: What unstated assumption must hold?
Boundary cases: Which boundary is most likely in real use?
Adversarial: What does worst-case input look like?
Alternative paradigm: What objective makes the opposite true?
Operational: Which dependency/policy is a hard stop?
Uncertainty: What distribution shift flips the result?
Baseline: Better than what, on which metric?
Meta-test: What experiment would change your mind fastest?
Oracle: What explicit boundaries keep this honest?

Core artifacts

Argument map

Claim:
Premises:
- P1:
- P2:
Hidden assumptions:
- A1:
Weak links:
- W1:
Disproof tests:
- T1:
Refined claim:

Round Ledger (update every turn)

Round: <1-10>
Focus:
Claim scope:
New evidence:
New counterexample:
Knowledge Delta:
Remaining gaps:
Next round:

Claim boundary table

| Boundary type | Valid when | Invalid when | Assumptions | Stressors |
|---------------|-----------|--------------|-------------|-----------|
| Scale         |           |              |             |           |
| Data quality  |           |              |             |           |
| Environment   |           |              |             |           |
| Adversary     |           |              |             |           |

Next-tests plan

| Test | Data needed | Success threshold | Stop condition |
|------|-------------|-------------------|----------------|

Domain packs

Performance

Use when the claim is about speed, latency, throughput, or resources.

Clarify: median vs tail latency vs throughput.
Identify workload shape (spiky vs steady) and bottleneck resource.

Product

Use when the claim is about user impact, adoption, or behavior.

Clarify user segment and success metric.
State the baseline/counterfactual.
Name the likely unintended behavior/tradeoff.

Oracle synthesis template (final)

Original claim:
Refined claim:
Boundaries:
- Valid when:
- Invalid when:
Confidence trail:
- Evidence:
- Gaps:
Next tests:
- ...

Deliverable format (per turn)

Round number + focus.
Round Ledger + Knowledge Delta.
At most one question for the user (if needed).

Activation cues

"always" / "never" / "guaranteed" / "optimal" / "cannot fail" / "no downside" / "100%"
"prove it" / "devil's advocate" / "stress test" / "rigor"