eval-review

Review LLM-generated diff classifications for the diffview eval system. Use when the user invokes this skill or pastes yanked case data from evalreview (containing "# Diff Classification Review" header). Evaluates whether classifications accurately help code reviewers, provides pass/fail verdict with detailed critique.

$ 安裝

git clone https://github.com/fwojciec/diffstory /tmp/diffstory && cp -r /tmp/diffstory/.claude/skills/eval-review ~/.claude/skills/diffstory

// tip: Run this command in your terminal to install the skill


name: eval-review description: Review LLM-generated diff classifications for the diffview eval system. Use when the user invokes this skill or pastes yanked case data from evalreview (containing "# Diff Classification Review" header). Evaluates whether classifications accurately help code reviewers, provides pass/fail verdict with detailed critique.

Eval Review

Review diff classification cases to determine if the LLM correctly categorized hunks, identified change type, and told a coherent story.

Why Classifications Matter

The classification drives a story-guided review UI where:

  1. PRs become "pages" — Each story section is a page showing a subset of hunks with narration
  2. Categories control visibility:
    • core → Shown prominently. Reviewer must see this.
    • supporting → Shown de-emphasized. Worth a glance.
    • noise → Suppressed. Don't waste reviewer attention.
  3. Narration guides understanding — Section explanations tell the reviewer what they're looking at and why it matters

The stakes: A miscategorized hunk either hides something important (core → noise) or wastes attention on trivia (noise → core). The classifier's job is to help reviewers focus.

Workflow

  1. Receive case data — User pastes yanked content from evalreview (y key)
  2. Gather context — Fetch additional context to verify claims
  3. Evaluate — Apply criteria tables below
  4. Verdict — Provide pass/fail with actionable critique

Context Discovery

Before evaluating, gather this context (extract repo, branch, commit hashes from the input):

1. Beads Issue (intent)

bd show <branch>   # e.g., bd show diffview-7yu

Gives: Why was this change made? What problem does it solve?

2. Full Files at Commit (surrounding code)

git show <hash>:<filepath>   # e.g., git show abc123:pkg/auth/login.go

Gives: See functions around the changed hunks. Is a "systematic" hunk actually changing behavior?

3. PR Description (if exists)

gh pr list --head <branch> --json number,title,body --jq '.[0]'

Gives: Author's description of the change, which helps verify summary accuracy.

4. Recent History of Changed Files

git log --oneline -3 <filepath>

Gives: Is this a refactor of recent work, or new functionality?

What each context verifies:

  • Beads issue → change_type, narrative pattern
  • Full files → hunk categories (core vs systematic)
  • PR description → summary quality
  • File history → feature vs refactor distinction

Evaluation Criteria

Change Types

TypeCorrect when...Misapplied when...
bugfixRemoves incorrect behavior or adds missing behavior that should have existedAdds genuinely new capability (→ feature)
featureAdds new capability that didn't exist beforeRestructures existing capability (→ refactor)
refactorSame behavior, different structureBehavior actually changes
choreMaintenance: deps, CI, build scripts. No functional changesIncludes functional changes
docsOnly docs/comments changeCode changes accompany doc changes

Narratives

PatternCorrect when...Misapplied when...
cause-effectClear problem → solution structureNo identifiable "problem"
core-peripheryOne central change causes ripplesChanges are independent
before-afterTransformation from old to new patternNo clear "before" state
rule-instancesPattern defined once, applied N timesEach change is unique
entry-implementationPublic API defined, then implementationNo clear API boundary

Hunk Categories

CategoryCorrect when...Dangerous if wrong
coreChanges program behavior. Must reviewMarked systematic → hides bugs
systematicMechanical: renames, params, boilerplateMarked core → wastes attention
refactoringCode moves without behavior changeBehavior actually changes
noiseFormatting, whitespace onlyReal changes hidden

Key test: If skipping could cause a reviewer to miss a bug, it's core.

Beads/Issue Tracker Changes

Beads files (.beads/) require special handling based on what they affect:

Change ScopeCategoryUI TreatmentRationale
Current issue status/closednoiseSuppressReviewer is already reviewing this work
Current issue notes updatednoiseSuppressMechanical bookkeeping
Dependency edges added/removednoiseSuppressGraph maintenance
Wiring notes to other issuessupportingShow (de-emphasized)Documents decisions for future work
New issues createdsupportingShow (de-emphasized)Discovered scope worth noting

The rule: Changes scoped to <current-branch-id> → noise. Changes propagating context to other issue IDs → supporting.

Mixed hunks: If a beads hunk contains both current-issue bookkeeping and downstream wiring notes, categorize by dominant content. Pure status updates = noise; substantive wiring notes = supporting.

Section Roles

RoleContains...
problemWhat's broken/missing
fixThe solution
testTest additions/modifications
coreEssential logic changes
supportingEnables but isn't focus (including exceptions to patterns)
patternRepeatable approach being applied
interfaceAPI/interface definitions
cleanupRemoving old code

Grouping test: Hunks in a section should be semantically related, not just in the same file.

Output Format

Use this structured format for consistent, scannable reviews:

## Verdict: PASS | FAIL

[One sentence summary of the classification quality]

## Evaluation

### Change Type: [type] — ✓ Correct | ✗ Wrong
[1-2 sentences explaining why this type fits or doesn't fit]

### Narrative: [pattern] — ✓ Correct | ✗ Wrong
[1-2 sentences explaining why this pattern fits or doesn't fit]

### Hunk Assignments
| Hunk | Assigned | Correct? | Issue |
|------|----------|----------|-------|
| file.go:H0 | core | ✓ | — |
| test.go:H0 | test | ✓ | — |
| .beads:H0 | supporting | ✓ | Could argue noise |

### Section Structure
[Are sections well-organized? Do explanations help reviewers?]

## Context Gathered
- `bd show <id>` — [what it revealed]
- `git log <file>` — [what it revealed]

Formatting rules:

  • Use ✓ for correct, ✗ for incorrect
  • Keep each evaluation to 1-2 sentences max
  • The hunk table should list every hunk with quick assessment
  • If FAIL, the "Issue" column should state what it should be