llm-testing

Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs.

$ インストール

git clone https://github.com/yonatangross/skillforge-claude-plugin /tmp/skillforge-claude-plugin && cp -r /tmp/skillforge-claude-plugin/.claude/skills/llm-testing ~/.claude/skills/skillforge-claude-plugin

// tip: Run this command in your terminal to install the skill

SKILL.md

View on GitHub →

name: llm-testing description: Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs. context: fork agent: test-generator version: 2.0.0 tags: [testing, llm, ai, deepeval, ragas, 2026] hooks: PostToolUse: - matcher: "Write|Edit" command: "$CLAUDE_PROJECT_DIR/.claude/hooks/skill/test-runner.sh" Stop: - command: "$CLAUDE_PROJECT_DIR/.claude/hooks/skill/eval-metrics-collector.sh"

LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

When to Use

LLM integration testing
Async timeout validation
Structured output testing
Quality gate testing
RAG pipeline evaluation

Quick Reference

Mock LLM Responses

from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

DeepEval Quality Testing

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

assert_test(test_case, metrics)

Timeout Testing

import asyncio
import pytest

@pytest.mark.asyncio
async def test_respects_timeout():
    with pytest.raises(asyncio.TimeoutError):
        async with asyncio.timeout(0.1):
            await slow_llm_call()

Quality Metrics (2026)

Metric	Threshold	Purpose
Answer Relevancy	≥ 0.7	Response addresses question
Faithfulness	≥ 0.8	Output matches context
Hallucination	≤ 0.3	No fabricated facts
Context Precision	≥ 0.7	Retrieved contexts relevant

Anti-Patterns (FORBIDDEN)

# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)

# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))

# ❌ NEVER skip timeout handling
await llm_call()  # No timeout!

# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
    result = await function_under_test()

# ✅ ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
    ...

Key Decisions

Decision	Recommendation
Mock vs VCR	VCR for integration, mock for unit
Timeout	Always test with < 1s timeout
Schema validation	Test both valid and invalid
Edge cases	Test all null/empty paths
Quality metrics	Use multiple dimensions (3-5)

Detailed Documentation

Resource	Description
references/deepeval-ragas-api.md	DeepEval & RAGAS API reference
examples/test-patterns.md	Complete test examples
checklists/llm-test-checklist.md	Setup and review checklists
templates/llm-test-template.py	Starter test template

Related Skills

vcr-http-recording - Record LLM responses
llm-evaluation - Quality assessment
unit-testing - Test fundamentals

Capability Details

llm-response-mocking

Keywords: mock LLM, fake response, stub LLM, mock AI Solves:

Mock LLM responses in tests
Create deterministic AI test fixtures
Avoid live API calls in CI

async-timeout-testing

Keywords: timeout, async test, wait for, polling Solves:

Test async LLM operations
Handle timeout scenarios
Implement polling assertions

structured-output-validation

Keywords: structured output, JSON validation, schema validation, output format Solves:

Validate structured LLM output
Test JSON schema compliance
Assert output structure

deepeval-assertions

Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:

Use DeepEval for LLM assertions
Implement metric-based tests
Configure quality thresholds

golden-dataset-testing

Keywords: golden dataset, golden test, reference output, expected output Solves:

Test against golden datasets
Compare with reference outputs
Implement regression testing

vcr-recording

Keywords: VCR, cassette, record, replay, HTTP recording Solves:

Record LLM API responses
Replay recordings in tests
Create deterministic test suites

llm-testing

$ インストール

LLM Testing Patterns

When to Use

Quick Reference

Mock LLM Responses

DeepEval Quality Testing

Timeout Testing

Quality Metrics (2026)

Anti-Patterns (FORBIDDEN)

Key Decisions

Detailed Documentation

Related Skills

Capability Details

llm-response-mocking

async-timeout-testing

structured-output-validation

deepeval-assertions

golden-dataset-testing

vcr-recording

Repository

Actions

Related Skills