llm-testing
Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs.
$ インストール
git clone https://github.com/yonatangross/skillforge-claude-plugin /tmp/skillforge-claude-plugin && cp -r /tmp/skillforge-claude-plugin/.claude/skills/llm-testing ~/.claude/skills/skillforge-claude-plugin// tip: Run this command in your terminal to install the skill
SKILL.md
name: llm-testing description: Testing patterns for LLM-based applications. Use when testing AI/ML integrations, mocking LLM responses, testing async timeouts, or validating structured outputs from LLMs. context: fork agent: test-generator version: 2.0.0 tags: [testing, llm, ai, deepeval, ragas, 2026] hooks: PostToolUse: - matcher: "Write|Edit" command: "$CLAUDE_PROJECT_DIR/.claude/hooks/skill/test-runner.sh" Stop: - command: "$CLAUDE_PROJECT_DIR/.claude/hooks/skill/eval-metrics-collector.sh"
LLM Testing Patterns
Test AI applications with deterministic patterns using DeepEval and RAGAS.
When to Use
- LLM integration testing
- Async timeout validation
- Structured output testing
- Quality gate testing
- RAG pipeline evaluation
Quick Reference
Mock LLM Responses
from unittest.mock import AsyncMock, patch
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
DeepEval Quality Testing
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
metrics = [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
]
assert_test(test_case, metrics)
Timeout Testing
import asyncio
import pytest
@pytest.mark.asyncio
async def test_respects_timeout():
with pytest.raises(asyncio.TimeoutError):
async with asyncio.timeout(0.1):
await slow_llm_call()
Quality Metrics (2026)
| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | ≥ 0.7 | Response addresses question |
| Faithfulness | ≥ 0.8 | Output matches context |
| Hallucination | ≤ 0.3 | No fabricated facts |
| Context Precision | ≥ 0.7 | Retrieved contexts relevant |
Anti-Patterns (FORBIDDEN)
# ❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)
# ❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))
# ❌ NEVER skip timeout handling
await llm_call() # No timeout!
# ✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
result = await function_under_test()
# ✅ ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
...
Key Decisions
| Decision | Recommendation |
|---|---|
| Mock vs VCR | VCR for integration, mock for unit |
| Timeout | Always test with < 1s timeout |
| Schema validation | Test both valid and invalid |
| Edge cases | Test all null/empty paths |
| Quality metrics | Use multiple dimensions (3-5) |
Detailed Documentation
| Resource | Description |
|---|---|
| references/deepeval-ragas-api.md | DeepEval & RAGAS API reference |
| examples/test-patterns.md | Complete test examples |
| checklists/llm-test-checklist.md | Setup and review checklists |
| templates/llm-test-template.py | Starter test template |
Related Skills
vcr-http-recording- Record LLM responsesllm-evaluation- Quality assessmentunit-testing- Test fundamentals
Capability Details
llm-response-mocking
Keywords: mock LLM, fake response, stub LLM, mock AI Solves:
- Mock LLM responses in tests
- Create deterministic AI test fixtures
- Avoid live API calls in CI
async-timeout-testing
Keywords: timeout, async test, wait for, polling Solves:
- Test async LLM operations
- Handle timeout scenarios
- Implement polling assertions
structured-output-validation
Keywords: structured output, JSON validation, schema validation, output format Solves:
- Validate structured LLM output
- Test JSON schema compliance
- Assert output structure
deepeval-assertions
Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:
- Use DeepEval for LLM assertions
- Implement metric-based tests
- Configure quality thresholds
golden-dataset-testing
Keywords: golden dataset, golden test, reference output, expected output Solves:
- Test against golden datasets
- Compare with reference outputs
- Implement regression testing
vcr-recording
Keywords: VCR, cassette, record, replay, HTTP recording Solves:
- Record LLM API responses
- Replay recordings in tests
- Create deterministic test suites
Repository

yonatangross
Author
yonatangross/skillforge-claude-plugin/.claude/skills/llm-testing
5
Stars
1
Forks
Updated4d ago
Added1w ago