name: test-writer description: "MANDATORY - INVOKE BEFORE writing ANY test code (def test_, class Test). Prevents brittle tests. Read this skill first, then write tests."

test-writer Skill

🚨 CRITICAL: MANDATORY FOR ALL TEST WRITING AND UPDATING

YOU CANNOT WRITE OR UPDATE TESTS WITHOUT THIS SKILL.

If you write or update tests without following this skill, you will:

Write brittle tests with hardcoded library outputs
Create self-evident tests that provide zero value
Use fixtures incorrectly (overuse for simple cases, underuse for complex)
Test Python/library behavior instead of YOUR code's contracts

This skill is your checklist. Follow it step-by-step. No shortcuts.

🚨 CRITICAL FOR TEST WRITING

BEFORE writing tests → Use test-writer skill (MANDATORY - analyzes code type, dependencies, contract)
AFTER writing tests → Invoke pytest-test-reviewer agent (validates patterns)
YOU CANNOT WRITE TESTS WITHOUT test-writer SKILL - No exceptions, no shortcuts, every test, every time

When to Use This Skill

Use this skill when:

✅ User asks "write tests for X"
✅ You're creating a new test file (test_*.py)
✅ You're adding tests to an existing test file
✅ User says "test this" or "add test coverage"
✅ You've just written code and need to test it
✅ You're updating/modifying existing tests (e.g., when test-fixer needs to update test expectations)
✅ Tests are failing and need to be fixed (use this skill to understand what to change)

DO NOT write or update tests without using this skill. PERIOD.

🔄 How This Skill Interacts With Other Skills

Called by test-fixer when modifying test files - determines if code or contract is wrong
Can call sql-reader to query production data model and design realistic fixtures
MUST call semantic-search before writing tests to find existing test patterns and fixtures:
- docker exec arsenal-semantic-search-cli code-search find "test <feature>"
- Check for existing fixtures, test utilities, and similar test patterns
Works autonomously but flags UX contract changes: "⚠️ UX contract change: [explain]"

🚨 CRITICAL: Don't Encode Broken Behavior

When updating tests, ask:

Is the CODE wrong? → Fix code, keep test
Is the TEST wrong? → Update test (legitimate contract change)
Is this encoding BROKEN behavior? → Flag to user and continue

Red flags:

"Code changed so I'll update the test" ← DANGER
Test passed → code changed → test fails → changing test instead of code ← DANGER

Safe updates:

Intentional contract change (documented in spec)
Refactoring (same behavior, different implementation)
Fixing brittle tests (testing implementation not contract)

When in doubt: Flag it and continue autonomously: "⚠️ This may encode broken behavior: [explain]"

Step 1: Analyze the Code Being Tested

Before writing A SINGLE LINE of test code, answer these questions:

Question 1: What type of code is this?

Pure function (no side effects, no state, deterministic)
- Example: def calculate_total(items: list[Item]) -> float
- Example: def infer_timezone_from_phone(phone: str) -> str | None
Database model/ORM (models with relationships, DB operations)
- Example: create_intervention(message: Message, user: User) -> Intervention
- Example: get_conversation_messages(conversation_id: int) -> list[Message]
API endpoint (FastAPI routes, HTTP handlers)
- Example: POST /webhook/sendblue
- Example: GET /conversations/{id}/messages
External service integration (calls to OpenAI, Langfuse, SendBlue, etc.)
- Example: send_intervention_via_sendblue(message: str, phone: str)
- Example: fetch_langfuse_prompt(prompt_name: str)
Business logic with state (complex rules, workflows, state machines)
- Example: should_send_daily_reminder(user: User, last_intervention: datetime)
- Example: calculate_conflict_score(message: Message, conversation: Conversation)

Write your answer:

Type: [YOUR ANSWER HERE]
Reasoning: [WHY you chose this type]

Question 2: What are the dependencies?

Check all that apply:

External library (phonenumbers, pytz, croniter, etc.)
Database (PostgreSQL via SQLAlchemy)
External API (OpenAI, Langfuse, SendBlue, etc.)
File system
Redis/Queue
None (pure function with no external deps)

Write your answer:

Dependencies: [LIST THEM]
Which are external (library/API): [WHICH ONES]
Which need mocking: [WHICH ONES]

Question 3: What's YOUR code's contract?

NOT what libraries return. What does YOUR code GUARANTEE?

Think about:

What does this function promise to do?
What are valid inputs?
What are valid outputs?
What errors should it raise?
What invariants must hold?

Write your answer:

Contract:
- Input guarantees: [e.g., "accepts valid US phone numbers"]
- Output guarantees: [e.g., "returns valid pytz timezone or None"]
- Error handling: [e.g., "returns None for invalid input, doesn't raise"]
- Invariants: [e.g., "US numbers always return America/* timezones"]

Question 4: What are the edge cases?

None/empty input?
Invalid input?
Boundary values (min, max)?
Error conditions?
Race conditions or timing issues?

Write your answer:

Edge cases to test:
1. [EDGE CASE 1]
2. [EDGE CASE 2]
3. [EDGE CASE 3]

Step 2: Choose the Right Test Type

Based on your analysis, determine which test type(s) to use:

Unit Tests (`tests/unit/`)

When: Complex business logic in isolation Database: SQLite in-memory Redis: FakeRedis APIs: All mocked Speed: <5s total

Use for:

Pure functions with complex logic
Business rule combinations
Edge cases and boundaries
Data transformations

Integration Tests (`tests/integration/`)

When: Component interactions Database: SQLite in-memory Redis: FakeRedis APIs: All mocked Speed: <5s total

Use for:

Service interactions
Database operations
API endpoint contracts
FastAPI TestClient validation

E2E Mocked Tests (`tests/e2e_mocked/`)

When: Critical workflows Database: Docker PostgreSQL (SHARED - use UUIDs!) Redis: FakeRedis APIs: All mocked Speed: <20s total

Use for:

Complete workflows (webhook → queue → worker)
Full pipeline testing
Integration of multiple components

⚠️ CRITICAL: Use UUID-based unique identifiers for parallel execution:

unique_id = str(uuid.uuid4())[:8]
user_name = f"TestUser_{unique_id}"

E2E Live Tests (`tests/e2e_live/`) 💰

When: Validate prompts with REAL LLMs Database: SQLite in-memory Redis: FakeRedis APIs: REAL (costs money!) Speed: <60s total

⚠️ COSTS REAL MONEY! Use gpt-4.1-nano for efficiency.

Use for:

Prompt validation with real LLMs
Langfuse prompt deployment verification
Critical AI behavior validation

Smoke Tests (`tests/smoke_tests/`)

When: Production health validation Database: Real PostgreSQL (via API) Redis: Real Redis (via API) Speed: <60s total

Use for:

Deployment validation
API availability checks
Production monitoring

Write your decision:

Test type: [UNIT | INTEGRATION | E2E_MOCKED | E2E_LIVE | SMOKE]
Reasoning: [WHY this type is appropriate]

Step 3: Decide Fixture Strategy

DO Use Fixtures For:

✅ Database models with relationships:

def test_message_processing(mock_couple_conversation, mock_message):
    # Fixtures handle complex DB setup
    conversation, participants = mock_couple_conversation
    result = process_message(conversation, mock_message)

✅ Complex objects with many fields:

@pytest.fixture
def oauth_client():
    return OAuthClient(
        client_id="...",
        client_secret="...",
        redirect_uri="...",
        # 10+ more required fields
    )

✅ Stateful components:

@pytest.fixture
def redis_connection():
    conn = Redis(...)
    yield conn
    conn.close()

DON'T Use Fixtures For:

❌ Pure functions with simple inputs:

# ❌ OVERKILL
@pytest.fixture
def phone_numbers():
    return ["+14155551234", "+12125551234"]

def test_timezone(phone_numbers):
    result = infer_timezone(phone_numbers[0])

# ✅ SIMPLE
def test_timezone():
    result = infer_timezone("+14155551234")
    assert result.startswith("America/")

❌ Simple strings/primitives (< 5 fields):

# ❌ Unnecessary fixture
@pytest.fixture
def sample_json():
    return '{"key": "value"}'

# ✅ Inline it
def test_parsing():
    data = '{"key": "value"}'
    assert parse_json(data)["key"] == "value"

Rule of thumb: If your "fixture" is just returning a hardcoded string/dict with <5 fields, inline it.

Write your decision:

Fixtures needed: [YES/NO]
Which fixtures: [LIST THEM OR "NONE"]
Why: [REASONING]

Step 4: The 5 Critical Questions

Before writing ANY assert statement, ask:

1. Am I testing MY code or someone else's?

❌ Testing library behavior:

# BAD: Testing that phonenumbers library works
def test_phonenumbers_library():
    assert phonenumbers.parse("+14155551234").country_code == 1  # phonenumbers' job!

✅ Testing MY wrapper's contract:

# GOOD: Testing what MY function guarantees
def test_us_phone_returns_us_timezone():
    result = infer_timezone_from_phone("+14155551234")
    assert result is not None           # MY guarantee: non-None for valid input
    assert result.startswith("America/") # MY guarantee: US number → US timezone
    assert pytz.timezone(result)        # MY guarantee: valid pytz timezone

2. What can change without touching my code?

❌ Hardcoding external library outputs:

# BAD: Brittle - breaks if phonenumbers updates timezone mappings
def test_timezone_inference():
    assert infer_timezone("+14155551234") == "America/Los_Angeles"
    # phonenumbers controls this exact value, not YOUR code!

✅ Testing contracts:

# GOOD: Tests behavior, not exact library output
def test_timezone_inference():
    result = infer_timezone("+14155551234")
    assert result.startswith("America/")  # Contract: US timezone
    # Robust to library changing "Los_Angeles" to "Los_Angeles/Pacific"

3. Is this self-evident?

❌ Self-evident tests:

# BAD: Testing that setting a value works
def test_setting_state():
    participant.state = ConversationState.ACTIVE
    assert participant.state == ConversationState.ACTIVE  # Duh!

# BAD: Testing pass-through logic
def test_returns_input_unchanged():
    result = resolve_timezone("Europe/London", phone=None)
    assert result == "Europe/London"  # Just testing: if x: return x

# BAD: Testing mocks
def test_mock_returns_value():
    mock.get_value.return_value = 42
    assert mock.get_value() == 42  # Of course it does!

✅ Testing business logic:

# GOOD: Tests decision logic (priority order)
def test_timezone_resolution_priority():
    # When both configured AND phone available, configured wins
    result = resolve_timezone("Europe/London", "+14155551234")
    assert result == "Europe/London"  # Tests priority, not pass-through

4. Am I testing "WHAT" or "HOW"?

❌ Testing implementation (HOW):

# BAD: Exact values from library
assert infer_timezone("+1415...") == "America/Los_Angeles"

✅ Testing contract (WHAT):

# GOOD: Behavior and guarantees
result = infer_timezone("+1415...")
assert result.startswith("America/")  # What: returns US timezone

5. Do I need fixtures/factories?

Complex DB setup with relationships → ✅ YES
Pure function with primitives → ❌ NO
Stateful components → ✅ YES
Simple strings/dicts (<5 fields) → ❌ NO

Write your answers:

Q1 (My code or library): [ANSWER]
Q2 (What can change): [ANSWER]
Q3 (Self-evident): [YES/NO + reasoning]
Q4 (What or how): [ANSWER]
Q5 (Need fixtures): [YES/NO + which ones]

Step 5: Anti-Pattern Check

Before writing code, verify you will NOT:

❌ ANTI-PATTERNS TO AVOID:

1. Hardcoded library outputs:

# ❌ NO
assert infer_timezone("+14155551234") == "America/Los_Angeles"

# ✅ YES
assert infer_timezone("+14155551234").startswith("America/")

2. Self-evident assertions:

# ❌ NO
user.name = "Alice"
assert user.name == "Alice"

# ✅ YES - test business rules
assert can_send_intervention(user) == (user.has_consented and not user.is_banned)

3. Testing library/Python behavior:

# ❌ NO
result = {**dict1, **dict2}
assert len(result) == len(dict1) + len(dict2)  # Testing Python!

# ✅ YES - test YOUR logic
merged = merge_conversation_contexts(conv1, conv2)
assert merged.participant_count == conv1.participant_count + conv2.participant_count

4. Fixtures for primitives:

# ❌ NO
@pytest.fixture
def phone_numbers():
    return ["+14155551234"]

# ✅ YES - inline it
def test_phone():
    result = process_phone("+14155551234")

5. Mock chains:

# ❌ NO
mock.query.return_value.filter.return_value.first.return_value = user

# ✅ YES - specific mock
with patch("data.models.User.get_by_id", return_value=user):

6. Multiple fixture variants:

# ❌ NO
@pytest.fixture
def full_payload(): ...

@pytest.fixture
def partial_payload(): ...

@pytest.fixture
def minimal_payload(): ...

# ✅ YES - one factory with overrides
@pytest.fixture
def payload_factory():
    def _create(**overrides):
        defaults = {"name": "Alice", "consent": True}
        return {**defaults, **overrides}
    return _create

7. Wrong mocking for test type:

# ❌ NO - in E2E_live test
with patch('openai.ChatCompletion.create'):  # Don't mock in live tests!

# ✅ YES - in unit/integration test
with patch('openai.ChatCompletion.create', return_value=mock_response):

Checklist:

No hardcoded library outputs?
No self-evident assertions?
Not testing library/Python behavior?
Fixtures used appropriately?
No mock chains?
Factory fixtures with overrides (not multiple variants)?
Correct mocking for test type?

Step 5.5: Pattern Reference - DO THIS, NOT THAT

Before writing code, review these concrete examples of good vs bad test patterns.

Pattern 1: Test Setup

❌ DON'T create test data inline:

def test_message_processing():
    # 20+ lines of manual setup
    person1 = Persons(name="Alice")
    person2 = Persons(name="Bob")
    conversation = Conversations()
    # ... more boilerplate

✅ DO use shared fixtures:

def test_message_processing(mock_couple_conversation, mock_message):
    # Clean test focused on logic
    conversation, participants = mock_couple_conversation
    result = process_message(conversation, mock_message)

Pattern 2: Test Mocking

❌ DON'T mock everything or use mock chains:

# Over-mocking with chains
mock.query.return_value.filter.return_value.first.return_value = user

# Wrong mocking for test type - In E2E_live test:
with patch('openai.ChatCompletion.create'):  # NEVER mock live services in e2e_live!

✅ DO use targeted mocking appropriate to test type:

# Unit/Integration: Mock external services
with patch('data.models.message.Message.get_latest', return_value=[]):
    # Test specific integration point

# E2E_live: NEVER mock - use real APIs
response = generate_intervention(message)  # Real OpenAI call
assert "coach" in response.lower()  # Not "therapist"

Pattern 3: Test Assertions - Self-Evident Truths

❌ DON'T test obvious Python behavior:

# Testing that Python works
user.name = "Alice"
assert user.name == "Alice"  # Self-evident!

# Testing framework features
assert session.commit() is None  # SQLAlchemy always returns None

# Testing that setting a value works
participant.state = ConversationState.ACTIVE
assert participant.state == ConversationState.ACTIVE  # Of course!

# Testing that mocks return what you told them
mock.get_value.return_value = 42
assert mock.get_value() == 42  # Duh!

# Testing Python built-ins
result = {**dict1, **dict2}
assert len(result) == len(dict1) + len(dict2)  # Testing Python!

✅ DO test business logic:

# Tests business rule
def test_consent_required_before_coaching():
    """Ensures coaching only starts after explicit consent."""
    user = create_user(has_consented=False)
    assert not can_send_intervention(user)

# Tests complex logic
def test_conflict_detection():
    message = "You never listen to me!"
    assert detect_conflict_level(message) == "high"

Pattern 4: Test Assertions - Hardcoded vs Computed

❌ DON'T use hardcoded expected values from formatters:

# BAD: Hardcoded string breaks when format changes
def test_form_to_message():
    message = create_message_from_form({"relationship_type": "romantic"})
    assert "romantic relationship" in message.lower()  # Brittle!

✅ DO compute expected values using actual formatting methods:

# GOOD: Uses the same formatting logic being tested
def test_form_to_message():
    message = create_message_from_form({"relationship_type": "romantic"})
    expected = RELATIONSHIP_TYPE_FIELD.to_message("romantic")
    assert expected and expected.lower() in message.lower()

Pattern 5: Test Organization - Fixtures

❌ DON'T create multiple fixture variants:

# BAD - creates maintenance burden, violates DRY
@pytest.fixture
def full_payload_data():
    return {"user_name": "Alice", "consent": True, ...}

@pytest.fixture
def partial_payload_data():
    return {"user_name": "Alice", "consent": True, "communication_goals": None}

@pytest.fixture
def minimal_payload_data():
    return {"user_name": "Alice"}

# Now you have 3 fixtures to maintain when schema changes!

✅ DO create one factory fixture with configurable overrides:

@pytest.fixture
def payload_factory() -> Callable:
    """Factory for test payloads with sane defaults and overrides."""
    def _create_payload(user_name: str = "Alice", **overrides):
        defaults = {
            "user_name": user_name,
            "consent": True,
            "relationship_type": "romantic",
            "communication_goals": "better listening",
        }
        defaults.update(overrides)
        return defaults
    return _create_payload

# Usage - customize only what varies per test
def test_full_data(payload_factory):
    payload = payload_factory()  # Uses all defaults

def test_partial_data(payload_factory):
    payload = payload_factory(communication_goals=None)

def test_custom_data(payload_factory):
    payload = payload_factory(user_name="Bob", relationship_type="co-parenting")

Pattern 6: Test Organization - Parallel Execution

❌ DON'T use hardcoded values in E2E tests:

# BAD: Hardcoded values cause conflicts in parallel execution
def test_workflow():
    user_name = "TestUser"  # Will conflict when tests run in parallel!

✅ DO use UUID-based unique identifiers:

# GOOD: Each test run gets unique data
def test_workflow():
    unique_id = str(uuid.uuid4())[:8]
    user_name = f"TestUser_{unique_id}"  # Parallel-safe

Pattern 7: Test Documentation

❌ DON'T write technical descriptions:

def test_webhook():
    """Tests POST /webhook returns 200."""

✅ DO explain business value:

def test_webhook_queues_messages():
    """
    Ensures incoming messages are reliably queued for async processing,
    preventing message loss during high load or worker downtime.
    """

Pattern 8: Test Parametrization

❌ DON'T write separate tests for each variant:

# BAD - repetitive, hard to maintain
def test_romantic_relationship_creates_fact():
    assert "romantic" in facts

def test_coparenting_relationship_creates_fact():
    assert "co-parenting" in facts

def test_friendship_relationship_creates_fact():
    assert "friendship" in facts

✅ DO use parametrize for common patterns:

# GOOD - single parametrized test
@pytest.mark.parametrize("relationship_type", ["romantic", "co-parenting", "friendship"])
def test_relationship_type_creates_fact(relationship_type):
    assert relationship_type in facts

# GOOD - test business rule combinations
@pytest.mark.parametrize(
    "sender_interventions,recipient_interventions,expected_should_send",
    [
        (False, False, True),   # No recent interventions → send reminder
        (True, False, False),   # Sender has interventions → don't spam
        (False, True, False),   # Recipient has interventions → don't spam
    ],
)
def test_daily_reminder_logic(sender_interventions, recipient_interventions, expected_should_send):
    """Tests reminder logic respects intervention cooldown periods."""
    # Single test implementation covering 3 business rule combinations

Pattern 9: Contract Testing (Library Wrappers)

❌ DON'T hardcode library outputs:

# BAD: Brittle - breaks if phonenumbers updates mappings
def test_timezone_inference():
    assert infer_timezone_from_phone("+14155551234") == "America/Los_Angeles"

✅ DO test YOUR contract, not library internals:

# GOOD: Contract test
def test_us_phone_returns_us_timezone():
    """
    Valid US phone numbers should return a US timezone.

    Contract test: validates that US numbers map to America/* timezones
    without depending on exact phonenumbers library output that could change.
    """
    result = infer_timezone_from_phone("+14155551234")

    # Test YOUR contract, not library internals
    assert result is not None
    assert result.startswith("America/")  # Contract: US → America/*
    assert pytz.timezone(result)  # Contract: valid timezone

Pattern 10: Wrong Test Type / Fixtures

❌ DON'T mix test types or use wrong fixtures:

# Wrong fixture for test type
# In unit test:
def test_logic(real_database):  # Should use SQLite/mocks!

# In E2E_mocked:
user_name = "TestUser"  # Hardcoded = parallel test failures

✅ DO use correct test type and fixtures:

# Unit test: SQLite + FakeRedis + Mocks
def test_complex_logic(mock_session, mock_message):
    # Test algorithm only

# E2E_mocked: Docker PostgreSQL + unique data
def test_workflow():
    unique_id = str(uuid.uuid4())[:8]
    user_name = f"TestUser_{unique_id}"  # Parallel-safe

# E2E_live: Real APIs (costs money!)
@pytest.fixture(scope="module")  # Cache expensive calls
def gpt_response():
    return openai.complete(model="gpt-4.1-nano")  # Cheapest model

Step 6: Write Test Structure

Now you can write the test. Follow this template:

For Pure Functions:

class TestFunctionName:
    """Test [function_name] [what it does]."""

    def test_[descriptive_name](self):
        """
        [Business value explanation - WHY this test matters]

        [What contract/guarantee this verifies]
        """
        # Arrange: Set up inputs
        input_value = "test_input"

        # Act: Call the function
        result = function_name(input_value)

        # Assert: Verify contract (not exact values!)
        assert result is not None
        assert isinstance(result, ExpectedType)
        assert result.meets_contract()  # Whatever YOUR guarantee is

For Database/Stateful Code:

class TestFeatureName:
    """Test [feature] [what it does]."""

    def test_[descriptive_name](
        self,
        test_db_session: Session,
        mock_fixture_1,
        mock_fixture_2,
    ):
        """
        [Business value explanation - WHY this test matters]

        [What business rule this verifies]
        """
        # Arrange: Use fixtures
        entity = mock_fixture_1()

        # Act: Execute business logic
        result = business_function(entity)

        # Assert: Verify business rules
        test_db_session.refresh(result)
        assert result.state == ExpectedState.CORRECT
        assert result.relationship_set_correctly

For Parametrized Tests:

@pytest.mark.parametrize(
    "input_value,expected_behavior",
    [
        ("value1", "behavior1"),  # Comment explaining this case
        ("value2", "behavior2"),  # Comment explaining this case
        ("edge_case", "edge_behavior"),  # Edge case
    ],
)
def test_[descriptive_name](self, input_value, expected_behavior):
    """
    [Business value explanation]

    Tests that [function] handles [variety] of inputs correctly.
    """
    result = function_name(input_value)
    assert result.matches_expected(expected_behavior)

For Contract Testing (Library Wrappers):

def test_wrapper_contract(self):
    """
    [What your wrapper guarantees]

    Contract test: validates [YOUR guarantees] without depending on
    exact library outputs that could change.
    """
    result = your_wrapper_function(input)

    # Test YOUR contract, not library internals
    assert result is not None                    # Guarantee: non-None for valid input
    assert result.matches_expected_pattern()     # Guarantee: correct format
    assert result.passes_validation()            # Guarantee: valid output
    # NOT: assert result == "exact_library_value"  # ❌ Brittle!

Step 7: Write Business-Focused Docstrings

Every test MUST have a docstring that explains:

Business value - WHY this test matters
What guarantee/contract it verifies

❌ BAD - Technical description:

def test_webhook():
    """Tests POST /webhook returns 200."""

✅ GOOD - Business value:

def test_webhook_queues_messages():
    """
    Ensures incoming messages are reliably queued for async processing,
    preventing message loss during high load or worker downtime.
    """

❌ BAD - Obvious:

def test_timezone_inference():
    """Tests that timezone is inferred from phone."""

✅ GOOD - Contract and value:

def test_us_phone_returns_us_timezone():
    """
    Valid US phone numbers should return a US timezone.

    Contract test: validates that US numbers map to America/* timezones
    without depending on exact phonenumbers library output that could change.
    Ensures scheduling happens in user's local timezone.
    """

Template:

def test_[descriptive_name]():
    """
    [One sentence: business value - what breaks if this fails]

    [Optional: Additional context about contract, edge case, or business rule]
    [Optional: Why this matters for users/product]
    """

Step 8: Golden Rule Check

Before finalizing, ask yourself:

"If this test fails, what business requirement did we break?"

If you can't answer that question clearly, the test shouldn't exist.

Examples:

✅ "We broke the guarantee that US phone numbers return US timezones"
✅ "We broke the rule that interventions require user consent"
✅ "We broke the priority order for timezone resolution"
❌ "We broke... um... setting a value returns that value?" (self-evident)
❌ "We broke... the phonenumbers library?" (not your code)

Write your answer:

If this test fails, we broke: [SPECIFIC BUSINESS REQUIREMENT]

Step 9: Decision Tree Summary

Final check:

Am I testing a business decision or rule? → Write the test
Am I testing that Python/framework features work? → Don't write it
Am I testing what I just set/mocked? → Don't write it
Would this test catch a real bug? → Write the test
Would this test help someone understand the system? → Write the test
Is this test just for coverage percentage? → Don't write it

Step 10: Present Analysis to User

Before writing code, present your analysis:

## Test Writing Analysis

### Code Type
[Pure function | Database model | API endpoint | etc.]
Reasoning: [WHY]

### Dependencies
- [Dependency 1]: [Mock it | Use real | etc.]
- [Dependency 2]: [Mock it | Use real | etc.]

### Contract
YOUR code guarantees:
- [Guarantee 1]
- [Guarantee 2]
- [Guarantee 3]

### Test Type
[UNIT | INTEGRATION | E2E_MOCKED | E2E_LIVE | SMOKE]
Reasoning: [WHY this type]

### Fixture Strategy
[YES: Use fixtures for X, Y, Z | NO: Pure function, inline data]

### Edge Cases
1. [Edge case 1]
2. [Edge case 2]
3. [Edge case 3]

### Anti-Pattern Check
✅ No hardcoded library outputs
✅ No self-evident assertions
✅ Testing MY code's contract
✅ Appropriate fixture usage
✅ Business-focused docstrings

### Golden Rule
If these tests fail, we broke: [SPECIFIC BUSINESS REQUIREMENT]

### Proposed Test Structure
```python
[SHOW TEST TEMPLATE]

Does this approach look correct?


**Get user confirmation before proceeding.**

---

## Step 11: Write the Tests

Only after Steps 1-10, write the actual test code.

Use the structure from Step 6.
Use the docstrings from Step 7.
Follow the anti-patterns from Step 5.

---

## Step 12: Invoke pytest-test-reviewer

After writing tests, ALWAYS invoke the `pytest-test-reviewer` agent to validate:
- Patterns followed correctly
- No anti-patterns introduced
- Business value clear
- Contracts tested (not implementation)

---

## Examples

### Example 1: Pure Function (Timezone Util)

**User:** "Write tests for `infer_timezone_from_phone`"

**Step 1-3: Analysis**

Code type: Pure function wrapping phonenumbers library Dependencies: phonenumbers (external), pytz (validation) Contract:

Input: phone number string (various formats)
Output: valid pytz timezone string OR None
Guarantee: US numbers → America/* timezones
Guarantee: Invalid input → None (no exceptions)


**Step 4: Test Type**

UNIT test - pure function, no DB/state


**Step 5: Fixtures**

NO fixtures needed - simple string inputs


**Step 6-7: Code**
```python
class TestInferTimezoneFromPhone:
    """Test timezone inference from phone numbers."""

    def test_valid_us_phone_returns_us_timezone(self):
        """
        Valid US phone numbers should return a US timezone.

        Contract test: validates that US numbers map to America/* timezones
        without depending on exact phonenumbers library output that could change.
        Ensures cronjobs run in user's local timezone.
        """
        # Test various US formats
        test_numbers = [
            "+14155551234",  # With country code
            "4155551234",     # Without country code
            "415-555-1234",   # With dashes
        ]

        for phone in test_numbers:
            result = infer_timezone_from_phone(phone)

            # Test OUR contract, not library internals
            assert result is not None, f"Should infer timezone for {phone}"
            assert result.startswith("America/"), f"US number should return America/* timezone"
            assert pytz.timezone(result) is not None  # Valid timezone

    def test_different_us_regions_return_different_timezones(self):
        """
        Different US regions should map to different timezones.

        Validates that the wrapper preserves geographic precision for
        accurate scheduling across time zones.
        """
        california = infer_timezone_from_phone("+14155551234")
        new_york = infer_timezone_from_phone("+12125551234")

        assert california is not None
        assert new_york is not None
        assert california != new_york, "Different regions should have different timezones"

    def test_invalid_phone_numbers_return_none(self):
        """
        Invalid phone numbers should return None.

        Critical for fallback logic - we need to know when inference
        failed so we can use the fallback timezone instead of crashing.
        """
        invalid_numbers = [None, "", "not a phone", "123"]

        for phone in invalid_numbers:
            result = infer_timezone_from_phone(phone)
            assert result is None, f"Invalid number {phone} should return None"

Golden Rule: If these tests fail, we broke:

The guarantee that US phone numbers return US timezones
The guarantee that invalid input doesn't crash (returns None)
The preservation of geographic precision (different regions)

Example 2: Database Logic (Intervention Creation)

User: "Write tests for create_intervention"

Step 1-3: Analysis

Code type: Business logic with database models
Dependencies: Database (SQLAlchemy), Message model, User model
Contract:
  - Creates Intervention in DB with correct relationships
  - Sets state to PENDING
  - Links to message and user correctly
  - Returns created intervention

Step 4: Test Type

INTEGRATION test - tests DB operations and model interactions

Step 5: Fixtures

YES - need mock_message, mock_user, test_db_session
Complex DB setup with relationships

Step 6-7: Code

class TestCreateIntervention:
    """Test intervention creation business logic."""

    def test_create_intervention_sets_correct_relationships(
        self,
        test_db_session: Session,
        mock_message,
        mock_user,
    ):
        """
        Creating an intervention should link it to the message and user.

        Ensures data integrity and enables querying interventions by
        user or message for analytics and debugging.
        """
        # Arrange: Use fixtures for complex DB setup
        message = mock_message()
        user = mock_user()

        # Act: Execute business logic
        intervention = create_intervention(message, user)

        # Assert: Verify business rules
        test_db_session.refresh(intervention)
        assert intervention.message_id == message.id
        assert intervention.user_id == user.id
        assert intervention.state == InterventionState.PENDING

    def test_create_intervention_fails_without_consent(
        self,
        test_db_session: Session,
        mock_message,
        mock_user,
    ):
        """
        Interventions should not be created for users without consent.

        Enforces ethical boundary - ensures we only coach users who
        explicitly opted in, maintaining trust and legal compliance.
        """
        # Arrange
        message = mock_message()
        user = mock_user(has_consented=False)

        # Act & Assert: Should raise
        with pytest.raises(ValueError, match="User has not consented"):
            create_intervention(message, user)

Golden Rule: If these tests fail, we broke:

Data integrity (relationships not set correctly)
Ethical boundaries (sending to non-consented users)
State machine correctness (interventions start in wrong state)

Success Criteria

Tests are ready when ALL of these are true:

Common Mistakes to Avoid

Starting to code before analysis - STOP. Do Steps 1-5 first.
Skipping the Golden Rule check - If you can't articulate what breaks, delete the test.
Using fixtures for simple strings - Inline them!
Hardcoding library outputs - Test contracts instead.
Writing self-evident tests - Ask "Am I testing Python or MY code?"
Testing library behavior - Test YOUR wrapper, not wrapped library.
Forgetting pytest-test-reviewer - ALWAYS invoke after writing tests.

After Test Writing

MANDATORY: Invoke pytest-test-reviewer agent to validate:

# Agent will check:
# - Patterns followed?
# - Anti-patterns avoided?
# - Business value clear?
# - Contracts tested?

Remember

YOU CANNOT WRITE TESTS WITHOUT THIS SKILL.

This skill is your safeguard against:

Brittle tests that break with library updates
Self-evident tests that waste time
Wrong fixture usage
Testing library behavior instead of YOUR code

Follow every step. No shortcuts. Every test. Every time.

test-writer

$ Installer

name: test-writer description: "MANDATORY - INVOKE BEFORE writing ANY test code (def test_, class Test). Prevents brittle tests. Read this skill first, then write tests."

test-writer Skill

🚨 CRITICAL: MANDATORY FOR ALL TEST WRITING AND UPDATING

🚨 CRITICAL FOR TEST WRITING

When to Use This Skill

🔄 How This Skill Interacts With Other Skills

🚨 CRITICAL: Don't Encode Broken Behavior

Step 1: Analyze the Code Being Tested

Question 1: What type of code is this?

Question 2: What are the dependencies?

Question 3: What's YOUR code's contract?

Question 4: What are the edge cases?

Step 2: Choose the Right Test Type

Unit Tests (tests/unit/)

Integration Tests (tests/integration/)

E2E Mocked Tests (tests/e2e_mocked/)

E2E Live Tests (tests/e2e_live/) 💰

Smoke Tests (tests/smoke_tests/)

Step 3: Decide Fixture Strategy

DO Use Fixtures For:

DON'T Use Fixtures For:

Step 4: The 5 Critical Questions

1. Am I testing MY code or someone else's?

2. What can change without touching my code?

3. Is this self-evident?

4. Am I testing "WHAT" or "HOW"?

5. Do I need fixtures/factories?

Step 5: Anti-Pattern Check

❌ ANTI-PATTERNS TO AVOID:

Step 5.5: Pattern Reference - DO THIS, NOT THAT

Pattern 1: Test Setup

Pattern 2: Test Mocking

Pattern 3: Test Assertions - Self-Evident Truths

Pattern 4: Test Assertions - Hardcoded vs Computed

Pattern 5: Test Organization - Fixtures

Pattern 6: Test Organization - Parallel Execution

Pattern 7: Test Documentation

Pattern 8: Test Parametrization

Pattern 9: Contract Testing (Library Wrappers)

Pattern 10: Wrong Test Type / Fixtures

Step 6: Write Test Structure

For Pure Functions:

For Database/Stateful Code:

For Parametrized Tests:

For Contract Testing (Library Wrappers):

Step 7: Write Business-Focused Docstrings

Step 8: Golden Rule Check

Step 9: Decision Tree Summary

Step 10: Present Analysis to User

Example 2: Database Logic (Intervention Creation)

Success Criteria

Common Mistakes to Avoid

After Test Writing

Remember

Repository

Actions

Related Skills

Unit Tests (`tests/unit/`)

Integration Tests (`tests/integration/`)

E2E Mocked Tests (`tests/e2e_mocked/`)

E2E Live Tests (`tests/e2e_live/`) 💰

Smoke Tests (`tests/smoke_tests/`)