Marketplace

dev-sre

Gate 2 of the development cycle. VALIDATES that observability was correctly implemented by developers. Does not implement observability code - only validates it.

$ Instalar

git clone https://github.com/LerianStudio/ring /tmp/ring && cp -r /tmp/ring/dev-team/skills/dev-sre ~/.claude/skills/ring

// tip: Run this command in your terminal to install the skill


name: dev-sre description: | Gate 2 of the development cycle. VALIDATES that observability was correctly implemented by developers. Does not implement observability code - only validates it.

trigger: |

  • Gate 2 of development cycle
  • Gate 0 (Implementation) complete with observability code
  • Gate 1 (DevOps) setup complete
  • Service needs observability validation (logging, tracing)

NOT_skip_when: |

  • "Task says observability not required" → AI cannot self-exempt. all services need observability.
  • "Pure frontend" → If it calls any API, backend needs observability. Frontend-only = static HTML.
  • "MVP doesn't need observability" → MVP without observability = blind MVP. No exceptions.

sequence: after: [dev-devops] before: [dev-testing]

related: complementary: [dev-cycle, dev-devops, dev-testing]

input_schema: required: - name: unit_id type: string description: "Task or subtask identifier being validated" - name: language type: string enum: [go, typescript, python] description: "Programming language of the implementation" - name: service_type type: string enum: [api, worker, batch, cli, library] description: "Type of service being validated" - name: implementation_agent type: string description: "Agent that performed Gate 0 (e.g., backend-engineer-golang)" - name: implementation_files type: array items: string description: "List of files created/modified in Gate 0" optional: - name: external_dependencies type: array items: string description: "External services called (HTTP, gRPC, queues)" - name: gate0_handoff type: object description: "Summary from Gate 0 implementation" - name: gate1_handoff type: object description: "Summary from Gate 1 DevOps setup"

output_schema: format: markdown required_sections: - name: "Validation Result" pattern: "^## Validation Result" required: true - name: "Instrumentation Coverage" pattern: "^## Instrumentation Coverage" required: true - name: "Issues Found" pattern: "^## Issues Found" required: true - name: "Handoff to Next Gate" pattern: "^## Handoff to Next Gate" required: true metrics: - name: result type: enum values: [PASS, FAIL, NEEDS_FIXES] - name: instrumentation_coverage_percent type: float - name: iterations type: integer - name: issues_critical type: integer - name: issues_high type: integer - name: issues_medium type: integer - name: issues_low type: integer

verification: automated: - command: "docker-compose logs app 2>&1 | head -5 | jq -e '.level'" description: "Logs are JSON structured" success_pattern: "info|debug|warn|error" manual: - "Verify logs include trace_id when tracing is enabled"

examples:

  • name: "API service observability validation" input: unit_id: "task-001" language: "go" service_type: "api" implementation_agent: "backend-engineer-golang" implementation_files: ["internal/handler/user.go", "internal/service/user.go"] expected_output: |

    Validation Result

    Status: PASS Iterations: 1

    Instrumentation Coverage

    LayerInstrumentedTotalCoverage
    Handlers55100%
    Services88100%
    Repositories44100%
    TOTAL1717100%

    Issues Found

    None

    Handoff to Next Gate

    • Ready for Gate 3: YES

SRE Validation (Gate 2)

Overview

This skill VALIDATES that observability was correctly implemented by developers:

  • Structured logging with trace correlation
  • OpenTelemetry tracing instrumentation
  • Code instrumentation coverage (90%+ required)
  • Context propagation for distributed tracing

CRITICAL: Role Clarification

Developers IMPLEMENT observability. SRE VALIDATES it.

WhoResponsibility
Developers (Gate 0)IMPLEMENT observability following Ring Standards
SRE Agent (Gate 2)VALIDATE that observability is correctly implemented
Implementation AgentFIX issues found by SRE (if any)

If observability is missing or incorrect:

  1. SRE reports issues with severity levels
  2. This skill dispatches fixes to the implementation agent
  3. SRE re-validates after fixes
  4. Max 3 iterations, then escalate to user

Step 1: Validate Input

<verify_before_proceed>

  • unit_id exists
  • language is valid (go|typescript|python)
  • service_type is valid (api|worker|batch|cli|library)
  • implementation_agent exists
  • implementation_files is not empty </verify_before_proceed>
REQUIRED INPUT (from dev-cycle orchestrator):
- unit_id: [task/subtask being validated]
- language: [go|typescript|python]
- service_type: [api|worker|batch|cli|library]
- implementation_agent: [agent that did Gate 0]
- implementation_files: [list of files from Gate 0]

OPTIONAL INPUT:
- external_dependencies: [HTTP clients, gRPC clients, queues]
- gate0_handoff: [summary from Gate 0]
- gate1_handoff: [summary from Gate 1]

if any REQUIRED input is missing:
  → STOP and report: "Missing required input: [field]"
  → Return to orchestrator with error

Step 2: Initialize Validation State

validation_state = {
  iteration: 1,
  max_iterations: 3,
  sre_result: null,
  issues: [],
  instrumentation_coverage: null
}

Step 3: Dispatch SRE Agent for Validation

<dispatch_required agent="sre" model="opus"> Validate observability implementation for unit_id. </dispatch_required>

Task:
  subagent_type: "ring-dev-team:sre"
  model: "opus"
  description: "Validate observability for [unit_id]"
  prompt: |
    ⛔ VALIDATE Observability Implementation

    ## Input Context
    - **Unit ID:** [unit_id]
    - **Language:** [language]
    - **Service Type:** [service_type]
    - **Implementation Agent:** [implementation_agent]
    - **Files to Validate:** [implementation_files]
    - **External Dependencies:** [external_dependencies or "None"]

    ## Standards Reference
    WebFetch: https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/sre.md

    ## Your Role
    - VALIDATE that observability is implemented correctly
    - Do not implement - only verify and report
    - Check structured JSON logging
    - Check OpenTelemetry instrumentation coverage
    - Check context propagation for external calls

    ## Validation Checklist

    ### 0. FORBIDDEN Logging Patterns (CRITICAL - Check FIRST)

    Any occurrence = CRITICAL severity, automatic FAIL verdict.

    <forbidden>
    - fmt.Println() in Go code
    - fmt.Printf() in Go code
    - log.Println() in Go code
    - log.Printf() in Go code
    - log.Fatal() in Go code
    - println() in Go code
    - console.log() in TypeScript
    - console.error() in TypeScript
    - console.warn() in TypeScript
    </forbidden>
    
    **MUST search for and report all occurrences of FORBIDDEN patterns:**
    
    | Language | FORBIDDEN Pattern | Search For |
    |----------|-------------------|------------|
    | Go | `fmt.Println()` | `fmt.Println` in *.go files |
    | Go | `fmt.Printf()` | `fmt.Printf` in *.go files |
    | Go | `log.Println()` | `log.Println` in *.go files |
    | Go | `log.Printf()` | `log.Printf` in *.go files |
    | Go | `log.Fatal()` | `log.Fatal` in *.go files |
    | Go | `println()` | `println(` in *.go files |
    | TypeScript | `console.log()` | `console.log` in *.ts files |
    | TypeScript | `console.error()` | `console.error` in *.ts files |
    | TypeScript | `console.warn()` | `console.warn` in *.ts files |
    
    **If any FORBIDDEN pattern found:**
    - Severity: **CRITICAL**
    - Verdict: **FAIL** (automatic, no exceptions)
    - Each occurrence MUST be listed with file:line
    
    ### 1. Structured Logging (lib-commons)
    - [ ] Uses `libCommons.NewTrackingFromContext(ctx)` for logger (Go)
    - [ ] Uses `initializeLogger()` from lib-common-js (TypeScript)
    - [ ] JSON format with timestamp, level, message, service
    - [ ] trace_id correlation in logs
    - [ ] **no FORBIDDEN patterns** (see check 0 above)

    ### 2. Instrumentation Coverage (90%+ required)
    For [language], check these patterns:

    **Go (lib-commons):**
    ```go
    logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)
    ctx, span := tracer.Start(ctx, "layer.operation")
    defer span.End()
    ```

    **TypeScript:**
    ```typescript
    const span = tracer.startSpan('layer.operation');
    try { /* work */ } finally { span.end(); }
    ```

    Count spans in:
    - Handlers: grep "tracer.Start" in *handler*.go or *controller*.ts
    - Services: grep "tracer.Start" in *service*.go or *service*.ts
    - Repositories: grep "tracer.Start" in *repo*.go or *repository*.ts

    ### 3. Context Propagation
    For external calls, verify:
    - HTTP: InjectHTTPContext (Go) or equivalent
    - gRPC: InjectGRPCContext (Go) or equivalent
    - Queues: PrepareQueueHeaders (Go) or equivalent

    ## Required Output Format

    ### Validation Summary
    | Check | Status | Evidence |
    |-------|--------|----------|
    | Structured Logging | ✅/❌ | [file:line or "not FOUND"] |
    | Tracing Enabled | ✅/❌ | [file:line or "not FOUND"] |
    | Instrumentation ≥90% | ✅/❌ | [X%] |
    | Context Propagation | ✅/❌/N/A | [file:line or "N/A"] |

    ### Instrumentation Coverage Table
    | Layer | Instrumented | Total | Coverage |
    |-------|--------------|-------|----------|
    | Handlers | X | Y | Z% |
    | Services | X | Y | Z% |
    | Repositories | X | Y | Z% |
    | HTTP Clients | X | Y | Z% |
    | gRPC Clients | X | Y | Z% |
    | **TOTAL** | X | Y | **Z%** |

    ### Issues Found (if any)
    For each issue:
    - **Severity:** CRITICAL/HIGH/MEDIUM/LOW
    - **Category:** [Logging|Tracing|Instrumentation|Propagation]
    - **Description:** [what's wrong]
    - **File:** [path:line]
    - **Expected:** [what should exist]
    - **Fix Required By:** [implementation_agent]

    ### Verdict
    - **all CHECKS PASSED:**  YES /  no
    - **Instrumentation Coverage:** [X%]
    - **If no, blocking issues:** [list]

Step 4: Parse SRE Agent Output

Parse agent output:

1. Extract Validation Summary table
2. Extract Instrumentation Coverage table
3. Extract Issues Found list
4. Extract Verdict

validation_state.sre_result = {
  logging_ok: [true/false],
  tracing_ok: [true/false],
  instrumentation_coverage: [percentage],
  context_propagation_ok: [true/false/na],
  issues: [list of issues],
  verdict: [PASS/FAIL]
}

Step 5: Handle Validation Result

if validation_state.sre_result.verdict == "PASS" 
   and validation_state.sre_result.instrumentation_coverage >= 90:
  → Go to Step 8 (Success)

if validation_state.sre_result.verdict == "FAIL"
   or validation_state.sre_result.instrumentation_coverage < 90:
  → Go to Step 6 (Dispatch Fix)

if validation_state.iteration >= validation_state.max_iterations:
  → Go to Step 9 (Escalate)

Step 6: Dispatch Fix to Implementation Agent

Task:
  subagent_type: "[implementation_agent from input]"  # e.g., "backend-engineer-golang"
  model: "opus"
  description: "Fix observability issues for [unit_id]"
  prompt: |
    ⛔ FIX REQUIRED - Observability Issues Found

    ## Context
    - **Unit ID:** [unit_id]
    - **Iteration:** [validation_state.iteration] of [validation_state.max_iterations]
    - **Your Previous Implementation:** [implementation_files]

    ## Issues to Fix (from SRE Validation)
    [paste issues from validation_state.sre_result.issues]

    ## Current Instrumentation Coverage
    [paste Instrumentation Coverage table from SRE output]
    **Required:** ≥90%
    **Current:** [validation_state.sre_result.instrumentation_coverage]%

    ## Standards Reference
    For Go: https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/golang.md
    For TS: https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/typescript.md

    Focus on: Telemetry & Observability section

    ## Required Fixes

    ### If Logging Issues:
    - Replace fmt.Println/console.log with structured logger
    - Add trace_id to log context
    - Use JSON format

    ### If Instrumentation Coverage < 90%:
    - Add spans to all handlers: `tracer.Start(ctx, "handler.name")`
    - Add spans to all services: `tracer.Start(ctx, "service.domain.operation")`
    - Add spans to all repositories: `tracer.Start(ctx, "db.operation")`
    - Add `defer span.End()` after each span creation

    ### If Context Propagation Issues:
    - Add InjectHTTPContext for outgoing HTTP calls
    - Add InjectGRPCContext for outgoing gRPC calls
    - Add PrepareQueueHeaders for queue publishing

    ## Required Output
    - Files modified with fixes
    - New Instrumentation Coverage calculation
    - Confirmation all issues addressed

Step 7: Re-Validate After Fix

validation_state.iteration += 1

if validation_state.iteration > validation_state.max_iterations:
  → Go to Step 9 (Escalate)

→ Go back to Step 3 (Dispatch SRE Agent)

Step 8: Success - Prepare Output

Generate skill output:

## Validation Result
**Status:** PASS
**Iterations:** [validation_state.iteration]
**Instrumentation Coverage:** [validation_state.sre_result.instrumentation_coverage]%

## Instrumentation Coverage
[paste final Instrumentation Coverage table]

## Issues Found
None (all resolved)

## Handoff to Next Gate
- SRE validation: COMPLETE
- Logging: ✅ Structured JSON with trace_id
- Tracing: ✅ OpenTelemetry instrumented
- Instrumentation: ✅ [X]% coverage
- Ready for Gate 3 (Testing): YES

Step 9: Escalate - Max Iterations Reached

Generate skill output:

## Validation Result
**Status:** FAIL
**Iterations:** [validation_state.iteration] (MAX REACHED)
**Instrumentation Coverage:** [validation_state.sre_result.instrumentation_coverage]%

## Instrumentation Coverage
[paste final Instrumentation Coverage table]

## Issues Found
[list remaining unresolved issues]

## Handoff to Next Gate
- SRE validation: FAILED
- Remaining issues: [count]
- Ready for Gate 3 (Testing): no
- **Action Required:** User must manually resolve remaining issues

⛔ ESCALATION: Max iterations (3) reached. User intervention required.

Severity Calibration

SeverityScenarioGate 2 StatusAction
CRITICALMissing all observability (no structured logs)FAIL❌ Return to Gate 0
CRITICALfmt.Println/echo instead of JSON logsFAIL❌ Return to Gate 0
CRITICALInstrumentation coverage < 50%FAIL❌ Return to Gate 0
CRITICAL"DEFERRED" appears in validation outputFAIL❌ Return to Gate 0
HIGHInstrumentation coverage 50-89%NEEDS_FIXES⚠️ Fix and re-validate
MEDIUMMissing context propagationNEEDS_FIXES⚠️ Fix and re-validate
LOWMinor logging improvementsPASS✅ Note for future

Blocker Criteria - STOP and Report

<block_condition> If any condition is true, STOP and dispatch fix or escalate to user.

  • Service lacks JSON-structured logs
  • Instrumentation coverage < 50%
  • Max iterations (3) reached </block_condition>
Decision TypeExamplesAction
HARD BLOCKService lacks JSON structured logsSTOP - Dispatch fix to implementation agent
HARD BLOCKInstrumentation coverage < 50%STOP - Dispatch fix to implementation agent
HARD BLOCKMax iterations reachedSTOP - Escalate to user

Cannot Be Overridden

<cannot_skip>

  • Gate 2 execution (no MVP exemptions)
  • 90% instrumentation coverage minimum
  • JSON structured logs requirement </cannot_skip>
RequirementCannot Be Waived ByRationale
Gate 2 executionCTO, PM, "MVP" argumentsObservability prevents production blindness
90% instrumentation coverage"We'll add spans later"Later = never. Instrument during implementation.
JSON structured logs"Plain text is enough"Plain text is unsearchable in production

Pressure Resistance

See shared-patterns/shared-pressure-resistance.md for universal pressure scenarios.

User SaysYour Response
"Skip SRE validation""Observability is MANDATORY. Dispatching SRE agent now."
"90% coverage is too high""90% is the Ring Standard minimum. Cannot lower."
"Will add instrumentation later""Instrumentation is part of implementation. Fix now."

Anti-Rationalization Table

See shared-patterns/shared-anti-rationalization.md for universal anti-rationalizations.

Gate 2-Specific Anti-Rationalizations

RationalizationWhy It's WRONGRequired Action
"OpenTelemetry library is installed"Installation ≠ InstrumentationVerify spans exist in code
"Middleware handles tracing"Middleware = root span onlyAdd child spans in all layers
"Small function doesn't need span"Size is irrelevantAdd span to every function
"Only external calls need tracing"Internal ops need tracing tooInstrument all layers
"Feature complete, observability later"Observability IS completionFix NOW before Gate 3

Component Type Requirements

TypeJSON LogsTracingInstrumentation
API ServiceREQUIREDREQUIRED90%+
Background WorkerREQUIREDREQUIRED90%+
CLI ToolREQUIREDN/AN/A
LibraryN/AN/AN/A

Execution Report Format

## Validation Result
**Status:** [PASS|FAIL|NEEDS_FIXES]
**Iterations:** [N]
**Duration:** [Xm Ys]

## Instrumentation Coverage
| Layer | Instrumented | Total | Coverage |
|-------|--------------|-------|----------|
| Handlers | X | Y | Z% |
| Services | X | Y | Z% |
| Repositories | X | Y | Z% |
| HTTP Clients | X | Y | Z% |
| gRPC Clients | X | Y | Z% |
| **TOTAL** | X | Y | **Z%** |

**Coverage Status:** [PASS (≥90%) | NEEDS_FIXES (50-89%) | FAIL (<50%)]

## Issues Found
- [List by severity or "None"]

## Handoff to Next Gate
- SRE validation status: [complete|needs_fixes|failed]
- Instrumentation coverage: [X%]
- Ready for testing: [YES|no]