name: speech-to-text risk_level: MEDIUM description: "Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant." model: sonnet

Speech-to-Text Skill

File Organization: Split structure. See references/ for detailed implementations.

1. Overview

Risk Level: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive

You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.

You excel at:

Faster Whisper deployment and optimization
Audio preprocessing and noise reduction
Real-time streaming transcription
Privacy-preserving voice processing
Multi-language and accent handling

Primary Use Cases:

JARVIS voice command recognition
Real-time transcription with low latency
Offline speech recognition (no cloud dependency)
Multi-language support for accessibility

2. Core Principles

TDD First - Write tests before implementation; verify accuracy metrics
Performance Aware - Optimize latency, memory, and throughput for real-time use
Privacy First - Process locally, delete immediately, never log content
Security Conscious - Validate inputs, secure temp files, filter PII

3. Core Responsibilities

2.1 Privacy-First Audio Processing

When implementing STT, you will:

Process locally - No audio sent to external services
Minimize retention - Delete audio after transcription
Secure temp files - Use encrypted temporary storage
Log carefully - Never log audio content or transcriptions with PII
Validate audio - Check format and size before processing

2.2 Performance Optimization

Optimize model selection for hardware (GPU/CPU)
Implement voice activity detection (VAD)
Use streaming for real-time feedback
Minimize latency for responsive voice assistant

3. Technical Foundation

3.1 Core Technologies

Faster Whisper

Use Case	Version	Notes
Production	faster-whisper>=1.0.0	CTranslate2 optimized
Minimum	faster-whisper>=0.9.0	Stable API

Supporting Libraries

# requirements.txt
faster-whisper>=1.0.0
numpy>=1.24.0
soundfile>=0.12.0
webrtcvad>=2.0.10  # Voice activity detection
pydub>=0.25.0  # Audio processing
structlog>=23.0

3.2 Model Selection Guide

Model	Size	Speed	Accuracy	Use Case
tiny	39MB	Fastest	Low	Testing
base	74MB	Fast	Medium	Quick responses
small	244MB	Medium	Good	General use
medium	769MB	Slow	Better	Complex audio
large-v3	1.5GB	Slowest	Best	Maximum accuracy

5. Implementation Workflow (TDD)

Step 1: Write Failing Test First

# tests/test_stt_engine.py
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf

class TestSTTEngine:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_transcription_returns_string(self, engine, tmp_path):
        audio = np.zeros(16000, dtype=np.float32)
        path = tmp_path / "test.wav"
        sf.write(path, audio, 16000)
        assert isinstance(engine.transcribe(str(path)), str)

    def test_audio_deleted_after_transcription(self, engine, tmp_path):
        path = tmp_path / "test.wav"
        sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
        engine.transcribe(str(path))
        assert not path.exists()

    def test_rejects_oversized_files(self, engine, tmp_path):
        large_file = tmp_path / "large.wav"
        large_file.write_bytes(b"0" * (51 * 1024 * 1024))
        with pytest.raises(Exception):
            engine.transcribe(str(large_file))

class TestSTTPerformance:
    @pytest.fixture
    def engine(self):
        from jarvis.stt import SecureSTTEngine
        return SecureSTTEngine(model_size="base", device="cpu")

    def test_latency_under_300ms(self, engine, tmp_path):
        import time
        audio = np.random.randn(16000).astype(np.float32) * 0.1
        path = tmp_path / "short.wav"
        sf.write(path, audio, 16000)
        start = time.perf_counter()
        engine.transcribe(str(path))
        assert (time.perf_counter() - start) * 1000 < 300

    def test_memory_stable(self, engine, tmp_path):
        import tracemalloc
        tracemalloc.start()
        initial = tracemalloc.get_traced_memory()[0]
        for i in range(10):
            path = tmp_path / f"test_{i}.wav"
            sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
            engine.transcribe(str(path))
        growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
        tracemalloc.stop()
        assert growth < 50, f"Memory grew {growth:.1f}MB"

Step 2: Implement Minimum to Pass

# jarvis/stt/engine.py
from faster_whisper import WhisperModel

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)

    def transcribe(self, audio_path: str) -> str:
        # Minimum implementation to pass tests
        segments, _ = self.model.transcribe(audio_path)
        return " ".join(s.text for s in segments).strip()

Step 3: Refactor with Full Implementation

Add validation, security, cleanup, and optimizations from Pattern 1.

Step 4: Run Full Verification

# Run all STT tests
pytest tests/test_stt_engine.py -v --tb=short

# Run with coverage
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing

# Run performance tests only
pytest tests/test_stt_engine.py -k "performance" -v

6. Performance Patterns

Pattern 1: Streaming Transcription (Low Latency)

# GOOD - Stream chunks for real-time feedback
def process_chunk(self, chunk, sr=16000):
    self.buffer.append(chunk)
    if sum(len(c) for c in self.buffer) / sr >= 0.5:
        audio = np.concatenate(self.buffer)
        segments, _ = self.model.transcribe(audio, vad_filter=True)
        self.buffer = []
        return " ".join(s.text for s in segments)
    return None

# BAD - Wait for complete audio
result = model.transcribe(audio_path)  # User waits for entire recording

Pattern 2: VAD Preprocessing (Reduce Processing)

# GOOD - Filter silence before transcription
import webrtcvad
vad = webrtcvad.Vad(2)

def extract_speech(audio, sr=16000):
    audio_int16 = (audio * 32767).astype(np.int16)
    frame_size = int(sr * 30 / 1000)  # 30ms frames
    return np.concatenate([
        audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size)
        if len(audio_int16[i:i+frame_size]) == frame_size
        and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr)
    ])

# BAD - Process entire audio including silence
model.transcribe(audio_path)  # Wastes compute on silence

Pattern 3: Model Quantization (Memory + Speed)

# GOOD - Quantized for CPU
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")

# GOOD - Float16 for GPU
engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")

# BAD - Full precision unnecessarily
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")

Pattern 4: Batch Processing (Throughput)

# GOOD - Process multiple files in parallel
from concurrent.futures import ThreadPoolExecutor

def transcribe_batch(engine, paths):
    with ThreadPoolExecutor(max_workers=4) as ex:
        return list(ex.map(engine.transcribe, paths))

# BAD - Sequential processing
results = [engine.transcribe(p) for p in paths]  # Blocks on each

Pattern 5: Audio Buffering (Memory Efficiency)

# GOOD - Fixed-size ring buffer
class RingBuffer:
    def __init__(self, max_samples):
        self.buffer = np.zeros(max_samples, dtype=np.float32)
        self.idx = 0

    def append(self, audio):
        n = len(audio)
        end = (self.idx + n) % len(self.buffer)
        if end > self.idx:
            self.buffer[self.idx:end] = audio
        else:
            self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
            self.buffer[:end] = audio[len(self.buffer)-self.idx:]
        self.idx = end

# BAD - Unbounded list growth
chunks = []
chunks.append(audio)  # Memory leak over time

7. Implementation Patterns

Pattern 1: Secure Faster Whisper Setup

from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog

logger = structlog.get_logger()

class SecureSTTEngine:
    def __init__(self, model_size="base", device="cpu", compute_type="int8"):
        valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
        if model_size not in valid_sizes:
            raise ValueError(f"Invalid model size: {model_size}")

        self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
        self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
        os.chmod(self.temp_dir, 0o700)

    def transcribe(self, audio_path: str) -> str:
        path = Path(audio_path).resolve()
        if not self._validate_audio_file(path):
            raise ValidationError("Invalid audio file")

        try:
            segments, info = self.model.transcribe(
                str(path), beam_size=5, vad_filter=True,
                vad_parameters=dict(min_silence_duration_ms=500)
            )
            text = " ".join(s.text for s in segments)
            logger.info("stt.transcribed", duration=info.duration)
            return text.strip()
        finally:
            path.unlink(missing_ok=True)

    def _validate_audio_file(self, path: Path) -> bool:
        if not path.exists():
            return False
        if path.stat().st_size > 50 * 1024 * 1024:
            return False
        return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}

    def cleanup(self):
        import shutil
        shutil.rmtree(self.temp_dir, ignore_errors=True)

Pattern 2: Privacy-Preserving Transcription

class PrivacyAwareSTT:
    """STT with privacy protections."""

    def __init__(self, engine: SecureSTTEngine):
        self.engine = engine

    def transcribe_private(self, audio_path: str) -> dict:
        """Transcribe with privacy features."""
        # Transcribe
        text = self.engine.transcribe(audio_path)

        # Remove PII patterns
        cleaned = self._remove_pii(text)

        # Log without content
        logger.info("stt.transcribed_private",
                   word_count=len(cleaned.split()),
                   had_pii=cleaned != text)

        return {
            "text": cleaned,
            "privacy_filtered": cleaned != text
        }

    def _remove_pii(self, text: str) -> str:
        """Remove potential PII from transcription."""
        import re

        # Phone numbers
        text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

        # Email addresses
        text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)

        # Social security numbers
        text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)

        # Credit card numbers
        text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)

        return text

8. Security Standards

Privacy Concerns: Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.

Required Mitigations:

# Always delete after processing
def transcribe_and_delete(audio_path: str) -> str:
    try:
        return engine.transcribe(audio_path)
    finally:
        Path(audio_path).unlink(missing_ok=True)

# Validate before processing
def validate_audio(path: str) -> bool:
    p = Path(path)
    if p.stat().st_size > 50 * 1024 * 1024:
        raise ValidationError("File too large")
    if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}:
        raise ValidationError("Invalid format")
    return True

9. Common Mistakes

NEVER: Keep Audio Files

# BAD - Audio persists
def transcribe(path):
    return model.transcribe(path)  # File remains

# GOOD - Delete after use
def transcribe(path):
    try:
        return model.transcribe(path)
    finally:
        Path(path).unlink()

NEVER: Log Transcription Content

# BAD - Logs sensitive content
logger.info(f"Transcribed: {text}")

# GOOD - Log metadata only
logger.info("stt.complete", word_count=len(text.split()))

10. Pre-Implementation Checklist

Phase 1: Before Writing Code

Read SKILL.md completely
Review TDD workflow and performance patterns
Identify test cases for accuracy and latency requirements
Plan audio cleanup and privacy protections
Select appropriate model size for target hardware
Design temp file handling with secure permissions

Phase 2: During Implementation

Write failing tests first (accuracy, latency, memory)
Implement minimum code to pass tests
Audio deleted immediately after transcription
Temp files use restricted permissions (0o700)
No transcription content in logs
PII filtering implemented
Input validation (size, format, duration)
Voice activity detection enabled
Model loaded once (singleton pattern)

Phase 3: Before Committing

All tests pass: pytest tests/test_stt_engine.py -v
Coverage above 80%: pytest --cov=jarvis.stt
Latency under 300ms for short audio
Memory stable over repeated transcriptions
No audio files persist after processing
Security review completed (no PII leaks)

11. Summary

Your goal is to create STT systems that are:

Private: Audio processed locally, deleted immediately
Fast: Optimized for real-time voice assistant responses
Accurate: Appropriate model and preprocessing for context

You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.

Critical Reminders:

Delete audio files immediately after transcription
Never log transcription content
Filter PII from transcription results
Use secure temp directories with restricted permissions
Validate all audio input (size, format, duration)

speech-to-text

$ Installer