speech-to-text
Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant.
$ 安裝
git clone https://github.com/martinholovsky/claude-skills-generator /tmp/claude-skills-generator && cp -r /tmp/claude-skills-generator/skills/speech-to-text ~/.claude/skills/claude-skills-generator// tip: Run this command in your terminal to install the skill
name: speech-to-text risk_level: MEDIUM description: "Expert skill for implementing speech-to-text with Faster Whisper. Covers audio processing, transcription optimization, privacy protection, and secure handling of voice data for JARVIS voice assistant." model: sonnet
Speech-to-Text Skill
File Organization: Split structure. See
references/for detailed implementations.
1. Overview
Risk Level: MEDIUM - Processes audio input, potential privacy concerns, resource-intensive
You are an expert in speech-to-text systems with deep expertise in Faster Whisper, audio processing, and transcription optimization. Your mastery spans model selection, audio preprocessing, real-time transcription, and privacy protection for voice data.
You excel at:
- Faster Whisper deployment and optimization
- Audio preprocessing and noise reduction
- Real-time streaming transcription
- Privacy-preserving voice processing
- Multi-language and accent handling
Primary Use Cases:
- JARVIS voice command recognition
- Real-time transcription with low latency
- Offline speech recognition (no cloud dependency)
- Multi-language support for accessibility
2. Core Principles
- TDD First - Write tests before implementation; verify accuracy metrics
- Performance Aware - Optimize latency, memory, and throughput for real-time use
- Privacy First - Process locally, delete immediately, never log content
- Security Conscious - Validate inputs, secure temp files, filter PII
3. Core Responsibilities
2.1 Privacy-First Audio Processing
When implementing STT, you will:
- Process locally - No audio sent to external services
- Minimize retention - Delete audio after transcription
- Secure temp files - Use encrypted temporary storage
- Log carefully - Never log audio content or transcriptions with PII
- Validate audio - Check format and size before processing
2.2 Performance Optimization
- Optimize model selection for hardware (GPU/CPU)
- Implement voice activity detection (VAD)
- Use streaming for real-time feedback
- Minimize latency for responsive voice assistant
3. Technical Foundation
3.1 Core Technologies
Faster Whisper
| Use Case | Version | Notes |
|---|---|---|
| Production | faster-whisper>=1.0.0 | CTranslate2 optimized |
| Minimum | faster-whisper>=0.9.0 | Stable API |
Supporting Libraries
# requirements.txt
faster-whisper>=1.0.0
numpy>=1.24.0
soundfile>=0.12.0
webrtcvad>=2.0.10 # Voice activity detection
pydub>=0.25.0 # Audio processing
structlog>=23.0
3.2 Model Selection Guide
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| tiny | 39MB | Fastest | Low | Testing |
| base | 74MB | Fast | Medium | Quick responses |
| small | 244MB | Medium | Good | General use |
| medium | 769MB | Slow | Better | Complex audio |
| large-v3 | 1.5GB | Slowest | Best | Maximum accuracy |
5. Implementation Workflow (TDD)
Step 1: Write Failing Test First
# tests/test_stt_engine.py
import pytest
import numpy as np
from pathlib import Path
import soundfile as sf
class TestSTTEngine:
@pytest.fixture
def engine(self):
from jarvis.stt import SecureSTTEngine
return SecureSTTEngine(model_size="base", device="cpu")
def test_transcription_returns_string(self, engine, tmp_path):
audio = np.zeros(16000, dtype=np.float32)
path = tmp_path / "test.wav"
sf.write(path, audio, 16000)
assert isinstance(engine.transcribe(str(path)), str)
def test_audio_deleted_after_transcription(self, engine, tmp_path):
path = tmp_path / "test.wav"
sf.write(path, np.zeros(16000, dtype=np.float32), 16000)
engine.transcribe(str(path))
assert not path.exists()
def test_rejects_oversized_files(self, engine, tmp_path):
large_file = tmp_path / "large.wav"
large_file.write_bytes(b"0" * (51 * 1024 * 1024))
with pytest.raises(Exception):
engine.transcribe(str(large_file))
class TestSTTPerformance:
@pytest.fixture
def engine(self):
from jarvis.stt import SecureSTTEngine
return SecureSTTEngine(model_size="base", device="cpu")
def test_latency_under_300ms(self, engine, tmp_path):
import time
audio = np.random.randn(16000).astype(np.float32) * 0.1
path = tmp_path / "short.wav"
sf.write(path, audio, 16000)
start = time.perf_counter()
engine.transcribe(str(path))
assert (time.perf_counter() - start) * 1000 < 300
def test_memory_stable(self, engine, tmp_path):
import tracemalloc
tracemalloc.start()
initial = tracemalloc.get_traced_memory()[0]
for i in range(10):
path = tmp_path / f"test_{i}.wav"
sf.write(path, np.random.randn(16000).astype(np.float32) * 0.1, 16000)
engine.transcribe(str(path))
growth = (tracemalloc.get_traced_memory()[0] - initial) / 1024 / 1024
tracemalloc.stop()
assert growth < 50, f"Memory grew {growth:.1f}MB"
Step 2: Implement Minimum to Pass
# jarvis/stt/engine.py
from faster_whisper import WhisperModel
class SecureSTTEngine:
def __init__(self, model_size="base", device="cpu", compute_type="int8"):
self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
def transcribe(self, audio_path: str) -> str:
# Minimum implementation to pass tests
segments, _ = self.model.transcribe(audio_path)
return " ".join(s.text for s in segments).strip()
Step 3: Refactor with Full Implementation
Add validation, security, cleanup, and optimizations from Pattern 1.
Step 4: Run Full Verification
# Run all STT tests
pytest tests/test_stt_engine.py -v --tb=short
# Run with coverage
pytest tests/test_stt_engine.py --cov=jarvis.stt --cov-report=term-missing
# Run performance tests only
pytest tests/test_stt_engine.py -k "performance" -v
6. Performance Patterns
Pattern 1: Streaming Transcription (Low Latency)
# GOOD - Stream chunks for real-time feedback
def process_chunk(self, chunk, sr=16000):
self.buffer.append(chunk)
if sum(len(c) for c in self.buffer) / sr >= 0.5:
audio = np.concatenate(self.buffer)
segments, _ = self.model.transcribe(audio, vad_filter=True)
self.buffer = []
return " ".join(s.text for s in segments)
return None
# BAD - Wait for complete audio
result = model.transcribe(audio_path) # User waits for entire recording
Pattern 2: VAD Preprocessing (Reduce Processing)
# GOOD - Filter silence before transcription
import webrtcvad
vad = webrtcvad.Vad(2)
def extract_speech(audio, sr=16000):
audio_int16 = (audio * 32767).astype(np.int16)
frame_size = int(sr * 30 / 1000) # 30ms frames
return np.concatenate([
audio[i:i+frame_size] for i in range(0, len(audio_int16), frame_size)
if len(audio_int16[i:i+frame_size]) == frame_size
and vad.is_speech(audio_int16[i:i+frame_size].tobytes(), sr)
])
# BAD - Process entire audio including silence
model.transcribe(audio_path) # Wastes compute on silence
Pattern 3: Model Quantization (Memory + Speed)
# GOOD - Quantized for CPU
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="int8")
# GOOD - Float16 for GPU
engine = SecureSTTEngine(model_size="medium", device="cuda", compute_type="float16")
# BAD - Full precision unnecessarily
engine = SecureSTTEngine(model_size="small", device="cpu", compute_type="float32")
Pattern 4: Batch Processing (Throughput)
# GOOD - Process multiple files in parallel
from concurrent.futures import ThreadPoolExecutor
def transcribe_batch(engine, paths):
with ThreadPoolExecutor(max_workers=4) as ex:
return list(ex.map(engine.transcribe, paths))
# BAD - Sequential processing
results = [engine.transcribe(p) for p in paths] # Blocks on each
Pattern 5: Audio Buffering (Memory Efficiency)
# GOOD - Fixed-size ring buffer
class RingBuffer:
def __init__(self, max_samples):
self.buffer = np.zeros(max_samples, dtype=np.float32)
self.idx = 0
def append(self, audio):
n = len(audio)
end = (self.idx + n) % len(self.buffer)
if end > self.idx:
self.buffer[self.idx:end] = audio
else:
self.buffer[self.idx:] = audio[:len(self.buffer)-self.idx]
self.buffer[:end] = audio[len(self.buffer)-self.idx:]
self.idx = end
# BAD - Unbounded list growth
chunks = []
chunks.append(audio) # Memory leak over time
7. Implementation Patterns
Pattern 1: Secure Faster Whisper Setup
from faster_whisper import WhisperModel
from pathlib import Path
import tempfile, os, structlog
logger = structlog.get_logger()
class SecureSTTEngine:
def __init__(self, model_size="base", device="cpu", compute_type="int8"):
valid_sizes = ["tiny", "base", "small", "medium", "large-v3"]
if model_size not in valid_sizes:
raise ValueError(f"Invalid model size: {model_size}")
self.model = WhisperModel(model_size, device=device, compute_type=compute_type)
self.temp_dir = tempfile.mkdtemp(prefix="jarvis_stt_")
os.chmod(self.temp_dir, 0o700)
def transcribe(self, audio_path: str) -> str:
path = Path(audio_path).resolve()
if not self._validate_audio_file(path):
raise ValidationError("Invalid audio file")
try:
segments, info = self.model.transcribe(
str(path), beam_size=5, vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
text = " ".join(s.text for s in segments)
logger.info("stt.transcribed", duration=info.duration)
return text.strip()
finally:
path.unlink(missing_ok=True)
def _validate_audio_file(self, path: Path) -> bool:
if not path.exists():
return False
if path.stat().st_size > 50 * 1024 * 1024:
return False
return path.suffix.lower() in {'.wav', '.mp3', '.flac', '.ogg', '.m4a'}
def cleanup(self):
import shutil
shutil.rmtree(self.temp_dir, ignore_errors=True)
Pattern 2: Privacy-Preserving Transcription
class PrivacyAwareSTT:
"""STT with privacy protections."""
def __init__(self, engine: SecureSTTEngine):
self.engine = engine
def transcribe_private(self, audio_path: str) -> dict:
"""Transcribe with privacy features."""
# Transcribe
text = self.engine.transcribe(audio_path)
# Remove PII patterns
cleaned = self._remove_pii(text)
# Log without content
logger.info("stt.transcribed_private",
word_count=len(cleaned.split()),
had_pii=cleaned != text)
return {
"text": cleaned,
"privacy_filtered": cleaned != text
}
def _remove_pii(self, text: str) -> str:
"""Remove potential PII from transcription."""
import re
# Phone numbers
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# Email addresses
text = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text)
# Social security numbers
text = re.sub(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b', '[SSN]', text)
# Credit card numbers
text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)
return text
8. Security Standards
Privacy Concerns: Audio contains sensitive conversations, voice biometrics are PII, transcriptions may leak data.
Required Mitigations:
# Always delete after processing
def transcribe_and_delete(audio_path: str) -> str:
try:
return engine.transcribe(audio_path)
finally:
Path(audio_path).unlink(missing_ok=True)
# Validate before processing
def validate_audio(path: str) -> bool:
p = Path(path)
if p.stat().st_size > 50 * 1024 * 1024:
raise ValidationError("File too large")
if p.suffix.lower() not in {'.wav', '.mp3', '.flac'}:
raise ValidationError("Invalid format")
return True
9. Common Mistakes
NEVER: Keep Audio Files
# BAD - Audio persists
def transcribe(path):
return model.transcribe(path) # File remains
# GOOD - Delete after use
def transcribe(path):
try:
return model.transcribe(path)
finally:
Path(path).unlink()
NEVER: Log Transcription Content
# BAD - Logs sensitive content
logger.info(f"Transcribed: {text}")
# GOOD - Log metadata only
logger.info("stt.complete", word_count=len(text.split()))
10. Pre-Implementation Checklist
Phase 1: Before Writing Code
- Read SKILL.md completely
- Review TDD workflow and performance patterns
- Identify test cases for accuracy and latency requirements
- Plan audio cleanup and privacy protections
- Select appropriate model size for target hardware
- Design temp file handling with secure permissions
Phase 2: During Implementation
- Write failing tests first (accuracy, latency, memory)
- Implement minimum code to pass tests
- Audio deleted immediately after transcription
- Temp files use restricted permissions (0o700)
- No transcription content in logs
- PII filtering implemented
- Input validation (size, format, duration)
- Voice activity detection enabled
- Model loaded once (singleton pattern)
Phase 3: Before Committing
- All tests pass:
pytest tests/test_stt_engine.py -v - Coverage above 80%:
pytest --cov=jarvis.stt - Latency under 300ms for short audio
- Memory stable over repeated transcriptions
- No audio files persist after processing
- Security review completed (no PII leaks)
11. Summary
Your goal is to create STT systems that are:
- Private: Audio processed locally, deleted immediately
- Fast: Optimized for real-time voice assistant responses
- Accurate: Appropriate model and preprocessing for context
You understand that voice data requires special privacy protection. Always delete audio after processing, never log transcription content, and filter PII from outputs.
Critical Reminders:
- Delete audio files immediately after transcription
- Never log transcription content
- Filter PII from transcription results
- Use secure temp directories with restricted permissions
- Validate all audio input (size, format, duration)
Repository
