assemblyai-streaming
This skill should be used when working with AssemblyAI’s Speech-to-Text and LLM Gateway APIs, especially for streaming/live transcription, meeting notetakers, and voice agents that need low-latency transcripts and audio analysis.
$ 安裝
git clone https://github.com/ratacat/claude-skills /tmp/claude-skills && cp -r /tmp/claude-skills/skills/assembly-ai-streaming ~/.claude/skills/claude-skills// tip: Run this command in your terminal to install the skill
name: assemblyai-streaming description: This skill should be used when working with AssemblyAI’s Speech-to-Text and LLM Gateway APIs, especially for streaming/live transcription, meeting notetakers, and voice agents that need low-latency transcripts and audio analysis. license: MIT allowed-tools:
- Read
- Write
- Edit
- Grep
- Glob
- Bash
- Python metadata: skill-version: "1.0.0" upstream-docs: "https://www.assemblyai.com/docs" focus: "streaming-stt, meeting-notetaker, voice-agent, llm-gateway"
AssemblyAI Streaming & Live Transcription Skill
Overview
Use this skill to build and maintain code that talks to AssemblyAI’s:
- Streaming Speech-to-Text (STT) via WebSockets (
wss://streaming.assemblyai.com/v3/ws) - Async / pre-recorded STT via REST (
https://api.assemblyai.com/v2/transcript) - LLM Gateway for applying Claude/GPT/Gemini-style models to transcripts (
https://llm-gateway.assemblyai.com)
The emphasis is on streaming/live transcription, meeting notetakers, and voice agents, while still covering async workflows and post-processing.
This skill assumes a Claude Code environment with access to Python (preferred) and Bash.
When to Use
Use this skill when:
- Implementing real-time transcription from a microphone, telephony stream, or audio file.
- Building a live meeting notetaker (Zoom/Teams/Meet), especially with summaries, action items, and highlights.
- Implementing a voice agent where latency and natural turn-taking matter.
- Migrating from other STT providers (OpenAI/Deepgram/Google/AWS/etc.) to AssemblyAI.
- Applying LLMs to audio via LLM Gateway for summaries, Q&A, topic tagging, or custom prompts.
Do not use this skill when:
- The task is generic HTTP client usage with no AssemblyAI-specific logic.
- The request clearly targets a different STT vendor.
- The environment cannot safely store or use an API key.
AssemblyAI Mental Model
1. Products to care about
-
Pre-recorded Speech-to-Text (Async)
- REST API:
POST /v2/transcript→GET /v2/transcript/{id} - Designed for files from URLs, uploads, S3, etc.
- Supports extra models: summarization, topic detection, sentiment, PII redaction, chapters, etc.
- REST API:
-
Streaming Speech-to-Text
- WebSocket:
wss://streaming.assemblyai.com/v3/ws - Low-latency, immutable transcripts (~300ms).
- Turn detection built in; fits voice agents and live captioning.
- WebSocket:
-
LLM Gateway
- REST API:
POST /v1/chat/completionsathttps://llm-gateway.assemblyai.com - Unified access to multiple LLMs (Claude, GPT, Gemini, etc.).
- Designed for “LLM over transcripts” workflows.
- REST API:
2. Key model knobs (Async)
speech_models:["slam-1", "universal"]etc.- Slam-1: best English accuracy + keyterms_prompt, good for medical/technical conversations.
- Universal: multilingual coverage; good default if language is unknown.
language_codevslanguage_detection:- Use
language_codewhen the language is known. - Use
language_detection: truewhen unknown; optionally setlanguage_confidence_threshold.
- Use
keyterms_prompt:- Domain words/phrases to boost (med terms, product names, etc.).
- Extra intelligence:
summarization,iab_categories,content_safety,entity_detection,auto_chapters,sentiment_analysis,speaker_labels,auto_highlights,redact_pii, etc.
3. Key model knobs (Streaming)
Connection URL:
- US:
wss://streaming.assemblyai.com/v3/ws - EU:
wss://streaming.eu.assemblyai.com/v3/ws
Important query parameters:
sample_rate(required): e.g.16000format_turns(bool): return formatted final transcripts; avoid for low-latency voice agents.speech_model:universal-streaming-english(default) oruniversal-streaming-multi.- `keyterms_p
rompt: JSON-encoded list of terms, e.g. ["AssemblyAI", "Slam-1", "Keanu Reeves"]`.
- Turn detection:
end_of_turn_confidence_threshold(0.0–1.0, default ~0.4)min_end_of_turn_silence_when_confident(ms, default ~400)max_turn_silence(ms, default ~1280)
Headers:
- Use either
Authorization: <API_KEY>or a short-livedtokenquery parameter issued by your backend.
Messages:
- Client sends:
- Binary audio chunks (50–1000ms each).
- Optional JSON messages:
{"type": "UpdateConfig", ...},{"type": "Terminate"},{"type": "ForceEndpoint"}.
- Server sends:
Beginevent withid,expires_at.Turnevents with:transcript(immutable partials/finals),utterance(complete semantic chunk),end_of_turn(bool),turn_is_formatted(bool),wordsarray with timestamps/confidences.
Terminationevent with summary stats.
4. Regions and data residency
- Async:
- US:
https://api.assemblyai.com - EU:
https://api.eu.assemblyai.com
- US:
- Streaming:
- US:
wss://streaming.assemblyai.com/v3/ws - EU:
wss://streaming.eu.assemblyai.com/v3/ws
- US:
Always keep base URLs consistent per project; don’t mix US/EU endpoints for the same data.
Security & API Keys
- Always require an AssemblyAI API key and keep it out of source in Claude Code output:
- Use environment variables:
ASSEMBLYAI_API_KEY. - Or placeholders (
"<YOUR_API_KEY>") in snippets.
- Use environment variables:
- For browser/client code:
- Do not embed the API key.
- Instruct the user to generate temporary streaming tokens on their backend and pass only the token into the WebSocket connection.
- Never print real keys in logs or comments.
High-Level Workflow Patterns
Decision tree
-
Is the audio live?
- Yes → Use Streaming STT.
- No → Use Async STT.
-
Is latency critical (<1s) for responses?
- Yes → Streaming with
format_turns=falseand careful turn detection. - No → Async, then Summarization/Chapters/etc.
- Yes → Streaming with
-
Do transcripts leave the backend?
- Yes → Consider
redact_pii(and optionallyredact_pii_audio) before sharing. - No → Use raw transcripts as needed.
- Yes → Consider
-
Need LLM-based processing (Q&A, structured summaries)?
- Yes → Pipe transcripts into LLM Gateway via
chat/completions.
- Yes → Pipe transcripts into LLM Gateway via
How Claude Should Work with This Skill
General principles
- Prefer official AssemblyAI SDKs (Python/JS) when available; fall back to
requests/websocket-clientonly if SDK cannot be installed. - Always:
- Validate HTTP responses and WebSocket status.
- Surface useful error messages (
status,errorfields in transcript JSON). - Respect documented min/max chunk sizes (50–1000ms of audio per binary message).
- For voice-agent code, optimize for:
- Immutable partials (
transcript) andutterancefield. - Minimal latency, avoid extra formatting passes.
- Immutable partials (
Recipe 1 – Minimal Streaming from Microphone (Python SDK)
Goal: Stream mic audio to AssemblyAI and print transcripts in real time.
Use this when the environment has Python and assemblyai + pyaudio installed, and the user wants a quick streaming demo.
import assemblyai as aai
from assemblyai.streaming import v3 as aai_stream
import pyaudio
API_KEY = "<YOUR_API_KEY>"
aai.settings.api_key = API_KEY
SAMPLE_RATE = 16000
CHUNK_MS = 50
FRAMES_PER_BUFFER = int(SAMPLE_RATE * (CHUNK_MS / 1000.0))
def main():
client = aai_stream.StreamingClient(
aai_stream.StreamingClientOptions(
api_key=API_KEY,
api_host="streaming.assemblyai.com", # or "streaming.eu.assemblyai.com"
)
)
def on_begin(_client, event: aai_stream.BeginEvent):
print(f"Session started: {event.id}, expires at {event.expires_at}")
def on_turn(_client, event: aai_stream.TurnEvent):
# Use immutable transcript text
text = (event.transcript or "").strip()
if not text:
return
# Use formatted finals only for display; keep unformatted for LLMs
if event.turn_is_formatted:
print(f"[FINAL] {text}")
else:
print(f"[PARTIAL] {text}", end="\r")
def on_terminated(_client, event: aai_stream.TerminationEvent):
print(f"\nTerminated. Audio duration={event.audio_duration_seconds}s")
def on_error(_client, error: aai_stream.StreamingError):
print(f"\nStreaming error: {error}")
client.on(aai_stream.StreamingEvents.Begin, on_begin)
client.on(aai_stream.StreamingEvents.Turn, on_turn)
client.on(aai_stream.StreamingEvents.Termination, on_terminated)
client.on(aai_stream.StreamingEvents.Error, on_error)
client.connect(
aai_stream.StreamingParameters(
sample_rate=SAMPLE_RATE,
format_turns=False, # better latency for voice agents
)
)
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=1,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=FRAMES_PER_BUFFER,
)
try:
print("Speak into your microphone (Ctrl+C to stop)...")
def audio_gen():
while True:
yield stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
client.stream(audio_gen())
except KeyboardInterrupt:
pass
finally:
client.disconnect(terminate=True)
stream.stop_stream()
stream.close()
pa.terminate()
if __name__ == "__main__":
main()
Repository
