音訊處理
357 skills in 內容與媒體 > 音訊處理
gencast
Auto-invoke gencast to generate podcasts from documents when user mentions podcast, audio, or dialogue generation
openai-whisper-api
Transcribe audio via OpenAI Audio Transcriptions API (Whisper).
unsloth-tts
Fine-tuning Text-to-Speech (TTS) models with Unsloth for voice cloning and synthetic speech (triggers: TTS, text-to-speech, voice cloning, Orpheus-TTS, audio fine-tuning, speech synthesis).
brand-guidelines
Create and maintain brand guidelines including visual identity, voice and tone, and usage rules. Use for establishing brand standards, style guides, and ensuring brand consistency across materials.
yarb-branding
Apply yarb brand voice and tone when writing UI copy, documentation, error messages,or any user-facing text. Use this skill when asked to "write copy", "update messaging","make it sound like yarb", "apply branding", or when creating new UI components.
whisper-lolo-audio-ingest
Build or modify the browser-side recording and upload pipeline for whisper-lolo. Use when implementing MediaRecorder + IndexedDB chunking, assembling audio blobs, or configuring Vercel Blob client uploads with progress and callbacks.
replicate
Runs open-source ML models via Replicate API for image generation, LLMs, and audio. Use when calling Stable Diffusion, Llama, Whisper, or other models without infrastructure management.
glsl-shader
Create audio-reactive GLSL visualizers for Bice-Box. Provides templates, audio uniforms (iRMSOutput, iRMSInput, iAudioTexture), coordinate patterns, and common shader functions.
voiceover
Generates audio narration from a text file using Chatterbox TTS. Use when the user wants to generate voiceover/audio from ANY text file.
parakeet
Convert audio files to text using parakeet-mlx, NVIDIA's Parakeet automatic speech recognition model optimized for Apple's MLX framework. Run via uvx for on-device speech-to-text processing with high-quality timestamped transcriptions. Ideal for podcasts, interviews, meetings, and other audio content. This skill is triggered when the user says things like "transcribe this audio", "convert audio to text", "transcribe this podcast", "get text from this recording", "speech to text", or "transcribe this wav/mp3/m4a file".
tts
Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.
audio-injection-testing
Test Bob The Skull with virtual audio injection instead of speaking. Use when testing wake word detection, STT accuracy, full conversation pipeline, or automated testing. Covers setup, configuration, injection methods, and troubleshooting.
wavecap-export
Export WaveCap transcriptions and pager data. Use when the user wants to export transcriptions as JSON, download reviewed transcriptions with audio, or export pager feed data.
gemini-audio
Guide for implementing Google Gemini API audio capabilities - analyze audio with transcription, summarization, and understanding (up to 9.5 hours), plus generate speech with controllable TTS. Use when processing audio files, creating transcripts, analyzing speech/music/sounds, or generating natural speech from text.
whisper-transcribe
Transcribes audio and video files to text using OpenAI's Whisper CLI with contextual grounding.Converts audio/video to text, transcribes recordings, and creates transcripts from media files.Use when asked to "whisper transcribe", "transcribe audio", "convert recording to text", or"speech to text". Uses markdown files in the same directory as context to improve transcriptionaccuracy for technical terms, proper nouns, and domain-specific vocabulary.
rtsafetyauditor
Analyze C++ code for real-time safety violations including heap allocations, locks, blocking calls, and unbounded operations in audio threads.
ai-multimodal
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.
piper
Convert text to speech using Piper, a fast, local, neural text-to-speech system with natural sounding voices. This skill is triggered when the user says things like "convert text to speech", "text to audio", "read this aloud", "create audio from text", "generate speech from text", "make an audio file from this text", or "use piper TTS".
realitykit-ar-companion
Comprehensive RealityKit skill optimized for building AR companion experiences on iOS and visionOS, with character animation, body/hand tracking, AI integration patterns, spatial audio, and entity lifecycle management
wavecap-evaluate
Evaluate WaveCap audio analysis and transcription accuracy. Use when the user wants to run regression tests, compare transcriptions against ground truth, calculate WER/CER metrics, or assess overall system quality.