Unnamed Skill
Fine-tuning Speech-to-Text models like Whisper using Unsloth's optimized LoRA pipeline. Triggers: stt, whisper, transcription, audio fine-tuning, speech-to-text, audio normalization.
$ Installieren
git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/development/unsloth-stt ~/.claude/skills/claude-skill-registry// tip: Run this command in your terminal to install the skill
SKILL.md
name: unsloth-stt description: Fine-tuning Speech-to-Text models like Whisper using Unsloth's optimized LoRA pipeline. Triggers: stt, whisper, transcription, audio fine-tuning, speech-to-text, audio normalization.
Overview
Unsloth supports fine-tuning for Speech-to-Text (STT) models like OpenAI Whisper. By applying its optimized LoRA pipeline to Whisper architecture, Unsloth achieves 1.5x faster training with 50% less memory usage compared to standard methods.
When to Use
- When you need to capture specialized terminology (medical, legal) that base Whisper misses.
- When adapting models to specific accents or dialects.
- When fine-tuning large models (like whisper-large-v3) on limited consumer hardware.
Decision Tree
- Is transcription accuracy low on zero-shot inference?
- Yes: Proceed with STT fine-tuning.
- Is audio recorded at a non-standard sample rate?
- Yes: Resample to 16kHz before training.
- Using large-v3?
- Yes: Load in 4-bit and apply LoRA to cross-attention layers.
Workflows
Whisper STT Data Preprocessing
- Load audio files using
datasets.Audiofeature to handle on-the-fly decoding. - Resample all training audio to 16kHz to avoid sample rate mismatch.
- Normalize transcripts to remove characters not present in Whisper's vocabulary.
Fine-tuning Whisper with Unsloth
- Load
whisper-large-v3usingFastLanguageModel.from_pretrainedwithload_in_4bit=True. - Apply LoRA weights to cross-attention and self-attention layers of the encoder and decoder.
- Train using
Seq2SeqTraineror standardTrainerif audio is flattened to tokens.
Non-Obvious Insights
- Whisper fine-tuning in Unsloth uses the same PEFT/LoRA pipeline as LLMs, which is why it can handle large models on consumer GPUs.
- Capturing vocal nuances like accents is significantly more effective through fine-tuning than zero-shot, as the model learns to map specific acoustic features to text tokens.
- Proper audio normalization (16kHz resampling) is the single most important preprocessing step; Whisper's architecture is hard-coded for this sample rate.
Evidence
- "Unsloth supports any transformers compatible TTS/STT model... 1.5x faster with 50% less memory." Source
- "Fine-tuning delivers far more accurate and realistic voice replication than zero-shot." Source
Scripts
scripts/unsloth-stt_tool.py: Script for resampling and preprocessing audio datasets for Whisper.scripts/unsloth-stt_tool.js: Utility for cleaning transcription strings.
Dependencies
- unsloth
- datasets
- librosa
- transformers
References
- [[references/README.md]]
Repository

majiayu000
Author
majiayu000/claude-skill-registry/skills/development/unsloth-stt
0
Stars
0
Forks
Updated10h ago
Added1w ago