Unnamed Skill

Fine-tuning Speech-to-Text models like Whisper using Unsloth's optimized LoRA pipeline. Triggers: stt, whisper, transcription, audio fine-tuning, speech-to-text, audio normalization.

$ Installieren

git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/development/unsloth-stt ~/.claude/skills/claude-skill-registry

// tip: Run this command in your terminal to install the skill

SKILL.md

View on GitHub →

name: unsloth-stt description: Fine-tuning Speech-to-Text models like Whisper using Unsloth's optimized LoRA pipeline. Triggers: stt, whisper, transcription, audio fine-tuning, speech-to-text, audio normalization.

Overview

Unsloth supports fine-tuning for Speech-to-Text (STT) models like OpenAI Whisper. By applying its optimized LoRA pipeline to Whisper architecture, Unsloth achieves 1.5x faster training with 50% less memory usage compared to standard methods.

When to Use

When you need to capture specialized terminology (medical, legal) that base Whisper misses.
When adapting models to specific accents or dialects.
When fine-tuning large models (like whisper-large-v3) on limited consumer hardware.

Decision Tree

Is transcription accuracy low on zero-shot inference?
- Yes: Proceed with STT fine-tuning.
Is audio recorded at a non-standard sample rate?
- Yes: Resample to 16kHz before training.
Using large-v3?
- Yes: Load in 4-bit and apply LoRA to cross-attention layers.

Workflows

Whisper STT Data Preprocessing

Load audio files using datasets.Audio feature to handle on-the-fly decoding.
Resample all training audio to 16kHz to avoid sample rate mismatch.
Normalize transcripts to remove characters not present in Whisper's vocabulary.

Fine-tuning Whisper with Unsloth

Load whisper-large-v3 using FastLanguageModel.from_pretrained with load_in_4bit=True.
Apply LoRA weights to cross-attention and self-attention layers of the encoder and decoder.
Train using Seq2SeqTrainer or standard Trainer if audio is flattened to tokens.

Non-Obvious Insights

Whisper fine-tuning in Unsloth uses the same PEFT/LoRA pipeline as LLMs, which is why it can handle large models on consumer GPUs.
Capturing vocal nuances like accents is significantly more effective through fine-tuning than zero-shot, as the model learns to map specific acoustic features to text tokens.
Proper audio normalization (16kHz resampling) is the single most important preprocessing step; Whisper's architecture is hard-coded for this sample rate.

Evidence

"Unsloth supports any transformers compatible TTS/STT model... 1.5x faster with 50% less memory." Source
"Fine-tuning delivers far more accurate and realistic voice replication than zero-shot." Source

Scripts

scripts/unsloth-stt_tool.py: Script for resampling and preprocessing audio datasets for Whisper.
scripts/unsloth-stt_tool.js: Utility for cleaning transcription strings.

Dependencies

unsloth
datasets
librosa
transformers

References

[[references/README.md]]