Unnamed Skill

Fine-tuning Text-to-Speech (TTS) models with Unsloth for voice cloning and synthetic speech (triggers: TTS, text-to-speech, voice cloning, Orpheus-TTS, audio fine-tuning, speech synthesis).

$ Installer

git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/development/unsloth-tts ~/.claude/skills/claude-skill-registry

// tip: Run this command in your terminal to install the skill

SKILL.md

View on GitHub →

name: unsloth-tts description: Fine-tuning Text-to-Speech (TTS) models with Unsloth for voice cloning and synthetic speech (triggers: TTS, text-to-speech, voice cloning, Orpheus-TTS, audio fine-tuning, speech synthesis).

Overview

Unsloth-tts brings the library's performance optimizations to speech synthesis models. Specifically optimized for the Llama-based Orpheus-TTS architecture, it enables high-quality voice cloning with 50% less memory than standard implementations.

When to Use

When creating custom voice clones that require realistic phrasing and emotional quirks.
When fine-tuning Orpheus-TTS for specialized synthesis (e.g., laughter, sighs).
When deploying TTS models via llama.cpp or GGUF.

Decision Tree

Is high audio quality the priority over VRAM?
- Yes: Load with load_in_4bit = False.
Does the target voice have specific quirks (e.g., laughter)?
- Yes: Annotate transcripts with emotion tags like <laughs>.
Is this a quick test or full replication?
- Quick test: Set max_steps = 60.
- Full replication: Train for 1 full epoch with 1-3 hours of data.

Workflows

Voice Cloning with LoRA: Load orpheus-3b-0.1-pretrained, prepare 1-3 hours of annotated audio/text data, and apply LoRA adapters.
Audio Data Preprocessing: Cast audio columns to 24,000Hz (for Orpheus) and tokenize text while preserving special tags like <sigh>.
Training and Evaluation: Train with low batch size (1) and high learning rate (2e-4), saving only the LoRA adapters for portability.

Non-Obvious Insights

Zero-shot voice cloning often fails to capture subtle phrasing; LoRA fine-tuning is necessary for true personality replication.
Orpheus-TTS is highly beginner-friendly because its Llama-based architecture is compatible with existing LLM tools like GGUF.
Unlike some TTS systems, models like Orpheus can decode audio output tokens directly into waveforms without needing a separate vocoder.

Evidence

"Unsloth supports any transformers compatible TTS model... 1.5x faster with 50% less memory than other implementations." Source
"Orpheus supports tags like , , ... These tags are enclosed in angle brackets and will be treated as special tokens." Source

Scripts

scripts/unsloth-tts_tool.py: Audio preprocessing and Orpheus loading.
scripts/unsloth-tts_tool.js: Utility for managing transcript tagging.

Dependencies

unsloth
librosa / soundfile
datasets

References

references/README.md