Unnamed Skill
Strategies for continued pretraining and domain adaptation in Unsloth (triggers: continued pretraining, CPT, domain adaptation, lm_head, embed_tokens, rsLoRA, embedding_learning_rate).
$ 설치
git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/data/unsloth-cpt ~/.claude/skills/claude-skill-registry// tip: Run this command in your terminal to install the skill
SKILL.md
name: unsloth-cpt description: Strategies for continued pretraining and domain adaptation in Unsloth (triggers: continued pretraining, CPT, domain adaptation, lm_head, embed_tokens, rsLoRA, embedding_learning_rate).
Overview
Unsloth-cpt provides specific optimizations for Continued Pretraining (CPT) and domain adaptation. It addresses the critical need for training embedding layers and language modeling heads while stabilizing the training process using Rank Stabilized LoRA (rsLoRA) and differentiated learning rates.
When to Use
- When teaching a model a new language or highly specialized domain (e.g., legal, medical).
- When updating the
embed_tokensorlm_headlayers. - When using high LoRA ranks (e.g., r=256) which can become unstable without rsLoRA.
Decision Tree
- Are you training on a new domain with unique vocabulary?
- Yes: Include
lm_headandembed_tokensintarget_modules.
- Yes: Include
- Are you using a LoRA rank > 64?
- Yes: Set
use_rslora = True.
- Yes: Set
- Are you training embeddings?
- Yes: Set
embedding_learning_rateto 1/10th of the standard learning rate.
- Yes: Set
Workflows
- New Language Adaptation: Load the base model and configure
get_peft_modelto targetlm_head,embed_tokens, andgate_projwithuse_rslora = True. - Stabilizing Embedding Updates: Use
UnslothTrainerand setlearning_rate = 5e-5with a significantly lowerembedding_learning_rate(e.g., 5e-6). - Continued Finetuning from Adapters: Load existing adapters using
from_pretrainedand resume training on refined domain data.
Non-Obvious Insights
- Training on
lm_headandembed_tokenswith the standard learning rate often degrades performance; a 2-10x smaller learning rate is required for stability. - Including the
gate_projmatrix in LoRA modules is essential for CPT; omitting it leads to significantly faster catastrophic forgetting. - Rank Stabilized LoRA (rsLoRA) is mathematically necessary to maintain scaling stability when using very high ranks (r=256) for broad domain adaptation.
Evidence
- "Blindly training on the lm_head and embed_tokens does even worse! We must use a smaller learning rate for the lm_head and embed_tokens." Source
- "The paper showed how Llama-2 performed well on maths, but not code - this is because the lm_head & embed_tokens weren't trained." Source
Scripts
scripts/unsloth-cpt_tool.py: Configuration for rsLoRA and embedding learning rates.scripts/unsloth-cpt_tool.js: Helper for calculating rank-scaled learning rates.
Dependencies
unslothtorchpeft
References
Repository

majiayu000
Author
majiayu000/claude-skill-registry/skills/data/unsloth-cpt
0
Stars
0
Forks
Updated6h ago
Added1w ago