Unnamed Skill
Fine-tuning multimodal vision-language models (Llama 3.2 Vision, Qwen2.5 VL) using optimized vision layers (triggers: vision models, multimodal, Llama 3.2 Vision, Qwen2.5 VL, UnslothVisionDataCollator, finetune_vision_layers).
$ Installer
git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/data/unsloth-vision ~/.claude/skills/claude-skill-registry// tip: Run this command in your terminal to install the skill
SKILL.md
name: unsloth-vision description: Fine-tuning multimodal vision-language models (Llama 3.2 Vision, Qwen2.5 VL) using optimized vision layers (triggers: vision models, multimodal, Llama 3.2 Vision, Qwen2.5 VL, UnslothVisionDataCollator, finetune_vision_layers).
Overview
Unsloth-vision provides optimized support for fine-tuning multimodal models like Llama 3.2 Vision and Qwen2.5 VL. It allows granular control over which layers (vision, language, or both) are updated and includes specialized data collators to handle image padding.
When to Use
- When adapting models for specialized visual tasks like medical imaging or OCR-to-code.
- When fine-tuning Llama 3.2 Vision models on consumer hardware.
- When needing to train specifically on assistant responses in a multimodal context.
Decision Tree
- Do you need to update visual feature extraction?
- Yes: Set
finetune_vision_layers = Trueinget_peft_model.
- Yes: Set
- Are your images varying in size?
- Yes: Standardize to 300-1000px and use
UnslothVisionDataCollator.
- Yes: Standardize to 300-1000px and use
- Should the model ignore system/user prompts during loss calculation?
- Yes: Use
train_on_responses_only = Truein the collator.
- Yes: Use
Workflows
- Vision Model Setup: Load models via
FastVisionModel.from_pretrainedand enable PEFT targetingall-linearmodules. - Multimodal Dataset Preparation: Format data as 'user'/'assistant' conversations with
{'type': 'image'}content and standardized dimensions (300-1000px). - Training for Response Accuracy: Initialize
UnslothVisionDataCollatorwithtrain_on_responses_only = Trueand specified chat template headers.
Non-Obvious Insights
- Unsloth allows selective fine-tuning of just the vision layers, just the language layers, or specific components like attention/MLP layers.
- The
UnslothVisionDataCollatorautomatically masks out padding vision tokens, which is essential for stabilizing loss during training. - Standardizing image resolution to the 300-1000px range is the optimal balance between detail preservation and VRAM efficiency.
Evidence
- "To finetune vision models, we now allow you to select which parts of the mode to finetune... You can select to only finetune the vision layers, or the language layers..." Source
- "It is best to ensure your dataset has images of all the same size/dimensions. Use dimensions of 300-1000px..." Source
Scripts
scripts/unsloth-vision_tool.py: Loading and configuring FastVisionModel.scripts/unsloth-vision_tool.js: Dataset formatter for multimodal conversations.
Dependencies
unslothpillow(for image processing)torch
References
Repository

majiayu000
Author
majiayu000/claude-skill-registry/skills/data/unsloth-vision
0
Stars
0
Forks
Updated2d ago
Added1w ago