video-storytelling
Create coherent video story sequences with AI-generated images and narrated audio. Combines image-generation and elevenlabs skills to produce complete video stories with visual and narrative consistency across all scenes. Maintains character appearance, style, lighting, and voice consistency throughout the story. Use this skill when the user requests: - Video stories with narration - Animated story sequences - Educational video content - Character-driven narratives with visuals - Multi-scene story videos - Narrated image sequences Features: Visual consistency locks, character persistence, multi-turn image generation, character voice narration, automatic video assembly Default: 1 title scene + 5 story scenes Dependencies: image-generation skill, elevenlabs skill, ffmpeg
$ インストール
git clone https://github.com/jkitchin/skillz /tmp/skillz && cp -r /tmp/skillz/skills/creative/video-storytelling ~/.claude/skills/skillz// tip: Run this command in your terminal to install the skill
name: video-storytelling description: | Create coherent video story sequences with AI-generated images and narrated audio. Combines image-generation and elevenlabs skills to produce complete video stories with visual and narrative consistency across all scenes. Maintains character appearance, style, lighting, and voice consistency throughout the story.
Use this skill when the user requests:
- Video stories with narration
- Animated story sequences
- Educational video content
- Character-driven narratives with visuals
- Multi-scene story videos
- Narrated image sequences
Features: Visual consistency locks, character persistence, multi-turn image generation, character voice narration, automatic video assembly
Default: 1 title scene + 5 story scenes Dependencies: image-generation skill, elevenlabs skill, ffmpeg allowed-tools: ["*"]
Video Storytelling
Purpose
This skill creates coherent video story sequences by combining AI-generated images with narrated audio. Acts as a story director and visual coordinator, maintaining perfect consistency across characters, visual style, lighting, and narrative tone throughout all scenes. Produces complete MP4 videos with synchronized images and character voiceovers.
When to Use
This skill should be invoked when the user asks to:
- Create a video story or animated sequence
- Generate a narrated story with visuals
- Produce educational video content with characters
- Make a visual storybook with voiceover
- Create character-driven video narratives
- Generate multi-scene story videos
- Produce children's stories with pictures and narration
Core Capabilities
Visual Consistency System
Global Style Lock:
- Fixed aspect ratio, camera settings, lighting
- Consistent color palette across all scenes
- Uniform visual style and post-processing
- Prevents visual discontinuities
Character Lock:
- Maintains character appearance across scenes
- Same outfit, colors, facial features throughout
- Consistent accessories and distinctive traits
- Visual identity preservation
Multi-Turn Image Generation:
- Each scene references previous scene's image
- Builds visual continuity through the sequence
- Prevents character/style drift
Narrative System
Character Voices:
- Maps characters to ElevenLabs voices
- Maintains voice consistency per character
- Supports multiple characters in dialogue
Emotion Tags:
- Expressive narration with emotion markers
- Sound effects and pauses
- Natural pacing and delivery
Narrative Structure:
- 50-80 words per scene (15-20 seconds)
- Neutral narrator for transitions
- Character-specific dialogue
Video Assembly
Automated Pipeline:
- Generates all images in sequence
- Creates character voice narration
- Combines into synchronized MP4 video
- Equal time per scene based on total audio length
Default Configuration
Scene Structure
- Default: 1 title scene (scene 0) + 5 story scenes (scenes 1-5)
- Total: 6 scenes
- Customizable: User can specify different scene counts
Default Style Lock
STYLE_LOCK:
- Aspect ratio: 1080×1080 (square)
- Camera: 50mm lens, eye-level perspective
- Lighting: soft three-point lighting, warm key light (4500K)
- Color palette: #0B5FFF, #FFB703, #FB8500, #023047, #8ECAE6
- Materials: matte finish, no film grain or heavy bloom
- Background: subtle gradient, clean composition
- Style: semi-realistic cartoon with clear lines and gentle shading
- Post: crisp focus, no vignette or text artifacts
NEGATIVE_LOCK:
no text errors, no misspellings, no watermarks, no stickers,
no extra characters, no visual noise, no drastic lighting changes
Customization: Users can override with custom style locks, but defaults ensure consistency.
Default Voice Mapping
From ElevenLabs Voices:
Narrators:
- Neutral Narrator (male): George (
JBFqnCBsd6RMkjVDRZzb) - Neutral Narrator (female): Rachel (
21m00Tcm4TlvDq8ikWAM)
Character Voices:
- Young Male (energetic): Josh (
TxGEqnHWrfWFTfGW9XjX) - Young Female (calm): Rachel (
21m00Tcm4TlvDq8ikWAM) - Young Female (expressive): Bella (
EXAVITQu4vr4xnSDxMaL) - Male (authoritative): Adam (
pNInz6obpgDQGcFmaJgB) - Female (warm): Matilda (
XrExE9yKIg1WjnnlVkGX) - Young Male (friendly): Antoni (
ErXwobaYiN019PkySvjV)
Assignment Logic:
- If character gender/age specified, match to appropriate voice
- If unspecified, use Josh for male, Rachel for female
- Narrator defaults to George (male) or Rachel (female)
Instructions
Step 1: Gather Story Information
Collect necessary information from the user:
Required:
- Story concept: What is the story about?
- Tone/Genre: Educational, adventure, comedy, drama, etc.
Optional (prompt if missing):
- Number of scenes: Default is 6 (1 title + 5 story), but user can specify
- Character descriptions: Names, appearance, personality
- Custom style locks: Override defaults if user has specific requirements
Example Prompts:
"What's your story about?"
"How many scenes would you like? (Default: 1 title + 5 story scenes)"
"Describe your main character(s): name, appearance, personality"
"Any specific visual style preferences? (Default: semi-realistic cartoon)"
Step 2: Define Characters
For each character in the story, create a character profile:
Character Profile Template:
character = {
"name": "Character Name",
"species": "human/animal/creature",
"description": "brief description",
"colors": {
"primary": "#HEX",
"secondary": "#HEX"
},
"outfit": "clothing description",
"features": ["distinctive trait 1", "trait 2", "trait 3"],
"personality": "personality description",
"voice_id": "elevenlabs-voice-id",
"voice_name": "ElevenLabs voice name"
}
Example:
pyter_python = {
"name": "Pyter Python",
"species": "friendly snake mascot",
"description": "A cheerful coding mentor snake",
"colors": {
"body": "#0B5FFF", # Blue
"belly": "#FFB703" # Yellow
},
"outfit": "tiny white lab coat with circular π logo",
"features": ["large brown eyes", "rounded head", "cheerful smile"],
"personality": "enthusiastic, helpful, curious",
"voice_id": "TxGEqnHWrfWFTfGW9XjX",
"voice_name": "Josh"
}
Voice Assignment:
- Ask user for voice preference or auto-assign based on character
- Use default mapping for common types
- Allow custom voice selection from ElevenLabs library
Step 3: Plan Story Sequence
Create scene-by-scene outline:
Scene 0 (Title Scene):
- Visual: Title card with main character(s)
- Audio: Story introduction (narrator or main character)
- Duration: ~15-20 seconds
Scenes 1-N (Story Scenes):
- Visual: Sequential story moments
- Audio: Narrative with character dialogue
- Duration: ~15-20 seconds each
Example Scene Plan:
scene_plan = [
{
"number": 0,
"type": "title",
"visual_description": "Pyter Python with laptop, 'Pyter's Coding Adventure' text overlay",
"characters": ["Pyter Python"],
"narrative": "[cheerful] Join Pyter Python on an exciting coding adventure!",
"speaker": "Narrator",
"voice_id": "JBFqnCBsd6RMkjVDRZzb"
},
{
"number": 1,
"type": "story",
"visual_description": "Pyter at desk looking at computer screen showing error message, confused expression",
"characters": ["Pyter Python"],
"narrative": "[confused] Hmm... what does this error message mean? [pause] I thought my code was perfect!",
"speaker": "Pyter Python",
"voice_id": "TxGEqnHWrfWFTfGW9XjX"
},
# ... more scenes
]
Step 4: Build Style and Character Locks
Prepare Global Style Lock:
STYLE_LOCK = """
Aspect ratio: 1080×1080 (square)
Camera: 50mm lens, eye-level perspective
Lighting: soft three-point lighting, warm key light (4500K)
Color palette: #0B5FFF, #FFB703, #FB8500, #023047, #8ECAE6
Materials: matte finish, no film grain or heavy bloom
Background: subtle gradient, clean composition
Style: semi-realistic cartoon with clear lines and gentle shading
Post: crisp focus, no vignette or text artifacts
"""
NEGATIVE_LOCK = """
no text errors, no misspellings, no watermarks, no stickers,
no extra characters, no visual noise, no drastic lighting changes
"""
Build Character Lock for Each Scene:
def build_character_lock(characters_in_scene):
lock = ""
for character in characters_in_scene:
lock += f"""
Character: {character['name']}
Species: {character['species']}
Colors: body {character['colors']['primary']}, secondary {character['colors']['secondary']}
Outfit: {character['outfit']}
Key features: {', '.join(character['features'])}
"""
return lock
Step 5: Generate Image Sequence
Generate images using multi-turn generation for consistency:
Implementation:
from pathlib import Path
import json
# Initialize tracking
previous_image_id = None
image_files = []
# Generate each scene
for scene in scene_plan:
print(f"Generating Scene {scene['number']}: {scene['visual_description']}")
# Build character lock for this scene
character_lock = build_character_lock(
[char_profiles[name] for name in scene['characters']]
)
# Build complete image prompt
image_prompt = f"""
{STYLE_LOCK}
{character_lock}
Scene Description:
{scene['visual_description']}
{NEGATIVE_LOCK}
"""
# Add reference to previous scene if not first scene
if previous_image_id:
image_prompt += f"\nReference previous scene for consistency: {previous_image_id}"
# Generate image using image-generation skill
# (This would invoke the image-generation skill)
# For implementation, use appropriate model (DALL-E 3 or Gemini Pro)
result = generate_image(
prompt=image_prompt,
model="dall-e-3", # or gemini-3-pro-image-preview
size="1024x1024",
reference_image=previous_image_id
)
# Save image
filename = f"scene-{scene['number']:02d}.png"
save_image(result, filename)
image_files.append(filename)
# Track for next scene reference
previous_image_id = result['image_id']
print(f" ✓ Saved: {filename}")
Key Points:
- Scene 0 generates base image
- Scenes 1+ reference previous scene for consistency
- Apply STYLE_LOCK and CHARACTER_LOCK to every prompt
- Save with sequential numbering
Step 6: Generate Narrative Audio
Create voice narration for each scene:
Implementation:
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key=os.environ['ELEVENLABS_API_KEY'])
audio_files = []
for scene in scene_plan:
print(f"Generating audio for Scene {scene['number']}")
# Prepare dialogue input
dialogue_input = {
"text": scene['narrative'],
"name": scene['speaker'],
"voice_id": scene['voice_id']
}
# Generate audio using text_to_dialogue
audio = client.text_to_dialogue.convert(
inputs=[dialogue_input]
)
# Save audio file
filename = f"scene-{scene['number']:02d}.mp3"
with open(filename, 'wb') as f:
for chunk in audio:
f.write(chunk)
audio_files.append(filename)
print(f" ✓ Saved: {filename}")
Narrative Guidelines:
- 50-80 words per scene
- Use emotion tags:
[excited],[thoughtful],[confused],[pause] - Include sound effects when appropriate:
[sound effect: door creaking] - Vary pacing with pauses
Step 7: Concatenate Audio
Combine all scene audio into single track:
Implementation:
import subprocess
# Build ffmpeg concat command
concat_filter = "concat=n={}:v=0:a=1[out]".format(len(audio_files))
inputs = []
for audio_file in audio_files:
inputs.extend(['-i', audio_file])
cmd = ['ffmpeg', '-y'] + inputs + [
'-filter_complex', concat_filter,
'-map', '[out]',
'full_audio.mp3'
]
subprocess.run(cmd, check=True)
print("✓ Audio concatenated: full_audio.mp3")
Step 8: Assemble Final Video
Use the included assemble_video.sh script:
Implementation:
import subprocess
from pathlib import Path
# Prepare command
script_path = Path(__file__).parent / "scripts" / "assemble_video.sh"
cmd = [str(script_path), "full_audio.mp3"] + image_files
# Run assembly
subprocess.run(cmd, check=True)
# Output will be full_audio.mp4
print("✓ Video created: full_audio.mp4")
Script Details:
- Calculates equal time per image based on total audio length
- Creates video segment for each image
- Ensures all images are exactly 1080×1080 (pads if needed)
- Concatenates segments
- Muxes with audio track
- Outputs high-quality MP4 with H.264
Step 9: Deliver Results
Provide user with:
- Final video file:
<story-name>.mp4 - Scene breakdown: Summary of each scene
- Individual assets: Images and audio files (if requested)
- Story metadata: Character profiles, scene plan (if requested)
Example Output:
✓ Video Story Created: pyter-coding-adventure.mp4
Scenes:
0. Title: "Pyter's Coding Adventure" (20s)
1. Pyter encounters an error (18s)
2. Pyter realizes the mistake (17s)
3. Pyter fixes the code (19s)
4. Code runs successfully (16s)
5. Pyter celebrates (15s)
Total Duration: 1:45
Resolution: 1080×1080
Characters: Pyter Python (voiced by Josh)
Files generated:
- pyter-coding-adventure.mp4 (final video)
- scene-00.png through scene-05.png (images)
- scene-00.mp3 through scene-05.mp3 (audio)
- full_audio.mp3 (concatenated audio)
Character Voice Reference
ElevenLabs Voice IDs
Narrators:
- George (male, middle-aged, narrative):
JBFqnCBsd6RMkjVDRZzb - Rachel (female, young, calm):
21m00Tcm4TlvDq8ikWAM
Young Characters:
- Josh (male, energetic):
TxGEqnHWrfWFTfGW9XjX - Bella (female, expressive):
EXAVITQu4vr4xnSDxMaL - Antoni (male, friendly):
ErXwobaYiN019PkySvjV - Elli (female, emotional):
MF3mGyEYCl7XYWbV9V6O
Adult Characters:
- Adam (male, authoritative):
pNInz6obpgDQGcFmaJgB - Domi (female, confident):
AZnzlk1XvdvUeBnXmlld - Matilda (female, warm):
XrExE9yKIg1WjnnlVkGX
Assignment Strategy:
def assign_voice(character):
"""Auto-assign voice based on character attributes"""
# Check for explicit assignment
if 'voice_preference' in character:
return get_voice_id(character['voice_preference'])
# Auto-assign based on attributes
age = character.get('age', 'young')
gender = character.get('gender', 'male')
if age == 'young':
if gender == 'male':
return 'TxGEqnHWrfWFTfGW9XjX' # Josh
else:
return '21m00Tcm4TlvDq8ikWAM' # Rachel
else: # adult
if gender == 'male':
return 'pNInz6obpgDQGcFmaJgB' # Adam
else:
return 'XrExE9yKIg1WjnnlVkGX' # Matilda
Example Story Generation
Complete Example: "Pyter's First Bug"
User Request: "Create a short story about a coding snake fixing his first bug"
Step 1: Character Definition
pyter = {
"name": "Pyter Python",
"species": "friendly snake",
"colors": {"body": "#0B5FFF", "belly": "#FFB703"},
"outfit": "white lab coat with π logo",
"features": ["large brown eyes", "rounded head", "cheerful smile"],
"personality": "enthusiastic learner",
"voice_id": "TxGEqnHWrfWFTfGW9XjX" # Josh
}
Step 2: Scene Plan
scenes = [
{
"number": 0,
"visual": "Pyter with laptop, title 'Pyter's First Bug'",
"narrative": "[cheerful] Today, Pyter Python will fix his very first coding bug!",
"speaker": "Narrator",
"voice_id": "JBFqnCBsd6RMkjVDRZzb"
},
{
"number": 1,
"visual": "Pyter staring at screen with red error message",
"narrative": "[confused] Wait... why isn't my code working? [pause] The computer says there's a syntax error!",
"speaker": "Pyter",
"voice_id": "TxGEqnHWrfWFTfGW9XjX"
},
{
"number": 2,
"visual": "Pyter reading a Python book, thoughtful",
"narrative": "[thoughtful] Let me check the Python book... [pause] Oh! I need to look at line 5 carefully.",
"speaker": "Pyter",
"voice_id": "TxGEqnHWrfWFTfGW9XjX"
},
{
"number": 3,
"visual": "Close-up of Pyter pointing at screen, realization",
"narrative": "[excited] I found it! I forgot to close the parentheses! [pause] That's the bug!",
"speaker": "Pyter",
"voice_id": "TxGEqnHWrfWFTfGW9XjX"
},
{
"number": 4,
"visual": "Screen showing 'Success!' with green checkmark",
"narrative": "[proud] I fixed it! My code is running perfectly now!",
"speaker": "Pyter",
"voice_id": "TxGEqnHWrfWFTfGW9XjX"
},
{
"number": 5,
"visual": "Pyter celebrating, confetti in background",
"narrative": "[warm] And that's how Pyter learned that every programmer makes mistakes... and that's okay!",
"speaker": "Narrator",
"voice_id": "JBFqnCBsd6RMkjVDRZzb"
}
]
Step 3: Generate (using process described above)
Output: pyters-first-bug.mp4 with 6 scenes, ~90 seconds total
Requirements
Skills:
image-generation- For creating consistent visual sceneselevenlabs- For character voice narration
Python Packages:
pip install elevenlabs pillow
System:
- Python 3.8+
- ffmpeg (for video assembly)
- Bash shell (for assemble_video.sh script)
- 2GB+ free disk space (for temporary files)
API Keys:
- OpenAI or Google (for image generation)
- ElevenLabs (for voice narration)
File Permissions:
- Execute permission for
assemble_video.sh
Best Practices
Story Planning
-
Keep it Simple:
- Start with 6 scenes (1 title + 5 story)
- Clear beginning, middle, end
- Single main character for first stories
-
Character Consistency:
- Define characters completely before starting
- Use distinctive visual features
- Maintain outfit/colors throughout
-
Pacing:
- 15-20 seconds per scene ideal
- Use pauses for dramatic effect
- Vary emotion tags for expressiveness
Visual Consistency
-
Use Style Locks:
- Apply to every scene without exception
- Don't modify mid-story
- Custom locks should be complete, not partial
-
Character Locks:
- Specify colors with hex codes
- List 3-5 distinctive features
- Include outfit details
-
Multi-Turn References:
- Always reference previous scene
- Mention "maintain character appearance"
- Note "same lighting and style"
Audio Quality
-
Narrative Guidelines:
- Write naturally for speech
- Use emotion tags sparingly (1-2 per scene)
- Include pauses for pacing
-
Voice Selection:
- Match voice to character age/personality
- Keep narrator voice neutral
- Maintain voice consistency per character
-
Audio Testing:
- Generate one scene first to test
- Verify voice/emotion match intent
- Adjust before generating all scenes
Video Assembly
-
File Organization:
- Use consistent naming (scene-XX.png/mp3)
- Keep in flat directory structure
- Clean up temp files after assembly
-
Quality Settings:
- Default 1080×1080 ensures quality
- H.264 baseline profile for compatibility
- AAC audio at 192kbps
-
Testing:
- Verify all images are same size
- Check audio files are valid
- Test script with 2-3 scenes first
Troubleshooting
Visual Inconsistencies
Problem: Character looks different across scenes
Solutions:
- Ensure character lock is applied to every prompt
- Verify previous image is referenced
- Add "maintain exact character appearance from previous scene"
- Use more specific color hex codes
Audio Issues
Problem: Voice doesn't match character
Solutions:
- Verify voice_id is correct
- Test voice with sample text first
- Check character voice assignment logic
Problem: Concatenated audio has gaps
Solutions:
- Ensure all audio files are valid MP3
- Check ffmpeg concat filter syntax
- Verify no missing scene audio files
Video Assembly Errors
Problem: Script fails with "file not found"
Solutions:
- Verify all image files exist
- Check audio file path
- Ensure script has execute permissions
Problem: Images different sizes in video
Solutions:
- Verify all images are 1080×1080
- Check image generation settings
- Script auto-pads, but prefer exact size
Limitations
-
Scene Count:
- Practical limit: 10-12 scenes (video length ~3 minutes)
- More scenes = longer generation time
- Audio/video file size considerations
-
Character Complexity:
- 1-3 main characters recommended
- Too many characters = harder consistency
- Background characters okay if not detailed
-
Visual Changes:
- Can't change style mid-story
- Character outfit changes require new character lock
- Major scene changes (day/night) may reduce consistency
-
Audio Length:
- Each scene 15-20 seconds ideal
- Very short scenes (<10s) feel rushed
- Very long scenes (>30s) slow pacing
-
Processing Time:
- Image generation: 30-60s per scene
- Audio generation: 10-20s per scene
- Video assembly: 30-60s total
- Total: ~10-15 minutes for 6-scene story
Related Skills
image-generation- Required for visual generationelevenlabs- Required for voice narrationpython-plotting- For visualizing story analyticsscientific-writing- For writing narrative scripts
Additional Resources
- Image Generation Skill: See
image-generation/SKILL.md - ElevenLabs Skill: See
elevenlabs/SKILL.md - Style Lock Reference: See
references/style-locks.md - Narrative Design: See
references/narrative-design.md - Video Assembly: See
references/video-assembly.md - Example Stories: See
examples/example-stories.md
Repository
