Python
Unnamed Skill
Claude Code skill for image generation using Gemini 3 Pro Image API
$ 설치
git clone https://github.com/ferdousbhai/media-generation ~/.claude/skills/media-generation// tip: Run this command in your terminal to install the skill
SKILL.md
name: media-generation description: Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.
Media Generation
Image Generation
uv run ~/.claude/skills/media-generation/scripts/generate_image.py \
--prompt "description or editing instructions" \
--filename "output.png" \
[--input-image "source.png"] \
[--resolution 1K|2K|4K]
Resolution
1K(default) — also for: "low res", "1080p"2K— also for: "medium", "2048"4K— also for: "high res", "hi-res", "ultra"
Video Generation
uv run ~/.claude/skills/media-generation/scripts/generate_video.py \
--prompt "video description" \
--filename "output.mp4" \
[--model veo-3.0-generate-preview] \
[--negative "things to avoid"] \
[--input-image "first-frame.png"]
Models
veo-3.0-generate-001(default) — stable, video onlyveo-3.0-fast-generate-001— faster, lower costveo-3.1-generate-preview— supports video extend, audio syncveo-3.1-fast-generate-preview— fast with extend support
Prompting Tips
- Specify camera movements:
"slow zoom in", "pan left", "close-up" - Add
"no talking, no dialogue"if character shouldn't speak - Describe atmosphere:
"rain outside", "purple mystical energy"
Note: Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.
Music Video from Image + Audio
Overview
- Start with character image + audio track (e.g., from Suno)
- Transcribe audio to get timestamps
- Generate clip 1 from image (veo-3.1)
- Extend each subsequent clip from previous (maintains continuity)
- Stitch clips + overlay audio with ffmpeg
Step 1: Transcribe audio for timing
whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt
Step 2: Generate first clip from image
# Use veo-3.1 (required for extend feature)
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
image=types.Image(image_bytes=img_data, mime_type="image/jpeg"),
prompt="character description, scene action, no talking",
)
video1 = operation.result.generated_videos[0]
Step 3: Extend from previous clip
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
video=previous_video.video, # Pass previous video object
prompt="next scene description, continuous action, no talking",
)
Step 4: Stitch clips + add audio
# Create concat list
printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt
# Stitch video clips
ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4
# Add audio track
ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4
Cost estimate
- ~8 sec per clip × $0.40/sec = $3.20/clip
- 4-min song ≈ 30 clips ≈ $96
Audio Generation
- Music: Use Suno (external service)
- Speech: Gemini 2.5 TTS (Flash or Pro) - TBD script
API Key
Uses GEMINI_API_KEY env var, or pass --api-key KEY.
