Python

Unnamed Skill

Claude Code skill for image generation using Gemini 3 Pro Image API

$ 설치

git clone https://github.com/ferdousbhai/media-generation ~/.claude/skills/media-generation

// tip: Run this command in your terminal to install the skill


name: media-generation description: Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.

Media Generation

Image Generation

uv run ~/.claude/skills/media-generation/scripts/generate_image.py \
  --prompt "description or editing instructions" \
  --filename "output.png" \
  [--input-image "source.png"] \
  [--resolution 1K|2K|4K]

Resolution

  • 1K (default) — also for: "low res", "1080p"
  • 2K — also for: "medium", "2048"
  • 4K — also for: "high res", "hi-res", "ultra"

Video Generation

uv run ~/.claude/skills/media-generation/scripts/generate_video.py \
  --prompt "video description" \
  --filename "output.mp4" \
  [--model veo-3.0-generate-preview] \
  [--negative "things to avoid"] \
  [--input-image "first-frame.png"]

Models

  • veo-3.0-generate-001 (default) — stable, video only
  • veo-3.0-fast-generate-001 — faster, lower cost
  • veo-3.1-generate-preview — supports video extend, audio sync
  • veo-3.1-fast-generate-preview — fast with extend support

Prompting Tips

  • Specify camera movements: "slow zoom in", "pan left", "close-up"
  • Add "no talking, no dialogue" if character shouldn't speak
  • Describe atmosphere: "rain outside", "purple mystical energy"

Note: Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.

Music Video from Image + Audio

Overview

  1. Start with character image + audio track (e.g., from Suno)
  2. Transcribe audio to get timestamps
  3. Generate clip 1 from image (veo-3.1)
  4. Extend each subsequent clip from previous (maintains continuity)
  5. Stitch clips + overlay audio with ffmpeg

Step 1: Transcribe audio for timing

whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt

Step 2: Generate first clip from image

# Use veo-3.1 (required for extend feature)
operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    image=types.Image(image_bytes=img_data, mime_type="image/jpeg"),
    prompt="character description, scene action, no talking",
)
video1 = operation.result.generated_videos[0]

Step 3: Extend from previous clip

operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    video=previous_video.video,  # Pass previous video object
    prompt="next scene description, continuous action, no talking",
)

Step 4: Stitch clips + add audio

# Create concat list
printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt

# Stitch video clips
ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4

# Add audio track
ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4

Cost estimate

  • ~8 sec per clip × $0.40/sec = $3.20/clip
  • 4-min song ≈ 30 clips ≈ $96

Audio Generation

  • Music: Use Suno (external service)
  • Speech: Gemini 2.5 TTS (Flash or Pro) - TBD script

API Key

Uses GEMINI_API_KEY env var, or pass --api-key KEY.