Marketplace

cloudflare-workers-ai

Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama 4, Gemma 3, Mistral 3.1, Flux image generation, BGE embeddings (2x faster, 2025), streaming support, and AI Gateway for cost tracking. Use when: implementing LLM inference, generating images, building RAG with embeddings, streaming AI responses, using AI Gateway, troubleshooting max_tokens defaults (breaking change 2025), BGE pooling parameter (not backwards compatible), or handling AI_ERROR, rate limits, model deprecations, token limits. Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama-4-scout, @cf/google/gemma-3-12b-it, @cf/mistralai/mistral-small-3.1-24b-instruct, @cf/openai/gpt-oss-120b, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, bge pooling cls mean, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, deepgram aura, leonardo image generation, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, max_tokens breaking change, bge pooling backwards compatibility, model deprecations october 2025, token limit exceeded, neurons exceeded, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize, workers-ai-provider v2, ai sdk v5, lora adapters rank 32

$ インストール

git clone https://github.com/jezweb/claude-skills /tmp/claude-skills && cp -r /tmp/claude-skills/skills/cloudflare-workers-ai ~/.claude/skills/claude-skills

// tip: Run this command in your terminal to install the skill


name: cloudflare-workers-ai description: | Run LLMs and AI models on Cloudflare's global GPU network with Workers AI. Includes Llama 4, Gemma 3, Mistral 3.1, Flux image generation, BGE embeddings (2x faster, 2025), streaming support, and AI Gateway for cost tracking.

Use when: implementing LLM inference, generating images, building RAG with embeddings, streaming AI responses, using AI Gateway, troubleshooting max_tokens defaults (breaking change 2025), BGE pooling parameter (not backwards compatible), or handling AI_ERROR, rate limits, model deprecations, token limits.

Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama-4-scout, @cf/google/gemma-3-12b-it, @cf/mistralai/mistral-small-3.1-24b-instruct, @cf/openai/gpt-oss-120b, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, bge pooling cls mean, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, deepgram aura, leonardo image generation, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, max_tokens breaking change, bge pooling backwards compatibility, model deprecations october 2025, token limit exceeded, neurons exceeded, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize, workers-ai-provider v2, ai sdk v5, lora adapters rank 32

Cloudflare Workers AI

Status: Production Ready ✅ Last Updated: 2025-11-25 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.50.0, @cloudflare/workers-types@4.20251125.0

Recent Updates (2025):

  • April 2025 - Performance: Llama 3.3 70B 2-4x faster (speculative decoding, prefix caching), BGE embeddings 2x faster
  • April 2025 - Breaking Changes: max_tokens now correctly defaults to 256 (was not respected), BGE pooling parameter (cls NOT backwards compatible with mean)
  • 2025 - New Models (14): Mistral 3.1 24B (vision+tools), Gemma 3 12B (128K context), EmbeddingGemma 300M, Llama 4 Scout, GPT-OSS 120B/20B, Qwen models (QwQ 32B, Coder 32B), Leonardo image gen, Deepgram Aura 2, Whisper v3 Turbo, IBM Granite, Nova 3
  • 2025 - Platform: Context windows API change (tokens not chars), unit-based pricing with per-model granularity, workers-ai-provider v2.0.0 (AI SDK v5), LoRA rank up to 32 (was 8), 100 adapters per account
  • October 2025: Model deprecations (use Llama 4, GPT-OSS instead)

Quick Start (5 Minutes)

// 1. Add AI binding to wrangler.jsonc
{ "ai": { "binding": "AI" } }

// 2. Run model with streaming (recommended)
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [{ role: 'user', content: 'Tell me a story' }],
      stream: true, // Always stream for text generation!
    });

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
};

Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.


API Reference

env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Model Selection Guide (Updated 2025)

Text Generation (LLMs)

ModelBest ForRate LimitSizeNotes
2025 Models
@cf/meta/llama-4-scout-17b-16e-instructLatest Llama, general purpose300/min17BNEW 2025
@cf/openai/gpt-oss-120bLargest open-source GPT300/min120BNEW 2025
@cf/openai/gpt-oss-20bSmaller open-source GPT300/min20BNEW 2025
@cf/google/gemma-3-12b-it128K context, 140+ languages300/min12BNEW 2025, vision
@cf/mistralai/mistral-small-3.1-24b-instructVision + tool calling300/min24BNEW 2025
@cf/qwen/qwq-32bReasoning, complex tasks300/min32BNEW 2025
@cf/qwen/qwen2.5-coder-32b-instructCoding specialist300/min32BNEW 2025
@cf/qwen/qwen3-30b-a3b-fp8Fast quantized300/min30BNEW 2025
@cf/ibm-granite/granite-4.0-h-microSmall, efficient300/minMicroNEW 2025
Performance (2025)
@cf/meta/llama-3.3-70b-instruct-fp8-fast2-4x faster (2025 update)300/min70BSpeculative decoding
@cf/meta/llama-3.1-8b-instruct-fp8-fastFast 8B variant300/min8B-
Standard Models
@cf/meta/llama-3.1-8b-instructGeneral purpose300/min8B-
@cf/meta/llama-3.2-1b-instructUltra-fast, simple tasks300/min1B-
@cf/deepseek-ai/deepseek-r1-distill-qwen-32bCoding, technical300/min32B-

Text Embeddings (2x Faster - 2025)

ModelDimensionsBest ForRate LimitNotes
@cf/google/embeddinggemma-300m768Best-in-class RAG3000/minNEW 2025
@cf/baai/bge-base-en-v1.5768General RAG (2x faster)3000/minpooling: "cls" recommended
@cf/baai/bge-large-en-v1.51024High accuracy (2x faster)1500/minpooling: "cls" recommended
@cf/baai/bge-small-en-v1.5384Fast, low storage (2x faster)3000/minpooling: "cls" recommended
@cf/qwen/qwen3-embedding-0.6b768Qwen embeddings3000/minNEW 2025

CRITICAL (2025): BGE models now support pooling: "cls" parameter (recommended) but NOT backwards compatible with pooling: "mean" (default).

Image Generation

ModelBest ForRate LimitNotes
@cf/black-forest-labs/flux-1-schnellHigh quality, photorealistic720/min-
@cf/leonardo/lucid-originLeonardo AI style720/minNEW 2025
@cf/leonardo/phoenix-1.0Leonardo AI variant720/minNEW 2025
@cf/stabilityai/stable-diffusion-xl-base-1.0General purpose720/min-

Vision Models

ModelBest ForRate LimitNotes
@cf/meta/llama-3.2-11b-vision-instructImage understanding720/min-
@cf/google/gemma-3-12b-itVision + text (128K context)300/minNEW 2025

Audio Models (2025)

ModelTypeRate LimitNotes
@cf/deepgram/aura-2-enText-to-speech (English)720/minNEW 2025
@cf/deepgram/aura-2-esText-to-speech (Spanish)720/minNEW 2025
@cf/deepgram/nova-3Speech-to-text (+ WebSocket)720/minNEW 2025
@cf/openai/whisper-large-v3-turboSpeech-to-text (faster)720/minNEW 2025

Common Patterns

RAG (Retrieval Augmented Generation)

// 1. Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });

// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// 3. Generate with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

Structured Output with Zod

import { z } from 'zod';

const Schema = z.object({ name: z.string(), items: z.array(z.string()) });

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{
    role: 'user',
    content: `Generate JSON matching: ${JSON.stringify(Schema.shape)}`
  }],
});

const validated = Schema.parse(JSON.parse(response.response));

AI Gateway Integration

Provides caching, logging, cost tracking, and analytics for AI requests.

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  { gateway: { id: 'my-gateway', skipCache: false } }
);

// Access logs and send feedback
const gateway = env.AI.gateway('my-gateway');
await gateway.patchLog(env.AI.aiGatewayLogId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Benefits: Cost tracking, caching (reduces duplicate inference), logging, rate limiting, analytics.


Rate Limits & Pricing (Updated 2025)

Rate Limits (per minute)

Task TypeDefault LimitNotes
Text Generation300/minSome fast models: 400-1500/min
Text Embeddings3000/minBGE-large: 1500/min
Image Generation720/minAll image models
Vision Models720/minImage understanding
Audio (TTS/STT)720/minDeepgram, Whisper
Translation720/minM2M100, Opus MT
Classification2000/minText classification

Pricing (Unit-Based, Billed in Neurons - 2025)

Free Tier:

  • 10,000 neurons per day
  • Resets daily at 00:00 UTC

Paid Tier ($0.011 per 1,000 neurons):

  • 10,000 neurons/day included
  • Unlimited usage above free allocation

2025 Model Costs (per 1M tokens):

ModelInputOutputNotes
2025 Models
Llama 4 Scout 17B$0.270$0.850NEW 2025
GPT-OSS 120B$0.350$0.750NEW 2025
GPT-OSS 20B$0.200$0.300NEW 2025
Gemma 3 12B$0.345$0.556NEW 2025
Mistral 3.1 24B$0.351$0.555NEW 2025
Qwen QwQ 32B$0.660$1.000NEW 2025
Qwen Coder 32B$0.660$1.000NEW 2025
IBM Granite Micro$0.017$0.112NEW 2025
EmbeddingGemma 300M$0.012N/ANEW 2025
Qwen3 Embedding 0.6B$0.012N/ANEW 2025
Performance (2025)
Llama 3.3 70B Fast$0.293$2.2532-4x faster
Llama 3.1 8B FP8 Fast$0.045$0.384Fast variant
Standard Models
Llama 3.2 1B$0.027$0.201-
Llama 3.1 8B$0.282$0.827-
Deepseek R1 32B$0.497$4.881-
BGE-base (2x faster)$0.067N/A2025 speedup
BGE-large (2x faster)$0.204N/A2025 speedup
Image Models (2025)
Flux 1 Schnell$0.0000528 per 512x512 tile-
Leonardo Lucid$0.006996 per 512x512 tileNEW 2025
Leonardo Phoenix$0.005830 per 512x512 tileNEW 2025
Audio Models (2025)
Deepgram Aura 2$0.030 per 1k charsNEW 2025
Deepgram Nova 3$0.0052 per audio minNEW 2025
Whisper v3 Turbo$0.0005 per audio minNEW 2025

Error Handling with Retry

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;

      // Rate limit - retry with exponential backoff
      if (lastError.message.toLowerCase().includes('rate limit')) {
        await new Promise((resolve) => setTimeout(resolve, Math.pow(2, i) * 1000));
        continue;
      }

      throw error; // Other errors - fail immediately
    }
  }

  throw lastError!;
}

OpenAI Compatibility

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/ai/v1`,
});

// Chat completions
await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Endpoints: /v1/chat/completions, /v1/embeddings


Vercel AI SDK Integration (workers-ai-provider v2.0.0)

import { createWorkersAI } from 'workers-ai-provider'; // v2.0.0 with AI SDK v5
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate or stream
await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

References