name: cloudflare-workers-ai description: Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.

Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize license: MIT

Cloudflare Workers AI - Complete Reference

Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.

Status: Production Ready ✅ Last Updated: 2025-11-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0

Quick Start (5 minutes)
Workers AI API Reference
Model Selection Guide
Common Patterns
AI Gateway Integration
Rate Limits & Pricing
Production Checklist

Quick Start (5 minutes)

1. Add AI Binding

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

2. Run Your First Model

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is Cloudflare?',
    });

    return Response.json(response);
  },
};

3. Add Streaming (Recommended)

const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // Always use streaming for text generation!
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Why streaming?

Prevents buffering large responses in memory
Faster time-to-first-token
Better user experience for long-form content
Avoids Worker timeout issues

Workers AI API Reference

Core API: `env.AI.run()`

const response = await env.AI.run(model, inputs, options?);

Parameter	Type	Description
`model`	string	Model ID (e.g., `@cf/meta/llama-3.1-8b-instruct`)
`inputs`	object	Model-specific inputs (see model type below)
`options.gateway.id`	string	AI Gateway ID for caching/logging
`options.gateway.skipCache`	boolean	Skip AI Gateway cache

Returns: Promise<ModelOutput> (non-streaming) or ReadableStream (streaming)

Input Types by Model Category

Category	Key Inputs	Output
Text Generation	`messages[]`, `stream`, `max_tokens`, `temperature`	`{ response: string }`
Embeddings	`text: string \| string[]`	`{ data: number[][], shape: number[] }`
Image Generation	`prompt`, `num_steps`, `guidance`	Binary PNG
Vision	`messages[].content[].image_url`	`{ response: string }`

📄 Full model details: Load references/models-catalog.md for complete model list, parameters, and rate limits.

Model Selection Guide

Text Generation (LLMs)

Model	Best For	Rate Limit	Size
`@cf/meta/llama-3.1-8b-instruct`	General purpose, fast	300/min	8B
`@cf/meta/llama-3.2-1b-instruct`	Ultra-fast, simple tasks	300/min	1B
`@cf/qwen/qwen1.5-14b-chat-awq`	High quality, complex reasoning	150/min	14B
`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`	Coding, technical content	300/min	32B
`@hf/thebloke/mistral-7b-instruct-v0.1-awq`	Fast, efficient	400/min	7B

Text Embeddings

Model	Dimensions	Best For	Rate Limit
`@cf/baai/bge-base-en-v1.5`	768	General purpose RAG	3000/min
`@cf/baai/bge-large-en-v1.5`	1024	High accuracy search	1500/min
`@cf/baai/bge-small-en-v1.5`	384	Fast, low storage	3000/min

Image Generation

Model	Best For	Rate Limit	Speed
`@cf/black-forest-labs/flux-1-schnell`	High quality, photorealistic	720/min	Fast
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	General purpose	720/min	Medium
`@cf/lykon/dreamshaper-8-lcm`	Artistic, stylized	720/min	Fast

Vision Models

Model	Best For	Rate Limit
`@cf/meta/llama-3.2-11b-vision-instruct`	Image understanding	720/min
`@cf/unum/uform-gen2-qwen-500m`	Fast image captioning	720/min

Common Patterns

Pattern 1: Chat with Streaming

app.post('/chat', async (c) => {
  const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>();
  const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true });
  return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
});

Pattern 2: RAG (Retrieval Augmented Generation)

// 1. Generate embedding for query
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
// 3. Build context
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: `Answer using this context:\n${context}` },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });

📄 More patterns: Load references/best-practices.md for structured output, image generation, multi-model consensus, and production patterns.

AI Gateway Integration

Enable caching, logging, and cost tracking with AI Gateway:

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, {
  gateway: { id: 'my-gateway', skipCache: false },
});

Benefits: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.

Rate Limits & Pricing

Information last verified: 2025-01-14

Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:

Rate Limits: https://developers.cloudflare.com/workers-ai/platform/limits/
Pricing: https://developers.cloudflare.com/workers-ai/platform/pricing/

Free Tier: 10,000 neurons/day Paid Tier: $0.011 per 1,000 neurons

📄 Per-model details: See references/models-catalog.md for specific rate limits and pricing for each model.

Production Checklist

Essential before deploying:

Enable AI Gateway for cost tracking
Implement streaming for text generation
Add rate limit retry with exponential backoff
Validate input length (prevent token limit errors)
Add input sanitization (prevent prompt injection)

📄 Full checklist: Load references/best-practices.md for complete production checklist, error handling patterns, monitoring, and cost optimization.

External SDK Integrations

Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:

// OpenAI SDK - use same patterns with Workers AI models
const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});

// Vercel AI SDK - native integration
import { createWorkersAI } from 'workers-ai-provider';
const workersai = createWorkersAI({ binding: env.AI });

📄 Full integration guide: Load references/integrations.md for OpenAI SDK, Vercel AI SDK, and REST API examples.

Limits Summary

Feature	Limit
Concurrent requests	No hard limit (rate limits apply)
Max input tokens	Varies by model (typically 2K-128K)
Max output tokens	Varies by model (typically 512-2048)
Streaming chunk size	~1 KB
Image size (output)	~5 MB
Request timeout	Workers timeout applies (30s default, 5m max CPU)
Daily free neurons	10,000
Rate limits	See "Rate Limits & Pricing" section

When to Load References

Reference File	Load When...
`references/models-catalog.md`	Choosing a model, checking rate limits, comparing model capabilities
`references/best-practices.md`	Production deployment, error handling, cost optimization, security
`references/integrations.md`	Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding

Unnamed Skill

$ Installer

Cloudflare Workers AI - Complete Reference

Table of Contents

Quick Start (5 minutes)

1. Add AI Binding

2. Run Your First Model

3. Add Streaming (Recommended)

Workers AI API Reference

Core API: `env.AI.run()`

Input Types by Model Category

Model Selection Guide

Text Generation (LLMs)

Text Embeddings

Image Generation

Vision Models

Common Patterns

Pattern 1: Chat with Streaming

Pattern 2: RAG (Retrieval Augmented Generation)

AI Gateway Integration

Rate Limits & Pricing

Production Checklist

External SDK Integrations

Limits Summary

When to Load References

References

Repository

Actions

Related Skills

Unnamed Skill

$ Installer

Cloudflare Workers AI - Complete Reference

Table of Contents

Quick Start (5 minutes)

1. Add AI Binding

2. Run Your First Model

3. Add Streaming (Recommended)

Workers AI API Reference

Core API: env.AI.run()

Input Types by Model Category

Model Selection Guide

Text Generation (LLMs)

Text Embeddings

Image Generation

Vision Models

Common Patterns

Pattern 1: Chat with Streaming

Pattern 2: RAG (Retrieval Augmented Generation)

AI Gateway Integration

Rate Limits & Pricing

Production Checklist

External SDK Integrations

Limits Summary

When to Load References

References

Repository

Actions

Related Skills

Core API: `env.AI.run()`