Unnamed Skill
Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.
$ Installer
git clone https://github.com/secondsky/claude-skills /tmp/claude-skills && cp -r /tmp/claude-skills/plugins/cloudflare-workers-ai/skills/cloudflare-workers-ai ~/.claude/skills/claude-skills// tip: Run this command in your terminal to install the skill
name: cloudflare-workers-ai description: Cloudflare Workers AI for serverless GPU inference. Use for LLMs, text/image generation, embeddings, or encountering AI_ERROR, rate limits, token exceeded errors.
Keywords: workers ai, cloudflare ai, ai bindings, llm workers, @cf/meta/llama, workers ai models, ai inference, cloudflare llm, ai streaming, text generation ai, ai embeddings, image generation ai, workers ai rag, ai gateway, llama workers, flux image generation, stable diffusion workers, vision models ai, ai chat completion, AI_ERROR, rate limit ai, model not found, token limit exceeded, neurons exceeded, ai quota exceeded, streaming failed, model unavailable, workers ai hono, ai gateway workers, vercel ai sdk workers, openai compatible workers, workers ai vectorize license: MIT
Cloudflare Workers AI - Complete Reference
Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.
Status: Production Ready ✅ Last Updated: 2025-11-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0
Table of Contents
- Quick Start (5 minutes)
- Workers AI API Reference
- Model Selection Guide
- Common Patterns
- AI Gateway Integration
- Rate Limits & Pricing
- Production Checklist
Quick Start (5 minutes)
1. Add AI Binding
wrangler.jsonc:
{
"ai": {
"binding": "AI"
}
}
2. Run Your First Model
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
prompt: 'What is Cloudflare?',
});
return Response.json(response);
},
};
3. Add Streaming (Recommended)
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true, // Always use streaming for text generation!
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
Why streaming?
- Prevents buffering large responses in memory
- Faster time-to-first-token
- Better user experience for long-form content
- Avoids Worker timeout issues
Workers AI API Reference
Core API: env.AI.run()
const response = await env.AI.run(model, inputs, options?);
| Parameter | Type | Description |
|---|---|---|
model | string | Model ID (e.g., @cf/meta/llama-3.1-8b-instruct) |
inputs | object | Model-specific inputs (see model type below) |
options.gateway.id | string | AI Gateway ID for caching/logging |
options.gateway.skipCache | boolean | Skip AI Gateway cache |
Returns: Promise<ModelOutput> (non-streaming) or ReadableStream (streaming)
Input Types by Model Category
| Category | Key Inputs | Output |
|---|---|---|
| Text Generation | messages[], stream, max_tokens, temperature | { response: string } |
| Embeddings | text: string | string[] | { data: number[][], shape: number[] } |
| Image Generation | prompt, num_steps, guidance | Binary PNG |
| Vision | messages[].content[].image_url | { response: string } |
📄 Full model details: Load references/models-catalog.md for complete model list, parameters, and rate limits.
Model Selection Guide
Text Generation (LLMs)
| Model | Best For | Rate Limit | Size |
|---|---|---|---|
@cf/meta/llama-3.1-8b-instruct | General purpose, fast | 300/min | 8B |
@cf/meta/llama-3.2-1b-instruct | Ultra-fast, simple tasks | 300/min | 1B |
@cf/qwen/qwen1.5-14b-chat-awq | High quality, complex reasoning | 150/min | 14B |
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b | Coding, technical content | 300/min | 32B |
@hf/thebloke/mistral-7b-instruct-v0.1-awq | Fast, efficient | 400/min | 7B |
Text Embeddings
| Model | Dimensions | Best For | Rate Limit |
|---|---|---|---|
@cf/baai/bge-base-en-v1.5 | 768 | General purpose RAG | 3000/min |
@cf/baai/bge-large-en-v1.5 | 1024 | High accuracy search | 1500/min |
@cf/baai/bge-small-en-v1.5 | 384 | Fast, low storage | 3000/min |
Image Generation
| Model | Best For | Rate Limit | Speed |
|---|---|---|---|
@cf/black-forest-labs/flux-1-schnell | High quality, photorealistic | 720/min | Fast |
@cf/stabilityai/stable-diffusion-xl-base-1.0 | General purpose | 720/min | Medium |
@cf/lykon/dreamshaper-8-lcm | Artistic, stylized | 720/min | Fast |
Vision Models
| Model | Best For | Rate Limit |
|---|---|---|
@cf/meta/llama-3.2-11b-vision-instruct | Image understanding | 720/min |
@cf/unum/uform-gen2-qwen-500m | Fast image captioning | 720/min |
Common Patterns
Pattern 1: Chat with Streaming
app.post('/chat', async (c) => {
const { messages } = await c.req.json<{ messages: Array<{ role: string; content: string }> }>();
const stream = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', { messages, stream: true });
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
});
Pattern 2: RAG (Retrieval Augmented Generation)
// 1. Generate embedding for query
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: [userQuery] });
// 2. Search Vectorize
const matches = await env.VECTORIZE.query(embeddings.data[0], { topK: 3 });
// 3. Build context
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');
// 4. Generate with context
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [
{ role: 'system', content: `Answer using this context:\n${context}` },
{ role: 'user', content: userQuery },
],
stream: true,
});
return new Response(stream, { headers: { 'content-type': 'text/event-stream' } });
📄 More patterns: Load references/best-practices.md for structured output, image generation, multi-model consensus, and production patterns.
AI Gateway Integration
Enable caching, logging, and cost tracking with AI Gateway:
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', { prompt: 'Hello' }, {
gateway: { id: 'my-gateway', skipCache: false },
});
Benefits: Cost tracking, response caching (50-90% savings on repeated queries), request logging, rate limiting, analytics.
Rate Limits & Pricing
Information last verified: 2025-01-14
Rate limits and pricing vary significantly by model. Always check the official documentation for the most current information:
- Rate Limits: https://developers.cloudflare.com/workers-ai/platform/limits/
- Pricing: https://developers.cloudflare.com/workers-ai/platform/pricing/
Free Tier: 10,000 neurons/day Paid Tier: $0.011 per 1,000 neurons
📄 Per-model details: See references/models-catalog.md for specific rate limits and pricing for each model.
Production Checklist
Essential before deploying:
- Enable AI Gateway for cost tracking
- Implement streaming for text generation
- Add rate limit retry with exponential backoff
- Validate input length (prevent token limit errors)
- Add input sanitization (prevent prompt injection)
📄 Full checklist: Load references/best-practices.md for complete production checklist, error handling patterns, monitoring, and cost optimization.
External SDK Integrations
Workers AI supports OpenAI SDK compatibility and Vercel AI SDK:
// OpenAI SDK - use same patterns with Workers AI models
const openai = new OpenAI({
apiKey: env.CLOUDFLARE_API_KEY,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});
// Vercel AI SDK - native integration
import { createWorkersAI } from 'workers-ai-provider';
const workersai = createWorkersAI({ binding: env.AI });
📄 Full integration guide: Load references/integrations.md for OpenAI SDK, Vercel AI SDK, and REST API examples.
Limits Summary
| Feature | Limit |
|---|---|
| Concurrent requests | No hard limit (rate limits apply) |
| Max input tokens | Varies by model (typically 2K-128K) |
| Max output tokens | Varies by model (typically 512-2048) |
| Streaming chunk size | ~1 KB |
| Image size (output) | ~5 MB |
| Request timeout | Workers timeout applies (30s default, 5m max CPU) |
| Daily free neurons | 10,000 |
| Rate limits | See "Rate Limits & Pricing" section |
When to Load References
| Reference File | Load When... |
|---|---|
references/models-catalog.md | Choosing a model, checking rate limits, comparing model capabilities |
references/best-practices.md | Production deployment, error handling, cost optimization, security |
references/integrations.md | Using OpenAI SDK, Vercel AI SDK, or REST API instead of native binding |
References
Repository
