Cloudflare Workers AI and Agents
Run AI models and build AI-powered agents on Cloudflare's global network with Workers AI, AI Gateway, and Agents SDK
Cloudflare Workers AI and Agents
Build AI-powered applications on Cloudflare's global network with serverless GPU-powered inference, AI Gateway, and the Agents SDK.
Workers AI
Run machine learning models, powered by serverless GPUs, on Cloudflare's global network.
Getting Started
- Dashboard: Create and deploy a Workers AI application using the Cloudflare dashboard
- REST API: Use the Cloudflare Workers AI REST API to deploy a large language model (LLM)
- Workers Bindings: Deploy your first Cloudflare Workers AI project using the CLI
Models
Browse the catalog of machine learning models available on Workers AI:
- Text generation models
- Embedding models
- Image generation models
- Audio transcription models
- Vision models
Configuration
- Vercel AI SDK: Use Workers AI with the Vercel AI SDK for streaming text generation, tool calls, and structured output
- Workers Bindings: Create an AI binding to connect your Cloudflare Worker to Workers AI
- Hugging Face Chat UI: Connect Workers AI models to Hugging Face's open-source Chat UI interface
- OpenAI Compatible API Endpoints: Use the OpenAI SDK to call Workers AI models through compatible API endpoints
Features
Asynchronous Batch API
Queue large inference workloads for asynchronous processing with the Workers AI Batch API:
- REST API: Send and retrieve batch inference requests
- Workers Binding: Send and retrieve batch inference requests using a Workers AI binding
Fine-tunes
Run fine-tuned inference on Workers AI using LoRA adapters:
- Upload and use LoRA adapters for fine-tuned inference
- Public LoRA adapters immediately available
Function Calling
Enable Workers AI models to execute functions and interact with external APIs:
- Embedded: Execute function code alongside inference calls
- API Reference: runWithTools and autoTrimTools methods
- Examples: fetch() handler, KV API, OpenAPI Spec
- Traditional: Define tools and schemas for industry-standard function calling
JSON Mode
Force Workers AI text generation models to return valid JSON output using response_format or JSON schemas.
Markdown Conversion
Convert documents in multiple formats to Markdown:
- Conversion Options: Per-format options for HTML and image settings
- How it Works: Pre-processing and conversion pipeline
- Supported Formats: List of supported file formats
- Usage: Workers Binding and REST API
Prompt Caching
Use prefix caching and the x-session-affinity header to reduce latency and inference costs.
Prompting
Structure prompts for Workers AI text generation models using system, user, and assistant message roles.
Guides and Tutorials
- Build a RAG AI: Build your first AI app with Cloudflare AI using Workers AI, Vectorize, D1, and Cloudflare Workers
- Whisper-large-v3-turbo: Transcribe large audio files using Workers AI with chunking
- Code Generation: Explore code generation using DeepSeek Coder models
- Fine Tune Models: Fine-tuning AI models with LoRA adapters on Workers AI
- Image Generation Playground: Build an AI Image Generator using Flux models
- Llama Vision: Use the Llama 3.2 11B Vision Instruct model
- BigQuery Integration: Ingest data from BigQuery as input to Workers AI models
Platform
- AI Gateway: Manage, monitor, and cache your Workers AI requests
- Data Usage: How Cloudflare handles your data, inputs, and outputs
- Errors: Reference table of Workers AI error codes
- Event Subscriptions: Subscribe to Workers AI events using Cloudflare Queues
- Glossary: Definitions of key terms
- Limits: Rate limits for inference requests by task type and model
- Pricing: Based on Neurons with free daily allocation
AI Gateway
Observe and control your AI applications with analytics, caching, rate limiting, and model fallback.
Getting Started
Set up AI Gateway and send your first request to observe and control AI API traffic.
Using AI Gateway
Connect your AI applications using:
- Unified API (OpenAI compat): Send requests to multiple AI providers through a single endpoint
- Provider Integrations:
- Anthropic, Azure OpenAI, Amazon Bedrock
- OpenAI, Google AI Studio, DeepSeek, Groq
- HuggingFace, Cohere, Cerebras, Mistral
- Replicate, Perplexity, Vertex AI
- Cartesia, Deepgram, ElevenLabs, Fal AI
- Ideogram, xAI, Parallel, OpenRouter
- Workers AI
- Universal Endpoint: Route requests to any AI provider with fallbacks and retries
- WebSockets API: Persistent connections for real-time and non-realtime AI interactions
Features
- Caching: Override caching settings on a per-request basis
- Data Loss Prevention (DLP): Protect sensitive data in prompts and responses
- Dynamic Routing: Route requests based on conditions, quotas, and fallbacks
- JSON Configuration: Define routing flows using REST API and JSON
- Guardrails: Evaluate prompts and responses for harmful content
- Supported model types for text generation and embeddings
- Rate Limiting: Control traffic with fixed or sliding rate limits
- Unified Billing: Pay for inference requests through Cloudflare
Integrations
- Workers Bindings: Reference for the AI binding with AI Gateway
- Vercel AI SDK: Route Vercel AI SDK requests through AI Gateway
Agents
Build AI-powered agents to perform tasks, persist state, browse the web, and communicate in real-time.
Getting Started
- Add to Existing Project: Add the Agents SDK to an existing Cloudflare Workers project
- Build a Chat Agent: Build a streaming AI chat agent with tools using Workers AI
- Prompt an AI Model: Use the Workers "mega prompt" for building Agents
- Quick Start: Build your first agent in 10 minutes - a counter with persistent state
- Testing: Write and run tests using Vitest and the Workers test pool
Patterns
Implement common AI agent patterns:
- Prompt chaining
- Routing
- Parallelization
- Orchestrator-workers
Model Context Protocol (MCP)
Build and deploy remote MCP servers on Cloudflare:
- Authorization: OAuth 2.1 authorization using Cloudflare Access
- MCP Governance: Control which MCP servers your organization uses
- MCP Server Portals: Centralize multiple MCP servers onto a single endpoint
- Cloudflare's MCP Servers: Connect to managed remote MCP servers
- Tools: Define, register, and manage MCP tools
- Transport: Configure Streamable HTTP transport
Agentic Payments
Let AI agents pay for services programmatically:
- MPP (Machine Payments Protocol): Accept and make payments using MPP
- x402: Accept machine-to-machine payments using the x402 HTTP payment protocol
- Charge for HTTP content
- Charge for MCP tools
- Pay from Agents SDK
- Pay from coding tools
API Reference
- Agents API: Agent base class, lifecycle hooks, SQL storage, error handling
- Browse the Web: Chrome DevTools Protocol access for scraping and screenshots
- Callable Methods: Expose Agent methods to external clients over WebSocket RPC
- Chat Agents: Build AI chat interfaces with AIChatAgent and useAgentChat
- Client SDK: Connect from browsers or server runtimes
- Codemode: Let LLMs write and execute JavaScript in a secure sandbox
- Configuration: Wrangler bindings, environment variables, type generation
- Durable Execution: Run work that survives Durable Object eviction
- Email: Send and receive email from Cloudflare Agents
- getCurrentAgent(): Access the current Agent context
- HTTP and SSE: Handle HTTP requests and stream responses
- McpAgent: Build stateful MCP servers on Cloudflare
- McpClient: Connect to external MCP servers
- Observability: Subscribe to structured Agent events
- Protocol Messages: Control identity, state, and MCP protocol messages
- Queue Tasks: Add background tasks to a built-in FIFO queue
- RAG: Add RAG using Vectorize and embedded SQL database
- Readonly Connections: Restrict WebSocket clients to view-only access
- Retries: Retry failed operations with exponential backoff
- Routing: Route HTTP and WebSocket requests to Agents instances
- Run Workflows: Integrate Cloudflare Workflows
- Schedule Tasks: Delayed, date-based, cron, and interval tasks
- Sessions: Persistent conversation storage with tree-structured messages
- Store and Sync State: Persist and sync Agent state across clients
- Sub-agents: Spawn child agents with isolated storage
Architecture
Workers AI Architecture
- Serverless GPU infrastructure at edge locations
- Global network with 300+ cities
- Automatic model scaling
- No cold starts
AI Gateway Architecture
- Single endpoint for multiple providers
- Built-in caching at edge
- Rate limiting and quota management
- Observability and analytics
Agents SDK Architecture
- WebSocket-based real-time communication
- SQLite-backed persistent storage
- Durable execution surviving eviction
- Type-safe TypeScript API
Example: Building a Chat Agent
import { AiAgent } from '@cloudflare/agents-sdk';
const agent = new AiAgent({
accountId: env.ACCOUNT_ID,
ai: env.AI,
});
await agent.run({
messages: [{ role: 'user', content: 'Hello' }],
stream: true,
});
Example: Setting Up AI Gateway
// Workers AI binding with AI Gateway
const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
messages: [{ role: 'user', content: 'Hello' }],
}, {
gateway: {
id: 'my-gateway', // AI Gateway ID
skipCache: false,
}
});