Cloudflare Workers AI and Agents

Build AI-powered applications on Cloudflare's global network with serverless GPU-powered inference, AI Gateway, and the Agents SDK.

Workers AI

Run machine learning models, powered by serverless GPUs, on Cloudflare's global network.

Getting Started

Dashboard: Create and deploy a Workers AI application using the Cloudflare dashboard
REST API: Use the Cloudflare Workers AI REST API to deploy a large language model (LLM)
Workers Bindings: Deploy your first Cloudflare Workers AI project using the CLI

Models

Browse the catalog of machine learning models available on Workers AI:

Text generation models
Embedding models
Image generation models
Audio transcription models
Vision models

Configuration

Vercel AI SDK: Use Workers AI with the Vercel AI SDK for streaming text generation, tool calls, and structured output
Workers Bindings: Create an AI binding to connect your Cloudflare Worker to Workers AI
Hugging Face Chat UI: Connect Workers AI models to Hugging Face's open-source Chat UI interface
OpenAI Compatible API Endpoints: Use the OpenAI SDK to call Workers AI models through compatible API endpoints

Features

Asynchronous Batch API

Queue large inference workloads for asynchronous processing with the Workers AI Batch API:

REST API: Send and retrieve batch inference requests
Workers Binding: Send and retrieve batch inference requests using a Workers AI binding

Fine-tunes

Run fine-tuned inference on Workers AI using LoRA adapters:

Upload and use LoRA adapters for fine-tuned inference
Public LoRA adapters immediately available

Function Calling

Enable Workers AI models to execute functions and interact with external APIs:

Embedded: Execute function code alongside inference calls
- API Reference: runWithTools and autoTrimTools methods
- Examples: fetch() handler, KV API, OpenAPI Spec
Traditional: Define tools and schemas for industry-standard function calling

JSON Mode

Force Workers AI text generation models to return valid JSON output using response_format or JSON schemas.

Markdown Conversion

Convert documents in multiple formats to Markdown:

Conversion Options: Per-format options for HTML and image settings
How it Works: Pre-processing and conversion pipeline
Supported Formats: List of supported file formats
Usage: Workers Binding and REST API

Prompt Caching

Use prefix caching and the x-session-affinity header to reduce latency and inference costs.

Prompting

Structure prompts for Workers AI text generation models using system, user, and assistant message roles.

Guides and Tutorials

Build a RAG AI: Build your first AI app with Cloudflare AI using Workers AI, Vectorize, D1, and Cloudflare Workers
Whisper-large-v3-turbo: Transcribe large audio files using Workers AI with chunking
Code Generation: Explore code generation using DeepSeek Coder models
Fine Tune Models: Fine-tuning AI models with LoRA adapters on Workers AI
Image Generation Playground: Build an AI Image Generator using Flux models
Llama Vision: Use the Llama 3.2 11B Vision Instruct model
BigQuery Integration: Ingest data from BigQuery as input to Workers AI models

Platform

AI Gateway: Manage, monitor, and cache your Workers AI requests
Data Usage: How Cloudflare handles your data, inputs, and outputs
Errors: Reference table of Workers AI error codes
Event Subscriptions: Subscribe to Workers AI events using Cloudflare Queues
Glossary: Definitions of key terms
Limits: Rate limits for inference requests by task type and model
Pricing: Based on Neurons with free daily allocation

AI Gateway

Observe and control your AI applications with analytics, caching, rate limiting, and model fallback.

Getting Started

Set up AI Gateway and send your first request to observe and control AI API traffic.

Using AI Gateway

Connect your AI applications using:

Unified API (OpenAI compat): Send requests to multiple AI providers through a single endpoint
Provider Integrations:
- Anthropic, Azure OpenAI, Amazon Bedrock
- OpenAI, Google AI Studio, DeepSeek, Groq
- HuggingFace, Cohere, Cerebras, Mistral
- Replicate, Perplexity, Vertex AI
- Cartesia, Deepgram, ElevenLabs, Fal AI
- Ideogram, xAI, Parallel, OpenRouter
- Workers AI
Universal Endpoint: Route requests to any AI provider with fallbacks and retries
WebSockets API: Persistent connections for real-time and non-realtime AI interactions

Features

Caching: Override caching settings on a per-request basis
Data Loss Prevention (DLP): Protect sensitive data in prompts and responses
Dynamic Routing: Route requests based on conditions, quotas, and fallbacks
- JSON Configuration: Define routing flows using REST API and JSON
Guardrails: Evaluate prompts and responses for harmful content
- Supported model types for text generation and embeddings
Rate Limiting: Control traffic with fixed or sliding rate limits
Unified Billing: Pay for inference requests through Cloudflare

Integrations

Workers Bindings: Reference for the AI binding with AI Gateway
Vercel AI SDK: Route Vercel AI SDK requests through AI Gateway

Agents

Build AI-powered agents to perform tasks, persist state, browse the web, and communicate in real-time.

Getting Started

Add to Existing Project: Add the Agents SDK to an existing Cloudflare Workers project
Build a Chat Agent: Build a streaming AI chat agent with tools using Workers AI
Prompt an AI Model: Use the Workers "mega prompt" for building Agents
Quick Start: Build your first agent in 10 minutes - a counter with persistent state
Testing: Write and run tests using Vitest and the Workers test pool

Patterns

Implement common AI agent patterns:

Prompt chaining
Routing
Parallelization
Orchestrator-workers

Model Context Protocol (MCP)

Build and deploy remote MCP servers on Cloudflare:

Authorization: OAuth 2.1 authorization using Cloudflare Access
MCP Governance: Control which MCP servers your organization uses
MCP Server Portals: Centralize multiple MCP servers onto a single endpoint
Cloudflare's MCP Servers: Connect to managed remote MCP servers
Tools: Define, register, and manage MCP tools
Transport: Configure Streamable HTTP transport

Agentic Payments

Let AI agents pay for services programmatically:

MPP (Machine Payments Protocol): Accept and make payments using MPP
x402: Accept machine-to-machine payments using the x402 HTTP payment protocol
- Charge for HTTP content
- Charge for MCP tools
- Pay from Agents SDK
- Pay from coding tools

API Reference

Agents API: Agent base class, lifecycle hooks, SQL storage, error handling
Browse the Web: Chrome DevTools Protocol access for scraping and screenshots
Callable Methods: Expose Agent methods to external clients over WebSocket RPC
Chat Agents: Build AI chat interfaces with AIChatAgent and useAgentChat
Client SDK: Connect from browsers or server runtimes
Codemode: Let LLMs write and execute JavaScript in a secure sandbox
Configuration: Wrangler bindings, environment variables, type generation
Durable Execution: Run work that survives Durable Object eviction
Email: Send and receive email from Cloudflare Agents
getCurrentAgent(): Access the current Agent context
HTTP and SSE: Handle HTTP requests and stream responses
McpAgent: Build stateful MCP servers on Cloudflare
McpClient: Connect to external MCP servers
Observability: Subscribe to structured Agent events
Protocol Messages: Control identity, state, and MCP protocol messages
Queue Tasks: Add background tasks to a built-in FIFO queue
RAG: Add RAG using Vectorize and embedded SQL database
Readonly Connections: Restrict WebSocket clients to view-only access
Retries: Retry failed operations with exponential backoff
Routing: Route HTTP and WebSocket requests to Agents instances
Run Workflows: Integrate Cloudflare Workflows
Schedule Tasks: Delayed, date-based, cron, and interval tasks
Sessions: Persistent conversation storage with tree-structured messages
Store and Sync State: Persist and sync Agent state across clients
Sub-agents: Spawn child agents with isolated storage

Architecture

Workers AI Architecture

Serverless GPU infrastructure at edge locations
Global network with 300+ cities
Automatic model scaling
No cold starts

AI Gateway Architecture

Single endpoint for multiple providers
Built-in caching at edge
Rate limiting and quota management
Observability and analytics

Agents SDK Architecture

WebSocket-based real-time communication
SQLite-backed persistent storage
Durable execution surviving eviction
Type-safe TypeScript API

Example: Building a Chat Agent

import { AiAgent } from '@cloudflare/agents-sdk';

const agent = new AiAgent({
  accountId: env.ACCOUNT_ID,
  ai: env.AI,
});

await agent.run({
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,
});

Example: Setting Up AI Gateway

// Workers AI binding with AI Gateway
const response = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
  messages: [{ role: 'user', content: 'Hello' }],
}, {
  gateway: {
    id: 'my-gateway',  // AI Gateway ID
    skipCache: false,
  }
});