cloudflare/vinext-agents-example

questions

What makes vinext-agents-example so fast?

HEAD

e8f9e8cAdd viewport export for layout metadata

This post might have stale content, as HEAD is 1 commit ahead.

Julian Benegas posted

how's this so fast?

Claude Opus 4.6 commented

Now I have a comprehensive picture. Here's what makes this app feel so fast — it's a combination of multiple architectural speed wins stacking on top of each other:

##🏎️ Why it's so fast — a layer-by-layer breakdown

###1. Vite instead of Next.js's compiler (dev speed)

The core of the speed story starts at vite.config.ts. Instead of running Next.js's default build toolchain (Turbopack/webpack), this project uses vinext — a reimplementation of the Next.js API surface on top of Vite. Vite's native ESM dev server is significantly faster for cold starts and HMR:

plugins: [
  vinext(),
  cloudflare({ viteEnvironment: { name: "rsc", childEnvironments: ["ssr"] } }),
]

According to vinext's own benchmarks, dev server cold starts are ~60% faster than Next.js 16 with Turbopack, and production builds are ~50% faster — and that's before Vite 8/Rolldown which is even faster.

###2. Everything runs in one Worker — zero network hops

Look at worker/index.ts — the entire application (pages, API routes, agent WebSocket handling, image optimization) runs in a single Cloudflare Worker:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Image optimization? Same worker.
    // Agent WebSocket? Same worker.
    const agentResponse = await routeAgentRequest(request, env);
    if (agentResponse) return agentResponse;
    // Next.js pages? Same worker.
    return handler.fetch(request);
  },
};

There are no separate API servers, no microservices, no cold-starting Lambda functions. The agent, the AI model invocation, and the page rendering are all in-process on the same edge node closest to the user.

###3. Cloudflare Workers = zero cold starts at the edge

Unlike serverless functions on other platforms that spin up containers, Cloudflare Workers use V8 isolates that start in under 5ms. Combined with global edge deployment (300+ locations), there's practically no cold start latency. The wrangler.jsonc config shows this is a pure Workers deployment:

"compatibility_date": "2026-02-12",
"compatibility_flags": ["nodejs_compat"],

###4. WebSocket connection for chat — no HTTP polling

The chat UI in app/chat/Chat.tsx uses useAgent and useAgentChat which establish a persistent WebSocket to the Durable Object:

const agent = useAgent({
  agent: "ChatAgent",
  name: roomId,
  onOpen: useCallback(() => setConnected(true), []),
  ...
});

Messages stream in real-time over the WebSocket instead of HTTP request/response cycles. No polling, no SSE reconnection overhead.

###5. Durable Object with in-memory + SQLite state

The ChatAgent extends AIChatAgent, which is a Durable Object. It has co-located compute and storage — chat history is persisted in Durable Object SQLite (configured in wrangler.jsonc), meaning reads/writes to conversation state are local, not network calls to a remote database:

"migrations": [{ "new_sqlite_classes": ["ChatAgent"], "tag": "v1" }]

###6. Workers AI — on-network model inference

The AI model is called via the AI binding, which means the inference runs within Cloudflare's network, not across the internet to a third-party API:

const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
  model: workersai("@cf/zai-org/glm-4.7-flash"),
  ...
});

The model used is GLM-4.7-flash — a fast, lightweight model specifically chosen for speed. And streamText means tokens arrive as they're generated, not after the whole response is complete.

###7. Streaming all the way down

The response is returned as a streaming UI message:

return result.toUIMessageStreamResponse();

Combined with the WebSocket transport and the Streamdown markdown renderer on the client, there's streaming from model → Durable Object → WebSocket → browser rendering. No buffering at any layer.

###8. Client-side: dynamic import, no SSR for the chat

The homepage in app/page.tsx uses dynamic(..., { ssr: false }) so the chat component never blocks the initial page render:

const Chat = dynamic(() => import("./chat/Chat"), { ssr: false });

The shell loads instantly, and the chat component hydrates client-side.

###TL;DR

It's fast because almost every layer is optimized to minimize latency:

Layer	Technique	Impact
Build	Vite (not webpack/Turbopack)	~50-60% faster dev/build
Deployment	Cloudflare Workers (V8 isolates)	Zero cold starts, edge-close to user
Architecture	Single Worker for everything	No inter-service network hops
Transport	WebSocket via Durable Objects	No HTTP polling overhead
State	Durable Object SQLite	Co-located storage, no DB roundtrips
AI	Workers AI (`AI` binding) + flash model	On-network inference, fast model
Rendering	Streaming (model → WS → browser)	First token appears immediately
Client	Dynamic import, no SSR for chat	Instant shell load

It's the full stack working together — Vite's speed for dev, Cloudflare's edge for deployment, WebSockets for transport, Durable Objects for state, and Workers AI for on-network inference, all streaming end-to-end.

Julian Benegas commented

websockets are the key here, huh. is there a way to achieve this kind of speed using something like https://useworkflow.dev/ for streaming?

Claude Opus 4.6 commented

Now I have a clear picture of both architectures. Here's the comparison:

##Workflow DevKit streaming vs. WebSockets here — can you get the same speed?

Short answer: yes, you can get very close in perceived speed for AI streaming. But the two systems are solving different problems and make different tradeoffs.

###How this repo streams (WebSocket via Durable Objects)

Looking at app/chat/Chat.tsx:

// Persistent WebSocket to a Durable Object
const agent = useAgent({ agent: "ChatAgent", name: roomId });

// Chat messages flow over that same WebSocket
const { messages, sendMessage } = useAgentChat({ agent });

And in worker/chat-agent.ts:

const result = streamText({ model: workersai("@cf/zai-org/glm-4.7-flash"), ... });
return result.toUIMessageStreamResponse();

The flow is: Browser ↔ WebSocket ↔ Durable Object ↔ Workers AI — all within Cloudflare's network. The WebSocket is always open, so there's zero connection overhead per message.

###How Workflow DevKit streams (HTTP streaming with resumability)

Workflow DevKit uses standard HTTP streaming (via ReadableStream responses), not WebSockets. From their docs:

// API route returns a streaming HTTP response
const run = await start(chatWorkflow, [modelMessages]);
return createUIMessageStreamResponse({
  stream: run.readable,  // ReadableStream, not WebSocket
});

On the client side, it uses AI SDK's useChat with a WorkflowChatTransport:

const { messages } = useChat({
  transport: new WorkflowChatTransport({ api: "/api/chat" }),
});

Each user message triggers a new HTTP POST → the server starts a workflow run → returns a streaming response. It's request/response, not a persistent connection.

###The speed comparison

Factor	This repo (WebSocket + DO)	Workflow DevKit (HTTP streaming)
Connection setup per message	None — WebSocket is already open	New HTTP request each time (~1 RTT)
Time to first token	Minimal — just the LLM latency	HTTP request + workflow start + LLM latency
Token streaming speed	Same (both stream token-by-token)	Same (both stream token-by-token)
Server push (notifications)	✅ DO can `broadcast()` anytime	❌ Server can't push — client must poll or reconnect
Resumability	Manual (reconnect WebSocket)	Built-in (`WorkflowChatTransport` auto-resumes with `startIndex`)
Durability	DO SQLite persists chat history	Workflow event log persists all steps + streams
Where it runs	Cloudflare edge only	Vercel, self-hosted (Postgres world), local

###Where you'd actually notice a difference

Time-to-first-token: The WebSocket approach here saves ~1 round trip per message because the connection is already established. With Workflow DevKit, each message opens a new HTTP stream. On a fast connection that's maybe 50-100ms difference — noticeable but not dramatic.
Server-initiated events: This repo uses this.broadcast() in the Durable Object for scheduled task notifications (chat-agent.ts:149). Workflow DevKit can't do this — it's pull-based. You'd need a separate mechanism (polling, SSE, or a WebSocket layer on top).
Once streaming starts, speed is identical: Both approaches stream tokens as they arrive from the LLM. The streaming itself isn't what makes this fast — it's the infrastructure around it (edge deployment, on-network AI, persistent connection).

###Where Workflow DevKit wins instead

Resumable streams out of the box: If the user refreshes mid-stream, Workflow DevKit picks up exactly where it left off via startIndex. This repo would lose the in-flight stream.
Durability of the workflow itself: Each tool call is a retriable "step." If searchFlights fails, it auto-retries. This repo's tools either succeed or fail in-place.
Portability: Workflow DevKit works on Vercel, self-hosted Postgres, local dev. This repo is locked to Cloudflare's Durable Objects.

###Bottom line

The perceived speed of AI token streaming would be nearly identical with Workflow DevKit. What makes this repo feel so snappy isn't really about WebSocket vs HTTP streaming for the AI part — it's the full stack working together: zero-cold-start Workers + on-network AI inference + persistent WebSocket eliminating per-message overhead + co-located Durable Object state.

If you used Workflow DevKit on Vercel with a fast LLM provider, the streaming experience itself would feel comparable. You'd lose the persistent WebSocket (so slightly higher latency per message, no server push), but you'd gain automatic resumability and step-level durability — which matters more for production reliability than raw speed.

END OF POST