inngest/utah

questions

Utah vs Vercel AI SDK: what are the pros and cons?

HEAD

82c0f28Linting and formatting for consistency

This post might have stale content, as HEAD is 6 commits ahead.

Julian Benegas posted

tell me pros and cons of this project, vs deploying something with the ai sdk in vercel?

Claude Opus 4.6 commented

Now I have a thorough understanding of the project. Here's a comprehensive comparison:

##Utah (Inngest) vs. Vercel AI SDK — Pros & Cons

###What Utah is

Utah is a durable, event-driven AI agent harness — a think/act/observe loop where every LLM call and tool execution is an Inngest step, giving you automatic retries, persistence, and observability. It connects to Inngest Cloud via WebSocket (connect()), processes messages from channels like Telegram/Slack, and runs locally with full filesystem access. It uses pi-ai (not the Vercel AI SDK) as its LLM abstraction.

###What the Vercel AI SDK is

The Vercel AI SDK (ai package) is a TypeScript library for building AI-powered UIs and server routes. It provides streaming helpers, tool calling, multi-step agents (generateText with maxSteps), and React hooks (useChat). It's designed for web apps deployed on Vercel (or similar), typically behind HTTP request/response cycles.

##Pros of Utah over Vercel AI SDK

###1. Durability is built-in, not bolted on

Every LLM call and tool execution in Utah is an Inngest step (src/agent-loop.ts). If Claude times out on iteration 7 of 20, Inngest retries that step — the previous 6 iterations aren't re-run. With the Vercel AI SDK, if your generateText call fails mid-loop on a serverless function, you lose everything and start over (or you have to build your own checkpointing).

###2. No request timeout constraints

Utah's agent loop can run for 20 iterations (maxIterations: 20), each with multiple tool calls. Vercel serverless functions have a 60-second default (up to 5 min on Pro). Utah connects via a persistent WebSocket — there's no HTTP timeout wall. Long-running agents with many tool calls (file reads, bash commands, web fetches) just work.

###3. Singleton concurrency + cancel-on-new-message

Utah gets singleton + cancel mode for free from Inngest:

singleton: { key: "event.data.sessionKey", mode: "cancel" }

If the user sends a new message while the agent is mid-loop, the old run is cancelled and a new one starts. With the AI SDK, you'd need to implement this yourself (abort controllers, state management, race condition handling).

###4. Persistent memory system

Utah has a two-tier memory system — daily logs + LLM-distilled long-term memory in MEMORY.md, maintained by a cron heartbeat. The AI SDK has no built-in persistence layer; you'd need to build session storage, memory distillation, and maintenance yourself.

###5. Conversation compaction

Utah automatically summarizes long conversations when they approach token limits (~80% of 150K), preserving recent messages verbatim while compacting older ones. This is a durable step too. The AI SDK gives you maxSteps but no built-in compaction.

###6. Multi-channel architecture out of the box

Utah has a clean channel abstraction (ChannelHandler interface) with Telegram and Slack implementations. Events are normalized to agent.message.received — the agent loop is channel-agnostic. The AI SDK is designed for web UIs; adapting it to Telegram/Slack/Discord requires significant plumbing.

###7. Full local filesystem access + coding tools

Utah gives the agent real tools — read, write, edit, bash, grep, find, ls powered by pi-coding-agent. It's a coding agent that can modify files on your machine. The AI SDK's tool system is more geared toward API calls and structured data extraction, not local system access.

###8. Context pruning

Utah implements two-tier pruning of old tool results (soft trim + hard clear) to prevent context bloat. This is the kind of production detail the AI SDK leaves to you.

###9. Event-driven composition

One message triggers 5 independent functions: handle-message, acknowledge, send-reply, heartbeat, failure-handler. Each has its own retry/concurrency config. This is Inngest's model — the AI SDK doesn't have this event composition pattern.

##Cons of Utah vs. Vercel AI SDK

###1. No streaming

Utah's agent loop collects the full LLM response per step, then emits a reply event. The AI SDK has first-class streaming — streamText(), useChat(), StreamingTextResponse. If you're building a web UI where tokens should appear as they're generated, the AI SDK is miles ahead. Utah sends a complete reply after the whole loop finishes (or at incremental reply points).

###2. No built-in UI layer

The AI SDK provides React hooks (useChat, useCompletion, useAssistant) that make building chat UIs trivial. Utah has no UI — it's a backend worker that talks to Telegram/Slack. Building a custom web frontend would require significant additional work.

###3. Heavier infrastructure dependency

Utah requires Inngest Cloud (or the Inngest dev server) as a hard dependency. That's another service to manage, pay for, and depend on. The AI SDK is just a library — you can deploy it anywhere (Vercel, AWS, Cloudflare, self-hosted) with zero external orchestration services.

###4. Alpha-stage dependencies

Utah uses inngest@4.0.0-alpha.6 and relies on @mariozechner/pi-ai@latest. The Vercel AI SDK is stable, widely adopted, and well-documented. Utah is explicitly an example/template.

###5. Smaller ecosystem and community

The AI SDK has a large community, extensive docs, provider integrations (OpenAI, Anthropic, Google, Mistral, Cohere, etc.), and is battle-tested in production. Utah is a single-repo example. If something breaks, you're reading source code (which is small and clear, but still).

###6. No structured output / object generation

The AI SDK provides generateObject() and streamObject() for generating typed, validated JSON from LLMs. Utah is focused on the agent loop pattern — tool calling in/out — and doesn't have a parallel for structured data extraction use cases.

###7. Not designed for request/response web apps

If your use case is "user asks a question on a webpage, gets a streamed answer," the AI SDK is purpose-built for that. Utah is designed for asynchronous, event-driven agents — the architecture (WebSocket worker → Inngest events → channel APIs) is overkill for a simple chatbot endpoint.

###8. Local-first development model

Utah runs on your machine with full filesystem access. That's powerful for a personal coding agent but means deployment to production requires thinking about where the workspace lives, file system persistence, etc. The AI SDK is stateless by design and deploys trivially to serverless.

###9. Provider abstraction is thinner

Utah uses pi-ai which supports Anthropic/OpenAI/Google. The AI SDK supports 20+ providers through a standardized interface with community-maintained adapters.

##TL;DR — When to use which

Use Case	Better Pick
Long-running agent with many tool calls	Utah — durability, no timeouts
Chat UI on a website	AI SDK — streaming, React hooks
Multi-channel bot (Slack, Telegram)	Utah — channel abstraction built-in
Personal coding agent on your machine	Utah — filesystem tools, local worker
Serverless API endpoint with AI	AI SDK — simpler, no infra deps
Agent that needs retries + exactly-once	Utah — Inngest gives this for free
Structured data extraction	AI SDK — `generateObject()`
Production app with streaming UX	AI SDK — first-class streaming

They're fundamentally different tools for different problems. Utah is an agent runtime (durable, event-driven, multi-channel). The AI SDK is an AI integration library (streaming, UI-first, serverless-friendly). You could even combine them — use the AI SDK for the web UI layer and Inngest for durable background agent work.

Julian Benegas commented

tell me about context pruning and about the memory system in utah

Claude Opus 4.6 commented

Here's the full picture on both systems.

##Context Pruning

Context pruning in Utah operates at two distinct levels that solve different problems at different time scales.

###Level 1: Intra-loop Tool Result Pruning

This happens within a single agent run, inside the while loop in src/agent-loop.ts. The problem it solves: tool results (file reads, bash output, grep results) can be enormous and accumulate fast over 20 iterations.

The config (lines 56–66):

const PRUNING = {
  keepLastAssistantTurns: 3,         // protect the 3 most recent turns
  softTrim: {
    maxChars: 4000,                  // trigger soft trim above this
    headChars: 1500,                 // keep first 1,500 chars
    tailChars: 1500,                 // keep last 1,500 chars
  },
  hardClear: {
    threshold: 50_000,              // if total old tool chars > 50K, nuke everything
    placeholder: "[Tool result cleared — old context]",
  },
};

The function pruneOldToolResults() (lines 68–98) runs every iteration (after the first 3):

Determine the protection zone: The last keepLastAssistantTurns * 2 messages (= 6 messages) are never touched. This preserves the most recent tool call → result pairs that the LLM is actively reasoning about.
Measure total old tool content: It walks all toolResult messages before the protection zone and sums up the character count of their text content blocks.
Choose a strategy:
- If total old tool chars exceed 50,000 → hard clear: every old tool result is replaced with "[Tool result cleared — old context]". This is the nuclear option — the LLM loses all detail from early tool calls but the context stays manageable.
- Otherwise, soft trim: any individual tool result over 4,000 chars gets head+tail trimmed — keep the first 1,500 and last 1,500 chars, with a note like ... [12000 chars trimmed] ... in the middle.

The pruning mutates the messages array in-place — it's not creating copies. This is important: it means the LLM on the next iteration sees the trimmed versions, not the originals.

It's triggered at line 222:

if (iterations > PRUNING.keepLastAssistantTurns) {
  pruneOldToolResults(messages);
}

###Level 2: Context Overflow Recovery

There's a second pruning mechanism that kicks in as a safety net if the LLM itself reports a context overflow error (lines 257–289):

if (isOverflow && !hasCompactedThisRun) {
  // Keep only the last 6 messages
  const keepCount = Math.min(6, messages.length);
  const toSummarize = messages.slice(0, messages.length - keepCount);
  const toKeep = messages.slice(-keepCount);
  // ... crude inline summarization (first 200 chars per message)
  // Retry the iteration (iterations--)
}

This is a last resort — it does a rough in-line summary (just truncating each old message to 200 chars, no LLM call) and retries the current iteration. It only fires once per run (hasCompactedThisRun flag).

###Level 3: Budget Warnings

Not exactly pruning, but related: when iterations are running low, the system injects warning messages (lines 225–233):

At 10 iterations remaining: "Start wrapping up — respond with text soon."
At 3 iterations remaining: "You MUST respond with your final answer NOW. Do not call any more tools."

These are appended as user messages to messagesForLLM (a copy — they don't pollute the actual messages array).

##The Memory System

Utah's memory is a two-tier, file-based system that gives the agent persistent knowledge across conversations. It's spread across three files:

###Tier 1: Daily Logs (short-term, append-only)

Location: workspace/memory/YYYY-MM-DD.md

Written by: The remember tool (src/lib/tools.ts line 52):

const rememberTool: Tool = {
  name: "remember",
  description: "Save a note to today's daily log. Use for things you want to remember 
    across conversations — decisions, facts, user preferences, task outcomes.",
  parameters: Type.Object({
    note: Type.String({ description: "The note to save" }),
  }),
};

When the agent calls remember, it hits executeTool → appendDailyLog() in src/lib/memory.ts (line 98):

export async function appendDailyLog(entry: string): Promise<void> {
  const d = todayString();                       // "2025-07-14"
  const path = getWorkspacePath(config.workspace.memoryDir, `${d}.md`);
  const existing = await readDailyLog(d);
  const timestamp = new Date().toLocaleTimeString("en-US", { hour12: false });
  const content = existing
    ? `${existing}\n\n### ${timestamp}\n${entry}`   // append
    : `# ${d}\n\n### ${timestamp}\n${entry}`;       // create
  await writeFile(path, content, "utf-8");
}

Each entry gets a wall-clock timestamp header (### 14:32:07). The file grows throughout the day — it's never truncated by the agent, only by the heartbeat's pruning step.

###Tier 2: Long-Term Memory (curated, LLM-distilled)

Location: workspace/MEMORY.md

Written by: The heartbeat function (src/functions/heartbeat.ts).

This is an Inngest cron function that runs every 30 minutes (configurable via HEARTBEAT_CRON env var). It's designed to be cheap on quiet days — most runs cost zero LLM calls:

Step 1 — Check if needed (no LLM, just file reads):

const shouldDistill = logSize > LOG_SIZE_THRESHOLD    // 4096 bytes
                   || hoursSinceLast > MAX_HOURS_BETWEEN; // 8 hours

It reads MEMORY.md, parses a hidden HTML comment timestamp (), and checks today's log size. If neither threshold is met, it returns immediately.

Step 2 — Load context: Reads MEMORY.md + the last 7 days of daily logs.

Step 3 — LLM distillation: Sends the current MEMORY.md plus all recent daily logs to the LLM with a prompt that instructs it to:

Add new facts, decisions, preferences
Update existing entries if superseded
Remove outdated info
Keep it concise — curated notes, not raw logs

Step 4 — Write: Saves the updated MEMORY.md with a fresh heartbeat timestamp.

Step 5 — Prune old logs: Deletes daily log files older than 30 days (configurable via MEMORY_RETENTION_DAYS).

###How Memory Gets Into the Agent

At the start of every agent run, buildSystemPrompt() calls buildMemoryContext() (line 134):

export async function buildMemoryContext(): Promise<string> {
  const memory = stripTimestamp(await readMemory()).trim();  // MEMORY.md
  const today = await readDailyLog();                        // today's log
  const yesterday = await readDailyLog(/* yesterday */);     // yesterday's log

  // Assemble in priority order (lowest first — truncation cuts from top)
  parts.push(`### Long-Term Memory (MEMORY.md)\n${memory}`);
  parts.push(`### Yesterday's Log\n${yesterday}`);
  parts.push(`### Today's Log\n${today}`);

  return truncateToFit(full, MAX_MEMORY_CHARS);  // 10,000 chars cap
}

The priority order is deliberate: when the 10K char budget forces truncation, it cuts from the top (long-term memory first), preserving today's notes. This means:

Today's log is always fully visible
Yesterday's log is usually visible
Long-term memory gets truncated first if space is tight

The result is injected into the system prompt under a ## Memory section.

###Conversation Compaction (across-session memory)

Separate from the memory system but related: src/lib/compaction.ts handles conversation-level memory. When a session's JSONL file gets too long (estimated tokens > 80% of 150K = 120K tokens), the system:

Finds a cut point, keeping ~20K tokens of recent messages verbatim
Sends older messages to the LLM for structured summarization (Goal → Progress → Decisions → Next Steps)
Replaces the session file with [summary message] + [recent messages]

This runs as a durable Inngest step (step.run("compact")) at the top of the agent loop, before the while loop starts.

###The Full Picture

                              ┌─────────────────────────────┐
                              │     System Prompt            │
                              │                              │
                              │  SOUL.md (personality)       │
                              │  USER.md (user info)         │
                              │  MEMORY.md (long-term)  ◄────┼──── Heartbeat distills
                              │  Today's log            ◄────┼──── "remember" tool appends
                              │  Yesterday's log             │
                              │  [capped at 10K chars]       │
                              └──────────────┬──────────────┘
                                             │
                              ┌──────────────▼──────────────┐
                              │     Conversation Messages    │
                              │                              │
                              │  [compaction summary]   ◄────┼──── If session > 120K tokens
                              │  recent session history      │
                              │  current user message        │
                              └──────────────┬──────────────┘
                                             │
                              ┌──────────────▼──────────────┐
                              │     Agent Loop (while)       │
                              │                              │
                              │  think → act → observe       │
                              │  pruneOldToolResults() ◄─────┼──── Every iteration after 3rd
                              │  budget warnings        ◄────┼──── Last 10 / last 3 iterations
                              │  overflow recovery      ◄────┼──── If LLM says "too large"
                              └─────────────────────────────┘

So the agent has context management at four levels:

System prompt budget — memory capped at 10K chars, prioritizing recency
Session compaction — LLM summarization when conversation exceeds ~120K tokens
Tool result pruning — soft trim / hard clear of old tool outputs every iteration
Overflow recovery — emergency inline truncation if the LLM chokes

END OF POST