Is utah built with the AI SDK and how does it handle multi-modality?

HEAD
82c0f28Linting and formatting for consistency
This post might have stale content, as HEAD is 6 commits ahead.

is this built with the ai sdk? and if not, how does it handle multi-modality?

Avatar of anthropic/claude-opus-4.6
Claude Opus 4.6
commented

Now I have the full picture. Here's the answer:


No, this is not built with the Vercel AI SDK. It uses @mariozechner/pi-ai (from the Pi monorepo by Mario Zechner) as its unified LLM abstraction layer, along with @mariozechner/pi-agent-core and @mariozechner/pi-coding-agent for tools.

###How pi-ai works as the LLM layer

The core LLM interaction happens in src/lib/llm.ts:

import { getModel, complete, validateToolArguments } from "@mariozechner/pi-ai";
  • getModel(provider, model) — instantiates a model for any supported provider (Anthropic, OpenAI, Google, etc.)
  • complete(model, { systemPrompt, messages, tools }) — unified completion call across providers
  • Provider and model are configured via env vars (LLM_PROVIDER, AGENT_MODEL) in src/config.ts

pi-ai provides a unified type system across providers — TextContent, ToolCall, ToolResultMessage, etc. — so the agent loop in src/agent-loop.ts is provider-agnostic.

###How multi-modality is (and isn't) handled

At the pi-ai type level, multi-modality is fully supported. Looking at the pi-ai types:

export interface ImageContent {
  type: "image";
  data: string;       // base64 encoded image data
  mimeType: string;   // e.g., "image/jpeg", "image/png"
}

export interface UserMessage {
  role: "user";
  content: string | (TextContent | ImageContent)[];  // ← supports images!
  timestamp: number;
}

export interface ToolResultMessage {
  // ...
  content: (TextContent | ImageContent)[];  // ← tool results can include images too
}

So pi-ai natively supports passing images to the LLM as part of user messages and tool results.

However, Utah itself currently only processes text. Both channel transforms explicitly filter out non-text messages:

The normalized event type (src/channels/types.ts) also only has a message: string field — no attachments array:

export type AgentMessageData = {
  message: string;        // The text content — no image/file support
  sessionKey: string;
  channel: string;
  sender: Sender;
  destination: Destination;
  channelMeta: Record<string, unknown>;
};

And when building the conversation history in the agent loop (src/agent-loop.ts:204), user messages are always constructed as plain strings:

{ role: "user" as const, content: userMessage, timestamp: Date.now() }

The one place images could flow through is via pi-coding-agent's read tool — the comment in src/lib/tools.ts:5 mentions "image support" for the read tool — meaning if the agent reads an image file from the workspace, pi-coding-agent could return it as an ImageContent block in the tool result. But currently executeTool() strips everything to text:

const text = result.content
  .filter((c): c is TextContent => c.type === "text")
  .map((c) => c.text)
  .join("\n");

###TL;DR

The plumbing supports multi-modality (pi-ai has ImageContent in user messages and tool results), but Utah doesn't use it yet — channel transforms drop non-text messages, the normalized event type has no attachment field, and tool results are filtered down to text-only. Adding image support would require changes at the channel transform layer, the AgentMessageData type, and the message construction in the agent loop.


END OF POST

Is utah built with the AI SDK and how does it handle multi-modality? — inngest/utah