Is utah built with the AI SDK and how does it handle multi-modality?

HEAD

82c0f28Linting and formatting for consistency

This post might have stale content, as HEAD is 6 commits ahead.

Julian Benegas posted

is this built with the ai sdk? and if not, how does it handle multi-modality?

Claude Opus 4.6 commented

Now I have the full picture. Here's the answer:

No, this is not built with the Vercel AI SDK. It uses @mariozechner/pi-ai (from the Pi monorepo by Mario Zechner) as its unified LLM abstraction layer, along with @mariozechner/pi-agent-core and @mariozechner/pi-coding-agent for tools.

###How `pi-ai` works as the LLM layer

The core LLM interaction happens in src/lib/llm.ts:

import { getModel, complete, validateToolArguments } from "@mariozechner/pi-ai";

getModel(provider, model) — instantiates a model for any supported provider (Anthropic, OpenAI, Google, etc.)
complete(model, { systemPrompt, messages, tools }) — unified completion call across providers
Provider and model are configured via env vars (LLM_PROVIDER, AGENT_MODEL) in src/config.ts

pi-ai provides a unified type system across providers — TextContent, ToolCall, ToolResultMessage, etc. — so the agent loop in src/agent-loop.ts is provider-agnostic.

###How multi-modality is (and isn't) handled

At the pi-ai type level, multi-modality is fully supported. Looking at the pi-ai types:

export interface ImageContent {
  type: "image";
  data: string;       // base64 encoded image data
  mimeType: string;   // e.g., "image/jpeg", "image/png"
}

export interface UserMessage {
  role: "user";
  content: string | (TextContent | ImageContent)[];  // ← supports images!
  timestamp: number;
}

export interface ToolResultMessage {
  // ...
  content: (TextContent | ImageContent)[];  // ← tool results can include images too
}

So pi-ai natively supports passing images to the LLM as part of user messages and tool results.

However, Utah itself currently only processes text. Both channel transforms explicitly filter out non-text messages:

Telegram (src/channels/telegram/transform.ts:15): if (!evt.message || !evt.message.text) → returns "telegram/message.unsupported" — photos, documents, voice, etc. are all dropped.
Slack (src/channels/slack/transform.ts): Only processes message and app_mention events with text — file uploads are skipped.

The normalized event type (src/channels/types.ts) also only has a message: string field — no attachments array:

export type AgentMessageData = {
  message: string;        // The text content — no image/file support
  sessionKey: string;
  channel: string;
  sender: Sender;
  destination: Destination;
  channelMeta: Record<string, unknown>;
};

And when building the conversation history in the agent loop (src/agent-loop.ts:204), user messages are always constructed as plain strings:

{ role: "user" as const, content: userMessage, timestamp: Date.now() }

The one place images could flow through is via pi-coding-agent's read tool — the comment in src/lib/tools.ts:5 mentions "image support" for the read tool — meaning if the agent reads an image file from the workspace, pi-coding-agent could return it as an ImageContent block in the tool result. But currently executeTool() strips everything to text:

const text = result.content
  .filter((c): c is TextContent => c.type === "text")
  .map((c) => c.text)
  .join("\n");

###TL;DR

The plumbing supports multi-modality (pi-ai has ImageContent in user messages and tool results), but Utah doesn't use it yet — channel transforms drop non-text messages, the normalized event type has no attachment field, and tool results are filtered down to text-only. Adding image support would require changes at the channel transform layer, the AgentMessageData type, and the message construction in the agent loop.

END OF POST

###How pi-ai works as the LLM layer

###How multi-modality is (and isn't) handled

###TL;DR

###How `pi-ai` works as the LLM layer