Is utah built with the AI SDK and how does it handle multi-modality?
is this built with the ai sdk? and if not, how does it handle multi-modality?
Now I have the full picture. Here's the answer:
No, this is not built with the Vercel AI SDK. It uses @mariozechner/pi-ai (from the Pi monorepo by Mario Zechner) as its unified LLM abstraction layer, along with @mariozechner/pi-agent-core and @mariozechner/pi-coding-agent for tools.
###How pi-ai works as the LLM layer
The core LLM interaction happens in src/lib/llm.ts:
import { getModel, complete, validateToolArguments } from "@mariozechner/pi-ai";
getModel(provider, model)— instantiates a model for any supported provider (Anthropic, OpenAI, Google, etc.)complete(model, { systemPrompt, messages, tools })— unified completion call across providers- Provider and model are configured via env vars (
LLM_PROVIDER,AGENT_MODEL) insrc/config.ts
pi-ai provides a unified type system across providers — TextContent, ToolCall, ToolResultMessage, etc. — so the agent loop in src/agent-loop.ts is provider-agnostic.
###How multi-modality is (and isn't) handled
At the pi-ai type level, multi-modality is fully supported. Looking at the pi-ai types:
export interface ImageContent {
type: "image";
data: string; // base64 encoded image data
mimeType: string; // e.g., "image/jpeg", "image/png"
}
export interface UserMessage {
role: "user";
content: string | (TextContent | ImageContent)[]; // ← supports images!
timestamp: number;
}
export interface ToolResultMessage {
// ...
content: (TextContent | ImageContent)[]; // ← tool results can include images too
}
So pi-ai natively supports passing images to the LLM as part of user messages and tool results.
However, Utah itself currently only processes text. Both channel transforms explicitly filter out non-text messages:
- Telegram (
src/channels/telegram/transform.ts:15):if (!evt.message || !evt.message.text)→ returns"telegram/message.unsupported"— photos, documents, voice, etc. are all dropped. - Slack (
src/channels/slack/transform.ts): Only processesmessageandapp_mentionevents withtext— file uploads are skipped.
The normalized event type (src/channels/types.ts) also only has a message: string field — no attachments array:
export type AgentMessageData = {
message: string; // The text content — no image/file support
sessionKey: string;
channel: string;
sender: Sender;
destination: Destination;
channelMeta: Record<string, unknown>;
};
And when building the conversation history in the agent loop (src/agent-loop.ts:204), user messages are always constructed as plain strings:
{ role: "user" as const, content: userMessage, timestamp: Date.now() }
The one place images could flow through is via pi-coding-agent's read tool — the comment in src/lib/tools.ts:5 mentions "image support" for the read tool — meaning if the agent reads an image file from the workspace, pi-coding-agent could return it as an ImageContent block in the tool result. But currently executeTool() strips everything to text:
const text = result.content
.filter((c): c is TextContent => c.type === "text")
.map((c) => c.text)
.join("\n");
###TL;DR
The plumbing supports multi-modality (pi-ai has ImageContent in user messages and tool results), but Utah doesn't use it yet — channel transforms drop non-text messages, the normalized event type has no attachment field, and tool results are filtered down to text-only. Adding image support would require changes at the channel transform layer, the AgentMessageData type, and the message construction in the agent loop.
END OF POST