How to implement prompt caching in forums?
how do u recommend we implement prompt caching in forums?
Perfect! Now I have comprehensive information. Let me provide a detailed recommendation based on the codebase and the latest documentation.
##Recommendations for Implementing Prompt Caching in Forums
Based on your codebase architecture and current usage of the Vercel AI SDK + Anthropic Claude, here's how I'd recommend implementing prompt caching:
###Current State
- Using
streamText()from theaipackage - Static system prompt that's the same for every request (the "You're assisting users..." boilerplate)
- Multi-turn conversations with accumulated context (tools output, prior comments)
- Tools that execute file reads/searches that produce large context
###Recommended Approach: Message-Level Cache Control
Add Anthropic's cache_control directive to the final message in your conversation. Here's the pattern:
In agent/response-agent.ts, modify streamTextStep():
async function streamTextStep({
owner,
repo,
gitRef,
model,
writable,
sandboxId,
initialMessages,
newMessages,
}: {
owner: string
repo: string
gitRef: string
model: string
writable: WritableStream
sandboxId: string
initialMessages: AgentUIMessage[]
newMessages: UIMessage[]
}): Promise<{...}> {
"use step"
const workspace = await getWorkspace({
sandboxId,
gitContext: { owner, repo, ref: gitRef },
})
const allMessages = [...initialMessages, ...newMessages] as AgentUIMessage[]
// Add cache control to the last message (the current user question)
const messagesWithCache = allMessages.map((msg, idx) => ({
...msg,
...(idx === allMessages.length - 1 && model.includes("claude")
? {
providerOptions: {
cacheControl: { type: "ephemeral" },
},
}
: {}),
}))
const result = streamText({
messages: await convertToModelMessages(messagesWithCache),
tools: getTools({ workspace }),
system: `You're assisting users in a forum about the GitHub repository \`${owner}/${repo}\`...`,
model,
})
// ... rest of function
}
###Key Benefits for Forums
- Caches the conversation prefix: Each multi-turn conversation within a thread will reuse prior messages
- System prompt cached: Your static system instructions (the repo context, tool docs) are included in the cache
- Cost savings: Cached tokens cost 10% vs full price for input tokens (but cache writes cost 25% more)
- Lower latency: 85% latency reduction on cache hits
###Cost Breakeven Math
- Cache write: 1.25× input cost
- Cache read: 0.1× input cost
- Breakeven: After ~2 cache reads per write (within 5-min window)
For forums this is excellent because:
- Users typically ask 2-3 follow-up questions per thread
- The 5-minute cache TTL aligns with natural conversation pace
- Initial expensive write on first message is recovered within the thread
###Where This Works Best
- Response agent (main use case): Each user follow-up question reuses prior conversation context
- Category agent: Lighter use case, but system prompt is stable across posts
###Implementation Details
- Only works with Anthropic Claude 3.5 Sonnet+, 3 Opus+ (not Haiku 4.5 for cost reasons - 2048 token minimum)
- Requires no extra dependencies (already using
aiv6+) - API automatically handles message→block-level translation
- Monitor
usage.promptTokenDetails.cachedTokensin responses to verify it's working
###Alternative: Response-Level Caching (if you want more control)
Use language model middleware for session-based caching across different threads:
// Cache identical system prompt + identical recent comments
// Useful if many threads ask about the same code sections
This is more complex but valuable if you want cross-thread caching.
###Next Steps
- Add cache control to the last message in
streamTextStep - Monitor usage metrics in Anthropic dashboard (Group by token type)
- Test with a few user conversations to validate cost reduction
- Consider expanding to category-agent if results are positive
The message-level approach is the simplest win—minimal code changes, maximum benefit for your multi-turn conversation pattern.
END OF POST