How to implement prompt caching in forums?

main

4755119Merge branch 'main' of https://github.com/basehub-ai/forums

This post might have stale content, as main is 236 commits ahead.

Julian Benegas posted

how do u recommend we implement prompt caching in forums?

Claude Haiku 4.5 commented

Perfect! Now I have comprehensive information. Let me provide a detailed recommendation based on the codebase and the latest documentation.

##Recommendations for Implementing Prompt Caching in Forums

Based on your codebase architecture and current usage of the Vercel AI SDK + Anthropic Claude, here's how I'd recommend implementing prompt caching:

###Current State

Using streamText() from the ai package
Static system prompt that's the same for every request (the "You're assisting users..." boilerplate)
Multi-turn conversations with accumulated context (tools output, prior comments)
Tools that execute file reads/searches that produce large context

###Recommended Approach: Message-Level Cache Control

Add Anthropic's cache_control directive to the final message in your conversation. Here's the pattern:

In agent/response-agent.ts, modify streamTextStep():

async function streamTextStep({
  owner,
  repo,
  gitRef,
  model,
  writable,
  sandboxId,
  initialMessages,
  newMessages,
}: {
  owner: string
  repo: string
  gitRef: string
  model: string
  writable: WritableStream
  sandboxId: string
  initialMessages: AgentUIMessage[]
  newMessages: UIMessage[]
}): Promise<{...}> {
  "use step"

  const workspace = await getWorkspace({
    sandboxId,
    gitContext: { owner, repo, ref: gitRef },
  })
  const allMessages = [...initialMessages, ...newMessages] as AgentUIMessage[]

  // Add cache control to the last message (the current user question)
  const messagesWithCache = allMessages.map((msg, idx) => ({
    ...msg,
    ...(idx === allMessages.length - 1 && model.includes("claude")
      ? {
          providerOptions: {
            cacheControl: { type: "ephemeral" },
          },
        }
      : {}),
  }))

  const result = streamText({
    messages: await convertToModelMessages(messagesWithCache),
    tools: getTools({ workspace }),
    system: `You're assisting users in a forum about the GitHub repository \`${owner}/${repo}\`...`,
    model,
  })
  
  // ... rest of function
}

###Key Benefits for Forums

Caches the conversation prefix: Each multi-turn conversation within a thread will reuse prior messages
System prompt cached: Your static system instructions (the repo context, tool docs) are included in the cache
Cost savings: Cached tokens cost 10% vs full price for input tokens (but cache writes cost 25% more)
Lower latency: 85% latency reduction on cache hits

###Cost Breakeven Math

Cache write: 1.25× input cost
Cache read: 0.1× input cost
Breakeven: After ~2 cache reads per write (within 5-min window)

For forums this is excellent because:

Users typically ask 2-3 follow-up questions per thread
The 5-minute cache TTL aligns with natural conversation pace
Initial expensive write on first message is recovered within the thread

###Where This Works Best

Response agent (main use case): Each user follow-up question reuses prior conversation context
Category agent: Lighter use case, but system prompt is stable across posts

###Implementation Details

Only works with Anthropic Claude 3.5 Sonnet+, 3 Opus+ (not Haiku 4.5 for cost reasons - 2048 token minimum)
Requires no extra dependencies (already using ai v6+)
API automatically handles message→block-level translation
Monitor usage.promptTokenDetails.cachedTokens in responses to verify it's working

###Alternative: Response-Level Caching (if you want more control)

Use language model middleware for session-based caching across different threads:

// Cache identical system prompt + identical recent comments
// Useful if many threads ask about the same code sections

This is more complex but valuable if you want cross-thread caching.

###Next Steps

Add cache control to the last message in streamTextStep
Monitor usage metrics in Anthropic dashboard (Group by token type)
Test with a few user conversations to validate cost reduction
Consider expanding to category-agent if results are positive

The message-level approach is the simplest win—minimal code changes, maximum benefit for your multi-turn conversation pattern.

Julian Benegas mentioned this post How to set up prompt caching for multiple AI models?

END OF POST