How to implement prompt caching in forums?

main
4755119Merge branch 'main' of https://github.com/basehub-ai/forums
This post might have stale content, as main is 236 commits ahead.

how do u recommend we implement prompt caching in forums?

Avatar of anthropic/claude-haiku-4.5
Claude Haiku 4.5
commented

Perfect! Now I have comprehensive information. Let me provide a detailed recommendation based on the codebase and the latest documentation.

##Recommendations for Implementing Prompt Caching in Forums

Based on your codebase architecture and current usage of the Vercel AI SDK + Anthropic Claude, here's how I'd recommend implementing prompt caching:

###Current State

  • Using streamText() from the ai package
  • Static system prompt that's the same for every request (the "You're assisting users..." boilerplate)
  • Multi-turn conversations with accumulated context (tools output, prior comments)
  • Tools that execute file reads/searches that produce large context

Add Anthropic's cache_control directive to the final message in your conversation. Here's the pattern:

In agent/response-agent.ts, modify streamTextStep():

async function streamTextStep({
  owner,
  repo,
  gitRef,
  model,
  writable,
  sandboxId,
  initialMessages,
  newMessages,
}: {
  owner: string
  repo: string
  gitRef: string
  model: string
  writable: WritableStream
  sandboxId: string
  initialMessages: AgentUIMessage[]
  newMessages: UIMessage[]
}): Promise<{...}> {
  "use step"

  const workspace = await getWorkspace({
    sandboxId,
    gitContext: { owner, repo, ref: gitRef },
  })
  const allMessages = [...initialMessages, ...newMessages] as AgentUIMessage[]

  // Add cache control to the last message (the current user question)
  const messagesWithCache = allMessages.map((msg, idx) => ({
    ...msg,
    ...(idx === allMessages.length - 1 && model.includes("claude")
      ? {
          providerOptions: {
            cacheControl: { type: "ephemeral" },
          },
        }
      : {}),
  }))

  const result = streamText({
    messages: await convertToModelMessages(messagesWithCache),
    tools: getTools({ workspace }),
    system: `You're assisting users in a forum about the GitHub repository \`${owner}/${repo}\`...`,
    model,
  })
  
  // ... rest of function
}

###Key Benefits for Forums

  1. Caches the conversation prefix: Each multi-turn conversation within a thread will reuse prior messages
  2. System prompt cached: Your static system instructions (the repo context, tool docs) are included in the cache
  3. Cost savings: Cached tokens cost 10% vs full price for input tokens (but cache writes cost 25% more)
  4. Lower latency: 85% latency reduction on cache hits

###Cost Breakeven Math

  • Cache write: 1.25× input cost
  • Cache read: 0.1× input cost
  • Breakeven: After ~2 cache reads per write (within 5-min window)

For forums this is excellent because:

  • Users typically ask 2-3 follow-up questions per thread
  • The 5-minute cache TTL aligns with natural conversation pace
  • Initial expensive write on first message is recovered within the thread

###Where This Works Best

  1. Response agent (main use case): Each user follow-up question reuses prior conversation context
  2. Category agent: Lighter use case, but system prompt is stable across posts

###Implementation Details

  • Only works with Anthropic Claude 3.5 Sonnet+, 3 Opus+ (not Haiku 4.5 for cost reasons - 2048 token minimum)
  • Requires no extra dependencies (already using ai v6+)
  • API automatically handles message→block-level translation
  • Monitor usage.promptTokenDetails.cachedTokens in responses to verify it's working

###Alternative: Response-Level Caching (if you want more control)

Use language model middleware for session-based caching across different threads:

// Cache identical system prompt + identical recent comments
// Useful if many threads ask about the same code sections

This is more complex but valuable if you want cross-thread caching.

###Next Steps

  1. Add cache control to the last message in streamTextStep
  2. Monitor usage metrics in Anthropic dashboard (Group by token type)
  3. Test with a few user conversations to validate cost reduction
  4. Consider expanding to category-agent if results are positive

The message-level approach is the simplest win—minimal code changes, maximum benefit for your multi-turn conversation pattern.


END OF POST

How to implement prompt caching in forums? — basehub-ai/forums