How does response caching work?
How does response caching work?
Perfect! Now I have a good understanding. Let me provide a comprehensive answer based on the source code:
##How Response Caching Works
Response caching in the Vercel AI SDK works differently depending on the provider you're using. The SDK provides support for prompt/token caching features offered by various AI model providers. Here's how it works:
###1. Token-Level Caching (Usage Tracking)
The SDK tracks cache usage with detailed token information in the LanguageModelUsage type:
inputTokenDetails: {
noCacheTokens: number | undefined; // Non-cached input tokens
cacheReadTokens: number | undefined; // Cached tokens read from cache
cacheWriteTokens: number | undefined; // Tokens written to cache
}
This appears in the response usage:
result.usage.inputTokenDetails.cacheReadTokens // How many cached tokens were used
result.usage.cachedInputTokens // Deprecated alias for cache read tokens
###2. Provider-Specific Cache Control
Different providers support caching differently:
####OpenAI (Prompt Caching)
Configure via promptCacheKey and promptCacheRetention:
providerOptions: {
openai: {
promptCacheKey: 'my-cache-key',
promptCacheRetention: 'in_memory' | '24h' // How long to keep cache
}
}
The cache is created automatically by OpenAI based on your prompt patterns.
####Anthropic (Cache Control)
Explicitly mark content for caching:
messages: [{
role: 'user',
content: [
{
type: 'text',
text: 'System instructions or reference material',
providerOptions: {
anthropic: {
cacheControl: { type: 'ephemeral', ttl: '1h' } // 5m or 1h TTL
}
}
}
]
}]
Then check results:
result.usage.cachedInputTokens // Cache read tokens
result.providerMetadata?.anthropic?.cacheCreationInputTokens // Cache write tokens
####Amazon Bedrock (Cache Points)
Use cache points to mark where caching should occur:
providerOptions: {
bedrock: {
cachePoint: { type: 'default' }
}
}
####Google Generative AI (Cached Content)
Reference previously cached content:
providerOptions: {
google: {
cachedContent: 'cachedContents/{cachedContent}' // Reference to cached content
}
}
###3. How Caching Works at the API Level
From the source code, caching follows this pattern:
- First Request (Cache Write): When you make a request with cacheable content, the provider creates a cache entry and returns
cacheWriteTokensin usage - Subsequent Requests (Cache Read): On subsequent requests with the same prompt patterns, the provider reuses cached content and returns
cacheReadTokens(usually cheaper than regular tokens) - Cost Savings: Cached tokens typically cost less than regular tokens (e.g., 90% cheaper with OpenAI)
###4. Real-World Example
From the example code:
const result = await generateText({
model: anthropic('claude-3-5-sonnet'),
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'You are a JavaScript expert.' },
{
type: 'text',
text: `Error: ${errorMessage}`, // Large static reference
providerOptions: {
anthropic: {
cacheControl: { type: 'ephemeral', ttl: '1h' }
}
}
},
{ type: 'text', text: 'Explain the error.' }
]
}]
});
// After first request, subsequent similar requests reuse the cache
console.log('Cache read tokens:', result.usage.cachedInputTokens);
###Key Points
- Automatic vs Manual: OpenAI handles caching automatically; Anthropic requires explicit
cacheControlmarkup - Cost Reduction: Cached tokens cost significantly less (typically 10% of regular token cost)
- TTL (Time-to-Live): Some providers (Anthropic) allow you to control how long cached content persists (5 minutes or 1 hour)
- Provider Differences: Each provider implements caching with different APIs and limitations - the SDK normalizes these into a unified interface
END OF POST