How does response caching work?

This post might have stale content, as main is 18 commits ahead.

How does response caching work?

Avatar of Claude Haiku 4.5Claude Haiku 4.5commented

I'll search the codebase to understand how response caching works in the Vercel AI repository.

Let me look at some specific files related to caching to get a better understanding:

Now let me look at examples and how caching is actually used:

Let me also check the Anthropic cache control example:

Perfect! Now I have a good understanding. Let me provide a comprehensive answer based on the source code:

##How Response Caching Works

Response caching in the Vercel AI SDK works differently depending on the provider you're using. The SDK provides support for prompt/token caching features offered by various AI model providers. Here's how it works:

###1. Token-Level Caching (Usage Tracking)

The SDK tracks cache usage with detailed token information in the LanguageModelUsage type:

inputTokenDetails: {
  noCacheTokens: number | undefined;      // Non-cached input tokens
  cacheReadTokens: number | undefined;    // Cached tokens read from cache
  cacheWriteTokens: number | undefined;   // Tokens written to cache
}

This appears in the response usage:

result.usage.inputTokenDetails.cacheReadTokens   // How many cached tokens were used
result.usage.cachedInputTokens                    // Deprecated alias for cache read tokens

###2. Provider-Specific Cache Control

Different providers support caching differently:

####OpenAI (Prompt Caching)

Configure via promptCacheKey and promptCacheRetention:

providerOptions: {
  openai: { 
    promptCacheKey: 'my-cache-key',
    promptCacheRetention: 'in_memory' | '24h'  // How long to keep cache
  }
}

The cache is created automatically by OpenAI based on your prompt patterns.

####Anthropic (Cache Control)

Explicitly mark content for caching:

messages: [{
  role: 'user',
  content: [
    {
      type: 'text',
      text: 'System instructions or reference material',
      providerOptions: {
        anthropic: {
          cacheControl: { type: 'ephemeral', ttl: '1h' }  // 5m or 1h TTL
        }
      }
    }
  ]
}]

Then check results:

result.usage.cachedInputTokens  // Cache read tokens
result.providerMetadata?.anthropic?.cacheCreationInputTokens  // Cache write tokens

####Amazon Bedrock (Cache Points)

Use cache points to mark where caching should occur:

providerOptions: { 
  bedrock: { 
    cachePoint: { type: 'default' } 
  } 
}

####Google Generative AI (Cached Content)

Reference previously cached content:

providerOptions: {
  google: {
    cachedContent: 'cachedContents/{cachedContent}'  // Reference to cached content
  }
}

###3. How Caching Works at the API Level

From the source code, caching follows this pattern:

  1. First Request (Cache Write): When you make a request with cacheable content, the provider creates a cache entry and returns cacheWriteTokens in usage
  2. Subsequent Requests (Cache Read): On subsequent requests with the same prompt patterns, the provider reuses cached content and returns cacheReadTokens (usually cheaper than regular tokens)
  3. Cost Savings: Cached tokens typically cost less than regular tokens (e.g., 90% cheaper with OpenAI)

###4. Real-World Example

From the example code:

const result = await generateText({
  model: anthropic('claude-3-5-sonnet'),
  messages: [{
    role: 'user',
    content: [
      { type: 'text', text: 'You are a JavaScript expert.' },
      {
        type: 'text',
        text: `Error: ${errorMessage}`,  // Large static reference
        providerOptions: {
          anthropic: {
            cacheControl: { type: 'ephemeral', ttl: '1h' }
          }
        }
      },
      { type: 'text', text: 'Explain the error.' }
    ]
  }]
});

// After first request, subsequent similar requests reuse the cache
console.log('Cache read tokens:', result.usage.cachedInputTokens);

###Key Points

  • Automatic vs Manual: OpenAI handles caching automatically; Anthropic requires explicit cacheControl markup
  • Cost Reduction: Cached tokens cost significantly less (typically 10% of regular token cost)
  • TTL (Time-to-Live): Some providers (Anthropic) allow you to control how long cached content persists (5 minutes or 1 hour)
  • Provider Differences: Each provider implements caching with different APIs and limitations - the SDK normalizes these into a unified interface

END OF POST

Log in to add a comment
Claude Haiku 4.5
How does response caching work? — vercel/ai