How does context compaction work?
how does context compaction work?
Now I have the full picture. Let me provide a comprehensive answer.
##How Context Compaction Works in OpenCode
Context compaction is a mechanism to handle LLM context limits by summarizing conversation history when tokens exceed limits. Here's how it works:
###1. Triggering Compaction
Compaction can be triggered in three ways:
-
Automatically when context overflows - checked via
SessionCompaction.isOverflow():export async function isOverflow(input: { tokens; model }) { const config = await Config.get() if (config.compaction?.auto === false) return false const context = input.model.limit.context // Check if total tokens exceed usable input limit const count = input.tokens.input + input.tokens.cache.read + input.tokens.output const usable = input.model.limit.input || context - output return count > usable } -
Manually via the
/compactcommand (alias/summarize) -
Via keybind (
<leader>cby default)
###2. The Compaction Process (compaction.ts:93-191)
When compaction runs:
-
Creates a special assistant message with
mode: "compaction"andsummary: true -
Uses a dedicated "compaction" agent with its own prompt (compaction.txt)
-
Sends the conversation history to the LLM with a prompt asking for a summary:
"Provide a detailed prompt for continuing our conversation above. Focus on information that would be helpful for continuing the conversation..."
-
Plugins can customize via the
experimental.session.compactinghook to inject context or replace the prompt entirely
###3. Filtering Compacted Messages (message-v2.ts:598-612)
After compaction, the filterCompacted() function filters messages so that only messages after a completed compaction summary are included:
export async function filterCompacted(stream) {
const result = []
for await (const msg of stream) {
result.push(msg)
// Stop if we hit a completed compaction message
if (msg.parts.some(part => part.type === "compaction"))
break
if (msg.info.role === "assistant" && msg.info.summary && msg.info.finish)
completed.add(msg.info.parentID)
}
result.reverse()
return result
}
###4. Pruning Old Tool Outputs (compaction.ts:51-86)
Additionally, there's a pruning mechanism that clears old tool outputs:
- Goes backward through messages
- Protects the most recent ~40,000 tokens of tool outputs
- Marks older tool outputs as
compacted, replacing their content with"[Old tool result content cleared]"in future prompts - Certain tools (like
skill) are protected from pruning
###5. Configuration Options
In opencode.json:
{
"compaction": {
"auto": true, // Enable/disable automatic compaction (default: true)
"prune": true // Enable/disable pruning of old tool outputs (default: true)
}
}
Or via environment variable:
OPENCODE_DISABLE_AUTOCOMPACT=true- disables automatic compaction
###Summary Flow
User sends message
↓
LLM processes and returns response
↓
Check: tokens > context limit?
↓ yes
Create compaction message part (type: "compaction")
↓
Next loop iteration detects pending compaction
↓
SessionCompaction.process() runs:
- Sends full history + summary prompt to LLM
- LLM generates summary as assistant message (marked summary: true)
↓
filterCompacted() now truncates history at summary
↓
Prune old tool outputs to save more tokens
↓
Continue with fresh context (summary + recent messages)
what model is used for the summary generation?
The model selection logic is in lines 100-103:
const agent = await Agent.get("compaction")
const model = agent.model
? await Provider.getModel(agent.model.providerID, agent.model.modelID)
: await Provider.getModel(userMessage.model.providerID, userMessage.model.modelID)
The model used for summary generation is:
- If the
compactionagent has a model configured → uses that specific model - Otherwise → uses the same model as the user's last message (i.e., whatever model you're currently chatting with)
By default, the compaction agent doesn't have a model set, so it inherits the current conversation model. You can override this in your config:
{
"agents": {
"compaction": {
"model": {
"providerID": "anthropic",
"modelID": "claude-3-5-haiku-latest"
}
}
}
}
This lets you use a cheaper/faster model for summaries if desired.
how is it used with the AI sdk? generateText? or streamText?
It uses streamText from the AI SDK.
The flow is:
SessionCompaction.process()creates aSessionProcessor- Calls
processor.process()which callsLLM.stream() LLM.stream()usesstreamText()from the Vercel AI SDK (line 165)
So even for compaction/summary generation, it streams the response rather than using generateText(). This allows the summary to be displayed progressively in the UI as it's being generated.
nice. any reason why they don't let the stream fail and THEN compact, vs preemtively doing it?
You're onto something. The current approach is preemptive - it checks if tokens exceed the usable limit after each successful response. Here are likely reasons why they don't wait for failure:
###Possible reasons for preemptive compaction:
-
Better UX - A context overflow API error mid-stream is jarring. The user sees a failure, then has to wait for compaction, then retry. Preemptive compaction is seamless.
-
Wasted tokens/cost - If you let it fail, you've already:
- Sent a large prompt (input tokens billed)
- Possibly received partial output before the error
- Then you still need to compact and retry
-
Inconsistent error messages - Different providers return different errors for context overflow. Some might not even give a clear "context exceeded" error, making it hard to reliably detect and recover.
-
The check is cheap - They already get token counts from the usage stats of each response, so checking
count > usablecosts nothing.
###But you raise a valid point:
The preemptive approach has a gap - the check uses the last response's token count, but the next request might have:
- Additional user input
- New tool results
- System prompt changes
So in theory, you could pass the preemptive check but still overflow on the next request. A reactive approach (catch error → compact → retry) would be more robust as a fallback.
It seems like they're betting on the token count being a good enough predictor, and avoiding the complexity of error-based recovery. But adding a fallback for context overflow errors would make it more resilient.
are u sure they have a gap here? maybe they're handling it in another way?
You're right - I was missing the full picture! There's no gap. Here's how it actually works:
###The Two-Stage Check
-
Pre-request check (
prompt.ts:495-507): Before sending a new user message, check iflastFinishedalready overflowed → trigger compaction before even trying -
Post-response check (
processor.ts:274): After a successful response completes, check if this response pushed us over → return"compact"→ triggers compaction before the next iteration
###Why this is gap-free:
The key is that the check happens after each successful response in the processor. So even if:
- User sends a big message
- Response succeeds but now tokens are over the limit
processor.process()returns"compact"- The loop in
prompt.tscreates a compaction task - Next iteration processes the compaction before any new LLM call
The token count input + cache.read + output from the response tells you exactly how many tokens the next request would start with (since output becomes part of history). So checking after each response and before continuing the loop ensures you never send a request that would overflow.
Good catch - they are handling it properly with the post-response check in the processor!
do they compact with some threshhold? meaning idk, at 90% full?
No threshold/buffer. The check is:
const count = input.tokens.input + input.tokens.cache.read + input.tokens.output
const output = Math.min(model.limit.output, 32_000) || 32_000
const usable = model.limit.input || context - output
return count > usable
So it compacts when count > usable, where:
usable= the model's input limit, ORcontext - output(reserving space for the next response)- No 90% threshold, no buffer
The only "buffer" is the reserved output tokens (32K or the model's output limit). This ensures there's room for the next response, but doesn't leave extra headroom.
This means they're pushing it pretty close to the limit. If a user sends a large message or a tool returns a huge output right after a response that was just under the limit, it could theoretically fail. Though in practice, the reserved 32K for output provides some implicit buffer since most responses don't max that out.
what gets presented to the compaction agent? the full conversation in its context? or a path for it to read the convo in the filesystem?
The full conversation is passed directly in context. Look at lines 151-163:
messages: [
...MessageV2.toModelMessage(input.messages), // <-- full conversation history
{
role: "user",
content: [
{
type: "text",
text: promptText, // "Provide a detailed prompt for continuing..."
},
],
},
],
So:
input.messagesis the full conversation (already filtered byfilterCompacted- so only messages since the last compaction)- These get converted to model format via
MessageV2.toModelMessage() - A final user message is appended asking for the summary
Also note tools: {} - the compaction agent has no tools, it just reads the conversation in context and generates a summary. No filesystem access needed since the conversation is passed directly.
This does raise an interesting point though - if the conversation is already at context limit, passing the full conversation to the compaction agent could itself overflow. But presumably the compaction model has the same or larger context, so it should fit.
cool. do they include recent messages after compaction? or just the summary and then straight to work?
Now I understand. The stream function yields messages in reverse order (newest first, see line 567: for (let i = list.length - 1; i >= 0; i--)).
So filterCompacted works like this:
- Iterates messages from newest to oldest
- Keeps pushing messages into
result - Tracks completed summaries via
completed.add(msg.info.parentID) - Stops when it hits a user message that has a completed summary response AND has a
compactionpart - Reverses at the end to get chronological order
The answer: They include messages AFTER the compaction, not just the summary.
The flow is:
[old messages] → [user msg with compaction part] → [summary response] → [newer messages if any]
↑ stops here and includes everything after
So if after compaction the user continued chatting, the context would include:
- The user message that triggered compaction
- The summary (assistant response marked
summary: true) - Any subsequent messages
This makes sense - the summary replaces everything before it, but anything after the summary is still included verbatim.
doesn't the compaction agent get confused if they pass the full conversation as if they have been part of it? then ask it to summarize?
You're right - it's a bit odd! The compaction agent receives:
- System prompt: "You are a helpful AI assistant tasked with summarizing conversations"
- Messages: The full conversation history (user/assistant turns from the original agent)
- Final user message: "Provide a detailed prompt for continuing our conversation above..."
So the compaction agent sees a conversation where "it" (as an assistant) was doing coding work, making tool calls, etc. - but that was actually a different agent (like coder). Then suddenly it's asked to summarize "our conversation."
It does seem like it could be confusing, but in practice it probably works because:
- LLMs are pretty good at understanding the meta-task ("summarize this")
- The system prompt primes it for summarization, not continuation
- The final user message is explicit about what's needed
A cleaner approach might be to present the conversation as a document to summarize rather than as if the compaction agent was part of it - something like:
messages: [
{
role: "user",
content: `Here is a conversation to summarize:\n\n${formattedConversation}\n\nPlease summarize...`
}
]
But that would lose the structured message format and potentially use more tokens. The current approach is a bit of a hack but probably works well enough in practice.
sounds fine. so they ignore the "system" role message? doesn't that kill prompt caching?
So even with system: [], it still builds a system prompt:
const system = SystemPrompt.header(input.model.providerID)
system.push(
[
...(input.agent.prompt ? [input.agent.prompt] : ...), // compaction agent's prompt
...input.system, // empty []
...(input.user.system ? [input.user.system] : []),
]
)
So the compaction agent's prompt from compaction.txt does get included via input.agent.prompt.
But you're right about prompt caching - this is a one-shot summarization call. The system prompt + conversation is completely different from normal chat flow:
- Different system prompt (compaction agent vs coder agent)
- Conversation ends with "summarize this" instead of continuing naturally
- Only happens once per compaction
So there's no cache reuse here - the compaction call can't benefit from any previous cache, and its cache won't benefit future calls. It's essentially a cold call every time.
That's probably fine since compaction is relatively infrequent, but it does mean the full conversation gets re-tokenized at full input price (no cache discount).
i meant the original system prompt, not the compaction agent's system prompt
You're absolutely right. toModelMessage only extracts user/assistant messages - no system prompt.
So the compaction agent sees the conversation without the original system prompt that guided the coder agent's behavior. This means:
- The coder agent was operating under detailed instructions (tool usage, coding style, etc.)
- The compaction agent sees the conversation but has no idea what those instructions were
- Context about why the assistant behaved certain ways is lost
And yes - this kills any potential for prompt caching on the conversation history, because:
- Normal flow:
[system prompt A] + [messages]→ cached - Compaction flow:
[system prompt B] + [same messages]→ completely different prefix, no cache hit
If they included the original system prompt (even just prepended as a user message like "The assistant was operating under these instructions: ..."), it could potentially help the compaction agent understand the context better AND wouldn't break caching for the message portion.
Though to be fair, compaction is a one-off operation so cache savings might be minimal anyway.
gotcha, makes sense. do the compaction messages include all the tool calls and results? doesn't that make it super expensive?
So it does include all tool calls and results. The only optimization is if a tool part was previously marked compacted, its output becomes "[Old tool result content cleared]".
Yes, this is expensive. The compaction call includes:
- All user messages
- All assistant text responses
- All tool call inputs
- All tool call outputs (full content, unless previously pruned)
And since we established there's no prompt caching benefit (different system prompt), you're paying full input token price for what could be a massive conversation right at the context limit.
This is a known trade-off: you pay once for a big compaction call to avoid paying repeatedly for an ever-growing context in future calls. But if the conversation is heavy on tool outputs (like file reads, grep results, etc.), that compaction call could be quite costly.
END OF POST