anomalyco/opencode

questions

How does context compaction work?

14d1e20Revert "fix(app): support anthropic models on azure cognitive services" (#8966)

This post might have stale content, as dev is 4320 commits ahead.

Julian Benegas posted

how does context compaction work?

Claude Opus 4.5 commented

Now I have the full picture. Let me provide a comprehensive answer.

##How Context Compaction Works in OpenCode

Context compaction is a mechanism to handle LLM context limits by summarizing conversation history when tokens exceed limits. Here's how it works:

###1. Triggering Compaction

Compaction can be triggered in three ways:

Automatically when context overflows - checked via SessionCompaction.isOverflow():

export async function isOverflow(input: { tokens; model }) {
  const config = await Config.get()
  if (config.compaction?.auto === false) return false
  const context = input.model.limit.context
  // Check if total tokens exceed usable input limit
  const count = input.tokens.input + input.tokens.cache.read + input.tokens.output
  const usable = input.model.limit.input || context - output
  return count > usable
}

Manually via the /compact command (alias /summarize)
Via keybind (<leader>c by default)

###2. The Compaction Process (compaction.ts:93-191)

When compaction runs:

Creates a special assistant message with mode: "compaction" and summary: true
Uses a dedicated "compaction" agent with its own prompt (compaction.txt)
Sends the conversation history to the LLM with a prompt asking for a summary:

"Provide a detailed prompt for continuing our conversation above. Focus on information that would be helpful for continuing the conversation..."
Plugins can customize via the experimental.session.compacting hook to inject context or replace the prompt entirely

###3. Filtering Compacted Messages (message-v2.ts:598-612)

After compaction, the filterCompacted() function filters messages so that only messages after a completed compaction summary are included:

export async function filterCompacted(stream) {
  const result = []
  for await (const msg of stream) {
    result.push(msg)
    // Stop if we hit a completed compaction message
    if (msg.parts.some(part => part.type === "compaction"))
      break
    if (msg.info.role === "assistant" && msg.info.summary && msg.info.finish)
      completed.add(msg.info.parentID)
  }
  result.reverse()
  return result
}

###4. Pruning Old Tool Outputs (compaction.ts:51-86)

Additionally, there's a pruning mechanism that clears old tool outputs:

Goes backward through messages
Protects the most recent ~40,000 tokens of tool outputs
Marks older tool outputs as compacted, replacing their content with "[Old tool result content cleared]" in future prompts
Certain tools (like skill) are protected from pruning

###5. Configuration Options

In opencode.json:

{
  "compaction": {
    "auto": true,   // Enable/disable automatic compaction (default: true)
    "prune": true   // Enable/disable pruning of old tool outputs (default: true)
  }
}

Or via environment variable:

OPENCODE_DISABLE_AUTOCOMPACT=true - disables automatic compaction

###Summary Flow

User sends message
    ↓
LLM processes and returns response
    ↓
Check: tokens > context limit?
    ↓ yes
Create compaction message part (type: "compaction")
    ↓
Next loop iteration detects pending compaction
    ↓
SessionCompaction.process() runs:
  - Sends full history + summary prompt to LLM
  - LLM generates summary as assistant message (marked summary: true)
    ↓
filterCompacted() now truncates history at summary
    ↓
Prune old tool outputs to save more tokens
    ↓
Continue with fresh context (summary + recent messages)

Julian Benegas commented

what model is used for the summary generation?

Claude Opus 4.5 commented

The model selection logic is in lines 100-103:

const agent = await Agent.get("compaction")
const model = agent.model
  ? await Provider.getModel(agent.model.providerID, agent.model.modelID)
  : await Provider.getModel(userMessage.model.providerID, userMessage.model.modelID)

The model used for summary generation is:

If the compaction agent has a model configured → uses that specific model
Otherwise → uses the same model as the user's last message (i.e., whatever model you're currently chatting with)

By default, the compaction agent doesn't have a model set, so it inherits the current conversation model. You can override this in your config:

{
  "agents": {
    "compaction": {
      "model": {
        "providerID": "anthropic",
        "modelID": "claude-3-5-haiku-latest"
      }
    }
  }
}

This lets you use a cheaper/faster model for summaries if desired.

Julian Benegas commented

how is it used with the AI sdk? generateText? or streamText?

Claude Opus 4.5 commented

It uses streamText from the AI SDK.

The flow is:

SessionCompaction.process() creates a SessionProcessor
Calls processor.process() which calls LLM.stream()
LLM.stream() uses streamText() from the Vercel AI SDK (line 165)

So even for compaction/summary generation, it streams the response rather than using generateText(). This allows the summary to be displayed progressively in the UI as it's being generated.

Julian Benegas commented

nice. any reason why they don't let the stream fail and THEN compact, vs preemtively doing it?

Claude Opus 4.5 commented

You're onto something. The current approach is preemptive - it checks if tokens exceed the usable limit after each successful response. Here are likely reasons why they don't wait for failure:

###Possible reasons for preemptive compaction:

Better UX - A context overflow API error mid-stream is jarring. The user sees a failure, then has to wait for compaction, then retry. Preemptive compaction is seamless.
Wasted tokens/cost - If you let it fail, you've already:
- Sent a large prompt (input tokens billed)
- Possibly received partial output before the error
- Then you still need to compact and retry
Inconsistent error messages - Different providers return different errors for context overflow. Some might not even give a clear "context exceeded" error, making it hard to reliably detect and recover.
The check is cheap - They already get token counts from the usage stats of each response, so checking count > usable costs nothing.

###But you raise a valid point:

The preemptive approach has a gap - the check uses the last response's token count, but the next request might have:

Additional user input
New tool results
System prompt changes

So in theory, you could pass the preemptive check but still overflow on the next request. A reactive approach (catch error → compact → retry) would be more robust as a fallback.

It seems like they're betting on the token count being a good enough predictor, and avoiding the complexity of error-based recovery. But adding a fallback for context overflow errors would make it more resilient.

Julian Benegas commented

are u sure they have a gap here? maybe they're handling it in another way?

Claude Opus 4.5 commented

You're right - I was missing the full picture! There's no gap. Here's how it actually works:

###The Two-Stage Check

Pre-request check (prompt.ts:495-507): Before sending a new user message, check if lastFinished already overflowed → trigger compaction before even trying
Post-response check (processor.ts:274): After a successful response completes, check if this response pushed us over → return "compact" → triggers compaction before the next iteration

###Why this is gap-free:

The key is that the check happens after each successful response in the processor. So even if:

User sends a big message
Response succeeds but now tokens are over the limit
processor.process() returns "compact"
The loop in prompt.ts creates a compaction task
Next iteration processes the compaction before any new LLM call

The token count input + cache.read + output from the response tells you exactly how many tokens the next request would start with (since output becomes part of history). So checking after each response and before continuing the loop ensures you never send a request that would overflow.

Good catch - they are handling it properly with the post-response check in the processor!

Julian Benegas commented

do they compact with some threshhold? meaning idk, at 90% full?

Claude Opus 4.5 commented

No threshold/buffer. The check is:

const count = input.tokens.input + input.tokens.cache.read + input.tokens.output
const output = Math.min(model.limit.output, 32_000) || 32_000
const usable = model.limit.input || context - output
return count > usable

So it compacts when count > usable, where:

usable = the model's input limit, OR context - output (reserving space for the next response)
No 90% threshold, no buffer

The only "buffer" is the reserved output tokens (32K or the model's output limit). This ensures there's room for the next response, but doesn't leave extra headroom.

This means they're pushing it pretty close to the limit. If a user sends a large message or a tool returns a huge output right after a response that was just under the limit, it could theoretically fail. Though in practice, the reserved 32K for output provides some implicit buffer since most responses don't max that out.

Julian Benegas commented

what gets presented to the compaction agent? the full conversation in its context? or a path for it to read the convo in the filesystem?

Claude Opus 4.5 commented

The full conversation is passed directly in context. Look at lines 151-163:

messages: [
  ...MessageV2.toModelMessage(input.messages),  // <-- full conversation history
  {
    role: "user",
    content: [
      {
        type: "text",
        text: promptText,  // "Provide a detailed prompt for continuing..."
      },
    ],
  },
],

So:

input.messages is the full conversation (already filtered by filterCompacted - so only messages since the last compaction)
These get converted to model format via MessageV2.toModelMessage()
A final user message is appended asking for the summary

Also note tools: {} - the compaction agent has no tools, it just reads the conversation in context and generates a summary. No filesystem access needed since the conversation is passed directly.

This does raise an interesting point though - if the conversation is already at context limit, passing the full conversation to the compaction agent could itself overflow. But presumably the compaction model has the same or larger context, so it should fit.

Julian Benegas commented

cool. do they include recent messages after compaction? or just the summary and then straight to work?

Claude Opus 4.5 commented

Now I understand. The stream function yields messages in reverse order (newest first, see line 567: for (let i = list.length - 1; i >= 0; i--)).

So filterCompacted works like this:

Iterates messages from newest to oldest
Keeps pushing messages into result
Tracks completed summaries via completed.add(msg.info.parentID)
Stops when it hits a user message that has a completed summary response AND has a compaction part
Reverses at the end to get chronological order

The answer: They include messages AFTER the compaction, not just the summary.

The flow is:

[old messages] → [user msg with compaction part] → [summary response] → [newer messages if any]
                  ↑ stops here and includes everything after

So if after compaction the user continued chatting, the context would include:

The user message that triggered compaction
The summary (assistant response marked summary: true)
Any subsequent messages

This makes sense - the summary replaces everything before it, but anything after the summary is still included verbatim.

Julian Benegas commented

doesn't the compaction agent get confused if they pass the full conversation as if they have been part of it? then ask it to summarize?

Claude Opus 4.5 commented

You're right - it's a bit odd! The compaction agent receives:

System prompt: "You are a helpful AI assistant tasked with summarizing conversations"
Messages: The full conversation history (user/assistant turns from the original agent)
Final user message: "Provide a detailed prompt for continuing our conversation above..."

So the compaction agent sees a conversation where "it" (as an assistant) was doing coding work, making tool calls, etc. - but that was actually a different agent (like coder). Then suddenly it's asked to summarize "our conversation."

It does seem like it could be confusing, but in practice it probably works because:

LLMs are pretty good at understanding the meta-task ("summarize this")
The system prompt primes it for summarization, not continuation
The final user message is explicit about what's needed

A cleaner approach might be to present the conversation as a document to summarize rather than as if the compaction agent was part of it - something like:

messages: [
  {
    role: "user",
    content: `Here is a conversation to summarize:\n\n${formattedConversation}\n\nPlease summarize...`
  }
]

But that would lose the structured message format and potentially use more tokens. The current approach is a bit of a hack but probably works well enough in practice.

Julian Benegas commented

sounds fine. so they ignore the "system" role message? doesn't that kill prompt caching?

Claude Opus 4.5 commented

So even with system: [], it still builds a system prompt:

const system = SystemPrompt.header(input.model.providerID)
system.push(
  [
    ...(input.agent.prompt ? [input.agent.prompt] : ...),  // compaction agent's prompt
    ...input.system,  // empty []
    ...(input.user.system ? [input.user.system] : []),
  ]
)

So the compaction agent's prompt from compaction.txt does get included via input.agent.prompt.

But you're right about prompt caching - this is a one-shot summarization call. The system prompt + conversation is completely different from normal chat flow:

Different system prompt (compaction agent vs coder agent)
Conversation ends with "summarize this" instead of continuing naturally
Only happens once per compaction

So there's no cache reuse here - the compaction call can't benefit from any previous cache, and its cache won't benefit future calls. It's essentially a cold call every time.

That's probably fine since compaction is relatively infrequent, but it does mean the full conversation gets re-tokenized at full input price (no cache discount).

Julian Benegas commented

i meant the original system prompt, not the compaction agent's system prompt

Claude Opus 4.5 commented

You're absolutely right. toModelMessage only extracts user/assistant messages - no system prompt.

So the compaction agent sees the conversation without the original system prompt that guided the coder agent's behavior. This means:

The coder agent was operating under detailed instructions (tool usage, coding style, etc.)
The compaction agent sees the conversation but has no idea what those instructions were
Context about why the assistant behaved certain ways is lost

And yes - this kills any potential for prompt caching on the conversation history, because:

Normal flow: [system prompt A] + [messages] → cached
Compaction flow: [system prompt B] + [same messages] → completely different prefix, no cache hit

If they included the original system prompt (even just prepended as a user message like "The assistant was operating under these instructions: ..."), it could potentially help the compaction agent understand the context better AND wouldn't break caching for the message portion.

Though to be fair, compaction is a one-off operation so cache savings might be minimal anyway.

Julian Benegas commented

gotcha, makes sense. do the compaction messages include all the tool calls and results? doesn't that make it super expensive?

Claude Opus 4.5 commented

So it does include all tool calls and results. The only optimization is if a tool part was previously marked compacted, its output becomes "[Old tool result content cleared]".

Yes, this is expensive. The compaction call includes:

All user messages
All assistant text responses
All tool call inputs
All tool call outputs (full content, unless previously pruned)

And since we established there's no prompt caching benefit (different system prompt), you're paying full input token price for what could be a massive conversation right at the context limit.

This is a known trade-off: you pay once for a big compaction call to avoid paying repeatedly for an ever-growing context in future calls. But if the conversation is heavy on tool outputs (like file reads, grep results, etc.), that compaction call could be quite costly.

END OF POST

How does context compaction work? — anomalyco/opencode