How does this work?

main
399fd7av0.6.0 changelog (#154)
This post might have stale content, as main is 212 commits ahead.

how does this work?

Avatar of anthropic/claude-haiku-4.5
Claude Haiku 4.5
commented

Now I can see the structure. Before diving deeper, what aspect interests you most? Or should I start with the README to give you a high-level overview?

Avatar of NicolasMontone
Nicolás Montone
commented

what's the strategy used to make the agent use the website

Avatar of anthropic/claude-haiku-4.5
Claude Haiku 4.5
commented

Perfect! Now I have a complete understanding. Let me create a comprehensive summary:

##Strategy: How Agent-Browser Makes Agents Use Websites

The core strategy is elegantly simple but powerful. Here's how it works:

###1. Accessibility Tree Snapshot with Reference IDs

When an AI agent needs to interact with a website, it first calls agent-browser snapshot which:

  • Generates an ARIA accessibility tree using Playwright's built-in ariaSnapshot() method
  • Assigns deterministic reference IDs (like @e1, @e2, etc.) to interactive and important elements
  • Stores metadata about each ref (role, name, selector) in a RefMap

Example output:

- button "Submit" [ref=e2]
- textbox "Email" [ref=e3]  
- link "Help" [ref=e1]

This is found in src/snapshot.ts - the getEnhancedSnapshot() function that processes the ARIA tree.

###2. Dual Selector Support

The BrowserManager in src/browser.ts implements getLocator() which supports both types of selectors:

getLocator(selectorOrRef: string): Locator {
  // Check if it's a ref first (@e1, @e2, etc)
  const locator = this.getLocatorFromRef(selectorOrRef);
  if (locator) return locator;
  
  // Otherwise treat as regular CSS selector
  return page.locator(selectorOrRef);
}

This means agents can use:

  • Refs: agent-browser click @e2 (AI-friendly)
  • CSS selectors: agent-browser click "#submit" (traditional)

###3. Command Dispatch & Execution

All commands flow through a massive switch statement in src/actions.ts that maps each action to a handler:

switch (command.action) {
  case 'click': return await handleClick(command, browser);
  case 'fill': return await handleFill(command, browser);
  case 'snapshot': return await handleSnapshot(command, browser);
  // ... 100+ more commands
}

Each handler:

  1. Gets the locator (supporting both refs and selectors)
  2. Executes the Playwright action
  3. Returns AI-friendly errors if something fails

###4. Why This Works for AI Agents

Problem it solves:

  • AI agents struggle with fragile CSS selectors that break when page structure changes
  • Accessibility trees are semantic and more stable across page variations

Solution provided:

  • Semantic references: Use meaningful element names instead of fragile selectors
  • Deterministic: Same elements always get the same refs in a snapshot
  • AI-optimized: Agents can reason about @e2 (the "Submit" button) instead of .btn-primary.mt-3[data-id="..."]
  • Fallback support: Still works with regular CSS/XPath selectors if needed

###5. The Full Loop

AI Agent thinks: "I need to fill the email field and click submit"
    ↓
Agent calls: agent-browser snapshot
    ↓
Gets: { tree: "- textbox \"Email\" [ref=e3]", refs: { e3: {...} } }
    ↓
Agent calls: agent-browser fill @e3 "test@example.com"
    ↓
BrowserManager.getLocator("@e3") → resolves to Playwright locator
    ↓
Playwright executes: locator.fill("test@example.com")
    ↓
Agent calls: agent-browser click @e2 
    ↓
Success!

This is a brilliant design because it bridges the gap between human-readable UI semantics (what accessibility trees provide) and deterministic element selection (what AI agents need).


END OF POST

How does this work? — vercel-labs/agent-browser