vercel-labs/agent-browser

questions

How does this work?

399fd7av0.6.0 changelog (#154)

This post might have stale content, as main is 475 commits ahead.

Nicolás Montone posted

how does this work?

Claude Haiku 4.5 commented

Now I can see the structure. Before diving deeper, what aspect interests you most? Or should I start with the README to give you a high-level overview?

Nicolás Montone commented

what's the strategy used to make the agent use the website

Claude Haiku 4.5 commented

Perfect! Now I have a complete understanding. Let me create a comprehensive summary:

##Strategy: How Agent-Browser Makes Agents Use Websites

The core strategy is elegantly simple but powerful. Here's how it works:

###1. Accessibility Tree Snapshot with Reference IDs

When an AI agent needs to interact with a website, it first calls agent-browser snapshot which:

Generates an ARIA accessibility tree using Playwright's built-in ariaSnapshot() method
Assigns deterministic reference IDs (like @e1, @e2, etc.) to interactive and important elements
Stores metadata about each ref (role, name, selector) in a RefMap

Example output:

- button "Submit" [ref=e2]
- textbox "Email" [ref=e3]  
- link "Help" [ref=e1]

This is found in src/snapshot.ts - the getEnhancedSnapshot() function that processes the ARIA tree.

###2. Dual Selector Support

The BrowserManager in src/browser.ts implements getLocator() which supports both types of selectors:

getLocator(selectorOrRef: string): Locator {
  // Check if it's a ref first (@e1, @e2, etc)
  const locator = this.getLocatorFromRef(selectorOrRef);
  if (locator) return locator;
  
  // Otherwise treat as regular CSS selector
  return page.locator(selectorOrRef);
}

This means agents can use:

Refs: agent-browser click @e2 (AI-friendly)
CSS selectors: agent-browser click "#submit" (traditional)

###3. Command Dispatch & Execution

All commands flow through a massive switch statement in src/actions.ts that maps each action to a handler:

switch (command.action) {
  case 'click': return await handleClick(command, browser);
  case 'fill': return await handleFill(command, browser);
  case 'snapshot': return await handleSnapshot(command, browser);
  // ... 100+ more commands
}

Each handler:

Gets the locator (supporting both refs and selectors)
Executes the Playwright action
Returns AI-friendly errors if something fails

###4. Why This Works for AI Agents

Problem it solves:

AI agents struggle with fragile CSS selectors that break when page structure changes
Accessibility trees are semantic and more stable across page variations

Solution provided:

✅ Semantic references: Use meaningful element names instead of fragile selectors
✅ Deterministic: Same elements always get the same refs in a snapshot
✅ AI-optimized: Agents can reason about @e2 (the "Submit" button) instead of .btn-primary.mt-3[data-id="..."]
✅ Fallback support: Still works with regular CSS/XPath selectors if needed

###5. The Full Loop

AI Agent thinks: "I need to fill the email field and click submit"
    ↓
Agent calls: agent-browser snapshot
    ↓
Gets: { tree: "- textbox \"Email\" [ref=e3]", refs: { e3: {...} } }
    ↓
Agent calls: agent-browser fill @e3 "test@example.com"
    ↓
BrowserManager.getLocator("@e3") → resolves to Playwright locator
    ↓
Playwright executes: locator.fill("test@example.com")
    ↓
Agent calls: agent-browser click @e2 
    ↓
Success!

This is a brilliant design because it bridges the gap between human-readable UI semantics (what accessibility trees provide) and deterministic element selection (what AI agents need).

END OF POST

How does this work? — vercel-labs/agent-browser