agent-browser: The Rust-Powered Browser CLI That Saves 93% of Your AI Agent's Context
When AI agents interact with the web, they have a token problem.
Tools like Playwright MCP dump raw HTML or verbose JSON into the context window. A typical page snapshot can consume 10,000–50,000 tokens before the agent even starts reasoning about what to do. On complex SPAs, that's your entire context window gone on one page load.
agent-browser by @TheSethRose takes a fundamentally different approach: instead of giving the agent a wall of HTML, it outputs a compact accessibility tree — a semantic representation of interactive elements, each identified by a stable short reference like @e1, @e2. The result, according to benchmarks from Vercel Labs (which maintains the underlying CLI), is a 93% reduction in context window usage compared to Playwright MCP.
With 74,000+ downloads on ClawHub and 14,000 GitHub stars, it's become the de facto browser automation choice for production AI agent workflows.
Architecture: Why Rust?
Most browser automation tools — Playwright, Puppeteer, Selenium — are built around a Node.js (or Python) runtime that communicates with a browser via the Chrome DevTools Protocol (CDP). This works, but it comes with overhead: startup time, memory footprint, and processing latency on every command.
agent-browser uses a two-tier architecture:
┌──────────────────────┐
│ Native Rust CLI │ ← sub-millisecond command parsing
│ (agent-browser) │
└──────────┬───────────┘
│ CDP
┌──────────▼───────────┐
│ Node.js daemon │ ← persistent browser instance (Playwright)
│ (manages Chrome) │
└──────────────────────┘
The Rust CLI handles command parsing at near-instant speed and routes to a persistent Node.js daemon that maintains a running Playwright browser instance. The daemon stays alive between commands — no cold-start cost per action.
When the native binary isn't available (non-x86_64 Linux, some ARM environments), the system transparently falls back to a pure Node.js path. The API is identical; you just lose the sub-millisecond overhead.
The Core Innovation: Accessibility Tree + Stable Refs
The critical insight in agent-browser's design is this: AI agents don't need raw HTML. They need a structured description of what's interactive and what it means.
Running agent-browser snapshot doesn't return a DOM dump. It returns an accessibility tree — the same representation that screen readers use — annotated with short stable refs:
Document
main
heading "Sign in to GitHub" @e1
group "Sign in with a passkey"
button "Sign in with a passkey" @e2
group
labelText "Username or email address" @e3
textbox "Username or email address" @e4
labelText "Password" @e5
textbox "Password" @e6
link "Forgot password?" @e7
button "Sign in" @e8
group "New to GitHub?"
link "Create an account" @e9
This is dramatically more compact than HTML. The agent can now reason about what's on screen — headings, buttons, inputs, links — and act on elements by ref:
agent-browser fill @e4 "myusername"
agent-browser fill @e6 "mypassword"
agent-browser click @e8No fragile CSS selectors. No XPath. No "find the third div with class btn-primary." Refs are stable across re-renders because they're derived from the accessibility structure, not the DOM position.
Full Command Reference
Installation
# Global install (recommended for production)
npm install -g agent-browser
agent-browser install
# Linux: install system dependencies
agent-browser install --with-deps
# macOS via Homebrew
brew install agent-browser
# Quick test without installing
npx agent-browser open example.comNote:
npxrouting is "noticeably slower" than a global install due to Node.js intermediary overhead. For agents running many commands, install globally.
Navigation
agent-browser open example.com # open in default browser
agent-browser goto https://github.com # navigate current tab
agent-browser navigate back # browser back
agent-browser navigate forward
agent-browser navigate reloadSnapshots & Screenshots
agent-browser snapshot # accessibility tree (AI-optimized, compact)
agent-browser snapshot --interactive # show only interactive elements
agent-browser screenshot # PNG screenshot
agent-browser screenshot --annotate # screenshot with numbered element overlays
agent-browser screenshot --full # full-page screenshot
agent-browser pdf output.pdf # export page as PDFElement Interaction (by ref)
agent-browser click @e2 # click element from snapshot
agent-browser dblclick @e5 # double click
agent-browser fill @e4 "value" # clear + type into input
agent-browser type @e4 "value" # type without clearing
agent-browser press @e4 Enter # key press on element
agent-browser hover @e3 # hover (for dropdowns, tooltips)
agent-browser select @e7 "option" # select dropdown option
agent-browser check @e9 # check checkbox
agent-browser uncheck @e9 # uncheck checkboxSemantic Finders (no ref needed)
# Find by ARIA role
agent-browser find role button --name "Submit" click
agent-browser find role textbox --name "Email" fill "test@example.com"
# Find by visible text
agent-browser find text "Sign in" click
# Find by label
agent-browser find label "Password" fill "secret"
# Find by placeholder
agent-browser find placeholder "Search..." type "query"State & Auth Persistence
# Save browser auth state (cookies, localStorage)
agent-browser state save ./auth.json
# Load saved state (persist login across agent sessions)
agent-browser state load ./auth.jsonJavaScript Execution
agent-browser eval "document.title"
agent-browser eval "window.scrollY"
# Base64 for complex scripts
agent-browser eval --base64 "cmV0dXJuIGRvY3VtZW50LnF1ZXJ5U2VsZWN0b3JBbGwoJ2EnKS5sZW5ndGg="Network Interception
# Mock API responses for testing
agent-browser network route "*/api/user" --body '{"name":"test"}'
agent-browser network unroute "*/api/user"Viewport & Device Emulation
agent-browser set viewport 375 812 # iPhone dimensions
agent-browser set device "iPhone 14 Pro"
agent-browser set geo 37.7749 -122.4194 # San FranciscoVisual Diffs & Debugging
agent-browser snapshot --diff ./prev.json # compare against saved snapshot
agent-browser screenshot --diff ./prev.png # pixel diff
agent-browser connect 9222 # attach to existing Chrome via CDP portComparing agent-browser to Playwright MCP and Puppeteer
| | agent-browser | Playwright MCP | Puppeteer |
|--|--|--|--|
| Primary language | Rust CLI + Node.js daemon | Node.js | Node.js |
| Context window usage | ~93% less than Playwright MCP | High (HTML dumps) | High |
| Output format | Compact accessibility tree | Raw HTML / JSON | Raw HTML |
| Element selection | Stable refs (@e1) + semantic finders | CSS selectors / XPath | CSS selectors |
| Startup speed | Sub-millisecond (Rust) | ~1-2s | ~1-2s |
| Auth persistence | Built-in (state save/load) | Manual | Manual |
| AI agent design | Native — built for agents | Adaptation | Not designed for agents |
| Browser support | Chromium (via Playwright backend) | Chromium, Firefox, WebKit | Chromium, Firefox |
The key trade-off: agent-browser wins decisively on context efficiency and agent ergonomics; Playwright wins if you need Firefox/WebKit or the full Playwright API surface.
Real Use Cases
1. Web Data Extraction Without Fragile Selectors
Before agent-browser, extracting structured data from a complex SPA meant either writing bespoke Playwright scripts (brittle, break on UI updates) or using a browser extension. With agent-browser:
agent-browser open https://app.example.com/dashboard
agent-browser state load ./auth.json # already logged in
agent-browser snapshot # agent reads the accessible structure
# agent identifies @e12 as the revenue table
agent-browser get text @e12 # extract text contentNo CSS selector archaeology. When the UI updates, the accessibility structure stays semantically consistent.
2. Form Automation Across Sessions
# Session 1: authenticate and save state
agent-browser open https://portal.example.com
agent-browser fill @e3 "user@example.com"
agent-browser fill @e5 "password"
agent-browser click @e7
agent-browser state save ./portal-auth.json
# All future sessions: skip login
agent-browser state load ./portal-auth.json
agent-browser goto https://portal.example.com/submit3. Visual Regression Monitoring
# Baseline
agent-browser open https://myapp.com
agent-browser screenshot baseline.png
# After deployment
agent-browser open https://myapp.com
agent-browser screenshot current.png --diff baseline.png
# agent-browser reports pixel diff percentage4. AI Agent Web Research Pipelines
The accessibility tree format is designed to be processed by LLMs directly. An agent can:
snapshota page to understand its structure- Extract relevant
@refidentifiers - Issue targeted
clickorget textcommands - Follow links and repeat — all within a single context window, not multiple full HTML dumps
Installation via ClawHub
clawhub install agent-browserThe skill wraps the agent-browser CLI with OpenClaw-specific conventions: the allowed-tools config pre-authorizes Bash(agent-browser:*) so the agent can issue browser commands without per-command approval.
Verify the skill is active:
agent-browser --version
agent-browser install # installs the browser binary if not presentBefore installing, check the skill's current safety status at clawhub.ai/skills.
Frequently Asked Questions
Does it support Firefox or Safari?
Currently Chromium only (via the Playwright backend). Firefox support is on the roadmap. For cross-browser testing, Playwright MCP remains the better choice.
Does the npx path work for quick testing?
Yes, but it's "noticeably slower" per the official docs. For one-off experiments it's fine; for production agent workflows that issue many browser commands, install globally.
How does auth state persistence work across agent sessions?
agent-browser state save ./auth.json captures cookies, localStorage, and session storage. agent-browser state load restores it. This means an agent can log in once and reuse the session across many runs — useful for agents that need authenticated access to web apps.
Can it handle SPAs that load content asynchronously?
Yes. The underlying Playwright daemon waits for network idle and DOM stability before returning snapshot results. You can also use agent-browser wait commands for explicit waits.
What's the --annotate flag on screenshots?
It overlays numbered labels on every interactive element, matching the @e1, @e2 refs from the snapshot. Useful for debugging — you can visually verify which ref corresponds to which element.