Browser-Use Is Solving the Wrong Half of the Problem

Question

Browser-Use Is Solving the Wrong Half of the Problem

calendar_todayMay 19 • schedule5 min read

— Originally published at renezander.com

TL;DR — why browserground, not the other 2B grounding models

You already know the hybrid-AI argument: don't pay frontier-vision rates for "where is the button?" There are three good 2B specialists for that job — UI-TARS, ShowUI, browserground. Here's the case for picking this one.

	browserground v0.3	UI-TARS-2B-SFT	ShowUI-2B
ScreenSpot-v2 (overall)	60.0%	89.5%	75.5%
Output format	✅ strict JSON `{"bbox_2d": [...]}`, 100% parseable	❌ coord strings inside prose — needs regex	❌ varies by prompt
Apple Silicon native	✅ MLX 4-bit, Ollama, GGUF	❌ server-class	❌ server-class
Distribution	✅ npm + pip + Ollama + HF, one install per stack	HF only	HF only
Daemon / HTTP REST	✅ `serve --http :8401`, Ollama-shape API	❌	❌
Batch + confidence + eval CLIs	✅ built-in	❌	❌
Adapters	✅ `browser-use` Controller + Skyvern `ground_with_fallback`	❌ DIY	❌ DIY
Base model	Qwen3-VL-2025	Qwen2-VL-2024	Qwen2-VL-2024
Training compute	$2.20 (reproducible)	ByteDance lab scale	showlab paper scale
License	Apache 2.0	Apache 2.0	Apache 2.0

The honest take on accuracy. Yes, UI-TARS scores 89.5% to our 60.0% on ScreenSpot-v2 overall. That gap is a training-data-and-compute gap, not an architecture gap. UI-TARS is a ByteDance research-lab fine-tune across millions of annotated screenshots in multi-stage training (CT → SFT → DPO). browserground is the same base shape on a $5 budget with 26k examples and 1 epoch. Reaching ~89% is reproducible on the same recipe with ~$200–500 of compute and 250k records.

Why ship at 60% anyway? Because you don't use a 2B local model as a standalone cloud replacement. You use it as a router-stage primitive — --confidence returns the sequence log-prob, and the Skyvern adapter ships ground_with_fallback(threshold=0.55, cloud_fallback=...) out of the box. On representative agent workloads, ~70–80% of grounding calls clear the threshold and stay local at $0. The remaining 20–30% — sub-50px icons, ambiguous targets — escalate to cloud. Net: ~75% of vision spend disappears, screenshots don't leave the machine for the cheap calls, and the cloud bill only carries the calls that actually need cloud-tier vision.

That's the product. UI-TARS is the "I want one model for everything" answer; browserground is the "I want a fast, structured, MLX-native router primitive that plugs into the npm CLI / pip / Ollama" answer.

On per-split numbers: mobile-app buttons 78%, text-labelled targets ~74%, icon-only ~41%. If your agent mostly clicks labelled buttons (the common case), real-world accuracy is closer to the high end. Icons get fixed in v0.4.

And if you're new to the hybrid pattern — why this exists at all

Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.

Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button 3.3x more reliably on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%).

The architecture is the bug, not the model.

Two Jobs, One Forward Pass

Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy.

browser-use (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step:

Given this page, what should the agent do next, and where exactly does it click?

Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels.

You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop.

Grounding Is a Parser Problem

Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that:

Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision
Outputs strict JSON without hallucinating bounding boxes
Runs locally so the per-step cost is zero

A 2B specialist trained on screen-parsing data. Not a frontier model.

I trained one. Total cost: ~$5 of RunPod compute on a single A6000 GPU. The result, browserground, hits 60.0% on ScreenSpot-v2 vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it beats SeeClick (9.6B params, 55.1%) at 4.8x smaller. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via transformers on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.

The Reasoning Model Gets Its Reasoning Capacity Back

When you split the call, the frontier model stops seeing pixels. It sees:

{
  "elements": [
    {"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
    {"id": "e8", "label": "Edit cart",    "type": "link",   "bbox": [...]}
  ]
}

Now the frontier model does the job it's good at: deciding e7 vs e8 given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.

Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.

What Anthropic and OpenAI Ship Next

The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.

When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.

Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.

The Diagnostic

Pull up your last failed agent trace. Three numbers:

Total tokens spent on vision calls per agent step.
Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
Per-run vision cost at your current API rates.

If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.

Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.

I build the split layer. browserground is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:

npm CLI (daemon, HTTP REST server, batch, confidence, eval): npm install -g browserground → browserground parse <img> --target "..."
PyPI (no Node required, MLX or transformers): pip install "browserground[mlx]" → from browserground import click_xy
Ollama (cross-platform, GGUF Q4_K_M + f16 mmproj): ollama run renezander030/browserground

Adapters land in the repo for browser-use (drop-in Controller action) and Skyvern (ground_with_fallback for local-first + cloud-fallback). Model: huggingface.co/renezander030/browserground. MLX 4-bit: browserground-mlx. GGUF: browserground-gguf. Source: github.com/renezander030/browserground. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.

I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

romainl · Answer 1 · 2026-05-21T05:45:01+0000

romainl • May 21

Interesting point. Feels like the real bottleneck is still understanding intent, not just controlling the browser.

René Zander • May 21

@[romainl] Exactly. Executing the actions is a solved engineering problem. The real frontier is building models that can handle ambiguity until Web MCP or something like it was widely adopted.

	TypeScript Complexity Has Finally Reached the Point of Total Absurdity Karol Modelskiverified - Apr 23
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	Local-First: The Browser as the Vault Pocket Portfolio - Apr 20
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20

Browser-Use Is Solving the Wrong Half of the Problem

TL;DR — why browserground, not the other 2B grounding models

And if you're new to the hybrid pattern — why this exists at all

Two Jobs, One Forward Pass

Grounding Is a Parser Problem

The Reasoning Model Gets Its Reasoning Capacity Back

What Anthropic and OpenAI Ship Next

The Diagnostic

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Local-First: The Browser as the Vault

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

More From René Zander

Never Let an AI Agent Grade Its Own Homework

This Smart-Home Agent Treats Its Own 1B Model as Untrusted Input

Sandboxing an AI Coding Agent: The Harness Owns the Boundaries

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,755 amazing developers

Don't have an account? Sign up

OR

Browser-Use Is Solving the Wrong Half of the Problem

TL;DR — why browserground, not the other 2B grounding models

And if you're new to the hybrid pattern — why this exists at all

Two Jobs, One Forward Pass

Grounding Is a Parser Problem

The Reasoning Model Gets Its Reasoning Capacity Back

What Anthropic and OpenAI Ship Next

The Diagnostic

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From René Zander

Related Jobs

Commenters (This Week)