Browser-Use Is Solving the Wrong Half of the Problem

Browser-Use Is Solving the Wrong Half of the Problem

1 2 14
calendar_todayschedule5 min read
— Originally published at renezander.com

TL;DR — why browserground, not the other 2B grounding models

You already know the hybrid-AI argument: don't pay frontier-vision rates for "where is the button?" There are three good 2B specialists for that job — UI-TARS, ShowUI, browserground. Here's the case for picking this one.

browserground v0.3 UI-TARS-2B-SFT ShowUI-2B
ScreenSpot-v2 (overall) 60.0% 89.5% 75.5%
Output format strict JSON {"bbox_2d": [...]}, 100% parseable ❌ coord strings inside prose — needs regex ❌ varies by prompt
Apple Silicon native ✅ MLX 4-bit, Ollama, GGUF ❌ server-class ❌ server-class
Distribution ✅ npm + pip + Ollama + HF, one install per stack HF only HF only
Daemon / HTTP REST serve --http :8401, Ollama-shape API
Batch + confidence + eval CLIs ✅ built-in
Adapters browser-use Controller + Skyvern ground_with_fallback ❌ DIY ❌ DIY
Base model Qwen3-VL-2025 Qwen2-VL-2024 Qwen2-VL-2024
Training compute $2.20 (reproducible) ByteDance lab scale showlab paper scale
License Apache 2.0 Apache 2.0 Apache 2.0

The honest take on accuracy. Yes, UI-TARS scores 89.5% to our 60.0% on ScreenSpot-v2 overall. That gap is a training-data-and-compute gap, not an architecture gap. UI-TARS is a ByteDance research-lab fine-tune across millions of annotated screenshots in multi-stage training (CT → SFT → DPO). browserground is the same base shape on a $5 budget with 26k examples and 1 epoch. Reaching ~89% is reproducible on the same recipe with ~$200–500 of compute and 250k records.

Why ship at 60% anyway? Because you don't use a 2B local model as a standalone cloud replacement. You use it as a router-stage primitive — --confidence returns the sequence log-prob, and the Skyvern adapter ships ground_with_fallback(threshold=0.55, cloud_fallback=...) out of the box. On representative agent workloads, ~70–80% of grounding calls clear the threshold and stay local at $0. The remaining 20–30% — sub-50px icons, ambiguous targets — escalate to cloud. Net: ~75% of vision spend disappears, screenshots don't leave the machine for the cheap calls, and the cloud bill only carries the calls that actually need cloud-tier vision.

That's the product. UI-TARS is the "I want one model for everything" answer; browserground is the "I want a fast, structured, MLX-native router primitive that plugs into the npm CLI / pip / Ollama" answer.

On per-split numbers: mobile-app buttons 78%, text-labelled targets ~74%, icon-only ~41%. If your agent mostly clicks labelled buttons (the common case), real-world accuracy is closer to the high end. Icons get fixed in v0.4.


And if you're new to the hybrid pattern — why this exists at all

Everyone's posting browser-agent demos this week. Click here, scroll there, fill that form. Most break by click seven.

Mine broke too. The submit button on a checkout form that the frontier vision model literally couldn't see. Billed at $0.01-0.05 per call, called 20-50 times per agent run, the model was burning reasoning capacity on parsing pixel coordinates. A 2B specialist I trained for $5 hits that same button **3.3x more reliably** on ScreenSpot-v2 (60.0% vs GPT-4o's 18.3%). The architecture is the bug, not the model. ## Two Jobs, One Forward Pass Browser-agent stacks send a screenshot to a frontier vision model and ask it for both the next decision and the click coordinates in one call. Splitting that into two calls, a local 2B grounding model that emits JSON followed by a frontier model that reasons over the JSON, drops vision token spend and raises click accuracy. `browser-use` (94k stars), Skyvern (22k stars), Claude Computer Use, OpenAI Operator. Same pattern. Same compound question every step: > Given this page, what should the agent do next, and where exactly does it click? Two jobs welded together. Reasoning ("what next") is a probabilistic problem worth a frontier model. Grounding ("where exactly") is a structured-output problem with a tight schema: clickable elements, bounding boxes, accessible labels. You're paying frontier-tier rates for the second job. Per screenshot. Every step of the loop. ## Grounding Is a Parser Problem Once you name it as a parser problem, the right tool changes. You don't need 200 billion parameters to emit a JSON list of clickable elements. You need a model that: - Has seen enough UI screenshots to recognize buttons, inputs, links with sub-50-pixel precision - Outputs strict JSON without hallucinating bounding boxes - Runs locally so the per-step cost is zero A 2B specialist trained on screen-parsing data. Not a frontier model. I trained one. Total cost: **~$5 of RunPod compute on a single A6000 GPU. The result, browserground, hits 60.0% on ScreenSpot-v2 vs GPT-4o's 18.3% — a 3.3x beat at the click-grounding job. More telling: it beats SeeClick (9.6B params, 55.1%) at 4.8x smaller**. A drop-in for any agent loop currently handing screenshots to a frontier API. Today the CLI runs via transformers on Apple Silicon (~14 s/call); MLX-native build coming for the ~1.5 s path.

The Reasoning Model Gets Its Reasoning Capacity Back

When you split the call, the frontier model stops seeing pixels. It sees:

{
  "elements": [
    {"id": "e7", "label": "Submit order", "type": "button", "bbox": [344, 612, 478, 658]},
    {"id": "e8", "label": "Edit cart",    "type": "link",   "bbox": [...]}
  ]
}

Now the frontier model does the job it's good at: deciding e7 vs e8 given the agent's goal. A reasoning question over structured input. Cheap. Reliable. Auditable.

Three things change at once. Per-step token spend on vision collapses, because the grounding step runs locally. JSON validity hits 100% (the specialist learned the output convention with 35M LoRA parameters on a Qwen3-VL-2B base). Agent traces become debuggable. You read the structured grounding output before the reasoning step ever runs.

What Anthropic and OpenAI Ship Next

The frontier providers will absorb grounding into their own small models. Within twelve months, "fast vision" or "tool vision" tiers will appear in both Anthropic and OpenAI billing at a fraction of frontier rates. The economics demand it. Nobody can justify charging GPT-5 prices for a parser, and Hugging Face downloads already prove the demand: SeeClick, UI-TARS, and ShowUI pull ~300k category downloads a month between them.

When that ships, stack owners who already split grounding from reasoning have three things the wait-and-see crowd doesn't. A local fallback if the provider has an outage. An auditable structured-grounding trace in every log line. An exit option to a different reasoning provider without re-validating click behavior, because the grounding step belongs to them.

Stack owners who didn't split will find their grounding step has quietly become someone else's API. Same vendor billing the reasoning calls. Same vendor setting the price. Same vendor's deprecation calendar.

The Diagnostic

Pull up your last failed agent trace. Three numbers:

  1. Total tokens spent on vision calls per agent step.
  2. Fraction of those tokens spent on grounding (parsing pixel coordinates) vs reasoning (deciding actions).
  3. Per-run vision cost at your current API rates.

If grounding dominates the first two numbers, and in most stacks it does, your stack has the split wrong.

Grounding is plumbing. Reasoning is cognition. Stop paying cognition rates for plumbing.


I build the split layer. browserground is the open-source reference for the local grounding half. v0.3 ships three packagings so it drops into any stack:

  • npm CLI (daemon, HTTP REST server, batch, confidence, eval): npm install -g browsergroundbrowserground parse <img> --target "..."
  • PyPI (no Node required, MLX or transformers): pip install "browserground[mlx]"from browserground import click_xy
  • Ollama (cross-platform, GGUF Q4_K_M + f16 mmproj): ollama run renezander030/browserground

Adapters land in the repo for browser-use (drop-in Controller action) and Skyvern (ground_with_fallback for local-first + cloud-fallback). Model: huggingface.co/renezander030/browserground. MLX 4-bit: browserground-mlx. GGUF: browserground-gguf. Source: github.com/renezander030/browserground. Apache-2.0. v0.2 LoRA trained on 26k mixed-domain examples (macOS + Android + UIBert + web). PRs welcome, especially eval cases where it fails.


I write field notes from real builds — AI integration, cron-driven automation, and the parts that break in production. New posts every two weeks at renezander.com.

432 Points17 Badges1 2 14
Berlin, Germanyrenezander.com
10Posts
4Comments
4Followers
4Connections
Enterprise AI Architect | Production-Ready AI Agents for CRM & ERP
Build your own developer journey
Track progress. Share learning. Stay consistent.

2 Comments

2 votes
1
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Karol Modelskiverified - Apr 23

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Local-First: The Browser as the Vault

Pocket Portfolio - Apr 20

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

3 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!