I built a web scraper in Rust that bypasses Cloudflare without a browser

Question

I built a web scraper in Rust that bypasses Cloudflare without a browser

calendar_todayMay 14 • schedule3 min read

— Originally published at dev.to

Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.

I spent the last few months building webclaw to fix this.

The problem

Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.

The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.

What webclaw does differently

Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.

Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.

A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.

Architecture

webclaw is a Rust workspace with 6 crates:

webclaw-core    pure extraction, zero network deps, WASM-safe
webclaw-fetch   HTTP + TLS fingerprinting via primp
webclaw-llm     LLM provider chain (Ollama > OpenAI > Anthropic)
webclaw-pdf     PDF text extraction
webclaw-cli     CLI binary
webclaw-mcp     MCP server for AI agents

The split between core and fetch was intentional. webclaw-core takes a &str of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.

Extraction speed on the core alone (no network):

Page size	Time
10 KB	0.8ms
100 KB	3.2ms
500 KB	12.1ms

How to use it

CLI

# basic extraction
webclaw https://example.com

# different output formats
webclaw https://example.com -f json
webclaw https://example.com -f llm

# crawl a docs site
webclaw https://docs.example.com --crawl --depth 2

# extract structured data with LLM
webclaw https://example.com --extract-prompt "get all pricing tiers"

# track page changes
webclaw https://example.com -f json > snapshot.json
webclaw https://example.com --diff-with snapshot.json

MCP server (for Claude, Cursor, Windsurf, Codex)

npx create-webclaw

One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

128 MB image. Works on any machine.

Benchmarks

Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.

Metric	webclaw	readability	trafilatura	newspaper3k
Extraction accuracy	95.1%	83%	80%	66%
Noise removal	96.1%	79%	73%	61%

The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from __NEXT_DATA__, window.__data, and similar patterns. Most other tools return nothing.

What I learned building this

TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.

The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside <article> or <main> tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.

Try it

MIT licensed, fully open source.

GitHub: https://github.com/0xMassi/webclaw
Website: https://webclaw.io
Discord: https://discord.gg/KDfd48EpnW

If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.

1 Comment

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

romainl · Answer 1 · 2026-05-15T17:08:48+0000

This is actually impressive. Curious though how often does it break against stricter Cloudflare configs?

	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	Local-First: The Browser as the Vault Pocket Portfolio - Apr 20
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20

I built a web scraper in Rust that bypasses Cloudflare without a browser

The problem

What webclaw does differently

Architecture

How to use it

CLI

MCP server (for Claude, Cursor, Windsurf, Codex)

Docker

Benchmarks

What I learned building this

Try it

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Local-First: The Browser as the Vault

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,759 amazing developers

Don't have an account? Sign up

OR

I built a web scraper in Rust that bypasses Cloudflare without a browser

The problem

What webclaw does differently

Architecture

How to use it

CLI

MCP server (for Claude, Cursor, Windsurf, Codex)

Docker

Benchmarks

What I learned building this

Try it

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Local-First: The Browser as the Vault

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Related Jobs

Commenters (This Week)