I built a web scraper in Rust that bypasses Cloudflare without a browser

1 3
calendar_todayschedule3 min read
— Originally published at dev.to

Every AI agent has the same problem. You ask it to read a webpage and it comes back with a 403, or worse, 5000 tokens of navigation bars and cookie banners.

I spent the last few months building webclaw to fix this.

The problem

Try fetching any real website with a standard HTTP client. Most of them will block you. Cloudflare, Akamai, DataDome, they all look at your TLS fingerprint before the request even reaches the server.

The usual fix is spinning up a headless Chrome. That works, but now you need 500MB of browser, it takes 2-3 seconds per page, and you still get all the HTML noise.

What webclaw does differently

Instead of launching a browser, webclaw impersonates one at the TLS level. The TCP handshake, cipher suites, extensions, everything looks like Chrome 142. Most anti-bot systems pass the request through because the fingerprint is already valid.

Then the extraction engine scores every DOM node by text density, semantic tags, and link ratio. Navigation, ads, footers, cookie banners get stripped. What comes out is clean markdown.

A real example: a news article that is 4,820 tokens as raw HTML becomes 1,590 tokens after webclaw processes it. Same content, 67% less tokens.

Architecture

webclaw is a Rust workspace with 6 crates:

webclaw-core    pure extraction, zero network deps, WASM-safe
webclaw-fetch   HTTP + TLS fingerprinting via primp
webclaw-llm     LLM provider chain (Ollama > OpenAI > Anthropic)
webclaw-pdf     PDF text extraction
webclaw-cli     CLI binary
webclaw-mcp     MCP server for AI agents

The split between core and fetch was intentional. webclaw-core takes a &str of HTML and returns structured output. No I/O, no network calls, no allocator tricks. It should compile to WASM without changes.

Extraction speed on the core alone (no network):

Page size Time
10 KB 0.8ms
100 KB 3.2ms
500 KB 12.1ms

How to use it

CLI

# basic extraction
webclaw https://example.com

# different output formats
webclaw https://example.com -f json
webclaw https://example.com -f llm

# crawl a docs site
webclaw https://docs.example.com --crawl --depth 2

# extract structured data with LLM
webclaw https://example.com --extract-prompt "get all pricing tiers"

# track page changes
webclaw https://example.com -f json > snapshot.json
webclaw https://example.com --diff-with snapshot.json

MCP server (for Claude, Cursor, Windsurf, Codex)

npx create-webclaw

One command. It detects what AI tools you have installed and writes the config for each one. After restart you get 10 tools: scrape, crawl, search, extract, summarize, brand, diff, map, batch, research.

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

128 MB image. Works on any machine.

Benchmarks

Tested on 50 real pages across news sites, documentation, e-commerce, SPAs, and blogs.

Metric webclaw readability trafilatura newspaper3k
Extraction accuracy 95.1% 83% 80% 66%
Noise removal 96.1% 79% 73% 61%

The biggest wins are on JavaScript heavy sites. When the visible DOM is empty because content is in embedded JSON (Next.js, React SSR payloads), webclaw has a data island extractor that pulls content from __NEXT_DATA__, window.__data, and similar patterns. Most other tools return nothing.

What I learned building this

TLS fingerprinting is fragile. Chrome updates their cipher suites every few versions and you have to keep up. I am using primp, which maintains patched forks of rustls, hyper, and h2. It works well but it is a maintenance burden. If Chrome ships a new TLS extension tomorrow, requests start getting blocked until the forks are updated.

The extraction scoring took the most iteration. Early versions were too aggressive and would strip content that looked like navigation (short paragraphs with links). The fix was a semantic bonus system: nodes inside <article> or <main> tags get a score boost, nodes with content-related class names get another boost. Combined with link density penalties, it handles most layouts without site-specific rules.

Try it

MIT licensed, fully open source.

GitHub: https://github.com/0xMassi/webclaw
Website: https://webclaw.io
Discord: https://discord.gg/KDfd48EpnW

If you run into a site that webclaw fails on, open an issue. Every edge case makes the extraction better.

1 Comment

1 vote
🔥 Join developers growing publicly
Share your knowledge, build in public, and grow your developer presence with a global community.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Ken W. Algerverified - Jun 4

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelski - Mar 19

Local-First: The Browser as the Vault

Pocket Portfolio - Apr 20

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20
chevron_left
141 Points4 Badges
Italy0xmassi.dev
1Posts
0Comments
Hi, I'm Valerio, but everyone calls me Massi!

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!