How We Built a Free SEO Audit Tool with Puppeteer and Chromium

Question

How We Built a Free SEO Audit Tool with Puppeteer and Chromium

cgutierrez1145 posted 5 days 6 min read

Real browser rendering, 60+ modular checks, and the engineering lessons we learned building a headless Chromium audit pipeline

When we started building our SEO auditing pipeline, the first architectural decision we made was also the most important one:

We would never parse raw HTML.

Most SEO audit tools still rely on that approach because it’s lightweight and efficient for static websites. But frontend frameworks like React, Next.js, and Vue changed the landscape completely.

We kept running into the same issue:

Traditional parsers were auditing code that users and Google never actually saw.

So we took a different route:
real browser rendering with Puppeteer and headless Chromium.

Here’s how we built the system and what we learned along the way.

The Core Problem With HTML Parsers

Most SEO auditors work something like this:

const response = await fetch(url);
const html = await response.text();

// parse html and inspect tags
For static sites, that works fine.
But JavaScript applications often return almost empty HTML responses initially:
<html>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.chunk.js"></script>
  </body>
</html>

The actual SEO relevant content, including:

H1 tags
Meta descriptions
Structured data
Canonical tags
Images
Internal links

doesn’t exist yet.

It gets generated after JavaScript executes in the browser.

An HTML parser never sees that content.

Googlebot renders JavaScript. Your audit tool should, too.

One of the most surprising things we discovered during testing was how little meaningful content some frameworks returned before hydration. In several React-based applications, the initial response contained almost none of the content users eventually saw in the browser.

In our internal testing, more than half of the React-based sites we audited returned incomplete metadata before rendering. Some pages were missing titles, canonical tags, structured data, or even visible heading content entirely until JavaScript finished executing.

That gap became impossible to ignore.

Why We Chose Puppeteer

We evaluated several options before deciding on our stack:

Playwright: excellent tooling, but heavier than we needed for a Chromium-only workflow
Selenium: powerful, but designed more for browser testing than rendering audits
Cheerio + axios: extremely fast, but limited to static HTML parsing
Puppeteer: lightweight Chromium automation with a straightforward API and strong ecosystem support

Puppeteer ultimately made the most sense for our use case.

We didn’t need multi-browser automation.

We needed rendering accuracy.

That narrowed the field quickly.

The Rendering Pipeline

Here’s a simplified version of the core audit flow:

const puppeteer = require('puppeteer');

async function auditPage(url) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  await page.setUserAgent(
    'Mozilla/5.0 (compatible; DeepAuditBot/1.0; +https://axiondeepdigital.com)'
  );

  const resources = [];
  page.on('request', (req) => resources.push(req));

  await page.goto(url, {
    waitUntil: 'networkidle2',
    timeout: 30000,
  });

  await autoScroll(page);

  const dom = await page.evaluate(() => document.documentElement.outerHTML);

  await browser.close();

  return { dom, resources };
}

The key detail here is:

waitUntil: 'networkidle2'

This tells Puppeteer to wait until there are no more than two active network requests for at least 500ms.

Without this step, audits frequently captured incomplete pages before JavaScript finished rendering critical content.

This became especially important for:

Hydration heavy React apps
Lazy-loaded images
Dynamically injected metadata
Client-side routing frameworks

Waiting for the network to stabilize before auditing eliminated many of the incomplete renders we encountered early in development.

Handling Lazy-Loaded Content

Another challenge we encountered was lazy loading.

Many sites only load images and components once the user scrolls down the page. A simple page load misses large portions of the content entirely.

To solve this, we implemented an incremental scrolling helper:

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 200;

      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

This scrolling behavior triggers:

Intersection observers
Lazy load listeners
Deferred image requests

in much the same way as real user interaction would.

Without scrolling, some audits completely missed below-the-fold image sections, deferred components, and dynamically injected content.

Building the Audit Engine

Once we had a fully rendered page, we built the audit engine itself as a collection of independent modules.

The structure looked roughly like this:

checks/
  meta/
    title.js
    description.js
    og-tags.js
    canonical.js
  headings/
    h1-presence.js
    heading-hierarchy.js
  images/
    alt-text.js
    lazy-load-detection.js
    oversized-images.js
  performance/
    render-blocking.js
    resource-hints.js
    font-loading.js
  structured-data/
    json-ld-validation.js
    schema-types.js
  links/
    internal-links.js
    broken-links.js
    anchor-text.js

Each check receives:

The rendered DOM
Network resource data
Page metrics

and returns a standardized result object:

{
  check: 'h1-presence',
  status: 'pass',
  message: 'H1 tag found: "Your Page Title"',
  impact: 'high'
}

Some checks were intentionally simple:

if (!document.querySelector('h1')) {
  return fail('Missing H1 tag');
}

Others required additional context, especially performance analysis and structured data validation.

This modular structure ended up saving us repeatedly as the platform expanded. Once the number of checks started growing, isolating each audit into independent modules made debugging, maintenance, and feature development far easier.

It also allowed us to:

Disable problematic checks quickly
Add new audit rules independently
Prioritize issues by impact
Generate cleaner reporting output

As the project evolved, modularity became one of the best architectural decisions we made.

Challenges We Didn’t Anticipate

1. Timeout Handling
Some pages are genuinely slow.

Large JavaScript bundles, third-party scripts, tracking pixels, and API delays can dramatically increase render time.

Originally, slow pages caused full audit failures.

We eventually redesigned the pipeline so incomplete audits could still return partial results instead of failing entirely.

That change made the platform far more resilient in production environments.

2. Bot Detection
Some sites actively detect headless browsers and serve different content.

In a few cases, pages rendered perfectly in a normal browser but returned stripped-down responses when rendered in headless Chromium.

We mitigated part of the problem using:

Realistic user agents
Browser fingerprint adjustments
Standard viewport sizes

but avoiding bot detection remains an ongoing challenge across the industry.

3. Single Page App Routing

Single-page applications introduced another issue:
Deep routes sometimes triggered unexpected re rendering behavior during navigation.

We initially experimented with broader crawling behavior, but dynamic client-side routing made the process unreliable very quickly.

In several cases, navigating between routes caused the application state to reset entirely, producing inconsistent audit results between runs.

We eventually simplified the pipeline and audited only the exact URL requested.

That decision made results far more predictable and reduced unnecessary complexity.

4. Memory Management

Chromium gets expensive fast under concurrency.

Early versions of the system launched a fresh browser instance for every audit request. During our first large-scale load tests, memory usage escalated far faster than we expected.

The rendering itself worked well.

The infrastructure did not.

Under concurrency, even small memory leaks became amplified rapidly because every Chromium instance carried the overhead of a full browser environment.

At one point, a single improperly terminated Chromium process accumulated enough memory to destabilize an entire worker node under concurrent load.

That was the moment we realized browser lifecycle management mattered just as much as the audit logic itself.

We eventually learned that even one failed browser cleanup could quietly accumulate memory until the worker became unstable under load.

What We’d Do Differently

If we were starting over, we would implement a reusable browser pool from day one.

Launching a fresh Chromium instance for every audit works initially, but it becomes inefficient very quickly at scale.

Reusing browser instances with isolated pages is far more resource-efficient and improves throughput under concurrency.

In later testing, browser reuse reduced memory overhead noticeably compared to isolated browser launches per request.

We also would have invested earlier in DOM snapshot caching.

Rendering is by far the most expensive part of the pipeline, especially for repeat audits against the same URL.

Caching rendered snapshots would have reduced both rendering overhead and infrastructure costs.

Final Thoughts

Building a browser-rendered SEO auditing system proved far more demanding than parsing static HTML, but it also exposed how incomplete traditional auditing approaches had become for JavaScript-driven applications.

As we expanded the platform, we found ourselves solving problems that had less to do with SEO itself and more to do with rendering stability, browser orchestration, memory management, and infrastructure scaling.

What started as a rendering experiment eventually forced us to rethink nearly every assumption traditional SEO tools make about how websites should be analyzed in modern frontend environments.

The result became the foundation for our public SEO auditing platform, allowing developers and site owners to analyze pages using the same Chromium-based rendering pipeline we built internally.

axiondeepdigital.com/free-seo-audit

Crystal A. Gutierrez is Chairperson & Infrastructure Lead at Axion Deep Digital, a web development and SEO agency based in Las Cruces, NM.

2 Comments

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Aulia Ika Savitri · Answer 1 · 2026-05-24T05:11:45+0000

Aulia Ika Savitri • 4 days

Nice breakdown. Building an SEO audit tool from scratch sounds way harder than most people think. Curious why you chose Chromium over lighter alternatives though?

cgutierrez1145 • 2 days

@[Aulia Ika Savitri] Appreciate that. We looked at a few different approaches before settling on Puppeteer + Chromium.

A big part of the decision came down to rendering accuracy and consistency. Since the audit tool evaluates live page behavior, metadata, structured content, resource loading, and other frontend signals, we needed something that mimics a real browser environment as closely as possible.

We considered lighter approaches like Cheerio + axios, but they struggled with JavaScript-rendered content and dynamic pages. We also looked at Playwright, which is excellent, but for our use case, we didn’t really need multi-browser automation. Most of the audit logic centered on Chromium rendering, so Puppeteer ended up being the cleaner fit.

The ecosystem and documentation around Puppeteer also made iteration much faster while building the rendering pipeline.

Definitely a tradeoff, though. Chromium is heavier, but for the type of audits we wanted to run, the rendering reliability was worth the overhead.

	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance Ken W. Algerverified - Apr 28
	5 Web Dev Pitfalls That Are Silently Killing Your Projects (With Real Fixes) Dharanidharan - Mar 3
	TypeScript Complexity Has Finally Reached the Point of Total Absurdity Karol Modelskiverified - Apr 23

How We Built a Free SEO Audit Tool with Puppeteer and Chromium

The Core Problem With HTML Parsers

Why We Chose Puppeteer

The Rendering Pipeline

Handling Lazy-Loaded Content

Building the Audit Engine

Challenges We Didn’t Anticipate

What We’d Do Differently

Final Thoughts

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

5 Web Dev Pitfalls That Are Silently Killing Your Projects (With Real Fixes)

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

More From cgutierrez1145

Most SEO Tools Are Auditing the Wrong Thing. Here’s What We Found.

Why Most SEO Tools Lie to You (And How Real Browser Rendering Fixes It)

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,329 amazing developers

Don't have an account? Sign up

OR

How We Built a Free SEO Audit Tool with Puppeteer and Chromium

The Core Problem With HTML Parsers

Why We Chose Puppeteer

The Rendering Pipeline

Handling Lazy-Loaded Content

Building the Audit Engine

Challenges We Didn’t Anticipate

What We’d Do Differently

Final Thoughts

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

5 Web Dev Pitfalls That Are Silently Killing Your Projects (With Real Fixes)

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

More From cgutierrez1145

Most SEO Tools Are Auditing the Wrong Thing. Here’s What We Found.

Why Most SEO Tools Lie to You (And How Real Browser Rendering Fixes It)

Related Jobs

Commenters (This Week)