How We Built a Free SEO Audit Tool with Puppeteer and Chromium

How We Built a Free SEO Audit Tool with Puppeteer and Chromium

posted 6 min read

Real browser rendering, 60+ modular checks, and the engineering lessons we learned building a headless Chromium audit pipeline

When we started building our SEO auditing pipeline, the first architectural decision we made was also the most important one:

We would never parse raw HTML.

Most SEO audit tools still rely on that approach because it’s lightweight and efficient for static websites. But frontend frameworks like React, Next.js, and Vue changed the landscape completely.

We kept running into the same issue:

Traditional parsers were auditing code that users and Google never actually saw.

So we took a different route:
real browser rendering with Puppeteer and headless Chromium.

Here’s how we built the system and what we learned along the way.

The Core Problem With HTML Parsers

Most SEO auditors work something like this:

const response = await fetch(url);
const html = await response.text();

// parse html and inspect tags
For static sites, that works fine.
But JavaScript applications often return almost empty HTML responses initially:
<html>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.chunk.js"></script>
  </body>
</html>

The actual SEO relevant content, including:

  • H1 tags
  • Meta descriptions
  • Structured data
  • Canonical tags
  • Images
  • Internal links

doesn’t exist yet.

It gets generated after JavaScript executes in the browser.

An HTML parser never sees that content.

Googlebot renders JavaScript. Your audit tool should, too.

One of the most surprising things we discovered during testing was how little meaningful content some frameworks returned before hydration. In several React-based applications, the initial response contained almost none of the content users eventually saw in the browser.

In our internal testing, more than half of the React-based sites we audited returned incomplete metadata before rendering. Some pages were missing titles, canonical tags, structured data, or even visible heading content entirely until JavaScript finished executing.

That gap became impossible to ignore.

Why We Chose Puppeteer

We evaluated several options before deciding on our stack:

  • Playwright: excellent tooling, but heavier than we needed for a Chromium-only workflow

  • Selenium: powerful, but designed more for browser testing than rendering audits

  • Cheerio + axios: extremely fast, but limited to static HTML parsing

  • Puppeteer: lightweight Chromium automation with a straightforward API and strong ecosystem support

Puppeteer ultimately made the most sense for our use case.

We didn’t need multi-browser automation.

We needed rendering accuracy.

That narrowed the field quickly.

The Rendering Pipeline

Here’s a simplified version of the core audit flow:

const puppeteer = require('puppeteer');

async function auditPage(url) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  await page.setUserAgent(
    'Mozilla/5.0 (compatible; DeepAuditBot/1.0; +https://axiondeepdigital.com)'
  );

  const resources = [];
  page.on('request', (req) => resources.push(req));

  await page.goto(url, {
    waitUntil: 'networkidle2',
    timeout: 30000,
  });

  await autoScroll(page);

  const dom = await page.evaluate(() => document.documentElement.outerHTML);

  await browser.close();

  return { dom, resources };
}

The key detail here is:

waitUntil: 'networkidle2'

This tells Puppeteer to wait until there are no more than two active network requests for at least 500ms.

Without this step, audits frequently captured incomplete pages before JavaScript finished rendering critical content.

This became especially important for:

  • Hydration heavy React apps

  • Lazy-loaded images

  • Dynamically injected metadata

  • Client-side routing frameworks

Waiting for the network to stabilize before auditing eliminated many of the incomplete renders we encountered early in development.

Handling Lazy-Loaded Content

Another challenge we encountered was lazy loading.

Many sites only load images and components once the user scrolls down the page. A simple page load misses large portions of the content entirely.

To solve this, we implemented an incremental scrolling helper:

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 200;

      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;

        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

This scrolling behavior triggers:

  • Intersection observers

  • Lazy load listeners

  • Deferred image requests

in much the same way as real user interaction would.

Without scrolling, some audits completely missed below-the-fold image sections, deferred components, and dynamically injected content.

Building the Audit Engine

Once we had a fully rendered page, we built the audit engine itself as a collection of independent modules.

The structure looked roughly like this:

checks/
  meta/
    title.js
    description.js
    og-tags.js
    canonical.js
  headings/
    h1-presence.js
    heading-hierarchy.js
  images/
    alt-text.js
    lazy-load-detection.js
    oversized-images.js
  performance/
    render-blocking.js
    resource-hints.js
    font-loading.js
  structured-data/
    json-ld-validation.js
    schema-types.js
  links/
    internal-links.js
    broken-links.js
    anchor-text.js

Each check receives:

  • The rendered DOM

  • Network resource data

  • Page metrics

and returns a standardized result object:

{
  check: 'h1-presence',
  status: 'pass',
  message: 'H1 tag found: "Your Page Title"',
  impact: 'high'
}

Some checks were intentionally simple:

if (!document.querySelector('h1')) {
  return fail('Missing H1 tag');
}

Others required additional context, especially performance analysis and structured data validation.

This modular structure ended up saving us repeatedly as the platform expanded. Once the number of checks started growing, isolating each audit into independent modules made debugging, maintenance, and feature development far easier.

It also allowed us to:

  • Disable problematic checks quickly

  • Add new audit rules independently

  • Prioritize issues by impact

  • Generate cleaner reporting output

As the project evolved, modularity became one of the best architectural decisions we made.

Challenges We Didn’t Anticipate

1. Timeout Handling
Some pages are genuinely slow.

Large JavaScript bundles, third-party scripts, tracking pixels, and API delays can dramatically increase render time.

Originally, slow pages caused full audit failures.

We eventually redesigned the pipeline so incomplete audits could still return partial results instead of failing entirely.

That change made the platform far more resilient in production environments.

2. Bot Detection
Some sites actively detect headless browsers and serve different content.

In a few cases, pages rendered perfectly in a normal browser but returned stripped-down responses when rendered in headless Chromium.

We mitigated part of the problem using:

  • Realistic user agents

  • Browser fingerprint adjustments

  • Standard viewport sizes

but avoiding bot detection remains an ongoing challenge across the industry.

3. Single Page App Routing

Single-page applications introduced another issue:
Deep routes sometimes triggered unexpected re rendering behavior during navigation.

We initially experimented with broader crawling behavior, but dynamic client-side routing made the process unreliable very quickly.

In several cases, navigating between routes caused the application state to reset entirely, producing inconsistent audit results between runs.

We eventually simplified the pipeline and audited only the exact URL requested.

That decision made results far more predictable and reduced unnecessary complexity.

4. Memory Management

Chromium gets expensive fast under concurrency.

Early versions of the system launched a fresh browser instance for every audit request. During our first large-scale load tests, memory usage escalated far faster than we expected.

The rendering itself worked well.

The infrastructure did not.

Under concurrency, even small memory leaks became amplified rapidly because every Chromium instance carried the overhead of a full browser environment.

At one point, a single improperly terminated Chromium process accumulated enough memory to destabilize an entire worker node under concurrent load.

That was the moment we realized browser lifecycle management mattered just as much as the audit logic itself.

We eventually learned that even one failed browser cleanup could quietly accumulate memory until the worker became unstable under load.

What We’d Do Differently

If we were starting over, we would implement a reusable browser pool from day one.

Launching a fresh Chromium instance for every audit works initially, but it becomes inefficient very quickly at scale.

Reusing browser instances with isolated pages is far more resource-efficient and improves throughput under concurrency.

In later testing, browser reuse reduced memory overhead noticeably compared to isolated browser launches per request.

We also would have invested earlier in DOM snapshot caching.

Rendering is by far the most expensive part of the pipeline, especially for repeat audits against the same URL.

Caching rendered snapshots would have reduced both rendering overhead and infrastructure costs.

Final Thoughts

Building a browser-rendered SEO auditing system proved far more demanding than parsing static HTML, but it also exposed how incomplete traditional auditing approaches had become for JavaScript-driven applications.

As we expanded the platform, we found ourselves solving problems that had less to do with SEO itself and more to do with rendering stability, browser orchestration, memory management, and infrastructure scaling.

What started as a rendering experiment eventually forced us to rethink nearly every assumption traditional SEO tools make about how websites should be analyzed in modern frontend environments.

The result became the foundation for our public SEO auditing platform, allowing developers and site owners to analyze pages using the same Chromium-based rendering pipeline we built internally.

axiondeepdigital.com/free-seo-audit

Crystal A. Gutierrez is Chairperson & Infrastructure Lead at Axion Deep Digital, a web development and SEO agency based in Las Cruces, NM.

2 Comments

1 vote
0

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

The Audit Trail of Things: Using Hashgraph as a Digital Caliper for Provenance

Ken W. Algerverified - Apr 28

5 Web Dev Pitfalls That Are Silently Killing Your Projects (With Real Fixes)

Dharanidharan - Mar 3

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Karol Modelskiverified - Apr 23
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

3 comments

Contribute meaningful comments to climb the leaderboard and earn badges!