Configuring My Site for AI Discoverability

Configuring My Site for AI Discoverability

Leader posted Originally published at morello.dev 6 min read

A growing share of web traffic doesn't come from people anymore. It comes from models reading on their behalf. ChatGPT, Claude, Perplexity, Copilot. They fetch a handful of pages, summarize, and ship the answer back. If your site isn't readable by those agents, you don't exist to them.

People are calling this GEO, short for Generative Engine Optimization. It overlaps with SEO but the priorities are different. Agents don't care about your layout. They care about your prose, your metadata, and how many tokens it costs them to read you.

This post covers how I configured this site for GEO. The first half is framework-agnostic. The second half is specific to my setup on Cloudflare, and includes a deliberate choice that fails a popular GEO audit. I'll explain why.

Part 1: general GEO techniques

Serve raw Markdown alongside HTML

The single biggest GEO win is giving agents a version of each page without the navigation, styling, and scripts. HTML is designed for browsers. Markdown is designed for readers, human or otherwise. Agents spend their context window on your prose, not your DOM.

Every blog post on this site has a mirror URL with a .md suffix:

  • /blog/my-post is the full HTML page for humans
  • /blog/my-post.md is the raw Markdown, served as text/markdown

In Astro, this is a two-line route at src/pages/blog/[slug].md.ts:

export const GET = async ({ params }) => {
  const post = await getPostById(params.slug);
  return new Response(formatPostMarkdown(post), {
    headers: { "Content-Type": "text/markdown; charset=utf-8" },
  });
};

Both variants are pre-generated at build time. Same content, roughly half the tokens for an agent to consume.

Agents landing on the HTML need to know the Markdown exists. A single <link> in the head does it:

<link rel="alternate" type="text/markdown" href="/blog/my-post.md" />

Browsers ignore this tag. Agents that parse the head follow it.

Publish an llms.txt index

llms.txt is a convention for a Markdown file at the root of your site listing your content with short descriptions and links. Think of it as a sitemap an LLM can actually read.

I ship two variants:

  • /llms.txt is the index. Title, description, one line per post with a link to its .md version.
  • /llms-full.txt is the full corpus. Every post body concatenated into a single response.

Why both? An agent researching a specific topic can fetch llms.txt, pick the relevant links, and pull them. An agent doing deep research on the site as a whole fetches llms-full.txt once and has everything it needs in one request. Either way there's no crawling.

Declare your AI stance in robots.txt

robots.txt now carries a Content-Signal directive for AI use. Mine reads:

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
Sitemap: https://morello.dev/sitemap-index.xml

Three independent knobs:

  • search=yes lets search engines index
  • ai-train=no says my content is not for training data
  • ai-input=yes says my content can be retrieved and used as input for AI answers

This is the stance I'm comfortable with. I want to show up when someone asks Claude about something I've written; I just don't want my posts absorbed into the next base model.

Whether any given operator actually honors this is another question. The signal's there regardless, and I'd rather be on record than silent about it.

Add structured data that actually describes the content

Most blogs ship JSON-LD schema by reflex. Few of them include the fields that help a generative engine decide whether your article is worth fetching.

On each post I emit a BlogPosting graph with:

  • wordCount and timeRequired (ISO 8601 duration), so an agent can estimate how much context it'll spend before fetching
  • articleBody, the full text machine-readable, with no HTML parsing required
  • author linked to a Person node with knowsAbout so the entity is grounded in real topics
  • BreadcrumbList for site hierarchy

All of it goes into a single @graph per page rather than scattered <script> tags, which makes it cheaper for an engine to walk from post to author to site without cross-referencing.

A sitemap that actually tracks freshness

If you regenerate your sitemap once and never look at it again, you're wasting a signal. Every URL in mine carries a lastmod timestamp pulled from the post's updatedDate frontmatter, falling back to pubDate. When I edit an old post, its lastmod moves forward and crawlers reprioritize it.

Validate with real tools

Two tools I found useful while iterating on all of the above:

  • isitagentready.com audits across five categories: discoverability, content accessibility, bot access control, protocol discovery, and commerce. The bot access control checks (Content-Signal, Web Bot Auth, AI bot rules) are the part that actually influences how agents treat your content.
  • acceptmarkdown.com has a narrower focus. It checks whether your site responds to Accept: text/markdown with a Markdown body, includes Vary: Accept, returns 406 for unsupported types, and parses q-values correctly.

I'll come back to the second one at the end of the post, because my site deliberately fails it.

Part 2: the Cloudflare-specific setup

General GEO gets you most of the way there. The rest is delivery. How fast you respond, whether the edge caches correctly, and how you advertise your agent-facing resources without waiting for someone to parse your HTML.

Static assets, zero Worker invocations

My wrangler.jsonc points a ./dist directory at Cloudflare's assets deployment, with no main entry:

{
  "name": "morellodev",
  "compatibility_date": "2026-04-18",
  "assets": {
    "directory": "./dist",
    "html_handling": "drop-trailing-slash",
    "not_found_handling": "404-page",
  },
}

Every request goes straight from the edge asset cache. HTML, Markdown, llms.txt, sitemap, RSS. Same path for all of them, and no Worker ever runs. On the Workers Free tier this matters. A crawler sweep that would otherwise eat into 100k daily invocations now costs me nothing. Agents, for better or worse, don't fingerprint politely.

Cloudflare's _headers file lets you ship response headers without any server code. I use it to tell every response, not just HTML ones, where the agent-facing files live:

/*
  Link: </sitemap-index.xml>; rel="sitemap",
        </rss.xml>; rel="alternate"; type="application/rss+xml"; title="RSS",
        </llms.txt>; rel="describedby"; type="text/plain",
        </llms-full.txt>; rel="describedby"; type="text/plain"

A crawler doing a HEAD against any URL on the site sees all four links before it parses a single byte of HTML. One round-trip, no body, full discovery.

Long-lived cache for hashed assets

Astro emits fingerprinted filenames under /_astro/, so those can sit in cache for a year:

/_astro/*
  Cache-Control: public, max-age=31536000, immutable

Faster first paint for humans, cheaper crawls for agents. Same lever.

Why I skipped Accept: text/markdown content negotiation

acceptmarkdown.com will tell you this site doesn't do content negotiation. No Vary: Accept, no 406, no Markdown from the canonical URL. That's not an oversight. I tried it, shipped it briefly, and rolled it back.

The reason is Cloudflare's free plan. Custom cache keys are Enterprise-only, and their docs are explicit that Vary: Accept is ignored for caching decisions. The edge collapses every variant of /blog/my-post into one cache entry, so the first requester's format poisons the cache for everyone else until TTL expires.

The workaround is a Worker that bypasses the edge cache. But now every /blog/* request burns a Worker invocation, humans included, and the Workers Free plan gives you 100k per day and 10ms of CPU each. That's a real budget to share across humans and bots, for no functional gain over a static .md URL.

So I deleted the Worker. The only thing I lost is curl -H "Accept: text/markdown" …/blog/my-post returning Markdown. Between llms.txt, <link rel="alternate">, and the /blog/[slug].md convention, no mainstream agent I've seen actually needs Accept: negotiation. It's the more elegant protocol; alternate URLs are the more robust one on a free-tier CDN. On a paid plan I'd probably do both.

Where this leaves things

Every page exists in two forms, both served from the edge. Agent-facing resources are advertised in response headers on every request, before any HTML gets parsed. Structured data tells engines what the article is and how much context it takes to read. robots.txt says what I'll allow and what I won't.

GEO is still very new. The standards are half-drafted, the tools disagree with each other, and half the signals I described above didn't exist two years ago. I fully expect to be rewriting parts of this post within six months, probably with a different opinion about Accept-based negotiation, once I've either moved off the free plan or found a workaround that doesn't involve a Worker. But for now: serve agents a version they can cheaply consume, be explicit about what you'll allow, and accept that the defaults aren't on your side.

If you're reading this via a summary from some assistant, hi. Thanks for the traffic.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolio - Apr 1

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolio - Feb 25

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

Just completed another large-scale WordPress migration — and the client left this

saqib_devmorph - Apr 7
chevron_left

Related Jobs

Commenters (This Week)

2 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!