Configuring My Site for AI Discoverability

Question

Configuring My Site for AI Discoverability

calendar_todayApr 21 • schedule6 min read

— Originally published at morello.dev

A growing share of web traffic doesn't come from people anymore. It comes from models reading on their behalf. ChatGPT, Claude, Perplexity, Copilot. They fetch a handful of pages, summarize, and ship the answer back. If your site isn't readable by those agents, you don't exist to them.

People are calling this GEO, short for Generative Engine Optimization. It overlaps with SEO but the priorities are different. Agents don't care about your layout. They care about your prose, your metadata, and how many tokens it costs them to read you.

This post covers how I configured this site for GEO. The first half is framework-agnostic. The second half is specific to my setup on Cloudflare, and includes a deliberate choice that fails a popular GEO audit. I'll explain why.

Part 1: general GEO techniques

Serve raw Markdown alongside HTML

The single biggest GEO win is giving agents a version of each page without the navigation, styling, and scripts. HTML is designed for browsers. Markdown is designed for readers, human or otherwise. Agents spend their context window on your prose, not your DOM.

Every blog post on this site has a mirror URL with a .md suffix:

/blog/my-post is the full HTML page for humans
/blog/my-post.md is the raw Markdown, served as text/markdown

In Astro, this is a two-line route at src/pages/blog/[slug].md.ts:

export const GET = async ({ params }) => {
  const post = await getPostById(params.slug);
  return new Response(formatPostMarkdown(post), {
    headers: { "Content-Type": "text/markdown; charset=utf-8" },
  });
};

Both variants are pre-generated at build time. Same content, roughly half the tokens for an agent to consume.

Advertise the Markdown version in `<head>`

Agents landing on the HTML need to know the Markdown exists. A single <link> in the head does it:

<link rel="alternate" type="text/markdown" href="/blog/my-post.md" />

Browsers ignore this tag. Agents that parse the head follow it.

Publish an `llms.txt` index

llms.txt is a convention for a Markdown file at the root of your site listing your content with short descriptions and links. Think of it as a sitemap an LLM can actually read.

I ship two variants:

/llms.txt is the index. Title, description, one line per post with a link to its .md version.
/llms-full.txt is the full corpus. Every post body concatenated into a single response.

Why both? An agent researching a specific topic can fetch llms.txt, pick the relevant links, and pull them. An agent doing deep research on the site as a whole fetches llms-full.txt once and has everything it needs in one request. Either way there's no crawling.

Declare your AI stance in `robots.txt`

robots.txt now carries a Content-Signal directive for AI use. Mine reads:

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
Sitemap: https://morello.dev/sitemap-index.xml

Three independent knobs:

search=yes lets search engines index
ai-train=no says my content is not for training data
ai-input=yes says my content can be retrieved and used as input for AI answers

This is the stance I'm comfortable with. I want to show up when someone asks Claude about something I've written; I just don't want my posts absorbed into the next base model.

Whether any given operator actually honors this is another question. The signal's there regardless, and I'd rather be on record than silent about it.

Add structured data that actually describes the content

Most blogs ship JSON-LD schema by reflex. Few of them include the fields that help a generative engine decide whether your article is worth fetching.

On each post I emit a BlogPosting graph with:

wordCount and timeRequired (ISO 8601 duration), so an agent can estimate how much context it'll spend before fetching
articleBody, the full text machine-readable, with no HTML parsing required
author linked to a Person node with knowsAbout so the entity is grounded in real topics
BreadcrumbList for site hierarchy

All of it goes into a single @graph per page rather than scattered <script> tags, which makes it cheaper for an engine to walk from post to author to site without cross-referencing.

A sitemap that actually tracks freshness

If you regenerate your sitemap once and never look at it again, you're wasting a signal. Every URL in mine carries a lastmod timestamp pulled from the post's updatedDate frontmatter, falling back to pubDate. When I edit an old post, its lastmod moves forward and crawlers reprioritize it.

Validate with real tools

Two tools I found useful while iterating on all of the above:

isitagentready.com audits across five categories: discoverability, content accessibility, bot access control, protocol discovery, and commerce. The bot access control checks (Content-Signal, Web Bot Auth, AI bot rules) are the part that actually influences how agents treat your content.
acceptmarkdown.com has a narrower focus. It checks whether your site responds to Accept: text/markdown with a Markdown body, includes Vary: Accept, returns 406 for unsupported types, and parses q-values correctly.

I'll come back to the second one at the end of the post, because my site deliberately fails it.

Part 2: the Cloudflare-specific setup

General GEO gets you most of the way there. The rest is delivery. How fast you respond, whether the edge caches correctly, and how you advertise your agent-facing resources without waiting for someone to parse your HTML.

Static assets, zero Worker invocations

My wrangler.jsonc points a ./dist directory at Cloudflare's assets deployment, with no main entry:

{
  "name": "morellodev",
  "compatibility_date": "2026-04-18",
  "assets": {
    "directory": "./dist",
    "html_handling": "drop-trailing-slash",
    "not_found_handling": "404-page",
  },
}

Every request goes straight from the edge asset cache. HTML, Markdown, llms.txt, sitemap, RSS. Same path for all of them, and no Worker ever runs. On the Workers Free tier this matters. A crawler sweep that would otherwise eat into 100k daily invocations now costs me nothing. Agents, for better or worse, don't fingerprint politely.

Advertise discovery endpoints in a `Link` header

Cloudflare's _headers file lets you ship response headers without any server code. I use it to tell every response, not just HTML ones, where the agent-facing files live:

/*
  Link: </sitemap-index.xml>; rel="sitemap",
        </rss.xml>; rel="alternate"; type="application/rss+xml"; title="RSS",
        </llms.txt>; rel="describedby"; type="text/plain",
        </llms-full.txt>; rel="describedby"; type="text/plain"

A crawler doing a HEAD against any URL on the site sees all four links before it parses a single byte of HTML. One round-trip, no body, full discovery.

Long-lived cache for hashed assets

Astro emits fingerprinted filenames under /_astro/, so those can sit in cache for a year:

/_astro/*
  Cache-Control: public, max-age=31536000, immutable

Faster first paint for humans, cheaper crawls for agents. Same lever.

Why I skipped `Accept: text/markdown` content negotiation

acceptmarkdown.com will tell you this site doesn't do content negotiation. No Vary: Accept, no 406, no Markdown from the canonical URL. That's not an oversight. I tried it, shipped it briefly, and rolled it back.

The reason is Cloudflare's free plan. Custom cache keys are Enterprise-only, and their docs are explicit that Vary: Accept is ignored for caching decisions. The edge collapses every variant of /blog/my-post into one cache entry, so the first requester's format poisons the cache for everyone else until TTL expires.

The workaround is a Worker that bypasses the edge cache. But now every /blog/* request burns a Worker invocation, humans included, and the Workers Free plan gives you 100k per day and 10ms of CPU each. That's a real budget to share across humans and bots, for no functional gain over a static .md URL.

So I deleted the Worker. The only thing I lost is curl -H "Accept: text/markdown" …/blog/my-post returning Markdown. Between llms.txt, <link rel="alternate">, and the /blog/[slug].md convention, no mainstream agent I've seen actually needs Accept: negotiation. It's the more elegant protocol; alternate URLs are the more robust one on a free-tier CDN. On a paid plan I'd probably do both.

Where this leaves things

Every page exists in two forms, both served from the edge. Agent-facing resources are advertised in response headers on every request, before any HTML gets parsed. Structured data tells engines what the article is and how much context it takes to read. robots.txt says what I'll allow and what I won't.

GEO is still very new. The standards are half-drafted, the tools disagree with each other, and half the signals I described above didn't exist two years ago. I fully expect to be rewriting parts of this post within six months, probably with a different opinion about Accept-based negotiation, once I've either moved off the free plan or found a workaround that doesn't involve a Worker. But for now: serve agents a version they can cheaply consume, be explicit about what you'll allow, and accept that the defaults aren't on your side.

If you're reading this via a summary from some assistant, hi. Thanks for the traffic.

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

Lailaps · Answer 1 · 2026-04-23T05:46:17+0000

Lailaps • Apr 23

This is actually one of the more practical GEO breakdowns I’ve seen. most posts stop at just add llms.txt.

morellodev • Apr 23

Thanks, I tried to summarise my findings over a couple of weeks of research and experimenting.

	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	TypeScript Complexity Has Finally Reached the Point of Total Absurdity Karol Modelskiverified - Apr 23
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1

Configuring My Site for AI Discoverability

Part 1: general GEO techniques

Serve raw Markdown alongside HTML

Advertise the Markdown version in `<head>`

Publish an `llms.txt` index

Declare your AI stance in `robots.txt`

Add structured data that actually describes the content

A sitemap that actually tracks freshness

Validate with real tools

Part 2: the Cloudflare-specific setup

Static assets, zero Worker invocations

Advertise discovery endpoints in a `Link` header

Long-lived cache for hashed assets

Why I skipped `Accept: text/markdown` content negotiation

Where this leaves things

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

Your AI Doesn't Just Write Tests. It Runs Them Too.

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

More From morellodev

The New HTTP QUERY Method

SolidJS 2.0: A React Developer's First Look at Signals and Async

Rebuilding Defrag98: Getting the Details Right

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,640 amazing developers

Don't have an account? Sign up

OR

Configuring My Site for AI Discoverability

Part 1: general GEO techniques

Serve raw Markdown alongside HTML

Advertise the Markdown version in <head>

Publish an llms.txt index

Declare your AI stance in robots.txt

Add structured data that actually describes the content

A sitemap that actually tracks freshness

Validate with real tools

Part 2: the Cloudflare-specific setup

Static assets, zero Worker invocations

Advertise discovery endpoints in a Link header

Long-lived cache for hashed assets

Why I skipped Accept: text/markdown content negotiation

Where this leaves things

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From morellodev

Related Jobs

Commenters (This Week)

Advertise the Markdown version in `<head>`

Publish an `llms.txt` index

Declare your AI stance in `robots.txt`

Advertise discovery endpoints in a `Link` header

Why I skipped `Accept: text/markdown` content negotiation