DevLog 20250523: Sitemap and `robots.txt`

Backer posted Originally published at dev.to 3 min read

Search engine optimization (SEO) is not just about keywords and HTML metadata! Though those are the most basic things one can do (and can easily improve site visibility), there are other tricks that go a bit deeper—more technical than what ordinary readers see.

I found that webmasters can go to Google Search Console and Bing Webmaster Tools to add sitemaps. This continues our previous discussion on Search Engine Architecture.

Sitemap

A sitemap is simply a file (or sometimes a web page) that tells search engines about the pages on our site.

  • Better discovery: Search engines won’t have to “guess” at pages.
  • Faster indexing: New or updated content gets found more quickly when we update the sitemap.
  • Structured hints: Metadata in an XML sitemap gives crawlers extra clues about how often and how important different pages are.

There are two main flavors:

  1. XML Sitemap (for search engines)

    • It’s an XML-formatted file (usually named sitemap.xml) that lives at the website’s root.
    • Inside, it lists all of the website’s important URLs, plus optional metadata like:
      • <lastmod> (when the page was last changed)
      • <changefreq> (how often it tends to be updated)
      • <priority> (a hint about which pages we consider most important)
    • By submitting this file to Google Search Console or Bing Webmaster Tools, we help crawlers discover and index our pages more efficiently—especially useful if we have a very large site, pages that aren’t well linked internally, or lots of media content.
    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
      <url>
        <loc>https://www.example.com/</loc>
        <lastmod>2025-05-20</lastmod>
        <changefreq>daily</changefreq>
        <priority>1.0</priority>
      </url>
      <url>
        <loc>https://www.example.com/blog/post-1</loc>
        <lastmod>2025-05-18</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
      </url>
      <!-- more URLs here -->
    </urlset>
    
  2. HTML Sitemap (for people)

    • It’s just a regular web page on the site that lists links to all pages in a human-readable format.
    • It’s primarily a usability feature—helping visitors (and indirectly search engines) navigate large or complex sites.

robots.txt

The robots.txt file is another “cheat-sheet” to put at the very root of the website (e.g. https://www.example.com/robots.txt) to tell well-behaved web crawlers which parts of the site they’re welcome to explore—and which parts we’d rather keep off-limits.

  • Privacy & security: Keep staging directories, admin panels, or confidential files out of search results.
  • Crawl-budget control: On large sites, we can steer crawlers away from low-value pages (like faceted filters), so they focus on important content.
  • Performance: Reduce server load by preventing bots from hammering resource-heavy sections.

Some notes:

  1. Location matters
    • Must live at https://domain.com/robots.txt (exactly).
    • Crawlers automatically look here first before they begin crawling website pages.
  2. Basic syntax

    • It’s plain text, with directives grouped by User-agent (the crawler’s name).
    • Common directives:

      • Disallow: — path (or file) we don’t want crawled
      • Allow: — exception to a Disallow: (supported by Google, Bing, etc.)
      • Sitemap: — URL of the XML sitemap
    # Block all crawlers from /private/
    User-agent: *
    Disallow: /private/
    
    # Allow Googlebot to see /private/public-info.html
    User-agent: Googlebot
    Allow: /private/public-info.html
    
    # Let everyone know where the sitemap lives
    Sitemap: https://www.example.com/sitemap.xml
    
  3. User-agents
    • * is the wildcard: applies to every crawler.
    • We can target specific bots (e.g., User-agent: Googlebot, User-agent: Bingbot) if we need different rules.
  4. Disallow vs. Allow
    • Disallow: / — don’t crawl anything on the site.
    • Disallow: (empty) — allow everything.
    • Allow: /path/to/page.html — lets a crawler index a page that would otherwise be blocked by a broader Disallow:.

A few best practices

robots.txt is publicly visible—don’t use it to hide truly sensitive info (use authentication!).

  • Test in Google Search Console’s “robots.txt Tester” or Bing Webmaster Tools.
  • Combine with sitemaps: Always include a Sitemap: line so crawlers can discover all valid URLs easily.

Here is an example from Google: https://www.google.com/robots.txt

If you read this far, tweet to the author to show them you care. Tweet a Thanks

More Posts

DevLog 20250522: Serverless & Serverside vs Client Side Rendering

Methodox - May 23

Beyond Google: Mastering SEO for Twitter, Discord & Telegram (2025 Dev Guide)

Suni - Jun 25

DevLog 20250520: Search Engine Architecture

Methodox - May 20

DevLog 20250523: Firebase Auth

Methodox - May 23

Building Teams for Digital Products: Essential Roles, Methods, and Real-World Advice

ByteMinds - Jun 11
chevron_left