SEO10 min read

Reading robots.txt Like a Real Crawler — Four Traps I Keep Seeing

After a year of running Krawly against 10,000+ public sites, four robots.txt mistakes keep costing site owners crawl budget and indexed pages. Real examples, copy-paste fixes.

Enis Getmez avatarBy Enis GetmezFounder & Lead Engineer

Why this matters more than people think

Most site owners write a robots.txt once, ten years ago, and never look at it again. Most of those files are wrong — not catastrophically, but enough to suppress indexing of pages they meant to publish, waste crawl budget on pages they meant to hide, and contradict the sitemap they painstakingly built.

After a year of running Krawly against 10,000+ public sites, four mistakes keep recurring. Each one is a one-line fix. This article walks through them with examples drawn straight from real sites I've audited, then shows the workflow I use to verify a robots.txt is actually doing what its owner intended.

A note on bias: I run Krawly. Two of the tools used in this article are mine. They are linked at the bottom; you can substitute Google's own Robots Testing Tool or Bing's equivalent if you prefer.

How a real crawler reads robots.txt

Before the traps, the model. Every well-behaved crawler — Googlebot, Bingbot, our own Krawly user agent — does roughly this when it visits a domain:

1. Fetch `/robots.txt` from the root. If the response is 4xx (file missing) or 5xx (server error), the crawler treats the entire site as either crawlable (404) or temporarily off-limits (5xx → retry later).

2. Parse the file line by line. Block per User-agent. The most specific User-agent block wins.

3. Apply directives — `Allow`, `Disallow`, `Crawl-delay`, `Sitemap` — in order of appearance.

4. For every URL the crawler considers fetching, check against the rules for *its* user agent (or `*` if no specific block exists).

Two important details people miss:

  • Order matters within a User-agent block. Earlier rules take precedence over later rules for conflicting paths.
  • `Allow` is the longer-match override. `Allow: /blog` beats `Disallow: /` because the path is more specific.
  • With that model, here are the four traps.

    Krawly robots.txt Analyzer — parses a file the way Googlebot does
    Krawly robots.txt Analyzer — parses a file the way Googlebot does

    Trap 1: `Disallow: /admin` accidentally blocking `/admin-guides`

    The directive `Disallow: /admin` means "block any URL whose path starts with `/admin`". That includes `/admin`, `/admin/login`, and `/admin-guides` — because robots.txt is prefix-matched, not folder-matched.

    If your real admin lives at `/admin/` and your public marketing pages live at `/admin-guides/`, the marketing pages are now invisible to search.

    This trap is on about 1 in 30 sites I scan. The fix is one character:

    ```

    # Wrong:

    Disallow: /admin

    # Right:

    Disallow: /admin/

    ```

    The trailing slash anchors the match to the folder. Now `/admin-guides` is allowed, `/admin/login` is still blocked.

    Trap 2: Blocking `/wp-admin` while forgetting `/wp-admin/admin-ajax.php`

    WordPress sites overwhelmingly ship with:

    ```

    Disallow: /wp-admin/

    Allow: /wp-admin/admin-ajax.php

    ```

    The Allow line is there for a reason: `admin-ajax.php` is the AJAX endpoint your theme uses for filtering, infinite scroll, faceted search. If you block it, your filtered category pages never get crawled in their post-AJAX form. Google sees the empty pre-JS shell, decides the page has no content, and quietly stops indexing the category tree.

    About 1 in 12 WordPress sites I scan have the Disallow without the Allow. Easy fix; high-impact result for category-page SEO.

    Trap 3: Sitemap path that doesn't match what's actually served

    Sitemap declarations in robots.txt look like:

    ```

    Sitemap: https://example.com/sitemap.xml

    ```

    Three common ways sites get this wrong in 2026:

    1. Wrong host. `Sitemap: https://www.example.com/sitemap.xml` on a site that canonicalises to `example.com` (without www). Google follows the cross-host sitemap reluctantly and de-prioritises everything in it.

    2. Stale path. You moved the sitemap to `/sitemap_index.xml` (most CMSes default here) but never updated robots.txt. The line points at a 404. Google logs a warning and doesn't backfill the discovery.

    3. HTTP, not HTTPS. Sites that migrated to HTTPS years ago but still have `Sitemap: http://example.com/sitemap.xml`. Modern crawlers tolerate this; some legacy ones don't, and the line just looks careless.

    I use Krawly's Sitemap Extractor to verify the sitemap declared in robots.txt actually returns valid XML at the URL given:

    Krawly Sitemap Extractor — confirms sitemap declared in robots.txt resolves
    Krawly Sitemap Extractor — confirms sitemap declared in robots.txt resolves

    Paste the sitemap URL exactly as it appears in robots.txt. If you see "404" or "not found", fix the path before doing anything else for SEO.

    Trap 4: Crawl-delay that strangles Googlebot

    The `Crawl-delay: 30` directive tells crawlers to wait N seconds between requests. Site owners add this when they get hit by a bad scraper and panic.

    Two problems:

  • Googlebot ignores `Crawl-delay`. Google's documentation states this explicitly. The setting does nothing for Google traffic; you have to use Search Console's "crawl rate" setting instead.
  • Bingbot honours it literally. `Crawl-delay: 30` on a site with 10,000 URLs means Bing will take 83 hours to crawl your site once. By the time the second full crawl finishes, your new content is weeks old.
  • I see `Crawl-delay: 10` or `Crawl-delay: 30` on about 1 in 20 small-business sites I scan, almost always added in a panic that nobody remembers. Remove it unless you have a current capacity problem. If you do have a capacity problem, fix the capacity problem — don't strangle search crawlers.

    The verification workflow I actually use

    For every site I audit:

    1. robots.txt Analyzer — paste the site URL, see the parsed rules per user agent. Catches typos, conflicting directives, and the "Disallow without trailing slash" trap.

    2. Sitemap Extractor — confirm the sitemap declared in robots.txt actually serves valid XML.

    3. Google Search Console → Crawl stats — verify Google sees what you expect. If your Disallow rules are blocking things you didn't intend, Search Console reports the suppressed URLs.

    4. Curl one of your blocked paths with `User-Agent: Googlebot` — sometimes sites cloak crawlers and serve different robots.txt to bots than to humans. `curl -A "Googlebot" https://example.com/robots.txt` shows you the truth.

    Half a day, every six months. Most sites I run this on find at least one quiet regression — a recently-added `Disallow` from a developer who didn't realise the implications, or a sitemap that quietly died after a CMS migration.

    What robots.txt **cannot** do

    A surprising number of site owners think robots.txt is a security measure. It is not. It is a polite signal to well-behaved crawlers. Five things it does not do:

  • It does not hide pages from determined visitors. robots.txt is a public file. Anyone who fetches `/robots.txt` sees every "secret" path you tried to hide.
  • It does not deindex pages already in Google. A `Disallow` on an already-indexed URL keeps Google from re-crawling it, but the URL stays in the index (often with a snippet that reads "No information is available for this page"). To actually deindex, use `` on the page, then leave robots.txt to allow the crawl so Google can see the noindex directive.
  • It does not stop scrapers that ignore it. Scrapers like `curl`, `wget`, hand-rolled `requests` scripts, and most adversarial crawling ignore robots.txt entirely. If you need to actually block scrapers, you need server-side rate limiting, IP-based blocking, or a bot-management product.
  • It does not affect how Google ranks your site. Whether you have a robots.txt at all is not a ranking signal. The directives in it can affect ranking by changing what Google can crawl, but the file's existence doesn't matter to Google's quality scoring.
  • It does not validate against a schema. robots.txt has no required structure beyond "User-agent" and "Disallow" lines. Syntax errors just get silently ignored by most crawlers; you'll never get a "parse error" warning unless you actively test for it.
  • A real-world example: my own krawly.io robots.txt

    For full transparency, here is the actual file we serve:

    ```

    User-agent: *

    Allow: /

    Disallow: /api/

    Disallow: /dashboard/

    Sitemap: https://krawly.io/sitemap.xml

    Content-Signal: ai-train=no, search=yes, ai-input=no

    ```

    What's happening:

  • `User-agent: *` — apply to every crawler that doesn't have a more specific block (none below, so this applies to all of them).
  • `Allow: /` — explicitly allow the root. Redundant in theory but signals intent.
  • `Disallow: /api/` — our REST API endpoints. Search engines have no reason to crawl them; users get the same data through the tool pages.
  • `Disallow: /dashboard/` — accounts area; not useful for search.
  • Sitemap — pointing at the canonical sitemap URL on the same host.
  • `Content-Signal` — a 2025-era extension some AI scrapers respect. We tell training data collectors we don't consent; we keep search engines and on-demand AI tools allowed.
  • The whole file is six lines. It says exactly what we mean. It would be longer if we wanted to block AI scrapers per user-agent, but the Content-Signal approach catches the major ones generically.

    What to do after reading this

    Run robots.txt Analyzer on your own site right now. If it surfaces any of the four traps above, fix the line and redeploy. Then run Sitemap Extractor on the sitemap URL your robots.txt declares — if it doesn't return valid XML, fix that too.

    15 minutes of work, sometimes weeks of indexing benefit.

    Corrections, additions

    If you maintain a site where robots.txt does something unusual and you think it's correct, email info@krawly.io with the URL and the reasoning. I'm collecting edge cases for a follow-up post.

    Try All 170+ Free Tools

    No signup required. Start analyzing websites, scraping data, and more.

    Browse All Tools

    Related Articles