Krawly.io is a free online platform that provides 160+ web scraping, SEO analysis, security auditing, and developer tools. It allows users to extract data from websites, analyze SEO performance, check security headers, detect technologies, scrape YouTube data, generate leads, and much more — all without requiring signup or installation.

Is Krawly free to use?

Yes! Krawly is completely free. You get 30 free uses per day without signing up. No credit card required.

Do I need to install anything?

No installation needed. All 160+ tools run directly in your browser with no downloads or setup required. There is also a REST API and a Chrome extension for advanced users.

What data formats can I export?

All tools support JSON, CSV, and Excel (XLS) export. The screenshot tool provides PNG images. API responses are in JSON format.

How does the Krawly API work?

Krawly is a free web-based platform. All 160+ tools are available directly in your browser with no signup required. You get 30 free uses per day.

What types of tools does Krawly offer?

Krawly offers tools in 11 categories: SEO (meta tag validator, heading analyzer, keyword density, core web vitals), Web Scraping (email scraper, table extractor, CSS selector scraper), Security (SSL checker, port scanner, WordPress scanner), OSINT (DNS lookup, WHOIS, IP geolocation), E-Commerce (Amazon scraper, Shopify scraper, price tracker), YouTube (comments scraper, playlist extractor, thumbnail downloader), Developer (JSON formatter, regex tester, JWT decoder), Content (word counter, readability score), Analysis (technology detector, page speed analyzer), Social Media (TikTok analyzer, Instagram analyzer), and Utilities (QR code generator, URL shortener).

Is Krawly a good alternative to SEMrush, Ahrefs, or BuiltWith?

Krawly provides many of the same on-page SEO analysis features as SEMrush and Ahrefs for free, including meta tag validation, heading analysis, keyword density checking, and structured data auditing. For technology detection, it serves as a free alternative to BuiltWith and Wappalyzer. However, Krawly focuses on individual page analysis rather than domain-wide crawling or backlink databases.

Can I use Krawly for lead generation?

Yes. Krawly's Lead Generation tool automatically crawls a website's homepage, about, contact, and team pages to extract email addresses, phone numbers, LinkedIn profiles, Twitter handles, Facebook pages, Instagram accounts, and GitHub links. Results can be exported as JSON, CSV, or Excel.

Reading robots.txt Like a Real Crawler — Four Traps I Keep Seeing

Why this matters more than people think

Most site owners write a robots.txt once, ten years ago, and never look at it again. Most of those files are wrong — not catastrophically, but enough to suppress indexing of pages they meant to publish, waste crawl budget on pages they meant to hide, and contradict the sitemap they painstakingly built.

After a year of running Krawly against 10,000+ public sites, four mistakes keep recurring. Each one is a one-line fix. This article walks through them with examples drawn straight from real sites I've audited, then shows the workflow I use to verify a robots.txt is actually doing what its owner intended.

A note on bias: I run Krawly. Two of the tools used in this article are mine. They are linked at the bottom; you can substitute Google's own Robots Testing Tool or Bing's equivalent if you prefer.

How a real crawler reads robots.txt

Before the traps, the model. Every well-behaved crawler — Googlebot, Bingbot, our own Krawly user agent — does roughly this when it visits a domain:

1. Fetch `/robots.txt` from the root. If the response is 4xx (file missing) or 5xx (server error), the crawler treats the entire site as either crawlable (404) or temporarily off-limits (5xx → retry later).

2. Parse the file line by line. Block per User-agent. The most specific User-agent block wins.

3. Apply directives — `Allow`, `Disallow`, `Crawl-delay`, `Sitemap` — in order of appearance.

4. For every URL the crawler considers fetching, check against the rules for *its* user agent (or `*` if no specific block exists).

Two important details people miss:

Order matters within a User-agent block. Earlier rules take precedence over later rules for conflicting paths.

`Allow` is the longer-match override. `Allow: /blog` beats `Disallow: /` because the path is more specific.

With that model, here are the four traps.

Krawly robots.txt Analyzer — parses a file the way Googlebot does

Trap 1: `Disallow: /admin` accidentally blocking `/admin-guides`

The directive `Disallow: /admin` means "block any URL whose path starts with `/admin`". That includes `/admin`, `/admin/login`, and `/admin-guides` — because robots.txt is prefix-matched, not folder-matched.

If your real admin lives at `/admin/` and your public marketing pages live at `/admin-guides/`, the marketing pages are now invisible to search.

This trap is on about 1 in 30 sites I scan. The fix is one character:

```

# Wrong:

Disallow: /admin

# Right:

Disallow: /admin/

```

The trailing slash anchors the match to the folder. Now `/admin-guides` is allowed, `/admin/login` is still blocked.

Trap 2: Blocking `/wp-admin` while forgetting `/wp-admin/admin-ajax.php`

WordPress sites overwhelmingly ship with:

```

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

```

The Allow line is there for a reason: `admin-ajax.php` is the AJAX endpoint your theme uses for filtering, infinite scroll, faceted search. If you block it, your filtered category pages never get crawled in their post-AJAX form. Google sees the empty pre-JS shell, decides the page has no content, and quietly stops indexing the category tree.

About 1 in 12 WordPress sites I scan have the Disallow without the Allow. Easy fix; high-impact result for category-page SEO.

Trap 3: Sitemap path that doesn't match what's actually served

Sitemap declarations in robots.txt look like:

```

Sitemap: https://example.com/sitemap.xml

```

Three common ways sites get this wrong in 2026:

1. Wrong host. `Sitemap: https://www.example.com/sitemap.xml` on a site that canonicalises to `example.com` (without www). Google follows the cross-host sitemap reluctantly and de-prioritises everything in it.

2. Stale path. You moved the sitemap to `/sitemap_index.xml` (most CMSes default here) but never updated robots.txt. The line points at a 404. Google logs a warning and doesn't backfill the discovery.

3. HTTP, not HTTPS. Sites that migrated to HTTPS years ago but still have `Sitemap: http://example.com/sitemap.xml`. Modern crawlers tolerate this; some legacy ones don't, and the line just looks careless.

I use Krawly's Sitemap Extractor to verify the sitemap declared in robots.txt actually returns valid XML at the URL given:

Krawly Sitemap Extractor — confirms sitemap declared in robots.txt resolves

Paste the sitemap URL exactly as it appears in robots.txt. If you see "404" or "not found", fix the path before doing anything else for SEO.

Trap 4: Crawl-delay that strangles Googlebot

The `Crawl-delay: 30` directive tells crawlers to wait N seconds between requests. Site owners add this when they get hit by a bad scraper and panic.

Two problems:

Googlebot ignores `Crawl-delay`. Google's documentation states this explicitly. The setting does nothing for Google traffic; you have to use Search Console's "crawl rate" setting instead.

Bingbot honours it literally. `Crawl-delay: 30` on a site with 10,000 URLs means Bing will take 83 hours to crawl your site once. By the time the second full crawl finishes, your new content is weeks old.

I see `Crawl-delay: 10` or `Crawl-delay: 30` on about 1 in 20 small-business sites I scan, almost always added in a panic that nobody remembers. Remove it unless you have a current capacity problem. If you do have a capacity problem, fix the capacity problem — don't strangle search crawlers.

The verification workflow I actually use

For every site I audit:

1. robots.txt Analyzer — paste the site URL, see the parsed rules per user agent. Catches typos, conflicting directives, and the "Disallow without trailing slash" trap.

2. Sitemap Extractor — confirm the sitemap declared in robots.txt actually serves valid XML.

3. Google Search Console → Crawl stats — verify Google sees what you expect. If your Disallow rules are blocking things you didn't intend, Search Console reports the suppressed URLs.

4. Curl one of your blocked paths with `User-Agent: Googlebot` — sometimes sites cloak crawlers and serve different robots.txt to bots than to humans. `curl -A "Googlebot" https://example.com/robots.txt` shows you the truth.

Half a day, every six months. Most sites I run this on find at least one quiet regression — a recently-added `Disallow` from a developer who didn't realise the implications, or a sitemap that quietly died after a CMS migration.

What robots.txt cannot do

A surprising number of site owners think robots.txt is a security measure. It is not. It is a polite signal to well-behaved crawlers. Five things it does not do:

It does not hide pages from determined visitors. robots.txt is a public file. Anyone who fetches `/robots.txt` sees every "secret" path you tried to hide.

It does not deindex pages already in Google. A `Disallow` on an already-indexed URL keeps Google from re-crawling it, but the URL stays in the index (often with a snippet that reads "No information is available for this page"). To actually deindex, use `` on the page, then leave robots.txt to allow the crawl so Google can see the noindex directive.

It does not stop scrapers that ignore it. Scrapers like `curl`, `wget`, hand-rolled `requests` scripts, and most adversarial crawling ignore robots.txt entirely. If you need to actually block scrapers, you need server-side rate limiting, IP-based blocking, or a bot-management product.

It does not affect how Google ranks your site. Whether you have a robots.txt at all is not a ranking signal. The directives in it can affect ranking by changing what Google can crawl, but the file's existence doesn't matter to Google's quality scoring.

It does not validate against a schema. robots.txt has no required structure beyond "User-agent" and "Disallow" lines. Syntax errors just get silently ignored by most crawlers; you'll never get a "parse error" warning unless you actively test for it.

A real-world example: my own krawly.io robots.txt

For full transparency, here is the actual file we serve:

```

User-agent: *

Allow: /

Disallow: /api/

Disallow: /dashboard/

Sitemap: https://krawly.io/sitemap.xml

Content-Signal: ai-train=no, search=yes, ai-input=no

```

What's happening:

`User-agent: *` — apply to every crawler that doesn't have a more specific block (none below, so this applies to all of them).

`Allow: /` — explicitly allow the root. Redundant in theory but signals intent.

`Disallow: /api/` — our REST API endpoints. Search engines have no reason to crawl them; users get the same data through the tool pages.

`Disallow: /dashboard/` — accounts area; not useful for search.

Sitemap — pointing at the canonical sitemap URL on the same host.

`Content-Signal` — a 2025-era extension some AI scrapers respect. We tell training data collectors we don't consent; we keep search engines and on-demand AI tools allowed.

The whole file is six lines. It says exactly what we mean. It would be longer if we wanted to block AI scrapers per user-agent, but the Content-Signal approach catches the major ones generically.

What to do after reading this

Run robots.txt Analyzer on your own site right now. If it surfaces any of the four traps above, fix the line and redeploy. Then run Sitemap Extractor on the sitemap URL your robots.txt declares — if it doesn't return valid XML, fix that too.

15 minutes of work, sometimes weeks of indexing benefit.

Corrections, additions

If you maintain a site where robots.txt does something unusual and you think it's correct, email info@krawly.io with the URL and the reasoning. I'm collecting edge cases for a follow-up post.

Reading robots.txt Like a Real Crawler — Four Traps I Keep Seeing

Why this matters more than people think

How a real crawler reads robots.txt

Trap 1: `Disallow: /admin` accidentally blocking `/admin-guides`

Trap 2: Blocking `/wp-admin` while forgetting `/wp-admin/admin-ajax.php`

Trap 3: Sitemap path that doesn't match what's actually served

Trap 4: Crawl-delay that strangles Googlebot

The verification workflow I actually use

What robots.txt cannot do

A real-world example: my own krawly.io robots.txt

What to do after reading this

Corrections, additions

Try All 170+ Free Tools

Related Articles

How to Scrape YouTube Playlist Data in 2026 (Free & Easy)

How to Find Email Addresses from Any Website (5 Free Methods)

Free SEO Audit Checklist for 2026 (With Tools)

Why this matters more than people think

How a real crawler reads robots.txt

Trap 1: `Disallow: /admin` accidentally blocking `/admin-guides`

Trap 2: Blocking `/wp-admin` while forgetting `/wp-admin/admin-ajax.php`

Trap 3: Sitemap path that doesn't match what's actually served

Trap 4: Crawl-delay that strangles Googlebot

The verification workflow I actually use

What robots.txt **cannot** do

A real-world example: my own krawly.io robots.txt

What to do after reading this

Corrections, additions

Try All 170+ Free Tools

Related Articles

How to Scrape YouTube Playlist Data in 2026 (Free & Easy)

How to Find Email Addresses from Any Website (5 Free Methods)

Free SEO Audit Checklist for 2026 (With Tools)

What robots.txt cannot do