Crawl-control guide

How to Test robots.txt Without Blocking Important URLs

Test a robots.txt draft against one crawler and one URL path before publishing, then recheck the live root file without confusing crawl blocking with index removal.

Test robots.txt safely Draft robots.txt rules Generate a sitemap

Draft first

Paste before publish

Check the exact draft rules before they can affect the live site.

One path

Inspect the matched rule

A useful test names the crawler, target URL, final verdict, and rule that caused it.

Crawl, not index

Do not use Disallow as removal

Robots.txt blocks crawling. It does not reliably remove URLs already known to search.

When to use a robots.txt tester instead of editing blind

Use a robots.txt tester any time a rule change could affect public templates, product pages, docs, localized pages, or search-result pages. The test turns a risky text edit into a specific question: can this crawler fetch this URL path under these rules?

This matters because robots.txt controls crawler access, not indexation. Google Search Central's robots documentation separates crawl blocking from search-result removal, and it also expects the file at the site root. If the file is missing, unreachable, or placed under a subfolder, the test result can be different from what the crawler sees.

References: Google Search Central robots.txt introduction and Google robots.txt creation guide.

Draft mode vs live mode: what each test proves

Draft mode proves how a pasted file would behave if published. Use it in a pull request, CMS preview, migration plan, or incident fix before the root file changes. It is the safest place to catch a broad Disallow rule that would block product, help, or blog URLs by accident.

Live mode proves what is currently published at the tested origin's root robots.txt file. It does not prove that Google has recrawled the file yet, and it does not inspect sitemaps or downstream pages. It answers whether the live file now allows or blocks a specific URL path for a specific crawler.

How to test one URL path against one crawler

Work from a concrete example. A SaaS site needs to block private admin screens and internal search result pages, keep the public blog crawlable, and allow Googlebot to crawl the public help page under /admin/help.

User-agent: *
Disallow: /admin/
Disallow: /search/

User-agent: Googlebot
Allow: /admin/help
Disallow: /admin/
Disallow: /search/

Sitemap: https://saas.example.com/sitemap.xml

Paste that file into the tester, set the crawler to Googlebot, and run these three URLs separately.


https://saas.example.com/admin/billing	Blocked for Googlebot	The path matches Disallow: /admin/. That is expected for private admin screens and should stay out of the public crawl path.
https://saas.example.com/admin/help	Allowed for Googlebot	The Googlebot-specific Allow rule is a longer match than /admin/, so the help center path can still be crawled.
https://saas.example.com/blog/robots-testing-checklist	Allowed for Googlebot	The blog URL does not match the blocked /admin/ or /search/ folders, so content pages remain crawlable.

If those three results match the intent, publish the file to https://saas.example.com/robots.txt. Then switch to live mode and re-run the same three URLs. The live recheck catches deployment errors such as the wrong root path, old CDN cache, unexpected redirects, or a file generated from a different rule set.

How to read blocked, allowed, wildcard, and longest-match results

A blocked result means the most specific matching directive prevents the crawler from fetching that path. An allowed result means no matching block applies, or an Allow rule wins for that crawler and path. Do not stop at the verdict; read the matched directive.

Wildcards and longest-match behavior are where mistakes hide. A broad Disallow: /admin/ can be narrowed by a longer Allow: /admin/help for a specific crawler. The safer review question is not "does the file look right?" It is "which single rule wins for this exact URL?"

How to fix blocked-by-robots mistakes without confusing crawl with indexation

If the tester says a public URL is blocked, revise the draft in the Robots.txt Generator, then test the exact path again before publishing. Avoid changing many folders at once; narrow the rule until the intended public URL becomes allowed and the private URL remains blocked.

If Search Console or a search result still shows a blocked URL after the crawl rule is correct, treat that as an indexation problem. A crawler blocked by robots.txt may not be able to see a page-level noindex directive.

For removal, make the page crawlable long enough for noindex to be seen, return an appropriate unavailable status, or use the search engine's removal tooling. More robots.txt edits alone do not remove a known URL.

A safe publish checklist after the test passes

Finish by keeping the crawl workflow connected: generate or update your XML sitemap, test robots rules for important templates, and make sure the sitemap lists only URLs that should be discovered.

Test the draft in paste mode before the file is deployed.
Run one crawler and one URL path at a time so the matched rule is inspectable.
Confirm the published file is reachable at the site root: https://saas.example.com/robots.txt.
Recheck the same URLs in live mode after publish.
Use page-level noindex or a removal workflow for already indexed URLs that should disappear from results.

Primary workflow

Robots.txt Tester

Test a pasted draft or live root file against one crawler and one URL path.

Draft companion

Robots.txt Generator

Create the allow, disallow, and sitemap lines before validating the result.

Related crawl file

XML Sitemap Generator

Generate the discoverable URL inventory that should align with crawl rules.

FAQ

How do I know if robots.txt is blocking a page?

Test the exact URL path against the crawler user-agent you care about. A robots tester should show the matched Allow or Disallow rule and whether that rule permits crawling.

Does robots.txt remove a page from Google?

No. Robots.txt controls crawling. A URL that Google already discovered can still appear in results, especially when other pages link to it.

Can I test robots.txt before I publish it?

Yes. Paste the draft file into a robots.txt tester, enter one target URL and one crawler, and inspect the matched rule before deployment.

Does robots.txt have to live at the site root?

Yes. Crawlers look for robots.txt at the root of the origin, such as https://example.com/robots.txt.

Why does Google still show a blocked URL?

The URL may have been discovered from links, sitemaps, or previous crawls. If it must leave search results, use a crawlable noindex page or the appropriate removal process.