Crawl-control guide
How to Test robots.txt Without Blocking Important URLs
Test a robots.txt draft against one crawler and one URL path before publishing, then recheck the live root file without confusing crawl blocking with index removal.
Paste before publish
Check the exact draft rules before they can affect the live site.
Inspect the matched rule
A useful test names the crawler, target URL, final verdict, and rule that caused it.
Do not use Disallow as removal
Robots.txt blocks crawling. It does not reliably remove URLs already known to search.
When to use a robots.txt tester instead of editing blind
Use a robots.txt tester any time a rule change could affect public templates, product pages, docs, localized pages, or search-result pages. The test turns a risky text edit into a specific question: can this crawler fetch this URL path under these rules?
This matters because robots.txt controls crawler access, not indexation. Google Search Central's robots documentation separates crawl blocking from search-result removal, and it also expects the file at the site root. If the file is missing, unreachable, or placed under a subfolder, the test result can be different from what the crawler sees.
References: Google Search Central robots.txt introduction and Google robots.txt creation guide.
Draft mode vs live mode: what each test proves
Draft mode proves how a pasted file would behave if published. Use it in a pull request, CMS
preview, migration plan, or incident fix before the root file changes. It is the safest place
to catch a broad Disallow rule that would block product,
help, or blog URLs by accident.
Live mode proves what is currently published at the tested origin's root robots.txt file. It does not prove that Google has
recrawled the file yet, and it does not inspect sitemaps or downstream pages. It answers
whether the live file now allows or blocks a specific URL path for a specific crawler.
How to test one URL path against one crawler
Work from a concrete example. A SaaS site needs to block private admin screens and internal
search result pages, keep the public blog crawlable, and allow Googlebot to crawl the public
help page under /admin/help.
User-agent: *
Disallow: /admin/
Disallow: /search/
User-agent: Googlebot
Allow: /admin/help
Disallow: /admin/
Disallow: /search/
Sitemap: https://saas.example.com/sitemap.xmlPaste that file into the tester, set the crawler to Googlebot, and run these three URLs separately.
| https://saas.example.com/admin/billing | Blocked for Googlebot | The path matches Disallow: /admin/. That is expected for private admin screens and should stay out of the public crawl path. |
| https://saas.example.com/admin/help | Allowed for Googlebot | The Googlebot-specific Allow rule is a longer match than /admin/, so the help center path can still be crawled. |
| https://saas.example.com/blog/robots-testing-checklist | Allowed for Googlebot | The blog URL does not match the blocked /admin/ or /search/ folders, so content pages remain crawlable. |
If those three results match the intent, publish the file to https://saas.example.com/robots.txt. Then switch to
live mode and re-run the same three URLs. The live recheck catches deployment errors such as
the wrong root path, old CDN cache, unexpected redirects, or a file generated from a different
rule set.
How to read blocked, allowed, wildcard, and longest-match results
A blocked result means the most specific matching directive prevents the crawler from fetching that path. An allowed result means no matching block applies, or an Allow rule wins for that crawler and path. Do not stop at the verdict; read the matched directive.
Wildcards and longest-match behavior are where mistakes hide. A broad Disallow: /admin/ can be narrowed by a longer Allow: /admin/help for a specific crawler. The safer
review question is not "does the file look right?" It is "which single rule wins for this
exact URL?"
How to fix blocked-by-robots mistakes without confusing crawl with indexation
If the tester says a public URL is blocked, revise the draft in the Robots.txt Generator, then test the exact path again before publishing. Avoid changing many folders at once; narrow the rule until the intended public URL becomes allowed and the private URL remains blocked.
For removal, make the page crawlable long enough for noindex to be seen, return an appropriate unavailable
status, or use the search engine's removal tooling. More robots.txt edits alone do not remove
a known URL.
A safe publish checklist after the test passes
Finish by keeping the crawl workflow connected: generate or update your XML sitemap, test robots rules for important templates, and make sure the sitemap lists only URLs that should be discovered.
- Test the draft in paste mode before the file is deployed.
- Run one crawler and one URL path at a time so the matched rule is inspectable.
- Confirm the published file is reachable at the site root: https://saas.example.com/robots.txt.
- Recheck the same URLs in live mode after publish.
- Use page-level noindex or a removal workflow for already indexed URLs that should disappear from results.
Robots.txt Tester
Test a pasted draft or live root file against one crawler and one URL path.
Robots.txt Generator
Create the allow, disallow, and sitemap lines before validating the result.
XML Sitemap Generator
Generate the discoverable URL inventory that should align with crawl rules.
FAQ
How do I know if robots.txt is blocking a page?
Test the exact URL path against the crawler user-agent you care about. A robots tester should show the matched Allow or Disallow rule and whether that rule permits crawling.
Does robots.txt remove a page from Google?
No. Robots.txt controls crawling. A URL that Google already discovered can still appear in results, especially when other pages link to it.
Can I test robots.txt before I publish it?
Yes. Paste the draft file into a robots.txt tester, enter one target URL and one crawler, and inspect the matched rule before deployment.
Does robots.txt have to live at the site root?
Yes. Crawlers look for robots.txt at the root of the origin, such as https://example.com/robots.txt.
Why does Google still show a blocked URL?
The URL may have been discovered from links, sitemaps, or previous crawls. If it must leave search results, use a crawlable noindex page or the appropriate removal process.