robots.txt and sitemaps

How the robots.txt file steers crawlers and advertises where your sitemap lives.

robots.txt is a small text file at the root of your domain that tells crawlers which parts of your site they may visit. It is also the conventional place to advertise where your sitemap lives.

A minimal example

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

The User-agent line names the crawler the rules apply to (* means all). Disallow blocks a path; Allow makes an exception. The Sitemap line — which can appear anywhere in the file — points crawlers at your sitemap with an absolute URL.

Crawling is not indexing

A common trap: Disallow stops a crawler from fetching a page — it does not guarantee the page stays out of the index. If other sites link to it, the URL can still appear in results without a description. To keep a page out of search entirely, let it be crawled and add a noindex directive instead.

A good crawler respects it

This tool reads robots.txt before crawling and skips any path you have disallowed, so your generated sitemap only lists pages you actually want crawled.