robots.txt is a
small text file at the root of your domain that tells crawlers which parts of your site they may visit. It is
also the conventional place to advertise where your sitemap lives.
A minimal example
User-agent: *
Disallow: /admin/
Allow: /
Sitemap: https://example.com/sitemap.xml
The User-agent
line names the crawler the rules apply to (* means all).
Disallow blocks a
path; Allow
makes an exception. The Sitemap
line — which can appear anywhere in the file — points crawlers at your sitemap with an absolute URL.
Crawling is not indexing
A common trap: Disallow
stops a crawler from fetching a page — it does not guarantee the page stays out of the index. If
other sites link to it, the URL can still appear in results without a description. To keep a page out of
search entirely, let it be crawled and add a
noindex
directive instead.
A good crawler respects it
This tool reads robots.txt
before crawling and skips any path you have disallowed, so your generated sitemap only lists pages you
actually want crawled.
Keep reading