robots.txt and sitemaps

robots.txt is a small text file at the root of your domain that tells crawlers which parts of your site they may visit. It is also the conventional place to advertise where your sitemap lives.

A minimal example

User-agent: *
Disallow: /admin/
Allow: /

Sitemap: https://example.com/sitemap.xml

The User-agent line names the crawler the rules apply to (* means all). Disallow blocks a path; Allow makes an exception. The Sitemap line — which can appear anywhere in the file — points crawlers at your sitemap with an absolute URL.

Crawling is not indexing

A common trap: Disallow stops a crawler from fetching a page — it does not guarantee the page stays out of the index. If other sites link to it, the URL can still appear in results without a description. To keep a page out of search entirely, let it be crawled and add a noindex directive instead.

A good crawler respects it

This tool reads robots.txt before crawling and skips any path you have disallowed, so your generated sitemap only lists pages you actually want crawled.

Keep reading

Submitting your sitemap to search engines Where to put your sitemap and how to hand it to Google and Bing so your pages get discovered faster. What is an XML sitemap? How the XML sitemap protocol works, what each tag means, and why search engines rely on it to crawl your site. Broken links and crawl errors What 404s and dead links cost you, how crawlers find them, and how to keep your sitemap clean. HTML sitemaps explained The human-facing cousin of the XML sitemap: a single page that links to everything, helping both visitors and crawlers.