Robots.txt Generator - Create Your Robots Text File

Advertisement - 728x90

Global Settings

Set your sitemap URL and default access level for all bots.

XML Sitemap URL - Optional but strongly recommended

Your sitemap helps search engines discover all your pages. Most platforms auto-generate one (e.g., /sitemap.xml or /sitemap_index.xml).

Default Access - Baseline rule before specific overrides below

"Allow All" is the standard choice for public websites. "Block All" is useful for staging sites or private environments you never want indexed.

Ruleset Builder

Create specific Allow/Disallow rules for individual bots. Add as many User-Agents as needed.

robots.txt

Quick Reference

Key terms explained for beginners.

User-Agent

The identity of the bot you are writing rules for. Use * to target all bots at once.

Allow

Explicitly grants a bot permission to crawl a specific path, even if a broader Disallow rule exists above it.

Disallow

Instructs a bot to skip a path. Use / to block your entire site. Note: this controls crawling, not indexing.

Crawl-delay

Asks a bot to wait N seconds between requests. Protects server resources. Not supported by Googlebot.

Crawl Budget

The number of pages Google will crawl on your site in a given timeframe. Blocking junk paths (like /tag/ or /?s=) conserves this budget for pages that matter.

The Complete Technical Guide to Robots.txt and Crawl Optimization

Everything a webmaster needs to know, from first principles to advanced strategy.

robots.txt

A plain text file at your domain root (e.g., yourdomain.com/robots.txt) that instructs bots which paths to crawl or skip.

User-Agent

A bot's identity string. Googlebot crawls for Google Search. GPTBot is OpenAI's training crawler.

Crawl Budget

The finite number of pages a search engine will crawl on your site. Wasting it on low-value URLs hurts your SEO.

AI Scrapers

Bots like GPTBot, CCBot, and ChatGPT-User that harvest content to train large language models, not to index your site.

XML Sitemap

A structured map of all your important URLs, submitted to search engines to ensure full crawl coverage.

Directives

The individual instructions inside robots.txt - specifically Allow and Disallow lines that follow a User-Agent declaration.

A robots.txt file is a plain text file that follows the Robots Exclusion Standard (RFC 9309). It must be placed at the root of your domain, accessible at the exact URL https://yourdomain.com/robots.txt. There can only be one robots.txt per hostname - you cannot have separate files for subfolders.

When a well-behaved crawler (like Googlebot or Bingbot) visits your site for the first time, the very first thing it does is fetch this file. It reads the instructions and decides which areas of your site it is permitted to access before it crawls a single page. If no robots.txt file is found, the bot assumes it is allowed to crawl the entire site.

The file uses a simple field: value syntax. Each block starts with a User-agent: line identifying which bot the rules apply to, followed by one or more Disallow: or Allow: lines. Blocks are separated by blank lines. The entire file should be encoded in UTF-8 and served with a text/plain content type. For most platforms (WordPress, Shopify, Squarespace), this file can be edited directly in your site's settings panel without touching the server directly.

This is one of the most misunderstood distinctions in technical SEO, and confusing the two can have serious consequences for your rankings. They operate at completely different levels of the crawl-index process.

Disallow (in robots.txt) controls crawling. It tells the bot: "Do not visit this URL at all." If a page is disallowed, the crawler will not fetch it, which means Google cannot read its content or discover new links on that page. However, Google can still know the page exists if another page links to it, and it may still appear in search results as an incomplete listing (URL only, no description). Disallowing a page does not guarantee it will not appear in Google's index.

NoIndex (a meta tag or HTTP header) controls indexing. It tells Google: "You can visit this page, but do not include it in your search results." For Google to read a NoIndex tag, it must be able to crawl the page. This is the critical paradox: if you block a page with Disallow: AND add a noindex tag, Google cannot crawl the page to find the noindex tag, so the block may be ignored. The correct approach is to use noindex alone for pages you want removed from search results, and use Disallow only for pages you want to save crawl budget on - like admin dashboards or internal search result pages.

A new class of crawler has emerged that is fundamentally different from search engine bots. AI scrapers like OpenAI's GPTBot, Anthropic's ClaudeBot, Common Crawl's CCBot, and ChatGPT-User do not index your site for search results - they harvest your content to train large language models (LLMs). Your writing, product descriptions, and expertise become training data for AI systems that may then reproduce similar content in response to other users' queries, potentially in direct competition with you.

Unlike search engine traffic, there is no SEO benefit to allowing AI scrapers. They consume your server bandwidth and crawl budget without sending you any visitors or rankings in return. Blocking them is a reasonable decision for most content publishers, especially writers, journalists, educators, and businesses whose primary competitive advantage is original content.

The key AI scraper User-Agent strings to block are: GPTBot (OpenAI's primary training crawler), ChatGPT-User (used when ChatGPT browses the web in real-time), CCBot (Common Crawl, a major data source for many LLMs), Google-Extended (Google's opt-out user agent for Bard and Vertex AI training data), anthropic-ai and ClaudeBot (Anthropic), and cohere-ai. Note that compliance with robots.txt is voluntary - these companies have publicly committed to honoring it, but bad actors may not.

Crawl Budget is the number of URLs Googlebot will crawl on your site within a given time period. It is determined by two factors: crawl rate limit (how fast Googlebot can crawl without overloading your server) and crawl demand (how popular and fresh your content is). For small sites under a few hundred pages, crawl budget is rarely a concern. For large sites with thousands of pages, it becomes a critical optimization lever.

When Googlebot has a limited budget, you want every crawled URL to count. URLs that waste budget include: faceted navigation URLs (e.g., /products?color=red&size=M generating millions of combinations), internal search result pages (e.g., /?s=keyword), duplicate content via URL parameters (e.g., /page/?ref=email), session ID URLs, tag and archive pages on large blogs, and pagination beyond a reasonable depth.

Use Disallow: in robots.txt to block these patterns. For WordPress sites, common disallow patterns include /wp-admin/, /?s=, /tag/, /author/, and /feed/. For e-commerce sites on platforms like Magento or WooCommerce, blocking filter/sort parameter pages is essential. You can verify your crawl budget usage in Google Search Console under Settings - Crawl Stats.

There are several methods to validate your robots.txt file, ranging from Google's official tools to manual inspection. The gold standard is Google Search Console's robots.txt Tester, found under Settings - Crawl stats - Open Report - Robots.txt. This tool shows Google's live, cached version of your file and lets you test any URL against your rules to see whether it would be allowed or blocked. If you have made recent changes, you can also request a fresh fetch.

For a quick manual check, simply navigate to https://yourdomain.com/robots.txt in your browser. You should see plain text - if you get a 404 error, the file is missing. If you get a 403 or redirect, your server configuration needs attention. The file should load instantly; slow load times can prevent bots from reading it.

For advanced testing, use tools like Screaming Frog SEO Spider (it has a built-in robots.txt checker) or the open-source robotparser library in Python. After making changes, always submit your sitemap URL through Google Search Console to prompt a fresh crawl. Watch the Coverage report over the following days to confirm that previously blocked pages are no longer being crawled, and that no important pages have been accidentally blocked.

Example - Optimized robots.txt for a WordPress Blog or E-commerce Site

# ============================================= # robots.txt - Example optimized configuration # Place this file at: yourdomain.com/robots.txt # ============================================= # --- Standard Search Engine Bots --- User-agent: * # Allow the entire site by default (empty Disallow = allow all) Disallow: User-agent: Googlebot # Block backend, search, and low-value paths Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /?s= Disallow: /tag/ Disallow: /author/ Disallow: /feed/ Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ # --- Block AI Training Scrapers --- User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / # --- Sitemap Location --- Sitemap: https://yourdomain.com/sitemap_index.xml

Disclaimer: This tool generates standard directives. Always test your final robots.txt file using Google Search Console to ensure you are not accidentally blocking critical pages from being indexed.