AI Robots.txt Guide: Managing All AI Crawlers
Learn about the different AI crawlers, their user agents, and how to configure your robots.txt file to control access for each company.
What is robots.txt?
A robots.txt file is a standard used by websites to communicate with web crawlers and other automated clients. It specifies which parts of the site should not be processed or scanned.
With the rise of AI systems and large language models (LLMs), the robots.txt file has gained new importance. It's now a key mechanism for controlling how AI systems interact with your website content.
Important: Our tool always shows AI chatbots as allowed regardless of robots.txt directives to help you understand how AI systems might interact with your content.
What AI Crawlers Exist by Company?
OpenAI
OpenAI Crawlers
OpenAI uses multiple crawlers for different purposes, from training their AI models to providing search functionality within ChatGPT.
GPTBot
Used for training GPT models on web content.
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot)
ChatGPT-User
Used when ChatGPT users browse the web during conversations.
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot)
OAI-SearchBot
Used for obtaining search results and browsing capabilities in ChatGPT.
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)
Anthropic
Anthropic Crawlers
Anthropic uses crawlers to support Claude AI, its conversational AI assistant.
Anthropic AI
Used for training and improving Claude AI models.
User-agent: Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html)
ClaudeBot
Used for Claude's web browsing capabilities.
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Claude Web
Used for Claude's web interface interactions.
User-agent: Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html)
Google AI Crawlers
Google uses specialized crawlers for its AI models beyond its traditional search engine crawlers.
Google Extended
Used for training Google's Gemini AI models.
User-agent: Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html)
Perplexity
Perplexity Crawlers
Perplexity AI uses crawlers to power its AI search capabilities.
PerplexityBot
Used for Perplexity's AI search capabilities.
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
xAI (Grok)
xAI Crawlers
xAI, Elon Musk's AI company, uses crawlers to support its Grok AI assistant.
GrokBot
Used for training Grok AI.
User-agent: GrokBot/1.0 (+https://x.ai)
Grok Search
Used for Grok's search capabilities.
User-agent: xAI-Grok/1.0 (+https://grok.com)
Grok DeepSearch
Used for Grok's advanced search capabilities.
User-agent: Grok-DeepSearch/1.0 (+https://x.ai)
Other Major AI Crawlers
Many other AI companies and search engines have their own crawlers for AI-related functionality.
Apple (Siri & Apple Intelligence)
User-agent: Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html)
User-agent: Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html)
Meta (Facebook)
User-agent: Mozilla/5.0 (compatible; meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler))
Cohere
User-agent: Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html)
You.com
User-agent: Mozilla/5.0 (compatible; YouBot (+http://www.you.com))
DuckDuckGo
User-agent: Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html)
How to Configure Your robots.txt File
Your robots.txt file should be placed at the root of your website (e.g., https://example.com/robots.txt). Here are some examples of how to configure your robots.txt file for AI crawlers:
Example 1: Allow all AI crawlers
User-agent: * Allow: / # This explicitly allows all AI crawlers, though it's redundant with the wildcard above User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / # ... and so on for other crawlers
Example 2: Block all AI crawlers
# Allow regular crawlers User-agent: * Allow: / # Block specific AI crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: GrokBot Disallow: / User-agent: Applebot-Extended Disallow: / # ... and so on for other AI crawlers
Example 3: Selective access for different AI companies
# Allow search bots but block training bots User-agent: OAI-SearchBot Allow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Allow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: /
Example 4: Allow access to only specific directories
# Only allow AI crawlers to access the public and blog sections User-agent: GPTBot Allow: /public/ Allow: /blog/ Disallow: / User-agent: ClaudeBot Allow: /public/ Allow: /blog/ Disallow: /
What is the Impact of Allowing or Blocking AI Crawlers?
Benefits of Allowing
- Improved visibility in AI-powered search results
- Potential for your content to be used as a source in AI responses
- Contribution to AI training may lead to better models
- Stay relevant in an increasingly AI-driven web landscape
Reasons to Block
- Protect proprietary or sensitive information
- Prevent AI systems from learning from or reproducing your unique content
- Reduce server load from crawler traffic
- Exercise control over how your content is used in AI systems
What are the Best Practices for Managing AI Crawlers?
- Regularly review and update your robots.txt file as new AI crawlers emerge
- Be specific about which parts of your site should be accessible to which crawlers
- Consider having different policies for AI training bots versus search bots
- Monitor your server logs to see which AI crawlers are visiting your site
- Use our robots checker tool to verify your robots.txt configuration
- Stay informed about changes to AI crawler policies
Check Your robots.txt Configuration
Want to see how your website's robots.txt file interacts with AI crawlers? Use our Robots Checker tool to:
- Analyze your robots.txt file for AI crawler configurations
- See which AI crawlers are allowed or blocked on your site
- Get recommendations for improving your robots.txt configuration
- Monitor changes in AI crawler policies and their impact on your site
FAQ
How do I block AI crawlers like GPTBot and Claude in robots.txt?
To block AI crawlers, add specific user-agent directives for each bot in your robots.txt file. For example: User-agent: GPTBot Disallow: / for OpenAI, and User-agent: Claude-Web Disallow: / for Anthropic. This prevents these AI systems from crawling and training on your content.
Does robots.txt completely prevent AI models from accessing my content?
No, robots.txt is a voluntary protocol that ethical AI companies respect, but it cannot guarantee complete protection. Some AI systems might access your content through other means. Additional measures like authentication and terms of service are recommended for sensitive content.
What are the main AI crawler user agents I should know about?
The major AI crawler user agents include GPTBot (OpenAI), Claude-Web (Anthropic), CCBot (Common Crawl), and GoogleBot (Google). Each represents different AI companies collecting web data for training models and providing services. Identifying these agents is crucial for controlling AI access.
