AI Crawlers

The Complete List of AI Crawlers in 2026

6 min read

Why you need this list

AI crawlers are web bots operated by AI platforms to collect content for training data, live search retrieval, and grounding responses. If your robots.txt does not explicitly address them, their access depends entirely on your wildcard rules. For most sites that means they are either blocked by accident or allowed without any deliberate decision.

This page lists every AI crawler user-agent string that is currently active and documented. Use it as a reference when auditing or writing your robots.txt.

OpenAI crawlers

GPTBot: primary web crawler for ChatGPT. Used for training data collection and live web browsing in ChatGPT.
OAI-SearchBot: used by OpenAI for building its search index. Separate from GPTBot and active alongside it.
ChatGPT-User: used when a ChatGPT user triggers a live web browse during a conversation.

Anthropic crawlers

ClaudeBot: primary crawler for Claude (Anthropic). Used for web access and training data.
Claude-Web: used when Claude performs a live web lookup during a conversation.

Google AI crawlers

Google-Extended: used for Gemini and Google AI training data. Separate from Googlebot. Blocking this does not affect organic Google search.
Googlebot: primary search crawler. Also used to power Google AI Overviews. Blocking it removes you from Google Search entirely.
Gemini-Extended: used specifically for Gemini content retrieval and grounding.

Google-Extended and Googlebot are separate user-agents. You can block Google-Extended (to opt out of AI training and Gemini) while keeping Googlebot allowed (to stay in organic search results). They are not the same.

Perplexity crawlers

PerplexityBot: primary crawler for Perplexity AI. Used for live search results and answer generation.

Meta and Apple crawlers

meta-externalagent: used by Meta AI for web access. Powers AI features across Facebook, Instagram, and the standalone Meta AI product.
Applebot: primary Apple crawler. Used for Siri, Spotlight, and Safari suggestions. Predates the current AI wave but now feeds Apple Intelligence.
Applebot-Extended: used specifically for Apple AI features and training. Separate from the base Applebot.

Other active AI crawlers

Bytespider: operated by ByteDance. Used for AI features across TikTok, Doubao, and other ByteDance products.
cohere-ai: used by Cohere for training data and grounding responses.
YouBot: used by You.com, an AI-native search engine.
Diffbot: commercial web data extraction platform used by various AI applications.
CCBot: Common Crawl bot. A non-profit web archive heavily used as training data by many AI models.
peer39_crawler: used by Peer39 for AI-based contextual advertising.
omgili / omgilibot: used by Webz.io for AI data collection products.
Amazonbot: used by Amazon for Alexa and other AI services.
Timpibot: used by Timpi, a distributed AI search index.

How to use this list in your robots.txt

For each crawler you want to allow, add a User-agent entry with an empty Disallow. For each one you want to block, use Disallow: /. You do not need to address every crawler on this list. Start with the platforms relevant to your content strategy and audience.

# Allow OpenAI crawlers
User-agent: GPTBot
Disallow:

User-agent: OAI-SearchBot
Disallow:

# Allow Claude
User-agent: ClaudeBot
Disallow:

# Allow Perplexity
User-agent: PerplexityBot
Disallow:

# Allow Google AI (separate from Googlebot)
User-agent: Google-Extended
Disallow:

# Block ByteDance
User-agent: Bytespider
Disallow: /

Place explicit AI crawler rules before any wildcard User-agent: * rule in your file. robots.txt is read top to bottom and the first matching rule wins.

The SEOFliq AEO and GEO Suite extension checks your robots.txt against all known AI crawlers automatically. It shows a full access report for each crawler: allowed, blocked, or undefined. Open it on any page of your site. No account or login required.