Last active
June 29, 2025 16:08
-
-
Save edmondburnett/8694c18c8fdc38fd81ea23a4fae74a14 to your computer and use it in GitHub Desktop.
robots.txt for blocking known AI bots/crawlers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
User-agent: ChatGPT-User | |
Disallow: / | |
User-agent: CCBot | |
Disallow: / | |
User-agent: GPTBot | |
Disallow: / | |
User-agent: Google-Extended | |
Disallow: / | |
User-agent: Omgilibot | |
Disallow: / | |
User-agent: FacebookBot | |
Disallow: / | |
User-agent: ImagesiftBot | |
Disallow: / | |
User-agent: Applebot-Extended | |
Disallow: / | |
User-agent: anthropic-ai | |
Disallow: / | |
User-agent: ClaudeBot | |
Disallow: / | |
User-agent: Omgili | |
Disallow: / | |
User-agent: Diffbot | |
Disallow: / | |
User-agent: Bytespider | |
Disallow: / | |
User-agent: PerplexityBot | |
Disallow: / | |
User-agent: cohere-ai | |
Disallow: / |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There's limitations to this. The most obvious one is that many bots will simply not respect or honor your robots.txt. The ones from major companies supposedly do. Think of it as more of a polite request, rather than a "block".
Also, Microsoft and Google AI have been shown to essentially bypass the need for direct crawling by their AI bots, by training their models from the data already collected and cached from your site by the Google or Bing search engines. This loophole unfortunately makes it impossible to opt-out of those AI's without also removing your sites from the major search engines.
Cloudflare has tools for blocking AI scrapers, but you'll need to subscribe to their services of course.
Anubis is another more lightweight option, used by sites like the Archlinux Wiki, and kernel.org.