Last active
July 13, 2024 05:26
-
-
Save edmondburnett/8694c18c8fdc38fd81ea23a4fae74a14 to your computer and use it in GitHub Desktop.
robots.txt for blocking known AI bots/crawlers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
User-agent: ChatGPT-User | |
Disallow: / | |
User-agent: CCBot | |
Disallow: / | |
User-agent: GPTBot | |
Disallow: / | |
User-agent: Google-Extended | |
Disallow: / | |
User-agent: Omgilibot | |
Disallow: / | |
User-agent: FacebookBot | |
Disallow: / | |
User-agent: ImagesiftBot | |
Disallow: / | |
User-agent: Applebot-Extended | |
Disallow: / | |
User-agent: anthropic-ai | |
Disallow: / | |
User-agent: ClaudeBot | |
Disallow: / | |
User-agent: Omgili | |
Disallow: / | |
User-agent: Diffbot | |
Disallow: / | |
User-agent: Bytespider | |
Disallow: / | |
User-agent: PerplexityBot | |
Disallow: / | |
User-agent: cohere-ai | |
Disallow: / |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There's limitations to this. The most obvious one is that many bots will simply not respect or honor your robots.txt.
Also, Microsoft and Google AI have been shown to essentially bypass the need for direct crawling by their AI bots, by training their models from the data already collected and cached from your site by the Google or Bing search engines. This loophole unfortunately makes it impossible to opt-out of AI without also removing your sites from the major search engines.