Anyone got a link to a robots.txt that “blocks” all the “AI” stuff?
… or maybe I should do this based on allowlisting rather than blocklisting. 🤔 Only allow a couple of bots that I think are fine …
@movq@www.uninformativ.de Only found 3 results for “robotst.xt” and OpenAI 😢 I seem to recall an effort (I cannot find) to build a standard for AI Crawlers similar to robots.txt
@prologic@twtxt.net Ahhh, I right, now I remember. That ai.txt
boils down to this, I guess:
User-Agent: *
Disallow: /
@movq@www.uninformativ.de I have this one as per some article I read some time ago… But just like the robots.txt I don’t think you have any grantee that it would be honored, you might even have a better chance hunting for and blocking user-agents.
@aelaraji@aelaraji.com Yeah, there is no guarantee with any of these things, it can all be faked or ignored. 🫤 I’m still going to do it in the hopes that some of those bots respect it.
@movq@www.uninformativ.de It looks like this one actually reads the robots.txt … it did a couple of times over the past few weeks.
“GET /robots.txt HTTP/1.1” 304 0 “-” “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)”
Hey @movq@www.uninformativ.de !! here’s an article you might find interesting: Blocking Bots with Nginx … this person is actually blocking AI
Bots based on a list of User Agents in an interesting way. 👍
@aelaraji@aelaraji.com Hmmm looks like the core idea is to intercept requests, Inspect the UserAgent
header and respond accordingly.
Can we trust the bots not to fake their identity? 🤔
@aelaraji@aelaraji.com @prologic@twtxt.net Hmm, yeah, looks a bit better than ai.txt
/ robots.txt
, but I wouldn’t trust that they don’t spoof their user agent. 🤔
@movq@www.uninformativ.de me neither 🤦♂️