Yarn

Recent twts in reply to #fucv4ya

Those of you who have your own sites, might want to give this a quick look: https://spawning.ai/ai-txt

It’s just a text file, similar to robots.txt, but for AI crawlers, rather than search engine ones. Probably not very effective, as of now, but at least it’s a way to make it clear you don’t conset to your site being used for AI training, without making it suck for human users, in the process.

⤋ Read More

@mckinley@twtxt.net @prologic@twtxt.net Yes, I agree the website itself sucks and the company behind it is incompetent at best - even more so, with their other websites.

Their first site (haveibeentrained.com) was offering a way to search through all the training datasets, not realizing, they were full of illegal porn - so it was quickly shut down.

Now their main gimmick is offering a browser extension, that lets you see what data on any given site you visit, was used for AI training, what has already been marked as “opted out” and a way to add your stuff, to that list.

I don’t like that idea either, adding URLs to a list, should not require questionable browser extensions and in general, opting out all the places that might have your images, doesn’t seem worth the time, if the companies, don’t even have to respect this request.

If you just want the txt file, without additional nonsense, feel free to take the default one, that I use here: https://thecanine.ueuo.com/ai.txt and use or edit it, to match your needs.

⤋ Read More

@thecanine@twtxt.net @prologic@twtxt.net @eldersnake@we.loveprivacy.club @mckinley@twtxt.net This page is just a terrible joke. Great writeup, mckinley! Exactly my thoughts, but you forgot to mention that you see zero contents unless you scroll a full page down. Boy do I hate this. Luckily, I did not watch this stupid video.

Why does this generator add tons of *.ext rules when it also has a simple * to catch them all? I’m not a robot.txt expert, but that feels redundant. If I do not have an ai.txt, is their crawler consulting my robots.txt? I could not find an answer to that – in my opinion – obvious question. I don’t want any bots on my site.

⤋ Read More

@mckinley@twtxt.net Haha, right. They might have figured that everybody is just using * anyway. :-D Evidence from logs suggests “Spawning-AI”.

Yup, @thecanine@twtxt.net, I thought so, too. Reminds me a bit of Google using the least restrictive robots.txt rule when in doubt (at least you could argue for improved searchability; but it smells a bit fishy).

In the logs I see these three 404s in a row from someone claiming to be their bot:

  • /.well-known/tdmrep.json
  • /ai.txt?t=1704481081.54321
  • /.well-known/ai.txt?t=1704481081.54321

I never heard of TDM Reservation Protocol before:

This specification defines a simple and practical Web protocol, capable of
expressing the reservation of rights relative to text & data mining (TDM)
applied to lawfully accessible Web content, and to ease the discovery of TDM
licensing policies associated with such content.

This initiative is a technical answer to the constraints set by the Article 4
of the new European Directive on copyright and related rights in the Digital
Single Market.

⤋ Read More

Participate

Login to join in on this yarn.