huh.. so not even trying to be compatible with existing hashes?
@movq@www.uninformativ.de Has that hashing change even be accepted? :-?
I have zero mental energy for programming at the moment. š«¤
Iāll try to implement the new hashing stuff in jenny before the ādeadlineā. But I donāt think youāll see any texudus development from me in the near future. ā¹ļø
@ About the URL, since it no longer used for hashing there might be no need to change it. I agree that we keep all the parts that already are out there for the most parts. Instead of a contact field you could also just use links like: link = Email mailto:user@example.dk
or link = Signal https://signal.me/sthF4raI5Lg_ybpJwB1sOptDla4oU7p[...]
@lyse@lyse.isobeef.org Yeah to avoid cutting off bits at the end making hashes end in either q
or a
š¤£
if
clauses to this. My point is: Every time I see a hash, Iād like to have a hint as to where to find the corresponding twt.
The reason I think this can work so well and Iām in full support of it is that itās the least disruptive way to resolve the issue of:
where did this hash come from?
@prologic@twtxt.net Not sure Iād attach any if
clauses to this. My point is: Every time I see a hash, Iād like to have a hint as to where to find the corresponding twt.
@movq@www.uninformativ.de If weāre focusing on solving the āmissing rootsā problems. I would start to think about āclient recommendationsā. The first recommendation would be:
- Replying to a Twt that has no initial Subject must itself have a Subject of the form (hash; url).
This way itās a hint to fetching clients that follow B, but not A (in the case of no mentions) that the Subject/Root might (very likely) is in the feed url
.
If we must stick to hashes for threading, can we maybe make it mandatory to always include a reference to the original twt URL when writing replies?
Instead of
(<a href="https://txt.sour.is/search?q=%23123467">#123467</a>) hello foo bar
you would have
(<a href="https://txt.sour.is/search?q=%23123467">#123467</a> http://foo.com/tw.txt) hello foo bar
or maybe even:
(<a href="https://txt.sour.is/search?q=%23123467">#123467</a> 2025-04-30T12:30:31Z http://foo.com/tw.txt) hello foo bar
This would greatly help in reconstructing broken threads, since hashes are obviously unfortunately one-way tickets. The URL/timestamp would not be used for threading, just for discovery of feeds that you donāt already follow.
I donāt insist on including the timestamp, but having some idea which feed weāre talking about would help a lot.
7
to 12
and use the first 12
characters of the base32 encoded blake2b hash. This will solve two problems, the fact that all hashes today either end in q
or a
(oops) š
And increasing the Twt Hash size will ensure that we never run into the chance of collision for ions to come. Chances of a 50% collision with 64 bits / 12 characters is roughly ~12.44B Twts. That ought to be enough! -- I also propose that we modify all our clients and make this change from the 1st July 2025, which will be Yarn.social's 5th birthday and 5 years since I started this whole project and endeavour! š± #Twtxt #Update
July 1st. 63 days from now to implement a backward-incompatible change, apparently not open to other ideas like replacing blake with SHA, or discussing implementation challenges for other languages and platforms.
Finally just closing #18, #19 and #20 without starting a proper discussion and ignoring a āmicro consensusā feels⦠not right.
I donāt know what to think rather than letting it rest (May will be busy here) and focus on other stuff in the future.
7
to 12
and use the first 12
characters of the base32 encoded blake2b hash. This will solve two problems, the fact that all hashes today either end in q
or a
(oops) š
And increasing the Twt Hash size will ensure that we never run into the chance of collision for ions to come. Chances of a 50% collision with 64 bits / 12 characters is roughly ~12.44B Twts. That ought to be enough! -- I also propose that we modify all our clients and make this change from the 1st July 2025, which will be Yarn.social's 5th birthday and 5 years since I started this whole project and endeavour! š± #Twtxt #Update
I will be adding the code in for yarnd
very soon⢠for this change, with a if the date is >= 2025-07-01 then compute_new_hashes else compute_old_hashes
Finally I propose that we increase the Twt Hash length from 7
to 12
and use the first 12
characters of the base32 encoded blake2b hash. This will solve two problems, the fact that all hashes today either end in q
or a
(oops) š
And increasing the Twt Hash size will ensure that we never run into the chance of collision for ions to come. Chances of a 50% collision with 64 bits / 12 characters is roughly ~12.44B Twts. That ought to be enough! ā I also propose that we modify all our clients and make this change from the 1st July 2025, which will be Yarn.socialās 5th birthday and 5 years since I started this whole project and endeavour! š± #Twtxt #Update
I had Chick-fil-A breakfast today (sausage, egg, and cheese biscuit, hash browns, coffee, and orange juice). Then at lunch my work place offered hot dogs. I had two (kosher, if that matters), plus a coke, a macadamia nuts cookie, and a small chocolate brownie.
So, here I am, at home, feeling hungry but guilty and refusing to eat anything else for the rest of the day. To top it off, I have only clocked 4,000 steps today (and I donāt feel like walking). I am going to hell, am I?
dm-only.txt
feeds. š
by commenting out DMs are you giving up on simplicity? See the Metadata extension holding the data inside comments, as the client doesnāt need to show it inside the timeline.
I donāt think that commenting out DMs as we are doing for metadata is giving up on simplicity (itās a feature already), and it helps to hide unwanted DMs to clients that will take months to add itās support to something named⦠an extension.
For some other extensions in https://twtxt.dev/extensions.html (for example the reply-to hash #abcdfeg
or the mention @ < example http://example.org/twtxt.txt >
) is not a big deal. The twt is still understandable in plain text.
For DM, itās only interesting for you if you are the recipient, otherwise you see an scrambled message like 1234567890abcdef=
. Even if you see it, youāll need some decryption to read it. Iāve said before that DMs shouldnāt be in the same section that the timeline as itās confusing.
So my point stands, and as Iāve said before, we are discussing it as a community, so letās see what other maintainers add to the convo.
dm-only.txt
feeds. š
After reading you, @eapl.me@eapl.me, Iāll tell you my point of view.
In my opinion, a feed does not have to be equivalent to a timeline. A timeline is a representation of the feed adapted to a user. You may not be interested in seeing other peopleās threads or DMs. But perhaps they are interested in seeing mentions or DMs directed at them. It is important not to fall into the trap. With that clarificationā¦
I insist, this is my point of view, it is not an absolute truth: I donāt think extensions should be respectful of customers who are no longer maintained.
We cannot have a system that is simple, backwards compatible and extensible all at the same time. We have to give up some of the 3 points. I would not like to give up simplicity because it will then make it harder to maintain the customers who do stay. Therefore, I think it is better to give up backwards compatibility and play with new formulas in the extensions. I donāt think itās a good idea to make a hash keep so much load: a hashtag, a thread and also a DM.
MaxAgeDays
configuration at the pod level, that now some profiles are rather empty. This is only because well, they're a bit "inactive" so to speak š£ļø Not sure what to do about this at the moment... Open to ideas? š”
yes it used be http://
only and to keep hashes from breaking i added # url = http://...
and now we are stock with it due to the curret specs.
Hmmm thereās a bug somewhere in the way Iām ingesting archived feeds š¤
sqlite> select * from twts where content like 'The web is such garbage these days%';
hash = 37sjhla
feed_url = https://twtxt.net/user/prologic/twtxt.txt/1
content = The web is such garbage these days š Or is it the garbage search engines? š¤
created = 2024-11-14T01:53:46Z
created_dt = 2024-11-14 01:53:46
subject = #37sjhla
mentions = []
tags = []
links = []
sqlite>
Some A hole has been trying to pull every single Twtxt feed that existed/still exists since forever. How do I know? Welpā Theyāve been querying my Timeline⢠instance for all of it, every single twtxt file and twt Hash they can find. šš¤¦ It must have been going on for days and I have just noticed⦠+ itās all coming from the same ASN AS136907 HWCLOUDS-AS-AP HUAWEI CLOUDS
Thank you Huawei for the DDos you sons of Glitches!!!
@quark@ferengi.one No editing old Twts that are the root of a thread with replies in the ecosystem. Just results in a fork. Unless the client has an implementation that does not store Twts keyed by Hash.
si4er3q
. See https://twtxt.dev/exts/twt-hash.html, a timezone offset of +00:00
or -00:00
must be replaced by Z
.
just a note that we are doing that on PHP: https://github.com/eapl-gemugami/twtxt-php/blob/master/docs/03-hash-extension.md#php-72
That PHP snippet could be merged into https://twtxt.dev/exts/twt-hash.html
@david@collantes.us @andros@twtxt.andros.dev The correct hash would be si4er3q
. See https://twtxt.dev/exts/twt-hash.html, a timezone offset of +00:00
or -00:00
must be replaced by Z
.
(That said, thereās a bug in jenny as well. It only replaces +00:00
, not -00:00
. š¤”)
@prologic@twtxt.net @bender@twtxt.net
What is the hash of the last message from?: https://aelaraji.com/twtxt.txt
dm-only.txt
feeds. š
@bender@twtxt.net @aelaraji@aelaraji.com The client should ignore twts if itās not compatible or not addressed to me. itās a simple regex to add! Itās similar to Twt Hash Extension, should they be in another file? They are child messages, not flat twt. Not of course!
@prologic@twtxt.net interesting. What would happen on a hash collision? š¤
@bender@twtxt.net Itās a bug in the UI for sure. The hash is the primary key.
@david@collantes.us Yeah, weāve been debugging that a bit yesterday. Looks like the wrong input (sometimes) gets fed to the hash function ā broken threads.
@movq@www.uninformativ.de @kat@yarn.girlonthemoon.xyz Heck yeah, thatās crazy! :-) Fingers crossed! (tt
also agrees with the right⢠hash)
./yarnc debug <your feed url>
:
The actual hash is fs7673q
.
./yarnc debug <your feed url>
:
@prologic@twtxt.net thatās not what I see. The hash znf6csa
cannot be found.
@prologic@twtxt.net There was no edit according to my Git history. š¤ On my end, the hash is fs7673q
and thatās also what kat used to reply.
Doesnāt look like it Hmmm
sqlite> select * from twts where content LIKE '%Linux installation%';
hash = znf6csa
feed_url = https://www.uninformativ.de/twtxt.txt
content = I wonder if my current Linux installation will actually make it to 20 years:
$ head -n 1 /var/log/pacman.log
[2011-07-07 11:19] installed filesystem (2011.04-1)
Itās not toooo far into the future.
It would be crazy ⦠20 years without reinstalling once ⦠phew. š„“
created = 2025-04-07T19:59:51Z
subject = (#znf6csa)
mentions = []
tags = []
links = []
@movq@www.uninformativ.de Apparently you wrote it :D The hash doesnāt lie? 𤣠https://twtxt.net/twt/znf6csa
@prologic@twtxt.net What happened here ā did I edit my twt or is this hash wrong? š„“
@prologic@twtxt.net Spring cleanup! Thatās one way to encourage people to self-host their feeds. :-D
Since Iām only interested in the url
metadata field for hashing, I do not keep any comments or metadata for that matter, just the messages themselves. The last time I fetched was probably some time yesterday evening (UTC+2). I cannot tell exactly, because the recorded last fetch timestamp has been overridden with todayās by now.
I dumped my new SQLite cache into: https://lyse.isobeef.org/tmp/backup.tar.gz This time maybe even correctly, if youāre lucky. Iām not entirely sure. It took me a few attempts (date and time were separated by space instead of T
at first, I normalized offsets +00:00
to Z
as yarnd does and converted newlines back to U+2028
). At least now the simple cross check with the Twtxt Feed Validator does not yield any problems.
@andros@twtxt.andros.dev sha256 hash of twt in json. Look at converter script
Amazing! It is a good tool for reading feeds. What you used to calculate the hash?
Hello, i want to present my new revolution twtxt v3 format - twjson
Thatās why you should use it:
- Itās easy to to parse
- Itās easy to read (in formatted mode :D)
- It used actually \n for newlines, you donāt need unprintable symbols
- Forget about hash collisions because using full hash
Here is my twjson feed: https://doesnm.p.psf.lt/twjson.json
And twtxt2json converter: https://doesnm.p.psf.lt/twjson.js
@eapl.me@eapl.me Interesting! Two points stood right out to me:
Why the hell are e-mail newsletters considered a valid option in the first place? Just offer an Atom feed and be done with it! Especially for a blog of this very type. This doesnāt even involve a third party service. Although, in addition he also links to Feedburner, what the fuck!? No e-mail address or the like is needed and subject to being disclosed.
When these spam mailers want to prevent resubscribing, then for fuckās sake, why donāt they use a hash of the e-mail address (I saw that in yarnd) for that purpose? Storing the e-mail address in clear text after unsubscribing is illegal in my book.
There are 82.108 read statuses, but only 24.421 messages in the cache. In contrast to the cache with the messages, the read statuses are never cleaned up when a feed was unsubscribed from. And the read statuses also contain old style hashes, before we settled on the what we have today. Still a huge difference. Hmm.
tt
reimplementation that I already followed with the old Python tt
. Previously, I just had a few feeds for testing purposes in my new config. While transfering, I "dropped" heaps of feeds that appeared to be inactive.
Thanks, @movq@www.uninformativ.de!
My backing SQLite database with indices is 8.7 MiB in size right now.
The twtxt
cache is 7.6 MiB, it uses Pythonās pickle
module. And next to it there is a 16.0 MiB second database with all the read statuses for the old tt
. Wow, super inefficient, it shouldnāt contain anything else, itās a giant, pickled {"$hash": {"read": True/False}, ā¦}
. What the heck, why is it so big?! O_o
(Back in tt
.) Well, it kinda worked. At least appending to the file. But my cache database got screwed up. I do not yet support replies, so the subject and and root hash columns have not been set at all, resulting in a message that is just not shown at all. I gotta do something about that next. The good thing is, though, after simply fixing the two columns the message appeared on screen.
@bender@twtxt.net Yeah, as you mentioned in the other thread, @andros@twtxt.andros.devās hashes appear to be not quite right. š¤
@movq@www.uninformativ.de I have no doubt that youāre not seeing the images correctly š. Itās just that itās broken when viewing them, in my case, and analyzing the URLs, Iāve seen everything I mentioned.
Regarding the hash, youāre right. Iāll have to investigate whatās going on. Iām having a hard time getting the hash generation to work properly.
@andros@twtxt.andros.dev Hm, looks correct to me. The image to be displayed is a thumbnail and this links to the full-sized image. The thumbnail (JPG) is auto-generated from the full image (PNG), hence the two extensions.
What does look strange, though, is that your client came up with the hash pqsmcka
, while it should have been te5quba
. š¤
Why not just use registry? It can be personal or hosted by someone like registry.twtxt.org. Just need to be adapt to support hashes
@prologic@twtxt.net We canāt agree on this idea because that makes things even more complicated than it already is today. The beauty of twtxt is, you put one file on your server, done. One. Not five million. Granted, there might be archive feeds, so it might be already a bit more, but still faaaaaaar less than one file per message.
Also, you would need to host not your own hash files, but everybody elseās as well you follow. Otherwise, what is that supposed to achieve? If people are already following my feed, they know what hashes I have, so this is to no use of them (unless they want to look up a message from an archive feed and donāt process them). But the far more common scenario is that an unknown hash originates from a feed that they have not subscribed to.
Additionally, yarndās URL schema would then also break, because https://twtxt.net/twt/<hash>
now becomes https://twtxt.net/user/prologic/<hash>
, https://twtxt.net/user/bender/<hash>
and so on. To me, that looks like you would only get hashes if they belonged to this particular user. Of course, you could define rules that if there is a /user/
part in the path, then use a different URL, but this complicates things even more.
Sorry, I donāt like that idea.
a few async ideas for later
The editing process needs a lot of consideration and compromises.
From one side, editing and deleting itās necessary IMO. People will do it anyway, and personally I like to edit my texts, so Iād put some effort on make it work.
Should we keep a history of edits? Should we hash every edit to avoid abuse? Should we mark internally a twt as deleted, but keeping the replies?
I think thatās part of a more complete āthreadā extension, although Iād say itās worth to agree on something reflecting the real usage in the wild, along with what people usually do on other platforms.
looks good to me!
About aliceās hash, using SHA256, I get 96473b4f
or 96473B4F
for the last 8 characters. Iāll add it as an implementation example.
The idea of including it besides the follow URL is to avoid calculating it every time we load the file (assuming the client did that correctly), and helps to track replies across the file with a simple search.
Also, watching your example Iām thinking now that instead of {url=96473B4F,id=1}
which is ambiguous of which URL we are referring to, it could be something like:
{reply_to=[URL_HASH]_[TWT_ID]}
/ {reply_to=96473B4F_1}
That way, the āfull twt IDā could be 96473B4F_1
.
True. Though if the idea turns out to be better.. then community will adopt it.
if you look at the subject for that twt you will see that it uses the extended hash format to include a URL address.
@bmallred@staystrong.run I forgot one more effect of edits. If clients remember the read status of massages by hash, an edit will mark the updated message as unread again. To some degree that is even the right behavior, because the message was updated, so the user might want to have a look at the updated version. On the other hand, if itās just a small typo fix, itās maybe not worth to tell the user about. But the client doesnāt know, at least not with additional logic.
Having said that, it appears that this only affects me personally, noone else. I donāt know of any other client that saves read statuses. But donāt worry about me, all good. Just keep doing what youāve done so far. I wanted to mention that only for the sake of completeness. :-)
I like this syntax, you have my vote, although Iād change it a bit like
#<Alice https://example.com/twtxt.com#2024-12-18T14:18:26+01:00>
Hashes are not a problem on PHP, I dont know why itās slow to calculate them from your side, but I agree with your points.
BTW, did you have the chance to read my proposal on twtxt 2.0? I shared a few ideas about possible improvements to discuss:
https://text.eapl.mx/a-few-ideas-for-a-next-twtxt-version
https://text.eapl.mx/reply-to-lyse-about-twtxt
@bmallred@staystrong.run Any edit automatically changes the twt hash, because the hash is built over the hash URL, message timestamp and message text. https://twtxt.dev/exts/twt-hash.html So, it is only a problem, if somebody replied to your original message with the old hash. The original message suddenly doesnāt exist anymore and the reply becomes detached, orphaned, whatever you wanna call it. Threading doesnāt break, though, if nobody replied to your message.
@andros@twtxt.andros.dev I believe you have just reproduced the bug⦠it looks like youāve replayed to a twt but the hash is wrong. I can see the hash here from Jenny, but it doesnāt look like it corresponds to any{twt,thing}. if you check it out on any yarn instance it wonāt look like a replay.
My hypothesis about that thing breaking my twts is that it might have something to do with the parenthesis surrounding the root twt hash in the replay twt-A
when I replay to it with fork-twt-B
; I imagine elisp interpreting those as a s-expression thus breaking the generation precess of hash (#twt-A) before prepending it to for-twt-B
⦠but then Iām too ignorant to figure out how to test my theory (heck I couldnāt even recalculate the hashes myself correctly in bash xD). Iāll keep trying tho.
@andros@twtxt.andros.dev yes, that usually happens when twts get edited and we just made a gentlemen agreement to avoid edits as much as possible (at least for the time being). But the thing is, That is not whatās happening with my broken twtsā hashes. Since Iāve bee mostly replaying to my own twts as a test and I know for sure that I havenāt edited any. (I usually fork-replay instead of edit a twt when needed)
@aelaraji@aelaraji.com Can you give me examples of hashes that you have detected wrong between Emacs client and twtxt.net?
Perhaps there is some character, some space, that is creating the discrepancy.
@prologic@twtxt.net Agreed! But clients can hallucinate and generate wrong hashes
aka Lies
𤣠Also, If you chheck your own twt on twtxt.net, it looks like a root twt instead of a replay.
the hash in that test replay should have been s243lua
instead š¤·
@prologic@twtxt.net Are you sure? xD ⦠it was supposed to be a replay to another twt, but the twt hash is wrong (I think).
@andros@twtxt.andros.dev is it me or twtxt-el generates a wrong twt hash when I use the [ ā³ Reply to twt ]
button?
@prologic@twtxt.net Of course you donāt notice it when yarnd only shows at most the last n messages of a feed. As an example, check out mckinleyās message from 2023-01-09T22:42:37Z. It has ā[Scheduled][Scheduled][Scheduled]ā⦠in it. This text in square brackets is repeated numerous times. If you search his feed for closing square bracket followed by an opening square bracket (][
) you will find a bunch more of these. It goes without question he never typed that in his feed. My client saves each twt hash Iāve explicitly marked read. A few days ago, I got plenty of apparently years old, yet suddenly unread messages. Each and every single one of them containing this repeated bracketed text thing. The only conclusion is that something messed up the feed again.
Iām realizing that my performance bottleneck is @prologic@twtxt.net ! It is actually calculating the hash to make the replicas, and specifically users with very long feeds š . Iām seriously thinking about enabling replies via configuration.
Well, thatās another bug: The search https://twtxt.net/search?q=%22LOOOOL%2C+great+programming+tutorial+music%22 yields the wrong hash. It should have been poyndha instead.
@lyse@lyse.isobeef.org Danke! Ja, es gibt noch unzƤhlige Stellschrauben an dem Ding. Deine Anmerkungen werde ich einarbeiten. Eine mobile Ansicht wƤr auch noch schƶn. Derzeit sitzt es auf dem Smartphone doch noch recht stramm.
@Unterhaltungen: Die von gestern zu verschlüsselten Nachrichten war ausschlaggebend für die Umsetzung. In āTimelineā und āYarnā haben mich die LƶsungsansƤtze bisher nicht überzeugt. Aber wir kƶnnen ja alle etwas von einander lernen.
another one would be to allow changing public keys over time (as it may be a good practice [0]
). A syntax like the following could help to know what public key you used to encrypt the message, and which private key the client should use to decrypt it:
!<nick url> <encrypted_message> <public_key_hash_7_chars>
Also Iād remove support for storing the message as hex, only allowing base64 (more compact, aiming for a minimalistic spec, etc.)
Iām still making progress with the Emacs client. Iām proud to say that the code that is responsible for reading the feeds is almost finished, including: Twt Hash Extension, Twt Subject Extension, Multiline Extension and Metadata Extension. Iām fine-tuning some tests and will soon do the first buffer that displays the twts.
Added TwtHash hashes to every message on my personal Twtxt HTML renderer. Code is not yet ready for prime-time. Need to work out some kinks still.
@bender@twtxt.net So turns out something is setting my HashingURI to the value {{ .Profile.URI }}
and that is making my hashes wrong so it cannot delete or edit twts.
@movq@www.uninformativ.de iām sorry if I sound too contrarian. Iām not a fan of using an obscure hash as well. The problem is that of future and backward compatibility. If we change to sha256 or another we donāt just need to support sha256. But need to now support both sha256 AND blake2b. Or we devide the community. Users of some clients will still use the old algorithm and get left behind.
Really we should all think hard about how changes will break things and if those breakages are acceptable.
I share I did write up an algorithm for it at some point I think it is lost in a git comment someplace. Iāll put together a pseudo/go code this week.
Super simple:
Making a reply:
- If yarn has one use that. (Maybe do collision check?)
- Make hash of twt raw no truncation.
- Check local cache for shortest without collision
- in SQL:
select len(subject) where head_full_hash like subject || '%'
- in SQL:
Threading:
- Get full hash of head twt
- Search for twts
- in SQL:
head_full_hash like subject || '%' and created_on > head_timestamp
- in SQL:
The assumption being replies will be for the most recent head. If replying to an older one it will use a longer hash.
These collisions arenāt important unless someone tries to fork. So.. for the vast majority its not a big deal. Using the grow hash algorithm could inform the client to add another char when they fork.
Thanks @david@collantes.us, good to know, but we need to agree on what character we use, otherwise the hashes will not be the same:)
Ok, i know how to command working (not sure), but seems it only grab from cache. Maybe make fetch from twtxt.net if hash not found?
@prologic@twtxt.net Regarding the new way of generating twt-hashes, to me it makes more sense to use tabs as separator instead of spaces, since the you can just copy/past a line directly from a twtxt-file that already go a tab between timestamp and message. But tabs might be hard to ātypeā when you are in a terminal, since it will activate autocompleteā¦š¤
Another thing, it seems that you sugget we only use the domain in the hash-creation and not the full path to the twtxt.txt
$ echo -e "https://example.com 2024-09-29T13:30:00Z Hello World!" | sha256sum - | awk '{ print $1 }' | base64 | head -c 12
twet display twts in raw format with some formatting (sadly no newlines). And for reply messages i just seen (#hash). But which text hidden on hash? currenly im open twtxt.net/twt/hash to see this
More thoughts about changes to twtxt (as if we havenāt had enough thoughts):
- There are lots of great ideas here! Is there a benefit to putting them all into one document? Seems to me this could more easily be a bunch of separate efforts that can progress at their own pace:
1a. Better and longer hashes.
1b. New possibly-controversial ideas like edit: and delete: and location-based references as an alternative to hashes.
1c. Best practices, e.g. Content-Type: text/plain; charset=utf-8
1d. Stuff already described at dev.twtxt.net that doesnāt need any changes.
We wonāt know what will and wonāt work until we try them. So Iām inclined to think of this as a bunch of draft ideas. Maybe later when weāve seen it play out it could make sense to define a group of recommended twtxt extensions and give them a name.
Another reason for 1 (above) is: I like the current situation where all you need to get started is these two short and simple documents:
https://twtxt.readthedocs.io/en/latest/user/twtxtfile.html
https://twtxt.readthedocs.io/en/latest/user/discoverability.html
and everything else is an extension for anyone interested. (Deprecating non-UTC times seems reasonable to me, though.) Having a big long ātwtxt v2ā document seems less inviting to people looking for something simple. (@prologic@twtxt.net you mentioned an anonymous comment āyouāve ruined twtxtā and while I donāt completely agree with that commenterās sentiment, I would feel like twtxt had lost something if it moved away from having a super-simple core.)All that being said, these are just my opinions, and Iām not doing the work of writing software or drafting proposals. Maybe I will at some point, but until then, if youāre actually implementing things, youāre in charge of what you decide to make, and Iām grateful for the work.
@prologic@twtxt.net Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasnāt sure if anyone objected), and changed āMUST follow the chainā to āSHOULD follow the chain.
@prologic@twtxt.net a wise plan! Who knows, ideas change, and often plans do not hash, right? Mature, mature! :-)
(#2024-09-24T12:44:35Z) There is a increase in space/memory for sure. But calculating the hashes also takes up CPU. Iām not good with that kind of math, but itās a tradeoff either way.
@falsifian@www.falsifian.org I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)
Sorry, youāre right, I should have used numbers!
Iām donāt understand what āpreserve the original hashā could mean other than āmake sure thereās still a twt in the feed with that hashā. Maybe the text could be clarified somehow.
Iām also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But itās not universal; e.g. as a jenny user I just see the plain text.
@prologic@twtxt.net Thanks for writing that up!
I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.
I am not sure how I feel about all this being done at once, vs. letting conventions arise.
For example, even today I could reply to twt abc1234 with ā(#abc1234) Edit: ā¦ā and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.
Similarly we could just start using 11-digit hashes. We should iron out whether itās sha256 or whatever but thereās no need get all the other stuff right at the same time.
I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).
However I recognize that Iām not the one implementing this stuff, and itās less work to just have everything determined up front.
Misc comments (I havenāt read the whole thing):
Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. Iād suggest gaining 11 bits with base64 instead.
āClients MUST preserve the original hashā ā do you mean they MUST preserve the original twt?
Thanks for phrasing the bit about deletions so neutrally.
I donāt like the MUST in āClients MUST follow the chain of reply-to referencesā¦ā. If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldnāt declare the client non-conforming just because they didnāt get to all the bells and whistles.
Similarly I donāt like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (Iām again thinking again of the 40-line shell script).
For āwho followsā lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?
Why canāt feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasnāt too bad, but 1.0 would have been slightly simpler.
Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.
Iām a little sad about other protocols being not recommended.
I donāt know how I feel about including markdown. I donāt mind too much that yarn users emit twts full of markdown, but Iām more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.
I wrote some code to try out non-hash reply subjects formatted as (replyto ), while keeping the ability to use the existing hash style.
I donāt think we need to decide all at once. If clients add support for a new method then people can use it if they like. The downside of course is that this costs developer time, so I decided to invest a few hours of my own time into a proof of concept.
With apologies to @movq@www.uninformativ.de for corrupting jennyās beautiful code. I donāt write this expecting you to incorporate the patch, because it does complicate things and might not be a direction you want to go in. But if you like any part of this approach feel free to use bits of it; I release the patch under jennyās current LICENCE.
Supporting both kinds of reply in jenny was complicated because each email can only have one Message-Id, and because itās possible the target twt will not be seen until after the twt referencing it. The following patch uses an sqlite database to keep track of known (url, timestamp) pairs, as well as a separate table of (url, timestamp) pairs that havenāt been seen yet but are wanted. When one of those āwantedā twts is finally seen, the mail file gets rewritten to include the appropriate In-Reply-To header.
Patch based on jenny commit 73a5ea81.
https://www.falsifian.org/a/oDtr/patch0.txt
Not implemented:
- Composing twts using the (replyto ā¦) format.
- Probably other important things Iām forgetting.
@prologic@twtxt.net I read it. I understand it. Hopefully a solution can be agreed upon that solves the editing issue, whilst maintaining the cryptographic hash.
@prologic@twtxt.net I know the role of the current hash is to allow referencing (replies and, thus, threads), and it also represents a āuniqueā way to verify a twtxt hasnāt been tampered with. Is that second so important, if we are trying to allow edits? I know if feels good to be able to verify, but in reality, how often one does it?
@prologic@twtxt.net how about hashing a combination of nick/timestamp, or url/timestamp only, and not the twtxt content? On edit those will not change, so no breaking of threads. I know, I know, just adding noise here. :-P
@quark@ferengi.one It does not. That is why Iām advocating for not using hashes for treads, but a simpler link-back scheme.
the stem matching is the same as how GIT does its branch hashes. i think you can stem it down to 2 or 3 sha bytes.
if a client sees someone in a yarn using a byte longer hash it can lengthen to match since it can assume that maybe the other client has a collision that it doesnt know about.
@prologic@twtxt.net the basic idea was to stem the hash.. so you have a hash abcdef0123456789...
any sub string of that hash after the first 6 will match. so abcdef
, abcdef012
, abcdef0123456
all match the same. on the case of a collision i think we decided on matching the newest since we archive off older threads anyway. the third rule was about growing the minimum hash size after some threshold of collisions were detected.
@prologic@twtxt.net Wikipedia claims sha1 is vulnerable to a āchosen-prefix attackā, which I gather means I can write any two twts I like, and then cause them to have the exact same sha1 hash by appending something. I guess a twt ending in random junk might look suspcious, but perhaps the junk could be worked into an image URL like
. If thatās not possible now maybe it will be later.git only uses sha1 because theyāre stuck with it: migrating is very hard. There was an effort to move git to sha256 but I donāt know its status. I think there is progress being made with Game Of Trees, a git clone that uses the same on-disk format.
I canāt imagine any benefit to using sha1, except that maybe some very old software might support sha1 but not sha256.
@movq@www.uninformativ.de going a little sideways on this, ā*If twtxt/Yarn was to grow bigger, then this would become a concern again. But even Mastodon allows editing, so how much of a problem can it really be? š *ā, wouldnāt it preparing for a potential (even if very, very, veeeeery remote) growth be a good thing? Mastodon signs all messages, keeps a history of edits, and it doesnāt break threads. It isnāt a problem there.š It is here.
I think keeping hashes is a must. If anything for that āfeels goodā feeling.
@movq@www.uninformativ.de Agreed that hashes have a benefit. I came up with a similar example where when I twted about an 11-character hash collision. Perhaps hashes could be made optional somehow. Like, you could use the āreplytoā idea and then additionally put a hash somewhere if you want to lock in which version of the twt you are replying to.
There is nothing wrong with how we currently run a diff to see what has been removed. if i build a merkle tree off all the twt hashes in a feed i can use that to verify a twt should be in a feed or not. and gossip that to my peers.
isnāt the benefit of blake2b that it is a more efficient algo than sha1 and has the same or similar entropy to sha3? i thought we had partially solved this with some type of expanding hash size? additionally we could increase bit density by using base36 or base64/url-safeā¦
Thereās a simple reason all the current hashes end in a or q: the hash is 256 bits, the base32 encoding chops that into groups of 5 bits, and 256 isnāt divisible by 5. The last character of the base32 encoding just has that left-over single bit (256 mod 5 = 1).
So I agree with #3 below, but do you have a source for #1, #2 or #4? I would expect any lack of variability in any part of a hash functionās output would make it more vulnerable to attacks, so designers of hash functions would want to make the whole output vary as much as possible.
Other than the divisible-by-5 thing, my current intuition is it doesnāt matter what part you take.
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
@prologic@twtxt.net I saw those, yes. I tried using yarnc
, and it would work for a simple twtxt. Now, for a more convoluted one it truly becomes a nightmare using that tool for the job. I know there are talks about changing this hash, so this might be a moot point right now, but it would be nice to have a tool that:
- Would calculate the hash of a twtxt in a file.
- Would calculate all hashes on a
twtxt.txt
(local and remote).
Again, something lovely to have after any looming changes occur.
Could someone knowledgable reply with the steps a grandpa will take to calculate the hash of a twtxt from the CLI, using out-of-the-box tools? I swear I read about it somewhere, but canāt find it.
@falsifian@www.falsifian.org based on Twt Subject Extension, your subject is invalid. You can have custom subjects, that is, not a valid hash, but you simply canāt put anything, and expect it to be treated as a TwtSubject
, me thinks.
yarnd just doesnāt render the subject. Fair enough. Itās (replyto http://darch.dk/twtxt.txt 2024-09-15T12:50:17Z), and if you donāt want to go on a hunt, the twt hash is weadxga: https://twtxt.net/twt/weadxga
@sorenpeter@darch.dk I like this idea. Just for fun, Iām using a variant in this twt. (Also because Iām curious how it non-hash subjects appear in jenny and yarn.)
URLs can contain commas so I suggest a different character to separate the url from the date. Is this twt Iāve used space (also after āreplytoā, for symmetry).
I think this solves:
- Changing feed identities: although @mckinley@twtxt.net points out URLs can change, I think this syntax should be okay as long as the feed at that URL can be fetched, and as long as the current canonical URL for the feed lists this one as an alternate.
- editing, if you donāt care about message integrity
- finding the root of a thread, if youāre not following the author
An optional hash could be added if message integrity is desired. (E.g. if you donāt trust the feed author not to make a misleading edit.) Other recent suggestions about how to deal with edits and hashes might be applicable then.
People publishing multiple twts per second should include sub-second precision in their timestamps. As you suggested, the timestamp could just be copied verbatim.