One of the biggest gripes of the community with the way the threading model currently works with Twtxt v1.2 (https://twtxt.dev) is this notion of:
What is this hash?
What does it refer to?
Idea: Why can’t we all agree to implement a simple URI scheme where we host our Twtxt feeds?
That is, if you host your feed at https://example.com/twtxt.txt – Why can’t or could you not also host various JSON files (let’s agree on the spec of course) at https://example.com/twt/<hash> ? 🤔
That way we solve this problem in a truly decentralised way, rather than every relying on yarnd pods alone.
Go-redis:執行 Lua 腳本
go-redis (github.com/redis/go-redis) 支持 Lua 腳本 redis.Script,本文在這裏簡單展示其在秒殺場景中使用的代碼片段。秒殺場景在秒殺場景中,一個商品的庫存對應了兩個信息,分別是總庫存量和已秒殺量。可以使用一個 Hash 類型的鍵值對來保存庫存的這兩個信息,如下所示:key: productid value: {total: N, ordered: ⌘ Read more
a few async ideas for later
The editing process needs a lot of consideration and compromises.
From one side, editing and deleting it’s necessary IMO. People will do it anyway, and personally I like to edit my texts, so I’d put some effort on make it work.
Should we keep a history of edits? Should we hash every edit to avoid abuse? Should we mark internally a twt as deleted, but keeping the replies?
I think that’s part of a more complete ‘thread’ extension, although I’d say it’s worth to agree on something reflecting the real usage in the wild, along with what people usually do on other platforms.
looks good to me!
About alice’s hash, using SHA256, I get 96473b4f or 96473B4F for the last 8 characters. I’ll add it as an implementation example.
The idea of including it besides the follow URL is to avoid calculating it every time we load the file (assuming the client did that correctly), and helps to track replies across the file with a simple search.
Also, watching your example I’m thinking now that instead of {url=96473B4F,id=1} which is ambiguous of which URL we are referring to, it could be something like:
{reply_to=[URL_HASH]_[TWT_ID]} / {reply_to=96473B4F_1}
That way, the ‘full twt ID’ could be 96473B4F_1.
True. Though if the idea turns out to be better.. then community will adopt it.
if you look at the subject for that twt you will see that it uses the extended hash format to include a URL address.
True. Though if the idea turns out to be better.. then community will adopt it.
if you look at the subject for that twt you will see that it uses the extended hash format to include a URL address.
@bmallred@staystrong.run I forgot one more effect of edits. If clients remember the read status of massages by hash, an edit will mark the updated message as unread again. To some degree that is even the right behavior, because the message was updated, so the user might want to have a look at the updated version. On the other hand, if it’s just a small typo fix, it’s maybe not worth to tell the user about. But the client doesn’t know, at least not with additional logic.
Having said that, it appears that this only affects me personally, noone else. I don’t know of any other client that saves read statuses. But don’t worry about me, all good. Just keep doing what you’ve done so far. I wanted to mention that only for the sake of completeness. :-)
I like this syntax, you have my vote, although I’d change it a bit like
#<Alice https://example.com/twtxt.com#2024-12-18T14:18:26+01:00>
Hashes are not a problem on PHP, I dont know why it’s slow to calculate them from your side, but I agree with your points.
BTW, did you have the chance to read my proposal on twtxt 2.0? I shared a few ideas about possible improvements to discuss:
https://text.eapl.mx/a-few-ideas-for-a-next-twtxt-version
https://text.eapl.mx/reply-to-lyse-about-twtxt
@bmallred@staystrong.run Any edit automatically changes the twt hash, because the hash is built over the hash URL, message timestamp and message text. https://twtxt.dev/exts/twt-hash.html So, it is only a problem, if somebody replied to your original message with the old hash. The original message suddenly doesn’t exist anymore and the reply becomes detached, orphaned, whatever you wanna call it. Threading doesn’t break, though, if nobody replied to your message.
@andros@twtxt.andros.dev I believe you have just reproduced the bug… it looks like you’ve replayed to a twt but the hash is wrong. I can see the hash here from Jenny, but it doesn’t look like it corresponds to any{twt,thing}. if you check it out on any yarn instance it won’t look like a replay.
My hypothesis about that thing breaking my twts is that it might have something to do with the parenthesis surrounding the root twt hash in the replay twt-A when I replay to it with fork-twt-B; I imagine elisp interpreting those as a s-expression thus breaking the generation precess of hash (#twt-A) before prepending it to for-twt-B … but then I’m too ignorant to figure out how to test my theory (heck I couldn’t even recalculate the hashes myself correctly in bash xD). I’ll keep trying tho.
@andros@twtxt.andros.dev yes, that usually happens when twts get edited and we just made a gentlemen agreement to avoid edits as much as possible (at least for the time being). But the thing is, That is not what’s happening with my broken twts’ hashes. Since I’ve bee mostly replaying to my own twts as a test and I know for sure that I haven’t edited any. (I usually fork-replay instead of edit a twt when needed)
@aelaraji@aelaraji.com Can you give me examples of hashes that you have detected wrong between Emacs client and twtxt.net?
Perhaps there is some character, some space, that is creating the discrepancy.
@prologic@twtxt.net Agreed! But clients can hallucinate and generate wrong hashes aka Lies 🤣 Also, If you chheck your own twt on twtxt.net, it looks like a root twt instead of a replay.
the hash in that test replay should have been s243lua instead 🤷
@prologic@twtxt.net Are you sure? xD … it was supposed to be a replay to another twt, but the twt hash is wrong (I think).
@andros@twtxt.andros.dev is it me or twtxt-el generates a wrong twt hash when I use the [ ↳ Reply to twt ] button?
分佈式基石算法 一致性 hash
什麼是 一致性 hash 算法—————首先摘抄一段維基百科的定義 一致哈希 是一種特殊的哈希算法。在使用一致哈希算法後,哈希表槽位數(大小)的改變平均只需要對𝐾/𝑛 個關鍵字重新映射,其中 𝐾 是關鍵字的數量,𝑛是槽位數量。然而在傳統的哈希表中,添加或刪除一個槽位的幾乎需要對所有關鍵字進行重新映射。 — wikipedia分佈式系統中, 一致性 hash 無處不在,C ⌘ Read more
@prologic@twtxt.net Of course you don’t notice it when yarnd only shows at most the last n messages of a feed. As an example, check out mckinley’s message from 2023-01-09T22:42:37Z. It has “[Scheduled][Scheduled][Scheduled]“… in it. This text in square brackets is repeated numerous times. If you search his feed for closing square bracket followed by an opening square bracket (][) you will find a bunch more of these. It goes without question he never typed that in his feed. My client saves each twt hash I’ve explicitly marked read. A few days ago, I got plenty of apparently years old, yet suddenly unread messages. Each and every single one of them containing this repeated bracketed text thing. The only conclusion is that something messed up the feed again.
Definitely something going on with replies. This one was replying to the wrong twt and even when I got clever and pasted the right hash it didn’t work.
I’m realizing that my performance bottleneck is @prologic@twtxt.net ! It is actually calculating the hash to make the replicas, and specifically users with very long feeds 😂 . I’m seriously thinking about enabling replies via configuration.
Well, that’s another bug: The search https://twtxt.net/search?q=%22LOOOOL%2C+great+programming+tutorial+music%22 yields the wrong hash. It should have been poyndha instead.
@lyse@lyse.isobeef.org Danke! Ja, es gibt noch unzählige Stellschrauben an dem Ding. Deine Anmerkungen werde ich einarbeiten. Eine mobile Ansicht wär auch noch schön. Derzeit sitzt es auf dem Smartphone doch noch recht stramm.
@Unterhaltungen: Die von gestern zu verschlüsselten Nachrichten war ausschlaggebend für die Umsetzung. In “Timeline” und “Yarn” haben mich die Lösungsansätze bisher nicht überzeugt. Aber wir können ja alle etwas von einander lernen.
another one would be to allow changing public keys over time (as it may be a good practice [0]). A syntax like the following could help to know what public key you used to encrypt the message, and which private key the client should use to decrypt it:
!<nick url> <encrypted_message> <public_key_hash_7_chars>
Also I’d remove support for storing the message as hex, only allowing base64 (more compact, aiming for a minimalistic spec, etc.)
Or using the same twt hash method, but only for the URL, to generate the nick, if it doesn’t exist, like so, @5vxo4ia@twtxt.net
@doesnmppsflt@doesnm.p.psf.lt Not sure which bug you’re referring to. 🤔 (Did I forget?)
Those long IDs like (#113797927355322708) are simply part of that feed. Looks like the author just dumps ActivityPub IDs into twtxt. I think this used to work in the past, but the corresponding spec (https://twtxt.dev/exts/hash-tag.html) has been deprecated and jenny doesn’t support – actually, jenny never supported that.
jenny can only group threads by exactly one criterium (because it writes a Message-ID into the mail file) and that’s the regular twt hash. So, anything else, like people doing “#CoolTopic”, isn’t possible.
I’m still making progress with the Emacs client. I’m proud to say that the code that is responsible for reading the feeds is almost finished, including: Twt Hash Extension, Twt Subject Extension, Multiline Extension and Metadata Extension. I’m fine-tuning some tests and will soon do the first buffer that displays the twts.
Added TwtHash hashes to every message on my personal Twtxt HTML renderer. Code is not yet ready for prime-time. Need to work out some kinks still.
rehrar releases Stack Wallet v2.1.9, Stack Duo v1.2.4
rehrar1 has released Stack Wallet2 version 2.1.93 and Stack Duo version 1.2.44 with various fixes and improvements.
Stack Wallet:
* Show Monero/Wownero tx private key option
* Fix Frost error
* Stack Wallet Backup fixes
Stack Duo:
* Added Monero churn options
* Paynym temporarily disabled while kinks worked out [..]
The release notes, binaries, and the SHA256 hashes can … ⌘ Read more
一文搞懂如何在 Go 包中支持 Hash-Based Bisect 調試
bisect 是一個英文動詞,意爲 “二分” 或“分成兩部分”。在數學和計算機科學中,通常指將一個區間或一個集合分成兩個相等的部分。對於程序員來說,最熟悉的 bisect 應用莫過於下面兩個:算法中的二分查找 (binary search) 二分查找是一個經典且高效的查找算法,任何一本介紹數據結構或計算機算法的書都會包含對二分查找的系統說明。所謂二分查找就是通過不斷將搜索區間一分爲二來找到目 ⌘ Read more
Go 中祕而不宣的數據結構 maphash:性能之王
哈希算法 (也稱散列算法) 是一種將任意長度的輸入數據轉換成固定長度輸出的數學算法。Hash 算法的應用場景Hash 算法常常用在以下的場景中:數據完整性驗證 可以爲文件生成唯一的哈希值,用於檢測文件是否被篡改 常見場景如軟件下載時的 MD5 校驗、區塊鏈中的數據驗證 密碼存儲安全 不直接存儲用戶密碼原文, 而是存儲密碼的哈希值 即使數據庫被攻破, 黑客也無法直接獲取 ⌘ Read more
@bender@twtxt.net So turns out something is setting my HashingURI to the value {{ .Profile.URI }} and that is making my hashes wrong so it cannot delete or edit twts.
@bender@twtxt.net So turns out something is setting my HashingURI to the value {{ .Profile.URI }} and that is making my hashes wrong so it cannot delete or edit twts.
Righto, @eapl.me@eapl.me, ta for the writeup. Here we go. :-)
Metadata on individual twts are too much for me. I do like the simplicity of the current spec. But I understand where you’re coming from.
Numbering twts in a feed is basically the attempt of generating message IDs. It’s an interesting idea, but I reckon it is not even needed. I’d simply use location based addressing (feed URL + ‘#’ + timestamp) instead of content addressing. If one really wanted to, one could hash the feed URL and timestamp, but the raw form would actually improve disoverability and would not even require a richer client. But the majority of twtxt users in the last poll wanted to stick with content addressing.
yarnd actually sends If-Modified-Since request headers. Not only can I observe heaps of 304 responses for yarnds in my access log, but in Cache.FetchFeeds(…) we can actually see If-Modified-Since being deployed when the feed has been retrieved with a Last-Modified response header before: https://git.mills.io/yarnsocial/yarn/src/commit/98eee5124ae425deb825fb5f8788a0773ec5bdd0/internal/cache.go#L1278
Turns out etags with If-None-Match are only supported when yarnd serves avatars (https://git.mills.io/yarnsocial/yarn/src/commit/98eee5124ae425deb825fb5f8788a0773ec5bdd0/internal/handlers.go#L158) and media uploads (https://git.mills.io/yarnsocial/yarn/src/commit/98eee5124ae425deb825fb5f8788a0773ec5bdd0/internal/media_handlers.go#L71). However, it ignores possible etags when fetching feeds.
I don’t understand how the discovery URLs should work to replace the User-Agent header in HTTP(S) requests. Do you mind to elaborate?
Different protocols are basically just a client thing.
I reckon it’s best to just avoid mixing several languages in one feed in the first place. Personally, I find it okay to occasionally write messages in other languages, but if that happens on a more regularly basis, I’d definitely create a different feed for other languages.
Isn’t the emoji thing “just” a client feature? So, feed do not even have to state any emojis. As a user I’d configure my client to use a certain symbol for feed ABC. Currently, I can do a similar thing in tt where I assign colors to feeds. On the other hand, what if a user wants to control what symbol should be displayed, similar to the feed’s nick? Hmm. But still, my terminal font doesn’t even render most of emojis. So, Unicode boxes everywhere. This makes me think it should actually be a only client feature.
Monerujo adds support for Exolix exchange in v4.1.2
m2049r1 has released Monerujo2 version 4.1.23 with support for crypto exchange platform Exolix 4:
Upgrading to Monero Core to v0.18.3.4
Add support for Exolix exchange
Minor bugfixes
Some refactoring
Before usage it is recommended to back up your seed and verify that you have downloaded the correct file using the SHA256 hash listed on Github3.
If you need help, consult the … ⌘ Read more
@movq@www.uninformativ.de i’m sorry if I sound too contrarian. I’m not a fan of using an obscure hash as well. The problem is that of future and backward compatibility. If we change to sha256 or another we don’t just need to support sha256. But need to now support both sha256 AND blake2b. Or we devide the community. Users of some clients will still use the old algorithm and get left behind.
Really we should all think hard about how changes will break things and if those breakages are acceptable.
@movq@www.uninformativ.de i’m sorry if I sound too contrarian. I’m not a fan of using an obscure hash as well. The problem is that of future and backward compatibility. If we change to sha256 or another we don’t just need to support sha256. But need to now support both sha256 AND blake2b. Or we devide the community. Users of some clients will still use the old algorithm and get left behind.
Really we should all think hard about how changes will break things and if those breakages are acceptable.
I share I did write up an algorithm for it at some point I think it is lost in a git comment someplace. I’ll put together a pseudo/go code this week.
Super simple:
Making a reply:
- If yarn has one use that. (Maybe do collision check?)
- Make hash of twt raw no truncation.
- Check local cache for shortest without collision
- in SQL:
select len(subject) where head_full_hash like subject || '%'
- in SQL:
Threading:
- Get full hash of head twt
- Search for twts
- in SQL:
head_full_hash like subject || '%' and created_on > head_timestamp
- in SQL:
The assumption being replies will be for the most recent head. If replying to an older one it will use a longer hash.
I share I did write up an algorithm for it at some point I think it is lost in a git comment someplace. I’ll put together a pseudo/go code this week.
Super simple:
Making a reply:
- If yarn has one use that. (Maybe do collision check?)
- Make hash of twt raw no truncation.
- Check local cache for shortest without collision
- in SQL:
select len(subject) where head_full_hash like subject || '%'
- in SQL:
Threading:
- Get full hash of head twt
- Search for twts
- in SQL:
head_full_hash like subject || '%' and created_on > head_timestamp
- in SQL:
The assumption being replies will be for the most recent head. If replying to an older one it will use a longer hash.
If we stuck with Blake2b for Twt Hash(es); what do we think we need to reasonably go to in bit length/size?
=> https://gist.mills.io/prologic/194993e7db04498fa0e8d00a528f7be6
e.g: (turns out @xuu@txt.sour.is is right about Blak2b being easy/simple too!):
$ printf "%s\t%s\t%s" "https://example.com/twtxt.txt" "2024-09-29T13:30:00Z" "Hello World!" | b2sum -l 32 -t | awk '{ print $1 }'
7b8b79dd
These collisions aren’t important unless someone tries to fork. So.. for the vast majority its not a big deal. Using the grow hash algorithm could inform the client to add another char when they fork.
These collisions aren’t important unless someone tries to fork. So.. for the vast majority its not a big deal. Using the grow hash algorithm could inform the client to add another char when they fork.
Thanks @david@collantes.us, good to know, but we need to agree on what character we use, otherwise the hashes will not be the same:)
Ok, i know how to command working (not sure), but seems it only grab from cache. Maybe make fetch from twtxt.net if hash not found?
@prologic@twtxt.net Regarding the new way of generating twt-hashes, to me it makes more sense to use tabs as separator instead of spaces, since the you can just copy/past a line directly from a twtxt-file that already go a tab between timestamp and message. But tabs might be hard to “type” when you are in a terminal, since it will activate autocomplete…🤔
Another thing, it seems that you sugget we only use the domain in the hash-creation and not the full path to the twtxt.txt
$ echo -e "https://example.com 2024-09-29T13:30:00Z Hello World!" | sha256sum - | awk '{ print $1 }' | base64 | head -c 12
@doesnm@doesnm.p.psf.lt twt probably isn’t the best client I’m afraid. It doesn’t really cache twts by their key (hash) to display threads properly. Jenny however does 👌
twet display twts in raw format with some formatting (sadly no newlines). And for reply messages i just seen (#hash). But which text hidden on hash? currenly im open twtxt.net/twt/hash to see this
👋 Thanks for joining us on our Sept monthly Yarn.social meetup today y’all 🙇♂️ We had @david@collantes.us @sorenpeter@darch.dk @doesnm@doesnm.p.psf.lt @falsifian@www.falsifian.org and @xuu@txt.sour.is 💪 Nice turn out! (not all at once of course, as we normally run this over 4 hours as we span many time zones!)
Things we talked about:
- Decentralised vs. Distributed
- Use of SHA256 for Twt Hash(es)
- We solved Edits! 🥳
- UUID(s) probably won’t work! (susceptible to sppofing)
- Helped @sorenpeter@darch.dk write some PHP to process/parse
User-Agentand service his feed via a custom PHP script 😅
- @falsifian@www.falsifian.org introduced himself 👌
- Talked about Merkle Trees 🌳
Did I miss anything? 🤔
More thoughts about changes to twtxt (as if we haven’t had enough thoughts):
- There are lots of great ideas here! Is there a benefit to putting them all into one document? Seems to me this could more easily be a bunch of separate efforts that can progress at their own pace:
1a. Better and longer hashes.
1b. New possibly-controversial ideas like edit: and delete: and location-based references as an alternative to hashes.
1c. Best practices, e.g. Content-Type: text/plain; charset=utf-8
1d. Stuff already described at dev.twtxt.net that doesn’t need any changes.
We won’t know what will and won’t work until we try them. So I’m inclined to think of this as a bunch of draft ideas. Maybe later when we’ve seen it play out it could make sense to define a group of recommended twtxt extensions and give them a name.
Another reason for 1 (above) is: I like the current situation where all you need to get started is these two short and simple documents:
https://twtxt.readthedocs.io/en/latest/user/twtxtfile.html
https://twtxt.readthedocs.io/en/latest/user/discoverability.html
and everything else is an extension for anyone interested. (Deprecating non-UTC times seems reasonable to me, though.) Having a big long “twtxt v2” document seems less inviting to people looking for something simple. (@prologic@twtxt.net you mentioned an anonymous comment “you’ve ruined twtxt” and while I don’t completely agree with that commenter’s sentiment, I would feel like twtxt had lost something if it moved away from having a super-simple core.)All that being said, these are just my opinions, and I’m not doing the work of writing software or drafting proposals. Maybe I will at some point, but until then, if you’re actually implementing things, you’re in charge of what you decide to make, and I’m grateful for the work.
@prologic@twtxt.net Done. Also, I went ahead and made two changes: changed hexadecimal to base64 for hashes (wasn’t sure if anyone objected), and changed “MUST follow the chain” to “SHOULD follow the chain.
We:
- Drop
# url=from the spec.
- We don’t adopt
# uuid =– Something @anth@a.9srv.net also mentioned (see below)
We instead use the @nick@domain to identify your feed in the first place and use that as the identify when calculating Twt hashes <id> + <timestamp> + <content>. Now in an ideal world I also agree, use WebFinger for this and expect that for the most part you’ll be doing a WebFinger lookup of @user@domain to fetch someone’s feed in the first place.
The only problem with WebFinger is should this be mandated or a recommendation?
@prologic@twtxt.net a wise plan! Who knows, ideas change, and often plans do not hash, right? Mature, mature! :-)
Good writeup, @anth@a.9srv.net! I agree to most of your points.
3.2 Timestamps: I feel no need to mandate UTC. Timezones are fine with me. But I could also live with this new restriction. I fail to see, though, how this change would make things any easier compared to the original format.
3.4 Multi-Line Twts: What exactly do you think are bad things with multi-lines?
4.1 Hash Generation: I do like the idea with with a new uuid metadata field! Any thoughts on two feeds selecting the same UUID for whatever reason? Well, the same could happen today with url.
5.1 Reply to last & 5.2 More work to backtrack: I do not understand anything you’re saying. Can you rephrase that?
8.1 Metadata should be collected up front: I generally agree, but if the uuid metadata field were a feed URL and no real UUID, there should be probably an exception to change the feed URL mid-file after relocation.
yarnd does for example) and equally a 5x increase in on-disk storage as well. This is based on the Twt Hash going from a 13 bytes (content-addressing) to 63 bytes (on average for location-based addressing). There is roughly a ~20-150% increase in the size of individual feeds as well that needs to be taken into consideration (on the average case).
(#2024-09-24T12:44:35Z) There is a increase in space/memory for sure. But calculating the hashes also takes up CPU. I’m not good with that kind of math, but it’s a tradeoff either way.
There is also a ~5x increase cost in memory utilization for any implementations or implementors that use or wish to use in-memory storage (yarnd does for example) and equally a 5x increase in on-disk storage as well. This is based on the Twt Hash going from a 13 bytes (content-addressing) to 63 bytes (on average for location-based addressing). There is roughly a ~20-150% increase in the size of individual feeds as well that needs to be taken into consideration (on the average case).
So really your argument is just that switching to a location-based addressing “just makes sense”. Why? Without concrete pros/cons of each approach this isn’t really a strong argument I’m afraid. In fact I probably need to just sit down and detail the properties of both approaches and the pros/cons of both.
I also don’t really buy the argument of simplicity either personally, because I don’t technically see it much more difficult to take a echo -e "<url>\t<timestamp>\t<content>" | sha256sum | base64 as the Twt Subject or concatenating the <url> <timestamp> – The “effort” is the same. If we’re going to argue that SHA256 or cryptographic hashes are “too complicated” then I’m not really sure how to support that argument.
@falsifian@www.falsifian.org I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)
@falsifian@www.falsifian.org I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)
Sorry, you’re right, I should have used numbers!
I’m don’t understand what “preserve the original hash” could mean other than “make sure there’s still a twt in the feed with that hash”. Maybe the text could be clarified somehow.
I’m also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But it’s not universal; e.g. as a jenny user I just see the plain text.
@prologic@twtxt.net Thanks for writing that up!
I hope it can remain a living document (or sequence of draft revisions) for a good long time while we figure out how this stuff works in practice.
I am not sure how I feel about all this being done at once, vs. letting conventions arise.
For example, even today I could reply to twt abc1234 with “(#abc1234) Edit: …” and I think all you humans would understand it as an edit to (#abc1234). Maybe eventually it would become a common enough convention that clients would start to support it explicitly.
Similarly we could just start using 11-digit hashes. We should iron out whether it’s sha256 or whatever but there’s no need get all the other stuff right at the same time.
I have similar thoughts about how some users could try out location-based replies in a backward-compatible way (append the replyto: stuff after the legacy (#hash) style).
However I recognize that I’m not the one implementing this stuff, and it’s less work to just have everything determined up front.
Misc comments (I haven’t read the whole thing):
Did you mean to make hashes hexadecimal? You lose 11 bits that way compared to base32. I’d suggest gaining 11 bits with base64 instead.
“Clients MUST preserve the original hash” — do you mean they MUST preserve the original twt?
Thanks for phrasing the bit about deletions so neutrally.
I don’t like the MUST in “Clients MUST follow the chain of reply-to references…”. If someone writes a client as a 40-line shell script that requires the user to piece together the threading themselves, IMO we shouldn’t declare the client non-conforming just because they didn’t get to all the bells and whistles.
Similarly I don’t like the MUST for user agents. For one thing, you might want to fetch a feed without revealing your identty. Also, it raises the bar for a minimal implementation (I’m again thinking again of the 40-line shell script).
For “who follows” lists: why must the long, random tokens be only valid for a limited time? Do you have a scenario in mind where they could leak?
Why can’t feeds be served over HTTP/1.0? Again, thinking about simple software. I recently tried implementing HTTP/1.1 and it wasn’t too bad, but 1.0 would have been slightly simpler.
Why get into the nitty-gritty about caching headers? This seems like generic advice for HTTP servers and clients.
I’m a little sad about other protocols being not recommended.
I don’t know how I feel about including markdown. I don’t mind too much that yarn users emit twts full of markdown, but I’m more of a plain text kind of person. Also it adds to the length. I wonder if putting a separate document would make more sense; that would also help with the length.
Had to build a list of all feeds (that I follow) and all twts in them and there are two collisions already:
$ ./stats
Saw 58263 hashes
7fqcxaa
https://twtxt.net/user/justamoment/twtxt.txt
https://twtxt.net/user/prologic/twtxt.txt
ntnakqa
https://twtxt.net/user/prologic/twtxt.txt
https://twtxt.net/user/thecanine/twtxt.txt
Namely:
$ jenny -D https://twtxt.net/user/justamoment/twtxt.txt | grep 7fqcxaa
[7fqcxaa] [2022-12-28 04:53:30+00:00] [(#pmuqoca) @prologic@twtxt.net I checked the GitHub discussion, it became a request to join forces.
Do you plan on having them join?
Also for the name, how about:
- “progit” or “prologit” (prologic official hard fork)
- “git-stance” (git instance)
- “GitTree” (Gitea inspired, maybe to related)
- “Gitomata” (git automata)
- “Git.Source”
- “Forgor” (forgit is taken so I forgor) 🤣
- “SweetGit” (as salty chat)
- “Pepper Git” (other ingredients) 😉
- “GitHeart” (core of git with a GitHub sounding name)
- “GitTaka” (With music in mind)
Ok, enough fun… Hope this helps sprout some ideas from others if nothing is to your taste.]
$ jenny -D https://twtxt.net/user/prologic/twtxt.txt/5 | grep 7fqcxaa
[7fqcxaa] [2022-02-25 21:14:45+00:00] [(#bqq6fxq) It’s handled by blue Monday]
And:
$ jenny -D https://twtxt.net/user/thecanine/twtxt.txt | grep ntnakqa
[ntnakqa] [2022-01-23 10:24:09+00:00] [(#2wh7r4q) <a href="https://txt.sour.is/external?uri=https://twtxt.net/user/prologic/twtxt.txt">@prologic<em>@twtxt.net</em></a> I know, I was just hoping it might have also gotten fixed by that change, by some kind of backend miracles. 😂]
$ jenny -D https://twtxt.net/user/prologic/twtxt.txt/1 | grep ntnakqa
[ntnakqa] [2024-02-27 05:51:50+00:00] [(#otuupfq) <a href="https://txt.sour.is/external?uri=https://twtxt.net/user/shreyan/twtxt.txt">@shreyan<em>@twtxt.net</em></a> Ahh 👌]
I wrote some code to try out non-hash reply subjects formatted as (replyto ), while keeping the ability to use the existing hash style.
I don’t think we need to decide all at once. If clients add support for a new method then people can use it if they like. The downside of course is that this costs developer time, so I decided to invest a few hours of my own time into a proof of concept.
With apologies to @movq@www.uninformativ.de for corrupting jenny’s beautiful code. I don’t write this expecting you to incorporate the patch, because it does complicate things and might not be a direction you want to go in. But if you like any part of this approach feel free to use bits of it; I release the patch under jenny’s current LICENCE.
Supporting both kinds of reply in jenny was complicated because each email can only have one Message-Id, and because it’s possible the target twt will not be seen until after the twt referencing it. The following patch uses an sqlite database to keep track of known (url, timestamp) pairs, as well as a separate table of (url, timestamp) pairs that haven’t been seen yet but are wanted. When one of those “wanted” twts is finally seen, the mail file gets rewritten to include the appropriate In-Reply-To header.
Patch based on jenny commit 73a5ea81.
https://www.falsifian.org/a/oDtr/patch0.txt
Not implemented:
- Composing twts using the (replyto …) format.
- Probably other important things I’m forgetting.
@prologic@twtxt.net I read it. I understand it. Hopefully a solution can be agreed upon that solves the editing issue, whilst maintaining the cryptographic hash.
@prologic@twtxt.net I know the role of the current hash is to allow referencing (replies and, thus, threads), and it also represents a “unique” way to verify a twtxt hasn’t been tampered with. Is that second so important, if we are trying to allow edits? I know if feels good to be able to verify, but in reality, how often one does it?
@prologic@twtxt.net how about hashing a combination of nick/timestamp, or url/timestamp only, and not the twtxt content? On edit those will not change, so no breaking of threads. I know, I know, just adding noise here. :-P
@quark@ferengi.one It does not. That is why I’m advocating for not using hashes for treads, but a simpler link-back scheme.
the stem matching is the same as how GIT does its branch hashes. i think you can stem it down to 2 or 3 sha bytes.
if a client sees someone in a yarn using a byte longer hash it can lengthen to match since it can assume that maybe the other client has a collision that it doesnt know about.
the stem matching is the same as how GIT does its branch hashes. i think you can stem it down to 2 or 3 sha bytes.
if a client sees someone in a yarn using a byte longer hash it can lengthen to match since it can assume that maybe the other client has a collision that it doesnt know about.
@prologic@twtxt.net the basic idea was to stem the hash.. so you have a hash abcdef0123456789... any sub string of that hash after the first 6 will match. so abcdef, abcdef012, abcdef0123456 all match the same. on the case of a collision i think we decided on matching the newest since we archive off older threads anyway. the third rule was about growing the minimum hash size after some threshold of collisions were detected.
@prologic@twtxt.net the basic idea was to stem the hash.. so you have a hash abcdef0123456789... any sub string of that hash after the first 6 will match. so abcdef, abcdef012, abcdef0123456 all match the same. on the case of a collision i think we decided on matching the newest since we archive off older threads anyway. the third rule was about growing the minimum hash size after some threshold of collisions were detected.
@prologic@twtxt.net Wikipedia claims sha1 is vulnerable to a “chosen-prefix attack”, which I gather means I can write any two twts I like, and then cause them to have the exact same sha1 hash by appending something. I guess a twt ending in random junk might look suspcious, but perhaps the junk could be worked into an image URL like
. If that’s not possible now maybe it will be later.git only uses sha1 because they’re stuck with it: migrating is very hard. There was an effort to move git to sha256 but I don’t know its status. I think there is progress being made with Game Of Trees, a git clone that uses the same on-disk format.
I can’t imagine any benefit to using sha1, except that maybe some very old software might support sha1 but not sha256.
@movq@www.uninformativ.de going a little sideways on this, “*If twtxt/Yarn was to grow bigger, then this would become a concern again. But even Mastodon allows editing, so how much of a problem can it really be? 😅*”, wouldn’t it preparing for a potential (even if very, very, veeeeery remote) growth be a good thing? Mastodon signs all messages, keeps a history of edits, and it doesn’t break threads. It isn’t a problem there.😉 It is here.
I think keeping hashes is a must. If anything for that “feels good” feeling.
@movq@www.uninformativ.de Agreed that hashes have a benefit. I came up with a similar example where when I twted about an 11-character hash collision. Perhaps hashes could be made optional somehow. Like, you could use the “replyto” idea and then additionally put a hash somewhere if you want to lock in which version of the twt you are replying to.
There is nothing wrong with how we currently run a diff to see what has been removed. if i build a merkle tree off all the twt hashes in a feed i can use that to verify a twt should be in a feed or not. and gossip that to my peers.
There is nothing wrong with how we currently run a diff to see what has been removed. if i build a merkle tree off all the twt hashes in a feed i can use that to verify a twt should be in a feed or not. and gossip that to my peers.
isn’t the benefit of blake2b that it is a more efficient algo than sha1 and has the same or similar entropy to sha3? i thought we had partially solved this with some type of expanding hash size? additionally we could increase bit density by using base36 or base64/url-safe…
isn’t the benefit of blake2b that it is a more efficient algo than sha1 and has the same or similar entropy to sha3? i thought we had partially solved this with some type of expanding hash size? additionally we could increase bit density by using base36 or base64/url-safe…
I’m not advocating in either direction, btw. I haven’t made up my mind yet. 😅 Just braindumping here.
The (replyto:…) proposal is definitely more in the spirit of twtxt, I’d say. It’s much simpler, anyone can use it even with the simplest tools, no need for any client code. That is certainly a great property, if you ask me, and it’s things like that that brought me to twtxt in the first place.
I’d also say that in our tiny little community, message integrity simply doesn’t matter. Signed feeds don’t matter. I signed my feed for a while using GPG, someone else did the same, but in the end, nobody cares. The community is so tiny, there’s enough “implicit trust” or whatever you want to call it.
If twtxt/Yarn was to grow bigger, then this would become a concern again. But even Mastodon allows editing, so how much of a problem can it really be? 😅
I do have to “admit”, though, that hashes feel better. It feels good to know that we can clearly identify a certain twt. It feels more correct and stable.
Hm.
I suspect that the (replyto:…) proposal would work just as well in practice.
Regarding jenny development: There have been enough changes in the last few weeks, imo. I want to let things settle for a while (potential bugfixes aside) and then I’m going to cut a new release.
And I guess the release after that is going to include all the threading/hashing stuff – if we can decide on one of the proposals. 😂
There’s a simple reason all the current hashes end in a or q: the hash is 256 bits, the base32 encoding chops that into groups of 5 bits, and 256 isn’t divisible by 5. The last character of the base32 encoding just has that left-over single bit (256 mod 5 = 1).
So I agree with #3 below, but do you have a source for #1, #2 or #4? I would expect any lack of variability in any part of a hash function’s output would make it more vulnerable to attacks, so designers of hash functions would want to make the whole output vary as much as possible.
Other than the divisible-by-5 thing, my current intuition is it doesn’t matter what part you take.
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
An alternate idea for supporting (properly) Twt Edits is to denoate as such and extend the meaning of a Twt Subject (which would need to be called something better?); For example, let’s say I produced the following Twt:
2024-09-18T23:08:00+10:00 Hllo World
And my feed’s URI is https://example.com/twtxt.txt. The hash for this Twt is therefore 229d24612a2:
$ echo -n "https://example.com/twtxt.txt\n2024-09-18T23:08:00+10:00\nHllo World" | sha1sum | head -c 11
229d24612a2
You wish to correct your mistake, so you make an amendment to that Twt like so:
2024-09-18T23:10:43+10:00 (edit:#229d24612a2) Hello World
Which would then have a new Twt hash value of 026d77e03fa:
$ echo -n "https://example.com/twtxt.txt\n2024-09-18T23:10:43+10:00\nHello World" | sha1sum | head -c 11
026d77e03fa
Clients would then take this edit:#229d24612a2 to mean, this Twt is an edit of 229d24612a2 and should be replaced in the client’s cache, or indicated as such to the user that this is the intended content.
@quark@ferengi.one My money is on a SHA1SUM hash encoding to keep things much simpler:
$ echo -n "https://twtxt.net/user/prologic/twtxt.txt\n2020-07-18T12:39:52Z\nHello World! 😊" | sha1sum | head -c 11
87fd9b0ae4e
Taking the last n characters of a base32 encoded hash instead of the first n can be problematic for several reasons:
Hash Structure: Hashes are typically designed so that their outputs have specific statistical properties. The first few characters often have more entropy or variability, meaning they are less likely to have patterns. The last characters may not maintain this randomness, especially if the encoding method has a tendency to produce less varied endings.
Collision Resistance: When using hashes, the goal is to minimize the risk of collisions (different inputs producing the same output). By using the first few characters, you leverage the full distribution of the hash. The last characters may not distribute in the same way, potentially increasing the likelihood of collisions.
Encoding Characteristics: Base32 encoding has a specific structure and padding that might influence the last characters more than the first. If the data being hashed is similar, the last characters may be more similar across different hashes.
Use Cases: In many applications (like generating unique identifiers), the beginning of the hash is often the most informative and varied. Relying on the end might reduce the uniqueness of generated identifiers, especially if a prefix has a specific context or meaning.
In summary, using the first n characters generally preserves the intended randomness and collision resistance of the hash, making it a safer choice in most cases.
@prologic@twtxt.net I saw those, yes. I tried using yarnc, and it would work for a simple twtxt. Now, for a more convoluted one it truly becomes a nightmare using that tool for the job. I know there are talks about changing this hash, so this might be a moot point right now, but it would be nice to have a tool that:
- Would calculate the hash of a twtxt in a file.
- Would calculate all hashes on a
twtxt.txt(local and remote).
Again, something lovely to have after any looming changes occur.
Could someone knowledgable reply with the steps a grandpa will take to calculate the hash of a twtxt from the CLI, using out-of-the-box tools? I swear I read about it somewhere, but can’t find it.
@falsifian@www.falsifian.org based on Twt Subject Extension, your subject is invalid. You can have custom subjects, that is, not a valid hash, but you simply can’t put anything, and expect it to be treated as a TwtSubject, me thinks.
yarnd just doesn’t render the subject. Fair enough. It’s (replyto http://darch.dk/twtxt.txt 2024-09-15T12:50:17Z), and if you don’t want to go on a hunt, the twt hash is weadxga: https://twtxt.net/twt/weadxga
@sorenpeter@darch.dk I like this idea. Just for fun, I’m using a variant in this twt. (Also because I’m curious how it non-hash subjects appear in jenny and yarn.)
URLs can contain commas so I suggest a different character to separate the url from the date. Is this twt I’ve used space (also after “replyto”, for symmetry).
I think this solves:
- Changing feed identities: although @mckinley@twtxt.net points out URLs can change, I think this syntax should be okay as long as the feed at that URL can be fetched, and as long as the current canonical URL for the feed lists this one as an alternate.
- editing, if you don’t care about message integrity
- finding the root of a thread, if you’re not following the author
An optional hash could be added if message integrity is desired. (E.g. if you don’t trust the feed author not to make a misleading edit.) Other recent suggestions about how to deal with edits and hashes might be applicable then.
People publishing multiple twts per second should include sub-second precision in their timestamps. As you suggested, the timestamp could just be copied verbatim.
@prologic@twtxt.net I have some ideas:
- Add smartypants rendering, just like Yarn has.
- Add the ability to create individual twtxts, each named after their hash.
- Fix the formatting of the help. :-P
(#hash;#originalHash) would also work.
Maybe I’m being a bit too purist/minimalistic here. As I said before (in one of the 1372739 posts on this topic – or maybe I didn’t even send that twt, I don’t remember 😅), I never really liked hashes to begin with. They aren’t super hard to implement but they are kind of against the beauty of the original twtxt – because you need special client support for them. It’s not something that you could write manually in your
twtxt.txtfile. With @sorenpeter@darch.dk’s proposal, though, that would be possible.
Tangentially related, I was a bit disappointed to learn that the twt subject extension is now never used except with hashes. Manually-written subjects sounded so beautifully ad-hoc and organic as a way to disambiguate replies. Maybe I’ll try it some time just for fun.
() @falsifian@www.falsifian.org You mean the idea of being able to inline
# url =changes in your feed?
Yes, that one. But @lyse@lyse.isobeef.org pointed out suffers a compatibility issue, since currently the first listed url is used for hashing, not the last. Unless your feed is in reverse chronological order. Heh, I guess another metadata field could indicate which version to use.
Or maybe url changes could somehow be combined with the archive feeds extension? Could the url metadata field be local to each archive file, so that to switch to a new url all you need to do is archive everything you’ve got and start a new file at the new url?
I don’t think it’s that likely my feed url will change.
@prologic@twtxt.net Yeah, that thing with (#hash;#originalHash) would also work.
Maybe I’m being a bit too purist/minimalistic here. As I said before (in one of the 1372739 posts on this topic – or maybe I didn’t even send that twt, I don’t remember 😅), I never really liked hashes to begin with. They aren’t super hard to implement but they are kind of against the beauty of the original twtxt – because you need special client support for them. It’s not something that you could write manually in your twtxt.txt file. With @sorenpeter@darch.dk’s proposal, though, that would be possible.
I don’t know … maybe it’s just me. 🥴
I’m also being a bit selfish, to be honest: Implementing (#hash;#originalHash) in jenny for editing your own feed would not be a no-brainer. (Editing is already kind of unsupported, actually.) It wouldn’t be a problem to implement it for fetching other people’s feeds, though.
@movq@www.uninformativ.de I figured it will be something like this, yet, you were able to reply just fine, and I wasn’t. Looking at your twtxt.txt I see this line:
2024-09-16T17:37:14+00:00 (#o6dsrga) @<prologic https://twtxt.net/user/prologic/twtxt.txt>
@<quark https://ferengi.one/twtxt.txt> This is what I get. 🤔
Which is using the right hash. Mine, on the other hand, when I replied to the original, old style message (Message-Id: <o6dsrga>), looks like this:
2024-09-16T16:42:27+00:00 (#o) @<prologic https://twtxt.net/user/prologic/twtxt.txt> this was your first twtxt. Cool! :-P
What did you do to make yours work? I simply went to the oldest @prologic@twtxt.net’s entry on my Maildir, and replied to it (jenny set the reply-to hash to #o, even though the Message-Id is o6dsrga). Since jenny can’t fetch archived twtxts, how could I go to re-fetch everything? And, most importantly, would re-fetching fix the Message-Id:?
Hmm… I replied to this message:
From: prologic <prologic>
Subject: Hello World! 😊
Date: Sat, 18 Jul 2020 08:39:52 -0400
Message-Id: <o6dsrga>
X-twtxt-feed-url: https://twtxt.net/user/prologic/twtxt.txt
Hello World! 😊
And see how the hash shows… Is it because that hash isn’t longer used?
(replyto:http://darch.dk/twtxt.txt,2024-09-15T12:06:27Z)
I think I like this a lot. 🤔
The problem with using hashes always was that they’re “one-directional”: You can construct a hash from URL + timestamp + twt, but you cannot do the inverse. When I see “, I have no idea what that could possibly refer to.
But of course something like (replyto:http://darch.dk/twtxt.txt,2024-09-15T12:06:27Z) has all the information you need. This could simplify twt/feed discovery quite a bit, couldn’t it? 🤔 That thing that I just implemented – jenny asking some Yarn pod for some twt hash – would not be necessary anymore. Clients could easily and automatically fetch complete threads instead of requiring the user to follow all relevant feeds.
Only using the timestamp to identify a twt also solves the edit problem.
It even is better for non-Yarn clients, because you now don’t have to read, understand, and implement a “twt hash specification” before you can reply to someone.
The only problem, really, is that (replyto:http://darch.dk/twtxt.txt,2024-09-15T12:06:27Z) is so long. Clients would have to try harder to hide this. 😅
@aelaraji@aelaraji.com no, it is not just you. Do fetch the parent with jenny, and you will see there are two messages with different hash. Soren did something funky, for sure.
Alright, I saw enough broken threads lately to be motivated enough to extend the --fetch-context thingy: It can now ask Yarn pods for twt hashes.
https://www.uninformativ.de/git/jenny/commit/eefd3fa09083e2206ed0d71887d2ef2884684a71.html
This is only done as a last resort if there’s no other way to find the missing twt. Like, when there’s a twt that begins with just a hash and no user mention, there’s no way for jenny to know on which feed that twt can be found, so it’ll ask some Yarn pod in that case.
@prologic@twtxt.net Brute force. I just hashed a bunch of versions of both tweets until I found a collision.
I mostly just wanted an excuse to write the program. I don’t know how I feel about actually using super-long hashes; could make the twts annoying to read if you prefer to view them untransformed.
@prologic@twtxt.net earlier you suggested extending hashes to 11 characters, but here’s an argument that they should be even longer than that.
Imagine I found this twt one day at https://example.com/twtxt.txt :
2024-09-14T22:00Z Useful backup command: rsync -a “$HOME” /mnt/backup
and I responded with “(#5dgoirqemeq) Thanks for the tip!”. Then I’ve endorsed the twt, but it could latter get changed to
2024-09-14T22:00Z Useful backup command: rm -rf /some_important_directory
which also has an 11-character base32 hash of 5dgoirqemeq. (I’m using the existing hashing method with https://example.com/twtxt.txt as the feed url, but I’m taking 11 characters instead of 7 from the end of the base32 encoding.)
That’s what I meant by “spoofing” in an earlier twt.
I don’t know if preventing this sort of attack should be a goal, but if it is, the number of bits in the hash should be at least two times log2(number of attempts we want to defend against), where the “two times” is because of the birthday paradox.
Side note: current hashes always end with “a” or “q”, which is a bit wasteful. Maybe we should take the first N characters of the base32 encoding instead of the last N.
Code I used for the above example: https://fossil.falsifian.org/misc/file?name=src/twt_collision/find_collision.c
I only needed to compute 43394987 hashes to find it.