Donāt forget about the upcoming Yarn.social online meetup coming up this Saturday! š See #jjbnvgq for details! ā Hope to see yāall there šŖ
š Donāt forget to take the Twtxt v2 poll š if you havenāt done so already (sorry about the confusing question at the end!)
(#abcdefg12345)
to something like (https://twtxt.net/user/prologic/twtxt.txt 2024-09-22T07:51:16Z)
.
@doesnm@doesnm.p.psf.lt I donāt even advocate for reading Twtxt in its raw form in the first place, which is why Iām in favor of continuing to use content-based addressing (hashes) and incremental improve what we already have. IMO the only reason to read a Twtxt file in itās raw form is a) if youāre a developer b) new feed author or c) debugging a client issue.
And finally the legibility of feeds when viewing them in their raw form are worsened as you go from a Twt Subject of (#abcdefg12345)
to something like (https://twtxt.net/user/prologic/twtxt.txt 2024-09-22T07:51:16Z)
.
There is also a ~5x increase cost in memory utilization for any implementations or implementors that use or wish to use in-memory storage (yarnd
does for example) and equally a 5x increase in on-disk storage as well. This is based on the Twt Hash going from a 13 bytes (content-addressing) to 63 bytes (on average for location-based addressing). There is roughly a ~20-150% increase in the size of individual feeds as well that needs to be taken into consideration (on the average case).
With Location-based addressing there is no way to verify that a single Twt actaully came from that feed without actually fetching the feed and checking. That has the effect of always having to rely on fetching the feed and storing a copy of feeds you fetch (which is okay), but youāre force to do this. You cannot really share individual Twts anymore really like yarnd
does (as peering) because there is no āintegrityā to the Twt identified by itās <url> <timestamp>
. The identify is meaningless and is only valid as long as you can trust the location and that the location at that point hasnāt changed its content.
Location-based addressing is vulnerable to the content changing. If the content changes the ālocationā is no longer valid. This is a problem if you build systems that rely on this.
So really your argument is just that switching to a location-based addressing ājust makes senseā. Why? Without concrete pros/cons of each approach this isnāt really a strong argument Iām afraid. In fact I probably need to just sit down and detail the properties of both approaches and the pros/cons of both.
I also donāt really buy the argument of simplicity either personally, because I donāt technically see it much more difficult to take a echo -e "<url>\t<timestamp>\t<content>" | sha256sum | base64
as the Twt Subject or concatenating the <url> <timestamp>
ā The āeffortā is the same. If weāre going to argue that SHA256 or cryptographic hashes are ātoo complicatedā then Iām not really sure how to support that argument.
@sorenpeter@darch.dk Points 2 & 3 arenāt really applicable here in the discussion of the threading model really Iām afraid. WebMentions is completely orthogonal to the discussion. Further, no-one that uses Twtxt really uses WebMentions, whilst yarnd
supports the use of WebMentions, itās very rarely used in practise (if ever) ā In fact I should just drop the feature entirely.
The use of WebSub OTOH is far more useful and is used by every single yarnd
pod everywhere (no that thereās that many around these days) to subscribe to feed updates in ~near real-time without having the poll constantly.
@doesnm@doesnm.p.psf.lt Welcome back š
@eapl.me@eapl.me Sad to see you go, disappointed in your choice of X, but respect your decision and choice. I will never cave in myself, even if it means my ācircle of friendsā remains low. I guess we call āem internet friends right? š
@lyse@lyse.isobeef.org How violent is the thunderstorm? š¤
@aelaraji@aelaraji.com LOl š
A new thing LLM(s) canāt do well. Write patches š¤£
@lyse@lyse.isobeef.org Yeah I think itās one of the reasons why yarnd
ās cache became so complicated really. I mean itās a bunch of maps and lists that is recalculated every ~5m. I donāt know of any better way to do this right now, but maybe one day Iāll figure out a better way to represent the same information that is displayed today that works reasonably well.
My point is, this is not a small trade-off to make for the sake of simplicity š
@movq@www.uninformativ.de Maybe I misspoke. Itās a factor of 5 in the size of the keyspace required. The impact is significantly less for on-disk storage of raw feeds and such, around ~1-1.5x depending on how many replies there are I suppose.
I wasnāt very clear; my apologies. If we update the current hash truncation length from 7 to 11. But then still decide anyway to go down this location-based twt identity and threading model then yes, weāre talking about twt subjects having a ~5x increase in size on average. Going from 14 characters (11 for the has, 2 for the parens, 1 for the #) to ~63 bytes (average Iāve worked out of length of URL + Timestamp) + 3 byte overhead for parents and space.
Donāt forget about the upcoming Yarn.social meetup coming up this Saturday! See #jjbnvgq for details! Hope to see some/all of yāall there šŖ
@lyse@lyse.isobeef.org And your query to construct a tree? Can you share the full query (screenshot looks scary š¤£) ā On another note, SQL and relational databases arenāt really that conduces to tree-like structures are they? š¤£
In fact it depends on how many Twts there are that form part of a thread, if you take a much larger sample size of my own feed for example, it starts to approximate ~1.5x increase in size:
$ ./compare.sh https://twtxt.net/user/prologic/twtxt.txt 500
Original file size: 126842 bytes
Modified file size: 317029 bytes
Percentage increase in file size: 149.94%
...
In fact @falsifian@www.falsifian.org you had quite a lot of good feedback, do you mind collecting them in a task list on the doc somewhere so I can get to em? š¤
Can someone make the edit?
@movq@www.uninformativ.de Tbis was just a representative sample. The real concrete cost here is a ~5x increase in memory consumption for yarnd
and/or ~5x increase in disk storage.
@lyse@lyse.isobeef.org Mind sharing your schema?
@lyse@lyse.isobeef.org Not sure Iāll check
@lyse@lyse.isobeef.org My proposal is three steps:
- increase the hash length from 7 to 11
Then:
- Add support for changing your feedās location without breaking g threads
Then much later:
- Add formal support for edits
@lyse@lyse.isobeef.org No I donāt either just sayān š
@movq@www.uninformativ.de Thatās what I want to know š¤£
So just to be clear, itās not as bad as the OP in this thread, this is just a worst case scenario. With some additional analysis I did today, its closer to around ~5x the memory requirements of my pod, which would roughly go from ~22MB to ~120MB or so, probably a bit more in practise. But this is still a significant increase in memory. The on-disk requirements would also increase by around ~5x as well on average going from ~12GB to about ~60GB at current archive size.
Just out of curiosity, I inspected the yarns database (the search engine//cralwer) to find the average length of a Twtxt URI:
$ inspect-db yarns.db | jq -r '.Value.URL' | awk '{ total += length; count++ } END { if (count > 0) print total / count }'
40.3387
Given an RFC3339 UTC timestamp has a length of 20 characters with seconds precision. Weāre talking about Twt Subject taking up ~63 characters/bytes on average.
Comparing a few feeds:
- @xuu@txt.sour.is would see an increase of ~20%
- @falsifian@www.falsifian.org would see an increase of ~8%
- @bender@twtxt.net would see an increase of ~20%
- @lyse@lyse.isobeef.org would see an increase of ~15%
- @aelaraji@aelaraji.com would see an increase of ~13%
- @sorenpeter@darch.dk would see an increase of ~8%
- @movq@www.uninformativ.de would see an increase of ~9%
Just from a scalability standpoint along Iām not seeing a switch to location-based Twt ids to support threading a good idea here. This is what I meant when I said to @david@collantes.us in a recent call that we open up a new can of worms (or new set of problems) by drastically changing the approach, rather than incrementally improving the existing approach we have today (_which has served us well for the past 4 years already_0.
Reminder to take the Twtxt (anonymous) Poll: http://polljunkie.com/poll/xdgjib/twtxt-v2
Apologies, I canāt edit the poll once itās live, so the suggestion on feedback for supporting Markdown will have to be discussed at another time.
@xuu@txt.sour.is š¤£š¤£š¤£
So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.
$ ./compare.sh
Original file size: 28145 bytes
Modified file size: 70672 bytes
Percentage increase in file size: 151.10%
...
Thank goodness we relaxed that limit and Iāve stopped being so Puritan about it but my overall point is we would be significantly increasing the human size as well as the machine size of the identity of threads as well as twts
With the original specification of 140 character Twt length recommendation. Thereās only leaves you with about 78 characters worth of anything remotely useful to say in response.
Letās say the overhead is always three bytes two parentheses under space.
So for example, if we would use @movq@www.uninformativ.de ās feed as an example thread ID here, his feed with a particular timestamp, were already looking at a subject length of 59 bytes +/- a couple of bytes to denote the subject in the Twt itself/
One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.
With the proposal to switch to location based addressing using a pointer to a feed and a timestamp in that feed youāre looking at roughly 2025 characters long because both the HTTP and HTML and even URI specifications do not specify maximum length for URI(s) AFAIK only recommendations.
@bender@twtxt.net I canāt see myself personally, increasing the infrastructure and costs to run this pod to support this as we switch over potentially and as things continue to grow in scale. You would never get your infinite search and infinite timeline features that youāve always wanted for example and I would have to drastically reduce what is visible or even searchable at any given point in time to much less than what it is today.
Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GBāan increase of roughly 7.06 GB!
@falsifian@www.falsifian.org No worries! Fell few to contribute to the doc directly Iād you wish š
@falsifian@www.falsifian.org Hmmm not sure sorry š¤
@xuu@txt.sour.is Goos to know! š So as long as we remain decentralized and non-commercial (I assume non/profit works too?) weāre good?
@lyse@lyse.isobeef.org Nice ! š
@doesnm@doesnm.p.psf.lt Hello! š
@lyse@lyse.isobeef.org Yes letās make UTF-8 mandatory š
@lyse@lyse.isobeef.org Agreed
Letās try this pill for Twtxt v2 (no account required)
@lyse@lyse.isobeef.org Iām a bit indifferent whether itās at the beginning or end tbh.
This is still a draft! Feel free to edit it š
@movq@www.uninformativ.de Thatās what I was afraid of š¤£
yarnd
to see how many things would break and how many assumptions there are around the idea of "Content Addressing"; here's where I'm at so far:
@movq@www.uninformativ.de Makes sense š I think itās fair to implement any spec changes incrementaly for sure š
And yea since yarnd has a store itās a bit easier to support edit / delete actions š
So Iām a location based system, how exactly do I reply to one of these two Twts from @Yarns@search.twtxt.net ? š¤
2024-09-07T12:55:56Z š„³ NEW FEED: @<twtxt http://edsu.github.io/twtxt/twtxt.txt>
2024-09-07T12:55:56Z š„³ NEW FEED: @<kdy https://twtxt.kdy.ch/twtxt.txt>
@lyse@lyse.isobeef.org Yup, this is why you started seeing if you could improve the ātrustā of peers right? š
yarnd
to see how many things would break and how many assumptions there are around the idea of "Content Addressing"; here's where I'm at so far:
@movq@www.uninformativ.de Yeah I think what Iām proposing here is a more pragmatic approach to improvements that will last much longer than our first interaction (~4 years and going strong, but running into minor issues with edit/identify and some collssions_). This scope of changes is much easier to implement for yarnd
and I suspect jenny
too. and as indicated in here quite easy to have a reference implementation written in Bash with standard UNIX tools.
Itās even sorta/somewhat compatible with our existing feeds (kind of) 𤣠ā Bit too stupid to figure out how to write enough correct Bash to make threads display inline nicely in an indented/tree-like fashion, but oh well š
Example:
$ ./twtxt-v2.sh reply 242561ce02d "Cool! š"
Posted twt with hash: b2c938f9838
...
$ ./twtxt-v2.sh timeline
...
prologic@twtxt.net [2024-09-22T07:26:37Z] <242561ce02d> Okay folks, I've spent all day on this today, and I _think_ its in "good enough"⢠shape to share:
**Twtxt v2**:
- Specification: https://docs.mills.io/uJXuisaYTRWYDrl8A2jADg?both
- implementation: https://gist.mills.io/prologic/afdec15443da4d7aa898f383f171ec1b

prologic@localhost [2024-09-22T07:51:16Z] <b2c938f9838> Cool! š (reply-to:242561ce02d)
Okay folks, Iāve spent all day on this today, and I think its in āgood enoughā⢠shape to share:
Twtxt v2:
- Specification: https://docs.mills.io/uJXuisaYTRWYDrl8A2jADg?both
- implementation: https://gist.mills.io/prologic/afdec15443da4d7aa898f383f171ec1b
@aelaraji@aelaraji.com No that is absolutely correct. Without cryptographic identities and signatures there is no way to verify authenticity. That is correct. And I donāt think we need to necessarily. What I was just showing and proving was that I didnāt write that spoofed Twt in the first place, which was only provable at the time of @lyse@lyse.isobeef.org short-lived attack 𤣠He essentially forked yarnd
, hosted it temporarily (I think locally) and used it to poison the caches of a few production pods.
Thankfully the gossip protocol used by yarnd
as part of its āpeeringā between pods isnāt fully trusted, twts are not archived for example into permanent storage. So the moment my pod re-fetched my own feed, the spoofed Twt was obliterated š
Eventual consistency š¤£
LOl š Not only have a tried to write up a full Twtxt v2 specification, Iāve also written a Bash shell script that implements the new spec š
@movq@www.uninformativ.de Haha š Nice one! And yes Iām also aware of some collisions too!
@aelaraji@aelaraji.com I like Nttfy š Iāve wanted to replace my use of the Pushover service with this for a while now š¤
yarnd
to see how many things would break and how many assumptions there are around the idea of "Content Addressing"; here's where I'm at so far:
@bender@twtxt.net š
š Reminder folks of the upcoming Yarn.social monthly online meetup:
I hope to see @david@collantes.us @movq@www.uninformativ.de @lyse@lyse.isobeef.org @xuu@txt.sour.is @sorenpeter@darch.dk and hopefully others too @aelaraji@aelaraji.com @falsifian@www.falsifian.org and anyone else that sees this! š Weāre hopefully going to primarily discuss the future of Twtxt and the last few weeks of discussions š¤£
- Event: Yarn.social Online Meetup
- When: 28th September 2024 at 12:00pm UTC (midday)
- Where: Mills Meet : Yarn.social
- Cadence: 4th Saturday of every Month
Agenda:
- Letās talk about the upcoming changes to the Twtxt spec(s)
- See #xgghhnq
- See #xgghhnq
My Position on the last few weeks of Twtxt spec discussions:
- We increase the Hash length from
7
to11
.
- We formalise the Update Commands extension.
- We amend the Twt Hash and Metadata extension to state:
Feed authors that wish to change the location of their feed (once Twts have been published) must append a new
# url =
comment to their feed to indicate the new location and thus change the āHashing URIā used for Twts from that point onward.
This has implications of the āorderā of a feed, and we should either do one of two things, either:
- Mandate that feeds are append-only.
- Or amend the Metadata spec with a new field that denotes the order of the feed so clients can make sense of āinlineā comments in the feed. ā This would also imply that the default order is (of course) append-only. Suggestion:
# direction = [append|prepend]
I finally decided to do a few experiments with yarnd
to see how many things would break and how many assumptions there are around the idea of āContent Addressingā; hereās where Iām at so far:
Basically Iām at a point where spending time on this is going to provide very little value, there are assumptions made in the lextwt parser, assumptions made in yarnd, assumptions in the way storage is done and the way threading works and things are looked up. There are far reaching implications to changing the way Twts are identified here to be ālocation addressedā that Iām quite worried about the amount of effort would be required to change yarnd
here.
@mckinley@twtxt.net Yes I have, however Iām not counting that because even using āCloudā is not labor free.
@aelaraji@aelaraji.com We digits it out 𤣠@lyse@lyse.isobeef.org ās little hack was good but only temporary š¤£
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
@sorenpeter@darch.dk Lins of agree with dealing with this kind of social nonsense which weāve all done in the past š¤£
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
@movq@www.uninformativ.de I think your scenario doesnāt account for clients and their storage. The scenario described only really affects clients that come along later. Even then they would also be able to re-fetch mossing Twts from peers or even a search engine to fill in the gaps.
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
@movq@www.uninformativ.de Thatās kind a problem though right?
yarnd
has a couple of settings with some sensible/sane defaults:
@david@collantes.us š¤£š¤£š¤£
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
I just realized the other big property you lose is:
What if someone completely changes the content of the root of the thread?
Does the Subject reference the feed and timestamp only or the intent too?
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
@bender@twtxt.net Yeah Iāll be honest here; Iām not going to be very happy if we go down this ālocation addressingā route;
- Twt Subjects lose their meaning.
- Twt Subjects cannot be verified without looking up the feed.
- Which may or may not exist anymore or may change.
- Which may or may not exist anymore or may change.
- Two persons cannot reply to a Twt independently of each other anymore.
and probably some other properties weād stand to lose that Iām forgetting aboutā¦
(replyto:ā¦)
. Itās easier to implement and the whole edits-breaking-threads thing resolves itself in a ānaturalā way without the need to add stuff to the protocol.
@movq@www.uninformativ.de One of the biggest reasons I donāt like the (replyto:ā¦)
proposal (location addressing vs. content addressing) is that you just introduce a similar problem down the track, albeit rarer where if a feed changes its location, your threadās āidentifiersā are no longer valid, unless those feed authors maintain strict URL redirects, etc. This potentially has the long-term effect of being rather fragile, as opposed to what we have now where an Edit just really causes a natural fork in the thread, which is how āforkingā works in the first place.
I realise this is a bit pret here, and it probably doesnāt matter a whole lot at our size. But Iām trying to think way ahead, to a point where Twtxt as a āthingā can continue to work and function decades from now, even with the extensions weāve built. Weāve already proven for example that Twts and threads from ~4 years ago still work and are easily looked up haha š
I just read the primary spec Iām strongly in support of and itās pretty rock solid for me š šÆ
Do you recall what it was? I blame my maintenance window šŖ
@bender@twtxt.net Hmm what you replied to appears to be non-existent: https://twtxt.net/twt/pqst4ea
@movq@www.uninformativ.de I just saw thes come through! š Thank you very much, Iāll definitely have a read tomorrow! š
@bender@twtxt.net Which reply was that? š¤
@bender@twtxt.net Bahahahahaha š¤£
Ever wondered what it would cost to self-hosted vs. use the cloud? Well I often doubt myself every time I look at hardware prices, and I know I have to do some hardware refresh soon⢠for the Mills DC (something I donāt have a regular plan or budget for), hereās a rough ball park:
The Mills DC has cost me around ~$15k to build and maintain over the last ~10 years or so. Roughly speaking. Iāve never actually taken a Bill of Materials or anything, but I could if anyone is interested in more specifics.
The equivalent of resources if run in the āCloudā would cost around:
- ~$1,000 for virtual machines
- ~$12000 for storage
So around ~$2,000/month to run.
Keep this in mind anytime anyone ever tries to con you into believing āCloud is cheaperā. Itās not.
@aelaraji@aelaraji.com This is one of the reasons why yarnd
has a couple of settings with some sensible/sane defaults:
I could already imagine a couple of extreme cases where, somewhere, in this peaceful world oneās exercise of freedom of speech could get them in Real trouble (if not danger) if found out, it wouldnāt necessarily have to involve something to do with Law or legal authorities. So, If someone asks, and maybe fearing fearing for⦠letās just say āTheir well beingā, would it heart if a pod just purged their content if itās serving it publicly (maybe relay the info to other pods) and call it a day? It doesnāt have to be about some law/convention somewhere ⦠𤷠I know! Too extreme, but Iāve seen news of people whoād gone to jail or got their lives ruined for as little as a silly joke. And it doesnāt even have to be about any of this.
There are two settings:
$ ./yarnd --help 2>&1 | grep max-cache
--max-cache-fetchers int set maximum numnber of fetchers to use for feed cache updates (default 10)
-I, --max-cache-items int maximum cache items (per feed source) of cached twts in memory (default 150)
-C, --max-cache-ttl duration maximum cache ttl (time-to-live) of cached twts in memory (default 336h0m0s)
So yarnd
pods by default are designed to only keep Twts around publicly visible on either the anonymous Frontpage or Discover View or your Timeline or the feedās Timeline for up to 2 weeks with a maximum of 150 items, whichever get exceeded first. Any Twts over this are considered āoldā and drop off the active cache.
Itās a feature that my old man @off_grid_living@twtxt.net was very strongly in support of, as was I back in the day of yarnd
ās design (nothing particularly to do with Twtxt per se) that Iāve to this day stuck by ā Even though there are some š that have different views on this š¤£
@aelaraji@aelaraji.com Thanks for this! š
Bahahahaha very clever @lyse@lyse.isobeef.org I look forward to reading your report ! 𤣠Howeverā¦
$ yarnc debug https://twtxt.net/user/prologic/twtxt.txt | grep -E '^pqst4ea' | tee | wc -l
0
I very quickly proved that Twt was never from me š¤£
@yarn_police@twtxt.net Cool cool šāāļø
@yarn_police@twtxt.net Whatās going on?
@movq@www.uninformativ.de Yes thatās true they are only integrity checks. But beyond a malicious pod (ignore yarndāa gossiping protocol for now) how does what @lyse@lyse.isobeef.org presented work exactly? š
But this is no different to how jenny
does things with storing every Twt in a Maildir I suppose? š¤
This has specifically come up before in the form of āinformal complaintsā against yarnd
because of the way it permanently stores and archives Twts, so even if you decide you changed your mind, or deleted that line out of your feed, if my pod or @xuu@txt.sour.is or @abucci@anthony.buc.ci or @eldersnake@we.loveprivacy.club (or any other handful of pods still around?) saw the Twt, itād be permanently archived.
Yeah Iām curious to find out too beyond just āhere sayā. But regardless of whether we should or shouldnāt care about this or should or shouldnāt comply. We should IMO. Iād have to build something that horrendously violates someoneās rights in another country.
@movq@www.uninformativ.de Care to explain how this explicit/attack works for me? š¤£
Well that was bloody awful. This PR bokr my pod for some strange reason I canāt figure out why or how š± The process just kept getting terminated from something, somewhere (no panic). weird. Iāve reverted this PR for now @xuu@txt.sour.is
Really though I only managed to save a few GB, but itās enough for now.
@bender@twtxt.net Haha š Faster? Maybe š¤ But yeah itās good to have backups! (that work)
Iāve also put up this PR Add compatible methods for Index to behave as the Archiver (transition) #1177
that will act as a transition from the old naive archiver to the new bluge-based search/index. I will switch my pod over to this soon to test it before anyone else does.