@aelaraji@aelaraji.com LOl 😂
Okay, I figured out the cause of the broken output. I also replaced the first subject = ''
for the existing conversation roots with subject > ''
. Somehow, my brain must have read subject <> ''
. That equality check should not have been touched at all. I just updated the updated archive for anyone who is interested to follow along: https://lyse.isobeef.org/tmp/tt2cache.tar.bz2 (151.1 KiB)
LMAO 🤣 … I’ve been scrolling through mutt(1) man page and found this:
BUGS
None. Mutts have fleas, not bugs.
A new thing LLM(s) can’t do well. Write patches 🤣
@lyse@lyse.isobeef.org Yeah I think it’s one of the reasons why yarnd
’s cache became so complicated really. I mean it’s a bunch of maps and lists that is recalculated every ~5m. I don’t know of any better way to do this right now, but maybe one day I’ll figure out a better way to represent the same information that is displayed today that works reasonably well.
@prologic@twtxt.net Yeah, relational databases are definitely not the perfect fit for trees, but I want to give it a shot anyway. :-)
Using EXPLAIN QUERY PLAN
I was able to create two indices, to avoid some table scans:
CREATE INDEX parent ON messages (hash, subject);
CREATE INDEX subject_created_at ON messages (subject, created_at);
Also, since strings are sortable, instead of str_col <> ''
I now use str_col > ''
to allow the use of an index.
But somehow, my output seems to be broken at the end for some reason, I just noticed. :-? Hmm.
The read status still gives me headache. I think I either have to filter in the application or create more meta data structures in the database.
I’m wondering if anyone here already used certain storages for tree data.
My point is, this is not a small trade-off to make for the sake of simplicity 😅
@movq@www.uninformativ.de Maybe I misspoke. It’s a factor of 5 in the size of the keyspace required. The impact is significantly less for on-disk storage of raw feeds and such, around ~1-1.5x depending on how many replies there are I suppose.
I wasn’t very clear; my apologies. If we update the current hash truncation length from 7 to 11. But then still decide anyway to go down this location-based twt identity and threading model then yes, we’re talking about twt subjects having a ~5x increase in size on average. Going from 14 characters (11 for the has, 2 for the parens, 1 for the #) to ~63 bytes (average I’ve worked out of length of URL + Timestamp) + 3 byte overhead for parents and space.
@prologic@twtxt.net A factor of 5 is hard to believe, to be honest. Especially disk usage. I know nothing about the internals of yarnd
, but still.
If this constitutes a hard “no” to the proposal, then I think we don’t need to discuss it further.
@lyse@lyse.isobeef.org Yes I think so.
@prologic@twtxt.net I see. I reckon, it makes to combine 1 and 2, because if we change the hashing anyway, we don’t break it twice.
Don’t forget about the upcoming Yarn.social meetup coming up this Saturday! See #jjbnvgq for details! Hope to see some/all of y’all there 💪
@lyse@lyse.isobeef.org And your query to construct a tree? Can you share the full query (screenshot looks scary 🤣) – On another note, SQL and relational databases aren’t really that conduces to tree-like structures are they? 🤣
This organigram example got me started: https://www.sqlite.org/lang_with.html#controlling_depth_first_versus_breadth_first_search_of_a_tree_using_order_by
But I feel execution times get worse rather quickly with more data I add. Also, caching helps tremendously, executing it for the first time took over 600ms. From then on I’m down to 40ms.
I think, it’s particularly bad that parents might be missing. Thus, I cannot use an index, because there is no parent to reference. But my database knowledge is fairly limited, so I have to read up on that.
In fact it depends on how many Twts there are that form part of a thread, if you take a much larger sample size of my own feed for example, it starts to approximate ~1.5x increase in size:
$ ./compare.sh https://twtxt.net/user/prologic/twtxt.txt 500
Original file size: 126842 bytes
Modified file size: 317029 bytes
Percentage increase in file size: 149.94%
...
In fact @falsifian@www.falsifian.org you had quite a lot of good feedback, do you mind collecting them in a task list on the doc somewhere so I can get to em? 🤔
Can someone make the edit?
There you go, @prologic@twtxt.net, the SQLite database (with a bit more data now) and the sqlitebrowser project file containing the query: https://lyse.isobeef.org/tmp/tt2cache.tar.bz2 (133.9 KiB)
@movq@www.uninformativ.de Tbis was just a representative sample. The real concrete cost here is a ~5x increase in memory consumption for yarnd
and/or ~5x increase in disk storage.
@lyse@lyse.isobeef.org Mind sharing your schema?
@lyse@lyse.isobeef.org Not sure I’ll check
@lyse@lyse.isobeef.org My proposal is three steps:
- increase the hash length from 7 to 11
Then:
- Add support for changing your feed’s location without breaking g threads
Then much later:
- Add formal support for edits
@lyse@lyse.isobeef.org No I don’t either just say’n 😅
@falsifian@www.falsifian.org I agreee. It’s an optional header.
@movq@www.uninformativ.de That’s what I want to know 🤣
@prologic@twtxt.net What’s that in absolute numbers? My ~/Mail/twt
is currently 26 MB in size. Increase that by 20% and we get 31.2 MB.
I don’t buy the argument with 2025 bytes. This worst case scenario is not relevant in practice.
@movq@www.uninformativ.de Oha! @bender@twtxt.net Happy cooling off!
@prologic@twtxt.net Well, mentions are also quite lengthy as they always include the feed URL. I know, that’s not a good argument.
I just got a very, very wild idea that I have not put any brain power into, so it might be totally stupid: Since many replies also mention the original feed, maybe a mention and thread identifier could be compbined, something like: @<nick url timestamp>
. But then we would also need another style if one does not want to mention the original author.
So, scratch that. But I put it out there anyway. Maybe this inspires someone else to come up with something neat.
It’s a different story when you just publish a twtxt file, I think. The question here is: When you publish a twt and don’t like it anymore and want to delete it, do you have the right to force others to delete it? (Not in a technical manner, but by sueing them.) What does the GDPR have to say about that? Not a clue. 😂
@xuu I think it is more tricky than that.
“A company or entity …”
Also, as I understand it, “personal or household activity” (as you called it) is rather strict: An example could be you uploading photos to a webspace behind HTTP basic auth and sending that link to a friend. So, yes, a webserver is involved and you process your friend’s data (e.g., when did he access your files), but it’s just between you and him. But if you were to publish these photos publicly on a webserver that anyone can access, then it’s a different story – even though you could say that “this is just my personal hobby, not related to any job or money”.
If you operate a public Yarn pod and if you accept registrations from other users, then I’m pretty sure the GDPR applies. 🤔 You process personal data and you don’t really know these people. It’s not a personal/private thing anymore.
@prologic@twtxt.net Not sure how many actually care about a 140 character limit. I don’t. Not at all.
@prologic@twtxt.net I’m wondering what exactly you mean by incremental changes, what are the individual ones? What do you have in mind?
@prologic@twtxt.net I find it quite hard to rank the facets. Some go hand in hand or depend on the protocol that a feed is offered. I feel some are only relevant to specific clients. I’m sure, people interpret some of them differently.
I’m curious, is it possible to see each individual poll submission?
I’m experimenting with SQLite and trees. It’s going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:
Now comes the real tricky part, how do I exclude completely read threads?
So just to be clear, it’s not as bad as the OP in this thread, this is just a worst case scenario. With some additional analysis I did today, its closer to around ~5x the memory requirements of my pod, which would roughly go from ~22MB to ~120MB or so, probably a bit more in practise. But this is still a significant increase in memory. The on-disk requirements would also increase by around ~5x as well on average going from ~12GB to about ~60GB at current archive size.
Just out of curiosity, I inspected the yarns database (the search engine//cralwer) to find the average length of a Twtxt URI:
$ inspect-db yarns.db | jq -r '.Value.URL' | awk '{ total += length; count++ } END { if (count > 0) print total / count }'
40.3387
Given an RFC3339 UTC timestamp has a length of 20 characters with seconds precision. We’re talking about Twt Subject taking up ~63 characters/bytes on average.
Comparing a few feeds:
- @xuu would see an increase of ~20%
- @falsifian@www.falsifian.org would see an increase of ~8%
- @bender@twtxt.net would see an increase of ~20%
- @lyse@lyse.isobeef.org would see an increase of ~15%
- @aelaraji@aelaraji.com would see an increase of ~13%
- @sorenpeter@darch.dk would see an increase of ~8%
- @movq@www.uninformativ.de would see an increase of ~9%
Just from a scalability standpoint along I’m not seeing a switch to location-based Twt ids to support threading a good idea here. This is what I meant when I said to @david@collantes.us in a recent call that we open up a new can of worms (or new set of problems) by drastically changing the approach, rather than incrementally improving the existing approach we have today (_which has served us well for the past 4 years already_0.
Reminder to take the Twtxt (anonymous) Poll: http://polljunkie.com/poll/xdgjib/twtxt-v2
Apologies, I can’t edit the poll once it’s live, so the suggestion on feedback for supporting Markdown will have to be discussed at another time.
@xuu 🤣🤣🤣
I demand full 9 digit nano second timestamps and the full TZ identifier as documented in the tz 2024b database! I need to know if there was a change in daylight savings as per the locality in question as of the provided date.
@falsifian@www.falsifian.org I believe the preserve means to include the original subject hash in the start of the twt such as (#somehash)
So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.
$ ./compare.sh
Original file size: 28145 bytes
Modified file size: 70672 bytes
Percentage increase in file size: 151.10%
...
Thank goodness we relaxed that limit and I’ve stopped being so Puritan about it but my overall point is we would be significantly increasing the human size as well as the machine size of the identity of threads as well as twts
With the original specification of 140 character Twt length recommendation. There’s only leaves you with about 78 characters worth of anything remotely useful to say in response.
Let’s say the overhead is always three bytes two parentheses under space.
So for example, if we would use @movq@www.uninformativ.de ’s feed as an example thread ID here, his feed with a particular timestamp, were already looking at a subject length of 59 bytes +/- a couple of bytes to denote the subject in the Twt itself/
One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.
With the proposal to switch to location based addressing using a pointer to a feed and a timestamp in that feed you’re looking at roughly 2025 characters long because both the HTTP and HTML and even URI specifications do not specify maximum length for URI(s) AFAIK only recommendations.
@bender@twtxt.net I can’t see myself personally, increasing the infrastructure and costs to run this pod to support this as we switch over potentially and as things continue to grow in scale. You would never get your infinite search and infinite timeline features that you’ve always wanted for example and I would have to drastically reduce what is visible or even searchable at any given point in time to much less than what it is today.
Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!