In-reply-to » I'm experimenting with SQLite and trees. It's going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:

@prologic@twtxt.net Yeah, relational databases are definitely not the perfect fit for trees, but I want to give it a shot anyway. :-)

Using EXPLAIN QUERY PLAN I was able to create two indices, to avoid some table scans:

CREATE INDEX parent ON messages (hash, subject);
CREATE INDEX subject_created_at ON messages (subject, created_at);

Also, since strings are sortable, instead of str_col <> '' I now use str_col > '' to allow the use of an index.

But somehow, my output seems to be broken at the end for some reason, I just noticed. :-? Hmm.

The read status still gives me headache. I think I either have to filter in the application or create more meta data structures in the database.

I’m wondering if anyone here already used certain storages for tree data.

⤋ Read More
In-reply-to » Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

My point is, this is not a small trade-off to make for the sake of simplicity 😅

⤋ Read More
In-reply-to » Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

@movq@www.uninformativ.de Maybe I misspoke. It’s a factor of 5 in the size of the keyspace required. The impact is significantly less for on-disk storage of raw feeds and such, around ~1-1.5x depending on how many replies there are I suppose.

I wasn’t very clear; my apologies. If we update the current hash truncation length from 7 to 11. But then still decide anyway to go down this location-based twt identity and threading model then yes, we’re talking about twt subjects having a ~5x increase in size on average. Going from 14 characters (11 for the has, 2 for the parens, 1 for the #) to ~63 bytes (average I’ve worked out of length of URL + Timestamp) + 3 byte overhead for parents and space.

⤋ Read More
In-reply-to » Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

@prologic@twtxt.net A factor of 5 is hard to believe, to be honest. Especially disk usage. I know nothing about the internals of yarnd, but still.

If this constitutes a hard “no” to the proposal, then I think we don’t need to discuss it further.

⤋ Read More
In-reply-to » I'm experimenting with SQLite and trees. It's going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:

@lyse@lyse.isobeef.org And your query to construct a tree? Can you share the full query (screenshot looks scary 🤣) – On another note, SQL and relational databases aren’t really that conduces to tree-like structures are they? 🤣

⤋ Read More
In-reply-to » I'm experimenting with SQLite and trees. It's going good so far with only my own 439 messages long main feed from a few days ago in the cache. Fetching these 632 rows took 20ms:

This organigram example got me started: https://www.sqlite.org/lang_with.html#controlling_depth_first_versus_breadth_first_search_of_a_tree_using_order_by

But I feel execution times get worse rather quickly with more data I add. Also, caching helps tremendously, executing it for the first time took over 600ms. From then on I’m down to 40ms.

I think, it’s particularly bad that parents might be missing. Thus, I cannot use an index, because there is no parent to reference. But my database knowledge is fairly limited, so I have to read up on that.

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

In fact it depends on how many Twts there are that form part of a thread, if you take a much larger sample size of my own feed for example, it starts to approximate ~1.5x increase in size:

$ ./compare.sh https://twtxt.net/user/prologic/twtxt.txt 500
Original file size: 126842 bytes
Modified file size: 317029 bytes
Percentage increase in file size: 149.94%
...

⤋ Read More
In-reply-to » @lyse I'd suggest making the whole content-type thing a SHOULD, to accommodate people just using some hosting service they don't have much control over. (The same situation could make detecting followers hard, but IMO "please email me if you follow me" is still legit twtxt, even if inconvenient.)

Can someone make the edit?

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

@movq@www.uninformativ.de Tbis was just a representative sample. The real concrete cost here is a ~5x increase in memory consumption for yarnd and/or ~5x increase in disk storage.

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

@prologic@twtxt.net What’s that in absolute numbers? My ~/Mail/twt is currently 26 MB in size. Increase that by 20% and we get 31.2 MB.

I don’t buy the argument with 2025 bytes. This worst case scenario is not relevant in practice.

⤋ Read More
In-reply-to » So for example, if we would use @movq 's feed as an example thread ID here, his feed with a particular timestamp, were already looking at a subject length of 59 bytes +/- a couple of bytes to denote the subject in the Twt itself/

@prologic@twtxt.net Well, mentions are also quite lengthy as they always include the feed URL. I know, that’s not a good argument.

I just got a very, very wild idea that I have not put any brain power into, so it might be totally stupid: Since many replies also mention the original feed, maybe a mention and thread identifier could be compbined, something like: @<nick url timestamp>. But then we would also need another style if one does not want to mention the original author.

So, scratch that. But I put it out there anyway. Maybe this inspires someone else to come up with something neat.

⤋ Read More
In-reply-to » @xuu I think it is more tricky than that.

It’s a different story when you just publish a twtxt file, I think. The question here is: When you publish a twt and don’t like it anymore and want to delete it, do you have the right to force others to delete it? (Not in a technical manner, but by sueing them.) What does the GDPR have to say about that? Not a clue. 😂

⤋ Read More
In-reply-to » @prologic Do you have a link to some past discussion?

@xuu I think it is more tricky than that.

https://commission.europa.eu/law/law-topic/data-protection/reform/rules-business-and-organisations/application-regulation/who-does-data-protection-law-apply_en

“A company or entity …”

Also, as I understand it, “personal or household activity” (as you called it) is rather strict: An example could be you uploading photos to a webspace behind HTTP basic auth and sending that link to a friend. So, yes, a webserver is involved and you process your friend’s data (e.g., when did he access your files), but it’s just between you and him. But if you were to publish these photos publicly on a webserver that anyone can access, then it’s a different story – even though you could say that “this is just my personal hobby, not related to any job or money”.

If you operate a public Yarn pod and if you accept registrations from other users, then I’m pretty sure the GDPR applies. 🤔 You process personal data and you don’t really know these people. It’s not a personal/private thing anymore.

⤋ Read More
In-reply-to » Reminder to take the Twtxt (anonymous) Poll: http://polljunkie.com/poll/xdgjib/twtxt-v2

@prologic@twtxt.net I find it quite hard to rank the facets. Some go hand in hand or depend on the protocol that a feed is offered. I feel some are only relevant to specific clients. I’m sure, people interpret some of them differently.

I’m curious, is it possible to see each individual poll submission?

⤋ Read More
In-reply-to » Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

So just to be clear, it’s not as bad as the OP in this thread, this is just a worst case scenario. With some additional analysis I did today, its closer to around ~5x the memory requirements of my pod, which would roughly go from ~22MB to ~120MB or so, probably a bit more in practise. But this is still a significant increase in memory. The on-disk requirements would also increase by around ~5x as well on average going from ~12GB to about ~60GB at current archive size.

⤋ Read More

Just out of curiosity, I inspected the yarns database (the search engine//cralwer) to find the average length of a Twtxt URI:

$ inspect-db yarns.db | jq -r '.Value.URL' | awk '{ total += length; count++ } END { if (count > 0) print total / count }'
40.3387

Given an RFC3339 UTC timestamp has a length of 20 characters with seconds precision. We’re talking about Twt Subject taking up ~63 characters/bytes on average.

⤋ Read More
In-reply-to » So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.

Comparing a few feeds:

Just from a scalability standpoint along I’m not seeing a switch to location-based Twt ids to support threading a good idea here. This is what I meant when I said to @david@collantes.us in a recent call that we open up a new can of worms (or new set of problems) by drastically changing the approach, rather than incrementally improving the existing approach we have today (_which has served us well for the past 4 years already_0.

⤋ Read More
In-reply-to » So I'm a location based system, how exactly do I reply to one of these two Twts from @Yarns ? 🤔

I demand full 9 digit nano second timestamps and the full TZ identifier as documented in the tz 2024b database! I need to know if there was a change in daylight savings as per the locality in question as of the provided date.

⤋ Read More

So I whipped up a quick shell script to demonstrate what I mean by the increase in feed size on average as well as the expected increase in storage and retrieval requirements.

$ ./compare.sh
Original file size: 28145 bytes
Modified file size: 70672 bytes
Percentage increase in file size: 151.10%
...

Download

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

Thank goodness we relaxed that limit and I’ve stopped being so Puritan about it but my overall point is we would be significantly increasing the human size as well as the machine size of the identity of threads as well as twts

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

With the original specification of 140 character Twt length recommendation. There’s only leaves you with about 78 characters worth of anything remotely useful to say in response.

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

Let’s say the overhead is always three bytes two parentheses under space.

⤋ Read More
In-reply-to » One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

So for example, if we would use @movq@www.uninformativ.de ’s feed as an example thread ID here, his feed with a particular timestamp, were already looking at a subject length of 59 bytes +/- a couple of bytes to denote the subject in the Twt itself/

⤋ Read More

One of the reasons we wanted to originally use Contant based addressing and short hashes as our threading model was to keep individual Twts short so that they were still readable if you viewed the manually by hand.

With the proposal to switch to location based addressing using a pointer to a feed and a timestamp in that feed you’re looking at roughly 2025 characters long because both the HTTP and HTML and even URI specifications do not specify maximum length for URI(s) AFAIK only recommendations.

⤋ Read More
In-reply-to » Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

@bender@twtxt.net I can’t see myself personally, increasing the infrastructure and costs to run this pod to support this as we switch over potentially and as things continue to grow in scale. You would never get your infinite search and infinite timeline features that you’ve always wanted for example and I would have to drastically reduce what is visible or even searchable at any given point in time to much less than what it is today.

⤋ Read More

Another interesting side effect of changing from content-based addressing to location-based addressing is that switching from 7-byte keys to 2025-character keys for 3.5 million entries would expand the database size from 24.5 MB to about 7.09 GB—an increase of roughly 7.06 GB!

⤋ Read More
In-reply-to » @prologic Thanks for writing that up!

@bender@twtxt.net

Sorry, you’re right, I should have used numbers!

I’m don’t understand what “preserve the original hash” could mean other than “make sure there’s still a twt in the feed with that hash”. Maybe the text could be clarified somehow.

I’m also not sure what you mean by markdown already being part of it. Of course people can already use Markdown, just like presumably nothing stopped people from using (twt subjects) before they were formally described. But it’s not universal; e.g. as a jenny user I just see the plain text.

⤋ Read More