Hmmmm, I somehow run into an encoding problem where my inserted data end up mangled in the database. But, both SQLite and Go use UTF-8. What’s happening here? :-?
@lyse@lyse.isobeef.org One of them is lying. >:-)
Mangled in what way? How does it look like? 🤔
@movq@www.uninformativ.de Non-ASCII characters were broken. Like U+2028, degrees (°), etc.
Turns out I used a silly library to detect the encoding and transform to UTF-8 if needed. When there is no Content-Type header, like for local files, it looks at the first 1024 bytes. Since it only saw ASCII in that region, the damn thing assumed the data to be in Windows-1252 (which for web pages kinda makes sense):
// TODO: change default depending on user's locale?
return charmap.Windows1252, "windows-1252", false
https://cs.opensource.google/go/x/net/+/master:html/charset/charset.go;l=102
This default is hardcoded and cannot be changed.
Trying to be smart and adding automatic support for other encodings turned out to be a bad move on my end. At least I can reduce my dependency list again. :-)
I now just reject everything that explicitly specifies something different than text/plain
and an optional charset other than utf-8
(ignoring casing). Otherwise I assume it’s in UTF-8 (just like the twtxt file format specification mandates) and hope for the best.
@lyse@lyse.isobeef.org Ouch. 🥴 Well, jenny always decodes as UTF-8 (because the spec says so) and this hasn’t caused any issues – yet.