I played around with parsers. This time I experimented with parser combinators for twt message text tokenization. Basically, extract mentions, subjects, URLs, media and regular text. It’s kinda nice, although my solution is not completely elegant, I have to say. Especially my communication protocol between different steps for intermediate results is really ugly. Not sure about performance, I reckon a hand-written state machine parser would be quite a bit faster. I need to write a second parser and then benchmark them.

lexer.go and newparser.go resemble the parser combinators: https://git.isobeef.org/lyse/tt2/-/commit/4d481acad0213771fe5804917576388f51c340c0 It’s far from finished yet.

The first attempt in parser.go doesn’t work as my backtracking is not accounted for, I noticed only later, that I have to do that. With twt message texts there is no real error in parsing. Just regular text as a “fallback”. So it works a bit differently than parsing a real language. No error reporting required, except maybe for debugging. My goal was to port my Python code as closely as possible. But then the runes in the string gave me a bit of a headache, so I thought I just build myself a nice reader abstraction. When I noticed the missing backtracking, I then decided to give parser combinators a try instead of improving on my look ahead reader. It only later occurred to me, that I could have just used a rune slice instead of a string. With that, porting the Python code should have been straightforward.

Yeah, all this doesn’t probably make sense, unless you look at the code. And even then, you have to learn the ropes a bit. Sorry for the noise. :-)

⤋ Read More