Mighty Parser

Submitted by Arancaytar on Mon, 11/20/2006 - 08:47

Mighty Parser and Print Parser are two new features used on the backend of this site. As you probably know, the PPP archives consist entirely of flat html files in ordered directories - even though the pages are now parsed and the results saved in the database, so we're moving away from the files.

PPP3 is easy to process. I saved it entirely with a program I called Magic Flute, and the resulting html files are uniform in format.
The problem is that most of the pages of PPP1 and PPP2 are pretty old and were saved by people. Different people, with different browsers, with more or (more frequently, alas) less of a clue on how to make sure the saved pages would be easily indexed.

IE did it again. More than a hundred topics were entirely unreadable because IE ran its own html "uglifier" past the source, inserting linebreaks where none belong, capitalizing tags and so forth. It's like seeing HTMLTidy in action, only in reverse.

So I made a parser that first runs the page through a filter to sanitize the HTML (not HTMLTidy, but a minimal filter I made myself). The result was still not readable by the other parser, but it was readable with a different one, which I wrote last Friday. The result is "Mighty Parser", and it hasn't yet failed to read a topic.

----

Another thing has changed about the way posts are parsed.

I no longer merely read the html block that constitutes the post body and cache that, but I also reverse-engineer the html to get a BBCode text. This text gets stored in the database and then parsed back to HTML when it is displayed. In this way, I try to keep the backend data as flexible as possible.

This also allows me to filter out high-bit characters in the texts, which otherwise can't be displayed. Don't ask me why - the "utf-8" character set that these pages use should be able to contain anything, but apparently PHP outputs high-bit characters in a weird way and they end up as squares or question marks. To avoid this, I encode all these characters in html entities. This isn't easy - PHP has an htmlentities() function, but it leaves a lot of characters uncovered, which need to be replaced explicitly.

I will speak of Print Parser in the next post.

Error message