Problems

Submitted by Arancaytar on Fri, 11/24/2006 - 23:34

As you can see from the parser status, the post parser is very close to being done now. The heavy post-crunching (triggered the first time somebody reads the topic) was done not by human visitors, but by a quite indiscriminate wget unleashed on the site (from its own server, naturally). Note that the figure is somewhat inaccurate, as I've found out a few of the topic lengths don't actually reflect the number of posts actually existing.

Unfortunately, as the parser cache fills, problems become apparent every day.

1. On the first page of polls, only the first page got saved. I solved this one, although polls are still displayed as a messy jumble of Javascript. TODO: Replace poll code with some text message. Eventually, add support for viewing poll results - parse those results pages that got saved and display them inline. I'm not sure if there are enough archived result pages for this to make sense.

2. Several special cases resulting from special users using special browsers coded by special programmers. I now wish to beat the staff of Infopop and Microsoft over the head with an unclosed html tag, and I never want to look at a regular expression again. These are all solved - or all the ones I'm aware of. I have seen less than half of the posts now parsed.

TODO: Add the option of easily flagging a topic as incorrectly parsed. Expected errors include weird formatting of a post's content, missing posts, missing pages, and posts getting attributed to the wrong member. It's happened, unfortunately.

3. Emoticons are unilaterally broken. Some of them still include the file paths of the cached images folder (On a related note: Kudos to whoever came up with the idea of naming topic files by their title rather than their uniquely identifying freaking number. Your amazing genius has saved me from months of boredom and will do so again when I get to PPP2.) Fortunately, this is easy to fix. Regex for /[img ](.*\/)?(smile|tongue|biggrin|angry|embarassed|eek|rolleyes|cool)\.gif[\/img]/ and replace with [img]/images/\$(2).gif[/ img]. Might have left one out there. I care. rolleyes.gif

3. Several special cases resulting from special users being lazy as well as special. To date, I've found about a dozen pages missing from the PPP1 archive. I will put up the list eventually on Spiderweb. (TODO).

4. This time through my own fault (gasp), quotes and bold tags are display very weirdly. They're correct in the stored BBCode (the reverse parsing worked), but the forward parsing into html causes bad formatting - nested b tags break it all, and nested quote tags are worse.

In the end, I might have to ditch the bbcode module entirely and write my own parser. I'm looking SO forward to that.

5. Quotes aren't always recognized as such. In some of the threads, you can see "quote:Originally written by:" without a quote mark. This is the result of the reverse BBCode parser not understanding the formatting of this quote tag and just stripping the html from it.

6. No wait, that's about all there is. Yay4PPP.

TODO List will follow in the next post. There's a dern lot of stuff otherwise.

Error message