Print Parser

Error message

  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /var/www/pied-piper.ermarian.net/includes/common.inc).
  • Deprecated function: The each() function is deprecated. This message will be suppressed on further calls in menu_set_active_trail() (line 2405 of /var/www/pied-piper.ermarian.net/includes/menu.inc).
Print Parser is a small step on the road to achieving the total completeness of these archives.

The unfortunate truth is that some topics didn't get saved in time.

Not on the announced purges, of course. Jeff always gives us plenty of time to save topics before they get pruned. I am referring to small in-between deletions - for example; Valley of Thunder and the PPP2 Announcement topic fell victim to this only a week before I finished Magic Flute in January. Stareye was apologetic, but I never found all of PPP2.

Until on Saturday evening, I was searching Google for a different Spiderweb thread and came across a cached version of the PPP2 topic. There were only two files: The first page of the topic, and the printer-friendly view of the entire topic, which Google loves to index and cache (probably due to its minimal styling and formatting).

Technically, I had all the content (minus some details like signatures and post icons). In practice, I couldn't parse it with my normal tool, and would have needed to enter the posts by hand.

Instead, I wrote a third parsing tool - Print Parser. Now, when Drupal cannot find the flat file for a certain page, it attempts - as a fallback attempt - to load the printer-friendly view, which I save as print.html in the same directory. It then parses these posts and caches all the pages at once. After that, the posts are in the database and Drupal does not need to concern itself with where they came from.

Post icons, as said, are lost. They are defaulted to the default icon (1). Also, the persistent post ID's are lost (the printer-friendly view has no named anchors). Instead, I use an auto-incremented value to give the posts new IDs. That means that if any posts got deleted, the gaps that should appear in the IDs will not be present. If you ever have the misfortune of getting to maintain the code, you will find this bug commented with


<?php /* post ids not carried over. whoop de fucking do. */ ?>


Other than that, it works like a charm. And the PPP2 Announcement topic is now once again available.