Warrick and the Scraping of the Wayback Machine

unwirednews

Back in the mid 00’s I ran a techblog called unwirednews.net, at some point in 2007 I suffered from a localized hard drive crash followed by a webserver crash, I lost my entire website. I decided to take this first assignment as an oppurunity to attempt to recover my website from the archive.org wayback machine arhives. This posted many issues, due to irregular URL formation WGET is unable to  accurately scrape the data and left me with empty folders. After doing more research I decided to use warrick, a similar tool in a perl script. There is not a lot of documentation on using warrick outside of pure ‘nix environments, but I was able to get it up and running (tutorial to come soon). By using warrick I was able to set a date in a command that told warrick to search archive.org for all instances of my website. After close to three hours I had recovered most of my website in 100’s of HTML documents. The result can be found at www.ajlevine.com/unwired. I had alter all my link references to get the images and some other things working and still have some more work to do in order to get it 100% back to where it was.

 

COMMAND USED: perl warrick.pl -dr 2006-01-14 http://www.unwirednews.net

This automatically scraped archive.org for the closest known state of unwirednews.net for the date Jan 14 2006. It managed to downloaded 125 articles with all images intact.

2 Responses to Warrick and the Scraping of the Wayback Machine

  1. hontes June 8, 2015 at 1:10 pm #

    Did you ever get around to doing a Warrick tutorial?

    • AJ June 17, 2015 at 12:03 am #

      Hi, I have not done this yet. Would it still be helpful for you?

Leave a Reply to hontes Click here to cancel reply.

Powered by WordPress. Designed by Woo Themes