Warrick and the Scraping of the Wayback Machine

By AJ on February 4, 2015 in ITP, Storage Wars & Data Dumps

Back in the mid 00’s I ran a techblog called unwirednews.net, at some point in 2007 I suffered from a localized hard drive crash followed by a webserver crash, I lost my entire website. I decided to take this first assignment as an oppurunity to attempt to recover my website from the archive.org wayback machine arhives. This posted many issues, due to irregular URL formation WGET is unable to accurately scrape the data and left me with empty folders. After doing more research I decided to use warrick, a similar tool in a perl script. There is not a lot of documentation on using warrick outside of pure ‘nix environments, but I was able to get it up and running (tutorial to come soon). By using warrick I was able to set a date in a command that told warrick to search archive.org for all instances of my website. After close to three hours I had recovered most of my website in 100’s of HTML documents. The result can be found at www.ajlevine.com/unwired. I had alter all my link references to get the images and some other things working and still have some more work to do in order to get it 100% back to where it was.

COMMAND USED: perl warrick.pl -dr 2006-01-14 http://www.unwirednews.net

This automatically scraped archive.org for the closest known state of unwirednews.net for the date Jan 14 2006. It managed to downloaded 125 articles with all images intact.

Navigation

Warrick and the Scraping of the Wayback Machine

2 Responses to Warrick and the Scraping of the Wayback Machine

Leave a Reply to hontes Click here to cancel reply.