Archive | Storage Wars & Data Dumps

Studying archival techniques in the digital age

Warrick and the Scraping of the Wayback Machine


Back in the mid 00’s I ran a techblog called, at some point in 2007 I suffered from a localized hard drive crash followed by a webserver crash, I lost my entire website. I decided to take this first assignment as an oppurunity to attempt to recover my website from the wayback machine arhives. This posted many issues, due to irregular URL formation WGET is unable to  accurately scrape the data and left me with empty folders. After doing more research I decided to use warrick, a similar tool in a perl script. There is not a lot of documentation on using warrick outside of pure ‘nix environments, but I was able to get it up and running (tutorial to come soon). By using warrick I was able to set a date in a command that told warrick to search for all instances of my website. After close to three hours I had recovered most of my website in 100’s of HTML documents. The result can be found at I had alter all my link references to get the images and some other things working and still have some more work to do in order to get it 100% back to where it was.


COMMAND USED: perl -dr 2006-01-14

This automatically scraped for the closest known state of for the date Jan 14 2006. It managed to downloaded 125 articles with all images intact.


Powered by WordPress. Designed by Woo Themes