[Haskell-cafe] ANN: archiver 0.1 and 0.2

Fri Dec 10 23:35:34 CET 2010

I'd like to announce a small utility and library which builds on my
WebArchive plugin for gitit: archiver
http://hackage.haskell.org/package/archiver Source is available via
`darcs get http://community.haskell.org/~gwern/archiver/`.

The library half is a simple wrapper around the appropriate HTTP
requests; the executable half reads a text file and loops as it
(slowly) fires off requests and deletes the appropriate URL.

That is, 'archiver' is a daemon which will process a specified text
file, each line of which is a URL, and will one by one request that
the URLs be archived or spidered by http://www.webcitation.org * and
http://www.archive.org ** for future reference. That is, WebCite and
the IA will store a copy of the HTML and hopefully all the non-dynamic
resources the web pages need. (An example would be
http://bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-former-chief/
and http://webcitation.org/5ur7ifr12)

Usage of archiver might look like `while true; do archiver ~/.urls.txt
gwern0 at gmail.com; done`***.

There are a number of ways to populate the source text file. For
example, I have a script `firefox-urls` which is called in my crontab
every hour, and which looks like this:

    #!/bin/sh
    set -e
    cp `find ~/.mozilla/ -name "places.sqlite"` ~/
    sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                           WHERE moz_places.id =
moz_historyvisits.place_id and visit_date > strftime('%s','now','-1
day')*1000000 ORDER by \
                           visit_date;" >> ~/.urls.txt
    rm ~/places.sqlite

This gets all visited URLs in the last time period and prints them out
to the file for archiver to process. Hence, everything I browse is
backed-up.

More useful perhaps is a script to extract external links from
Markdown files and print them to stdout:

    import System.Environment (getArgs)
    import Text.Pandoc (defaultParserState, processWithM,
readMarkdown, Inline(Link), Pandoc)
    main = getArgs >>= mapM readFile >>= mapM_ analyzePage
    analyzePage x = processWithM printLinks (readMarkdown defaultParserState x)
    printLinks (Link _ (x, _)) = putStrLn x >> return undefined
    printLinks x                   = return x

So now I can take `find . -name "*.page"`, pass the 100 or so Markdown
files in my wiki as arguments, and add the thousand or so external
links to the archiver queue (eg. `find . -name "*.page" | xargs
runhaskell link-extractor.hs >> ~/.urls.txt`); they will eventually be
archived/backed up and when combined with a tool like link-checker****
means that there never need be any broken links since one can either
find a live link or use the archived version.

General comments: I've used archiver for a number of weeks now. It has
never caught up with my Firefox-generated backlog since WebCite seems
to have IP-based throttling so you can't request more often than once
per 20 seconds, according to my experiments, so I removed the hinotify
'watch file' functionality. It may be I was too hasty in removing it.

* http://en.wikipedia.org/wiki/WebCite
** http://en.wikipedia.org/wiki/Internet_Archive
*** There are sporadic exceptions from somewhere in the network or
HTTP libraries, I think
**** http://linkchecker.sourceforge.net/

-- 
gwern
http://www.gwern.net