see:
http://scraperwiki.org
http://scrapy.org/

Once you have the raw data in a central location, it becomes much easier
for someone specialized in data processing to convert it to usable form -
even if it is hard to parse.  It does help to keep the metadata though...

> The unspoken truth of programmerhood is that many of us write spiders
> and scrapers. But nobody talks about it. I have done some
> introspection on why these initiatives fail in academic contexts, and
> I think a big reason is because of biting off more than one can chew.
> The other reason is that there's no best practices being passed
> around, and no reusable software distributed (for the most part).
> Maybe instead of never communicating about these ideas, it would be
> better to write them down for ourselves. I suspect that there are many
> individuals that are highly motivated this week to start writing out
> silly curl scripts. A pile of pdfs is fairly useless to the broader
> community (especially without metadata, since OCR so rarely works on
> \tau\epsilon\tex).
> I'm dropping this here because for whatever reason many of the people
> in the hackerspace community have approached me separately over the
> past few days about starting projects like these. Maybe instead of
> duplicating effort we could figure out ways to suck less?
