Lokkju Brennr lokkju at gmail.com
Mon Jan 14 18:51:35 CET 2013

see:
http://scraperwiki.org
http://scrapy.org/

Once you have the raw data in a central location, it becomes much easier
for someone specialized in data processing to convert it to usable form -
even if it is hard to parse.  It does help to keep the metadata though...

Loki

On Mon, Jan 14, 2013 at 12:27 PM, Bryan Bishop <kanzure at gmail.com> wrote:

> Hey all,
>
> The unspoken truth of programmerhood is that many of us write spiders
> and scrapers. But nobody talks about it. I have done some
> introspection on why these initiatives fail in academic contexts, and
> I think a big reason is because of biting off more than one can chew.
> The other reason is that there's no best practices being passed
> around, and no reusable software distributed (for the most part).
>
>
> Maybe instead of never communicating about these ideas, it would be
> better to write them down for ourselves. I suspect that there are many
> individuals that are highly motivated this week to start writing out
> silly curl scripts. A pile of pdfs is fairly useless to the broader
> community (especially without metadata, since OCR so rarely works on
> \tau\epsilon\tex).
>
> I'm dropping this here because for whatever reason many of the people
> in the hackerspace community have approached me separately over the
> past few days about starting projects like these. Maybe instead of
> duplicating effort we could figure out ways to suck less?
>
> - Bryan
> http://heybryan.org/
> 1 512 203 0507
> _______________________________________________
> Discuss mailing list
> Discuss at lists.hackerspaces.org
> http://lists.hackerspaces.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hackerspaces.org/pipermail/discuss/attachments/20130114/6f778f6f/attachment.html>