[hackerspaces] Academic scraping

Bryan Bishop kanzure at gmail.com
Mon Jan 14 18:59:38 CET 2013


On Mon, Jan 14, 2013 at 11:51 AM, Lokkju Brennr wrote:
> see:
> http://scraperwiki.org
> http://scrapy.org/
>
> Once you have the raw data in a central location, it becomes much easier for
> someone specialized in data processing to convert it to usable form - even
> if it is hard to parse.  It does help to keep the metadata though...

One of my favorite scraping methods at the moment is phantomjs, a
headless wrapper around webkit.

http://phantomjs.org/
https://github.com/ariya/phantomjs
https://github.com/kanzure/pyphantomjs

But for academic projects, I highly recommend zotero's translators.

https://github.com/zotero/translators

Here's why. There's already a huge userbase of zotero users actively
updating these scrapers. When they break, they fix them immediately.
They are all written in javascript and they extract not only the link
to the pdf but also the maximum amount of metadata. With the help of
the zotero/translation-server project, they can be used headlessly.

https://github.com/zotero/translation-server

I have a demo of this working in irc.freenode.net ##hplusroadmap
(paperbot), he just grabs links from our conversation and posts the
pdfs so that we don't have to ask each other for copies.

- Bryan
http://heybryan.org/
1 512 203 0507


More information about the Discuss mailing list