[hackerspaces] Academic scraping

Bryan Bishop kanzure at gmail.com
Wed Jan 30 22:09:25 CET 2013


On Wed, Jan 30, 2013 at 2:23 PM, Piotr Migdal <pmigdal at gmail.com> wrote:
> I typically use Requests (for downloading pages) + BeautifulSoup (for
> extracting data from HTML files).
>
> Links:
> http://docs.python-requests.org/en/latest/
> http://www.crummy.com/software/BeautifulSoup/

Many years ago, someone did a comparison of lxml versus BeautifulSoup
and found that while BeautifulSoup has a non-sucky API, that it tends
to be slower than lxml. I am not sure if this is still the case,
because even 2 years ago is ancient legend by now.

I enjoy python-requests as much as everyone else. However, I find that
sometimes servers implement non-standard HTTP. Sometimes this is
caused by the server rejecting otherwise standard headers... so my
solution was to write this to patch requests:

https://github.com/kanzure/careful-requests

(because kennethreitz rejected related changes). So, this might be
helpful for scraping delicate servers. For unit testing a scraper, I
like to use:

https://github.com/gabrielfalcao/HTTPretty

- Bryan
http://heybryan.org/
1 512 203 0507


More information about the Discuss mailing list