After a few days of work, I finally reached a preliminary release of the little scraping project I mentioned before. I wanted to test this version on my production server, so I deployed it, and immediately started seeing some weird parsing issues with Beautiful Soup.
My local work environment and my production server are the same, so usually deployments go smooth, but this time the same code, running on the same version of Python, with the same Ubuntu distribution, was giving different results. After a little bit of digging around, I decided that it has to be related to memory, since my virtual hosting environment is pretty limited in terms of RAM.
I was running a very simple Beautiful Soup parser before:
from BeautifulSoup import BeautifulStoneSoup
from urllib import urlopen
response = urlopen(url)
soup = BeautifulStoneSoup(response.read(), convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
lines = soup.findAll("li")
Run locally, this was giving me all li items on the page. Run on the server, it was giving me only a subset, usually only three or four. It was annoyingly random behaviour at best. After reading Beautiful Soup’s documentation on how to improve performance, I decided to use custom Soup Strainers to only parse parts of the document:
from BeautifulSoup import BeautifulStoneSoup, SoupStrainer
from urllib import urlopen
LINES_STRAINER = SoupStrainer("li")
response = urlopen(url)
soup = BeautifulStoneSoup(response.read(), convertEntities=BeautifulStoneSoup.HTML_ENTITIES, parseOnlyThese=LINES_STRAINER)
lines = soup.findAll("li")
Interestingly, this worked perfectly. Since it’s so easy to implement, and most likely one would only need to parse certain parts of the page when scraping, I would say this should be the default way Beautiful Soup should be used. I didn’t do a speed test but it feels like it runs much faster as well.
Are you parsing HTML or XML? I notice you’re using BeautifulStoneSoup, which the Beautiful Soup docs say is for parsing XML. BeautifulSoup is for parsing HTML.
@Alberto - Thanks for the suggestion, I will try lxml out!
@J. Heasly - I know, but even though i am parsing html, StoneSoup is enough for my purposes since my source is not badly marked up. BeautifulSoup is just a subclass of BeautifulStoneSoup that is aware of anomalies in HTML pages.
Beautiful Soup is one of the best lib, but i always need to deal with some ascii coding problem lol..
© Copyright 2001-2010 Taylan Pince. All rights reserved.
If you are working in a CPython environment, I suggest you to use lxml (http://codespeak.net/lxml/) to parse the code that have a small footprint and it’s much more faster (on big html 1mb it’s 100/300x faster). Also using the cssselector it’s very useful.