Beautiful Soup and Memory Issues

After a few days of work, I finally reached a preliminary release of the little scraping project I mentioned before. I wanted to test this version on my production server, so I deployed it, and immediately started seeing some weird parsing issues with Beautiful Soup.

My local work environment and my production server are the same, so usually deployments go smooth, but this time the same code, running on the same version of Python, with the same Ubuntu distribution, was giving different results. After a little bit of digging around, I decided that it has to be related to memory, since my virtual hosting environment is pretty limited in terms of RAM.

I was running a very simple Beautiful Soup parser before:

from BeautifulSoup import BeautifulStoneSoup
from urllib import urlopen
 
response = urlopen(url)
 
soup = BeautifulStoneSoup(response.read(), convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
 
lines = soup.findAll("li")

Run locally, this was giving me all li items on the page. Run on the server, it was giving me only a subset, usually only three or four. It was annoyingly random behaviour at best. After reading Beautiful Soup’s documentation on how to improve performance, I decided to use custom Soup Strainers to only parse parts of the document:

from BeautifulSoup import BeautifulStoneSoup, SoupStrainer
from urllib import urlopen
 
LINES_STRAINER = SoupStrainer("li")
 
response = urlopen(url)
 
soup = BeautifulStoneSoup(response.read(), convertEntities=BeautifulStoneSoup.HTML_ENTITIES, parseOnlyThese=LINES_STRAINER)
 
lines = soup.findAll("li")

Interestingly, this worked perfectly. Since it’s so easy to implement, and most likely one would only need to parse certain parts of the page when scraping, I would say this should be the default way Beautiful Soup should be used. I didn’t do a speed test but it feels like it runs much faster as well.

Comments (4)