Beautiful Soup & HTMLParser Issues

I started working on a little project for myself today and since it involved parsing some content from a site that doesn’t publish APIs, I turned to Beautiful Soup for help. After a couple of hours of coding, I was already starting to scrape nicely formatted data when I got a parser error!

Intrigued and slightly worried since I had never seen Beautiful Soup fail on any kind of bad HTML before, I looked around a bit. Apparently, since version 3.1.0, BeautifulSoup is using HTMLParser instead of SGMLParser, which makes it much more susceptible to badly formatted documents. Good news is, they are aware of these issues and have a detailed instructions page up already.

I opted to downgrade back to 3.0.7a since it was the easiest solution, but having a pluggable parser architecture for Beautiful Soup as mentioned in this post will definitely be the best case scenario.