Kent's Korner
by Kent S Johnson

2007-10-20 21:01:26

A Brief Introduction to Beautiful Soup

It's not unusual to need to extract information from the HTML source of a Web page. There are many approaches to doing this, and many problems.

For simple requirements or web pages that are very consistent, regular expressions may be used to extract text from a web page. Regular expressions may have difficulty coping with nested or repeating structures, line breaks and comments.

You can also write a parser based on sgmllib. For a good example of this approach, see Chapter 8 of Dive Into Python.

Real-world HTML is full of challenges such as missing and mis-matched tags and a variety of character encodings and entity escapes. BeautifulSoup is a library that fairly gracefully parses real-world data yielding a data structure with many options for searching and navigation.

BeautifulSoup provides a rich interface to the parsed data. I will just touch on a few of the methods I have found most useful. The BeautifulSoup Documentation is quite good; refer to it for many more capabilities than I discuss here.

Getting the Data

BeautifulSoup does not fetch the web page for you, you have to do that yourself. This usually involves some incantation with urllib2. For simple cases, you can use

import urllib2
f = urllib2.urlopen('http://coolsite.example.com')
data = f.read()
f.close()

If the page you want to fetch requires authentication, form submission or other complications you will need more than this. The urllib2 documentation and examples may help. In any event urllib2 is a topic for another day so I will assume that you have some data to parse.

Parsing the Data

Parsing the data is as simple as

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(data)

Searching the Parse Tree

Now that you have the soup, stir it until the pieces of interest come to the surface. :-) Here are some helpful bits:

Individual tags can be accessed as attributes of a parent tag. For example soup.head is the <head> tag and soup.head.title is the <title> tag. Under the hood, attribute access is actually a search for the first tag of the correct type, so soup.title will also access the first <title> tag.

The find() and findAll() methods provide more options. The first parameter is the tag name; attribute values are given with keyword parameters. You can specify tag name and attribute values using text (exact match), regular expressions (regex match), boolean (existence) and arbitrary callables (whatever you want). For example, to find all <a> tags with an href attribute (skipping # anchors), use soup.findAll('a', href=True).

BeautifulSoup also provides direct navigation to the parent, children or siblings of a tag. Some of these methods can be filtered; for example you can find the previous <div> tag with tag.findPreviousSibling('div'). See the documentation for details.

Access to Text

Text is represented in the parse tree as instances of NavigableString, a subclass of unicode. This can be confusing when stepping through a parse tree, e.g. using .next. You have to allow for the NavigableString elements. As a shortcut, if a tag contains only text (i.e. it has a single child which is a NavigableString), the text is available in the string attribute of the tag.

Attribute text is accessed with indexing notation. For example to print all the hrefs in a soup:

for anchor in soup.findAll('a', href=True):
  print anchor['href']

A Simple Example

Here is a complete program to fetch the PySIG event history for 2007 and print the entries in the "How Many" column:

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'
data = urllib2.urlopen(url).read()
soup = BeautifulSoup(data)

table = soup.table                         # The first table
for tr in table.findAll('tr')[1:]:         # All rows except the header
    count = tr.contents[-1].string.strip() # The count is in the last column
    print count
 
© Kent S Johnson Creative Commons License

Short, introductory essays about Python.

kentsjohnson.com

Kent's Korner home

Weblog

Other Essays