Kent's Korner: urllib2 Cookbook

2008-03-22 21:21:27

urllib2 Cookbook

urllib2 is a versatile module that can read data from URLs using HTTP, HTTPS, FTP, Gopher and file protocols. For HTTP and HTTPS requests it can handle redirects, basic and digest authentication and proxy servers. This tutorial focuses on using urllib2 to fetch web pages over HTTP; the same techniques apply to HTTPS.

GET requests

The simplest use of urllib2 is to fetch the contents of an unauthenticated web page into a string using a GET request. This is done with urllib2.urlopen():

import urllib2
f = urllib2.urlopen('http://coolsite.example.com')
data = f.read()
f.close()

The response code is available as f.code and the response headers as f.info(). The actual URL fetched is available as f.geturl(); in the case of a redirect this will differ from the URL you provide.

The object returned by urlopen() is a file-like object with read() and readlines() methods. In this case it is actually a wrapper around a socket. Best practice is to explicitly close() the wrapper, but as with a file, for quick-and-dirty code you can skip the close() call.

Redirects

urlopen() and the other open() methods shown below will transparently handle redirects. If the original URL returns a status code 301 or 302, the new address will be fetched within the same urlopen call.

You can detect a redirect by comparing the URL actually returned, f.geturl(), with the URL provided to urlopen(). The status code returned in f.code is the status from the second request. If you need the code from the original request, see the note about Universal Feed Parser in the Other resources section.

POST requests

To POST to a form, provide the post data as a url-encoded string of key=value pairs. The function urllib.urlencode() will create a url-encoded string from a dictionary of key-value pairs or a list of tuples. For example:

import urllib, urllib2
params = urllib.urlencode(dict(username='joe', password='secret'))
f = urllib2.urlopen('http://coolsite.example.com/login/', params)
data = f.read()
f.close()

Note that it is the presence of the data argument that signals a POST request. To make a GET request with query parameters, add the parameters to the URL you supply.

Authentication

A common requirement is to authenticate at a web site using either Basic Authentication or form-based authentication and cookies.

Basic Authentication is part of the HTTP protocol. When you browse to a web site that uses basic auth, the browser pops up a dialog asking for your username and password. You supply credentials to the browser; your credentials are included with every subsequent request to the same web site.

Form-based authentication uses an HTML form to request your credentials. When you are authenticated, the web server sends a cookie to the browser that represents your login session. This cookie must be included with subsequent requests.

For either of these methods you must configure an OpenerDirector with a handler for basic auth or cookies and use the configured OpenerDirector instance to fetch pages. This is done by calling urllib2.build_opener() with the correct handlers and using the returned opener to access web pages.

Basic Authentication

To authenticate using basic authentication you must configure an opener with an HTTPBasicAuthHandler. The handler itself must be configured with the authentication realm, server address, username and password. Here is an example:

# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password('realm', 'example.com', 'username', 'password')
opener = urllib2.build_opener(auth_handler)

The open() method of the returned opener can be used to read pages that require authentication:

data = opener.open('http://example.com/my/protected/page.html').read()

You can also install the opener globally so it will be used with urlopen():

urllib2.install_opener(opener)
data = urllib2.urlopen('http://example.com/my/protected/page.html').read()

Note: the realm is defined by the webserver that is requesting authentication. When you use your browser to log on with basic auth, the browser dialog will say something like "Enter user name and password for 'realm' at http://example.com." The realm that is displayed in the browser is the value you must use to configure the HTTPBasicAuthHandler.

Form-based authentication

Form-based login is a little more complicated than basic auth; the opener must be configured to support cookies, then the username and password must be submitted to the login form.

First configure an opener that will handle cookies. When you add an HTTPCookieProcessor to the opener, it will remember cookies that it receives and include them with subsequent requests:

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

Next use the opener to POST to the login form and the protected page. The cookie returned by the server will be captured by the HTTPCookieProcessor:

params = urllib.urlencode(dict(username='joe', password='secret'))
f = opener.open('http://coolsite.example.com/login/', params)
data = f.read()
f.close()
f = opener.open('http://example.com/my/protected/page.html')
data = f.read()
f.close()

Request headers

You can specify the values for request headers by providing an explicit Request object to urlopen() or open() instead of a URL. For example to provide User-Agent and If-Modified-Since headers:

request = urllib2.Request('http://coolsite.example.com/')
request.add_header('If-Modified-Since', 'Sun, 02 Mar 2008 04:00:08 GMT')
opener.addheaders = [('User-Agent', 'My cool client')]
data = opener.open(request).read()

Most headers can be specified with the add_header() method. For the User-Agent header you have to override the default User-Agent by assigning to opener.addheaders.

Debugging

urllib2 uses httplib for the actual HTTP transfers. httplib has fairly extensive debugging output available. You can enable debug output from urllib2 by explicitly creating an HTTPHandler and passing it a debug flag:

urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))

Socket timeout

The default socket timeout is None meaning no timeout. If you want your connections to time out, for example if unable to connect, call socket.setdefaulttimeout(secs) with the desired number of seconds for the timeout. This is a global setting that will affect every subsequent socket connection.

Problems

There is a long-standing problem in httplib where it fails to handle a particular HTTP 1.1 protocol error. The symptom is a transfer that fails with the error

ValueError: invalid literal for int()

The problem is discussed here: http://bugs.python.org/issue1486335. Apparently it is fixed for Python 2.6: http://bugs.python.org/issue900744. Meanwhile a workaround is to use urllib.urlopen() which does not have the same problem.

Other resources

urllib2 documentation

Mark Pilgrim's Universal Feed Parser includes a replacement for the standard redirect handlers that remembers the redirect code (look for _open_resource() and _FeedURLHandler). With the standard handlers there is no way to retrieve the redirect status code from the original fetch.

Fuzzyman has lengthier coverage of this material: urllib2 - The Missing Manual, Basic authentication and cookielib and ClientCookie.

RFC 2616 defines the HTTP protocol.

Short, introductory essays about Python.

Kent's Kornerby Kent S Johnson

urllib2 Cookbook

Kent's Korner
by Kent S Johnson