|
1 ************************************************ |
|
2 HOWTO Fetch Internet Resources Using urllib2 |
|
3 ************************************************ |
|
4 |
|
5 :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_ |
|
6 |
|
7 .. note:: |
|
8 |
|
9 There is an French translation of an earlier revision of this |
|
10 HOWTO, available at `urllib2 - Le Manuel manquant |
|
11 <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_. |
|
12 |
|
13 |
|
14 |
|
15 Introduction |
|
16 ============ |
|
17 |
|
18 .. sidebar:: Related Articles |
|
19 |
|
20 You may also find useful the following article on fetching web resources |
|
21 with Python : |
|
22 |
|
23 * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_ |
|
24 |
|
25 A tutorial on *Basic Authentication*, with examples in Python. |
|
26 |
|
27 **urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs |
|
28 (Uniform Resource Locators). It offers a very simple interface, in the form of |
|
29 the *urlopen* function. This is capable of fetching URLs using a variety of |
|
30 different protocols. It also offers a slightly more complex interface for |
|
31 handling common situations - like basic authentication, cookies, proxies and so |
|
32 on. These are provided by objects called handlers and openers. |
|
33 |
|
34 urllib2 supports fetching URLs for many "URL schemes" (identified by the string |
|
35 before the ":" in URL - for example "ftp" is the URL scheme of |
|
36 "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). |
|
37 This tutorial focuses on the most common case, HTTP. |
|
38 |
|
39 For straightforward situations *urlopen* is very easy to use. But as soon as you |
|
40 encounter errors or non-trivial cases when opening HTTP URLs, you will need some |
|
41 understanding of the HyperText Transfer Protocol. The most comprehensive and |
|
42 authoritative reference to HTTP is :rfc:`2616`. This is a technical document and |
|
43 not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*, |
|
44 with enough detail about HTTP to help you through. It is not intended to replace |
|
45 the :mod:`urllib2` docs, but is supplementary to them. |
|
46 |
|
47 |
|
48 Fetching URLs |
|
49 ============= |
|
50 |
|
51 The simplest way to use urllib2 is as follows:: |
|
52 |
|
53 import urllib2 |
|
54 response = urllib2.urlopen('http://python.org/') |
|
55 html = response.read() |
|
56 |
|
57 Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we |
|
58 could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the |
|
59 purpose of this tutorial to explain the more complicated cases, concentrating on |
|
60 HTTP. |
|
61 |
|
62 HTTP is based on requests and responses - the client makes requests and servers |
|
63 send responses. urllib2 mirrors this with a ``Request`` object which represents |
|
64 the HTTP request you are making. In its simplest form you create a Request |
|
65 object that specifies the URL you want to fetch. Calling ``urlopen`` with this |
|
66 Request object returns a response object for the URL requested. This response is |
|
67 a file-like object, which means you can for example call ``.read()`` on the |
|
68 response:: |
|
69 |
|
70 import urllib2 |
|
71 |
|
72 req = urllib2.Request('http://www.voidspace.org.uk') |
|
73 response = urllib2.urlopen(req) |
|
74 the_page = response.read() |
|
75 |
|
76 Note that urllib2 makes use of the same Request interface to handle all URL |
|
77 schemes. For example, you can make an FTP request like so:: |
|
78 |
|
79 req = urllib2.Request('ftp://example.com/') |
|
80 |
|
81 In the case of HTTP, there are two extra things that Request objects allow you |
|
82 to do: First, you can pass data to be sent to the server. Second, you can pass |
|
83 extra information ("metadata") *about* the data or the about request itself, to |
|
84 the server - this information is sent as HTTP "headers". Let's look at each of |
|
85 these in turn. |
|
86 |
|
87 Data |
|
88 ---- |
|
89 |
|
90 Sometimes you want to send data to a URL (often the URL will refer to a CGI |
|
91 (Common Gateway Interface) script [#]_ or other web application). With HTTP, |
|
92 this is often done using what's known as a **POST** request. This is often what |
|
93 your browser does when you submit a HTML form that you filled in on the web. Not |
|
94 all POSTs have to come from forms: you can use a POST to transmit arbitrary data |
|
95 to your own application. In the common case of HTML forms, the data needs to be |
|
96 encoded in a standard way, and then passed to the Request object as the ``data`` |
|
97 argument. The encoding is done using a function from the ``urllib`` library |
|
98 *not* from ``urllib2``. :: |
|
99 |
|
100 import urllib |
|
101 import urllib2 |
|
102 |
|
103 url = 'http://www.someserver.com/cgi-bin/register.cgi' |
|
104 values = {'name' : 'Michael Foord', |
|
105 'location' : 'Northampton', |
|
106 'language' : 'Python' } |
|
107 |
|
108 data = urllib.urlencode(values) |
|
109 req = urllib2.Request(url, data) |
|
110 response = urllib2.urlopen(req) |
|
111 the_page = response.read() |
|
112 |
|
113 Note that other encodings are sometimes required (e.g. for file upload from HTML |
|
114 forms - see `HTML Specification, Form Submission |
|
115 <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more |
|
116 details). |
|
117 |
|
118 If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One |
|
119 way in which GET and POST requests differ is that POST requests often have |
|
120 "side-effects": they change the state of the system in some way (for example by |
|
121 placing an order with the website for a hundredweight of tinned spam to be |
|
122 delivered to your door). Though the HTTP standard makes it clear that POSTs are |
|
123 intended to *always* cause side-effects, and GET requests *never* to cause |
|
124 side-effects, nothing prevents a GET request from having side-effects, nor a |
|
125 POST requests from having no side-effects. Data can also be passed in an HTTP |
|
126 GET request by encoding it in the URL itself. |
|
127 |
|
128 This is done as follows:: |
|
129 |
|
130 >>> import urllib2 |
|
131 >>> import urllib |
|
132 >>> data = {} |
|
133 >>> data['name'] = 'Somebody Here' |
|
134 >>> data['location'] = 'Northampton' |
|
135 >>> data['language'] = 'Python' |
|
136 >>> url_values = urllib.urlencode(data) |
|
137 >>> print url_values |
|
138 name=Somebody+Here&language=Python&location=Northampton |
|
139 >>> url = 'http://www.example.com/example.cgi' |
|
140 >>> full_url = url + '?' + url_values |
|
141 >>> data = urllib2.open(full_url) |
|
142 |
|
143 Notice that the full URL is created by adding a ``?`` to the URL, followed by |
|
144 the encoded values. |
|
145 |
|
146 Headers |
|
147 ------- |
|
148 |
|
149 We'll discuss here one particular HTTP header, to illustrate how to add headers |
|
150 to your HTTP request. |
|
151 |
|
152 Some websites [#]_ dislike being browsed by programs, or send different versions |
|
153 to different browsers [#]_ . By default urllib2 identifies itself as |
|
154 ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version |
|
155 numbers of the Python release, |
|
156 e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain |
|
157 not work. The way a browser identifies itself is through the |
|
158 ``User-Agent`` header [#]_. When you create a Request object you can |
|
159 pass a dictionary of headers in. The following example makes the same |
|
160 request as above, but identifies itself as a version of Internet |
|
161 Explorer [#]_. :: |
|
162 |
|
163 import urllib |
|
164 import urllib2 |
|
165 |
|
166 url = 'http://www.someserver.com/cgi-bin/register.cgi' |
|
167 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' |
|
168 values = {'name' : 'Michael Foord', |
|
169 'location' : 'Northampton', |
|
170 'language' : 'Python' } |
|
171 headers = { 'User-Agent' : user_agent } |
|
172 |
|
173 data = urllib.urlencode(values) |
|
174 req = urllib2.Request(url, data, headers) |
|
175 response = urllib2.urlopen(req) |
|
176 the_page = response.read() |
|
177 |
|
178 The response also has two useful methods. See the section on `info and geturl`_ |
|
179 which comes after we have a look at what happens when things go wrong. |
|
180 |
|
181 |
|
182 Handling Exceptions |
|
183 =================== |
|
184 |
|
185 *urlopen* raises :exc:`URLError` when it cannot handle a response (though as usual |
|
186 with Python APIs, builtin exceptions such as |
|
187 :exc:`ValueError`, :exc:`TypeError` etc. may also |
|
188 be raised). |
|
189 |
|
190 :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of |
|
191 HTTP URLs. |
|
192 |
|
193 URLError |
|
194 -------- |
|
195 |
|
196 Often, URLError is raised because there is no network connection (no route to |
|
197 the specified server), or the specified server doesn't exist. In this case, the |
|
198 exception raised will have a 'reason' attribute, which is a tuple containing an |
|
199 error code and a text error message. |
|
200 |
|
201 e.g. :: |
|
202 |
|
203 >>> req = urllib2.Request('http://www.pretend_server.org') |
|
204 >>> try: urllib2.urlopen(req) |
|
205 >>> except URLError, e: |
|
206 >>> print e.reason |
|
207 >>> |
|
208 (4, 'getaddrinfo failed') |
|
209 |
|
210 |
|
211 HTTPError |
|
212 --------- |
|
213 |
|
214 Every HTTP response from the server contains a numeric "status code". Sometimes |
|
215 the status code indicates that the server is unable to fulfil the request. The |
|
216 default handlers will handle some of these responses for you (for example, if |
|
217 the response is a "redirection" that requests the client fetch the document from |
|
218 a different URL, urllib2 will handle that for you). For those it can't handle, |
|
219 urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not |
|
220 found), '403' (request forbidden), and '401' (authentication required). |
|
221 |
|
222 See section 10 of RFC 2616 for a reference on all the HTTP error codes. |
|
223 |
|
224 The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which |
|
225 corresponds to the error sent by the server. |
|
226 |
|
227 Error Codes |
|
228 ~~~~~~~~~~~ |
|
229 |
|
230 Because the default handlers handle redirects (codes in the 300 range), and |
|
231 codes in the 100-299 range indicate success, you will usually only see error |
|
232 codes in the 400-599 range. |
|
233 |
|
234 ``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of |
|
235 response codes in that shows all the response codes used by RFC 2616. The |
|
236 dictionary is reproduced here for convenience :: |
|
237 |
|
238 # Table mapping response codes to messages; entries have the |
|
239 # form {code: (shortmessage, longmessage)}. |
|
240 responses = { |
|
241 100: ('Continue', 'Request received, please continue'), |
|
242 101: ('Switching Protocols', |
|
243 'Switching to new protocol; obey Upgrade header'), |
|
244 |
|
245 200: ('OK', 'Request fulfilled, document follows'), |
|
246 201: ('Created', 'Document created, URL follows'), |
|
247 202: ('Accepted', |
|
248 'Request accepted, processing continues off-line'), |
|
249 203: ('Non-Authoritative Information', 'Request fulfilled from cache'), |
|
250 204: ('No Content', 'Request fulfilled, nothing follows'), |
|
251 205: ('Reset Content', 'Clear input form for further input.'), |
|
252 206: ('Partial Content', 'Partial content follows.'), |
|
253 |
|
254 300: ('Multiple Choices', |
|
255 'Object has several resources -- see URI list'), |
|
256 301: ('Moved Permanently', 'Object moved permanently -- see URI list'), |
|
257 302: ('Found', 'Object moved temporarily -- see URI list'), |
|
258 303: ('See Other', 'Object moved -- see Method and URL list'), |
|
259 304: ('Not Modified', |
|
260 'Document has not changed since given time'), |
|
261 305: ('Use Proxy', |
|
262 'You must use proxy specified in Location to access this ' |
|
263 'resource.'), |
|
264 307: ('Temporary Redirect', |
|
265 'Object moved temporarily -- see URI list'), |
|
266 |
|
267 400: ('Bad Request', |
|
268 'Bad request syntax or unsupported method'), |
|
269 401: ('Unauthorized', |
|
270 'No permission -- see authorization schemes'), |
|
271 402: ('Payment Required', |
|
272 'No payment -- see charging schemes'), |
|
273 403: ('Forbidden', |
|
274 'Request forbidden -- authorization will not help'), |
|
275 404: ('Not Found', 'Nothing matches the given URI'), |
|
276 405: ('Method Not Allowed', |
|
277 'Specified method is invalid for this server.'), |
|
278 406: ('Not Acceptable', 'URI not available in preferred format.'), |
|
279 407: ('Proxy Authentication Required', 'You must authenticate with ' |
|
280 'this proxy before proceeding.'), |
|
281 408: ('Request Timeout', 'Request timed out; try again later.'), |
|
282 409: ('Conflict', 'Request conflict.'), |
|
283 410: ('Gone', |
|
284 'URI no longer exists and has been permanently removed.'), |
|
285 411: ('Length Required', 'Client must specify Content-Length.'), |
|
286 412: ('Precondition Failed', 'Precondition in headers is false.'), |
|
287 413: ('Request Entity Too Large', 'Entity is too large.'), |
|
288 414: ('Request-URI Too Long', 'URI is too long.'), |
|
289 415: ('Unsupported Media Type', 'Entity body in unsupported format.'), |
|
290 416: ('Requested Range Not Satisfiable', |
|
291 'Cannot satisfy request range.'), |
|
292 417: ('Expectation Failed', |
|
293 'Expect condition could not be satisfied.'), |
|
294 |
|
295 500: ('Internal Server Error', 'Server got itself in trouble'), |
|
296 501: ('Not Implemented', |
|
297 'Server does not support this operation'), |
|
298 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), |
|
299 503: ('Service Unavailable', |
|
300 'The server cannot process the request due to a high load'), |
|
301 504: ('Gateway Timeout', |
|
302 'The gateway server did not receive a timely response'), |
|
303 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), |
|
304 } |
|
305 |
|
306 When an error is raised the server responds by returning an HTTP error code |
|
307 *and* an error page. You can use the :exc:`HTTPError` instance as a response on the |
|
308 page returned. This means that as well as the code attribute, it also has read, |
|
309 geturl, and info, methods. :: |
|
310 |
|
311 >>> req = urllib2.Request('http://www.python.org/fish.html') |
|
312 >>> try: |
|
313 >>> urllib2.urlopen(req) |
|
314 >>> except URLError, e: |
|
315 >>> print e.code |
|
316 >>> print e.read() |
|
317 >>> |
|
318 404 |
|
319 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
|
320 "http://www.w3.org/TR/html4/loose.dtd"> |
|
321 <?xml-stylesheet href="./css/ht2html.css" |
|
322 type="text/css"?> |
|
323 <html><head><title>Error 404: File Not Found</title> |
|
324 ...... etc... |
|
325 |
|
326 Wrapping it Up |
|
327 -------------- |
|
328 |
|
329 So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two |
|
330 basic approaches. I prefer the second approach. |
|
331 |
|
332 Number 1 |
|
333 ~~~~~~~~ |
|
334 |
|
335 :: |
|
336 |
|
337 |
|
338 from urllib2 import Request, urlopen, URLError, HTTPError |
|
339 req = Request(someurl) |
|
340 try: |
|
341 response = urlopen(req) |
|
342 except HTTPError, e: |
|
343 print 'The server couldn\'t fulfill the request.' |
|
344 print 'Error code: ', e.code |
|
345 except URLError, e: |
|
346 print 'We failed to reach a server.' |
|
347 print 'Reason: ', e.reason |
|
348 else: |
|
349 # everything is fine |
|
350 |
|
351 |
|
352 .. note:: |
|
353 |
|
354 The ``except HTTPError`` *must* come first, otherwise ``except URLError`` |
|
355 will *also* catch an :exc:`HTTPError`. |
|
356 |
|
357 Number 2 |
|
358 ~~~~~~~~ |
|
359 |
|
360 :: |
|
361 |
|
362 from urllib2 import Request, urlopen, URLError |
|
363 req = Request(someurl) |
|
364 try: |
|
365 response = urlopen(req) |
|
366 except URLError, e: |
|
367 if hasattr(e, 'reason'): |
|
368 print 'We failed to reach a server.' |
|
369 print 'Reason: ', e.reason |
|
370 elif hasattr(e, 'code'): |
|
371 print 'The server couldn\'t fulfill the request.' |
|
372 print 'Error code: ', e.code |
|
373 else: |
|
374 # everything is fine |
|
375 |
|
376 |
|
377 info and geturl |
|
378 =============== |
|
379 |
|
380 The response returned by urlopen (or the :exc:`HTTPError` instance) has two useful |
|
381 methods :meth:`info` and :meth:`geturl`. |
|
382 |
|
383 **geturl** - this returns the real URL of the page fetched. This is useful |
|
384 because ``urlopen`` (or the opener object used) may have followed a |
|
385 redirect. The URL of the page fetched may not be the same as the URL requested. |
|
386 |
|
387 **info** - this returns a dictionary-like object that describes the page |
|
388 fetched, particularly the headers sent by the server. It is currently an |
|
389 ``httplib.HTTPMessage`` instance. |
|
390 |
|
391 Typical headers include 'Content-length', 'Content-type', and so on. See the |
|
392 `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_ |
|
393 for a useful listing of HTTP headers with brief explanations of their meaning |
|
394 and use. |
|
395 |
|
396 |
|
397 Openers and Handlers |
|
398 ==================== |
|
399 |
|
400 When you fetch a URL you use an opener (an instance of the perhaps |
|
401 confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using |
|
402 the default opener - via ``urlopen`` - but you can create custom |
|
403 openers. Openers use handlers. All the "heavy lifting" is done by the |
|
404 handlers. Each handler knows how to open URLs for a particular URL scheme (http, |
|
405 ftp, etc.), or how to handle an aspect of URL opening, for example HTTP |
|
406 redirections or HTTP cookies. |
|
407 |
|
408 You will want to create openers if you want to fetch URLs with specific handlers |
|
409 installed, for example to get an opener that handles cookies, or to get an |
|
410 opener that does not handle redirections. |
|
411 |
|
412 To create an opener, instantiate an ``OpenerDirector``, and then call |
|
413 ``.add_handler(some_handler_instance)`` repeatedly. |
|
414 |
|
415 Alternatively, you can use ``build_opener``, which is a convenience function for |
|
416 creating opener objects with a single function call. ``build_opener`` adds |
|
417 several handlers by default, but provides a quick way to add more and/or |
|
418 override the default handlers. |
|
419 |
|
420 Other sorts of handlers you might want to can handle proxies, authentication, |
|
421 and other common but slightly specialised situations. |
|
422 |
|
423 ``install_opener`` can be used to make an ``opener`` object the (global) default |
|
424 opener. This means that calls to ``urlopen`` will use the opener you have |
|
425 installed. |
|
426 |
|
427 Opener objects have an ``open`` method, which can be called directly to fetch |
|
428 urls in the same way as the ``urlopen`` function: there's no need to call |
|
429 ``install_opener``, except as a convenience. |
|
430 |
|
431 |
|
432 Basic Authentication |
|
433 ==================== |
|
434 |
|
435 To illustrate creating and installing a handler we will use the |
|
436 ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject -- |
|
437 including an explanation of how Basic Authentication works - see the `Basic |
|
438 Authentication Tutorial |
|
439 <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_. |
|
440 |
|
441 When authentication is required, the server sends a header (as well as the 401 |
|
442 error code) requesting authentication. This specifies the authentication scheme |
|
443 and a 'realm'. The header looks like : ``Www-authenticate: SCHEME |
|
444 realm="REALM"``. |
|
445 |
|
446 e.g. :: |
|
447 |
|
448 Www-authenticate: Basic realm="cPanel Users" |
|
449 |
|
450 |
|
451 The client should then retry the request with the appropriate name and password |
|
452 for the realm included as a header in the request. This is 'basic |
|
453 authentication'. In order to simplify this process we can create an instance of |
|
454 ``HTTPBasicAuthHandler`` and an opener to use this handler. |
|
455 |
|
456 The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle |
|
457 the mapping of URLs and realms to passwords and usernames. If you know what the |
|
458 realm is (from the authentication header sent by the server), then you can use a |
|
459 ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that |
|
460 case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows |
|
461 you to specify a default username and password for a URL. This will be supplied |
|
462 in the absence of you providing an alternative combination for a specific |
|
463 realm. We indicate this by providing ``None`` as the realm argument to the |
|
464 ``add_password`` method. |
|
465 |
|
466 The top-level URL is the first URL that requires authentication. URLs "deeper" |
|
467 than the URL you pass to .add_password() will also match. :: |
|
468 |
|
469 # create a password manager |
|
470 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() |
|
471 |
|
472 # Add the username and password. |
|
473 # If we knew the realm, we could use it instead of ``None``. |
|
474 top_level_url = "http://example.com/foo/" |
|
475 password_mgr.add_password(None, top_level_url, username, password) |
|
476 |
|
477 handler = urllib2.HTTPBasicAuthHandler(password_mgr) |
|
478 |
|
479 # create "opener" (OpenerDirector instance) |
|
480 opener = urllib2.build_opener(handler) |
|
481 |
|
482 # use the opener to fetch a URL |
|
483 opener.open(a_url) |
|
484 |
|
485 # Install the opener. |
|
486 # Now all calls to urllib2.urlopen use our opener. |
|
487 urllib2.install_opener(opener) |
|
488 |
|
489 .. note:: |
|
490 |
|
491 In the above example we only supplied our ``HHTPBasicAuthHandler`` to |
|
492 ``build_opener``. By default openers have the handlers for normal situations |
|
493 -- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``, |
|
494 ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``, |
|
495 ``FileHandler``, ``HTTPErrorProcessor``. |
|
496 |
|
497 ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme |
|
498 component and the hostname and optionally the port number) |
|
499 e.g. "http://example.com/" *or* an "authority" (i.e. the hostname, |
|
500 optionally including the port number) e.g. "example.com" or "example.com:8080" |
|
501 (the latter example includes a port number). The authority, if present, must |
|
502 NOT contain the "userinfo" component - for example "joe@password:example.com" is |
|
503 not correct. |
|
504 |
|
505 |
|
506 Proxies |
|
507 ======= |
|
508 |
|
509 **urllib2** will auto-detect your proxy settings and use those. This is through |
|
510 the ``ProxyHandler`` which is part of the normal handler chain. Normally that's |
|
511 a good thing, but there are occasions when it may not be helpful [#]_. One way |
|
512 to do this is to setup our own ``ProxyHandler``, with no proxies defined. This |
|
513 is done using similar steps to setting up a `Basic Authentication`_ handler : :: |
|
514 |
|
515 >>> proxy_support = urllib2.ProxyHandler({}) |
|
516 >>> opener = urllib2.build_opener(proxy_support) |
|
517 >>> urllib2.install_opener(opener) |
|
518 |
|
519 .. note:: |
|
520 |
|
521 Currently ``urllib2`` *does not* support fetching of ``https`` locations |
|
522 through a proxy. However, this can be enabled by extending urllib2 as |
|
523 shown in the recipe [#]_. |
|
524 |
|
525 |
|
526 Sockets and Layers |
|
527 ================== |
|
528 |
|
529 The Python support for fetching resources from the web is layered. urllib2 uses |
|
530 the httplib library, which in turn uses the socket library. |
|
531 |
|
532 As of Python 2.3 you can specify how long a socket should wait for a response |
|
533 before timing out. This can be useful in applications which have to fetch web |
|
534 pages. By default the socket module has *no timeout* and can hang. Currently, |
|
535 the socket timeout is not exposed at the httplib or urllib2 levels. However, |
|
536 you can set the default timeout globally for all sockets using :: |
|
537 |
|
538 import socket |
|
539 import urllib2 |
|
540 |
|
541 # timeout in seconds |
|
542 timeout = 10 |
|
543 socket.setdefaulttimeout(timeout) |
|
544 |
|
545 # this call to urllib2.urlopen now uses the default timeout |
|
546 # we have set in the socket module |
|
547 req = urllib2.Request('http://www.voidspace.org.uk') |
|
548 response = urllib2.urlopen(req) |
|
549 |
|
550 |
|
551 ------- |
|
552 |
|
553 |
|
554 Footnotes |
|
555 ========= |
|
556 |
|
557 This document was reviewed and revised by John Lee. |
|
558 |
|
559 .. [#] For an introduction to the CGI protocol see |
|
560 `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. |
|
561 .. [#] Like Google for example. The *proper* way to use google from a program |
|
562 is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See |
|
563 `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_ |
|
564 for some examples of using the Google API. |
|
565 .. [#] Browser sniffing is a very bad practise for website design - building |
|
566 sites using web standards is much more sensible. Unfortunately a lot of |
|
567 sites still send different versions to different browsers. |
|
568 .. [#] The user agent for MSIE 6 is |
|
569 *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'* |
|
570 .. [#] For details of more HTTP request headers, see |
|
571 `Quick Reference to HTTP Headers`_. |
|
572 .. [#] In my case I have to use a proxy to access the internet at work. If you |
|
573 attempt to fetch *localhost* URLs through this proxy it blocks them. IE |
|
574 is set to use the proxy, which urllib2 picks up on. In order to test |
|
575 scripts with a localhost server, I have to prevent urllib2 from using |
|
576 the proxy. |
|
577 .. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe |
|
578 <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_. |
|
579 |