symbian-qemu-0.9.1-12/python-2.6.1/Doc/howto/urllib2.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 ************************************************
       
     2   HOWTO Fetch Internet Resources Using urllib2
       
     3 ************************************************
       
     4 
       
     5 :Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
       
     6 
       
     7 .. note::
       
     8 
       
     9     There is an French translation of an earlier revision of this
       
    10     HOWTO, available at `urllib2 - Le Manuel manquant
       
    11     <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
       
    12 
       
    13  
       
    14 
       
    15 Introduction
       
    16 ============
       
    17 
       
    18 .. sidebar:: Related Articles
       
    19 
       
    20     You may also find useful the following article on fetching web resources
       
    21     with Python :
       
    22     
       
    23     * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
       
    24     
       
    25         A tutorial on *Basic Authentication*, with examples in Python.
       
    26 
       
    27 **urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
       
    28 (Uniform Resource Locators). It offers a very simple interface, in the form of
       
    29 the *urlopen* function. This is capable of fetching URLs using a variety of
       
    30 different protocols. It also offers a slightly more complex interface for
       
    31 handling common situations - like basic authentication, cookies, proxies and so
       
    32 on. These are provided by objects called handlers and openers.
       
    33 
       
    34 urllib2 supports fetching URLs for many "URL schemes" (identified by the string
       
    35 before the ":" in URL - for example "ftp" is the URL scheme of
       
    36 "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
       
    37 This tutorial focuses on the most common case, HTTP.
       
    38 
       
    39 For straightforward situations *urlopen* is very easy to use. But as soon as you
       
    40 encounter errors or non-trivial cases when opening HTTP URLs, you will need some
       
    41 understanding of the HyperText Transfer Protocol. The most comprehensive and
       
    42 authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
       
    43 not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
       
    44 with enough detail about HTTP to help you through. It is not intended to replace
       
    45 the :mod:`urllib2` docs, but is supplementary to them.
       
    46 
       
    47 
       
    48 Fetching URLs
       
    49 =============
       
    50 
       
    51 The simplest way to use urllib2 is as follows::
       
    52 
       
    53     import urllib2
       
    54     response = urllib2.urlopen('http://python.org/')
       
    55     html = response.read()
       
    56 
       
    57 Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
       
    58 could have used an URL starting with 'ftp:', 'file:', etc.).  However, it's the
       
    59 purpose of this tutorial to explain the more complicated cases, concentrating on
       
    60 HTTP.
       
    61 
       
    62 HTTP is based on requests and responses - the client makes requests and servers
       
    63 send responses. urllib2 mirrors this with a ``Request`` object which represents
       
    64 the HTTP request you are making. In its simplest form you create a Request
       
    65 object that specifies the URL you want to fetch. Calling ``urlopen`` with this
       
    66 Request object returns a response object for the URL requested. This response is
       
    67 a file-like object, which means you can for example call ``.read()`` on the
       
    68 response::
       
    69 
       
    70     import urllib2
       
    71 
       
    72     req = urllib2.Request('http://www.voidspace.org.uk')
       
    73     response = urllib2.urlopen(req)
       
    74     the_page = response.read()
       
    75 
       
    76 Note that urllib2 makes use of the same Request interface to handle all URL
       
    77 schemes.  For example, you can make an FTP request like so::
       
    78 
       
    79     req = urllib2.Request('ftp://example.com/')
       
    80 
       
    81 In the case of HTTP, there are two extra things that Request objects allow you
       
    82 to do: First, you can pass data to be sent to the server.  Second, you can pass
       
    83 extra information ("metadata") *about* the data or the about request itself, to
       
    84 the server - this information is sent as HTTP "headers".  Let's look at each of
       
    85 these in turn.
       
    86 
       
    87 Data
       
    88 ----
       
    89 
       
    90 Sometimes you want to send data to a URL (often the URL will refer to a CGI
       
    91 (Common Gateway Interface) script [#]_ or other web application). With HTTP,
       
    92 this is often done using what's known as a **POST** request. This is often what
       
    93 your browser does when you submit a HTML form that you filled in on the web. Not
       
    94 all POSTs have to come from forms: you can use a POST to transmit arbitrary data
       
    95 to your own application. In the common case of HTML forms, the data needs to be
       
    96 encoded in a standard way, and then passed to the Request object as the ``data``
       
    97 argument. The encoding is done using a function from the ``urllib`` library
       
    98 *not* from ``urllib2``. ::
       
    99 
       
   100     import urllib
       
   101     import urllib2  
       
   102 
       
   103     url = 'http://www.someserver.com/cgi-bin/register.cgi'
       
   104     values = {'name' : 'Michael Foord',
       
   105               'location' : 'Northampton',
       
   106               'language' : 'Python' }
       
   107 
       
   108     data = urllib.urlencode(values)
       
   109     req = urllib2.Request(url, data)
       
   110     response = urllib2.urlopen(req)
       
   111     the_page = response.read()
       
   112 
       
   113 Note that other encodings are sometimes required (e.g. for file upload from HTML
       
   114 forms - see `HTML Specification, Form Submission
       
   115 <http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
       
   116 details).
       
   117 
       
   118 If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
       
   119 way in which GET and POST requests differ is that POST requests often have
       
   120 "side-effects": they change the state of the system in some way (for example by
       
   121 placing an order with the website for a hundredweight of tinned spam to be
       
   122 delivered to your door).  Though the HTTP standard makes it clear that POSTs are
       
   123 intended to *always* cause side-effects, and GET requests *never* to cause
       
   124 side-effects, nothing prevents a GET request from having side-effects, nor a
       
   125 POST requests from having no side-effects. Data can also be passed in an HTTP
       
   126 GET request by encoding it in the URL itself.
       
   127 
       
   128 This is done as follows::
       
   129 
       
   130     >>> import urllib2
       
   131     >>> import urllib
       
   132     >>> data = {}
       
   133     >>> data['name'] = 'Somebody Here'
       
   134     >>> data['location'] = 'Northampton'
       
   135     >>> data['language'] = 'Python'
       
   136     >>> url_values = urllib.urlencode(data)
       
   137     >>> print url_values
       
   138     name=Somebody+Here&language=Python&location=Northampton
       
   139     >>> url = 'http://www.example.com/example.cgi'
       
   140     >>> full_url = url + '?' + url_values
       
   141     >>> data = urllib2.open(full_url)
       
   142 
       
   143 Notice that the full URL is created by adding a ``?`` to the URL, followed by
       
   144 the encoded values.
       
   145 
       
   146 Headers
       
   147 -------
       
   148 
       
   149 We'll discuss here one particular HTTP header, to illustrate how to add headers
       
   150 to your HTTP request.
       
   151 
       
   152 Some websites [#]_ dislike being browsed by programs, or send different versions
       
   153 to different browsers [#]_ . By default urllib2 identifies itself as
       
   154 ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
       
   155 numbers of the Python release,
       
   156 e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
       
   157 not work. The way a browser identifies itself is through the
       
   158 ``User-Agent`` header [#]_. When you create a Request object you can
       
   159 pass a dictionary of headers in. The following example makes the same
       
   160 request as above, but identifies itself as a version of Internet
       
   161 Explorer [#]_. ::
       
   162 
       
   163     import urllib
       
   164     import urllib2  
       
   165     
       
   166     url = 'http://www.someserver.com/cgi-bin/register.cgi'
       
   167     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
       
   168     values = {'name' : 'Michael Foord',
       
   169               'location' : 'Northampton',
       
   170               'language' : 'Python' }
       
   171     headers = { 'User-Agent' : user_agent }
       
   172     
       
   173     data = urllib.urlencode(values)
       
   174     req = urllib2.Request(url, data, headers)
       
   175     response = urllib2.urlopen(req)
       
   176     the_page = response.read()
       
   177 
       
   178 The response also has two useful methods. See the section on `info and geturl`_
       
   179 which comes after we have a look at what happens when things go wrong.
       
   180 
       
   181 
       
   182 Handling Exceptions
       
   183 ===================
       
   184 
       
   185 *urlopen* raises :exc:`URLError` when it cannot handle a response (though as usual
       
   186 with Python APIs, builtin exceptions such as 
       
   187 :exc:`ValueError`, :exc:`TypeError` etc. may also
       
   188 be raised).
       
   189 
       
   190 :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
       
   191 HTTP URLs.
       
   192 
       
   193 URLError
       
   194 --------
       
   195 
       
   196 Often, URLError is raised because there is no network connection (no route to
       
   197 the specified server), or the specified server doesn't exist.  In this case, the
       
   198 exception raised will have a 'reason' attribute, which is a tuple containing an
       
   199 error code and a text error message.
       
   200 
       
   201 e.g. ::
       
   202 
       
   203     >>> req = urllib2.Request('http://www.pretend_server.org')
       
   204     >>> try: urllib2.urlopen(req)
       
   205     >>> except URLError, e:
       
   206     >>>    print e.reason
       
   207     >>>
       
   208     (4, 'getaddrinfo failed')
       
   209 
       
   210 
       
   211 HTTPError
       
   212 ---------
       
   213 
       
   214 Every HTTP response from the server contains a numeric "status code". Sometimes
       
   215 the status code indicates that the server is unable to fulfil the request. The
       
   216 default handlers will handle some of these responses for you (for example, if
       
   217 the response is a "redirection" that requests the client fetch the document from
       
   218 a different URL, urllib2 will handle that for you). For those it can't handle,
       
   219 urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
       
   220 found), '403' (request forbidden), and '401' (authentication required).
       
   221 
       
   222 See section 10 of RFC 2616 for a reference on all the HTTP error codes.
       
   223 
       
   224 The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
       
   225 corresponds to the error sent by the server.
       
   226 
       
   227 Error Codes
       
   228 ~~~~~~~~~~~
       
   229 
       
   230 Because the default handlers handle redirects (codes in the 300 range), and
       
   231 codes in the 100-299 range indicate success, you will usually only see error
       
   232 codes in the 400-599 range.
       
   233 
       
   234 ``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
       
   235 response codes in that shows all the response codes used by RFC 2616. The
       
   236 dictionary is reproduced here for convenience ::
       
   237 
       
   238     # Table mapping response codes to messages; entries have the
       
   239     # form {code: (shortmessage, longmessage)}.
       
   240     responses = {
       
   241         100: ('Continue', 'Request received, please continue'),
       
   242         101: ('Switching Protocols',
       
   243               'Switching to new protocol; obey Upgrade header'),
       
   244 
       
   245         200: ('OK', 'Request fulfilled, document follows'),
       
   246         201: ('Created', 'Document created, URL follows'),
       
   247         202: ('Accepted',
       
   248               'Request accepted, processing continues off-line'),
       
   249         203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
       
   250         204: ('No Content', 'Request fulfilled, nothing follows'),
       
   251         205: ('Reset Content', 'Clear input form for further input.'),
       
   252         206: ('Partial Content', 'Partial content follows.'),
       
   253 
       
   254         300: ('Multiple Choices',
       
   255               'Object has several resources -- see URI list'),
       
   256         301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
       
   257         302: ('Found', 'Object moved temporarily -- see URI list'),
       
   258         303: ('See Other', 'Object moved -- see Method and URL list'),
       
   259         304: ('Not Modified',
       
   260               'Document has not changed since given time'),
       
   261         305: ('Use Proxy',
       
   262               'You must use proxy specified in Location to access this '
       
   263               'resource.'),
       
   264         307: ('Temporary Redirect',
       
   265               'Object moved temporarily -- see URI list'),
       
   266 
       
   267         400: ('Bad Request',
       
   268               'Bad request syntax or unsupported method'),
       
   269         401: ('Unauthorized',
       
   270               'No permission -- see authorization schemes'),
       
   271         402: ('Payment Required',
       
   272               'No payment -- see charging schemes'),
       
   273         403: ('Forbidden',
       
   274               'Request forbidden -- authorization will not help'),
       
   275         404: ('Not Found', 'Nothing matches the given URI'),
       
   276         405: ('Method Not Allowed',
       
   277               'Specified method is invalid for this server.'),
       
   278         406: ('Not Acceptable', 'URI not available in preferred format.'),
       
   279         407: ('Proxy Authentication Required', 'You must authenticate with '
       
   280               'this proxy before proceeding.'),
       
   281         408: ('Request Timeout', 'Request timed out; try again later.'),
       
   282         409: ('Conflict', 'Request conflict.'),
       
   283         410: ('Gone',
       
   284               'URI no longer exists and has been permanently removed.'),
       
   285         411: ('Length Required', 'Client must specify Content-Length.'),
       
   286         412: ('Precondition Failed', 'Precondition in headers is false.'),
       
   287         413: ('Request Entity Too Large', 'Entity is too large.'),
       
   288         414: ('Request-URI Too Long', 'URI is too long.'),
       
   289         415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
       
   290         416: ('Requested Range Not Satisfiable',
       
   291               'Cannot satisfy request range.'),
       
   292         417: ('Expectation Failed',
       
   293               'Expect condition could not be satisfied.'),
       
   294 
       
   295         500: ('Internal Server Error', 'Server got itself in trouble'),
       
   296         501: ('Not Implemented',
       
   297               'Server does not support this operation'),
       
   298         502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
       
   299         503: ('Service Unavailable',
       
   300               'The server cannot process the request due to a high load'),
       
   301         504: ('Gateway Timeout',
       
   302               'The gateway server did not receive a timely response'),
       
   303         505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
       
   304         }
       
   305 
       
   306 When an error is raised the server responds by returning an HTTP error code
       
   307 *and* an error page. You can use the :exc:`HTTPError` instance as a response on the
       
   308 page returned. This means that as well as the code attribute, it also has read,
       
   309 geturl, and info, methods. ::
       
   310 
       
   311     >>> req = urllib2.Request('http://www.python.org/fish.html')
       
   312     >>> try: 
       
   313     >>>     urllib2.urlopen(req)
       
   314     >>> except URLError, e:
       
   315     >>>     print e.code
       
   316     >>>     print e.read()
       
   317     >>> 
       
   318     404
       
   319     <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
       
   320         "http://www.w3.org/TR/html4/loose.dtd">
       
   321     <?xml-stylesheet href="./css/ht2html.css" 
       
   322         type="text/css"?>
       
   323     <html><head><title>Error 404: File Not Found</title> 
       
   324     ...... etc...
       
   325 
       
   326 Wrapping it Up
       
   327 --------------
       
   328 
       
   329 So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
       
   330 basic approaches. I prefer the second approach.
       
   331 
       
   332 Number 1
       
   333 ~~~~~~~~
       
   334 
       
   335 ::
       
   336 
       
   337 
       
   338     from urllib2 import Request, urlopen, URLError, HTTPError
       
   339     req = Request(someurl)
       
   340     try:
       
   341         response = urlopen(req)
       
   342     except HTTPError, e:
       
   343         print 'The server couldn\'t fulfill the request.'
       
   344         print 'Error code: ', e.code
       
   345     except URLError, e:
       
   346         print 'We failed to reach a server.'
       
   347         print 'Reason: ', e.reason
       
   348     else:
       
   349         # everything is fine
       
   350 
       
   351 
       
   352 .. note::
       
   353 
       
   354     The ``except HTTPError`` *must* come first, otherwise ``except URLError``
       
   355     will *also* catch an :exc:`HTTPError`.
       
   356 
       
   357 Number 2
       
   358 ~~~~~~~~
       
   359 
       
   360 ::
       
   361 
       
   362     from urllib2 import Request, urlopen, URLError
       
   363     req = Request(someurl)
       
   364     try:
       
   365         response = urlopen(req)
       
   366     except URLError, e:
       
   367         if hasattr(e, 'reason'):
       
   368             print 'We failed to reach a server.'
       
   369             print 'Reason: ', e.reason
       
   370         elif hasattr(e, 'code'):
       
   371             print 'The server couldn\'t fulfill the request.'
       
   372             print 'Error code: ', e.code
       
   373     else:
       
   374         # everything is fine
       
   375         
       
   376 
       
   377 info and geturl
       
   378 ===============
       
   379 
       
   380 The response returned by urlopen (or the :exc:`HTTPError` instance) has two useful
       
   381 methods :meth:`info` and :meth:`geturl`.
       
   382 
       
   383 **geturl** - this returns the real URL of the page fetched. This is useful
       
   384 because ``urlopen`` (or the opener object used) may have followed a
       
   385 redirect. The URL of the page fetched may not be the same as the URL requested.
       
   386 
       
   387 **info** - this returns a dictionary-like object that describes the page
       
   388 fetched, particularly the headers sent by the server. It is currently an
       
   389 ``httplib.HTTPMessage`` instance.
       
   390 
       
   391 Typical headers include 'Content-length', 'Content-type', and so on. See the
       
   392 `Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
       
   393 for a useful listing of HTTP headers with brief explanations of their meaning
       
   394 and use.
       
   395 
       
   396 
       
   397 Openers and Handlers
       
   398 ====================
       
   399 
       
   400 When you fetch a URL you use an opener (an instance of the perhaps
       
   401 confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
       
   402 the default opener - via ``urlopen`` - but you can create custom
       
   403 openers. Openers use handlers. All the "heavy lifting" is done by the
       
   404 handlers. Each handler knows how to open URLs for a particular URL scheme (http,
       
   405 ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
       
   406 redirections or HTTP cookies.
       
   407 
       
   408 You will want to create openers if you want to fetch URLs with specific handlers
       
   409 installed, for example to get an opener that handles cookies, or to get an
       
   410 opener that does not handle redirections.
       
   411 
       
   412 To create an opener, instantiate an ``OpenerDirector``, and then call
       
   413 ``.add_handler(some_handler_instance)`` repeatedly.
       
   414 
       
   415 Alternatively, you can use ``build_opener``, which is a convenience function for
       
   416 creating opener objects with a single function call.  ``build_opener`` adds
       
   417 several handlers by default, but provides a quick way to add more and/or
       
   418 override the default handlers.
       
   419 
       
   420 Other sorts of handlers you might want to can handle proxies, authentication,
       
   421 and other common but slightly specialised situations.
       
   422 
       
   423 ``install_opener`` can be used to make an ``opener`` object the (global) default
       
   424 opener. This means that calls to ``urlopen`` will use the opener you have
       
   425 installed.
       
   426 
       
   427 Opener objects have an ``open`` method, which can be called directly to fetch
       
   428 urls in the same way as the ``urlopen`` function: there's no need to call
       
   429 ``install_opener``, except as a convenience.
       
   430 
       
   431 
       
   432 Basic Authentication
       
   433 ====================
       
   434 
       
   435 To illustrate creating and installing a handler we will use the
       
   436 ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
       
   437 including an explanation of how Basic Authentication works - see the `Basic
       
   438 Authentication Tutorial
       
   439 <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
       
   440 
       
   441 When authentication is required, the server sends a header (as well as the 401
       
   442 error code) requesting authentication.  This specifies the authentication scheme
       
   443 and a 'realm'. The header looks like : ``Www-authenticate: SCHEME
       
   444 realm="REALM"``.
       
   445 
       
   446 e.g. :: 
       
   447 
       
   448     Www-authenticate: Basic realm="cPanel Users"
       
   449 
       
   450 
       
   451 The client should then retry the request with the appropriate name and password
       
   452 for the realm included as a header in the request. This is 'basic
       
   453 authentication'. In order to simplify this process we can create an instance of
       
   454 ``HTTPBasicAuthHandler`` and an opener to use this handler.
       
   455 
       
   456 The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
       
   457 the mapping of URLs and realms to passwords and usernames. If you know what the
       
   458 realm is (from the authentication header sent by the server), then you can use a
       
   459 ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
       
   460 case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
       
   461 you to specify a default username and password for a URL. This will be supplied
       
   462 in the absence of you providing an alternative combination for a specific
       
   463 realm. We indicate this by providing ``None`` as the realm argument to the
       
   464 ``add_password`` method.
       
   465 
       
   466 The top-level URL is the first URL that requires authentication. URLs "deeper"
       
   467 than the URL you pass to .add_password() will also match. ::
       
   468 
       
   469     # create a password manager
       
   470     password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()                        
       
   471 
       
   472     # Add the username and password.
       
   473     # If we knew the realm, we could use it instead of ``None``.
       
   474     top_level_url = "http://example.com/foo/"
       
   475     password_mgr.add_password(None, top_level_url, username, password)
       
   476 
       
   477     handler = urllib2.HTTPBasicAuthHandler(password_mgr)                            
       
   478 
       
   479     # create "opener" (OpenerDirector instance)
       
   480     opener = urllib2.build_opener(handler)                       
       
   481 
       
   482     # use the opener to fetch a URL
       
   483     opener.open(a_url)      
       
   484 
       
   485     # Install the opener.
       
   486     # Now all calls to urllib2.urlopen use our opener.
       
   487     urllib2.install_opener(opener)                               
       
   488 
       
   489 .. note::
       
   490 
       
   491     In the above example we only supplied our ``HHTPBasicAuthHandler`` to
       
   492     ``build_opener``. By default openers have the handlers for normal situations
       
   493     -- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``,
       
   494     ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
       
   495     ``FileHandler``, ``HTTPErrorProcessor``.
       
   496 
       
   497 ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
       
   498 component and the hostname and optionally the port number)
       
   499 e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
       
   500 optionally including the port number) e.g. "example.com" or "example.com:8080"
       
   501 (the latter example includes a port number).  The authority, if present, must
       
   502 NOT contain the "userinfo" component - for example "joe@password:example.com" is
       
   503 not correct.
       
   504 
       
   505 
       
   506 Proxies
       
   507 =======
       
   508 
       
   509 **urllib2** will auto-detect your proxy settings and use those. This is through
       
   510 the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
       
   511 a good thing, but there are occasions when it may not be helpful [#]_. One way
       
   512 to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
       
   513 is done using similar steps to setting up a `Basic Authentication`_ handler : ::
       
   514 
       
   515     >>> proxy_support = urllib2.ProxyHandler({})
       
   516     >>> opener = urllib2.build_opener(proxy_support)
       
   517     >>> urllib2.install_opener(opener)
       
   518 
       
   519 .. note::
       
   520 
       
   521     Currently ``urllib2`` *does not* support fetching of ``https`` locations
       
   522     through a proxy.  However, this can be enabled by extending urllib2 as
       
   523     shown in the recipe [#]_.
       
   524 
       
   525 
       
   526 Sockets and Layers
       
   527 ==================
       
   528 
       
   529 The Python support for fetching resources from the web is layered. urllib2 uses
       
   530 the httplib library, which in turn uses the socket library.
       
   531 
       
   532 As of Python 2.3 you can specify how long a socket should wait for a response
       
   533 before timing out. This can be useful in applications which have to fetch web
       
   534 pages. By default the socket module has *no timeout* and can hang. Currently,
       
   535 the socket timeout is not exposed at the httplib or urllib2 levels.  However,
       
   536 you can set the default timeout globally for all sockets using ::
       
   537 
       
   538     import socket
       
   539     import urllib2
       
   540 
       
   541     # timeout in seconds
       
   542     timeout = 10
       
   543     socket.setdefaulttimeout(timeout) 
       
   544 
       
   545     # this call to urllib2.urlopen now uses the default timeout
       
   546     # we have set in the socket module
       
   547     req = urllib2.Request('http://www.voidspace.org.uk')
       
   548     response = urllib2.urlopen(req)
       
   549 
       
   550 
       
   551 -------
       
   552 
       
   553 
       
   554 Footnotes
       
   555 =========
       
   556 
       
   557 This document was reviewed and revised by John Lee.
       
   558 
       
   559 .. [#] For an introduction to the CGI protocol see
       
   560        `Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_. 
       
   561 .. [#] Like Google for example. The *proper* way to use google from a program
       
   562        is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
       
   563        `Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
       
   564        for some examples of using the Google API.
       
   565 .. [#] Browser sniffing is a very bad practise for website design - building
       
   566        sites using web standards is much more sensible. Unfortunately a lot of
       
   567        sites still send different versions to different browsers.
       
   568 .. [#] The user agent for MSIE 6 is
       
   569        *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
       
   570 .. [#] For details of more HTTP request headers, see
       
   571        `Quick Reference to HTTP Headers`_.
       
   572 .. [#] In my case I have to use a proxy to access the internet at work. If you
       
   573        attempt to fetch *localhost* URLs through this proxy it blocks them. IE
       
   574        is set to use the proxy, which urllib2 picks up on. In order to test
       
   575        scripts with a localhost server, I have to prevent urllib2 from using
       
   576        the proxy.
       
   577 .. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe 
       
   578        <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
       
   579