|
1 |
|
2 :mod:`robotparser` --- Parser for robots.txt |
|
3 ============================================= |
|
4 |
|
5 .. module:: robotparser |
|
6 :synopsis: Loads a robots.txt file and answers questions about |
|
7 fetchability of other URLs. |
|
8 .. sectionauthor:: Skip Montanaro <skip@pobox.com> |
|
9 |
|
10 |
|
11 .. index:: |
|
12 single: WWW |
|
13 single: World Wide Web |
|
14 single: URL |
|
15 single: robots.txt |
|
16 |
|
17 .. note:: |
|
18 The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in |
|
19 Python 3.0. |
|
20 The :term:`2to3` tool will automatically adapt imports when converting |
|
21 your sources to 3.0. |
|
22 |
|
23 This module provides a single class, :class:`RobotFileParser`, which answers |
|
24 questions about whether or not a particular user agent can fetch a URL on the |
|
25 Web site that published the :file:`robots.txt` file. For more details on the |
|
26 structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html. |
|
27 |
|
28 |
|
29 .. class:: RobotFileParser() |
|
30 |
|
31 This class provides a set of methods to read, parse and answer questions |
|
32 about a single :file:`robots.txt` file. |
|
33 |
|
34 |
|
35 .. method:: set_url(url) |
|
36 |
|
37 Sets the URL referring to a :file:`robots.txt` file. |
|
38 |
|
39 |
|
40 .. method:: read() |
|
41 |
|
42 Reads the :file:`robots.txt` URL and feeds it to the parser. |
|
43 |
|
44 |
|
45 .. method:: parse(lines) |
|
46 |
|
47 Parses the lines argument. |
|
48 |
|
49 |
|
50 .. method:: can_fetch(useragent, url) |
|
51 |
|
52 Returns ``True`` if the *useragent* is allowed to fetch the *url* |
|
53 according to the rules contained in the parsed :file:`robots.txt` |
|
54 file. |
|
55 |
|
56 |
|
57 .. method:: mtime() |
|
58 |
|
59 Returns the time the ``robots.txt`` file was last fetched. This is |
|
60 useful for long-running web spiders that need to check for new |
|
61 ``robots.txt`` files periodically. |
|
62 |
|
63 |
|
64 .. method:: modified() |
|
65 |
|
66 Sets the time the ``robots.txt`` file was last fetched to the current |
|
67 time. |
|
68 |
|
69 The following example demonstrates basic use of the RobotFileParser class. :: |
|
70 |
|
71 >>> import robotparser |
|
72 >>> rp = robotparser.RobotFileParser() |
|
73 >>> rp.set_url("http://www.musi-cal.com/robots.txt") |
|
74 >>> rp.read() |
|
75 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") |
|
76 False |
|
77 >>> rp.can_fetch("*", "http://www.musi-cal.com/") |
|
78 True |
|
79 |