symbian-qemu-0.9.1-12/python-2.6.1/Doc/library/re.rst
changeset 1 2fb8b9db1c86
equal deleted inserted replaced
0:ffa851df0825 1:2fb8b9db1c86
       
     1 
       
     2 :mod:`re` --- Regular expression operations
       
     3 ===========================================
       
     4 
       
     5 .. module:: re
       
     6    :synopsis: Regular expression operations.
       
     7 .. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
       
     8 .. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
       
     9 
       
    10 
       
    11 
       
    12 
       
    13 This module provides regular expression matching operations similar to
       
    14 those found in Perl. Both patterns and strings to be searched can be
       
    15 Unicode strings as well as 8-bit strings.  The :mod:`re` module is
       
    16 always available.
       
    17 
       
    18 Regular expressions use the backslash character (``'\'``) to indicate
       
    19 special forms or to allow special characters to be used without invoking
       
    20 their special meaning.  This collides with Python's usage of the same
       
    21 character for the same purpose in string literals; for example, to match
       
    22 a literal backslash, one might have to write ``'\\\\'`` as the pattern
       
    23 string, because the regular expression must be ``\\``, and each
       
    24 backslash must be expressed as ``\\`` inside a regular Python string
       
    25 literal.
       
    26 
       
    27 The solution is to use Python's raw string notation for regular expression
       
    28 patterns; backslashes are not handled in any special way in a string literal
       
    29 prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
       
    30 ``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
       
    31 newline.  Usually patterns will be expressed in Python code using this raw
       
    32 string notation.
       
    33 
       
    34 It is important to note that most regular expression operations are available as
       
    35 module-level functions and :class:`RegexObject` methods.  The functions are
       
    36 shortcuts that don't require you to compile a regex object first, but miss some
       
    37 fine-tuning parameters.
       
    38 
       
    39 .. seealso::
       
    40 
       
    41    Mastering Regular Expressions
       
    42       Book on regular expressions by Jeffrey Friedl, published by O'Reilly.  The
       
    43       second edition of the book no longer covers Python at all, but the first
       
    44       edition covered writing good regular expression patterns in great detail.
       
    45 
       
    46    `Kodos <http://kodos.sf.net/>`_
       
    47       is a graphical regular expression debugger written in Python.
       
    48 
       
    49 
       
    50 .. _re-syntax:
       
    51 
       
    52 Regular Expression Syntax
       
    53 -------------------------
       
    54 
       
    55 A regular expression (or RE) specifies a set of strings that matches it; the
       
    56 functions in this module let you check if a particular string matches a given
       
    57 regular expression (or if a given regular expression matches a particular
       
    58 string, which comes down to the same thing).
       
    59 
       
    60 Regular expressions can be concatenated to form new regular expressions; if *A*
       
    61 and *B* are both regular expressions, then *AB* is also a regular expression.
       
    62 In general, if a string *p* matches *A* and another string *q* matches *B*, the
       
    63 string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
       
    64 operations; boundary conditions between *A* and *B*; or have numbered group
       
    65 references.  Thus, complex expressions can easily be constructed from simpler
       
    66 primitive expressions like the ones described here.  For details of the theory
       
    67 and implementation of regular expressions, consult the Friedl book referenced
       
    68 above, or almost any textbook about compiler construction.
       
    69 
       
    70 A brief explanation of the format of regular expressions follows.  For further
       
    71 information and a gentler presentation, consult the :ref:`regex-howto`.
       
    72 
       
    73 Regular expressions can contain both special and ordinary characters. Most
       
    74 ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
       
    75 expressions; they simply match themselves.  You can concatenate ordinary
       
    76 characters, so ``last`` matches the string ``'last'``.  (In the rest of this
       
    77 section, we'll write RE's in ``this special style``, usually without quotes, and
       
    78 strings to be matched ``'in single quotes'``.)
       
    79 
       
    80 Some characters, like ``'|'`` or ``'('``, are special. Special
       
    81 characters either stand for classes of ordinary characters, or affect
       
    82 how the regular expressions around them are interpreted. Regular
       
    83 expression pattern strings may not contain null bytes, but can specify
       
    84 the null byte using the ``\number`` notation, e.g., ``'\x00'``.
       
    85 
       
    86 
       
    87 The special characters are:
       
    88 
       
    89 ``'.'``
       
    90    (Dot.)  In the default mode, this matches any character except a newline.  If
       
    91    the :const:`DOTALL` flag has been specified, this matches any character
       
    92    including a newline.
       
    93 
       
    94 ``'^'``
       
    95    (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
       
    96    matches immediately after each newline.
       
    97 
       
    98 ``'$'``
       
    99    Matches the end of the string or just before the newline at the end of the
       
   100    string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
       
   101    matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
       
   102    only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
       
   103    matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
       
   104    a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
       
   105    the newline, and one at the end of the string.
       
   106 
       
   107 ``'*'``
       
   108    Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
       
   109    many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
       
   110    by any number of 'b's.
       
   111 
       
   112 ``'+'``
       
   113    Causes the resulting RE to match 1 or more repetitions of the preceding RE.
       
   114    ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
       
   115    match just 'a'.
       
   116 
       
   117 ``'?'``
       
   118    Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
       
   119    ``ab?`` will match either 'a' or 'ab'.
       
   120 
       
   121 ``*?``, ``+?``, ``??``
       
   122    The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
       
   123    as much text as possible.  Sometimes this behaviour isn't desired; if the RE
       
   124    ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
       
   125    string, and not just ``'<H1>'``.  Adding ``'?'`` after the qualifier makes it
       
   126    perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
       
   127    characters as possible will be matched.  Using ``.*?`` in the previous
       
   128    expression will match only ``'<H1>'``.
       
   129 
       
   130 ``{m}``
       
   131    Specifies that exactly *m* copies of the previous RE should be matched; fewer
       
   132    matches cause the entire RE not to match.  For example, ``a{6}`` will match
       
   133    exactly six ``'a'`` characters, but not five.
       
   134 
       
   135 ``{m,n}``
       
   136    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
       
   137    RE, attempting to match as many repetitions as possible.  For example,
       
   138    ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
       
   139    lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
       
   140    example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
       
   141    followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
       
   142    modifier would be confused with the previously described form.
       
   143 
       
   144 ``{m,n}?``
       
   145    Causes the resulting RE to match from *m* to *n* repetitions of the preceding
       
   146    RE, attempting to match as *few* repetitions as possible.  This is the
       
   147    non-greedy version of the previous qualifier.  For example, on the
       
   148    6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
       
   149    while ``a{3,5}?`` will only match 3 characters.
       
   150 
       
   151 ``'\'``
       
   152    Either escapes special characters (permitting you to match characters like
       
   153    ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
       
   154    sequences are discussed below.
       
   155 
       
   156    If you're not using a raw string to express the pattern, remember that Python
       
   157    also uses the backslash as an escape sequence in string literals; if the escape
       
   158    sequence isn't recognized by Python's parser, the backslash and subsequent
       
   159    character are included in the resulting string.  However, if Python would
       
   160    recognize the resulting sequence, the backslash should be repeated twice.  This
       
   161    is complicated and hard to understand, so it's highly recommended that you use
       
   162    raw strings for all but the simplest expressions.
       
   163 
       
   164 ``[]``
       
   165    Used to indicate a set of characters.  Characters can be listed individually, or
       
   166    a range of characters can be indicated by giving two characters and separating
       
   167    them by a ``'-'``.  Special characters are not active inside sets.  For example,
       
   168    ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
       
   169    ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
       
   170    ``[a-zA-Z0-9]`` matches any letter or digit.  Character classes such
       
   171    as ``\w`` or ``\S`` (defined below) are also acceptable inside a
       
   172    range, although the characters they match depends on whether :const:`LOCALE`
       
   173    or  :const:`UNICODE` mode is in force.  If you want to include a
       
   174    ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
       
   175    place it as the first character.  The pattern ``[]]`` will match
       
   176    ``']'``, for example.
       
   177 
       
   178    You can match the characters not within a range by :dfn:`complementing` the set.
       
   179    This is indicated by including a ``'^'`` as the first character of the set;
       
   180    ``'^'`` elsewhere will simply match the ``'^'`` character.  For example,
       
   181    ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
       
   182    character except ``'^'``.
       
   183 
       
   184    Note that inside ``[]`` the special forms and special characters lose
       
   185    their meanings and only the syntaxes described here are valid. For
       
   186    example, ``+``, ``*``, ``(``, ``)``, and so on are treated as
       
   187    literals inside ``[]``, and backreferences cannot be used inside
       
   188    ``[]``.
       
   189 
       
   190 ``'|'``
       
   191    ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
       
   192    will match either A or B.  An arbitrary number of REs can be separated by the
       
   193    ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
       
   194    the target string is scanned, REs separated by ``'|'`` are tried from left to
       
   195    right. When one pattern completely matches, that branch is accepted. This means
       
   196    that once ``A`` matches, ``B`` will not be tested further, even if it would
       
   197    produce a longer overall match.  In other words, the ``'|'`` operator is never
       
   198    greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
       
   199    character class, as in ``[|]``.
       
   200 
       
   201 ``(...)``
       
   202    Matches whatever regular expression is inside the parentheses, and indicates the
       
   203    start and end of a group; the contents of a group can be retrieved after a match
       
   204    has been performed, and can be matched later in the string with the ``\number``
       
   205    special sequence, described below.  To match the literals ``'('`` or ``')'``,
       
   206    use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
       
   207 
       
   208 ``(?...)``
       
   209    This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
       
   210    otherwise).  The first character after the ``'?'`` determines what the meaning
       
   211    and further syntax of the construct is. Extensions usually do not create a new
       
   212    group; ``(?P<name>...)`` is the only exception to this rule. Following are the
       
   213    currently supported extensions.
       
   214 
       
   215 ``(?iLmsux)``
       
   216    (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
       
   217    ``'u'``, ``'x'``.)  The group matches the empty string; the letters
       
   218    set the corresponding flags: :const:`re.I` (ignore case),
       
   219    :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
       
   220    :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
       
   221    and :const:`re.X` (verbose), for the entire regular expression. (The
       
   222    flags are described in :ref:`contents-of-module-re`.) This
       
   223    is useful if you wish to include the flags as part of the regular
       
   224    expression, instead of passing a *flag* argument to the
       
   225    :func:`compile` function.
       
   226 
       
   227    Note that the ``(?x)`` flag changes how the expression is parsed. It should be
       
   228    used first in the expression string, or after one or more whitespace characters.
       
   229    If there are non-whitespace characters before the flag, the results are
       
   230    undefined.
       
   231 
       
   232 ``(?:...)``
       
   233    A non-grouping version of regular parentheses. Matches whatever regular
       
   234    expression is inside the parentheses, but the substring matched by the group
       
   235    *cannot* be retrieved after performing a match or referenced later in the
       
   236    pattern.
       
   237 
       
   238 ``(?P<name>...)``
       
   239    Similar to regular parentheses, but the substring matched by the group is
       
   240    accessible via the symbolic group name *name*.  Group names must be valid Python
       
   241    identifiers, and each group name must be defined only once within a regular
       
   242    expression.  A symbolic group is also a numbered group, just as if the group
       
   243    were not named.  So the group named 'id' in the example below can also be
       
   244    referenced as the numbered group 1.
       
   245 
       
   246    For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
       
   247    referenced by its name in arguments to methods of match objects, such as
       
   248    ``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
       
   249    example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
       
   250 
       
   251 ``(?P=name)``
       
   252    Matches whatever text was matched by the earlier group named *name*.
       
   253 
       
   254 ``(?#...)``
       
   255    A comment; the contents of the parentheses are simply ignored.
       
   256 
       
   257 ``(?=...)``
       
   258    Matches if ``...`` matches next, but doesn't consume any of the string.  This is
       
   259    called a lookahead assertion.  For example, ``Isaac (?=Asimov)`` will match
       
   260    ``'Isaac '`` only if it's followed by ``'Asimov'``.
       
   261 
       
   262 ``(?!...)``
       
   263    Matches if ``...`` doesn't match next.  This is a negative lookahead assertion.
       
   264    For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
       
   265    followed by ``'Asimov'``.
       
   266 
       
   267 ``(?<=...)``
       
   268    Matches if the current position in the string is preceded by a match for ``...``
       
   269    that ends at the current position.  This is called a :dfn:`positive lookbehind
       
   270    assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
       
   271    lookbehind will back up 3 characters and check if the contained pattern matches.
       
   272    The contained pattern must only match strings of some fixed length, meaning that
       
   273    ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
       
   274    patterns which start with positive lookbehind assertions will never match at the
       
   275    beginning of the string being searched; you will most likely want to use the
       
   276    :func:`search` function rather than the :func:`match` function:
       
   277 
       
   278       >>> import re
       
   279       >>> m = re.search('(?<=abc)def', 'abcdef')
       
   280       >>> m.group(0)
       
   281       'def'
       
   282 
       
   283    This example looks for a word following a hyphen:
       
   284 
       
   285       >>> m = re.search('(?<=-)\w+', 'spam-egg')
       
   286       >>> m.group(0)
       
   287       'egg'
       
   288 
       
   289 ``(?<!...)``
       
   290    Matches if the current position in the string is not preceded by a match for
       
   291    ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
       
   292    positive lookbehind assertions, the contained pattern must only match strings of
       
   293    some fixed length.  Patterns which start with negative lookbehind assertions may
       
   294    match at the beginning of the string being searched.
       
   295 
       
   296 ``(?(id/name)yes-pattern|no-pattern)``
       
   297    Will try to match with ``yes-pattern`` if the group with given *id* or *name*
       
   298    exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
       
   299    can be omitted. For example,  ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
       
   300    matching pattern, which will match with ``'<user@host.com>'`` as well as
       
   301    ``'user@host.com'``, but not with ``'<user@host.com'``.
       
   302 
       
   303    .. versionadded:: 2.4
       
   304 
       
   305 The special sequences consist of ``'\'`` and a character from the list below.
       
   306 If the ordinary character is not on the list, then the resulting RE will match
       
   307 the second character.  For example, ``\$`` matches the character ``'$'``.
       
   308 
       
   309 ``\number``
       
   310    Matches the contents of the group of the same number.  Groups are numbered
       
   311    starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
       
   312    but not ``'the end'`` (note the space after the group).  This special sequence
       
   313    can only be used to match one of the first 99 groups.  If the first digit of
       
   314    *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
       
   315    a group match, but as the character with octal value *number*. Inside the
       
   316    ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
       
   317    characters.
       
   318 
       
   319 ``\A``
       
   320    Matches only at the start of the string.
       
   321 
       
   322 ``\b``
       
   323    Matches the empty string, but only at the beginning or end of a word.  A word is
       
   324    defined as a sequence of alphanumeric or underscore characters, so the end of a
       
   325    word is indicated by whitespace or a non-alphanumeric, non-underscore character.
       
   326    Note that  ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
       
   327    precise set of characters deemed to be alphanumeric depends on the values of the
       
   328    ``UNICODE`` and ``LOCALE`` flags.  Inside a character range, ``\b`` represents
       
   329    the backspace character, for compatibility with Python's string literals.
       
   330 
       
   331 ``\B``
       
   332    Matches the empty string, but only when it is *not* at the beginning or end of a
       
   333    word.  This is just the opposite of ``\b``, so is also subject to the settings
       
   334    of ``LOCALE`` and ``UNICODE``.
       
   335 
       
   336 ``\d``
       
   337    When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
       
   338    is equivalent to the set ``[0-9]``.  With :const:`UNICODE`, it will match
       
   339    whatever is classified as a digit in the Unicode character properties database.
       
   340 
       
   341 ``\D``
       
   342    When the :const:`UNICODE` flag is not specified, matches any non-digit
       
   343    character; this is equivalent to the set  ``[^0-9]``.  With :const:`UNICODE`, it
       
   344    will match  anything other than character marked as digits in the Unicode
       
   345    character  properties database.
       
   346 
       
   347 ``\s``
       
   348    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
       
   349    any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
       
   350    :const:`LOCALE`, it will match this set plus whatever characters are defined as
       
   351    space for the current locale. If :const:`UNICODE` is set, this will match the
       
   352    characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
       
   353    character properties database.
       
   354 
       
   355 ``\S``
       
   356    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
       
   357    any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
       
   358    With :const:`LOCALE`, it will match any character not in this set, and not
       
   359    defined as space in the current locale. If :const:`UNICODE` is set, this will
       
   360    match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
       
   361    the Unicode character properties database.
       
   362 
       
   363 ``\w``
       
   364    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
       
   365    any alphanumeric character and the underscore; this is equivalent to the set
       
   366    ``[a-zA-Z0-9_]``.  With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
       
   367    whatever characters are defined as alphanumeric for the current locale.  If
       
   368    :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
       
   369    is classified as alphanumeric in the Unicode character properties database.
       
   370 
       
   371 ``\W``
       
   372    When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
       
   373    any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
       
   374    With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
       
   375    not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
       
   376    this will match anything other than ``[0-9_]`` and characters marked as
       
   377    alphanumeric in the Unicode character properties database.
       
   378 
       
   379 ``\Z``
       
   380    Matches only at the end of the string.
       
   381 
       
   382 Most of the standard escapes supported by Python string literals are also
       
   383 accepted by the regular expression parser::
       
   384 
       
   385    \a      \b      \f      \n
       
   386    \r      \t      \v      \x
       
   387    \\
       
   388 
       
   389 Octal escapes are included in a limited form: If the first digit is a 0, or if
       
   390 there are three octal digits, it is considered an octal escape. Otherwise, it is
       
   391 a group reference.  As for string literals, octal escapes are always at most
       
   392 three digits in length.
       
   393 
       
   394 
       
   395 .. _matching-searching:
       
   396 
       
   397 Matching vs Searching
       
   398 ---------------------
       
   399 
       
   400 .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
       
   401 
       
   402 
       
   403 Python offers two different primitive operations based on regular expressions:
       
   404 **match** checks for a match only at the beginning of the string, while
       
   405 **search** checks for a match anywhere in the string (this is what Perl does
       
   406 by default).
       
   407 
       
   408 Note that match may differ from search even when using a regular expression
       
   409 beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
       
   410 :const:`MULTILINE` mode also immediately following a newline.  The "match"
       
   411 operation succeeds only if the pattern matches at the start of the string
       
   412 regardless of mode, or at the starting position given by the optional *pos*
       
   413 argument regardless of whether a newline precedes it.
       
   414 
       
   415    >>> re.match("c", "abcdef")  # No match
       
   416    >>> re.search("c", "abcdef") # Match
       
   417    <_sre.SRE_Match object at ...>
       
   418 
       
   419 
       
   420 .. _contents-of-module-re:
       
   421 
       
   422 Module Contents
       
   423 ---------------
       
   424 
       
   425 The module defines several functions, constants, and an exception. Some of the
       
   426 functions are simplified versions of the full featured methods for compiled
       
   427 regular expressions.  Most non-trivial applications always use the compiled
       
   428 form.
       
   429 
       
   430 
       
   431 .. function:: compile(pattern[, flags])
       
   432 
       
   433    Compile a regular expression pattern into a regular expression object, which
       
   434    can be used for matching using its :func:`match` and :func:`search` methods,
       
   435    described below.
       
   436 
       
   437    The expression's behaviour can be modified by specifying a *flags* value.
       
   438    Values can be any of the following variables, combined using bitwise OR (the
       
   439    ``|`` operator).
       
   440 
       
   441    The sequence ::
       
   442 
       
   443       prog = re.compile(pat)
       
   444       result = prog.match(str)
       
   445 
       
   446    is equivalent to ::
       
   447 
       
   448       result = re.match(pat, str)
       
   449 
       
   450    but the version using :func:`compile` is more efficient when the expression
       
   451    will be used several times in a single program.
       
   452 
       
   453    .. (The compiled version of the last pattern passed to :func:`re.match` or
       
   454       :func:`re.search` is cached, so programs that use only a single regular
       
   455       expression at a time needn't worry about compiling regular expressions.)
       
   456 
       
   457 
       
   458 .. data:: I
       
   459           IGNORECASE
       
   460 
       
   461    Perform case-insensitive matching; expressions like ``[A-Z]`` will match
       
   462    lowercase letters, too.  This is not affected by the current locale.
       
   463 
       
   464 
       
   465 .. data:: L
       
   466           LOCALE
       
   467 
       
   468    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
       
   469    current locale.
       
   470 
       
   471 
       
   472 .. data:: M
       
   473           MULTILINE
       
   474 
       
   475    When specified, the pattern character ``'^'`` matches at the beginning of the
       
   476    string and at the beginning of each line (immediately following each newline);
       
   477    and the pattern character ``'$'`` matches at the end of the string and at the
       
   478    end of each line (immediately preceding each newline).  By default, ``'^'``
       
   479    matches only at the beginning of the string, and ``'$'`` only at the end of the
       
   480    string and immediately before the newline (if any) at the end of the string.
       
   481 
       
   482 
       
   483 .. data:: S
       
   484           DOTALL
       
   485 
       
   486    Make the ``'.'`` special character match any character at all, including a
       
   487    newline; without this flag, ``'.'`` will match anything *except* a newline.
       
   488 
       
   489 
       
   490 .. data:: U
       
   491           UNICODE
       
   492 
       
   493    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
       
   494    on the Unicode character properties database.
       
   495 
       
   496    .. versionadded:: 2.0
       
   497 
       
   498 
       
   499 .. data:: X
       
   500           VERBOSE
       
   501 
       
   502    This flag allows you to write regular expressions that look nicer. Whitespace
       
   503    within the pattern is ignored, except when in a character class or preceded by
       
   504    an unescaped backslash, and, when a line contains a ``'#'`` neither in a
       
   505    character class or preceded by an unescaped backslash, all characters from the
       
   506    leftmost such ``'#'`` through the end of the line are ignored.
       
   507 
       
   508    That means that the two following regular expression objects that match a
       
   509    decimal number are functionally equal::
       
   510 
       
   511       a = re.compile(r"""\d +  # the integral part
       
   512                          \.    # the decimal point
       
   513                          \d *  # some fractional digits""", re.X)
       
   514       b = re.compile(r"\d+\.\d*")
       
   515 
       
   516 
       
   517 .. function:: search(pattern, string[, flags])
       
   518 
       
   519    Scan through *string* looking for a location where the regular expression
       
   520    *pattern* produces a match, and return a corresponding :class:`MatchObject`
       
   521    instance. Return ``None`` if no position in the string matches the pattern; note
       
   522    that this is different from finding a zero-length match at some point in the
       
   523    string.
       
   524 
       
   525 
       
   526 .. function:: match(pattern, string[, flags])
       
   527 
       
   528    If zero or more characters at the beginning of *string* match the regular
       
   529    expression *pattern*, return a corresponding :class:`MatchObject` instance.
       
   530    Return ``None`` if the string does not match the pattern; note that this is
       
   531    different from a zero-length match.
       
   532 
       
   533    .. note::
       
   534 
       
   535       If you want to locate a match anywhere in *string*, use :meth:`search`
       
   536       instead.
       
   537 
       
   538 
       
   539 .. function:: split(pattern, string[, maxsplit=0])
       
   540 
       
   541    Split *string* by the occurrences of *pattern*.  If capturing parentheses are
       
   542    used in *pattern*, then the text of all groups in the pattern are also returned
       
   543    as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
       
   544    splits occur, and the remainder of the string is returned as the final element
       
   545    of the list.  (Incompatibility note: in the original Python 1.5 release,
       
   546    *maxsplit* was ignored.  This has been fixed in later releases.)
       
   547 
       
   548       >>> re.split('\W+', 'Words, words, words.')
       
   549       ['Words', 'words', 'words', '']
       
   550       >>> re.split('(\W+)', 'Words, words, words.')
       
   551       ['Words', ', ', 'words', ', ', 'words', '.', '']
       
   552       >>> re.split('\W+', 'Words, words, words.', 1)
       
   553       ['Words', 'words, words.']
       
   554 
       
   555    If there are capturing groups in the separator and it matches at the start of
       
   556    the string, the result will start with an empty string.  The same holds for
       
   557    the end of the string:
       
   558 
       
   559       >>> re.split('(\W+)', '...words, words...')
       
   560       ['', '...', 'words', ', ', 'words', '...', '']
       
   561 
       
   562    That way, separator components are always found at the same relative
       
   563    indices within the result list (e.g., if there's one capturing group
       
   564    in the separator, the 0th, the 2nd and so forth).
       
   565 
       
   566    Note that *split* will never split a string on an empty pattern match.
       
   567    For example:
       
   568 
       
   569       >>> re.split('x*', 'foo')
       
   570       ['foo']
       
   571       >>> re.split("(?m)^$", "foo\n\nbar\n")
       
   572       ['foo\n\nbar\n']
       
   573 
       
   574 
       
   575 .. function:: findall(pattern, string[, flags])
       
   576 
       
   577    Return all non-overlapping matches of *pattern* in *string*, as a list of
       
   578    strings.  The *string* is scanned left-to-right, and matches are returned in
       
   579    the order found.  If one or more groups are present in the pattern, return a
       
   580    list of groups; this will be a list of tuples if the pattern has more than
       
   581    one group.  Empty matches are included in the result unless they touch the
       
   582    beginning of another match.
       
   583 
       
   584    .. versionadded:: 1.5.2
       
   585 
       
   586    .. versionchanged:: 2.4
       
   587       Added the optional flags argument.
       
   588 
       
   589 
       
   590 .. function:: finditer(pattern, string[, flags])
       
   591 
       
   592    Return an :term:`iterator` yielding :class:`MatchObject` instances over all
       
   593    non-overlapping matches for the RE *pattern* in *string*.  The *string* is
       
   594    scanned left-to-right, and matches are returned in the order found.  Empty
       
   595    matches are included in the result unless they touch the beginning of another
       
   596    match.
       
   597 
       
   598    .. versionadded:: 2.2
       
   599 
       
   600    .. versionchanged:: 2.4
       
   601       Added the optional flags argument.
       
   602 
       
   603 
       
   604 .. function:: sub(pattern, repl, string[, count])
       
   605 
       
   606    Return the string obtained by replacing the leftmost non-overlapping occurrences
       
   607    of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
       
   608    *string* is returned unchanged.  *repl* can be a string or a function; if it is
       
   609    a string, any backslash escapes in it are processed.  That is, ``\n`` is
       
   610    converted to a single newline character, ``\r`` is converted to a linefeed, and
       
   611    so forth.  Unknown escapes such as ``\j`` are left alone.  Backreferences, such
       
   612    as ``\6``, are replaced with the substring matched by group 6 in the pattern.
       
   613    For example:
       
   614 
       
   615       >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
       
   616       ...        r'static PyObject*\npy_\1(void)\n{',
       
   617       ...        'def myfunc():')
       
   618       'static PyObject*\npy_myfunc(void)\n{'
       
   619 
       
   620    If *repl* is a function, it is called for every non-overlapping occurrence of
       
   621    *pattern*.  The function takes a single match object argument, and returns the
       
   622    replacement string.  For example:
       
   623 
       
   624       >>> def dashrepl(matchobj):
       
   625       ...     if matchobj.group(0) == '-': return ' '
       
   626       ...     else: return '-'
       
   627       >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
       
   628       'pro--gram files'
       
   629 
       
   630    The pattern may be a string or an RE object; if you need to specify regular
       
   631    expression flags, you must use a RE object, or use embedded modifiers in a
       
   632    pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
       
   633 
       
   634    The optional argument *count* is the maximum number of pattern occurrences to be
       
   635    replaced; *count* must be a non-negative integer.  If omitted or zero, all
       
   636    occurrences will be replaced. Empty matches for the pattern are replaced only
       
   637    when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
       
   638    ``'-a-b-c-'``.
       
   639 
       
   640    In addition to character escapes and backreferences as described above,
       
   641    ``\g<name>`` will use the substring matched by the group named ``name``, as
       
   642    defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
       
   643    group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
       
   644    in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
       
   645    reference to group 20, not a reference to group 2 followed by the literal
       
   646    character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
       
   647    substring matched by the RE.
       
   648 
       
   649 
       
   650 .. function:: subn(pattern, repl, string[, count])
       
   651 
       
   652    Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
       
   653    number_of_subs_made)``.
       
   654 
       
   655 
       
   656 .. function:: escape(string)
       
   657 
       
   658    Return *string* with all non-alphanumerics backslashed; this is useful if you
       
   659    want to match an arbitrary literal string that may have regular expression
       
   660    metacharacters in it.
       
   661 
       
   662 
       
   663 .. exception:: error
       
   664 
       
   665    Exception raised when a string passed to one of the functions here is not a
       
   666    valid regular expression (for example, it might contain unmatched parentheses)
       
   667    or when some other error occurs during compilation or matching.  It is never an
       
   668    error if a string contains no match for a pattern.
       
   669 
       
   670 
       
   671 .. _re-objects:
       
   672 
       
   673 Regular Expression Objects
       
   674 --------------------------
       
   675 
       
   676 Compiled regular expression objects support the following methods and
       
   677 attributes:
       
   678 
       
   679 
       
   680 .. method:: RegexObject.match(string[, pos[, endpos]])
       
   681 
       
   682    If zero or more characters at the beginning of *string* match this regular
       
   683    expression, return a corresponding :class:`MatchObject` instance.  Return
       
   684    ``None`` if the string does not match the pattern; note that this is different
       
   685    from a zero-length match.
       
   686 
       
   687    .. note::
       
   688 
       
   689       If you want to locate a match anywhere in *string*, use :meth:`search`
       
   690       instead.
       
   691 
       
   692    The optional second parameter *pos* gives an index in the string where the
       
   693    search is to start; it defaults to ``0``.  This is not completely equivalent to
       
   694    slicing the string; the ``'^'`` pattern character matches at the real beginning
       
   695    of the string and at positions just after a newline, but not necessarily at the
       
   696    index where the search is to start.
       
   697 
       
   698    The optional parameter *endpos* limits how far the string will be searched; it
       
   699    will be as if the string is *endpos* characters long, so only the characters
       
   700    from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
       
   701    than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
       
   702    expression object, ``rx.match(string, 0, 50)`` is equivalent to
       
   703    ``rx.match(string[:50], 0)``.
       
   704 
       
   705       >>> pattern = re.compile("o")
       
   706       >>> pattern.match("dog")      # No match as "o" is not at the start of "dog."
       
   707       >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
       
   708       <_sre.SRE_Match object at ...>
       
   709 
       
   710 
       
   711 .. method:: RegexObject.search(string[, pos[, endpos]])
       
   712 
       
   713    Scan through *string* looking for a location where this regular expression
       
   714    produces a match, and return a corresponding :class:`MatchObject` instance.
       
   715    Return ``None`` if no position in the string matches the pattern; note that this
       
   716    is different from finding a zero-length match at some point in the string.
       
   717 
       
   718    The optional *pos* and *endpos* parameters have the same meaning as for the
       
   719    :meth:`match` method.
       
   720 
       
   721 
       
   722 .. method:: RegexObject.split(string[, maxsplit=0])
       
   723 
       
   724    Identical to the :func:`split` function, using the compiled pattern.
       
   725 
       
   726 
       
   727 .. method:: RegexObject.findall(string[, pos[, endpos]])
       
   728 
       
   729    Identical to the :func:`findall` function, using the compiled pattern.
       
   730 
       
   731 
       
   732 .. method:: RegexObject.finditer(string[, pos[, endpos]])
       
   733 
       
   734    Identical to the :func:`finditer` function, using the compiled pattern.
       
   735 
       
   736 
       
   737 .. method:: RegexObject.sub(repl, string[, count=0])
       
   738 
       
   739    Identical to the :func:`sub` function, using the compiled pattern.
       
   740 
       
   741 
       
   742 .. method:: RegexObject.subn(repl, string[, count=0])
       
   743 
       
   744    Identical to the :func:`subn` function, using the compiled pattern.
       
   745 
       
   746 
       
   747 .. attribute:: RegexObject.flags
       
   748 
       
   749    The flags argument used when the RE object was compiled, or ``0`` if no flags
       
   750    were provided.
       
   751 
       
   752 
       
   753 .. attribute:: RegexObject.groupindex
       
   754 
       
   755    A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
       
   756    numbers.  The dictionary is empty if no symbolic groups were used in the
       
   757    pattern.
       
   758 
       
   759 
       
   760 .. attribute:: RegexObject.pattern
       
   761 
       
   762    The pattern string from which the RE object was compiled.
       
   763 
       
   764 
       
   765 .. _match-objects:
       
   766 
       
   767 Match Objects
       
   768 -------------
       
   769 
       
   770 Match objects always have a boolean value of :const:`True`, so that you can test
       
   771 whether e.g. :func:`match` resulted in a match with a simple if statement.  They
       
   772 support the following methods and attributes:
       
   773 
       
   774 
       
   775 .. method:: MatchObject.expand(template)
       
   776 
       
   777    Return the string obtained by doing backslash substitution on the template
       
   778    string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
       
   779    converted to the appropriate characters, and numeric backreferences (``\1``,
       
   780    ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
       
   781    contents of the corresponding group.
       
   782 
       
   783 
       
   784 .. method:: MatchObject.group([group1, ...])
       
   785 
       
   786    Returns one or more subgroups of the match.  If there is a single argument, the
       
   787    result is a single string; if there are multiple arguments, the result is a
       
   788    tuple with one item per argument. Without arguments, *group1* defaults to zero
       
   789    (the whole match is returned). If a *groupN* argument is zero, the corresponding
       
   790    return value is the entire matching string; if it is in the inclusive range
       
   791    [1..99], it is the string matching the corresponding parenthesized group.  If a
       
   792    group number is negative or larger than the number of groups defined in the
       
   793    pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
       
   794    part of the pattern that did not match, the corresponding result is ``None``.
       
   795    If a group is contained in a part of the pattern that matched multiple times,
       
   796    the last match is returned.
       
   797 
       
   798       >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
       
   799       >>> m.group(0)       # The entire match
       
   800       'Isaac Newton'
       
   801       >>> m.group(1)       # The first parenthesized subgroup.
       
   802       'Isaac'
       
   803       >>> m.group(2)       # The second parenthesized subgroup.
       
   804       'Newton'
       
   805       >>> m.group(1, 2)    # Multiple arguments give us a tuple.
       
   806       ('Isaac', 'Newton')
       
   807 
       
   808    If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
       
   809    arguments may also be strings identifying groups by their group name.  If a
       
   810    string argument is not used as a group name in the pattern, an :exc:`IndexError`
       
   811    exception is raised.
       
   812 
       
   813    A moderately complicated example:
       
   814 
       
   815       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
       
   816       >>> m.group('first_name')
       
   817       'Malcom'
       
   818       >>> m.group('last_name')
       
   819       'Reynolds'
       
   820 
       
   821    Named groups can also be referred to by their index:
       
   822 
       
   823       >>> m.group(1)
       
   824       'Malcom'
       
   825       >>> m.group(2)
       
   826       'Reynolds'
       
   827 
       
   828    If a group matches multiple times, only the last match is accessible:
       
   829 
       
   830       >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
       
   831       >>> m.group(1)                        # Returns only the last match.
       
   832       'c3'
       
   833 
       
   834 
       
   835 .. method:: MatchObject.groups([default])
       
   836 
       
   837    Return a tuple containing all the subgroups of the match, from 1 up to however
       
   838    many groups are in the pattern.  The *default* argument is used for groups that
       
   839    did not participate in the match; it defaults to ``None``.  (Incompatibility
       
   840    note: in the original Python 1.5 release, if the tuple was one element long, a
       
   841    string would be returned instead.  In later versions (from 1.5.1 on), a
       
   842    singleton tuple is returned in such cases.)
       
   843 
       
   844    For example:
       
   845 
       
   846       >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
       
   847       >>> m.groups()
       
   848       ('24', '1632')
       
   849 
       
   850    If we make the decimal place and everything after it optional, not all groups
       
   851    might participate in the match.  These groups will default to ``None`` unless
       
   852    the *default* argument is given:
       
   853 
       
   854       >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
       
   855       >>> m.groups()      # Second group defaults to None.
       
   856       ('24', None)
       
   857       >>> m.groups('0')   # Now, the second group defaults to '0'.
       
   858       ('24', '0')
       
   859 
       
   860 
       
   861 .. method:: MatchObject.groupdict([default])
       
   862 
       
   863    Return a dictionary containing all the *named* subgroups of the match, keyed by
       
   864    the subgroup name.  The *default* argument is used for groups that did not
       
   865    participate in the match; it defaults to ``None``.  For example:
       
   866 
       
   867       >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
       
   868       >>> m.groupdict()
       
   869       {'first_name': 'Malcom', 'last_name': 'Reynolds'}
       
   870 
       
   871 
       
   872 .. method:: MatchObject.start([group])
       
   873             MatchObject.end([group])
       
   874 
       
   875    Return the indices of the start and end of the substring matched by *group*;
       
   876    *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
       
   877    *group* exists but did not contribute to the match.  For a match object *m*, and
       
   878    a group *g* that did contribute to the match, the substring matched by group *g*
       
   879    (equivalent to ``m.group(g)``) is ::
       
   880 
       
   881       m.string[m.start(g):m.end(g)]
       
   882 
       
   883    Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
       
   884    null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
       
   885    ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
       
   886    2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
       
   887 
       
   888    An example that will remove *remove_this* from email addresses:
       
   889 
       
   890       >>> email = "tony@tiremove_thisger.net"
       
   891       >>> m = re.search("remove_this", email)
       
   892       >>> email[:m.start()] + email[m.end():]
       
   893       'tony@tiger.net'
       
   894 
       
   895 
       
   896 .. method:: MatchObject.span([group])
       
   897 
       
   898    For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
       
   899    m.end(group))``. Note that if *group* did not contribute to the match, this is
       
   900    ``(-1, -1)``.  *group* defaults to zero, the entire match.
       
   901 
       
   902 
       
   903 .. attribute:: MatchObject.pos
       
   904 
       
   905    The value of *pos* which was passed to the :func:`search` or :func:`match`
       
   906    method of the :class:`RegexObject`.  This is the index into the string at which
       
   907    the RE engine started looking for a match.
       
   908 
       
   909 
       
   910 .. attribute:: MatchObject.endpos
       
   911 
       
   912    The value of *endpos* which was passed to the :func:`search` or :func:`match`
       
   913    method of the :class:`RegexObject`.  This is the index into the string beyond
       
   914    which the RE engine will not go.
       
   915 
       
   916 
       
   917 .. attribute:: MatchObject.lastindex
       
   918 
       
   919    The integer index of the last matched capturing group, or ``None`` if no group
       
   920    was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
       
   921    ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
       
   922    the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
       
   923    string.
       
   924 
       
   925 
       
   926 .. attribute:: MatchObject.lastgroup
       
   927 
       
   928    The name of the last matched capturing group, or ``None`` if the group didn't
       
   929    have a name, or if no group was matched at all.
       
   930 
       
   931 
       
   932 .. attribute:: MatchObject.re
       
   933 
       
   934    The regular expression object whose :meth:`match` or :meth:`search` method
       
   935    produced this :class:`MatchObject` instance.
       
   936 
       
   937 
       
   938 .. attribute:: MatchObject.string
       
   939 
       
   940    The string passed to :func:`match` or :func:`search`.
       
   941 
       
   942 
       
   943 Examples
       
   944 --------
       
   945 
       
   946 
       
   947 Checking For a Pair
       
   948 ^^^^^^^^^^^^^^^^^^^
       
   949 
       
   950 In this example, we'll use the following helper function to display match
       
   951 objects a little more gracefully:
       
   952 
       
   953 .. testcode::
       
   954 
       
   955    def displaymatch(match):
       
   956        if match is None:
       
   957            return None
       
   958        return '<Match: %r, groups=%r>' % (match.group(), match.groups())
       
   959 
       
   960 Suppose you are writing a poker program where a player's hand is represented as
       
   961 a 5-character string with each character representing a card, "a" for ace, "k"
       
   962 for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
       
   963 representing the card with that value.
       
   964 
       
   965 To see if a given string is a valid hand, one could do the following:
       
   966 
       
   967    >>> valid = re.compile(r"[0-9akqj]{5}$")
       
   968    >>> displaymatch(valid.match("ak05q"))  # Valid.
       
   969    "<Match: 'ak05q', groups=()>"
       
   970    >>> displaymatch(valid.match("ak05e"))  # Invalid.
       
   971    >>> displaymatch(valid.match("ak0"))    # Invalid.
       
   972    >>> displaymatch(valid.match("727ak"))  # Valid.
       
   973    "<Match: '727ak', groups=()>"
       
   974 
       
   975 That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
       
   976 To match this with a regular expression, one could use backreferences as such:
       
   977 
       
   978    >>> pair = re.compile(r".*(.).*\1")
       
   979    >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
       
   980    "<Match: '717', groups=('7',)>"
       
   981    >>> displaymatch(pair.match("718ak"))     # No pairs.
       
   982    >>> displaymatch(pair.match("354aa"))     # Pair of aces.
       
   983    "<Match: '354aa', groups=('a',)>"
       
   984 
       
   985 To find out what card the pair consists of, one could use the :func:`group`
       
   986 method of :class:`MatchObject` in the following manner:
       
   987 
       
   988 .. doctest::
       
   989 
       
   990    >>> pair.match("717ak").group(1)
       
   991    '7'
       
   992    
       
   993    # Error because re.match() returns None, which doesn't have a group() method:
       
   994    >>> pair.match("718ak").group(1)
       
   995    Traceback (most recent call last):
       
   996      File "<pyshell#23>", line 1, in <module>
       
   997        re.match(r".*(.).*\1", "718ak").group(1)
       
   998    AttributeError: 'NoneType' object has no attribute 'group'
       
   999    
       
  1000    >>> pair.match("354aa").group(1)
       
  1001    'a'
       
  1002 
       
  1003 
       
  1004 Simulating scanf()
       
  1005 ^^^^^^^^^^^^^^^^^^
       
  1006 
       
  1007 .. index:: single: scanf()
       
  1008 
       
  1009 Python does not currently have an equivalent to :cfunc:`scanf`.  Regular
       
  1010 expressions are generally more powerful, though also more verbose, than
       
  1011 :cfunc:`scanf` format strings.  The table below offers some more-or-less
       
  1012 equivalent mappings between :cfunc:`scanf` format tokens and regular
       
  1013 expressions.
       
  1014 
       
  1015 +--------------------------------+---------------------------------------------+
       
  1016 | :cfunc:`scanf` Token           | Regular Expression                          |
       
  1017 +================================+=============================================+
       
  1018 | ``%c``                         | ``.``                                       |
       
  1019 +--------------------------------+---------------------------------------------+
       
  1020 | ``%5c``                        | ``.{5}``                                    |
       
  1021 +--------------------------------+---------------------------------------------+
       
  1022 | ``%d``                         | ``[-+]?\d+``                                |
       
  1023 +--------------------------------+---------------------------------------------+
       
  1024 | ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
       
  1025 +--------------------------------+---------------------------------------------+
       
  1026 | ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
       
  1027 +--------------------------------+---------------------------------------------+
       
  1028 | ``%o``                         | ``0[0-7]*``                                 |
       
  1029 +--------------------------------+---------------------------------------------+
       
  1030 | ``%s``                         | ``\S+``                                     |
       
  1031 +--------------------------------+---------------------------------------------+
       
  1032 | ``%u``                         | ``\d+``                                     |
       
  1033 +--------------------------------+---------------------------------------------+
       
  1034 | ``%x``, ``%X``                 | ``0[xX][\dA-Fa-f]+``                        |
       
  1035 +--------------------------------+---------------------------------------------+
       
  1036 
       
  1037 To extract the filename and numbers from a string like ::
       
  1038 
       
  1039    /usr/sbin/sendmail - 0 errors, 4 warnings
       
  1040 
       
  1041 you would use a :cfunc:`scanf` format like ::
       
  1042 
       
  1043    %s - %d errors, %d warnings
       
  1044 
       
  1045 The equivalent regular expression would be ::
       
  1046 
       
  1047    (\S+) - (\d+) errors, (\d+) warnings
       
  1048 
       
  1049 
       
  1050 Avoiding recursion
       
  1051 ^^^^^^^^^^^^^^^^^^
       
  1052 
       
  1053 If you create regular expressions that require the engine to perform a lot of
       
  1054 recursion, you may encounter a :exc:`RuntimeError` exception with the message
       
  1055 ``maximum recursion limit`` exceeded. For example, ::
       
  1056 
       
  1057    >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
       
  1058    >>> re.match('Begin (\w| )*? end', s).end()
       
  1059    Traceback (most recent call last):
       
  1060      File "<stdin>", line 1, in ?
       
  1061      File "/usr/local/lib/python2.5/re.py", line 132, in match
       
  1062        return _compile(pattern, flags).match(string)
       
  1063    RuntimeError: maximum recursion limit exceeded
       
  1064 
       
  1065 You can often restructure your regular expression to avoid recursion.
       
  1066 
       
  1067 Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
       
  1068 avoid recursion.  Thus, the above regular expression can avoid recursion by
       
  1069 being recast as ``Begin [a-zA-Z0-9_ ]*?end``.  As a further benefit, such
       
  1070 regular expressions will run faster than their recursive equivalents.
       
  1071 
       
  1072 
       
  1073 search() vs. match()
       
  1074 ^^^^^^^^^^^^^^^^^^^^
       
  1075 
       
  1076 In a nutshell, :func:`match` only attempts to match a pattern at the beginning
       
  1077 of a string where :func:`search` will match a pattern anywhere in a string.
       
  1078 For example:
       
  1079 
       
  1080    >>> re.match("o", "dog")  # No match as "o" is not the first letter of "dog".
       
  1081    >>> re.search("o", "dog") # Match as search() looks everywhere in the string.
       
  1082    <_sre.SRE_Match object at ...>
       
  1083 
       
  1084 .. note::
       
  1085 
       
  1086    The following applies only to regular expression objects like those created
       
  1087    with ``re.compile("pattern")``, not the primitives ``re.match(pattern,
       
  1088    string)`` or ``re.search(pattern, string)``.
       
  1089 
       
  1090 :func:`match` has an optional second parameter that gives an index in the string
       
  1091 where the search is to start:
       
  1092 
       
  1093    >>> pattern = re.compile("o")
       
  1094    >>> pattern.match("dog")      # No match as "o" is not at the start of "dog."
       
  1095 
       
  1096    # Equivalent to the above expression as 0 is the default starting index:
       
  1097    >>> pattern.match("dog", 0)
       
  1098 
       
  1099    # Match as "o" is the 2nd character of "dog" (index 0 is the first):
       
  1100    >>> pattern.match("dog", 1)
       
  1101    <_sre.SRE_Match object at ...>
       
  1102    >>> pattern.match("dog", 2)   # No match as "o" is not the 3rd character of "dog."
       
  1103 
       
  1104 
       
  1105 Making a Phonebook
       
  1106 ^^^^^^^^^^^^^^^^^^
       
  1107 
       
  1108 :func:`split` splits a string into a list delimited by the passed pattern.  The 
       
  1109 method is invaluable for converting textual data into data structures that can be
       
  1110 easily read and modified by Python as demonstrated in the following example that
       
  1111 creates a phonebook.
       
  1112 
       
  1113 First, here is the input.  Normally it may come from a file, here we are using
       
  1114 triple-quoted string syntax:
       
  1115 
       
  1116    >>> input = """Ross McFluff: 834.345.1254 155 Elm Street
       
  1117    ... 
       
  1118    ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
       
  1119    ... Frank Burger: 925.541.7625 662 South Dogwood Way
       
  1120    ...
       
  1121    ...
       
  1122    ... Heather Albrecht: 548.326.4584 919 Park Place"""
       
  1123 
       
  1124 The entries are separated by one or more newlines. Now we convert the string
       
  1125 into a list with each nonempty line having its own entry:
       
  1126 
       
  1127 .. doctest::
       
  1128    :options: +NORMALIZE_WHITESPACE
       
  1129 
       
  1130    >>> entries = re.split("\n+", input)
       
  1131    >>> entries
       
  1132    ['Ross McFluff: 834.345.1254 155 Elm Street',
       
  1133    'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
       
  1134    'Frank Burger: 925.541.7625 662 South Dogwood Way',
       
  1135    'Heather Albrecht: 548.326.4584 919 Park Place']
       
  1136 
       
  1137 Finally, split each entry into a list with first name, last name, telephone
       
  1138 number, and address.  We use the ``maxsplit`` parameter of :func:`split`
       
  1139 because the address has spaces, our splitting pattern, in it:
       
  1140 
       
  1141 .. doctest::
       
  1142    :options: +NORMALIZE_WHITESPACE
       
  1143 
       
  1144    >>> [re.split(":? ", entry, 3) for entry in entries]
       
  1145    [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
       
  1146    ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
       
  1147    ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
       
  1148    ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
       
  1149 
       
  1150 The ``:?`` pattern matches the colon after the last name, so that it does not
       
  1151 occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
       
  1152 house number from the street name:
       
  1153 
       
  1154 .. doctest::
       
  1155    :options: +NORMALIZE_WHITESPACE
       
  1156 
       
  1157    >>> [re.split(":? ", entry, 4) for entry in entries]
       
  1158    [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
       
  1159    ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
       
  1160    ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
       
  1161    ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
       
  1162 
       
  1163 
       
  1164 Text Munging
       
  1165 ^^^^^^^^^^^^
       
  1166 
       
  1167 :func:`sub` replaces every occurrence of a pattern with a string or the
       
  1168 result of a function.  This example demonstrates using :func:`sub` with
       
  1169 a function to "munge" text, or randomize the order of all the characters
       
  1170 in each word of a sentence except for the first and last characters::
       
  1171 
       
  1172    >>> def repl(m):
       
  1173    ...   inner_word = list(m.group(2))
       
  1174    ...   random.shuffle(inner_word)
       
  1175    ...   return m.group(1) + "".join(inner_word) + m.group(3)
       
  1176    >>> text = "Professor Abdolmalek, please report your absences promptly."
       
  1177    >>> re.sub("(\w)(\w+)(\w)", repl, text)
       
  1178    'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
       
  1179    >>> re.sub("(\w)(\w+)(\w)", repl, text)
       
  1180    'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
       
  1181 
       
  1182 
       
  1183 Finding all Adverbs
       
  1184 ^^^^^^^^^^^^^^^^^^^
       
  1185 
       
  1186 :func:`findall` matches *all* occurrences of a pattern, not just the first
       
  1187 one as :func:`search` does.  For example, if one was a writer and wanted to
       
  1188 find all of the adverbs in some text, he or she might use :func:`findall` in
       
  1189 the following manner:
       
  1190 
       
  1191    >>> text = "He was carefully disguised but captured quickly by police."
       
  1192    >>> re.findall(r"\w+ly", text)
       
  1193    ['carefully', 'quickly']
       
  1194 
       
  1195 
       
  1196 Finding all Adverbs and their Positions
       
  1197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       
  1198 
       
  1199 If one wants more information about all matches of a pattern than the matched
       
  1200 text, :func:`finditer` is useful as it provides instances of
       
  1201 :class:`MatchObject` instead of strings.  Continuing with the previous example,
       
  1202 if one was a writer who wanted to find all of the adverbs *and their positions*
       
  1203 in some text, he or she would use :func:`finditer` in the following manner:
       
  1204 
       
  1205    >>> text = "He was carefully disguised but captured quickly by police."
       
  1206    >>> for m in re.finditer(r"\w+ly", text):
       
  1207    ...     print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
       
  1208    07-16: carefully
       
  1209    40-47: quickly
       
  1210 
       
  1211 
       
  1212 Raw String Notation
       
  1213 ^^^^^^^^^^^^^^^^^^^
       
  1214 
       
  1215 Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
       
  1216 every backslash (``'\'``) in a regular expression would have to be prefixed with
       
  1217 another one to escape it.  For example, the two following lines of code are
       
  1218 functionally identical:
       
  1219 
       
  1220    >>> re.match(r"\W(.)\1\W", " ff ")
       
  1221    <_sre.SRE_Match object at ...>
       
  1222    >>> re.match("\\W(.)\\1\\W", " ff ")
       
  1223    <_sre.SRE_Match object at ...>
       
  1224 
       
  1225 When one wants to match a literal backslash, it must be escaped in the regular
       
  1226 expression.  With raw string notation, this means ``r"\\"``.  Without raw string
       
  1227 notation, one must use ``"\\\\"``, making the following lines of code
       
  1228 functionally identical:
       
  1229 
       
  1230    >>> re.match(r"\\", r"\\")
       
  1231    <_sre.SRE_Match object at ...>
       
  1232    >>> re.match("\\\\", r"\\")
       
  1233    <_sre.SRE_Match object at ...>