|
1 .. _regex-howto: |
|
2 |
|
3 **************************** |
|
4 Regular Expression HOWTO |
|
5 **************************** |
|
6 |
|
7 :Author: A.M. Kuchling |
|
8 :Release: 0.05 |
|
9 |
|
10 .. TODO: |
|
11 Document lookbehind assertions |
|
12 Better way of displaying a RE, a string, and what it matches |
|
13 Mention optional argument to match.groups() |
|
14 Unicode (at least a reference) |
|
15 |
|
16 |
|
17 .. topic:: Abstract |
|
18 |
|
19 This document is an introductory tutorial to using regular expressions in Python |
|
20 with the :mod:`re` module. It provides a gentler introduction than the |
|
21 corresponding section in the Library Reference. |
|
22 |
|
23 |
|
24 Introduction |
|
25 ============ |
|
26 |
|
27 The :mod:`re` module was added in Python 1.5, and provides Perl-style regular |
|
28 expression patterns. Earlier versions of Python came with the :mod:`regex` |
|
29 module, which provided Emacs-style patterns. The :mod:`regex` module was |
|
30 removed completely in Python 2.5. |
|
31 |
|
32 Regular expressions (called REs, or regexes, or regex patterns) are essentially |
|
33 a tiny, highly specialized programming language embedded inside Python and made |
|
34 available through the :mod:`re` module. Using this little language, you specify |
|
35 the rules for the set of possible strings that you want to match; this set might |
|
36 contain English sentences, or e-mail addresses, or TeX commands, or anything you |
|
37 like. You can then ask questions such as "Does this string match the pattern?", |
|
38 or "Is there a match for the pattern anywhere in this string?". You can also |
|
39 use REs to modify a string or to split it apart in various ways. |
|
40 |
|
41 Regular expression patterns are compiled into a series of bytecodes which are |
|
42 then executed by a matching engine written in C. For advanced use, it may be |
|
43 necessary to pay careful attention to how the engine will execute a given RE, |
|
44 and write the RE in a certain way in order to produce bytecode that runs faster. |
|
45 Optimization isn't covered in this document, because it requires that you have a |
|
46 good understanding of the matching engine's internals. |
|
47 |
|
48 The regular expression language is relatively small and restricted, so not all |
|
49 possible string processing tasks can be done using regular expressions. There |
|
50 are also tasks that *can* be done with regular expressions, but the expressions |
|
51 turn out to be very complicated. In these cases, you may be better off writing |
|
52 Python code to do the processing; while Python code will be slower than an |
|
53 elaborate regular expression, it will also probably be more understandable. |
|
54 |
|
55 |
|
56 Simple Patterns |
|
57 =============== |
|
58 |
|
59 We'll start by learning about the simplest possible regular expressions. Since |
|
60 regular expressions are used to operate on strings, we'll begin with the most |
|
61 common task: matching characters. |
|
62 |
|
63 For a detailed explanation of the computer science underlying regular |
|
64 expressions (deterministic and non-deterministic finite automata), you can refer |
|
65 to almost any textbook on writing compilers. |
|
66 |
|
67 |
|
68 Matching Characters |
|
69 ------------------- |
|
70 |
|
71 Most letters and characters will simply match themselves. For example, the |
|
72 regular expression ``test`` will match the string ``test`` exactly. (You can |
|
73 enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` |
|
74 as well; more about this later.) |
|
75 |
|
76 There are exceptions to this rule; some characters are special |
|
77 :dfn:`metacharacters`, and don't match themselves. Instead, they signal that |
|
78 some out-of-the-ordinary thing should be matched, or they affect other portions |
|
79 of the RE by repeating them or changing their meaning. Much of this document is |
|
80 devoted to discussing various metacharacters and what they do. |
|
81 |
|
82 Here's a complete list of the metacharacters; their meanings will be discussed |
|
83 in the rest of this HOWTO. :: |
|
84 |
|
85 . ^ $ * + ? { [ ] \ | ( ) |
|
86 |
|
87 The first metacharacters we'll look at are ``[`` and ``]``. They're used for |
|
88 specifying a character class, which is a set of characters that you wish to |
|
89 match. Characters can be listed individually, or a range of characters can be |
|
90 indicated by giving two characters and separating them by a ``'-'``. For |
|
91 example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this |
|
92 is the same as ``[a-c]``, which uses a range to express the same set of |
|
93 characters. If you wanted to match only lowercase letters, your RE would be |
|
94 ``[a-z]``. |
|
95 |
|
96 Metacharacters are not active inside classes. For example, ``[akm$]`` will |
|
97 match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is |
|
98 usually a metacharacter, but inside a character class it's stripped of its |
|
99 special nature. |
|
100 |
|
101 You can match the characters not listed within the class by :dfn:`complementing` |
|
102 the set. This is indicated by including a ``'^'`` as the first character of the |
|
103 class; ``'^'`` outside a character class will simply match the ``'^'`` |
|
104 character. For example, ``[^5]`` will match any character except ``'5'``. |
|
105 |
|
106 Perhaps the most important metacharacter is the backslash, ``\``. As in Python |
|
107 string literals, the backslash can be followed by various characters to signal |
|
108 various special sequences. It's also used to escape all the metacharacters so |
|
109 you can still match them in patterns; for example, if you need to match a ``[`` |
|
110 or ``\``, you can precede them with a backslash to remove their special |
|
111 meaning: ``\[`` or ``\\``. |
|
112 |
|
113 Some of the special sequences beginning with ``'\'`` represent predefined sets |
|
114 of characters that are often useful, such as the set of digits, the set of |
|
115 letters, or the set of anything that isn't whitespace. The following predefined |
|
116 special sequences are available: |
|
117 |
|
118 ``\d`` |
|
119 Matches any decimal digit; this is equivalent to the class ``[0-9]``. |
|
120 |
|
121 ``\D`` |
|
122 Matches any non-digit character; this is equivalent to the class ``[^0-9]``. |
|
123 |
|
124 ``\s`` |
|
125 Matches any whitespace character; this is equivalent to the class ``[ |
|
126 \t\n\r\f\v]``. |
|
127 |
|
128 ``\S`` |
|
129 Matches any non-whitespace character; this is equivalent to the class ``[^ |
|
130 \t\n\r\f\v]``. |
|
131 |
|
132 ``\w`` |
|
133 Matches any alphanumeric character; this is equivalent to the class |
|
134 ``[a-zA-Z0-9_]``. |
|
135 |
|
136 ``\W`` |
|
137 Matches any non-alphanumeric character; this is equivalent to the class |
|
138 ``[^a-zA-Z0-9_]``. |
|
139 |
|
140 These sequences can be included inside a character class. For example, |
|
141 ``[\s,.]`` is a character class that will match any whitespace character, or |
|
142 ``','`` or ``'.'``. |
|
143 |
|
144 The final metacharacter in this section is ``.``. It matches anything except a |
|
145 newline character, and there's an alternate mode (``re.DOTALL``) where it will |
|
146 match even a newline. ``'.'`` is often used where you want to match "any |
|
147 character". |
|
148 |
|
149 |
|
150 Repeating Things |
|
151 ---------------- |
|
152 |
|
153 Being able to match varying sets of characters is the first thing regular |
|
154 expressions can do that isn't already possible with the methods available on |
|
155 strings. However, if that was the only additional capability of regexes, they |
|
156 wouldn't be much of an advance. Another capability is that you can specify that |
|
157 portions of the RE must be repeated a certain number of times. |
|
158 |
|
159 The first metacharacter for repeating things that we'll look at is ``*``. ``*`` |
|
160 doesn't match the literal character ``*``; instead, it specifies that the |
|
161 previous character can be matched zero or more times, instead of exactly once. |
|
162 |
|
163 For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), |
|
164 ``caaat`` (3 ``a`` characters), and so forth. The RE engine has various |
|
165 internal limitations stemming from the size of C's ``int`` type that will |
|
166 prevent it from matching over 2 billion ``a`` characters; you probably don't |
|
167 have enough memory to construct a string that large, so you shouldn't run into |
|
168 that limit. |
|
169 |
|
170 Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching |
|
171 engine will try to repeat it as many times as possible. If later portions of the |
|
172 pattern don't match, the matching engine will then back up and try again with |
|
173 few repetitions. |
|
174 |
|
175 A step-by-step example will make this more obvious. Let's consider the |
|
176 expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters |
|
177 from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching |
|
178 this RE against the string ``abcbd``. |
|
179 |
|
180 +------+-----------+---------------------------------+ |
|
181 | Step | Matched | Explanation | |
|
182 +======+===========+=================================+ |
|
183 | 1 | ``a`` | The ``a`` in the RE matches. | |
|
184 +------+-----------+---------------------------------+ |
|
185 | 2 | ``abcbd`` | The engine matches ``[bcd]*``, | |
|
186 | | | going as far as it can, which | |
|
187 | | | is to the end of the string. | |
|
188 +------+-----------+---------------------------------+ |
|
189 | 3 | *Failure* | The engine tries to match | |
|
190 | | | ``b``, but the current position | |
|
191 | | | is at the end of the string, so | |
|
192 | | | it fails. | |
|
193 +------+-----------+---------------------------------+ |
|
194 | 4 | ``abcb`` | Back up, so that ``[bcd]*`` | |
|
195 | | | matches one less character. | |
|
196 +------+-----------+---------------------------------+ |
|
197 | 5 | *Failure* | Try ``b`` again, but the | |
|
198 | | | current position is at the last | |
|
199 | | | character, which is a ``'d'``. | |
|
200 +------+-----------+---------------------------------+ |
|
201 | 6 | ``abc`` | Back up again, so that | |
|
202 | | | ``[bcd]*`` is only matching | |
|
203 | | | ``bc``. | |
|
204 +------+-----------+---------------------------------+ |
|
205 | 6 | ``abcb`` | Try ``b`` again. This time | |
|
206 | | | the character at the | |
|
207 | | | current position is ``'b'``, so | |
|
208 | | | it succeeds. | |
|
209 +------+-----------+---------------------------------+ |
|
210 |
|
211 The end of the RE has now been reached, and it has matched ``abcb``. This |
|
212 demonstrates how the matching engine goes as far as it can at first, and if no |
|
213 match is found it will then progressively back up and retry the rest of the RE |
|
214 again and again. It will back up until it has tried zero matches for |
|
215 ``[bcd]*``, and if that subsequently fails, the engine will conclude that the |
|
216 string doesn't match the RE at all. |
|
217 |
|
218 Another repeating metacharacter is ``+``, which matches one or more times. Pay |
|
219 careful attention to the difference between ``*`` and ``+``; ``*`` matches |
|
220 *zero* or more times, so whatever's being repeated may not be present at all, |
|
221 while ``+`` requires at least *one* occurrence. To use a similar example, |
|
222 ``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match |
|
223 ``ct``. |
|
224 |
|
225 There are two more repeating qualifiers. The question mark character, ``?``, |
|
226 matches either once or zero times; you can think of it as marking something as |
|
227 being optional. For example, ``home-?brew`` matches either ``homebrew`` or |
|
228 ``home-brew``. |
|
229 |
|
230 The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are |
|
231 decimal integers. This qualifier means there must be at least *m* repetitions, |
|
232 and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and |
|
233 ``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which |
|
234 has four. |
|
235 |
|
236 You can omit either *m* or *n*; in that case, a reasonable value is assumed for |
|
237 the missing value. Omitting *m* is interpreted as a lower limit of 0, while |
|
238 omitting *n* results in an upper bound of infinity --- actually, the upper bound |
|
239 is the 2-billion limit mentioned earlier, but that might as well be infinity. |
|
240 |
|
241 Readers of a reductionist bent may notice that the three other qualifiers can |
|
242 all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` |
|
243 is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use |
|
244 ``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier |
|
245 to read. |
|
246 |
|
247 |
|
248 Using Regular Expressions |
|
249 ========================= |
|
250 |
|
251 Now that we've looked at some simple regular expressions, how do we actually use |
|
252 them in Python? The :mod:`re` module provides an interface to the regular |
|
253 expression engine, allowing you to compile REs into objects and then perform |
|
254 matches with them. |
|
255 |
|
256 |
|
257 Compiling Regular Expressions |
|
258 ----------------------------- |
|
259 |
|
260 Regular expressions are compiled into :class:`RegexObject` instances, which have |
|
261 methods for various operations such as searching for pattern matches or |
|
262 performing string substitutions. :: |
|
263 |
|
264 >>> import re |
|
265 >>> p = re.compile('ab*') |
|
266 >>> print p |
|
267 <re.RegexObject instance at 80b4150> |
|
268 |
|
269 :func:`re.compile` also accepts an optional *flags* argument, used to enable |
|
270 various special features and syntax variations. We'll go over the available |
|
271 settings later, but for now a single example will do:: |
|
272 |
|
273 >>> p = re.compile('ab*', re.IGNORECASE) |
|
274 |
|
275 The RE is passed to :func:`re.compile` as a string. REs are handled as strings |
|
276 because regular expressions aren't part of the core Python language, and no |
|
277 special syntax was created for expressing them. (There are applications that |
|
278 don't need REs at all, so there's no need to bloat the language specification by |
|
279 including them.) Instead, the :mod:`re` module is simply a C extension module |
|
280 included with Python, just like the :mod:`socket` or :mod:`zlib` modules. |
|
281 |
|
282 Putting REs in strings keeps the Python language simpler, but has one |
|
283 disadvantage which is the topic of the next section. |
|
284 |
|
285 |
|
286 The Backslash Plague |
|
287 -------------------- |
|
288 |
|
289 As stated earlier, regular expressions use the backslash character (``'\'``) to |
|
290 indicate special forms or to allow special characters to be used without |
|
291 invoking their special meaning. This conflicts with Python's usage of the same |
|
292 character for the same purpose in string literals. |
|
293 |
|
294 Let's say you want to write a RE that matches the string ``\section``, which |
|
295 might be found in a LaTeX file. To figure out what to write in the program |
|
296 code, start with the desired string to be matched. Next, you must escape any |
|
297 backslashes and other metacharacters by preceding them with a backslash, |
|
298 resulting in the string ``\\section``. The resulting string that must be passed |
|
299 to :func:`re.compile` must be ``\\section``. However, to express this as a |
|
300 Python string literal, both backslashes must be escaped *again*. |
|
301 |
|
302 +-------------------+------------------------------------------+ |
|
303 | Characters | Stage | |
|
304 +===================+==========================================+ |
|
305 | ``\section`` | Text string to be matched | |
|
306 +-------------------+------------------------------------------+ |
|
307 | ``\\section`` | Escaped backslash for :func:`re.compile` | |
|
308 +-------------------+------------------------------------------+ |
|
309 | ``"\\\\section"`` | Escaped backslashes for a string literal | |
|
310 +-------------------+------------------------------------------+ |
|
311 |
|
312 In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE |
|
313 string, because the regular expression must be ``\\``, and each backslash must |
|
314 be expressed as ``\\`` inside a regular Python string literal. In REs that |
|
315 feature backslashes repeatedly, this leads to lots of repeated backslashes and |
|
316 makes the resulting strings difficult to understand. |
|
317 |
|
318 The solution is to use Python's raw string notation for regular expressions; |
|
319 backslashes are not handled in any special way in a string literal prefixed with |
|
320 ``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, |
|
321 while ``"\n"`` is a one-character string containing a newline. Regular |
|
322 expressions will often be written in Python code using this raw string notation. |
|
323 |
|
324 +-------------------+------------------+ |
|
325 | Regular String | Raw string | |
|
326 +===================+==================+ |
|
327 | ``"ab*"`` | ``r"ab*"`` | |
|
328 +-------------------+------------------+ |
|
329 | ``"\\\\section"`` | ``r"\\section"`` | |
|
330 +-------------------+------------------+ |
|
331 | ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | |
|
332 +-------------------+------------------+ |
|
333 |
|
334 |
|
335 Performing Matches |
|
336 ------------------ |
|
337 |
|
338 Once you have an object representing a compiled regular expression, what do you |
|
339 do with it? :class:`RegexObject` instances have several methods and attributes. |
|
340 Only the most significant ones will be covered here; consult the :mod:`re` docs |
|
341 for a complete listing. |
|
342 |
|
343 +------------------+-----------------------------------------------+ |
|
344 | Method/Attribute | Purpose | |
|
345 +==================+===============================================+ |
|
346 | ``match()`` | Determine if the RE matches at the beginning | |
|
347 | | of the string. | |
|
348 +------------------+-----------------------------------------------+ |
|
349 | ``search()`` | Scan through a string, looking for any | |
|
350 | | location where this RE matches. | |
|
351 +------------------+-----------------------------------------------+ |
|
352 | ``findall()`` | Find all substrings where the RE matches, and | |
|
353 | | returns them as a list. | |
|
354 +------------------+-----------------------------------------------+ |
|
355 | ``finditer()`` | Find all substrings where the RE matches, and | |
|
356 | | returns them as an :term:`iterator`. | |
|
357 +------------------+-----------------------------------------------+ |
|
358 |
|
359 :meth:`match` and :meth:`search` return ``None`` if no match can be found. If |
|
360 they're successful, a ``MatchObject`` instance is returned, containing |
|
361 information about the match: where it starts and ends, the substring it matched, |
|
362 and more. |
|
363 |
|
364 You can learn about this by interactively experimenting with the :mod:`re` |
|
365 module. If you have Tkinter available, you may also want to look at |
|
366 :file:`Tools/scripts/redemo.py`, a demonstration program included with the |
|
367 Python distribution. It allows you to enter REs and strings, and displays |
|
368 whether the RE matches or fails. :file:`redemo.py` can be quite useful when |
|
369 trying to debug a complicated RE. Phil Schwartz's `Kodos |
|
370 <http://kodos.sourceforge.net/>`_ is also an interactive tool for developing and |
|
371 testing RE patterns. |
|
372 |
|
373 This HOWTO uses the standard Python interpreter for its examples. First, run the |
|
374 Python interpreter, import the :mod:`re` module, and compile a RE:: |
|
375 |
|
376 Python 2.2.2 (#1, Feb 10 2003, 12:57:01) |
|
377 >>> import re |
|
378 >>> p = re.compile('[a-z]+') |
|
379 >>> p |
|
380 <_sre.SRE_Pattern object at 80c3c28> |
|
381 |
|
382 Now, you can try matching various strings against the RE ``[a-z]+``. An empty |
|
383 string shouldn't match at all, since ``+`` means 'one or more repetitions'. |
|
384 :meth:`match` should return ``None`` in this case, which will cause the |
|
385 interpreter to print no output. You can explicitly print the result of |
|
386 :meth:`match` to make this clear. :: |
|
387 |
|
388 >>> p.match("") |
|
389 >>> print p.match("") |
|
390 None |
|
391 |
|
392 Now, let's try it on a string that it should match, such as ``tempo``. In this |
|
393 case, :meth:`match` will return a :class:`MatchObject`, so you should store the |
|
394 result in a variable for later use. :: |
|
395 |
|
396 >>> m = p.match('tempo') |
|
397 >>> print m |
|
398 <_sre.SRE_Match object at 80c4f68> |
|
399 |
|
400 Now you can query the :class:`MatchObject` for information about the matching |
|
401 string. :class:`MatchObject` instances also have several methods and |
|
402 attributes; the most important ones are: |
|
403 |
|
404 +------------------+--------------------------------------------+ |
|
405 | Method/Attribute | Purpose | |
|
406 +==================+============================================+ |
|
407 | ``group()`` | Return the string matched by the RE | |
|
408 +------------------+--------------------------------------------+ |
|
409 | ``start()`` | Return the starting position of the match | |
|
410 +------------------+--------------------------------------------+ |
|
411 | ``end()`` | Return the ending position of the match | |
|
412 +------------------+--------------------------------------------+ |
|
413 | ``span()`` | Return a tuple containing the (start, end) | |
|
414 | | positions of the match | |
|
415 +------------------+--------------------------------------------+ |
|
416 |
|
417 Trying these methods will soon clarify their meaning:: |
|
418 |
|
419 >>> m.group() |
|
420 'tempo' |
|
421 >>> m.start(), m.end() |
|
422 (0, 5) |
|
423 >>> m.span() |
|
424 (0, 5) |
|
425 |
|
426 :meth:`group` returns the substring that was matched by the RE. :meth:`start` |
|
427 and :meth:`end` return the starting and ending index of the match. :meth:`span` |
|
428 returns both start and end indexes in a single tuple. Since the :meth:`match` |
|
429 method only checks if the RE matches at the start of a string, :meth:`start` |
|
430 will always be zero. However, the :meth:`search` method of :class:`RegexObject` |
|
431 instances scans through the string, so the match may not start at zero in that |
|
432 case. :: |
|
433 |
|
434 >>> print p.match('::: message') |
|
435 None |
|
436 >>> m = p.search('::: message') ; print m |
|
437 <re.MatchObject instance at 80c9650> |
|
438 >>> m.group() |
|
439 'message' |
|
440 >>> m.span() |
|
441 (4, 11) |
|
442 |
|
443 In actual programs, the most common style is to store the :class:`MatchObject` |
|
444 in a variable, and then check if it was ``None``. This usually looks like:: |
|
445 |
|
446 p = re.compile( ... ) |
|
447 m = p.match( 'string goes here' ) |
|
448 if m: |
|
449 print 'Match found: ', m.group() |
|
450 else: |
|
451 print 'No match' |
|
452 |
|
453 Two :class:`RegexObject` methods return all of the matches for a pattern. |
|
454 :meth:`findall` returns a list of matching strings:: |
|
455 |
|
456 >>> p = re.compile('\d+') |
|
457 >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') |
|
458 ['12', '11', '10'] |
|
459 |
|
460 :meth:`findall` has to create the entire list before it can be returned as the |
|
461 result. The :meth:`finditer` method returns a sequence of :class:`MatchObject` |
|
462 instances as an :term:`iterator`. [#]_ :: |
|
463 |
|
464 >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') |
|
465 >>> iterator |
|
466 <callable-iterator object at 0x401833ac> |
|
467 >>> for match in iterator: |
|
468 ... print match.span() |
|
469 ... |
|
470 (0, 2) |
|
471 (22, 24) |
|
472 (29, 31) |
|
473 |
|
474 |
|
475 Module-Level Functions |
|
476 ---------------------- |
|
477 |
|
478 You don't have to create a :class:`RegexObject` and call its methods; the |
|
479 :mod:`re` module also provides top-level functions called :func:`match`, |
|
480 :func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions |
|
481 take the same arguments as the corresponding :class:`RegexObject` method, with |
|
482 the RE string added as the first argument, and still return either ``None`` or a |
|
483 :class:`MatchObject` instance. :: |
|
484 |
|
485 >>> print re.match(r'From\s+', 'Fromage amk') |
|
486 None |
|
487 >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') |
|
488 <re.MatchObject instance at 80c5978> |
|
489 |
|
490 Under the hood, these functions simply produce a :class:`RegexObject` for you |
|
491 and call the appropriate method on it. They also store the compiled object in a |
|
492 cache, so future calls using the same RE are faster. |
|
493 |
|
494 Should you use these module-level functions, or should you get the |
|
495 :class:`RegexObject` and call its methods yourself? That choice depends on how |
|
496 frequently the RE will be used, and on your personal coding style. If the RE is |
|
497 being used at only one point in the code, then the module functions are probably |
|
498 more convenient. If a program contains a lot of regular expressions, or re-uses |
|
499 the same ones in several locations, then it might be worthwhile to collect all |
|
500 the definitions in one place, in a section of code that compiles all the REs |
|
501 ahead of time. To take an example from the standard library, here's an extract |
|
502 from :file:`xmllib.py`:: |
|
503 |
|
504 ref = re.compile( ... ) |
|
505 entityref = re.compile( ... ) |
|
506 charref = re.compile( ... ) |
|
507 starttagopen = re.compile( ... ) |
|
508 |
|
509 I generally prefer to work with the compiled object, even for one-time uses, but |
|
510 few people will be as much of a purist about this as I am. |
|
511 |
|
512 |
|
513 Compilation Flags |
|
514 ----------------- |
|
515 |
|
516 Compilation flags let you modify some aspects of how regular expressions work. |
|
517 Flags are available in the :mod:`re` module under two names, a long name such as |
|
518 :const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're |
|
519 familiar with Perl's pattern modifiers, the one-letter forms use the same |
|
520 letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) |
|
521 Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets |
|
522 both the :const:`I` and :const:`M` flags, for example. |
|
523 |
|
524 Here's a table of the available flags, followed by a more detailed explanation |
|
525 of each one. |
|
526 |
|
527 +---------------------------------+--------------------------------------------+ |
|
528 | Flag | Meaning | |
|
529 +=================================+============================================+ |
|
530 | :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | |
|
531 | | newlines | |
|
532 +---------------------------------+--------------------------------------------+ |
|
533 | :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches | |
|
534 +---------------------------------+--------------------------------------------+ |
|
535 | :const:`LOCALE`, :const:`L` | Do a locale-aware match | |
|
536 +---------------------------------+--------------------------------------------+ |
|
537 | :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | |
|
538 | | ``$`` | |
|
539 +---------------------------------+--------------------------------------------+ |
|
540 | :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | |
|
541 | | more cleanly and understandably. | |
|
542 +---------------------------------+--------------------------------------------+ |
|
543 |
|
544 |
|
545 .. data:: I |
|
546 IGNORECASE |
|
547 :noindex: |
|
548 |
|
549 Perform case-insensitive matching; character class and literal strings will |
|
550 match letters by ignoring case. For example, ``[A-Z]`` will match lowercase |
|
551 letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This |
|
552 lowercasing doesn't take the current locale into account; it will if you also |
|
553 set the :const:`LOCALE` flag. |
|
554 |
|
555 |
|
556 .. data:: L |
|
557 LOCALE |
|
558 :noindex: |
|
559 |
|
560 Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. |
|
561 |
|
562 Locales are a feature of the C library intended to help in writing programs that |
|
563 take account of language differences. For example, if you're processing French |
|
564 text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only |
|
565 matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If |
|
566 your system is configured properly and a French locale is selected, certain C |
|
567 functions will tell the program that ``'é'`` should also be considered a letter. |
|
568 Setting the :const:`LOCALE` flag when compiling a regular expression will cause |
|
569 the resulting compiled object to use these C functions for ``\w``; this is |
|
570 slower, but also enables ``\w+`` to match French words as you'd expect. |
|
571 |
|
572 |
|
573 .. data:: M |
|
574 MULTILINE |
|
575 :noindex: |
|
576 |
|
577 (``^`` and ``$`` haven't been explained yet; they'll be introduced in section |
|
578 :ref:`more-metacharacters`.) |
|
579 |
|
580 Usually ``^`` matches only at the beginning of the string, and ``$`` matches |
|
581 only at the end of the string and immediately before the newline (if any) at the |
|
582 end of the string. When this flag is specified, ``^`` matches at the beginning |
|
583 of the string and at the beginning of each line within the string, immediately |
|
584 following each newline. Similarly, the ``$`` metacharacter matches either at |
|
585 the end of the string and at the end of each line (immediately preceding each |
|
586 newline). |
|
587 |
|
588 |
|
589 .. data:: S |
|
590 DOTALL |
|
591 :noindex: |
|
592 |
|
593 Makes the ``'.'`` special character match any character at all, including a |
|
594 newline; without this flag, ``'.'`` will match anything *except* a newline. |
|
595 |
|
596 |
|
597 .. data:: X |
|
598 VERBOSE |
|
599 :noindex: |
|
600 |
|
601 This flag allows you to write regular expressions that are more readable by |
|
602 granting you more flexibility in how you can format them. When this flag has |
|
603 been specified, whitespace within the RE string is ignored, except when the |
|
604 whitespace is in a character class or preceded by an unescaped backslash; this |
|
605 lets you organize and indent the RE more clearly. This flag also lets you put |
|
606 comments within a RE that will be ignored by the engine; comments are marked by |
|
607 a ``'#'`` that's neither in a character class or preceded by an unescaped |
|
608 backslash. |
|
609 |
|
610 For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it |
|
611 is to read? :: |
|
612 |
|
613 charref = re.compile(r""" |
|
614 &[#] # Start of a numeric entity reference |
|
615 ( |
|
616 0[0-7]+ # Octal form |
|
617 | [0-9]+ # Decimal form |
|
618 | x[0-9a-fA-F]+ # Hexadecimal form |
|
619 ) |
|
620 ; # Trailing semicolon |
|
621 """, re.VERBOSE) |
|
622 |
|
623 Without the verbose setting, the RE would look like this:: |
|
624 |
|
625 charref = re.compile("&#(0[0-7]+" |
|
626 "|[0-9]+" |
|
627 "|x[0-9a-fA-F]+);") |
|
628 |
|
629 In the above example, Python's automatic concatenation of string literals has |
|
630 been used to break up the RE into smaller pieces, but it's still more difficult |
|
631 to understand than the version using :const:`re.VERBOSE`. |
|
632 |
|
633 |
|
634 More Pattern Power |
|
635 ================== |
|
636 |
|
637 So far we've only covered a part of the features of regular expressions. In |
|
638 this section, we'll cover some new metacharacters, and how to use groups to |
|
639 retrieve portions of the text that was matched. |
|
640 |
|
641 |
|
642 .. _more-metacharacters: |
|
643 |
|
644 More Metacharacters |
|
645 ------------------- |
|
646 |
|
647 There are some metacharacters that we haven't covered yet. Most of them will be |
|
648 covered in this section. |
|
649 |
|
650 Some of the remaining metacharacters to be discussed are :dfn:`zero-width |
|
651 assertions`. They don't cause the engine to advance through the string; |
|
652 instead, they consume no characters at all, and simply succeed or fail. For |
|
653 example, ``\b`` is an assertion that the current position is located at a word |
|
654 boundary; the position isn't changed by the ``\b`` at all. This means that |
|
655 zero-width assertions should never be repeated, because if they match once at a |
|
656 given location, they can obviously be matched an infinite number of times. |
|
657 |
|
658 ``|`` |
|
659 Alternation, or the "or" operator. If A and B are regular expressions, |
|
660 ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very |
|
661 low precedence in order to make it work reasonably when you're alternating |
|
662 multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, |
|
663 not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. |
|
664 |
|
665 To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, |
|
666 as in ``[|]``. |
|
667 |
|
668 ``^`` |
|
669 Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been |
|
670 set, this will only match at the beginning of the string. In :const:`MULTILINE` |
|
671 mode, this also matches immediately after each newline within the string. |
|
672 |
|
673 For example, if you wish to match the word ``From`` only at the beginning of a |
|
674 line, the RE to use is ``^From``. :: |
|
675 |
|
676 >>> print re.search('^From', 'From Here to Eternity') |
|
677 <re.MatchObject instance at 80c1520> |
|
678 >>> print re.search('^From', 'Reciting From Memory') |
|
679 None |
|
680 |
|
681 .. To match a literal \character{\^}, use \regexp{\e\^} or enclose it |
|
682 .. inside a character class, as in \regexp{[{\e}\^]}. |
|
683 |
|
684 ``$`` |
|
685 Matches at the end of a line, which is defined as either the end of the string, |
|
686 or any location followed by a newline character. :: |
|
687 |
|
688 >>> print re.search('}$', '{block}') |
|
689 <re.MatchObject instance at 80adfa8> |
|
690 >>> print re.search('}$', '{block} ') |
|
691 None |
|
692 >>> print re.search('}$', '{block}\n') |
|
693 <re.MatchObject instance at 80adfa8> |
|
694 |
|
695 To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, |
|
696 as in ``[$]``. |
|
697 |
|
698 ``\A`` |
|
699 Matches only at the start of the string. When not in :const:`MULTILINE` mode, |
|
700 ``\A`` and ``^`` are effectively the same. In :const:`MULTILINE` mode, they're |
|
701 different: ``\A`` still matches only at the beginning of the string, but ``^`` |
|
702 may match at any location inside the string that follows a newline character. |
|
703 |
|
704 ``\Z`` |
|
705 Matches only at the end of the string. |
|
706 |
|
707 ``\b`` |
|
708 Word boundary. This is a zero-width assertion that matches only at the |
|
709 beginning or end of a word. A word is defined as a sequence of alphanumeric |
|
710 characters, so the end of a word is indicated by whitespace or a |
|
711 non-alphanumeric character. |
|
712 |
|
713 The following example matches ``class`` only when it's a complete word; it won't |
|
714 match when it's contained inside another word. :: |
|
715 |
|
716 >>> p = re.compile(r'\bclass\b') |
|
717 >>> print p.search('no class at all') |
|
718 <re.MatchObject instance at 80c8f28> |
|
719 >>> print p.search('the declassified algorithm') |
|
720 None |
|
721 >>> print p.search('one subclass is') |
|
722 None |
|
723 |
|
724 There are two subtleties you should remember when using this special sequence. |
|
725 First, this is the worst collision between Python's string literals and regular |
|
726 expression sequences. In Python's string literals, ``\b`` is the backspace |
|
727 character, ASCII value 8. If you're not using raw strings, then Python will |
|
728 convert the ``\b`` to a backspace, and your RE won't match as you expect it to. |
|
729 The following example looks the same as our previous RE, but omits the ``'r'`` |
|
730 in front of the RE string. :: |
|
731 |
|
732 >>> p = re.compile('\bclass\b') |
|
733 >>> print p.search('no class at all') |
|
734 None |
|
735 >>> print p.search('\b' + 'class' + '\b') |
|
736 <re.MatchObject instance at 80c3ee0> |
|
737 |
|
738 Second, inside a character class, where there's no use for this assertion, |
|
739 ``\b`` represents the backspace character, for compatibility with Python's |
|
740 string literals. |
|
741 |
|
742 ``\B`` |
|
743 Another zero-width assertion, this is the opposite of ``\b``, only matching when |
|
744 the current position is not at a word boundary. |
|
745 |
|
746 |
|
747 Grouping |
|
748 -------- |
|
749 |
|
750 Frequently you need to obtain more information than just whether the RE matched |
|
751 or not. Regular expressions are often used to dissect strings by writing a RE |
|
752 divided into several subgroups which match different components of interest. |
|
753 For example, an RFC-822 header line is divided into a header name and a value, |
|
754 separated by a ``':'``, like this:: |
|
755 |
|
756 From: author@example.com |
|
757 User-Agent: Thunderbird 1.5.0.9 (X11/20061227) |
|
758 MIME-Version: 1.0 |
|
759 To: editor@example.com |
|
760 |
|
761 This can be handled by writing a regular expression which matches an entire |
|
762 header line, and has one group which matches the header name, and another group |
|
763 which matches the header's value. |
|
764 |
|
765 Groups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` |
|
766 have much the same meaning as they do in mathematical expressions; they group |
|
767 together the expressions contained inside them, and you can repeat the contents |
|
768 of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or |
|
769 ``{m,n}``. For example, ``(ab)*`` will match zero or more repetitions of |
|
770 ``ab``. :: |
|
771 |
|
772 >>> p = re.compile('(ab)*') |
|
773 >>> print p.match('ababababab').span() |
|
774 (0, 10) |
|
775 |
|
776 Groups indicated with ``'('``, ``')'`` also capture the starting and ending |
|
777 index of the text that they match; this can be retrieved by passing an argument |
|
778 to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are |
|
779 numbered starting with 0. Group 0 is always present; it's the whole RE, so |
|
780 :class:`MatchObject` methods all have group 0 as their default argument. Later |
|
781 we'll see how to express groups that don't capture the span of text that they |
|
782 match. :: |
|
783 |
|
784 >>> p = re.compile('(a)b') |
|
785 >>> m = p.match('ab') |
|
786 >>> m.group() |
|
787 'ab' |
|
788 >>> m.group(0) |
|
789 'ab' |
|
790 |
|
791 Subgroups are numbered from left to right, from 1 upward. Groups can be nested; |
|
792 to determine the number, just count the opening parenthesis characters, going |
|
793 from left to right. :: |
|
794 |
|
795 >>> p = re.compile('(a(b)c)d') |
|
796 >>> m = p.match('abcd') |
|
797 >>> m.group(0) |
|
798 'abcd' |
|
799 >>> m.group(1) |
|
800 'abc' |
|
801 >>> m.group(2) |
|
802 'b' |
|
803 |
|
804 :meth:`group` can be passed multiple group numbers at a time, in which case it |
|
805 will return a tuple containing the corresponding values for those groups. :: |
|
806 |
|
807 >>> m.group(2,1,2) |
|
808 ('b', 'abc', 'b') |
|
809 |
|
810 The :meth:`groups` method returns a tuple containing the strings for all the |
|
811 subgroups, from 1 up to however many there are. :: |
|
812 |
|
813 >>> m.groups() |
|
814 ('abc', 'b') |
|
815 |
|
816 Backreferences in a pattern allow you to specify that the contents of an earlier |
|
817 capturing group must also be found at the current location in the string. For |
|
818 example, ``\1`` will succeed if the exact contents of group 1 can be found at |
|
819 the current position, and fails otherwise. Remember that Python's string |
|
820 literals also use a backslash followed by numbers to allow including arbitrary |
|
821 characters in a string, so be sure to use a raw string when incorporating |
|
822 backreferences in a RE. |
|
823 |
|
824 For example, the following RE detects doubled words in a string. :: |
|
825 |
|
826 >>> p = re.compile(r'(\b\w+)\s+\1') |
|
827 >>> p.search('Paris in the the spring').group() |
|
828 'the the' |
|
829 |
|
830 Backreferences like this aren't often useful for just searching through a string |
|
831 --- there are few text formats which repeat data in this way --- but you'll soon |
|
832 find out that they're *very* useful when performing string substitutions. |
|
833 |
|
834 |
|
835 Non-capturing and Named Groups |
|
836 ------------------------------ |
|
837 |
|
838 Elaborate REs may use many groups, both to capture substrings of interest, and |
|
839 to group and structure the RE itself. In complex REs, it becomes difficult to |
|
840 keep track of the group numbers. There are two features which help with this |
|
841 problem. Both of them use a common syntax for regular expression extensions, so |
|
842 we'll look at that first. |
|
843 |
|
844 Perl 5 added several additional features to standard regular expressions, and |
|
845 the Python :mod:`re` module supports most of them. It would have been |
|
846 difficult to choose new single-keystroke metacharacters or new special sequences |
|
847 beginning with ``\`` to represent the new features without making Perl's regular |
|
848 expressions confusingly different from standard REs. If you chose ``&`` as a |
|
849 new metacharacter, for example, old expressions would be assuming that ``&`` was |
|
850 a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. |
|
851 |
|
852 The solution chosen by the Perl developers was to use ``(?...)`` as the |
|
853 extension syntax. ``?`` immediately after a parenthesis was a syntax error |
|
854 because the ``?`` would have nothing to repeat, so this didn't introduce any |
|
855 compatibility problems. The characters immediately after the ``?`` indicate |
|
856 what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead |
|
857 assertion) and ``(?:foo)`` is something else (a non-capturing group containing |
|
858 the subexpression ``foo``). |
|
859 |
|
860 Python adds an extension syntax to Perl's extension syntax. If the first |
|
861 character after the question mark is a ``P``, you know that it's an extension |
|
862 that's specific to Python. Currently there are two such extensions: |
|
863 ``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to |
|
864 a named group. If future versions of Perl 5 add similar features using a |
|
865 different syntax, the :mod:`re` module will be changed to support the new |
|
866 syntax, while preserving the Python-specific syntax for compatibility's sake. |
|
867 |
|
868 Now that we've looked at the general extension syntax, we can return to the |
|
869 features that simplify working with groups in complex REs. Since groups are |
|
870 numbered from left to right and a complex expression may use many groups, it can |
|
871 become difficult to keep track of the correct numbering. Modifying such a |
|
872 complex RE is annoying, too: insert a new group near the beginning and you |
|
873 change the numbers of everything that follows it. |
|
874 |
|
875 Sometimes you'll want to use a group to collect a part of a regular expression, |
|
876 but aren't interested in retrieving the group's contents. You can make this fact |
|
877 explicit by using a non-capturing group: ``(?:...)``, where you can replace the |
|
878 ``...`` with any other regular expression. :: |
|
879 |
|
880 >>> m = re.match("([abc])+", "abc") |
|
881 >>> m.groups() |
|
882 ('c',) |
|
883 >>> m = re.match("(?:[abc])+", "abc") |
|
884 >>> m.groups() |
|
885 () |
|
886 |
|
887 Except for the fact that you can't retrieve the contents of what the group |
|
888 matched, a non-capturing group behaves exactly the same as a capturing group; |
|
889 you can put anything inside it, repeat it with a repetition metacharacter such |
|
890 as ``*``, and nest it within other groups (capturing or non-capturing). |
|
891 ``(?:...)`` is particularly useful when modifying an existing pattern, since you |
|
892 can add new groups without changing how all the other groups are numbered. It |
|
893 should be mentioned that there's no performance difference in searching between |
|
894 capturing and non-capturing groups; neither form is any faster than the other. |
|
895 |
|
896 A more significant feature is named groups: instead of referring to them by |
|
897 numbers, groups can be referenced by a name. |
|
898 |
|
899 The syntax for a named group is one of the Python-specific extensions: |
|
900 ``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups |
|
901 also behave exactly like capturing groups, and additionally associate a name |
|
902 with a group. The :class:`MatchObject` methods that deal with capturing groups |
|
903 all accept either integers that refer to the group by number or strings that |
|
904 contain the desired group's name. Named groups are still given numbers, so you |
|
905 can retrieve information about a group in two ways:: |
|
906 |
|
907 >>> p = re.compile(r'(?P<word>\b\w+\b)') |
|
908 >>> m = p.search( '(((( Lots of punctuation )))' ) |
|
909 >>> m.group('word') |
|
910 'Lots' |
|
911 >>> m.group(1) |
|
912 'Lots' |
|
913 |
|
914 Named groups are handy because they let you use easily-remembered names, instead |
|
915 of having to remember numbers. Here's an example RE from the :mod:`imaplib` |
|
916 module:: |
|
917 |
|
918 InternalDate = re.compile(r'INTERNALDATE "' |
|
919 r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' |
|
920 r'(?P<year>[0-9][0-9][0-9][0-9])' |
|
921 r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' |
|
922 r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' |
|
923 r'"') |
|
924 |
|
925 It's obviously much easier to retrieve ``m.group('zonem')``, instead of having |
|
926 to remember to retrieve group 9. |
|
927 |
|
928 The syntax for backreferences in an expression such as ``(...)\1`` refers to the |
|
929 number of the group. There's naturally a variant that uses the group name |
|
930 instead of the number. This is another Python extension: ``(?P=name)`` indicates |
|
931 that the contents of the group called *name* should again be matched at the |
|
932 current point. The regular expression for finding doubled words, |
|
933 ``(\b\w+)\s+\1`` can also be written as ``(?P<word>\b\w+)\s+(?P=word)``:: |
|
934 |
|
935 >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') |
|
936 >>> p.search('Paris in the the spring').group() |
|
937 'the the' |
|
938 |
|
939 |
|
940 Lookahead Assertions |
|
941 -------------------- |
|
942 |
|
943 Another zero-width assertion is the lookahead assertion. Lookahead assertions |
|
944 are available in both positive and negative form, and look like this: |
|
945 |
|
946 ``(?=...)`` |
|
947 Positive lookahead assertion. This succeeds if the contained regular |
|
948 expression, represented here by ``...``, successfully matches at the current |
|
949 location, and fails otherwise. But, once the contained expression has been |
|
950 tried, the matching engine doesn't advance at all; the rest of the pattern is |
|
951 tried right where the assertion started. |
|
952 |
|
953 ``(?!...)`` |
|
954 Negative lookahead assertion. This is the opposite of the positive assertion; |
|
955 it succeeds if the contained expression *doesn't* match at the current position |
|
956 in the string. |
|
957 |
|
958 To make this concrete, let's look at a case where a lookahead is useful. |
|
959 Consider a simple pattern to match a filename and split it apart into a base |
|
960 name and an extension, separated by a ``.``. For example, in ``news.rc``, |
|
961 ``news`` is the base name, and ``rc`` is the filename's extension. |
|
962 |
|
963 The pattern to match this is quite simple: |
|
964 |
|
965 ``.*[.].*$`` |
|
966 |
|
967 Notice that the ``.`` needs to be treated specially because it's a |
|
968 metacharacter; I've put it inside a character class. Also notice the trailing |
|
969 ``$``; this is added to ensure that all the rest of the string must be included |
|
970 in the extension. This regular expression matches ``foo.bar`` and |
|
971 ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. |
|
972 |
|
973 Now, consider complicating the problem a bit; what if you want to match |
|
974 filenames where the extension is not ``bat``? Some incorrect attempts: |
|
975 |
|
976 ``.*[.][^b].*$`` The first attempt above tries to exclude ``bat`` by requiring |
|
977 that the first character of the extension is not a ``b``. This is wrong, |
|
978 because the pattern also doesn't match ``foo.bar``. |
|
979 |
|
980 ``.*[.]([^b]..|.[^a].|..[^t])$`` |
|
981 |
|
982 The expression gets messier when you try to patch up the first solution by |
|
983 requiring one of the following cases to match: the first character of the |
|
984 extension isn't ``b``; the second character isn't ``a``; or the third character |
|
985 isn't ``t``. This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it |
|
986 requires a three-letter extension and won't accept a filename with a two-letter |
|
987 extension such as ``sendmail.cf``. We'll complicate the pattern again in an |
|
988 effort to fix it. |
|
989 |
|
990 ``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` |
|
991 |
|
992 In the third attempt, the second and third letters are all made optional in |
|
993 order to allow matching extensions shorter than three characters, such as |
|
994 ``sendmail.cf``. |
|
995 |
|
996 The pattern's getting really complicated now, which makes it hard to read and |
|
997 understand. Worse, if the problem changes and you want to exclude both ``bat`` |
|
998 and ``exe`` as extensions, the pattern would get even more complicated and |
|
999 confusing. |
|
1000 |
|
1001 A negative lookahead cuts through all this confusion: |
|
1002 |
|
1003 ``.*[.](?!bat$).*$`` The negative lookahead means: if the expression ``bat`` |
|
1004 doesn't match at this point, try the rest of the pattern; if ``bat$`` does |
|
1005 match, the whole pattern will fail. The trailing ``$`` is required to ensure |
|
1006 that something like ``sample.batch``, where the extension only starts with |
|
1007 ``bat``, will be allowed. |
|
1008 |
|
1009 Excluding another filename extension is now easy; simply add it as an |
|
1010 alternative inside the assertion. The following pattern excludes filenames that |
|
1011 end in either ``bat`` or ``exe``: |
|
1012 |
|
1013 ``.*[.](?!bat$|exe$).*$`` |
|
1014 |
|
1015 |
|
1016 Modifying Strings |
|
1017 ================= |
|
1018 |
|
1019 Up to this point, we've simply performed searches against a static string. |
|
1020 Regular expressions are also commonly used to modify strings in various ways, |
|
1021 using the following :class:`RegexObject` methods: |
|
1022 |
|
1023 +------------------+-----------------------------------------------+ |
|
1024 | Method/Attribute | Purpose | |
|
1025 +==================+===============================================+ |
|
1026 | ``split()`` | Split the string into a list, splitting it | |
|
1027 | | wherever the RE matches | |
|
1028 +------------------+-----------------------------------------------+ |
|
1029 | ``sub()`` | Find all substrings where the RE matches, and | |
|
1030 | | replace them with a different string | |
|
1031 +------------------+-----------------------------------------------+ |
|
1032 | ``subn()`` | Does the same thing as :meth:`sub`, but | |
|
1033 | | returns the new string and the number of | |
|
1034 | | replacements | |
|
1035 +------------------+-----------------------------------------------+ |
|
1036 |
|
1037 |
|
1038 Splitting Strings |
|
1039 ----------------- |
|
1040 |
|
1041 The :meth:`split` method of a :class:`RegexObject` splits a string apart |
|
1042 wherever the RE matches, returning a list of the pieces. It's similar to the |
|
1043 :meth:`split` method of strings but provides much more generality in the |
|
1044 delimiters that you can split by; :meth:`split` only supports splitting by |
|
1045 whitespace or by a fixed string. As you'd expect, there's a module-level |
|
1046 :func:`re.split` function, too. |
|
1047 |
|
1048 |
|
1049 .. method:: .split(string [, maxsplit=0]) |
|
1050 :noindex: |
|
1051 |
|
1052 Split *string* by the matches of the regular expression. If capturing |
|
1053 parentheses are used in the RE, then their contents will also be returned as |
|
1054 part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits |
|
1055 are performed. |
|
1056 |
|
1057 You can limit the number of splits made, by passing a value for *maxsplit*. |
|
1058 When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the |
|
1059 remainder of the string is returned as the final element of the list. In the |
|
1060 following example, the delimiter is any sequence of non-alphanumeric characters. |
|
1061 :: |
|
1062 |
|
1063 >>> p = re.compile(r'\W+') |
|
1064 >>> p.split('This is a test, short and sweet, of split().') |
|
1065 ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] |
|
1066 >>> p.split('This is a test, short and sweet, of split().', 3) |
|
1067 ['This', 'is', 'a', 'test, short and sweet, of split().'] |
|
1068 |
|
1069 Sometimes you're not only interested in what the text between delimiters is, but |
|
1070 also need to know what the delimiter was. If capturing parentheses are used in |
|
1071 the RE, then their values are also returned as part of the list. Compare the |
|
1072 following calls:: |
|
1073 |
|
1074 >>> p = re.compile(r'\W+') |
|
1075 >>> p2 = re.compile(r'(\W+)') |
|
1076 >>> p.split('This... is a test.') |
|
1077 ['This', 'is', 'a', 'test', ''] |
|
1078 >>> p2.split('This... is a test.') |
|
1079 ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] |
|
1080 |
|
1081 The module-level function :func:`re.split` adds the RE to be used as the first |
|
1082 argument, but is otherwise the same. :: |
|
1083 |
|
1084 >>> re.split('[\W]+', 'Words, words, words.') |
|
1085 ['Words', 'words', 'words', ''] |
|
1086 >>> re.split('([\W]+)', 'Words, words, words.') |
|
1087 ['Words', ', ', 'words', ', ', 'words', '.', ''] |
|
1088 >>> re.split('[\W]+', 'Words, words, words.', 1) |
|
1089 ['Words', 'words, words.'] |
|
1090 |
|
1091 |
|
1092 Search and Replace |
|
1093 ------------------ |
|
1094 |
|
1095 Another common task is to find all the matches for a pattern, and replace them |
|
1096 with a different string. The :meth:`sub` method takes a replacement value, |
|
1097 which can be either a string or a function, and the string to be processed. |
|
1098 |
|
1099 |
|
1100 .. method:: .sub(replacement, string[, count=0]) |
|
1101 :noindex: |
|
1102 |
|
1103 Returns the string obtained by replacing the leftmost non-overlapping |
|
1104 occurrences of the RE in *string* by the replacement *replacement*. If the |
|
1105 pattern isn't found, *string* is returned unchanged. |
|
1106 |
|
1107 The optional argument *count* is the maximum number of pattern occurrences to be |
|
1108 replaced; *count* must be a non-negative integer. The default value of 0 means |
|
1109 to replace all occurrences. |
|
1110 |
|
1111 Here's a simple example of using the :meth:`sub` method. It replaces colour |
|
1112 names with the word ``colour``:: |
|
1113 |
|
1114 >>> p = re.compile( '(blue|white|red)') |
|
1115 >>> p.sub( 'colour', 'blue socks and red shoes') |
|
1116 'colour socks and colour shoes' |
|
1117 >>> p.sub( 'colour', 'blue socks and red shoes', count=1) |
|
1118 'colour socks and red shoes' |
|
1119 |
|
1120 The :meth:`subn` method does the same work, but returns a 2-tuple containing the |
|
1121 new string value and the number of replacements that were performed:: |
|
1122 |
|
1123 >>> p = re.compile( '(blue|white|red)') |
|
1124 >>> p.subn( 'colour', 'blue socks and red shoes') |
|
1125 ('colour socks and colour shoes', 2) |
|
1126 >>> p.subn( 'colour', 'no colours at all') |
|
1127 ('no colours at all', 0) |
|
1128 |
|
1129 Empty matches are replaced only when they're not adjacent to a previous match. |
|
1130 :: |
|
1131 |
|
1132 >>> p = re.compile('x*') |
|
1133 >>> p.sub('-', 'abxd') |
|
1134 '-a-b-d-' |
|
1135 |
|
1136 If *replacement* is a string, any backslash escapes in it are processed. That |
|
1137 is, ``\n`` is converted to a single newline character, ``\r`` is converted to a |
|
1138 carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. |
|
1139 Backreferences, such as ``\6``, are replaced with the substring matched by the |
|
1140 corresponding group in the RE. This lets you incorporate portions of the |
|
1141 original text in the resulting replacement string. |
|
1142 |
|
1143 This example matches the word ``section`` followed by a string enclosed in |
|
1144 ``{``, ``}``, and changes ``section`` to ``subsection``:: |
|
1145 |
|
1146 >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) |
|
1147 >>> p.sub(r'subsection{\1}','section{First} section{second}') |
|
1148 'subsection{First} subsection{second}' |
|
1149 |
|
1150 There's also a syntax for referring to named groups as defined by the |
|
1151 ``(?P<name>...)`` syntax. ``\g<name>`` will use the substring matched by the |
|
1152 group named ``name``, and ``\g<number>`` uses the corresponding group number. |
|
1153 ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous in a |
|
1154 replacement string such as ``\g<2>0``. (``\20`` would be interpreted as a |
|
1155 reference to group 20, not a reference to group 2 followed by the literal |
|
1156 character ``'0'``.) The following substitutions are all equivalent, but use all |
|
1157 three variations of the replacement string. :: |
|
1158 |
|
1159 >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) |
|
1160 >>> p.sub(r'subsection{\1}','section{First}') |
|
1161 'subsection{First}' |
|
1162 >>> p.sub(r'subsection{\g<1>}','section{First}') |
|
1163 'subsection{First}' |
|
1164 >>> p.sub(r'subsection{\g<name>}','section{First}') |
|
1165 'subsection{First}' |
|
1166 |
|
1167 *replacement* can also be a function, which gives you even more control. If |
|
1168 *replacement* is a function, the function is called for every non-overlapping |
|
1169 occurrence of *pattern*. On each call, the function is passed a |
|
1170 :class:`MatchObject` argument for the match and can use this information to |
|
1171 compute the desired replacement string and return it. |
|
1172 |
|
1173 In the following example, the replacement function translates decimals into |
|
1174 hexadecimal:: |
|
1175 |
|
1176 >>> def hexrepl( match ): |
|
1177 ... "Return the hex string for a decimal number" |
|
1178 ... value = int( match.group() ) |
|
1179 ... return hex(value) |
|
1180 ... |
|
1181 >>> p = re.compile(r'\d+') |
|
1182 >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') |
|
1183 'Call 0xffd2 for printing, 0xc000 for user code.' |
|
1184 |
|
1185 When using the module-level :func:`re.sub` function, the pattern is passed as |
|
1186 the first argument. The pattern may be a string or a :class:`RegexObject`; if |
|
1187 you need to specify regular expression flags, you must either use a |
|
1188 :class:`RegexObject` as the first parameter, or use embedded modifiers in the |
|
1189 pattern, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. |
|
1190 |
|
1191 |
|
1192 Common Problems |
|
1193 =============== |
|
1194 |
|
1195 Regular expressions are a powerful tool for some applications, but in some ways |
|
1196 their behaviour isn't intuitive and at times they don't behave the way you may |
|
1197 expect them to. This section will point out some of the most common pitfalls. |
|
1198 |
|
1199 |
|
1200 Use String Methods |
|
1201 ------------------ |
|
1202 |
|
1203 Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed |
|
1204 string, or a single character class, and you're not using any :mod:`re` features |
|
1205 such as the :const:`IGNORECASE` flag, then the full power of regular expressions |
|
1206 may not be required. Strings have several methods for performing operations with |
|
1207 fixed strings and they're usually much faster, because the implementation is a |
|
1208 single small C loop that's been optimized for the purpose, instead of the large, |
|
1209 more generalized regular expression engine. |
|
1210 |
|
1211 One example might be replacing a single fixed string with another one; for |
|
1212 example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the |
|
1213 function to use for this, but consider the :meth:`replace` method. Note that |
|
1214 :func:`replace` will also replace ``word`` inside words, turning ``swordfish`` |
|
1215 into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To |
|
1216 avoid performing the substitution on parts of words, the pattern would have to |
|
1217 be ``\bword\b``, in order to require that ``word`` have a word boundary on |
|
1218 either side. This takes the job beyond :meth:`replace`'s abilities.) |
|
1219 |
|
1220 Another common task is deleting every occurrence of a single character from a |
|
1221 string or replacing it with another single character. You might do this with |
|
1222 something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of |
|
1223 doing both tasks and will be faster than any regular expression operation can |
|
1224 be. |
|
1225 |
|
1226 In short, before turning to the :mod:`re` module, consider whether your problem |
|
1227 can be solved with a faster and simpler string method. |
|
1228 |
|
1229 |
|
1230 match() versus search() |
|
1231 ----------------------- |
|
1232 |
|
1233 The :func:`match` function only checks if the RE matches at the beginning of the |
|
1234 string while :func:`search` will scan forward through the string for a match. |
|
1235 It's important to keep this distinction in mind. Remember, :func:`match` will |
|
1236 only report a successful match which will start at 0; if the match wouldn't |
|
1237 start at zero, :func:`match` will *not* report it. :: |
|
1238 |
|
1239 >>> print re.match('super', 'superstition').span() |
|
1240 (0, 5) |
|
1241 >>> print re.match('super', 'insuperable') |
|
1242 None |
|
1243 |
|
1244 On the other hand, :func:`search` will scan forward through the string, |
|
1245 reporting the first match it finds. :: |
|
1246 |
|
1247 >>> print re.search('super', 'superstition').span() |
|
1248 (0, 5) |
|
1249 >>> print re.search('super', 'insuperable').span() |
|
1250 (2, 7) |
|
1251 |
|
1252 Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` |
|
1253 to the front of your RE. Resist this temptation and use :func:`re.search` |
|
1254 instead. The regular expression compiler does some analysis of REs in order to |
|
1255 speed up the process of looking for a match. One such analysis figures out what |
|
1256 the first character of a match must be; for example, a pattern starting with |
|
1257 ``Crow`` must match starting with a ``'C'``. The analysis lets the engine |
|
1258 quickly scan through the string looking for the starting character, only trying |
|
1259 the full match if a ``'C'`` is found. |
|
1260 |
|
1261 Adding ``.*`` defeats this optimization, requiring scanning to the end of the |
|
1262 string and then backtracking to find a match for the rest of the RE. Use |
|
1263 :func:`re.search` instead. |
|
1264 |
|
1265 |
|
1266 Greedy versus Non-Greedy |
|
1267 ------------------------ |
|
1268 |
|
1269 When repeating a regular expression, as in ``a*``, the resulting action is to |
|
1270 consume as much of the pattern as possible. This fact often bites you when |
|
1271 you're trying to match a pair of balanced delimiters, such as the angle brackets |
|
1272 surrounding an HTML tag. The naive pattern for matching a single HTML tag |
|
1273 doesn't work because of the greedy nature of ``.*``. :: |
|
1274 |
|
1275 >>> s = '<html><head><title>Title</title>' |
|
1276 >>> len(s) |
|
1277 32 |
|
1278 >>> print re.match('<.*>', s).span() |
|
1279 (0, 32) |
|
1280 >>> print re.match('<.*>', s).group() |
|
1281 <html><head><title>Title</title> |
|
1282 |
|
1283 The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of |
|
1284 the string. There's still more left in the RE, though, and the ``>`` can't |
|
1285 match at the end of the string, so the regular expression engine has to |
|
1286 backtrack character by character until it finds a match for the ``>``. The |
|
1287 final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in |
|
1288 ``</title>``, which isn't what you want. |
|
1289 |
|
1290 In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, |
|
1291 ``??``, or ``{m,n}?``, which match as *little* text as possible. In the above |
|
1292 example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and |
|
1293 when it fails, the engine advances a character at a time, retrying the ``'>'`` |
|
1294 at every step. This produces just the right result:: |
|
1295 |
|
1296 >>> print re.match('<.*?>', s).group() |
|
1297 <html> |
|
1298 |
|
1299 (Note that parsing HTML or XML with regular expressions is painful. |
|
1300 Quick-and-dirty patterns will handle common cases, but HTML and XML have special |
|
1301 cases that will break the obvious regular expression; by the time you've written |
|
1302 a regular expression that handles all of the possible cases, the patterns will |
|
1303 be *very* complicated. Use an HTML or XML parser module for such tasks.) |
|
1304 |
|
1305 |
|
1306 Not Using re.VERBOSE |
|
1307 -------------------- |
|
1308 |
|
1309 By now you've probably noticed that regular expressions are a very compact |
|
1310 notation, but they're not terribly readable. REs of moderate complexity can |
|
1311 become lengthy collections of backslashes, parentheses, and metacharacters, |
|
1312 making them difficult to read and understand. |
|
1313 |
|
1314 For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular |
|
1315 expression can be helpful, because it allows you to format the regular |
|
1316 expression more clearly. |
|
1317 |
|
1318 The ``re.VERBOSE`` flag has several effects. Whitespace in the regular |
|
1319 expression that *isn't* inside a character class is ignored. This means that an |
|
1320 expression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, |
|
1321 but ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space. In |
|
1322 addition, you can also put comments inside a RE; comments extend from a ``#`` |
|
1323 character to the next newline. When used with triple-quoted strings, this |
|
1324 enables REs to be formatted more neatly:: |
|
1325 |
|
1326 pat = re.compile(r""" |
|
1327 \s* # Skip leading whitespace |
|
1328 (?P<header>[^:]+) # Header name |
|
1329 \s* : # Whitespace, and a colon |
|
1330 (?P<value>.*?) # The header's value -- *? used to |
|
1331 # lose the following trailing whitespace |
|
1332 \s*$ # Trailing whitespace to end-of-line |
|
1333 """, re.VERBOSE) |
|
1334 |
|
1335 This is far more readable than:: |
|
1336 |
|
1337 pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") |
|
1338 |
|
1339 |
|
1340 Feedback |
|
1341 ======== |
|
1342 |
|
1343 Regular expressions are a complicated topic. Did this document help you |
|
1344 understand them? Were there parts that were unclear, or Problems you |
|
1345 encountered that weren't covered here? If so, please send suggestions for |
|
1346 improvements to the author. |
|
1347 |
|
1348 The most complete book on regular expressions is almost certainly Jeffrey |
|
1349 Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, |
|
1350 it exclusively concentrates on Perl and Java's flavours of regular expressions, |
|
1351 and doesn't contain any Python material at all, so it won't be useful as a |
|
1352 reference for programming in Python. (The first edition covered Python's |
|
1353 now-removed :mod:`regex` module, which won't help you much.) Consider checking |
|
1354 it out from your library. |
|
1355 |
|
1356 |
|
1357 .. rubric:: Footnotes |
|
1358 |
|
1359 .. [#] Introduced in Python 2.2.2. |
|
1360 |