|
1 <html> |
|
2 <head> |
|
3 <title>pcrematching specification</title> |
|
4 </head> |
|
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
|
6 <h1>pcrematching man page</h1> |
|
7 <p> |
|
8 Return to the <a href="index.html">PCRE index page</a>. |
|
9 </p> |
|
10 <p> |
|
11 This page is part of the PCRE HTML documentation. It was generated automatically |
|
12 from the original man page. If there is any nonsense in it, please consult the |
|
13 man page, in case the conversion went wrong. |
|
14 <br> |
|
15 <ul> |
|
16 <li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> |
|
17 <li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> |
|
18 <li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> |
|
19 <li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> |
|
20 <li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> |
|
21 <li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> |
|
22 <li><a name="TOC7" href="#SEC7">AUTHOR</a> |
|
23 <li><a name="TOC8" href="#SEC8">REVISION</a> |
|
24 </ul> |
|
25 <br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> |
|
26 <P> |
|
27 This document describes the two different algorithms that are available in PCRE |
|
28 for matching a compiled regular expression against a given subject string. The |
|
29 "standard" algorithm is the one provided by the <b>pcre_exec()</b> function. |
|
30 This works in the same was as Perl's matching function, and provides a |
|
31 Perl-compatible matching operation. |
|
32 </P> |
|
33 <P> |
|
34 An alternative algorithm is provided by the <b>pcre_dfa_exec()</b> function; |
|
35 this operates in a different way, and is not Perl-compatible. It has advantages |
|
36 and disadvantages compared with the standard algorithm, and these are described |
|
37 below. |
|
38 </P> |
|
39 <P> |
|
40 When there is only one possible way in which a given subject string can match a |
|
41 pattern, the two algorithms give the same answer. A difference arises, however, |
|
42 when there are multiple possibilities. For example, if the pattern |
|
43 <pre> |
|
44 ^<.*> |
|
45 </pre> |
|
46 is matched against the string |
|
47 <pre> |
|
48 <something> <something else> <something further> |
|
49 </pre> |
|
50 there are three possible answers. The standard algorithm finds only one of |
|
51 them, whereas the alternative algorithm finds all three. |
|
52 </P> |
|
53 <br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> |
|
54 <P> |
|
55 The set of strings that are matched by a regular expression can be represented |
|
56 as a tree structure. An unlimited repetition in the pattern makes the tree of |
|
57 infinite size, but it is still a tree. Matching the pattern to a given subject |
|
58 string (from a given starting point) can be thought of as a search of the tree. |
|
59 There are two ways to search a tree: depth-first and breadth-first, and these |
|
60 correspond to the two matching algorithms provided by PCRE. |
|
61 </P> |
|
62 <br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> |
|
63 <P> |
|
64 In the terminology of Jeffrey Friedl's book "Mastering Regular |
|
65 Expressions", the standard algorithm is an "NFA algorithm". It conducts a |
|
66 depth-first search of the pattern tree. That is, it proceeds along a single |
|
67 path through the tree, checking that the subject matches what is required. When |
|
68 there is a mismatch, the algorithm tries any alternatives at the current point, |
|
69 and if they all fail, it backs up to the previous branch point in the tree, and |
|
70 tries the next alternative branch at that level. This often involves backing up |
|
71 (moving to the left) in the subject string as well. The order in which |
|
72 repetition branches are tried is controlled by the greedy or ungreedy nature of |
|
73 the quantifier. |
|
74 </P> |
|
75 <P> |
|
76 If a leaf node is reached, a matching string has been found, and at that point |
|
77 the algorithm stops. Thus, if there is more than one possible match, this |
|
78 algorithm returns the first one that it finds. Whether this is the shortest, |
|
79 the longest, or some intermediate length depends on the way the greedy and |
|
80 ungreedy repetition quantifiers are specified in the pattern. |
|
81 </P> |
|
82 <P> |
|
83 Because it ends up with a single path through the tree, it is relatively |
|
84 straightforward for this algorithm to keep track of the substrings that are |
|
85 matched by portions of the pattern in parentheses. This provides support for |
|
86 capturing parentheses and back references. |
|
87 </P> |
|
88 <br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br> |
|
89 <P> |
|
90 This algorithm conducts a breadth-first search of the tree. Starting from the |
|
91 first matching point in the subject, it scans the subject string from left to |
|
92 right, once, character by character, and as it does this, it remembers all the |
|
93 paths through the tree that represent valid matches. In Friedl's terminology, |
|
94 this is a kind of "DFA algorithm", though it is not implemented as a |
|
95 traditional finite state machine (it keeps multiple states active |
|
96 simultaneously). |
|
97 </P> |
|
98 <P> |
|
99 The scan continues until either the end of the subject is reached, or there are |
|
100 no more unterminated paths. At this point, terminated paths represent the |
|
101 different matching possibilities (if there are none, the match has failed). |
|
102 Thus, if there is more than one possible match, this algorithm finds all of |
|
103 them, and in particular, it finds the longest. In PCRE, there is an option to |
|
104 stop the algorithm after the first match (which is necessarily the shortest) |
|
105 has been found. |
|
106 </P> |
|
107 <P> |
|
108 Note that all the matches that are found start at the same point in the |
|
109 subject. If the pattern |
|
110 <pre> |
|
111 cat(er(pillar)?) |
|
112 </pre> |
|
113 is matched against the string "the caterpillar catchment", the result will be |
|
114 the three strings "cat", "cater", and "caterpillar" that start at the fourth |
|
115 character of the subject. The algorithm does not automatically move on to find |
|
116 matches that start at later positions. |
|
117 </P> |
|
118 <P> |
|
119 There are a number of features of PCRE regular expressions that are not |
|
120 supported by the alternative matching algorithm. They are as follows: |
|
121 </P> |
|
122 <P> |
|
123 1. Because the algorithm finds all possible matches, the greedy or ungreedy |
|
124 nature of repetition quantifiers is not relevant. Greedy and ungreedy |
|
125 quantifiers are treated in exactly the same way. However, possessive |
|
126 quantifiers can make a difference when what follows could also match what is |
|
127 quantified, for example in a pattern like this: |
|
128 <pre> |
|
129 ^a++\w! |
|
130 </pre> |
|
131 This pattern matches "aaab!" but not "aaa!", which would be matched by a |
|
132 non-possessive quantifier. Similarly, if an atomic group is present, it is |
|
133 matched as if it were a standalone pattern at the current point, and the |
|
134 longest match is then "locked in" for the rest of the overall pattern. |
|
135 </P> |
|
136 <P> |
|
137 2. When dealing with multiple paths through the tree simultaneously, it is not |
|
138 straightforward to keep track of captured substrings for the different matching |
|
139 possibilities, and PCRE's implementation of this algorithm does not attempt to |
|
140 do this. This means that no captured substrings are available. |
|
141 </P> |
|
142 <P> |
|
143 3. Because no substrings are captured, back references within the pattern are |
|
144 not supported, and cause errors if encountered. |
|
145 </P> |
|
146 <P> |
|
147 4. For the same reason, conditional expressions that use a backreference as the |
|
148 condition or test for a specific group recursion are not supported. |
|
149 </P> |
|
150 <P> |
|
151 5. Because many paths through the tree may be active, the \K escape sequence, |
|
152 which resets the start of the match when encountered (but may be on some paths |
|
153 and not on others), is not supported. It causes an error if encountered. |
|
154 </P> |
|
155 <P> |
|
156 6. Callouts are supported, but the value of the <i>capture_top</i> field is |
|
157 always 1, and the value of the <i>capture_last</i> field is always -1. |
|
158 </P> |
|
159 <P> |
|
160 7. The \C escape sequence, which (in the standard algorithm) matches a single |
|
161 byte, even in UTF-8 mode, is not supported because the alternative algorithm |
|
162 moves through the subject string one character at a time, for all active paths |
|
163 through the tree. |
|
164 </P> |
|
165 <P> |
|
166 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not |
|
167 supported. (*FAIL) is supported, and behaves like a failing negative assertion. |
|
168 </P> |
|
169 <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> |
|
170 <P> |
|
171 Using the alternative matching algorithm provides the following advantages: |
|
172 </P> |
|
173 <P> |
|
174 1. All possible matches (at a single point in the subject) are automatically |
|
175 found, and in particular, the longest match is found. To find more than one |
|
176 match using the standard algorithm, you have to do kludgy things with |
|
177 callouts. |
|
178 </P> |
|
179 <P> |
|
180 2. There is much better support for partial matching. The restrictions on the |
|
181 content of the pattern that apply when using the standard algorithm for partial |
|
182 matching do not apply to the alternative algorithm. For non-anchored patterns, |
|
183 the starting position of a partial match is available. |
|
184 </P> |
|
185 <P> |
|
186 3. Because the alternative algorithm scans the subject string just once, and |
|
187 never needs to backtrack, it is possible to pass very long subject strings to |
|
188 the matching function in several pieces, checking for partial matching each |
|
189 time. |
|
190 </P> |
|
191 <br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> |
|
192 <P> |
|
193 The alternative algorithm suffers from a number of disadvantages: |
|
194 </P> |
|
195 <P> |
|
196 1. It is substantially slower than the standard algorithm. This is partly |
|
197 because it has to search for all possible matches, but is also because it is |
|
198 less susceptible to optimization. |
|
199 </P> |
|
200 <P> |
|
201 2. Capturing parentheses and back references are not supported. |
|
202 </P> |
|
203 <P> |
|
204 3. Although atomic groups are supported, their use does not provide the |
|
205 performance advantage that it does for the standard algorithm. |
|
206 </P> |
|
207 <br><a name="SEC7" href="#TOC1">AUTHOR</a><br> |
|
208 <P> |
|
209 Philip Hazel |
|
210 <br> |
|
211 University Computing Service |
|
212 <br> |
|
213 Cambridge CB2 3QH, England. |
|
214 <br> |
|
215 </P> |
|
216 <br><a name="SEC8" href="#TOC1">REVISION</a><br> |
|
217 <P> |
|
218 Last updated: 19 April 2008 |
|
219 <br> |
|
220 Copyright © 1997-2008 University of Cambridge. |
|
221 <br> |
|
222 <p> |
|
223 Return to the <a href="index.html">PCRE index page</a>. |
|
224 </p> |