|
1 .TH PCREPARTIAL 3 |
|
2 .SH NAME |
|
3 PCRE - Perl-compatible regular expressions |
|
4 .SH "PARTIAL MATCHING IN PCRE" |
|
5 .rs |
|
6 .sp |
|
7 In normal use of PCRE, if the subject string that is passed to |
|
8 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is |
|
9 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There |
|
10 are circumstances where it might be helpful to distinguish this case from other |
|
11 cases in which there is no match. |
|
12 .P |
|
13 Consider, for example, an application where a human is required to type in data |
|
14 for a field with specific formatting requirements. An example might be a date |
|
15 in the form \fIddmmmyy\fP, defined by this pattern: |
|
16 .sp |
|
17 ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$ |
|
18 .sp |
|
19 If the application sees the user's keystrokes one by one, and can check that |
|
20 what has been typed so far is potentially valid, it is able to raise an error |
|
21 as soon as a mistake is made, possibly beeping and not reflecting the |
|
22 character that has been typed. This immediate feedback is likely to be a better |
|
23 user interface than a check that is delayed until the entire string has been |
|
24 entered. |
|
25 .P |
|
26 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL |
|
27 option, which can be set when calling \fBpcre_exec()\fP or |
|
28 \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return |
|
29 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time |
|
30 during the matching process the last part of the subject string matched part of |
|
31 the pattern. Unfortunately, for non-anchored matching, it is not possible to |
|
32 obtain the position of the start of the partial match. No captured data is set |
|
33 when PCRE_ERROR_PARTIAL is returned. |
|
34 .P |
|
35 When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code |
|
36 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the |
|
37 subject is reached, there have been no complete matches, but there is still at |
|
38 least one matching possibility. The portion of the string that provided the |
|
39 partial match is set as the first matching string. |
|
40 .P |
|
41 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the |
|
42 last literal byte in a pattern, and abandons matching immediately if such a |
|
43 byte is not present in the subject string. This optimization cannot be used |
|
44 for a subject string that might match only partially. |
|
45 . |
|
46 . |
|
47 .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL" |
|
48 .rs |
|
49 .sp |
|
50 Because of the way certain internal optimizations are implemented in the |
|
51 \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all |
|
52 patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used. |
|
53 For \fBpcre_exec()\fP, repeated single characters such as |
|
54 .sp |
|
55 a{2,4} |
|
56 .sp |
|
57 and repeated single metasequences such as |
|
58 .sp |
|
59 \ed+ |
|
60 .sp |
|
61 are not permitted if the maximum number of occurrences is greater than one. |
|
62 Optional items such as \ed? (where the maximum is one) are permitted. |
|
63 Quantifiers with any values are permitted after parentheses, so the invalid |
|
64 examples above can be coded thus: |
|
65 .sp |
|
66 (a){2,4} |
|
67 (\ed)+ |
|
68 .sp |
|
69 These constructions run more slowly, but for the kinds of application that are |
|
70 envisaged for this facility, this is not felt to be a major restriction. |
|
71 .P |
|
72 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions, |
|
73 \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13). |
|
74 You can use the PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out |
|
75 if a compiled pattern can be used for partial matching. |
|
76 . |
|
77 . |
|
78 .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST" |
|
79 .rs |
|
80 .sp |
|
81 If the escape sequence \eP is present in a \fBpcretest\fP data line, the |
|
82 PCRE_PARTIAL flag is used for the match. Here is a run of \fBpcretest\fP that |
|
83 uses the date example quoted above: |
|
84 .sp |
|
85 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ |
|
86 data> 25jun04\eP |
|
87 0: 25jun04 |
|
88 1: jun |
|
89 data> 25dec3\eP |
|
90 Partial match |
|
91 data> 3ju\eP |
|
92 Partial match |
|
93 data> 3juj\eP |
|
94 No match |
|
95 data> j\eP |
|
96 No match |
|
97 .sp |
|
98 The first data string is matched completely, so \fBpcretest\fP shows the |
|
99 matched substrings. The remaining four strings do not match the complete |
|
100 pattern, but the first two are partial matches. The same test, using |
|
101 \fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces |
|
102 the following output: |
|
103 .sp |
|
104 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ |
|
105 data> 25jun04\eP\eD |
|
106 0: 25jun04 |
|
107 data> 23dec3\eP\eD |
|
108 Partial match: 23dec3 |
|
109 data> 3ju\eP\eD |
|
110 Partial match: 3ju |
|
111 data> 3juj\eP\eD |
|
112 No match |
|
113 data> j\eP\eD |
|
114 No match |
|
115 .sp |
|
116 Notice that in this case the portion of the string that was matched is made |
|
117 available. |
|
118 . |
|
119 . |
|
120 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()" |
|
121 .rs |
|
122 .sp |
|
123 When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible |
|
124 to continue the match by providing additional subject data and calling |
|
125 \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this |
|
126 time setting the PCRE_DFA_RESTART option. You must also pass the same working |
|
127 space as before, because this is where details of the previous partial match |
|
128 are stored. Here is an example using \fBpcretest\fP, using the \eR escape |
|
129 sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above): |
|
130 .sp |
|
131 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ |
|
132 data> 23ja\eP\eD |
|
133 Partial match: 23ja |
|
134 data> n05\eR\eD |
|
135 0: n05 |
|
136 .sp |
|
137 The first call has "23ja" as the subject, and requests partial matching; the |
|
138 second call has "n05" as the subject for the continued (restarted) match. |
|
139 Notice that when the match is complete, only the last part is shown; PCRE does |
|
140 not retain the previously partially-matched string. It is up to the calling |
|
141 program to do that if it needs to. |
|
142 .P |
|
143 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching |
|
144 over multiple segments. This facility can be used to pass very long subject |
|
145 strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain |
|
146 types of pattern. |
|
147 .P |
|
148 1. If the pattern contains tests for the beginning or end of a line, you need |
|
149 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the |
|
150 subject string for any call does not contain the beginning or end of a line. |
|
151 .P |
|
152 2. If the pattern contains backward assertions (including \eb or \eB), you need |
|
153 to arrange for some overlap in the subject strings to allow for this. For |
|
154 example, you could pass the subject in chunks that are 500 bytes long, but in |
|
155 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200 |
|
156 bytes at the start of the buffer. |
|
157 .P |
|
158 3. Matching a subject string that is split into multiple segments does not |
|
159 always produce exactly the same result as matching over one single long string. |
|
160 The difference arises when there are multiple matching possibilities, because a |
|
161 partial match result is given only when there are no completed matches in a |
|
162 call to \fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has |
|
163 been found, continuation to a new subject segment is no longer possible. |
|
164 Consider this \fBpcretest\fP example: |
|
165 .sp |
|
166 re> /dog(sbody)?/ |
|
167 data> do\eP\eD |
|
168 Partial match: do |
|
169 data> gsb\eR\eP\eD |
|
170 0: g |
|
171 data> dogsbody\eD |
|
172 0: dogsbody |
|
173 1: dog |
|
174 .sp |
|
175 The pattern matches the words "dog" or "dogsbody". When the subject is |
|
176 presented in several parts ("do" and "gsb" being the first two) the match stops |
|
177 when "dog" has been found, and it is not possible to continue. On the other |
|
178 hand, if "dogsbody" is presented as a single string, both matches are found. |
|
179 .P |
|
180 Because of this phenomenon, it does not usually make sense to end a pattern |
|
181 that is going to be matched in this way with a variable repeat. |
|
182 .P |
|
183 4. Patterns that contain alternatives at the top level which do not all |
|
184 start with the same pattern item may not work as expected. For example, |
|
185 consider this pattern: |
|
186 .sp |
|
187 1234|3789 |
|
188 .sp |
|
189 If the first part of the subject is "ABC123", a partial match of the first |
|
190 alternative is found at offset 3. There is no partial match for the second |
|
191 alternative, because such a match does not start at the same point in the |
|
192 subject string. Attempting to continue with the string "789" does not yield a |
|
193 match because only those alternatives that match at one point in the subject |
|
194 are remembered. The problem arises because the start of the second alternative |
|
195 matches within the first alternative. There is no problem with anchored |
|
196 patterns or patterns such as: |
|
197 .sp |
|
198 1234|ABCD |
|
199 .sp |
|
200 where no string can be a partial match for both alternatives. |
|
201 . |
|
202 . |
|
203 .SH AUTHOR |
|
204 .rs |
|
205 .sp |
|
206 .nf |
|
207 Philip Hazel |
|
208 University Computing Service |
|
209 Cambridge CB2 3QH, England. |
|
210 .fi |
|
211 . |
|
212 . |
|
213 .SH REVISION |
|
214 .rs |
|
215 .sp |
|
216 .nf |
|
217 Last updated: 04 June 2007 |
|
218 Copyright (c) 1997-2007 University of Cambridge. |
|
219 .fi |