|
1 <html> |
|
2 <head> |
|
3 <title>pcrecpp specification</title> |
|
4 </head> |
|
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
|
6 <h1>pcrecpp man page</h1> |
|
7 <p> |
|
8 Return to the <a href="index.html">PCRE index page</a>. |
|
9 </p> |
|
10 <p> |
|
11 This page is part of the PCRE HTML documentation. It was generated automatically |
|
12 from the original man page. If there is any nonsense in it, please consult the |
|
13 man page, in case the conversion went wrong. |
|
14 <br> |
|
15 <ul> |
|
16 <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a> |
|
17 <li><a name="TOC2" href="#SEC2">DESCRIPTION</a> |
|
18 <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a> |
|
19 <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a> |
|
20 <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a> |
|
21 <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a> |
|
22 <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a> |
|
23 <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a> |
|
24 <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a> |
|
25 <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a> |
|
26 <li><a name="TOC11" href="#SEC11">AUTHOR</a> |
|
27 <li><a name="TOC12" href="#SEC12">REVISION</a> |
|
28 </ul> |
|
29 <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br> |
|
30 <P> |
|
31 <b>#include <pcrecpp.h></b> |
|
32 </P> |
|
33 <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br> |
|
34 <P> |
|
35 The C++ wrapper for PCRE was provided by Google Inc. Some additional |
|
36 functionality was added by Giuseppe Maxia. This brief man page was constructed |
|
37 from the notes in the <i>pcrecpp.h</i> file, which should be consulted for |
|
38 further details. |
|
39 </P> |
|
40 <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br> |
|
41 <P> |
|
42 The "FullMatch" operation checks that supplied text matches a supplied pattern |
|
43 exactly. If pointer arguments are supplied, it copies matched sub-strings that |
|
44 match sub-patterns into them. |
|
45 <pre> |
|
46 Example: successful match |
|
47 pcrecpp::RE re("h.*o"); |
|
48 re.FullMatch("hello"); |
|
49 |
|
50 Example: unsuccessful match (requires full match): |
|
51 pcrecpp::RE re("e"); |
|
52 !re.FullMatch("hello"); |
|
53 |
|
54 Example: creating a temporary RE object: |
|
55 pcrecpp::RE("h.*o").FullMatch("hello"); |
|
56 </pre> |
|
57 You can pass in a "const char*" or a "string" for "text". The examples below |
|
58 tend to use a const char*. You can, as in the different examples above, store |
|
59 the RE object explicitly in a variable or use a temporary RE object. The |
|
60 examples below use one mode or the other arbitrarily. Either could correctly be |
|
61 used for any of these examples. |
|
62 </P> |
|
63 <P> |
|
64 You must supply extra pointer arguments to extract matched subpieces. |
|
65 <pre> |
|
66 Example: extracts "ruby" into "s" and 1234 into "i" |
|
67 int i; |
|
68 string s; |
|
69 pcrecpp::RE re("(\\w+):(\\d+)"); |
|
70 re.FullMatch("ruby:1234", &s, &i); |
|
71 |
|
72 Example: does not try to extract any extra sub-patterns |
|
73 re.FullMatch("ruby:1234", &s); |
|
74 |
|
75 Example: does not try to extract into NULL |
|
76 re.FullMatch("ruby:1234", NULL, &i); |
|
77 |
|
78 Example: integer overflow causes failure |
|
79 !re.FullMatch("ruby:1234567891234", NULL, &i); |
|
80 |
|
81 Example: fails because there aren't enough sub-patterns: |
|
82 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s); |
|
83 |
|
84 Example: fails because string cannot be stored in integer |
|
85 !pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
|
86 </pre> |
|
87 The provided pointer arguments can be pointers to any scalar numeric |
|
88 type, or one of: |
|
89 <pre> |
|
90 string (matched piece is copied to string) |
|
91 StringPiece (StringPiece is mutated to point to matched piece) |
|
92 T (where "bool T::ParseFrom(const char*, int)" exists) |
|
93 NULL (the corresponding matched sub-pattern is not copied) |
|
94 </pre> |
|
95 The function returns true iff all of the following conditions are satisfied: |
|
96 <pre> |
|
97 a. "text" matches "pattern" exactly; |
|
98 |
|
99 b. The number of matched sub-patterns is >= number of supplied |
|
100 pointers; |
|
101 |
|
102 c. The "i"th argument has a suitable type for holding the |
|
103 string captured as the "i"th sub-pattern. If you pass in |
|
104 void * NULL for the "i"th argument, or a non-void * NULL |
|
105 of the correct type, or pass fewer arguments than the |
|
106 number of sub-patterns, "i"th captured sub-pattern is |
|
107 ignored. |
|
108 </pre> |
|
109 CAVEAT: An optional sub-pattern that does not exist in the matched |
|
110 string is assigned the empty string. Therefore, the following will |
|
111 return false (because the empty string is not a valid number): |
|
112 <pre> |
|
113 int number; |
|
114 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number); |
|
115 </pre> |
|
116 The matching interface supports at most 16 arguments per call. |
|
117 If you need more, consider using the more general interface |
|
118 <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for |
|
119 <b>DoMatch</b>. |
|
120 </P> |
|
121 <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br> |
|
122 <P> |
|
123 You can use the "QuoteMeta" operation to insert backslashes before all |
|
124 potentially meaningful characters in a string. The returned string, used as a |
|
125 regular expression, will exactly match the original string. |
|
126 <pre> |
|
127 Example: |
|
128 string quoted = RE::QuoteMeta(unquoted); |
|
129 </pre> |
|
130 Note that it's legal to escape a character even if it has no special meaning in |
|
131 a regular expression -- so this function does that. (This also makes it |
|
132 identical to the perl function of the same name; see "perldoc -f quotemeta".) |
|
133 For example, "1.5-2.0?" becomes "1\.5\-2\.0\?". |
|
134 </P> |
|
135 <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br> |
|
136 <P> |
|
137 You can use the "PartialMatch" operation when you want the pattern |
|
138 to match any substring of the text. |
|
139 <pre> |
|
140 Example: simple search for a string: |
|
141 pcrecpp::RE("ell").PartialMatch("hello"); |
|
142 |
|
143 Example: find first number in a string: |
|
144 int number; |
|
145 pcrecpp::RE re("(\\d+)"); |
|
146 re.PartialMatch("x*100 + 20", &number); |
|
147 assert(number == 100); |
|
148 </PRE> |
|
149 </P> |
|
150 <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br> |
|
151 <P> |
|
152 By default, pattern and text are plain text, one byte per character. The UTF8 |
|
153 flag, passed to the constructor, causes both pattern and string to be treated |
|
154 as UTF-8 text, still a byte stream but potentially multiple bytes per |
|
155 character. In practice, the text is likelier to be UTF-8 than the pattern, but |
|
156 the match returned may depend on the UTF8 flag, so always use it when matching |
|
157 UTF8 text. For example, "." will match one byte normally but with UTF8 set may |
|
158 match up to three bytes of a multi-byte character. |
|
159 <pre> |
|
160 Example: |
|
161 pcrecpp::RE_Options options; |
|
162 options.set_utf8(); |
|
163 pcrecpp::RE re(utf8_pattern, options); |
|
164 re.FullMatch(utf8_string); |
|
165 |
|
166 Example: using the convenience function UTF8(): |
|
167 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); |
|
168 re.FullMatch(utf8_string); |
|
169 </pre> |
|
170 NOTE: The UTF8 flag is ignored if pcre was not configured with the |
|
171 <pre> |
|
172 --enable-utf8 flag. |
|
173 </PRE> |
|
174 </P> |
|
175 <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br> |
|
176 <P> |
|
177 PCRE defines some modifiers to change the behavior of the regular expression |
|
178 engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to |
|
179 pass such modifiers to a RE class. Currently, the following modifiers are |
|
180 supported: |
|
181 <pre> |
|
182 modifier description Perl corresponding |
|
183 |
|
184 PCRE_CASELESS case insensitive match /i |
|
185 PCRE_MULTILINE multiple lines match /m |
|
186 PCRE_DOTALL dot matches newlines /s |
|
187 PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
|
188 PCRE_EXTRA strict escape parsing N/A |
|
189 PCRE_EXTENDED ignore whitespaces /x |
|
190 PCRE_UTF8 handles UTF8 chars built-in |
|
191 PCRE_UNGREEDY reverses * and *? N/A |
|
192 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
|
193 </pre> |
|
194 (*) Both Perl and PCRE allow non capturing parentheses by means of the |
|
195 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not |
|
196 capture, while (ab|cd) does. |
|
197 </P> |
|
198 <P> |
|
199 For a full account on how each modifier works, please check the |
|
200 PCRE API reference page. |
|
201 </P> |
|
202 <P> |
|
203 For each modifier, there are two member functions whose name is made |
|
204 out of the modifier in lowercase, without the "PCRE_" prefix. For |
|
205 instance, PCRE_CASELESS is handled by |
|
206 <pre> |
|
207 bool caseless() |
|
208 </pre> |
|
209 which returns true if the modifier is set, and |
|
210 <pre> |
|
211 RE_Options & set_caseless(bool) |
|
212 </pre> |
|
213 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be |
|
214 accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member |
|
215 functions. Setting <i>match_limit</i> to a non-zero value will limit the |
|
216 execution of pcre to keep it from doing bad things like blowing the stack or |
|
217 taking an eternity to return a result. A value of 5000 is good enough to stop |
|
218 stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables |
|
219 match limiting. Alternatively, you can call <b>match_limit_recursion()</b> |
|
220 which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE |
|
221 recurses. <b>match_limit()</b> limits the number of matches PCRE does; |
|
222 <b>match_limit_recursion()</b> limits the depth of internal recursion, and |
|
223 therefore the amount of stack that is used. |
|
224 </P> |
|
225 <P> |
|
226 Normally, to pass one or more modifiers to a RE class, you declare |
|
227 a <i>RE_Options</i> object, set the appropriate options, and pass this |
|
228 object to a RE constructor. Example: |
|
229 <pre> |
|
230 RE_options opt; |
|
231 opt.set_caseless(true); |
|
232 if (RE("HELLO", opt).PartialMatch("hello world")) ... |
|
233 </pre> |
|
234 RE_options has two constructors. The default constructor takes no arguments and |
|
235 creates a set of flags that are off by default. The optional parameter |
|
236 <i>option_flags</i> is to facilitate transfer of legacy code from C programs. |
|
237 This lets you do |
|
238 <pre> |
|
239 RE(pattern, |
|
240 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); |
|
241 </pre> |
|
242 However, new code is better off doing |
|
243 <pre> |
|
244 RE(pattern, |
|
245 RE_Options().set_caseless(true).set_multiline(true)) |
|
246 .PartialMatch(str); |
|
247 </pre> |
|
248 If you are going to pass one of the most used modifiers, there are some |
|
249 convenience functions that return a RE_Options class with the |
|
250 appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>, |
|
251 <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>. |
|
252 </P> |
|
253 <P> |
|
254 If you need to set several options at once, and you don't want to go through |
|
255 the pains of declaring a RE_Options object and setting several options, there |
|
256 is a parallel method that give you such ability on the fly. You can concatenate |
|
257 several <b>set_xxxxx()</b> member functions, since each of them returns a |
|
258 reference to its class object. For example, to pass PCRE_CASELESS, |
|
259 PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: |
|
260 <pre> |
|
261 RE(" ^ xyz \\s+ .* blah$", |
|
262 RE_Options() |
|
263 .set_caseless(true) |
|
264 .set_extended(true) |
|
265 .set_multiline(true)).PartialMatch(sometext); |
|
266 |
|
267 </PRE> |
|
268 </P> |
|
269 <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br> |
|
270 <P> |
|
271 The "Consume" operation may be useful if you want to repeatedly |
|
272 match regular expressions at the front of a string and skip over |
|
273 them as they match. This requires use of the "StringPiece" type, |
|
274 which represents a sub-range of a real string. Like RE, StringPiece |
|
275 is defined in the pcrecpp namespace. |
|
276 <pre> |
|
277 Example: read lines of the form "var = value" from a string. |
|
278 string contents = ...; // Fill string somehow |
|
279 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece |
|
280 </PRE> |
|
281 </P> |
|
282 <P> |
|
283 <pre> |
|
284 string var; |
|
285 int value; |
|
286 pcrecpp::RE re("(\\w+) = (\\d+)\n"); |
|
287 while (re.Consume(&input, &var, &value)) { |
|
288 ...; |
|
289 } |
|
290 </pre> |
|
291 Each successful call to "Consume" will set "var/value", and also |
|
292 advance "input" so it points past the matched text. |
|
293 </P> |
|
294 <P> |
|
295 The "FindAndConsume" operation is similar to "Consume" but does not |
|
296 anchor your match at the beginning of the string. For example, you |
|
297 could extract all words from a string by repeatedly calling |
|
298 <pre> |
|
299 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) |
|
300 </PRE> |
|
301 </P> |
|
302 <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br> |
|
303 <P> |
|
304 By default, if you pass a pointer to a numeric value, the |
|
305 corresponding text is interpreted as a base-10 number. You can |
|
306 instead wrap the pointer with a call to one of the operators Hex(), |
|
307 Octal(), or CRadix() to interpret the text in another base. The |
|
308 CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) |
|
309 prefixes, but defaults to base-10. |
|
310 <pre> |
|
311 Example: |
|
312 int a, b, c, d; |
|
313 pcrecpp::RE re("(.*) (.*) (.*) (.*)"); |
|
314 re.FullMatch("100 40 0100 0x40", |
|
315 pcrecpp::Octal(&a), pcrecpp::Hex(&b), |
|
316 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); |
|
317 </pre> |
|
318 will leave 64 in a, b, c, and d. |
|
319 </P> |
|
320 <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br> |
|
321 <P> |
|
322 You can replace the first match of "pattern" in "str" with "rewrite". |
|
323 Within "rewrite", backslash-escaped digits (\1 to \9) can be |
|
324 used to insert text matching corresponding parenthesized group |
|
325 from the pattern. \0 in "rewrite" refers to the entire matching |
|
326 text. For example: |
|
327 <pre> |
|
328 string s = "yabba dabba doo"; |
|
329 pcrecpp::RE("b+").Replace("d", &s); |
|
330 </pre> |
|
331 will leave "s" containing "yada dabba doo". The result is true if the pattern |
|
332 matches and a replacement occurs, false otherwise. |
|
333 </P> |
|
334 <P> |
|
335 <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all |
|
336 occurrences of the pattern in the string with the rewrite. Replacements are |
|
337 not subject to re-matching. For example: |
|
338 <pre> |
|
339 string s = "yabba dabba doo"; |
|
340 pcrecpp::RE("b+").GlobalReplace("d", &s); |
|
341 </pre> |
|
342 will leave "s" containing "yada dada doo". It returns the number of |
|
343 replacements made. |
|
344 </P> |
|
345 <P> |
|
346 <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches, |
|
347 "rewrite" is copied into "out" (an additional argument) with substitutions. |
|
348 The non-matching portions of "text" are ignored. Returns true iff a match |
|
349 occurred and the extraction happened successfully; if no match occurs, the |
|
350 string is left unaffected. |
|
351 </P> |
|
352 <br><a name="SEC11" href="#TOC1">AUTHOR</a><br> |
|
353 <P> |
|
354 The C++ wrapper was contributed by Google Inc. |
|
355 <br> |
|
356 Copyright © 2007 Google Inc. |
|
357 <br> |
|
358 </P> |
|
359 <br><a name="SEC12" href="#TOC1">REVISION</a><br> |
|
360 <P> |
|
361 Last updated: 12 November 2007 |
|
362 <br> |
|
363 <p> |
|
364 Return to the <a href="index.html">PCRE index page</a>. |
|
365 </p> |