|
1 <html> |
|
2 <head> |
|
3 <title>pcrecompat specification</title> |
|
4 </head> |
|
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
|
6 <h1>pcrecompat man page</h1> |
|
7 <p> |
|
8 Return to the <a href="index.html">PCRE index page</a>. |
|
9 </p> |
|
10 <p> |
|
11 This page is part of the PCRE HTML documentation. It was generated automatically |
|
12 from the original man page. If there is any nonsense in it, please consult the |
|
13 man page, in case the conversion went wrong. |
|
14 <br> |
|
15 <br><b> |
|
16 DIFFERENCES BETWEEN PCRE AND PERL |
|
17 </b><br> |
|
18 <P> |
|
19 This document describes the differences in the ways that PCRE and Perl handle |
|
20 regular expressions. The differences described here are mainly with respect to |
|
21 Perl 5.8, though PCRE versions 7.0 and later contain some features that are |
|
22 expected to be in the forthcoming Perl 5.10. |
|
23 </P> |
|
24 <P> |
|
25 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what |
|
26 it does have are given in the |
|
27 <a href="pcre.html#utf8support">section on UTF-8 support</a> |
|
28 in the main |
|
29 <a href="pcre.html"><b>pcre</b></a> |
|
30 page. |
|
31 </P> |
|
32 <P> |
|
33 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits |
|
34 them, but they do not mean what you might think. For example, (?!a){3} does |
|
35 not assert that the next three characters are not "a". It just asserts that the |
|
36 next character is not "a" three times. |
|
37 </P> |
|
38 <P> |
|
39 3. Capturing subpatterns that occur inside negative lookahead assertions are |
|
40 counted, but their entries in the offsets vector are never set. Perl sets its |
|
41 numerical variables from any such patterns that are matched before the |
|
42 assertion fails to match something (thereby succeeding), but only if the |
|
43 negative lookahead assertion contains just one branch. |
|
44 </P> |
|
45 <P> |
|
46 4. Though binary zero characters are supported in the subject string, they are |
|
47 not allowed in a pattern string because it is passed as a normal C string, |
|
48 terminated by zero. The escape sequence \0 can be used in the pattern to |
|
49 represent a binary zero. |
|
50 </P> |
|
51 <P> |
|
52 5. The following Perl escape sequences are not supported: \l, \u, \L, |
|
53 \U, and \N. In fact these are implemented by Perl's general string-handling |
|
54 and are not part of its pattern matching engine. If any of these are |
|
55 encountered by PCRE, an error is generated. |
|
56 </P> |
|
57 <P> |
|
58 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is |
|
59 built with Unicode character property support. The properties that can be |
|
60 tested with \p and \P are limited to the general category properties such as |
|
61 Lu and Nd, script names such as Greek or Han, and the derived properties Any |
|
62 and L&. |
|
63 </P> |
|
64 <P> |
|
65 7. PCRE does support the \Q...\E escape for quoting substrings. Characters in |
|
66 between are treated as literals. This is slightly different from Perl in that $ |
|
67 and @ are also handled as literals inside the quotes. In Perl, they cause |
|
68 variable interpolation (but of course PCRE does not have variables). Note the |
|
69 following examples: |
|
70 <pre> |
|
71 Pattern PCRE matches Perl matches |
|
72 |
|
73 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz |
|
74 \Qabc\$xyz\E abc\$xyz abc\$xyz |
|
75 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
|
76 </pre> |
|
77 The \Q...\E sequence is recognized both inside and outside character classes. |
|
78 </P> |
|
79 <P> |
|
80 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
|
81 constructions. However, there is support for recursive patterns. This is not |
|
82 available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE "callout" |
|
83 feature allows an external function to be called during pattern matching. See |
|
84 the |
|
85 <a href="pcrecallout.html"><b>pcrecallout</b></a> |
|
86 documentation for details. |
|
87 </P> |
|
88 <P> |
|
89 9. Subpatterns that are called recursively or as "subroutines" are always |
|
90 treated as atomic groups in PCRE. This is like Python, but unlike Perl. |
|
91 </P> |
|
92 <P> |
|
93 10. There are some differences that are concerned with the settings of captured |
|
94 strings when part of a pattern is repeated. For example, matching "aba" against |
|
95 the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b". |
|
96 </P> |
|
97 <P> |
|
98 11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), (*FAIL), (*F), |
|
99 (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an |
|
100 argument. PCRE does not support (*MARK). If (*ACCEPT) is within capturing |
|
101 parentheses, PCRE does not set that capture group; this is different to Perl. |
|
102 </P> |
|
103 <P> |
|
104 12. PCRE provides some extensions to the Perl regular expression facilities. |
|
105 Perl 5.10 will include new features that are not in earlier versions, some of |
|
106 which (such as named parentheses) have been in PCRE for some time. This list is |
|
107 with respect to Perl 5.10: |
|
108 <br> |
|
109 <br> |
|
110 (a) Although lookbehind assertions must match fixed length strings, each |
|
111 alternative branch of a lookbehind assertion can match a different length of |
|
112 string. Perl requires them all to have the same length. |
|
113 <br> |
|
114 <br> |
|
115 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
|
116 meta-character matches only at the very end of the string. |
|
117 <br> |
|
118 <br> |
|
119 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special |
|
120 meaning is faulted. Otherwise, like Perl, the backslash is quietly ignored. |
|
121 (Perl can be made to issue a warning.) |
|
122 <br> |
|
123 <br> |
|
124 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is |
|
125 inverted, that is, by default they are not greedy, but if followed by a |
|
126 question mark they are. |
|
127 <br> |
|
128 <br> |
|
129 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be tried |
|
130 only at the first matching position in the subject string. |
|
131 <br> |
|
132 <br> |
|
133 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAPTURE |
|
134 options for <b>pcre_exec()</b> have no Perl equivalents. |
|
135 <br> |
|
136 <br> |
|
137 (g) The \R escape sequence can be restricted to match only CR, LF, or CRLF |
|
138 by the PCRE_BSR_ANYCRLF option. |
|
139 <br> |
|
140 <br> |
|
141 (h) The callout facility is PCRE-specific. |
|
142 <br> |
|
143 <br> |
|
144 (i) The partial matching facility is PCRE-specific. |
|
145 <br> |
|
146 <br> |
|
147 (j) Patterns compiled by PCRE can be saved and re-used at a later time, even on |
|
148 different hosts that have the other endianness. |
|
149 <br> |
|
150 <br> |
|
151 (k) The alternative matching function (<b>pcre_dfa_exec()</b>) matches in a |
|
152 different way and is not Perl-compatible. |
|
153 <br> |
|
154 <br> |
|
155 (l) PCRE recognizes some special sequences such as (*CR) at the start of |
|
156 a pattern that set overall options that cannot be changed within the pattern. |
|
157 </P> |
|
158 <br><b> |
|
159 AUTHOR |
|
160 </b><br> |
|
161 <P> |
|
162 Philip Hazel |
|
163 <br> |
|
164 University Computing Service |
|
165 <br> |
|
166 Cambridge CB2 3QH, England. |
|
167 <br> |
|
168 </P> |
|
169 <br><b> |
|
170 REVISION |
|
171 </b><br> |
|
172 <P> |
|
173 Last updated: 11 September 2007 |
|
174 <br> |
|
175 Copyright © 1997-2007 University of Cambridge. |
|
176 <br> |
|
177 <p> |
|
178 Return to the <a href="index.html">PCRE index page</a>. |
|
179 </p> |