|
1 <html> |
|
2 <head> |
|
3 <title>pcreposix specification</title> |
|
4 </head> |
|
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
|
6 <h1>pcreposix man page</h1> |
|
7 <p> |
|
8 Return to the <a href="index.html">PCRE index page</a>. |
|
9 </p> |
|
10 <p> |
|
11 This page is part of the PCRE HTML documentation. It was generated automatically |
|
12 from the original man page. If there is any nonsense in it, please consult the |
|
13 man page, in case the conversion went wrong. |
|
14 <br> |
|
15 <ul> |
|
16 <li><a name="TOC1" href="#SEC1">SYNOPSIS OF POSIX API</a> |
|
17 <li><a name="TOC2" href="#SEC2">DESCRIPTION</a> |
|
18 <li><a name="TOC3" href="#SEC3">COMPILING A PATTERN</a> |
|
19 <li><a name="TOC4" href="#SEC4">MATCHING NEWLINE CHARACTERS</a> |
|
20 <li><a name="TOC5" href="#SEC5">MATCHING A PATTERN</a> |
|
21 <li><a name="TOC6" href="#SEC6">ERROR MESSAGES</a> |
|
22 <li><a name="TOC7" href="#SEC7">MEMORY USAGE</a> |
|
23 <li><a name="TOC8" href="#SEC8">AUTHOR</a> |
|
24 <li><a name="TOC9" href="#SEC9">REVISION</a> |
|
25 </ul> |
|
26 <br><a name="SEC1" href="#TOC1">SYNOPSIS OF POSIX API</a><br> |
|
27 <P> |
|
28 <b>#include <pcreposix.h></b> |
|
29 </P> |
|
30 <P> |
|
31 <b>int regcomp(regex_t *<i>preg</i>, const char *<i>pattern</i>,</b> |
|
32 <b>int <i>cflags</i>);</b> |
|
33 </P> |
|
34 <P> |
|
35 <b>int regexec(regex_t *<i>preg</i>, const char *<i>string</i>,</b> |
|
36 <b>size_t <i>nmatch</i>, regmatch_t <i>pmatch</i>[], int <i>eflags</i>);</b> |
|
37 </P> |
|
38 <P> |
|
39 <b>size_t regerror(int <i>errcode</i>, const regex_t *<i>preg</i>,</b> |
|
40 <b>char *<i>errbuf</i>, size_t <i>errbuf_size</i>);</b> |
|
41 </P> |
|
42 <P> |
|
43 <b>void regfree(regex_t *<i>preg</i>);</b> |
|
44 </P> |
|
45 <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br> |
|
46 <P> |
|
47 This set of functions provides a POSIX-style API to the PCRE regular expression |
|
48 package. See the |
|
49 <a href="pcreapi.html"><b>pcreapi</b></a> |
|
50 documentation for a description of PCRE's native API, which contains much |
|
51 additional functionality. |
|
52 </P> |
|
53 <P> |
|
54 The functions described here are just wrapper functions that ultimately call |
|
55 the PCRE native API. Their prototypes are defined in the <b>pcreposix.h</b> |
|
56 header file, and on Unix systems the library itself is called |
|
57 <b>pcreposix.a</b>, so can be accessed by adding <b>-lpcreposix</b> to the |
|
58 command for linking an application that uses them. Because the POSIX functions |
|
59 call the native ones, it is also necessary to add <b>-lpcre</b>. |
|
60 </P> |
|
61 <P> |
|
62 I have implemented only those option bits that can be reasonably mapped to PCRE |
|
63 native options. In addition, the option REG_EXTENDED is defined with the value |
|
64 zero. This has no effect, but since programs that are written to the POSIX |
|
65 interface often use it, this makes it easier to slot in PCRE as a replacement |
|
66 library. Other POSIX options are not even defined. |
|
67 </P> |
|
68 <P> |
|
69 When PCRE is called via these functions, it is only the API that is POSIX-like |
|
70 in style. The syntax and semantics of the regular expressions themselves are |
|
71 still those of Perl, subject to the setting of various PCRE options, as |
|
72 described below. "POSIX-like in style" means that the API approximates to the |
|
73 POSIX definition; it is not fully POSIX-compatible, and in multi-byte encoding |
|
74 domains it is probably even less compatible. |
|
75 </P> |
|
76 <P> |
|
77 The header for these functions is supplied as <b>pcreposix.h</b> to avoid any |
|
78 potential clash with other POSIX libraries. It can, of course, be renamed or |
|
79 aliased as <b>regex.h</b>, which is the "correct" name. It provides two |
|
80 structure types, <i>regex_t</i> for compiled internal forms, and |
|
81 <i>regmatch_t</i> for returning captured substrings. It also defines some |
|
82 constants whose names start with "REG_"; these are used for setting options and |
|
83 identifying error codes. |
|
84 </P> |
|
85 <P> |
|
86 </P> |
|
87 <br><a name="SEC3" href="#TOC1">COMPILING A PATTERN</a><br> |
|
88 <P> |
|
89 The function <b>regcomp()</b> is called to compile a pattern into an |
|
90 internal form. The pattern is a C string terminated by a binary zero, and |
|
91 is passed in the argument <i>pattern</i>. The <i>preg</i> argument is a pointer |
|
92 to a <b>regex_t</b> structure that is used as a base for storing information |
|
93 about the compiled regular expression. |
|
94 </P> |
|
95 <P> |
|
96 The argument <i>cflags</i> is either zero, or contains one or more of the bits |
|
97 defined by the following macros: |
|
98 <pre> |
|
99 REG_DOTALL |
|
100 </pre> |
|
101 The PCRE_DOTALL option is set when the regular expression is passed for |
|
102 compilation to the native function. Note that REG_DOTALL is not part of the |
|
103 POSIX standard. |
|
104 <pre> |
|
105 REG_ICASE |
|
106 </pre> |
|
107 The PCRE_CASELESS option is set when the regular expression is passed for |
|
108 compilation to the native function. |
|
109 <pre> |
|
110 REG_NEWLINE |
|
111 </pre> |
|
112 The PCRE_MULTILINE option is set when the regular expression is passed for |
|
113 compilation to the native function. Note that this does <i>not</i> mimic the |
|
114 defined POSIX behaviour for REG_NEWLINE (see the following section). |
|
115 <pre> |
|
116 REG_NOSUB |
|
117 </pre> |
|
118 The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is passed |
|
119 for compilation to the native function. In addition, when a pattern that is |
|
120 compiled with this flag is passed to <b>regexec()</b> for matching, the |
|
121 <i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings |
|
122 are returned. |
|
123 <pre> |
|
124 REG_UTF8 |
|
125 </pre> |
|
126 The PCRE_UTF8 option is set when the regular expression is passed for |
|
127 compilation to the native function. This causes the pattern itself and all data |
|
128 strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF8 |
|
129 is not part of the POSIX standard. |
|
130 </P> |
|
131 <P> |
|
132 In the absence of these flags, no options are passed to the native function. |
|
133 This means the the regex is compiled with PCRE default semantics. In |
|
134 particular, the way it handles newline characters in the subject string is the |
|
135 Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only |
|
136 <i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way |
|
137 newlines are matched by . (they aren't) or by a negative class such as [^a] |
|
138 (they are). |
|
139 </P> |
|
140 <P> |
|
141 The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The |
|
142 <i>preg</i> structure is filled in on success, and one member of the structure |
|
143 is public: <i>re_nsub</i> contains the number of capturing subpatterns in |
|
144 the regular expression. Various error codes are defined in the header file. |
|
145 </P> |
|
146 <br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br> |
|
147 <P> |
|
148 This area is not simple, because POSIX and Perl take different views of things. |
|
149 It is not possible to get PCRE to obey POSIX semantics, but then PCRE was never |
|
150 intended to be a POSIX engine. The following table lists the different |
|
151 possibilities for matching newline characters in PCRE: |
|
152 <pre> |
|
153 Default Change with |
|
154 |
|
155 . matches newline no PCRE_DOTALL |
|
156 newline matches [^a] yes not changeable |
|
157 $ matches \n at end yes PCRE_DOLLARENDONLY |
|
158 $ matches \n in middle no PCRE_MULTILINE |
|
159 ^ matches \n in middle no PCRE_MULTILINE |
|
160 </pre> |
|
161 This is the equivalent table for POSIX: |
|
162 <pre> |
|
163 Default Change with |
|
164 |
|
165 . matches newline yes REG_NEWLINE |
|
166 newline matches [^a] yes REG_NEWLINE |
|
167 $ matches \n at end no REG_NEWLINE |
|
168 $ matches \n in middle no REG_NEWLINE |
|
169 ^ matches \n in middle no REG_NEWLINE |
|
170 </pre> |
|
171 PCRE's behaviour is the same as Perl's, except that there is no equivalent for |
|
172 PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is no way to stop |
|
173 newline from matching [^a]. |
|
174 </P> |
|
175 <P> |
|
176 The default POSIX newline handling can be obtained by setting PCRE_DOTALL and |
|
177 PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE behave exactly as for the |
|
178 REG_NEWLINE action. |
|
179 </P> |
|
180 <br><a name="SEC5" href="#TOC1">MATCHING A PATTERN</a><br> |
|
181 <P> |
|
182 The function <b>regexec()</b> is called to match a compiled pattern <i>preg</i> |
|
183 against a given <i>string</i>, which is by default terminated by a zero byte |
|
184 (but see REG_STARTEND below), subject to the options in <i>eflags</i>. These can |
|
185 be: |
|
186 <pre> |
|
187 REG_NOTBOL |
|
188 </pre> |
|
189 The PCRE_NOTBOL option is set when calling the underlying PCRE matching |
|
190 function. |
|
191 <pre> |
|
192 REG_NOTEOL |
|
193 </pre> |
|
194 The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
|
195 function. |
|
196 <pre> |
|
197 REG_STARTEND |
|
198 </pre> |
|
199 The string is considered to start at <i>string</i> + <i>pmatch[0].rm_so</i> and |
|
200 to have a terminating NUL located at <i>string</i> + <i>pmatch[0].rm_eo</i> |
|
201 (there need not actually be a NUL at that location), regardless of the value of |
|
202 <i>nmatch</i>. This is a BSD extension, compatible with but not specified by |
|
203 IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software |
|
204 intended to be portable to other systems. Note that a non-zero <i>rm_so</i> does |
|
205 not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not |
|
206 how it is matched. |
|
207 </P> |
|
208 <P> |
|
209 If the pattern was compiled with the REG_NOSUB flag, no data about any matched |
|
210 strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of |
|
211 <b>regexec()</b> are ignored. |
|
212 </P> |
|
213 <P> |
|
214 Otherwise,the portion of the string that was matched, and also any captured |
|
215 substrings, are returned via the <i>pmatch</i> argument, which points to an |
|
216 array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the |
|
217 members <i>rm_so</i> and <i>rm_eo</i>. These contain the offset to the first |
|
218 character of each substring and the offset to the first character after the end |
|
219 of each substring, respectively. The 0th element of the vector relates to the |
|
220 entire portion of <i>string</i> that was matched; subsequent elements relate to |
|
221 the capturing subpatterns of the regular expression. Unused entries in the |
|
222 array have both structure members set to -1. |
|
223 </P> |
|
224 <P> |
|
225 A successful match yields a zero return; various error codes are defined in the |
|
226 header file, of which REG_NOMATCH is the "expected" failure code. |
|
227 </P> |
|
228 <br><a name="SEC6" href="#TOC1">ERROR MESSAGES</a><br> |
|
229 <P> |
|
230 The <b>regerror()</b> function maps a non-zero errorcode from either |
|
231 <b>regcomp()</b> or <b>regexec()</b> to a printable message. If <i>preg</i> is not |
|
232 NULL, the error should have arisen from the use of that structure. A message |
|
233 terminated by a binary zero is placed in <i>errbuf</i>. The length of the |
|
234 message, including the zero, is limited to <i>errbuf_size</i>. The yield of the |
|
235 function is the size of buffer needed to hold the whole message. |
|
236 </P> |
|
237 <br><a name="SEC7" href="#TOC1">MEMORY USAGE</a><br> |
|
238 <P> |
|
239 Compiling a regular expression causes memory to be allocated and associated |
|
240 with the <i>preg</i> structure. The function <b>regfree()</b> frees all such |
|
241 memory, after which <i>preg</i> may no longer be used as a compiled expression. |
|
242 </P> |
|
243 <br><a name="SEC8" href="#TOC1">AUTHOR</a><br> |
|
244 <P> |
|
245 Philip Hazel |
|
246 <br> |
|
247 University Computing Service |
|
248 <br> |
|
249 Cambridge CB2 3QH, England. |
|
250 <br> |
|
251 </P> |
|
252 <br><a name="SEC9" href="#TOC1">REVISION</a><br> |
|
253 <P> |
|
254 Last updated: 05 April 2008 |
|
255 <br> |
|
256 Copyright © 1997-2008 University of Cambridge. |
|
257 <br> |
|
258 <p> |
|
259 Return to the <a href="index.html">PCRE index page</a>. |
|
260 </p> |