|
1 ChangeLog for PCRE |
|
2 ------------------ |
|
3 |
|
4 Version 7.8 05-Sep-08 |
|
5 --------------------- |
|
6 |
|
7 1. Replaced UCP searching code with optimized version as implemented for Ad |
|
8 Muncher (http://www.admuncher.com/) by Peter Kankowski. This uses a two- |
|
9 stage table and inline lookup instead of a function, giving speed ups of 2 |
|
10 to 5 times on some simple patterns that I tested. Permission was given to |
|
11 distribute the MultiStage2.py script that generates the tables (it's not in |
|
12 the tarball, but is in the Subversion repository). |
|
13 |
|
14 2. Updated the Unicode datatables to Unicode 5.1.0. This adds yet more |
|
15 scripts. |
|
16 |
|
17 3. Change 12 for 7.7 introduced a bug in pcre_study() when a pattern contained |
|
18 a group with a zero qualifier. The result of the study could be incorrect, |
|
19 or the function might crash, depending on the pattern. |
|
20 |
|
21 4. Caseless matching was not working for non-ASCII characters in back |
|
22 references. For example, /(\x{de})\1/8i was not matching \x{de}\x{fe}. |
|
23 It now works when Unicode Property Support is available. |
|
24 |
|
25 5. In pcretest, an escape such as \x{de} in the data was always generating |
|
26 a UTF-8 string, even in non-UTF-8 mode. Now it generates a single byte in |
|
27 non-UTF-8 mode. If the value is greater than 255, it gives a warning about |
|
28 truncation. |
|
29 |
|
30 6. Minor bugfix in pcrecpp.cc (change "" == ... to NULL == ...). |
|
31 |
|
32 7. Added two (int) casts to pcregrep when printing the difference of two |
|
33 pointers, in case they are 64-bit values. |
|
34 |
|
35 8. Added comments about Mac OS X stack usage to the pcrestack man page and to |
|
36 test 2 if it fails. |
|
37 |
|
38 9. Added PCRE_CALL_CONVENTION just before the names of all exported functions, |
|
39 and a #define of that name to empty if it is not externally set. This is to |
|
40 allow users of MSVC to set it if necessary. |
|
41 |
|
42 10. The PCRE_EXP_DEFN macro which precedes exported functions was missing from |
|
43 the convenience functions in the pcre_get.c source file. |
|
44 |
|
45 11. An option change at the start of a pattern that had top-level alternatives |
|
46 could cause overwriting and/or a crash. This command provoked a crash in |
|
47 some environments: |
|
48 |
|
49 printf "/(?i)[\xc3\xa9\xc3\xbd]|[\xc3\xa9\xc3\xbdA]/8\n" | pcretest |
|
50 |
|
51 This potential security problem was recorded as CVE-2008-2371. |
|
52 |
|
53 12. For a pattern where the match had to start at the beginning or immediately |
|
54 after a newline (e.g /.*anything/ without the DOTALL flag), pcre_exec() and |
|
55 pcre_dfa_exec() could read past the end of the passed subject if there was |
|
56 no match. To help with detecting such bugs (e.g. with valgrind), I modified |
|
57 pcretest so that it places the subject at the end of its malloc-ed buffer. |
|
58 |
|
59 13. The change to pcretest in 12 above threw up a couple more cases when pcre_ |
|
60 exec() might read past the end of the data buffer in UTF-8 mode. |
|
61 |
|
62 14. A similar bug to 7.3/2 existed when the PCRE_FIRSTLINE option was set and |
|
63 the data contained the byte 0x85 as part of a UTF-8 character within its |
|
64 first line. This applied both to normal and DFA matching. |
|
65 |
|
66 15. Lazy qualifiers were not working in some cases in UTF-8 mode. For example, |
|
67 /^[^d]*?$/8 failed to match "abc". |
|
68 |
|
69 16. Added a missing copyright notice to pcrecpp_internal.h. |
|
70 |
|
71 17. Make it more clear in the documentation that values returned from |
|
72 pcre_exec() in ovector are byte offsets, not character counts. |
|
73 |
|
74 18. Tidied a few places to stop certain compilers from issuing warnings. |
|
75 |
|
76 19. Updated the Virtual Pascal + BCC files to compile the latest v7.7, as |
|
77 supplied by Stefan Weber. I made a further small update for 7.8 because |
|
78 there is a change of source arrangements: the pcre_searchfuncs.c module is |
|
79 replaced by pcre_ucd.c. |
|
80 |
|
81 |
|
82 Version 7.7 07-May-08 |
|
83 --------------------- |
|
84 |
|
85 1. Applied Craig's patch to sort out a long long problem: "If we can't convert |
|
86 a string to a long long, pretend we don't even have a long long." This is |
|
87 done by checking for the strtoq, strtoll, and _strtoi64 functions. |
|
88 |
|
89 2. Applied Craig's patch to pcrecpp.cc to restore ABI compatibility with |
|
90 pre-7.6 versions, which defined a global no_arg variable instead of putting |
|
91 it in the RE class. (See also #8 below.) |
|
92 |
|
93 3. Remove a line of dead code, identified by coverity and reported by Nuno |
|
94 Lopes. |
|
95 |
|
96 4. Fixed two related pcregrep bugs involving -r with --include or --exclude: |
|
97 |
|
98 (1) The include/exclude patterns were being applied to the whole pathnames |
|
99 of files, instead of just to the final components. |
|
100 |
|
101 (2) If there was more than one level of directory, the subdirectories were |
|
102 skipped unless they satisfied the include/exclude conditions. This is |
|
103 inconsistent with GNU grep (and could even be seen as contrary to the |
|
104 pcregrep specification - which I improved to make it absolutely clear). |
|
105 The action now is always to scan all levels of directory, and just |
|
106 apply the include/exclude patterns to regular files. |
|
107 |
|
108 5. Added the --include_dir and --exclude_dir patterns to pcregrep, and used |
|
109 --exclude_dir in the tests to avoid scanning .svn directories. |
|
110 |
|
111 6. Applied Craig's patch to the QuoteMeta function so that it escapes the |
|
112 NUL character as backslash + 0 rather than backslash + NUL, because PCRE |
|
113 doesn't support NULs in patterns. |
|
114 |
|
115 7. Added some missing "const"s to declarations of static tables in |
|
116 pcre_compile.c and pcre_dfa_exec.c. |
|
117 |
|
118 8. Applied Craig's patch to pcrecpp.cc to fix a problem in OS X that was |
|
119 caused by fix #2 above. (Subsequently also a second patch to fix the |
|
120 first patch. And a third patch - this was a messy problem.) |
|
121 |
|
122 9. Applied Craig's patch to remove the use of push_back(). |
|
123 |
|
124 10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX |
|
125 matching function regexec(). |
|
126 |
|
127 11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', |
|
128 which, however, unlike Perl's \g{...}, are subroutine calls, not back |
|
129 references. PCRE supports relative numbers with this syntax (I don't think |
|
130 Oniguruma does). |
|
131 |
|
132 12. Previously, a group with a zero repeat such as (...){0} was completely |
|
133 omitted from the compiled regex. However, this means that if the group |
|
134 was called as a subroutine from elsewhere in the pattern, things went wrong |
|
135 (an internal error was given). Such groups are now left in the compiled |
|
136 pattern, with a new opcode that causes them to be skipped at execution |
|
137 time. |
|
138 |
|
139 13. Added the PCRE_JAVASCRIPT_COMPAT option. This makes the following changes |
|
140 to the way PCRE behaves: |
|
141 |
|
142 (a) A lone ] character is dis-allowed (Perl treats it as data). |
|
143 |
|
144 (b) A back reference to an unmatched subpattern matches an empty string |
|
145 (Perl fails the current match path). |
|
146 |
|
147 (c) A data ] in a character class must be notated as \] because if the |
|
148 first data character in a class is ], it defines an empty class. (In |
|
149 Perl it is not possible to have an empty class.) The empty class [] |
|
150 never matches; it forces failure and is equivalent to (*FAIL) or (?!). |
|
151 The negative empty class [^] matches any one character, independently |
|
152 of the DOTALL setting. |
|
153 |
|
154 14. A pattern such as /(?2)[]a()b](abc)/ which had a forward reference to a |
|
155 non-existent subpattern following a character class starting with ']' and |
|
156 containing () gave an internal compiling error instead of "reference to |
|
157 non-existent subpattern". Fortunately, when the pattern did exist, the |
|
158 compiled code was correct. (When scanning forwards to check for the |
|
159 existencd of the subpattern, it was treating the data ']' as terminating |
|
160 the class, so got the count wrong. When actually compiling, the reference |
|
161 was subsequently set up correctly.) |
|
162 |
|
163 15. The "always fail" assertion (?!) is optimzed to (*FAIL) by pcre_compile; |
|
164 it was being rejected as not supported by pcre_dfa_exec(), even though |
|
165 other assertions are supported. I have made pcre_dfa_exec() support |
|
166 (*FAIL). |
|
167 |
|
168 16. The implementation of 13c above involved the invention of a new opcode, |
|
169 OP_ALLANY, which is like OP_ANY but doesn't check the /s flag. Since /s |
|
170 cannot be changed at match time, I realized I could make a small |
|
171 improvement to matching performance by compiling OP_ALLANY instead of |
|
172 OP_ANY for "." when DOTALL was set, and then removing the runtime tests |
|
173 on the OP_ANY path. |
|
174 |
|
175 17. Compiling pcretest on Windows with readline support failed without the |
|
176 following two fixes: (1) Make the unistd.h include conditional on |
|
177 HAVE_UNISTD_H; (2) #define isatty and fileno as _isatty and _fileno. |
|
178 |
|
179 18. Changed CMakeLists.txt and cmake/FindReadline.cmake to arrange for the |
|
180 ncurses library to be included for pcretest when ReadLine support is |
|
181 requested, but also to allow for it to be overridden. This patch came from |
|
182 Daniel Bergström. |
|
183 |
|
184 19. There was a typo in the file ucpinternal.h where f0_rangeflag was defined |
|
185 as 0x00f00000 instead of 0x00800000. Luckily, this would not have caused |
|
186 any errors with the current Unicode tables. Thanks to Peter Kankowski for |
|
187 spotting this. |
|
188 |
|
189 |
|
190 Version 7.6 28-Jan-08 |
|
191 --------------------- |
|
192 |
|
193 1. A character class containing a very large number of characters with |
|
194 codepoints greater than 255 (in UTF-8 mode, of course) caused a buffer |
|
195 overflow. |
|
196 |
|
197 2. Patch to cut out the "long long" test in pcrecpp_unittest when |
|
198 HAVE_LONG_LONG is not defined. |
|
199 |
|
200 3. Applied Christian Ehrlicher's patch to update the CMake build files to |
|
201 bring them up to date and include new features. This patch includes: |
|
202 |
|
203 - Fixed PH's badly added libz and libbz2 support. |
|
204 - Fixed a problem with static linking. |
|
205 - Added pcredemo. [But later removed - see 7 below.] |
|
206 - Fixed dftables problem and added an option. |
|
207 - Added a number of HAVE_XXX tests, including HAVE_WINDOWS_H and |
|
208 HAVE_LONG_LONG. |
|
209 - Added readline support for pcretest. |
|
210 - Added an listing of the option settings after cmake has run. |
|
211 |
|
212 4. A user submitted a patch to Makefile that makes it easy to create |
|
213 "pcre.dll" under mingw when using Configure/Make. I added stuff to |
|
214 Makefile.am that cause it to include this special target, without |
|
215 affecting anything else. Note that the same mingw target plus all |
|
216 the other distribution libraries and programs are now supported |
|
217 when configuring with CMake (see 6 below) instead of with |
|
218 Configure/Make. |
|
219 |
|
220 5. Applied Craig's patch that moves no_arg into the RE class in the C++ code. |
|
221 This is an attempt to solve the reported problem "pcrecpp::no_arg is not |
|
222 exported in the Windows port". It has not yet been confirmed that the patch |
|
223 solves the problem, but it does no harm. |
|
224 |
|
225 6. Applied Sheri's patch to CMakeLists.txt to add NON_STANDARD_LIB_PREFIX and |
|
226 NON_STANDARD_LIB_SUFFIX for dll names built with mingw when configured |
|
227 with CMake, and also correct the comment about stack recursion. |
|
228 |
|
229 7. Remove the automatic building of pcredemo from the ./configure system and |
|
230 from CMakeLists.txt. The whole idea of pcredemo.c is that it is an example |
|
231 of a program that users should build themselves after PCRE is installed, so |
|
232 building it automatically is not really right. What is more, it gave |
|
233 trouble in some build environments. |
|
234 |
|
235 8. Further tidies to CMakeLists.txt from Sheri and Christian. |
|
236 |
|
237 |
|
238 Version 7.5 10-Jan-08 |
|
239 --------------------- |
|
240 |
|
241 1. Applied a patch from Craig: "This patch makes it possible to 'ignore' |
|
242 values in parens when parsing an RE using the C++ wrapper." |
|
243 |
|
244 2. Negative specials like \S did not work in character classes in UTF-8 mode. |
|
245 Characters greater than 255 were excluded from the class instead of being |
|
246 included. |
|
247 |
|
248 3. The same bug as (2) above applied to negated POSIX classes such as |
|
249 [:^space:]. |
|
250 |
|
251 4. PCRECPP_STATIC was referenced in pcrecpp_internal.h, but nowhere was it |
|
252 defined or documented. It seems to have been a typo for PCRE_STATIC, so |
|
253 I have changed it. |
|
254 |
|
255 5. The construct (?&) was not diagnosed as a syntax error (it referenced the |
|
256 first named subpattern) and a construct such as (?&a) would reference the |
|
257 first named subpattern whose name started with "a" (in other words, the |
|
258 length check was missing). Both these problems are fixed. "Subpattern name |
|
259 expected" is now given for (?&) (a zero-length name), and this patch also |
|
260 makes it give the same error for \k'' (previously it complained that that |
|
261 was a reference to a non-existent subpattern). |
|
262 |
|
263 6. The erroneous patterns (?+-a) and (?-+a) give different error messages; |
|
264 this is right because (?- can be followed by option settings as well as by |
|
265 digits. I have, however, made the messages clearer. |
|
266 |
|
267 7. Patterns such as (?(1)a|b) (a pattern that contains fewer subpatterns |
|
268 than the number used in the conditional) now cause a compile-time error. |
|
269 This is actually not compatible with Perl, which accepts such patterns, but |
|
270 treats the conditional as always being FALSE (as PCRE used to), but it |
|
271 seems to me that giving a diagnostic is better. |
|
272 |
|
273 8. Change "alphameric" to the more common word "alphanumeric" in comments |
|
274 and messages. |
|
275 |
|
276 9. Fix two occurrences of "backslash" in comments that should have been |
|
277 "backspace". |
|
278 |
|
279 10. Remove two redundant lines of code that can never be obeyed (their function |
|
280 was moved elsewhere). |
|
281 |
|
282 11. The program that makes PCRE's Unicode character property table had a bug |
|
283 which caused it to generate incorrect table entries for sequences of |
|
284 characters that have the same character type, but are in different scripts. |
|
285 It amalgamated them into a single range, with the script of the first of |
|
286 them. In other words, some characters were in the wrong script. There were |
|
287 thirteen such cases, affecting characters in the following ranges: |
|
288 |
|
289 U+002b0 - U+002c1 |
|
290 U+0060c - U+0060d |
|
291 U+0061e - U+00612 |
|
292 U+0064b - U+0065e |
|
293 U+0074d - U+0076d |
|
294 U+01800 - U+01805 |
|
295 U+01d00 - U+01d77 |
|
296 U+01d9b - U+01dbf |
|
297 U+0200b - U+0200f |
|
298 U+030fc - U+030fe |
|
299 U+03260 - U+0327f |
|
300 U+0fb46 - U+0fbb1 |
|
301 U+10450 - U+1049d |
|
302 |
|
303 12. The -o option (show only the matching part of a line) for pcregrep was not |
|
304 compatible with GNU grep in that, if there was more than one match in a |
|
305 line, it showed only the first of them. It now behaves in the same way as |
|
306 GNU grep. |
|
307 |
|
308 13. If the -o and -v options were combined for pcregrep, it printed a blank |
|
309 line for every non-matching line. GNU grep prints nothing, and pcregrep now |
|
310 does the same. The return code can be used to tell if there were any |
|
311 non-matching lines. |
|
312 |
|
313 14. Added --file-offsets and --line-offsets to pcregrep. |
|
314 |
|
315 15. The pattern (?=something)(?R) was not being diagnosed as a potentially |
|
316 infinitely looping recursion. The bug was that positive lookaheads were not |
|
317 being skipped when checking for a possible empty match (negative lookaheads |
|
318 and both kinds of lookbehind were skipped). |
|
319 |
|
320 16. Fixed two typos in the Windows-only code in pcregrep.c, and moved the |
|
321 inclusion of <windows.h> to before rather than after the definition of |
|
322 INVALID_FILE_ATTRIBUTES (patch from David Byron). |
|
323 |
|
324 17. Specifying a possessive quantifier with a specific limit for a Unicode |
|
325 character property caused pcre_compile() to compile bad code, which led at |
|
326 runtime to PCRE_ERROR_INTERNAL (-14). Examples of patterns that caused this |
|
327 are: /\p{Zl}{2,3}+/8 and /\p{Cc}{2}+/8. It was the possessive "+" that |
|
328 caused the error; without that there was no problem. |
|
329 |
|
330 18. Added --enable-pcregrep-libz and --enable-pcregrep-libbz2. |
|
331 |
|
332 19. Added --enable-pcretest-libreadline. |
|
333 |
|
334 20. In pcrecpp.cc, the variable 'count' was incremented twice in |
|
335 RE::GlobalReplace(). As a result, the number of replacements returned was |
|
336 double what it should be. I removed one of the increments, but Craig sent a |
|
337 later patch that removed the other one (the right fix) and added unit tests |
|
338 that check the return values (which was not done before). |
|
339 |
|
340 21. Several CMake things: |
|
341 |
|
342 (1) Arranged that, when cmake is used on Unix, the libraries end up with |
|
343 the names libpcre and libpcreposix, not just pcre and pcreposix. |
|
344 |
|
345 (2) The above change means that pcretest and pcregrep are now correctly |
|
346 linked with the newly-built libraries, not previously installed ones. |
|
347 |
|
348 (3) Added PCRE_SUPPORT_LIBREADLINE, PCRE_SUPPORT_LIBZ, PCRE_SUPPORT_LIBBZ2. |
|
349 |
|
350 22. In UTF-8 mode, with newline set to "any", a pattern such as .*a.*=.b.* |
|
351 crashed when matching a string such as a\x{2029}b (note that \x{2029} is a |
|
352 UTF-8 newline character). The key issue is that the pattern starts .*; |
|
353 this means that the match must be either at the beginning, or after a |
|
354 newline. The bug was in the code for advancing after a failed match and |
|
355 checking that the new position followed a newline. It was not taking |
|
356 account of UTF-8 characters correctly. |
|
357 |
|
358 23. PCRE was behaving differently from Perl in the way it recognized POSIX |
|
359 character classes. PCRE was not treating the sequence [:...:] as a |
|
360 character class unless the ... were all letters. Perl, however, seems to |
|
361 allow any characters between [: and :], though of course it rejects as |
|
362 unknown any "names" that contain non-letters, because all the known class |
|
363 names consist only of letters. Thus, Perl gives an error for [[:1234:]], |
|
364 for example, whereas PCRE did not - it did not recognize a POSIX character |
|
365 class. This seemed a bit dangerous, so the code has been changed to be |
|
366 closer to Perl. The behaviour is not identical to Perl, because PCRE will |
|
367 diagnose an unknown class for, for example, [[:l\ower:]] where Perl will |
|
368 treat it as [[:lower:]]. However, PCRE does now give "unknown" errors where |
|
369 Perl does, and where it didn't before. |
|
370 |
|
371 24. Rewrite so as to remove the single use of %n from pcregrep because in some |
|
372 Windows environments %n is disabled by default. |
|
373 |
|
374 |
|
375 Version 7.4 21-Sep-07 |
|
376 --------------------- |
|
377 |
|
378 1. Change 7.3/28 was implemented for classes by looking at the bitmap. This |
|
379 means that a class such as [\s] counted as "explicit reference to CR or |
|
380 LF". That isn't really right - the whole point of the change was to try to |
|
381 help when there was an actual mention of one of the two characters. So now |
|
382 the change happens only if \r or \n (or a literal CR or LF) character is |
|
383 encountered. |
|
384 |
|
385 2. The 32-bit options word was also used for 6 internal flags, but the numbers |
|
386 of both had grown to the point where there were only 3 bits left. |
|
387 Fortunately, there was spare space in the data structure, and so I have |
|
388 moved the internal flags into a new 16-bit field to free up more option |
|
389 bits. |
|
390 |
|
391 3. The appearance of (?J) at the start of a pattern set the DUPNAMES option, |
|
392 but did not set the internal JCHANGED flag - either of these is enough to |
|
393 control the way the "get" function works - but the PCRE_INFO_JCHANGED |
|
394 facility is supposed to tell if (?J) was ever used, so now (?J) at the |
|
395 start sets both bits. |
|
396 |
|
397 4. Added options (at build time, compile time, exec time) to change \R from |
|
398 matching any Unicode line ending sequence to just matching CR, LF, or CRLF. |
|
399 |
|
400 5. doc/pcresyntax.html was missing from the distribution. |
|
401 |
|
402 6. Put back the definition of PCRE_ERROR_NULLWSLIMIT, for backward |
|
403 compatibility, even though it is no longer used. |
|
404 |
|
405 7. Added macro for snprintf to pcrecpp_unittest.cc and also for strtoll and |
|
406 strtoull to pcrecpp.cc to select the available functions in WIN32 when the |
|
407 windows.h file is present (where different names are used). [This was |
|
408 reversed later after testing - see 16 below.] |
|
409 |
|
410 8. Changed all #include <config.h> to #include "config.h". There were also |
|
411 some further <pcre.h> cases that I changed to "pcre.h". |
|
412 |
|
413 9. When pcregrep was used with the --colour option, it missed the line ending |
|
414 sequence off the lines that it output. |
|
415 |
|
416 10. It was pointed out to me that arrays of string pointers cause lots of |
|
417 relocations when a shared library is dynamically loaded. A technique of |
|
418 using a single long string with a table of offsets can drastically reduce |
|
419 these. I have refactored PCRE in four places to do this. The result is |
|
420 dramatic: |
|
421 |
|
422 Originally: 290 |
|
423 After changing UCP table: 187 |
|
424 After changing error message table: 43 |
|
425 After changing table of "verbs" 36 |
|
426 After changing table of Posix names 22 |
|
427 |
|
428 Thanks to the folks working on Gregex for glib for this insight. |
|
429 |
|
430 11. --disable-stack-for-recursion caused compiling to fail unless -enable- |
|
431 unicode-properties was also set. |
|
432 |
|
433 12. Updated the tests so that they work when \R is defaulted to ANYCRLF. |
|
434 |
|
435 13. Added checks for ANY and ANYCRLF to pcrecpp.cc where it previously |
|
436 checked only for CRLF. |
|
437 |
|
438 14. Added casts to pcretest.c to avoid compiler warnings. |
|
439 |
|
440 15. Added Craig's patch to various pcrecpp modules to avoid compiler warnings. |
|
441 |
|
442 16. Added Craig's patch to remove the WINDOWS_H tests, that were not working, |
|
443 and instead check for _strtoi64 explicitly, and avoid the use of snprintf() |
|
444 entirely. This removes changes made in 7 above. |
|
445 |
|
446 17. The CMake files have been updated, and there is now more information about |
|
447 building with CMake in the NON-UNIX-USE document. |
|
448 |
|
449 |
|
450 Version 7.3 28-Aug-07 |
|
451 --------------------- |
|
452 |
|
453 1. In the rejigging of the build system that eventually resulted in 7.1, the |
|
454 line "#include <pcre.h>" was included in pcre_internal.h. The use of angle |
|
455 brackets there is not right, since it causes compilers to look for an |
|
456 installed pcre.h, not the version that is in the source that is being |
|
457 compiled (which of course may be different). I have changed it back to: |
|
458 |
|
459 #include "pcre.h" |
|
460 |
|
461 I have a vague recollection that the change was concerned with compiling in |
|
462 different directories, but in the new build system, that is taken care of |
|
463 by the VPATH setting the Makefile. |
|
464 |
|
465 2. The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed |
|
466 when the subject happened to end in the byte 0x85 (e.g. if the last |
|
467 character was \x{1ec5}). *Character* 0x85 is one of the "any" newline |
|
468 characters but of course it shouldn't be taken as a newline when it is part |
|
469 of another character. The bug was that, for an unlimited repeat of . in |
|
470 not-DOTALL UTF-8 mode, PCRE was advancing by bytes rather than by |
|
471 characters when looking for a newline. |
|
472 |
|
473 3. A small performance improvement in the DOTALL UTF-8 mode .* case. |
|
474 |
|
475 4. Debugging: adjusted the names of opcodes for different kinds of parentheses |
|
476 in debug output. |
|
477 |
|
478 5. Arrange to use "%I64d" instead of "%lld" and "%I64u" instead of "%llu" for |
|
479 long printing in the pcrecpp unittest when running under MinGW. |
|
480 |
|
481 6. ESC_K was left out of the EBCDIC table. |
|
482 |
|
483 7. Change 7.0/38 introduced a new limit on the number of nested non-capturing |
|
484 parentheses; I made it 1000, which seemed large enough. Unfortunately, the |
|
485 limit also applies to "virtual nesting" when a pattern is recursive, and in |
|
486 this case 1000 isn't so big. I have been able to remove this limit at the |
|
487 expense of backing off one optimization in certain circumstances. Normally, |
|
488 when pcre_exec() would call its internal match() function recursively and |
|
489 immediately return the result unconditionally, it uses a "tail recursion" |
|
490 feature to save stack. However, when a subpattern that can match an empty |
|
491 string has an unlimited repetition quantifier, it no longer makes this |
|
492 optimization. That gives it a stack frame in which to save the data for |
|
493 checking that an empty string has been matched. Previously this was taken |
|
494 from the 1000-entry workspace that had been reserved. So now there is no |
|
495 explicit limit, but more stack is used. |
|
496 |
|
497 8. Applied Daniel's patches to solve problems with the import/export magic |
|
498 syntax that is required for Windows, and which was going wrong for the |
|
499 pcreposix and pcrecpp parts of the library. These were overlooked when this |
|
500 problem was solved for the main library. |
|
501 |
|
502 9. There were some crude static tests to avoid integer overflow when computing |
|
503 the size of patterns that contain repeated groups with explicit upper |
|
504 limits. As the maximum quantifier is 65535, the maximum group length was |
|
505 set at 30,000 so that the product of these two numbers did not overflow a |
|
506 32-bit integer. However, it turns out that people want to use groups that |
|
507 are longer than 30,000 bytes (though not repeat them that many times). |
|
508 Change 7.0/17 (the refactoring of the way the pattern size is computed) has |
|
509 made it possible to implement the integer overflow checks in a much more |
|
510 dynamic way, which I have now done. The artificial limitation on group |
|
511 length has been removed - we now have only the limit on the total length of |
|
512 the compiled pattern, which depends on the LINK_SIZE setting. |
|
513 |
|
514 10. Fixed a bug in the documentation for get/copy named substring when |
|
515 duplicate names are permitted. If none of the named substrings are set, the |
|
516 functions return PCRE_ERROR_NOSUBSTRING (7); the doc said they returned an |
|
517 empty string. |
|
518 |
|
519 11. Because Perl interprets \Q...\E at a high level, and ignores orphan \E |
|
520 instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, |
|
521 because the ] is interpreted as the first data character and the |
|
522 terminating ] is not found. PCRE has been made compatible with Perl in this |
|
523 regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could |
|
524 cause memory overwriting. |
|
525 |
|
526 10. Like Perl, PCRE automatically breaks an unlimited repeat after an empty |
|
527 string has been matched (to stop an infinite loop). It was not recognizing |
|
528 a conditional subpattern that could match an empty string if that |
|
529 subpattern was within another subpattern. For example, it looped when |
|
530 trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the |
|
531 condition was not nested. This bug has been fixed. |
|
532 |
|
533 12. A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack |
|
534 past the start of the subject in the presence of bytes with the top bit |
|
535 set, for example "\x8aBCD". |
|
536 |
|
537 13. Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), |
|
538 (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT). |
|
539 |
|
540 14. Optimized (?!) to (*FAIL). |
|
541 |
|
542 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. |
|
543 This restricts code points to be within the range 0 to 0x10FFFF, excluding |
|
544 the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the |
|
545 full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still |
|
546 does: it's just the validity check that is more restrictive. |
|
547 |
|
548 16. Inserted checks for integer overflows during escape sequence (backslash) |
|
549 processing, and also fixed erroneous offset values for syntax errors during |
|
550 backslash processing. |
|
551 |
|
552 17. Fixed another case of looking too far back in non-UTF-8 mode (cf 12 above) |
|
553 for patterns like [\PPP\x8a]{1,}\x80 with the subject "A\x80". |
|
554 |
|
555 18. An unterminated class in a pattern like (?1)\c[ with a "forward reference" |
|
556 caused an overrun. |
|
557 |
|
558 19. A pattern like (?:[\PPa*]*){8,} which had an "extended class" (one with |
|
559 something other than just ASCII characters) inside a group that had an |
|
560 unlimited repeat caused a loop at compile time (while checking to see |
|
561 whether the group could match an empty string). |
|
562 |
|
563 20. Debugging a pattern containing \p or \P could cause a crash. For example, |
|
564 [\P{Any}] did so. (Error in the code for printing property names.) |
|
565 |
|
566 21. An orphan \E inside a character class could cause a crash. |
|
567 |
|
568 22. A repeated capturing bracket such as (A)? could cause a wild memory |
|
569 reference during compilation. |
|
570 |
|
571 23. There are several functions in pcre_compile() that scan along a compiled |
|
572 expression for various reasons (e.g. to see if it's fixed length for look |
|
573 behind). There were bugs in these functions when a repeated \p or \P was |
|
574 present in the pattern. These operators have additional parameters compared |
|
575 with \d, etc, and these were not being taken into account when moving along |
|
576 the compiled data. Specifically: |
|
577 |
|
578 (a) A item such as \p{Yi}{3} in a lookbehind was not treated as fixed |
|
579 length. |
|
580 |
|
581 (b) An item such as \pL+ within a repeated group could cause crashes or |
|
582 loops. |
|
583 |
|
584 (c) A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect |
|
585 "reference to non-existent subpattern" error. |
|
586 |
|
587 (d) A pattern like (\P{Yi}{2}\277)? could loop at compile time. |
|
588 |
|
589 24. A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte |
|
590 characters were involved (for example /\S{2}/8g with "A\x{a3}BC"). |
|
591 |
|
592 25. Using pcregrep in multiline, inverted mode (-Mv) caused it to loop. |
|
593 |
|
594 26. Patterns such as [\P{Yi}A] which include \p or \P and just one other |
|
595 character were causing crashes (broken optimization). |
|
596 |
|
597 27. Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing |
|
598 \p or \P) caused a compile-time loop. |
|
599 |
|
600 28. More problems have arisen in unanchored patterns when CRLF is a valid line |
|
601 break. For example, the unstudied pattern [\r\n]A does not match the string |
|
602 "\r\nA" because change 7.0/46 below moves the current point on by two |
|
603 characters after failing to match at the start. However, the pattern \nA |
|
604 *does* match, because it doesn't start till \n, and if [\r\n]A is studied, |
|
605 the same is true. There doesn't seem any very clean way out of this, but |
|
606 what I have chosen to do makes the common cases work: PCRE now takes note |
|
607 of whether there can be an explicit match for \r or \n anywhere in the |
|
608 pattern, and if so, 7.0/46 no longer applies. As part of this change, |
|
609 there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled |
|
610 pattern has explicit CR or LF references. |
|
611 |
|
612 29. Added (*CR) etc for changing newline setting at start of pattern. |
|
613 |
|
614 |
|
615 Version 7.2 19-Jun-07 |
|
616 --------------------- |
|
617 |
|
618 1. If the fr_FR locale cannot be found for test 3, try the "french" locale, |
|
619 which is apparently normally available under Windows. |
|
620 |
|
621 2. Re-jig the pcregrep tests with different newline settings in an attempt |
|
622 to make them independent of the local environment's newline setting. |
|
623 |
|
624 3. Add code to configure.ac to remove -g from the CFLAGS default settings. |
|
625 |
|
626 4. Some of the "internals" tests were previously cut out when the link size |
|
627 was not 2, because the output contained actual offsets. The recent new |
|
628 "Z" feature of pcretest means that these can be cut out, making the tests |
|
629 usable with all link sizes. |
|
630 |
|
631 5. Implemented Stan Switzer's goto replacement for longjmp() when not using |
|
632 stack recursion. This gives a massive performance boost under BSD, but just |
|
633 a small improvement under Linux. However, it saves one field in the frame |
|
634 in all cases. |
|
635 |
|
636 6. Added more features from the forthcoming Perl 5.10: |
|
637 |
|
638 (a) (?-n) (where n is a string of digits) is a relative subroutine or |
|
639 recursion call. It refers to the nth most recently opened parentheses. |
|
640 |
|
641 (b) (?+n) is also a relative subroutine call; it refers to the nth next |
|
642 to be opened parentheses. |
|
643 |
|
644 (c) Conditions that refer to capturing parentheses can be specified |
|
645 relatively, for example, (?(-2)... or (?(+3)... |
|
646 |
|
647 (d) \K resets the start of the current match so that everything before |
|
648 is not part of it. |
|
649 |
|
650 (e) \k{name} is synonymous with \k<name> and \k'name' (.NET compatible). |
|
651 |
|
652 (f) \g{name} is another synonym - part of Perl 5.10's unification of |
|
653 reference syntax. |
|
654 |
|
655 (g) (?| introduces a group in which the numbering of parentheses in each |
|
656 alternative starts with the same number. |
|
657 |
|
658 (h) \h, \H, \v, and \V match horizontal and vertical whitespace. |
|
659 |
|
660 7. Added two new calls to pcre_fullinfo(): PCRE_INFO_OKPARTIAL and |
|
661 PCRE_INFO_JCHANGED. |
|
662 |
|
663 8. A pattern such as (.*(.)?)* caused pcre_exec() to fail by either not |
|
664 terminating or by crashing. Diagnosed by Viktor Griph; it was in the code |
|
665 for detecting groups that can match an empty string. |
|
666 |
|
667 9. A pattern with a very large number of alternatives (more than several |
|
668 hundred) was running out of internal workspace during the pre-compile |
|
669 phase, where pcre_compile() figures out how much memory will be needed. A |
|
670 bit of new cunning has reduced the workspace needed for groups with |
|
671 alternatives. The 1000-alternative test pattern now uses 12 bytes of |
|
672 workspace instead of running out of the 4096 that are available. |
|
673 |
|
674 10. Inserted some missing (unsigned int) casts to get rid of compiler warnings. |
|
675 |
|
676 11. Applied patch from Google to remove an optimization that didn't quite work. |
|
677 The report of the bug said: |
|
678 |
|
679 pcrecpp::RE("a*").FullMatch("aaa") matches, while |
|
680 pcrecpp::RE("a*?").FullMatch("aaa") does not, and |
|
681 pcrecpp::RE("a*?\\z").FullMatch("aaa") does again. |
|
682 |
|
683 12. If \p or \P was used in non-UTF-8 mode on a character greater than 127 |
|
684 it matched the wrong number of bytes. |
|
685 |
|
686 |
|
687 Version 7.1 24-Apr-07 |
|
688 --------------------- |
|
689 |
|
690 1. Applied Bob Rossi and Daniel G's patches to convert the build system to one |
|
691 that is more "standard", making use of automake and other Autotools. There |
|
692 is some re-arrangement of the files and adjustment of comments consequent |
|
693 on this. |
|
694 |
|
695 2. Part of the patch fixed a problem with the pcregrep tests. The test of -r |
|
696 for recursive directory scanning broke on some systems because the files |
|
697 are not scanned in any specific order and on different systems the order |
|
698 was different. A call to "sort" has been inserted into RunGrepTest for the |
|
699 approprate test as a short-term fix. In the longer term there may be an |
|
700 alternative. |
|
701 |
|
702 3. I had an email from Eric Raymond about problems translating some of PCRE's |
|
703 man pages to HTML (despite the fact that I distribute HTML pages, some |
|
704 people do their own conversions for various reasons). The problems |
|
705 concerned the use of low-level troff macros .br and .in. I have therefore |
|
706 removed all such uses from the man pages (some were redundant, some could |
|
707 be replaced by .nf/.fi pairs). The 132html script that I use to generate |
|
708 HTML has been updated to handle .nf/.fi and to complain if it encounters |
|
709 .br or .in. |
|
710 |
|
711 4. Updated comments in configure.ac that get placed in config.h.in and also |
|
712 arranged for config.h to be included in the distribution, with the name |
|
713 config.h.generic, for the benefit of those who have to compile without |
|
714 Autotools (compare pcre.h, which is now distributed as pcre.h.generic). |
|
715 |
|
716 5. Updated the support (such as it is) for Virtual Pascal, thanks to Stefan |
|
717 Weber: (1) pcre_internal.h was missing some function renames; (2) updated |
|
718 makevp.bat for the current PCRE, using the additional files |
|
719 makevp_c.txt, makevp_l.txt, and pcregexp.pas. |
|
720 |
|
721 6. A Windows user reported a minor discrepancy with test 2, which turned out |
|
722 to be caused by a trailing space on an input line that had got lost in his |
|
723 copy. The trailing space was an accident, so I've just removed it. |
|
724 |
|
725 7. Add -Wl,-R... flags in pcre-config.in for *BSD* systems, as I'm told |
|
726 that is needed. |
|
727 |
|
728 8. Mark ucp_table (in ucptable.h) and ucp_gentype (in pcre_ucp_searchfuncs.c) |
|
729 as "const" (a) because they are and (b) because it helps the PHP |
|
730 maintainers who have recently made a script to detect big data structures |
|
731 in the php code that should be moved to the .rodata section. I remembered |
|
732 to update Builducptable as well, so it won't revert if ucptable.h is ever |
|
733 re-created. |
|
734 |
|
735 9. Added some extra #ifdef SUPPORT_UTF8 conditionals into pcretest.c, |
|
736 pcre_printint.src, pcre_compile.c, pcre_study.c, and pcre_tables.c, in |
|
737 order to be able to cut out the UTF-8 tables in the latter when UTF-8 |
|
738 support is not required. This saves 1.5-2K of code, which is important in |
|
739 some applications. |
|
740 |
|
741 Later: more #ifdefs are needed in pcre_ord2utf8.c and pcre_valid_utf8.c |
|
742 so as not to refer to the tables, even though these functions will never be |
|
743 called when UTF-8 support is disabled. Otherwise there are problems with a |
|
744 shared library. |
|
745 |
|
746 10. Fixed two bugs in the emulated memmove() function in pcre_internal.h: |
|
747 |
|
748 (a) It was defining its arguments as char * instead of void *. |
|
749 |
|
750 (b) It was assuming that all moves were upwards in memory; this was true |
|
751 a long time ago when I wrote it, but is no longer the case. |
|
752 |
|
753 The emulated memove() is provided for those environments that have neither |
|
754 memmove() nor bcopy(). I didn't think anyone used it these days, but that |
|
755 is clearly not the case, as these two bugs were recently reported. |
|
756 |
|
757 11. The script PrepareRelease is now distributed: it calls 132html, CleanTxt, |
|
758 and Detrail to create the HTML documentation, the .txt form of the man |
|
759 pages, and it removes trailing spaces from listed files. It also creates |
|
760 pcre.h.generic and config.h.generic from pcre.h and config.h. In the latter |
|
761 case, it wraps all the #defines with #ifndefs. This script should be run |
|
762 before "make dist". |
|
763 |
|
764 12. Fixed two fairly obscure bugs concerned with quantified caseless matching |
|
765 with Unicode property support. |
|
766 |
|
767 (a) For a maximizing quantifier, if the two different cases of the |
|
768 character were of different lengths in their UTF-8 codings (there are |
|
769 some cases like this - I found 11), and the matching function had to |
|
770 back up over a mixture of the two cases, it incorrectly assumed they |
|
771 were both the same length. |
|
772 |
|
773 (b) When PCRE was configured to use the heap rather than the stack for |
|
774 recursion during matching, it was not correctly preserving the data for |
|
775 the other case of a UTF-8 character when checking ahead for a match |
|
776 while processing a minimizing repeat. If the check also involved |
|
777 matching a wide character, but failed, corruption could cause an |
|
778 erroneous result when trying to check for a repeat of the original |
|
779 character. |
|
780 |
|
781 13. Some tidying changes to the testing mechanism: |
|
782 |
|
783 (a) The RunTest script now detects the internal link size and whether there |
|
784 is UTF-8 and UCP support by running ./pcretest -C instead of relying on |
|
785 values substituted by "configure". (The RunGrepTest script already did |
|
786 this for UTF-8.) The configure.ac script no longer substitutes the |
|
787 relevant variables. |
|
788 |
|
789 (b) The debugging options /B and /D in pcretest show the compiled bytecode |
|
790 with length and offset values. This means that the output is different |
|
791 for different internal link sizes. Test 2 is skipped for link sizes |
|
792 other than 2 because of this, bypassing the problem. Unfortunately, |
|
793 there was also a test in test 3 (the locale tests) that used /B and |
|
794 failed for link sizes other than 2. Rather than cut the whole test out, |
|
795 I have added a new /Z option to pcretest that replaces the length and |
|
796 offset values with spaces. This is now used to make test 3 independent |
|
797 of link size. (Test 2 will be tidied up later.) |
|
798 |
|
799 14. If erroroffset was passed as NULL to pcre_compile, it provoked a |
|
800 segmentation fault instead of returning the appropriate error message. |
|
801 |
|
802 15. In multiline mode when the newline sequence was set to "any", the pattern |
|
803 ^$ would give a match between the \r and \n of a subject such as "A\r\nB". |
|
804 This doesn't seem right; it now treats the CRLF combination as the line |
|
805 ending, and so does not match in that case. It's only a pattern such as ^$ |
|
806 that would hit this one: something like ^ABC$ would have failed after \r |
|
807 and then tried again after \r\n. |
|
808 |
|
809 16. Changed the comparison command for RunGrepTest from "diff -u" to "diff -ub" |
|
810 in an attempt to make files that differ only in their line terminators |
|
811 compare equal. This works on Linux. |
|
812 |
|
813 17. Under certain error circumstances pcregrep might try to free random memory |
|
814 as it exited. This is now fixed, thanks to valgrind. |
|
815 |
|
816 19. In pcretest, if the pattern /(?m)^$/g<any> was matched against the string |
|
817 "abc\r\n\r\n", it found an unwanted second match after the second \r. This |
|
818 was because its rules for how to advance for /g after matching an empty |
|
819 string at the end of a line did not allow for this case. They now check for |
|
820 it specially. |
|
821 |
|
822 20. pcretest is supposed to handle patterns and data of any length, by |
|
823 extending its buffers when necessary. It was getting this wrong when the |
|
824 buffer for a data line had to be extended. |
|
825 |
|
826 21. Added PCRE_NEWLINE_ANYCRLF which is like ANY, but matches only CR, LF, or |
|
827 CRLF as a newline sequence. |
|
828 |
|
829 22. Code for handling Unicode properties in pcre_dfa_exec() wasn't being cut |
|
830 out by #ifdef SUPPORT_UCP. This did no harm, as it could never be used, but |
|
831 I have nevertheless tidied it up. |
|
832 |
|
833 23. Added some casts to kill warnings from HP-UX ia64 compiler. |
|
834 |
|
835 24. Added a man page for pcre-config. |
|
836 |
|
837 |
|
838 Version 7.0 19-Dec-06 |
|
839 --------------------- |
|
840 |
|
841 1. Fixed a signed/unsigned compiler warning in pcre_compile.c, shown up by |
|
842 moving to gcc 4.1.1. |
|
843 |
|
844 2. The -S option for pcretest uses setrlimit(); I had omitted to #include |
|
845 sys/time.h, which is documented as needed for this function. It doesn't |
|
846 seem to matter on Linux, but it showed up on some releases of OS X. |
|
847 |
|
848 3. It seems that there are systems where bytes whose values are greater than |
|
849 127 match isprint() in the "C" locale. The "C" locale should be the |
|
850 default when a C program starts up. In most systems, only ASCII printing |
|
851 characters match isprint(). This difference caused the output from pcretest |
|
852 to vary, making some of the tests fail. I have changed pcretest so that: |
|
853 |
|
854 (a) When it is outputting text in the compiled version of a pattern, bytes |
|
855 other than 32-126 are always shown as hex escapes. |
|
856 |
|
857 (b) When it is outputting text that is a matched part of a subject string, |
|
858 it does the same, unless a different locale has been set for the match |
|
859 (using the /L modifier). In this case, it uses isprint() to decide. |
|
860 |
|
861 4. Fixed a major bug that caused incorrect computation of the amount of memory |
|
862 required for a compiled pattern when options that changed within the |
|
863 pattern affected the logic of the preliminary scan that determines the |
|
864 length. The relevant options are -x, and -i in UTF-8 mode. The result was |
|
865 that the computed length was too small. The symptoms of this bug were |
|
866 either the PCRE error "internal error: code overflow" from pcre_compile(), |
|
867 or a glibc crash with a message such as "pcretest: free(): invalid next |
|
868 size (fast)". Examples of patterns that provoked this bug (shown in |
|
869 pcretest format) are: |
|
870 |
|
871 /(?-x: )/x |
|
872 /(?x)(?-x: \s*#\s*)/ |
|
873 /((?i)[\x{c0}])/8 |
|
874 /(?i:[\x{c0}])/8 |
|
875 |
|
876 HOWEVER: Change 17 below makes this fix obsolete as the memory computation |
|
877 is now done differently. |
|
878 |
|
879 5. Applied patches from Google to: (a) add a QuoteMeta function to the C++ |
|
880 wrapper classes; (b) implement a new function in the C++ scanner that is |
|
881 more efficient than the old way of doing things because it avoids levels of |
|
882 recursion in the regex matching; (c) add a paragraph to the documentation |
|
883 for the FullMatch() function. |
|
884 |
|
885 6. The escape sequence \n was being treated as whatever was defined as |
|
886 "newline". Not only was this contrary to the documentation, which states |
|
887 that \n is character 10 (hex 0A), but it also went horribly wrong when |
|
888 "newline" was defined as CRLF. This has been fixed. |
|
889 |
|
890 7. In pcre_dfa_exec.c the value of an unsigned integer (the variable called c) |
|
891 was being set to -1 for the "end of line" case (supposedly a value that no |
|
892 character can have). Though this value is never used (the check for end of |
|
893 line is "zero bytes in current character"), it caused compiler complaints. |
|
894 I've changed it to 0xffffffff. |
|
895 |
|
896 8. In pcre_version.c, the version string was being built by a sequence of |
|
897 C macros that, in the event of PCRE_PRERELEASE being defined as an empty |
|
898 string (as it is for production releases) called a macro with an empty |
|
899 argument. The C standard says the result of this is undefined. The gcc |
|
900 compiler treats it as an empty string (which was what was wanted) but it is |
|
901 reported that Visual C gives an error. The source has been hacked around to |
|
902 avoid this problem. |
|
903 |
|
904 9. On the advice of a Windows user, included <io.h> and <fcntl.h> in Windows |
|
905 builds of pcretest, and changed the call to _setmode() to use _O_BINARY |
|
906 instead of 0x8000. Made all the #ifdefs test both _WIN32 and WIN32 (not all |
|
907 of them did). |
|
908 |
|
909 10. Originally, pcretest opened its input and output without "b"; then I was |
|
910 told that "b" was needed in some environments, so it was added for release |
|
911 5.0 to both the input and output. (It makes no difference on Unix-like |
|
912 systems.) Later I was told that it is wrong for the input on Windows. I've |
|
913 now abstracted the modes into two macros, to make it easier to fiddle with |
|
914 them, and removed "b" from the input mode under Windows. |
|
915 |
|
916 11. Added pkgconfig support for the C++ wrapper library, libpcrecpp. |
|
917 |
|
918 12. Added -help and --help to pcretest as an official way of being reminded |
|
919 of the options. |
|
920 |
|
921 13. Removed some redundant semicolons after macro calls in pcrecpparg.h.in |
|
922 and pcrecpp.cc because they annoy compilers at high warning levels. |
|
923 |
|
924 14. A bit of tidying/refactoring in pcre_exec.c in the main bumpalong loop. |
|
925 |
|
926 15. Fixed an occurrence of == in configure.ac that should have been = (shell |
|
927 scripts are not C programs :-) and which was not noticed because it works |
|
928 on Linux. |
|
929 |
|
930 16. pcretest is supposed to handle any length of pattern and data line (as one |
|
931 line or as a continued sequence of lines) by extending its input buffer if |
|
932 necessary. This feature was broken for very long pattern lines, leading to |
|
933 a string of junk being passed to pcre_compile() if the pattern was longer |
|
934 than about 50K. |
|
935 |
|
936 17. I have done a major re-factoring of the way pcre_compile() computes the |
|
937 amount of memory needed for a compiled pattern. Previously, there was code |
|
938 that made a preliminary scan of the pattern in order to do this. That was |
|
939 OK when PCRE was new, but as the facilities have expanded, it has become |
|
940 harder and harder to keep it in step with the real compile phase, and there |
|
941 have been a number of bugs (see for example, 4 above). I have now found a |
|
942 cunning way of running the real compile function in a "fake" mode that |
|
943 enables it to compute how much memory it would need, while actually only |
|
944 ever using a few hundred bytes of working memory and without too many |
|
945 tests of the mode. This should make future maintenance and development |
|
946 easier. A side effect of this work is that the limit of 200 on the nesting |
|
947 depth of parentheses has been removed (though this was never a serious |
|
948 limitation, I suspect). However, there is a downside: pcre_compile() now |
|
949 runs more slowly than before (30% or more, depending on the pattern). I |
|
950 hope this isn't a big issue. There is no effect on runtime performance. |
|
951 |
|
952 18. Fixed a minor bug in pcretest: if a pattern line was not terminated by a |
|
953 newline (only possible for the last line of a file) and it was a |
|
954 pattern that set a locale (followed by /Lsomething), pcretest crashed. |
|
955 |
|
956 19. Added additional timing features to pcretest. (1) The -tm option now times |
|
957 matching only, not compiling. (2) Both -t and -tm can be followed, as a |
|
958 separate command line item, by a number that specifies the number of |
|
959 repeats to use when timing. The default is 50000; this gives better |
|
960 precision, but takes uncomfortably long for very large patterns. |
|
961 |
|
962 20. Extended pcre_study() to be more clever in cases where a branch of a |
|
963 subpattern has no definite first character. For example, (a*|b*)[cd] would |
|
964 previously give no result from pcre_study(). Now it recognizes that the |
|
965 first character must be a, b, c, or d. |
|
966 |
|
967 21. There was an incorrect error "recursive call could loop indefinitely" if |
|
968 a subpattern (or the entire pattern) that was being tested for matching an |
|
969 empty string contained only one non-empty item after a nested subpattern. |
|
970 For example, the pattern (?>\x{100}*)\d(?R) provoked this error |
|
971 incorrectly, because the \d was being skipped in the check. |
|
972 |
|
973 22. The pcretest program now has a new pattern option /B and a command line |
|
974 option -b, which is equivalent to adding /B to every pattern. This causes |
|
975 it to show the compiled bytecode, without the additional information that |
|
976 -d shows. The effect of -d is now the same as -b with -i (and similarly, /D |
|
977 is the same as /B/I). |
|
978 |
|
979 23. A new optimization is now able automatically to treat some sequences such |
|
980 as a*b as a*+b. More specifically, if something simple (such as a character |
|
981 or a simple class like \d) has an unlimited quantifier, and is followed by |
|
982 something that cannot possibly match the quantified thing, the quantifier |
|
983 is automatically "possessified". |
|
984 |
|
985 24. A recursive reference to a subpattern whose number was greater than 39 |
|
986 went wrong under certain circumstances in UTF-8 mode. This bug could also |
|
987 have affected the operation of pcre_study(). |
|
988 |
|
989 25. Realized that a little bit of performance could be had by replacing |
|
990 (c & 0xc0) == 0xc0 with c >= 0xc0 when processing UTF-8 characters. |
|
991 |
|
992 26. Timing data from pcretest is now shown to 4 decimal places instead of 3. |
|
993 |
|
994 27. Possessive quantifiers such as a++ were previously implemented by turning |
|
995 them into atomic groups such as ($>a+). Now they have their own opcodes, |
|
996 which improves performance. This includes the automatically created ones |
|
997 from 23 above. |
|
998 |
|
999 28. A pattern such as (?=(\w+))\1: which simulates an atomic group using a |
|
1000 lookahead was broken if it was not anchored. PCRE was mistakenly expecting |
|
1001 the first matched character to be a colon. This applied both to named and |
|
1002 numbered groups. |
|
1003 |
|
1004 29. The ucpinternal.h header file was missing its idempotency #ifdef. |
|
1005 |
|
1006 30. I was sent a "project" file called libpcre.a.dev which I understand makes |
|
1007 building PCRE on Windows easier, so I have included it in the distribution. |
|
1008 |
|
1009 31. There is now a check in pcretest against a ridiculously large number being |
|
1010 returned by pcre_exec() or pcre_dfa_exec(). If this happens in a /g or /G |
|
1011 loop, the loop is abandoned. |
|
1012 |
|
1013 32. Forward references to subpatterns in conditions such as (?(2)...) where |
|
1014 subpattern 2 is defined later cause pcre_compile() to search forwards in |
|
1015 the pattern for the relevant set of parentheses. This search went wrong |
|
1016 when there were unescaped parentheses in a character class, parentheses |
|
1017 escaped with \Q...\E, or parentheses in a #-comment in /x mode. |
|
1018 |
|
1019 33. "Subroutine" calls and backreferences were previously restricted to |
|
1020 referencing subpatterns earlier in the regex. This restriction has now |
|
1021 been removed. |
|
1022 |
|
1023 34. Added a number of extra features that are going to be in Perl 5.10. On the |
|
1024 whole, these are just syntactic alternatives for features that PCRE had |
|
1025 previously implemented using the Python syntax or my own invention. The |
|
1026 other formats are all retained for compatibility. |
|
1027 |
|
1028 (a) Named groups can now be defined as (?<name>...) or (?'name'...) as well |
|
1029 as (?P<name>...). The new forms, as well as being in Perl 5.10, are |
|
1030 also .NET compatible. |
|
1031 |
|
1032 (b) A recursion or subroutine call to a named group can now be defined as |
|
1033 (?&name) as well as (?P>name). |
|
1034 |
|
1035 (c) A backreference to a named group can now be defined as \k<name> or |
|
1036 \k'name' as well as (?P=name). The new forms, as well as being in Perl |
|
1037 5.10, are also .NET compatible. |
|
1038 |
|
1039 (d) A conditional reference to a named group can now use the syntax |
|
1040 (?(<name>) or (?('name') as well as (?(name). |
|
1041 |
|
1042 (e) A "conditional group" of the form (?(DEFINE)...) can be used to define |
|
1043 groups (named and numbered) that are never evaluated inline, but can be |
|
1044 called as "subroutines" from elsewhere. In effect, the DEFINE condition |
|
1045 is always false. There may be only one alternative in such a group. |
|
1046 |
|
1047 (f) A test for recursion can be given as (?(R1).. or (?(R&name)... as well |
|
1048 as the simple (?(R). The condition is true only if the most recent |
|
1049 recursion is that of the given number or name. It does not search out |
|
1050 through the entire recursion stack. |
|
1051 |
|
1052 (g) The escape \gN or \g{N} has been added, where N is a positive or |
|
1053 negative number, specifying an absolute or relative reference. |
|
1054 |
|
1055 35. Tidied to get rid of some further signed/unsigned compiler warnings and |
|
1056 some "unreachable code" warnings. |
|
1057 |
|
1058 36. Updated the Unicode property tables to Unicode version 5.0.0. Amongst other |
|
1059 things, this adds five new scripts. |
|
1060 |
|
1061 37. Perl ignores orphaned \E escapes completely. PCRE now does the same. |
|
1062 There were also incompatibilities regarding the handling of \Q..\E inside |
|
1063 character classes, for example with patterns like [\Qa\E-\Qz\E] where the |
|
1064 hyphen was adjacent to \Q or \E. I hope I've cleared all this up now. |
|
1065 |
|
1066 38. Like Perl, PCRE detects when an indefinitely repeated parenthesized group |
|
1067 matches an empty string, and forcibly breaks the loop. There were bugs in |
|
1068 this code in non-simple cases. For a pattern such as ^(a()*)* matched |
|
1069 against aaaa the result was just "a" rather than "aaaa", for example. Two |
|
1070 separate and independent bugs (that affected different cases) have been |
|
1071 fixed. |
|
1072 |
|
1073 39. Refactored the code to abolish the use of different opcodes for small |
|
1074 capturing bracket numbers. This is a tidy that I avoided doing when I |
|
1075 removed the limit on the number of capturing brackets for 3.5 back in 2001. |
|
1076 The new approach is not only tidier, it makes it possible to reduce the |
|
1077 memory needed to fix the previous bug (38). |
|
1078 |
|
1079 40. Implemented PCRE_NEWLINE_ANY to recognize any of the Unicode newline |
|
1080 sequences (http://unicode.org/unicode/reports/tr18/) as "newline" when |
|
1081 processing dot, circumflex, or dollar metacharacters, or #-comments in /x |
|
1082 mode. |
|
1083 |
|
1084 41. Add \R to match any Unicode newline sequence, as suggested in the Unicode |
|
1085 report. |
|
1086 |
|
1087 42. Applied patch, originally from Ari Pollak, modified by Google, to allow |
|
1088 copy construction and assignment in the C++ wrapper. |
|
1089 |
|
1090 43. Updated pcregrep to support "--newline=any". In the process, I fixed a |
|
1091 couple of bugs that could have given wrong results in the "--newline=crlf" |
|
1092 case. |
|
1093 |
|
1094 44. Added a number of casts and did some reorganization of signed/unsigned int |
|
1095 variables following suggestions from Dair Grant. Also renamed the variable |
|
1096 "this" as "item" because it is a C++ keyword. |
|
1097 |
|
1098 45. Arranged for dftables to add |
|
1099 |
|
1100 #include "pcre_internal.h" |
|
1101 |
|
1102 to pcre_chartables.c because without it, gcc 4.x may remove the array |
|
1103 definition from the final binary if PCRE is built into a static library and |
|
1104 dead code stripping is activated. |
|
1105 |
|
1106 46. For an unanchored pattern, if a match attempt fails at the start of a |
|
1107 newline sequence, and the newline setting is CRLF or ANY, and the next two |
|
1108 characters are CRLF, advance by two characters instead of one. |
|
1109 |
|
1110 |
|
1111 Version 6.7 04-Jul-06 |
|
1112 --------------------- |
|
1113 |
|
1114 1. In order to handle tests when input lines are enormously long, pcretest has |
|
1115 been re-factored so that it automatically extends its buffers when |
|
1116 necessary. The code is crude, but this _is_ just a test program. The |
|
1117 default size has been increased from 32K to 50K. |
|
1118 |
|
1119 2. The code in pcre_study() was using the value of the re argument before |
|
1120 testing it for NULL. (Of course, in any sensible call of the function, it |
|
1121 won't be NULL.) |
|
1122 |
|
1123 3. The memmove() emulation function in pcre_internal.h, which is used on |
|
1124 systems that lack both memmove() and bcopy() - that is, hardly ever - |
|
1125 was missing a "static" storage class specifier. |
|
1126 |
|
1127 4. When UTF-8 mode was not set, PCRE looped when compiling certain patterns |
|
1128 containing an extended class (one that cannot be represented by a bitmap |
|
1129 because it contains high-valued characters or Unicode property items, e.g. |
|
1130 [\pZ]). Almost always one would set UTF-8 mode when processing such a |
|
1131 pattern, but PCRE should not loop if you do not (it no longer does). |
|
1132 [Detail: two cases were found: (a) a repeated subpattern containing an |
|
1133 extended class; (b) a recursive reference to a subpattern that followed a |
|
1134 previous extended class. It wasn't skipping over the extended class |
|
1135 correctly when UTF-8 mode was not set.] |
|
1136 |
|
1137 5. A negated single-character class was not being recognized as fixed-length |
|
1138 in lookbehind assertions such as (?<=[^f]), leading to an incorrect |
|
1139 compile error "lookbehind assertion is not fixed length". |
|
1140 |
|
1141 6. The RunPerlTest auxiliary script was showing an unexpected difference |
|
1142 between PCRE and Perl for UTF-8 tests. It turns out that it is hard to |
|
1143 write a Perl script that can interpret lines of an input file either as |
|
1144 byte characters or as UTF-8, which is what "perltest" was being required to |
|
1145 do for the non-UTF-8 and UTF-8 tests, respectively. Essentially what you |
|
1146 can't do is switch easily at run time between having the "use utf8;" pragma |
|
1147 or not. In the end, I fudged it by using the RunPerlTest script to insert |
|
1148 "use utf8;" explicitly for the UTF-8 tests. |
|
1149 |
|
1150 7. In multiline (/m) mode, PCRE was matching ^ after a terminating newline at |
|
1151 the end of the subject string, contrary to the documentation and to what |
|
1152 Perl does. This was true of both matching functions. Now it matches only at |
|
1153 the start of the subject and immediately after *internal* newlines. |
|
1154 |
|
1155 8. A call of pcre_fullinfo() from pcretest to get the option bits was passing |
|
1156 a pointer to an int instead of a pointer to an unsigned long int. This |
|
1157 caused problems on 64-bit systems. |
|
1158 |
|
1159 9. Applied a patch from the folks at Google to pcrecpp.cc, to fix "another |
|
1160 instance of the 'standard' template library not being so standard". |
|
1161 |
|
1162 10. There was no check on the number of named subpatterns nor the maximum |
|
1163 length of a subpattern name. The product of these values is used to compute |
|
1164 the size of the memory block for a compiled pattern. By supplying a very |
|
1165 long subpattern name and a large number of named subpatterns, the size |
|
1166 computation could be caused to overflow. This is now prevented by limiting |
|
1167 the length of names to 32 characters, and the number of named subpatterns |
|
1168 to 10,000. |
|
1169 |
|
1170 11. Subpatterns that are repeated with specific counts have to be replicated in |
|
1171 the compiled pattern. The size of memory for this was computed from the |
|
1172 length of the subpattern and the repeat count. The latter is limited to |
|
1173 65535, but there was no limit on the former, meaning that integer overflow |
|
1174 could in principle occur. The compiled length of a repeated subpattern is |
|
1175 now limited to 30,000 bytes in order to prevent this. |
|
1176 |
|
1177 12. Added the optional facility to have named substrings with the same name. |
|
1178 |
|
1179 13. Added the ability to use a named substring as a condition, using the |
|
1180 Python syntax: (?(name)yes|no). This overloads (?(R)... and names that |
|
1181 are numbers (not recommended). Forward references are permitted. |
|
1182 |
|
1183 14. Added forward references in named backreferences (if you see what I mean). |
|
1184 |
|
1185 15. In UTF-8 mode, with the PCRE_DOTALL option set, a quantified dot in the |
|
1186 pattern could run off the end of the subject. For example, the pattern |
|
1187 "(?s)(.{1,5})"8 did this with the subject "ab". |
|
1188 |
|
1189 16. If PCRE_DOTALL or PCRE_MULTILINE were set, pcre_dfa_exec() behaved as if |
|
1190 PCRE_CASELESS was set when matching characters that were quantified with ? |
|
1191 or *. |
|
1192 |
|
1193 17. A character class other than a single negated character that had a minimum |
|
1194 but no maximum quantifier - for example [ab]{6,} - was not handled |
|
1195 correctly by pce_dfa_exec(). It would match only one character. |
|
1196 |
|
1197 18. A valid (though odd) pattern that looked like a POSIX character |
|
1198 class but used an invalid character after [ (for example [[,abc,]]) caused |
|
1199 pcre_compile() to give the error "Failed: internal error: code overflow" or |
|
1200 in some cases to crash with a glibc free() error. This could even happen if |
|
1201 the pattern terminated after [[ but there just happened to be a sequence of |
|
1202 letters, a binary zero, and a closing ] in the memory that followed. |
|
1203 |
|
1204 19. Perl's treatment of octal escapes in the range \400 to \777 has changed |
|
1205 over the years. Originally (before any Unicode support), just the bottom 8 |
|
1206 bits were taken. Thus, for example, \500 really meant \100. Nowadays the |
|
1207 output from "man perlunicode" includes this: |
|
1208 |
|
1209 The regular expression compiler produces polymorphic opcodes. That |
|
1210 is, the pattern adapts to the data and automatically switches to |
|
1211 the Unicode character scheme when presented with Unicode data--or |
|
1212 instead uses a traditional byte scheme when presented with byte |
|
1213 data. |
|
1214 |
|
1215 Sadly, a wide octal escape does not cause a switch, and in a string with |
|
1216 no other multibyte characters, these octal escapes are treated as before. |
|
1217 Thus, in Perl, the pattern /\500/ actually matches \100 but the pattern |
|
1218 /\500|\x{1ff}/ matches \500 or \777 because the whole thing is treated as a |
|
1219 Unicode string. |
|
1220 |
|
1221 I have not perpetrated such confusion in PCRE. Up till now, it took just |
|
1222 the bottom 8 bits, as in old Perl. I have now made octal escapes with |
|
1223 values greater than \377 illegal in non-UTF-8 mode. In UTF-8 mode they |
|
1224 translate to the appropriate multibyte character. |
|
1225 |
|
1226 29. Applied some refactoring to reduce the number of warnings from Microsoft |
|
1227 and Borland compilers. This has included removing the fudge introduced |
|
1228 seven years ago for the OS/2 compiler (see 2.02/2 below) because it caused |
|
1229 a warning about an unused variable. |
|
1230 |
|
1231 21. PCRE has not included VT (character 0x0b) in the set of whitespace |
|
1232 characters since release 4.0, because Perl (from release 5.004) does not. |
|
1233 [Or at least, is documented not to: some releases seem to be in conflict |
|
1234 with the documentation.] However, when a pattern was studied with |
|
1235 pcre_study() and all its branches started with \s, PCRE still included VT |
|
1236 as a possible starting character. Of course, this did no harm; it just |
|
1237 caused an unnecessary match attempt. |
|
1238 |
|
1239 22. Removed a now-redundant internal flag bit that recorded the fact that case |
|
1240 dependency changed within the pattern. This was once needed for "required |
|
1241 byte" processing, but is no longer used. This recovers a now-scarce options |
|
1242 bit. Also moved the least significant internal flag bit to the most- |
|
1243 significant bit of the word, which was not previously used (hangover from |
|
1244 the days when it was an int rather than a uint) to free up another bit for |
|
1245 the future. |
|
1246 |
|
1247 23. Added support for CRLF line endings as well as CR and LF. As well as the |
|
1248 default being selectable at build time, it can now be changed at runtime |
|
1249 via the PCRE_NEWLINE_xxx flags. There are now options for pcregrep to |
|
1250 specify that it is scanning data with non-default line endings. |
|
1251 |
|
1252 24. Changed the definition of CXXLINK to make it agree with the definition of |
|
1253 LINK in the Makefile, by replacing LDFLAGS to CXXFLAGS. |
|
1254 |
|
1255 25. Applied Ian Taylor's patches to avoid using another stack frame for tail |
|
1256 recursions. This makes a big different to stack usage for some patterns. |
|
1257 |
|
1258 26. If a subpattern containing a named recursion or subroutine reference such |
|
1259 as (?P>B) was quantified, for example (xxx(?P>B)){3}, the calculation of |
|
1260 the space required for the compiled pattern went wrong and gave too small a |
|
1261 value. Depending on the environment, this could lead to "Failed: internal |
|
1262 error: code overflow at offset 49" or "glibc detected double free or |
|
1263 corruption" errors. |
|
1264 |
|
1265 27. Applied patches from Google (a) to support the new newline modes and (b) to |
|
1266 advance over multibyte UTF-8 characters in GlobalReplace. |
|
1267 |
|
1268 28. Change free() to pcre_free() in pcredemo.c. Apparently this makes a |
|
1269 difference for some implementation of PCRE in some Windows version. |
|
1270 |
|
1271 29. Added some extra testing facilities to pcretest: |
|
1272 |
|
1273 \q<number> in a data line sets the "match limit" value |
|
1274 \Q<number> in a data line sets the "match recursion limt" value |
|
1275 -S <number> sets the stack size, where <number> is in megabytes |
|
1276 |
|
1277 The -S option isn't available for Windows. |
|
1278 |
|
1279 |
|
1280 Version 6.6 06-Feb-06 |
|
1281 --------------------- |
|
1282 |
|
1283 1. Change 16(a) for 6.5 broke things, because PCRE_DATA_SCOPE was not defined |
|
1284 in pcreposix.h. I have copied the definition from pcre.h. |
|
1285 |
|
1286 2. Change 25 for 6.5 broke compilation in a build directory out-of-tree |
|
1287 because pcre.h is no longer a built file. |
|
1288 |
|
1289 3. Added Jeff Friedl's additional debugging patches to pcregrep. These are |
|
1290 not normally included in the compiled code. |
|
1291 |
|
1292 |
|
1293 Version 6.5 01-Feb-06 |
|
1294 --------------------- |
|
1295 |
|
1296 1. When using the partial match feature with pcre_dfa_exec(), it was not |
|
1297 anchoring the second and subsequent partial matches at the new starting |
|
1298 point. This could lead to incorrect results. For example, with the pattern |
|
1299 /1234/, partially matching against "123" and then "a4" gave a match. |
|
1300 |
|
1301 2. Changes to pcregrep: |
|
1302 |
|
1303 (a) All non-match returns from pcre_exec() were being treated as failures |
|
1304 to match the line. Now, unless the error is PCRE_ERROR_NOMATCH, an |
|
1305 error message is output. Some extra information is given for the |
|
1306 PCRE_ERROR_MATCHLIMIT and PCRE_ERROR_RECURSIONLIMIT errors, which are |
|
1307 probably the only errors that are likely to be caused by users (by |
|
1308 specifying a regex that has nested indefinite repeats, for instance). |
|
1309 If there are more than 20 of these errors, pcregrep is abandoned. |
|
1310 |
|
1311 (b) A binary zero was treated as data while matching, but terminated the |
|
1312 output line if it was written out. This has been fixed: binary zeroes |
|
1313 are now no different to any other data bytes. |
|
1314 |
|
1315 (c) Whichever of the LC_ALL or LC_CTYPE environment variables is set is |
|
1316 used to set a locale for matching. The --locale=xxxx long option has |
|
1317 been added (no short equivalent) to specify a locale explicitly on the |
|
1318 pcregrep command, overriding the environment variables. |
|
1319 |
|
1320 (d) When -B was used with -n, some line numbers in the output were one less |
|
1321 than they should have been. |
|
1322 |
|
1323 (e) Added the -o (--only-matching) option. |
|
1324 |
|
1325 (f) If -A or -C was used with -c (count only), some lines of context were |
|
1326 accidentally printed for the final match. |
|
1327 |
|
1328 (g) Added the -H (--with-filename) option. |
|
1329 |
|
1330 (h) The combination of options -rh failed to suppress file names for files |
|
1331 that were found from directory arguments. |
|
1332 |
|
1333 (i) Added the -D (--devices) and -d (--directories) options. |
|
1334 |
|
1335 (j) Added the -F (--fixed-strings) option. |
|
1336 |
|
1337 (k) Allow "-" to be used as a file name for -f as well as for a data file. |
|
1338 |
|
1339 (l) Added the --colo(u)r option. |
|
1340 |
|
1341 (m) Added Jeffrey Friedl's -S testing option, but within #ifdefs so that it |
|
1342 is not present by default. |
|
1343 |
|
1344 3. A nasty bug was discovered in the handling of recursive patterns, that is, |
|
1345 items such as (?R) or (?1), when the recursion could match a number of |
|
1346 alternatives. If it matched one of the alternatives, but subsequently, |
|
1347 outside the recursion, there was a failure, the code tried to back up into |
|
1348 the recursion. However, because of the way PCRE is implemented, this is not |
|
1349 possible, and the result was an incorrect result from the match. |
|
1350 |
|
1351 In order to prevent this happening, the specification of recursion has |
|
1352 been changed so that all such subpatterns are automatically treated as |
|
1353 atomic groups. Thus, for example, (?R) is treated as if it were (?>(?R)). |
|
1354 |
|
1355 4. I had overlooked the fact that, in some locales, there are characters for |
|
1356 which isalpha() is true but neither isupper() nor islower() are true. In |
|
1357 the fr_FR locale, for instance, the \xAA and \xBA characters (ordmasculine |
|
1358 and ordfeminine) are like this. This affected the treatment of \w and \W |
|
1359 when they appeared in character classes, but not when they appeared outside |
|
1360 a character class. The bit map for "word" characters is now created |
|
1361 separately from the results of isalnum() instead of just taking it from the |
|
1362 upper, lower, and digit maps. (Plus the underscore character, of course.) |
|
1363 |
|
1364 5. The above bug also affected the handling of POSIX character classes such as |
|
1365 [[:alpha:]] and [[:alnum:]]. These do not have their own bit maps in PCRE's |
|
1366 permanent tables. Instead, the bit maps for such a class were previously |
|
1367 created as the appropriate unions of the upper, lower, and digit bitmaps. |
|
1368 Now they are created by subtraction from the [[:word:]] class, which has |
|
1369 its own bitmap. |
|
1370 |
|
1371 6. The [[:blank:]] character class matches horizontal, but not vertical space. |
|
1372 It is created by subtracting the vertical space characters (\x09, \x0a, |
|
1373 \x0b, \x0c) from the [[:space:]] bitmap. Previously, however, the |
|
1374 subtraction was done in the overall bitmap for a character class, meaning |
|
1375 that a class such as [\x0c[:blank:]] was incorrect because \x0c would not |
|
1376 be recognized. This bug has been fixed. |
|
1377 |
|
1378 7. Patches from the folks at Google: |
|
1379 |
|
1380 (a) pcrecpp.cc: "to handle a corner case that may or may not happen in |
|
1381 real life, but is still worth protecting against". |
|
1382 |
|
1383 (b) pcrecpp.cc: "corrects a bug when negative radixes are used with |
|
1384 regular expressions". |
|
1385 |
|
1386 (c) pcre_scanner.cc: avoid use of std::count() because not all systems |
|
1387 have it. |
|
1388 |
|
1389 (d) Split off pcrecpparg.h from pcrecpp.h and had the former built by |
|
1390 "configure" and the latter not, in order to fix a problem somebody had |
|
1391 with compiling the Arg class on HP-UX. |
|
1392 |
|
1393 (e) Improve the error-handling of the C++ wrapper a little bit. |
|
1394 |
|
1395 (f) New tests for checking recursion limiting. |
|
1396 |
|
1397 8. The pcre_memmove() function, which is used only if the environment does not |
|
1398 have a standard memmove() function (and is therefore rarely compiled), |
|
1399 contained two bugs: (a) use of int instead of size_t, and (b) it was not |
|
1400 returning a result (though PCRE never actually uses the result). |
|
1401 |
|
1402 9. In the POSIX regexec() interface, if nmatch is specified as a ridiculously |
|
1403 large number - greater than INT_MAX/(3*sizeof(int)) - REG_ESPACE is |
|
1404 returned instead of calling malloc() with an overflowing number that would |
|
1405 most likely cause subsequent chaos. |
|
1406 |
|
1407 10. The debugging option of pcretest was not showing the NO_AUTO_CAPTURE flag. |
|
1408 |
|
1409 11. The POSIX flag REG_NOSUB is now supported. When a pattern that was compiled |
|
1410 with this option is matched, the nmatch and pmatch options of regexec() are |
|
1411 ignored. |
|
1412 |
|
1413 12. Added REG_UTF8 to the POSIX interface. This is not defined by POSIX, but is |
|
1414 provided in case anyone wants to the the POSIX interface with UTF-8 |
|
1415 strings. |
|
1416 |
|
1417 13. Added CXXLDFLAGS to the Makefile parameters to provide settings only on the |
|
1418 C++ linking (needed for some HP-UX environments). |
|
1419 |
|
1420 14. Avoid compiler warnings in get_ucpname() when compiled without UCP support |
|
1421 (unused parameter) and in the pcre_printint() function (omitted "default" |
|
1422 switch label when the default is to do nothing). |
|
1423 |
|
1424 15. Added some code to make it possible, when PCRE is compiled as a C++ |
|
1425 library, to replace subject pointers for pcre_exec() with a smart pointer |
|
1426 class, thus making it possible to process discontinuous strings. |
|
1427 |
|
1428 16. The two macros PCRE_EXPORT and PCRE_DATA_SCOPE are confusing, and perform |
|
1429 much the same function. They were added by different people who were trying |
|
1430 to make PCRE easy to compile on non-Unix systems. It has been suggested |
|
1431 that PCRE_EXPORT be abolished now that there is more automatic apparatus |
|
1432 for compiling on Windows systems. I have therefore replaced it with |
|
1433 PCRE_DATA_SCOPE. This is set automatically for Windows; if not set it |
|
1434 defaults to "extern" for C or "extern C" for C++, which works fine on |
|
1435 Unix-like systems. It is now possible to override the value of PCRE_DATA_ |
|
1436 SCOPE with something explicit in config.h. In addition: |
|
1437 |
|
1438 (a) pcreposix.h still had just "extern" instead of either of these macros; |
|
1439 I have replaced it with PCRE_DATA_SCOPE. |
|
1440 |
|
1441 (b) Functions such as _pcre_xclass(), which are internal to the library, |
|
1442 but external in the C sense, all had PCRE_EXPORT in their definitions. |
|
1443 This is apparently wrong for the Windows case, so I have removed it. |
|
1444 (It makes no difference on Unix-like systems.) |
|
1445 |
|
1446 17. Added a new limit, MATCH_LIMIT_RECURSION, which limits the depth of nesting |
|
1447 of recursive calls to match(). This is different to MATCH_LIMIT because |
|
1448 that limits the total number of calls to match(), not all of which increase |
|
1449 the depth of recursion. Limiting the recursion depth limits the amount of |
|
1450 stack (or heap if NO_RECURSE is set) that is used. The default can be set |
|
1451 when PCRE is compiled, and changed at run time. A patch from Google adds |
|
1452 this functionality to the C++ interface. |
|
1453 |
|
1454 18. Changes to the handling of Unicode character properties: |
|
1455 |
|
1456 (a) Updated the table to Unicode 4.1.0. |
|
1457 |
|
1458 (b) Recognize characters that are not in the table as "Cn" (undefined). |
|
1459 |
|
1460 (c) I revised the way the table is implemented to a much improved format |
|
1461 which includes recognition of ranges. It now supports the ranges that |
|
1462 are defined in UnicodeData.txt, and it also amalgamates other |
|
1463 characters into ranges. This has reduced the number of entries in the |
|
1464 table from around 16,000 to around 3,000, thus reducing its size |
|
1465 considerably. I realized I did not need to use a tree structure after |
|
1466 all - a binary chop search is just as efficient. Having reduced the |
|
1467 number of entries, I extended their size from 6 bytes to 8 bytes to |
|
1468 allow for more data. |
|
1469 |
|
1470 (d) Added support for Unicode script names via properties such as \p{Han}. |
|
1471 |
|
1472 19. In UTF-8 mode, a backslash followed by a non-Ascii character was not |
|
1473 matching that character. |
|
1474 |
|
1475 20. When matching a repeated Unicode property with a minimum greater than zero, |
|
1476 (for example \pL{2,}), PCRE could look past the end of the subject if it |
|
1477 reached it while seeking the minimum number of characters. This could |
|
1478 happen only if some of the characters were more than one byte long, because |
|
1479 there is a check for at least the minimum number of bytes. |
|
1480 |
|
1481 21. Refactored the implementation of \p and \P so as to be more general, to |
|
1482 allow for more different types of property in future. This has changed the |
|
1483 compiled form incompatibly. Anybody with saved compiled patterns that use |
|
1484 \p or \P will have to recompile them. |
|
1485 |
|
1486 22. Added "Any" and "L&" to the supported property types. |
|
1487 |
|
1488 23. Recognize \x{...} as a code point specifier, even when not in UTF-8 mode, |
|
1489 but give a compile time error if the value is greater than 0xff. |
|
1490 |
|
1491 24. The man pages for pcrepartial, pcreprecompile, and pcre_compile2 were |
|
1492 accidentally not being installed or uninstalled. |
|
1493 |
|
1494 25. The pcre.h file was built from pcre.h.in, but the only changes that were |
|
1495 made were to insert the current release number. This seemed silly, because |
|
1496 it made things harder for people building PCRE on systems that don't run |
|
1497 "configure". I have turned pcre.h into a distributed file, no longer built |
|
1498 by "configure", with the version identification directly included. There is |
|
1499 no longer a pcre.h.in file. |
|
1500 |
|
1501 However, this change necessitated a change to the pcre-config script as |
|
1502 well. It is built from pcre-config.in, and one of the substitutions was the |
|
1503 release number. I have updated configure.ac so that ./configure now finds |
|
1504 the release number by grepping pcre.h. |
|
1505 |
|
1506 26. Added the ability to run the tests under valgrind. |
|
1507 |
|
1508 |
|
1509 Version 6.4 05-Sep-05 |
|
1510 --------------------- |
|
1511 |
|
1512 1. Change 6.0/10/(l) to pcregrep introduced a bug that caused separator lines |
|
1513 "--" to be printed when multiple files were scanned, even when none of the |
|
1514 -A, -B, or -C options were used. This is not compatible with Gnu grep, so I |
|
1515 consider it to be a bug, and have restored the previous behaviour. |
|
1516 |
|
1517 2. A couple of code tidies to get rid of compiler warnings. |
|
1518 |
|
1519 3. The pcretest program used to cheat by referring to symbols in the library |
|
1520 whose names begin with _pcre_. These are internal symbols that are not |
|
1521 really supposed to be visible externally, and in some environments it is |
|
1522 possible to suppress them. The cheating is now confined to including |
|
1523 certain files from the library's source, which is a bit cleaner. |
|
1524 |
|
1525 4. Renamed pcre.in as pcre.h.in to go with pcrecpp.h.in; it also makes the |
|
1526 file's purpose clearer. |
|
1527 |
|
1528 5. Reorganized pcre_ucp_findchar(). |
|
1529 |
|
1530 |
|
1531 Version 6.3 15-Aug-05 |
|
1532 --------------------- |
|
1533 |
|
1534 1. The file libpcre.pc.in did not have general read permission in the tarball. |
|
1535 |
|
1536 2. There were some problems when building without C++ support: |
|
1537 |
|
1538 (a) If C++ support was not built, "make install" and "make test" still |
|
1539 tried to test it. |
|
1540 |
|
1541 (b) There were problems when the value of CXX was explicitly set. Some |
|
1542 changes have been made to try to fix these, and ... |
|
1543 |
|
1544 (c) --disable-cpp can now be used to explicitly disable C++ support. |
|
1545 |
|
1546 (d) The use of @CPP_OBJ@ directly caused a blank line preceded by a |
|
1547 backslash in a target when C++ was disabled. This confuses some |
|
1548 versions of "make", apparently. Using an intermediate variable solves |
|
1549 this. (Same for CPP_LOBJ.) |
|
1550 |
|
1551 3. $(LINK_FOR_BUILD) now includes $(CFLAGS_FOR_BUILD) and $(LINK) |
|
1552 (non-Windows) now includes $(CFLAGS) because these flags are sometimes |
|
1553 necessary on certain architectures. |
|
1554 |
|
1555 4. Added a setting of -export-symbols-regex to the link command to remove |
|
1556 those symbols that are exported in the C sense, but actually are local |
|
1557 within the library, and not documented. Their names all begin with |
|
1558 "_pcre_". This is not a perfect job, because (a) we have to except some |
|
1559 symbols that pcretest ("illegally") uses, and (b) the facility isn't always |
|
1560 available (and never for static libraries). I have made a note to try to |
|
1561 find a way round (a) in the future. |
|
1562 |
|
1563 |
|
1564 Version 6.2 01-Aug-05 |
|
1565 --------------------- |
|
1566 |
|
1567 1. There was no test for integer overflow of quantifier values. A construction |
|
1568 such as {1111111111111111} would give undefined results. What is worse, if |
|
1569 a minimum quantifier for a parenthesized subpattern overflowed and became |
|
1570 negative, the calculation of the memory size went wrong. This could have |
|
1571 led to memory overwriting. |
|
1572 |
|
1573 2. Building PCRE using VPATH was broken. Hopefully it is now fixed. |
|
1574 |
|
1575 3. Added "b" to the 2nd argument of fopen() in dftables.c, for non-Unix-like |
|
1576 operating environments where this matters. |
|
1577 |
|
1578 4. Applied Giuseppe Maxia's patch to add additional features for controlling |
|
1579 PCRE options from within the C++ wrapper. |
|
1580 |
|
1581 5. Named capturing subpatterns were not being correctly counted when a pattern |
|
1582 was compiled. This caused two problems: (a) If there were more than 100 |
|
1583 such subpatterns, the calculation of the memory needed for the whole |
|
1584 compiled pattern went wrong, leading to an overflow error. (b) Numerical |
|
1585 back references of the form \12, where the number was greater than 9, were |
|
1586 not recognized as back references, even though there were sufficient |
|
1587 previous subpatterns. |
|
1588 |
|
1589 6. Two minor patches to pcrecpp.cc in order to allow it to compile on older |
|
1590 versions of gcc, e.g. 2.95.4. |
|
1591 |
|
1592 |
|
1593 Version 6.1 21-Jun-05 |
|
1594 --------------------- |
|
1595 |
|
1596 1. There was one reference to the variable "posix" in pcretest.c that was not |
|
1597 surrounded by "#if !defined NOPOSIX". |
|
1598 |
|
1599 2. Make it possible to compile pcretest without DFA support, UTF8 support, or |
|
1600 the cross-check on the old pcre_info() function, for the benefit of the |
|
1601 cut-down version of PCRE that is currently imported into Exim. |
|
1602 |
|
1603 3. A (silly) pattern starting with (?i)(?-i) caused an internal space |
|
1604 allocation error. I've done the easy fix, which wastes 2 bytes for sensible |
|
1605 patterns that start (?i) but I don't think that matters. The use of (?i) is |
|
1606 just an example; this all applies to the other options as well. |
|
1607 |
|
1608 4. Since libtool seems to echo the compile commands it is issuing, the output |
|
1609 from "make" can be reduced a bit by putting "@" in front of each libtool |
|
1610 compile command. |
|
1611 |
|
1612 5. Patch from the folks at Google for configure.in to be a bit more thorough |
|
1613 in checking for a suitable C++ installation before trying to compile the |
|
1614 C++ stuff. This should fix a reported problem when a compiler was present, |
|
1615 but no suitable headers. |
|
1616 |
|
1617 6. The man pages all had just "PCRE" as their title. I have changed them to |
|
1618 be the relevant file name. I have also arranged that these names are |
|
1619 retained in the file doc/pcre.txt, which is a concatenation in text format |
|
1620 of all the man pages except the little individual ones for each function. |
|
1621 |
|
1622 7. The NON-UNIX-USE file had not been updated for the different set of source |
|
1623 files that come with release 6. I also added a few comments about the C++ |
|
1624 wrapper. |
|
1625 |
|
1626 |
|
1627 Version 6.0 07-Jun-05 |
|
1628 --------------------- |
|
1629 |
|
1630 1. Some minor internal re-organization to help with my DFA experiments. |
|
1631 |
|
1632 2. Some missing #ifdef SUPPORT_UCP conditionals in pcretest and printint that |
|
1633 didn't matter for the library itself when fully configured, but did matter |
|
1634 when compiling without UCP support, or within Exim, where the ucp files are |
|
1635 not imported. |
|
1636 |
|
1637 3. Refactoring of the library code to split up the various functions into |
|
1638 different source modules. The addition of the new DFA matching code (see |
|
1639 below) to a single monolithic source would have made it really too |
|
1640 unwieldy, quite apart from causing all the code to be include in a |
|
1641 statically linked application, when only some functions are used. This is |
|
1642 relevant even without the DFA addition now that patterns can be compiled in |
|
1643 one application and matched in another. |
|
1644 |
|
1645 The downside of splitting up is that there have to be some external |
|
1646 functions and data tables that are used internally in different modules of |
|
1647 the library but which are not part of the API. These have all had their |
|
1648 names changed to start with "_pcre_" so that they are unlikely to clash |
|
1649 with other external names. |
|
1650 |
|
1651 4. Added an alternate matching function, pcre_dfa_exec(), which matches using |
|
1652 a different (DFA) algorithm. Although it is slower than the original |
|
1653 function, it does have some advantages for certain types of matching |
|
1654 problem. |
|
1655 |
|
1656 5. Upgrades to pcretest in order to test the features of pcre_dfa_exec(), |
|
1657 including restarting after a partial match. |
|
1658 |
|
1659 6. A patch for pcregrep that defines INVALID_FILE_ATTRIBUTES if it is not |
|
1660 defined when compiling for Windows was sent to me. I have put it into the |
|
1661 code, though I have no means of testing or verifying it. |
|
1662 |
|
1663 7. Added the pcre_refcount() auxiliary function. |
|
1664 |
|
1665 8. Added the PCRE_FIRSTLINE option. This constrains an unanchored pattern to |
|
1666 match before or at the first newline in the subject string. In pcretest, |
|
1667 the /f option on a pattern can be used to set this. |
|
1668 |
|
1669 9. A repeated \w when used in UTF-8 mode with characters greater than 256 |
|
1670 would behave wrongly. This has been present in PCRE since release 4.0. |
|
1671 |
|
1672 10. A number of changes to the pcregrep command: |
|
1673 |
|
1674 (a) Refactored how -x works; insert ^(...)$ instead of setting |
|
1675 PCRE_ANCHORED and checking the length, in preparation for adding |
|
1676 something similar for -w. |
|
1677 |
|
1678 (b) Added the -w (match as a word) option. |
|
1679 |
|
1680 (c) Refactored the way lines are read and buffered so as to have more |
|
1681 than one at a time available. |
|
1682 |
|
1683 (d) Implemented a pcregrep test script. |
|
1684 |
|
1685 (e) Added the -M (multiline match) option. This allows patterns to match |
|
1686 over several lines of the subject. The buffering ensures that at least |
|
1687 8K, or the rest of the document (whichever is the shorter) is available |
|
1688 for matching (and similarly the previous 8K for lookbehind assertions). |
|
1689 |
|
1690 (f) Changed the --help output so that it now says |
|
1691 |
|
1692 -w, --word-regex(p) |
|
1693 |
|
1694 instead of two lines, one with "regex" and the other with "regexp" |
|
1695 because that confused at least one person since the short forms are the |
|
1696 same. (This required a bit of code, as the output is generated |
|
1697 automatically from a table. It wasn't just a text change.) |
|
1698 |
|
1699 (g) -- can be used to terminate pcregrep options if the next thing isn't an |
|
1700 option but starts with a hyphen. Could be a pattern or a path name |
|
1701 starting with a hyphen, for instance. |
|
1702 |
|
1703 (h) "-" can be given as a file name to represent stdin. |
|
1704 |
|
1705 (i) When file names are being printed, "(standard input)" is used for |
|
1706 the standard input, for compatibility with GNU grep. Previously |
|
1707 "<stdin>" was used. |
|
1708 |
|
1709 (j) The option --label=xxx can be used to supply a name to be used for |
|
1710 stdin when file names are being printed. There is no short form. |
|
1711 |
|
1712 (k) Re-factored the options decoding logic because we are going to add |
|
1713 two more options that take data. Such options can now be given in four |
|
1714 different ways, e.g. "-fname", "-f name", "--file=name", "--file name". |
|
1715 |
|
1716 (l) Added the -A, -B, and -C options for requesting that lines of context |
|
1717 around matches be printed. |
|
1718 |
|
1719 (m) Added the -L option to print the names of files that do not contain |
|
1720 any matching lines, that is, the complement of -l. |
|
1721 |
|
1722 (n) The return code is 2 if any file cannot be opened, but pcregrep does |
|
1723 continue to scan other files. |
|
1724 |
|
1725 (o) The -s option was incorrectly implemented. For compatibility with other |
|
1726 greps, it now suppresses the error message for a non-existent or non- |
|
1727 accessible file (but not the return code). There is a new option called |
|
1728 -q that suppresses the output of matching lines, which was what -s was |
|
1729 previously doing. |
|
1730 |
|
1731 (p) Added --include and --exclude options to specify files for inclusion |
|
1732 and exclusion when recursing. |
|
1733 |
|
1734 11. The Makefile was not using the Autoconf-supported LDFLAGS macro properly. |
|
1735 Hopefully, it now does. |
|
1736 |
|
1737 12. Missing cast in pcre_study(). |
|
1738 |
|
1739 13. Added an "uninstall" target to the makefile. |
|
1740 |
|
1741 14. Replaced "extern" in the function prototypes in Makefile.in with |
|
1742 "PCRE_DATA_SCOPE", which defaults to 'extern' or 'extern "C"' in the Unix |
|
1743 world, but is set differently for Windows. |
|
1744 |
|
1745 15. Added a second compiling function called pcre_compile2(). The only |
|
1746 difference is that it has an extra argument, which is a pointer to an |
|
1747 integer error code. When there is a compile-time failure, this is set |
|
1748 non-zero, in addition to the error test pointer being set to point to an |
|
1749 error message. The new argument may be NULL if no error number is required |
|
1750 (but then you may as well call pcre_compile(), which is now just a |
|
1751 wrapper). This facility is provided because some applications need a |
|
1752 numeric error indication, but it has also enabled me to tidy up the way |
|
1753 compile-time errors are handled in the POSIX wrapper. |
|
1754 |
|
1755 16. Added VPATH=.libs to the makefile; this should help when building with one |
|
1756 prefix path and installing with another. (Or so I'm told by someone who |
|
1757 knows more about this stuff than I do.) |
|
1758 |
|
1759 17. Added a new option, REG_DOTALL, to the POSIX function regcomp(). This |
|
1760 passes PCRE_DOTALL to the pcre_compile() function, making the "." character |
|
1761 match everything, including newlines. This is not POSIX-compatible, but |
|
1762 somebody wanted the feature. From pcretest it can be activated by using |
|
1763 both the P and the s flags. |
|
1764 |
|
1765 18. AC_PROG_LIBTOOL appeared twice in Makefile.in. Removed one. |
|
1766 |
|
1767 19. libpcre.pc was being incorrectly installed as executable. |
|
1768 |
|
1769 20. A couple of places in pcretest check for end-of-line by looking for '\n'; |
|
1770 it now also looks for '\r' so that it will work unmodified on Windows. |
|
1771 |
|
1772 21. Added Google's contributed C++ wrapper to the distribution. |
|
1773 |
|
1774 22. Added some untidy missing memory free() calls in pcretest, to keep |
|
1775 Electric Fence happy when testing. |
|
1776 |
|
1777 |
|
1778 |
|
1779 Version 5.0 13-Sep-04 |
|
1780 --------------------- |
|
1781 |
|
1782 1. Internal change: literal characters are no longer packed up into items |
|
1783 containing multiple characters in a single byte-string. Each character |
|
1784 is now matched using a separate opcode. However, there may be more than one |
|
1785 byte in the character in UTF-8 mode. |
|
1786 |
|
1787 2. The pcre_callout_block structure has two new fields: pattern_position and |
|
1788 next_item_length. These contain the offset in the pattern to the next match |
|
1789 item, and its length, respectively. |
|
1790 |
|
1791 3. The PCRE_AUTO_CALLOUT option for pcre_compile() requests the automatic |
|
1792 insertion of callouts before each pattern item. Added the /C option to |
|
1793 pcretest to make use of this. |
|
1794 |
|
1795 4. On the advice of a Windows user, the lines |
|
1796 |
|
1797 #if defined(_WIN32) || defined(WIN32) |
|
1798 _setmode( _fileno( stdout ), 0x8000 ); |
|
1799 #endif /* defined(_WIN32) || defined(WIN32) */ |
|
1800 |
|
1801 have been added to the source of pcretest. This apparently does useful |
|
1802 magic in relation to line terminators. |
|
1803 |
|
1804 5. Changed "r" and "w" in the calls to fopen() in pcretest to "rb" and "wb" |
|
1805 for the benefit of those environments where the "b" makes a difference. |
|
1806 |
|
1807 6. The icc compiler has the same options as gcc, but "configure" doesn't seem |
|
1808 to know about it. I have put a hack into configure.in that adds in code |
|
1809 to set GCC=yes if CC=icc. This seems to end up at a point in the |
|
1810 generated configure script that is early enough to affect the setting of |
|
1811 compiler options, which is what is needed, but I have no means of testing |
|
1812 whether it really works. (The user who reported this had patched the |
|
1813 generated configure script, which of course I cannot do.) |
|
1814 |
|
1815 LATER: After change 22 below (new libtool files), the configure script |
|
1816 seems to know about icc (and also ecc). Therefore, I have commented out |
|
1817 this hack in configure.in. |
|
1818 |
|
1819 7. Added support for pkg-config (2 patches were sent in). |
|
1820 |
|
1821 8. Negated POSIX character classes that used a combination of internal tables |
|
1822 were completely broken. These were [[:^alpha:]], [[:^alnum:]], and |
|
1823 [[:^ascii]]. Typically, they would match almost any characters. The other |
|
1824 POSIX classes were not broken in this way. |
|
1825 |
|
1826 9. Matching the pattern "\b.*?" against "ab cd", starting at offset 1, failed |
|
1827 to find the match, as PCRE was deluded into thinking that the match had to |
|
1828 start at the start point or following a newline. The same bug applied to |
|
1829 patterns with negative forward assertions or any backward assertions |
|
1830 preceding ".*" at the start, unless the pattern required a fixed first |
|
1831 character. This was a failing pattern: "(?!.bcd).*". The bug is now fixed. |
|
1832 |
|
1833 10. In UTF-8 mode, when moving forwards in the subject after a failed match |
|
1834 starting at the last subject character, bytes beyond the end of the subject |
|
1835 string were read. |
|
1836 |
|
1837 11. Renamed the variable "class" as "classbits" to make life easier for C++ |
|
1838 users. (Previously there was a macro definition, but it apparently wasn't |
|
1839 enough.) |
|
1840 |
|
1841 12. Added the new field "tables" to the extra data so that tables can be passed |
|
1842 in at exec time, or the internal tables can be re-selected. This allows |
|
1843 a compiled regex to be saved and re-used at a later time by a different |
|
1844 program that might have everything at different addresses. |
|
1845 |
|
1846 13. Modified the pcre-config script so that, when run on Solaris, it shows a |
|
1847 -R library as well as a -L library. |
|
1848 |
|
1849 14. The debugging options of pcretest (-d on the command line or D on a |
|
1850 pattern) showed incorrect output for anything following an extended class |
|
1851 that contained multibyte characters and which was followed by a quantifier. |
|
1852 |
|
1853 15. Added optional support for general category Unicode character properties |
|
1854 via the \p, \P, and \X escapes. Unicode property support implies UTF-8 |
|
1855 support. It adds about 90K to the size of the library. The meanings of the |
|
1856 inbuilt class escapes such as \d and \s have NOT been changed. |
|
1857 |
|
1858 16. Updated pcredemo.c to include calls to free() to release the memory for the |
|
1859 compiled pattern. |
|
1860 |
|
1861 17. The generated file chartables.c was being created in the source directory |
|
1862 instead of in the building directory. This caused the build to fail if the |
|
1863 source directory was different from the building directory, and was |
|
1864 read-only. |
|
1865 |
|
1866 18. Added some sample Win commands from Mark Tetrode into the NON-UNIX-USE |
|
1867 file. No doubt somebody will tell me if they don't make sense... Also added |
|
1868 Dan Mooney's comments about building on OpenVMS. |
|
1869 |
|
1870 19. Added support for partial matching via the PCRE_PARTIAL option for |
|
1871 pcre_exec() and the \P data escape in pcretest. |
|
1872 |
|
1873 20. Extended pcretest with 3 new pattern features: |
|
1874 |
|
1875 (i) A pattern option of the form ">rest-of-line" causes pcretest to |
|
1876 write the compiled pattern to the file whose name is "rest-of-line". |
|
1877 This is a straight binary dump of the data, with the saved pointer to |
|
1878 the character tables forced to be NULL. The study data, if any, is |
|
1879 written too. After writing, pcretest reads a new pattern. |
|
1880 |
|
1881 (ii) If, instead of a pattern, "<rest-of-line" is given, pcretest reads a |
|
1882 compiled pattern from the given file. There must not be any |
|
1883 occurrences of "<" in the file name (pretty unlikely); if there are, |
|
1884 pcretest will instead treat the initial "<" as a pattern delimiter. |
|
1885 After reading in the pattern, pcretest goes on to read data lines as |
|
1886 usual. |
|
1887 |
|
1888 (iii) The F pattern option causes pcretest to flip the bytes in the 32-bit |
|
1889 and 16-bit fields in a compiled pattern, to simulate a pattern that |
|
1890 was compiled on a host of opposite endianness. |
|
1891 |
|
1892 21. The pcre-exec() function can now cope with patterns that were compiled on |
|
1893 hosts of opposite endianness, with this restriction: |
|
1894 |
|
1895 As for any compiled expression that is saved and used later, the tables |
|
1896 pointer field cannot be preserved; the extra_data field in the arguments |
|
1897 to pcre_exec() should be used to pass in a tables address if a value |
|
1898 other than the default internal tables were used at compile time. |
|
1899 |
|
1900 22. Calling pcre_exec() with a negative value of the "ovecsize" parameter is |
|
1901 now diagnosed as an error. Previously, most of the time, a negative number |
|
1902 would have been treated as zero, but if in addition "ovector" was passed as |
|
1903 NULL, a crash could occur. |
|
1904 |
|
1905 23. Updated the files ltmain.sh, config.sub, config.guess, and aclocal.m4 with |
|
1906 new versions from the libtool 1.5 distribution (the last one is a copy of |
|
1907 a file called libtool.m4). This seems to have fixed the need to patch |
|
1908 "configure" to support Darwin 1.3 (which I used to do). However, I still |
|
1909 had to patch ltmain.sh to ensure that ${SED} is set (it isn't on my |
|
1910 workstation). |
|
1911 |
|
1912 24. Changed the PCRE licence to be the more standard "BSD" licence. |
|
1913 |
|
1914 |
|
1915 Version 4.5 01-Dec-03 |
|
1916 --------------------- |
|
1917 |
|
1918 1. There has been some re-arrangement of the code for the match() function so |
|
1919 that it can be compiled in a version that does not call itself recursively. |
|
1920 Instead, it keeps those local variables that need separate instances for |
|
1921 each "recursion" in a frame on the heap, and gets/frees frames whenever it |
|
1922 needs to "recurse". Keeping track of where control must go is done by means |
|
1923 of setjmp/longjmp. The whole thing is implemented by a set of macros that |
|
1924 hide most of the details from the main code, and operates only if |
|
1925 NO_RECURSE is defined while compiling pcre.c. If PCRE is built using the |
|
1926 "configure" mechanism, "--disable-stack-for-recursion" turns on this way of |
|
1927 operating. |
|
1928 |
|
1929 To make it easier for callers to provide specially tailored get/free |
|
1930 functions for this usage, two new functions, pcre_stack_malloc, and |
|
1931 pcre_stack_free, are used. They are always called in strict stacking order, |
|
1932 and the size of block requested is always the same. |
|
1933 |
|
1934 The PCRE_CONFIG_STACKRECURSE info parameter can be used to find out whether |
|
1935 PCRE has been compiled to use the stack or the heap for recursion. The |
|
1936 -C option of pcretest uses this to show which version is compiled. |
|
1937 |
|
1938 A new data escape \S, is added to pcretest; it causes the amounts of store |
|
1939 obtained and freed by both kinds of malloc/free at match time to be added |
|
1940 to the output. |
|
1941 |
|
1942 2. Changed the locale test to use "fr_FR" instead of "fr" because that's |
|
1943 what's available on my current Linux desktop machine. |
|
1944 |
|
1945 3. When matching a UTF-8 string, the test for a valid string at the start has |
|
1946 been extended. If start_offset is not zero, PCRE now checks that it points |
|
1947 to a byte that is the start of a UTF-8 character. If not, it returns |
|
1948 PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still checked; |
|
1949 this is necessary because there may be backward assertions in the pattern. |
|
1950 When matching the same subject several times, it may save resources to use |
|
1951 PCRE_NO_UTF8_CHECK on all but the first call if the string is long. |
|
1952 |
|
1953 4. The code for checking the validity of UTF-8 strings has been tightened so |
|
1954 that it rejects (a) strings containing 0xfe or 0xff bytes and (b) strings |
|
1955 containing "overlong sequences". |
|
1956 |
|
1957 5. Fixed a bug (appearing twice) that I could not find any way of exploiting! |
|
1958 I had written "if ((digitab[*p++] && chtab_digit) == 0)" where the "&&" |
|
1959 should have been "&", but it just so happened that all the cases this let |
|
1960 through by mistake were picked up later in the function. |
|
1961 |
|
1962 6. I had used a variable called "isblank" - this is a C99 function, causing |
|
1963 some compilers to warn. To avoid this, I renamed it (as "blankclass"). |
|
1964 |
|
1965 7. Cosmetic: (a) only output another newline at the end of pcretest if it is |
|
1966 prompting; (b) run "./pcretest /dev/null" at the start of the test script |
|
1967 so the version is shown; (c) stop "make test" echoing "./RunTest". |
|
1968 |
|
1969 8. Added patches from David Burgess to enable PCRE to run on EBCDIC systems. |
|
1970 |
|
1971 9. The prototype for memmove() for systems that don't have it was using |
|
1972 size_t, but the inclusion of the header that defines size_t was later. I've |
|
1973 moved the #includes for the C headers earlier to avoid this. |
|
1974 |
|
1975 10. Added some adjustments to the code to make it easier to compiler on certain |
|
1976 special systems: |
|
1977 |
|
1978 (a) Some "const" qualifiers were missing. |
|
1979 (b) Added the macro EXPORT before all exported functions; by default this |
|
1980 is defined to be empty. |
|
1981 (c) Changed the dftables auxiliary program (that builds chartables.c) so |
|
1982 that it reads its output file name as an argument instead of writing |
|
1983 to the standard output and assuming this can be redirected. |
|
1984 |
|
1985 11. In UTF-8 mode, if a recursive reference (e.g. (?1)) followed a character |
|
1986 class containing characters with values greater than 255, PCRE compilation |
|
1987 went into a loop. |
|
1988 |
|
1989 12. A recursive reference to a subpattern that was within another subpattern |
|
1990 that had a minimum quantifier of zero caused PCRE to crash. For example, |
|
1991 (x(y(?2))z)? provoked this bug with a subject that got as far as the |
|
1992 recursion. If the recursively-called subpattern itself had a zero repeat, |
|
1993 that was OK. |
|
1994 |
|
1995 13. In pcretest, the buffer for reading a data line was set at 30K, but the |
|
1996 buffer into which it was copied (for escape processing) was still set at |
|
1997 1024, so long lines caused crashes. |
|
1998 |
|
1999 14. A pattern such as /[ab]{1,3}+/ failed to compile, giving the error |
|
2000 "internal error: code overflow...". This applied to any character class |
|
2001 that was followed by a possessive quantifier. |
|
2002 |
|
2003 15. Modified the Makefile to add libpcre.la as a prerequisite for |
|
2004 libpcreposix.la because I was told this is needed for a parallel build to |
|
2005 work. |
|
2006 |
|
2007 16. If a pattern that contained .* following optional items at the start was |
|
2008 studied, the wrong optimizing data was generated, leading to matching |
|
2009 errors. For example, studying /[ab]*.*c/ concluded, erroneously, that any |
|
2010 matching string must start with a or b or c. The correct conclusion for |
|
2011 this pattern is that a match can start with any character. |
|
2012 |
|
2013 |
|
2014 Version 4.4 13-Aug-03 |
|
2015 --------------------- |
|
2016 |
|
2017 1. In UTF-8 mode, a character class containing characters with values between |
|
2018 127 and 255 was not handled correctly if the compiled pattern was studied. |
|
2019 In fixing this, I have also improved the studying algorithm for such |
|
2020 classes (slightly). |
|
2021 |
|
2022 2. Three internal functions had redundant arguments passed to them. Removal |
|
2023 might give a very teeny performance improvement. |
|
2024 |
|
2025 3. Documentation bug: the value of the capture_top field in a callout is *one |
|
2026 more than* the number of the hightest numbered captured substring. |
|
2027 |
|
2028 4. The Makefile linked pcretest and pcregrep with -lpcre, which could result |
|
2029 in incorrectly linking with a previously installed version. They now link |
|
2030 explicitly with libpcre.la. |
|
2031 |
|
2032 5. configure.in no longer needs to recognize Cygwin specially. |
|
2033 |
|
2034 6. A problem in pcre.in for Windows platforms is fixed. |
|
2035 |
|
2036 7. If a pattern was successfully studied, and the -d (or /D) flag was given to |
|
2037 pcretest, it used to include the size of the study block as part of its |
|
2038 output. Unfortunately, the structure contains a field that has a different |
|
2039 size on different hardware architectures. This meant that the tests that |
|
2040 showed this size failed. As the block is currently always of a fixed size, |
|
2041 this information isn't actually particularly useful in pcretest output, so |
|
2042 I have just removed it. |
|
2043 |
|
2044 8. Three pre-processor statements accidentally did not start in column 1. |
|
2045 Sadly, there are *still* compilers around that complain, even though |
|
2046 standard C has not required this for well over a decade. Sigh. |
|
2047 |
|
2048 9. In pcretest, the code for checking callouts passed small integers in the |
|
2049 callout_data field, which is a void * field. However, some picky compilers |
|
2050 complained about the casts involved for this on 64-bit systems. Now |
|
2051 pcretest passes the address of the small integer instead, which should get |
|
2052 rid of the warnings. |
|
2053 |
|
2054 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at |
|
2055 both compile and run time, and gives an error if an invalid UTF-8 sequence |
|
2056 is found. There is a option for disabling this check in cases where the |
|
2057 string is known to be correct and/or the maximum performance is wanted. |
|
2058 |
|
2059 11. In response to a bug report, I changed one line in Makefile.in from |
|
2060 |
|
2061 -Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \ |
|
2062 to |
|
2063 -Wl,--out-implib,.libs/@WIN_PREFIX@libpcreposix.dll.a \ |
|
2064 |
|
2065 to look similar to other lines, but I have no way of telling whether this |
|
2066 is the right thing to do, as I do not use Windows. No doubt I'll get told |
|
2067 if it's wrong... |
|
2068 |
|
2069 |
|
2070 Version 4.3 21-May-03 |
|
2071 --------------------- |
|
2072 |
|
2073 1. Two instances of @WIN_PREFIX@ omitted from the Windows targets in the |
|
2074 Makefile. |
|
2075 |
|
2076 2. Some refactoring to improve the quality of the code: |
|
2077 |
|
2078 (i) The utf8_table... variables are now declared "const". |
|
2079 |
|
2080 (ii) The code for \cx, which used the "case flipping" table to upper case |
|
2081 lower case letters, now just substracts 32. This is ASCII-specific, |
|
2082 but the whole concept of \cx is ASCII-specific, so it seems |
|
2083 reasonable. |
|
2084 |
|
2085 (iii) PCRE was using its character types table to recognize decimal and |
|
2086 hexadecimal digits in the pattern. This is silly, because it handles |
|
2087 only 0-9, a-f, and A-F, but the character types table is locale- |
|
2088 specific, which means strange things might happen. A private |
|
2089 table is now used for this - though it costs 256 bytes, a table is |
|
2090 much faster than multiple explicit tests. Of course, the standard |
|
2091 character types table is still used for matching digits in subject |
|
2092 strings against \d. |
|
2093 |
|
2094 (iv) Strictly, the identifier ESC_t is reserved by POSIX (all identifiers |
|
2095 ending in _t are). So I've renamed it as ESC_tee. |
|
2096 |
|
2097 3. The first argument for regexec() in the POSIX wrapper should have been |
|
2098 defined as "const". |
|
2099 |
|
2100 4. Changed pcretest to use malloc() for its buffers so that they can be |
|
2101 Electric Fenced for debugging. |
|
2102 |
|
2103 5. There were several places in the code where, in UTF-8 mode, PCRE would try |
|
2104 to read one or more bytes before the start of the subject string. Often this |
|
2105 had no effect on PCRE's behaviour, but in some circumstances it could |
|
2106 provoke a segmentation fault. |
|
2107 |
|
2108 6. A lookbehind at the start of a pattern in UTF-8 mode could also cause PCRE |
|
2109 to try to read one or more bytes before the start of the subject string. |
|
2110 |
|
2111 7. A lookbehind in a pattern matched in non-UTF-8 mode on a PCRE compiled with |
|
2112 UTF-8 support could misbehave in various ways if the subject string |
|
2113 contained bytes with the 0x80 bit set and the 0x40 bit unset in a lookbehind |
|
2114 area. (PCRE was not checking for the UTF-8 mode flag, and trying to move |
|
2115 back over UTF-8 characters.) |
|
2116 |
|
2117 |
|
2118 Version 4.2 14-Apr-03 |
|
2119 --------------------- |
|
2120 |
|
2121 1. Typo "#if SUPPORT_UTF8" instead of "#ifdef SUPPORT_UTF8" fixed. |
|
2122 |
|
2123 2. Changes to the building process, supplied by Ronald Landheer-Cieslak |
|
2124 [ON_WINDOWS]: new variable, "#" on non-Windows platforms |
|
2125 [NOT_ON_WINDOWS]: new variable, "#" on Windows platforms |
|
2126 [WIN_PREFIX]: new variable, "cyg" for Cygwin |
|
2127 * Makefile.in: use autoconf substitution for OBJEXT, EXEEXT, BUILD_OBJEXT |
|
2128 and BUILD_EXEEXT |
|
2129 Note: automatic setting of the BUILD variables is not yet working |
|
2130 set CPPFLAGS and BUILD_CPPFLAGS (but don't use yet) - should be used at |
|
2131 compile-time but not at link-time |
|
2132 [LINK]: use for linking executables only |
|
2133 make different versions for Windows and non-Windows |
|
2134 [LINKLIB]: new variable, copy of UNIX-style LINK, used for linking |
|
2135 libraries |
|
2136 [LINK_FOR_BUILD]: new variable |
|
2137 [OBJEXT]: use throughout |
|
2138 [EXEEXT]: use throughout |
|
2139 <winshared>: new target |
|
2140 <wininstall>: new target |
|
2141 <dftables.o>: use native compiler |
|
2142 <dftables>: use native linker |
|
2143 <install>: handle Windows platform correctly |
|
2144 <clean>: ditto |
|
2145 <check>: ditto |
|
2146 copy DLL to top builddir before testing |
|
2147 |
|
2148 As part of these changes, -no-undefined was removed again. This was reported |
|
2149 to give trouble on HP-UX 11.0, so getting rid of it seems like a good idea |
|
2150 in any case. |
|
2151 |
|
2152 3. Some tidies to get rid of compiler warnings: |
|
2153 |
|
2154 . In the match_data structure, match_limit was an unsigned long int, whereas |
|
2155 match_call_count was an int. I've made them both unsigned long ints. |
|
2156 |
|
2157 . In pcretest the fact that a const uschar * doesn't automatically cast to |
|
2158 a void * provoked a warning. |
|
2159 |
|
2160 . Turning on some more compiler warnings threw up some "shadow" variables |
|
2161 and a few more missing casts. |
|
2162 |
|
2163 4. If PCRE was complied with UTF-8 support, but called without the PCRE_UTF8 |
|
2164 option, a class that contained a single character with a value between 128 |
|
2165 and 255 (e.g. /[\xFF]/) caused PCRE to crash. |
|
2166 |
|
2167 5. If PCRE was compiled with UTF-8 support, but called without the PCRE_UTF8 |
|
2168 option, a class that contained several characters, but with at least one |
|
2169 whose value was between 128 and 255 caused PCRE to crash. |
|
2170 |
|
2171 |
|
2172 Version 4.1 12-Mar-03 |
|
2173 --------------------- |
|
2174 |
|
2175 1. Compiling with gcc -pedantic found a couple of places where casts were |
|
2176 needed, and a string in dftables.c that was longer than standard compilers are |
|
2177 required to support. |
|
2178 |
|
2179 2. Compiling with Sun's compiler found a few more places where the code could |
|
2180 be tidied up in order to avoid warnings. |
|
2181 |
|
2182 3. The variables for cross-compiling were called HOST_CC and HOST_CFLAGS; the |
|
2183 first of these names is deprecated in the latest Autoconf in favour of the name |
|
2184 CC_FOR_BUILD, because "host" is typically used to mean the system on which the |
|
2185 compiled code will be run. I can't find a reference for HOST_CFLAGS, but by |
|
2186 analogy I have changed it to CFLAGS_FOR_BUILD. |
|
2187 |
|
2188 4. Added -no-undefined to the linking command in the Makefile, because this is |
|
2189 apparently helpful for Windows. To make it work, also added "-L. -lpcre" to the |
|
2190 linking step for the pcreposix library. |
|
2191 |
|
2192 5. PCRE was failing to diagnose the case of two named groups with the same |
|
2193 name. |
|
2194 |
|
2195 6. A problem with one of PCRE's optimizations was discovered. PCRE remembers a |
|
2196 literal character that is needed in the subject for a match, and scans along to |
|
2197 ensure that it is present before embarking on the full matching process. This |
|
2198 saves time in cases of nested unlimited repeats that are never going to match. |
|
2199 Problem: the scan can take a lot of time if the subject is very long (e.g. |
|
2200 megabytes), thus penalizing straightforward matches. It is now done only if the |
|
2201 amount of subject to be scanned is less than 1000 bytes. |
|
2202 |
|
2203 7. A lesser problem with the same optimization is that it was recording the |
|
2204 first character of an anchored pattern as "needed", thus provoking a search |
|
2205 right along the subject, even when the first match of the pattern was going to |
|
2206 fail. The "needed" character is now not set for anchored patterns, unless it |
|
2207 follows something in the pattern that is of non-fixed length. Thus, it still |
|
2208 fulfils its original purpose of finding quick non-matches in cases of nested |
|
2209 unlimited repeats, but isn't used for simple anchored patterns such as /^abc/. |
|
2210 |
|
2211 |
|
2212 Version 4.0 17-Feb-03 |
|
2213 --------------------- |
|
2214 |
|
2215 1. If a comment in an extended regex that started immediately after a meta-item |
|
2216 extended to the end of string, PCRE compiled incorrect data. This could lead to |
|
2217 all kinds of weird effects. Example: /#/ was bad; /()#/ was bad; /a#/ was not. |
|
2218 |
|
2219 2. Moved to autoconf 2.53 and libtool 1.4.2. |
|
2220 |
|
2221 3. Perl 5.8 no longer needs "use utf8" for doing UTF-8 things. Consequently, |
|
2222 the special perltest8 script is no longer needed - all the tests can be run |
|
2223 from a single perltest script. |
|
2224 |
|
2225 4. From 5.004, Perl has not included the VT character (0x0b) in the set defined |
|
2226 by \s. It has now been removed in PCRE. This means it isn't recognized as |
|
2227 whitespace in /x regexes too, which is the same as Perl. Note that the POSIX |
|
2228 class [:space:] *does* include VT, thereby creating a mess. |
|
2229 |
|
2230 5. Added the class [:blank:] (a GNU extension from Perl 5.8) to match only |
|
2231 space and tab. |
|
2232 |
|
2233 6. Perl 5.005 was a long time ago. It's time to amalgamate the tests that use |
|
2234 its new features into the main test script, reducing the number of scripts. |
|
2235 |
|
2236 7. Perl 5.8 has changed the meaning of patterns like /a(?i)b/. Earlier versions |
|
2237 were backward compatible, and made the (?i) apply to the whole pattern, as if |
|
2238 /i were given. Now it behaves more logically, and applies the option setting |
|
2239 only to what follows. PCRE has been changed to follow suit. However, if it |
|
2240 finds options settings right at the start of the pattern, it extracts them into |
|
2241 the global options, as before. Thus, they show up in the info data. |
|
2242 |
|
2243 8. Added support for the \Q...\E escape sequence. Characters in between are |
|
2244 treated as literals. This is slightly different from Perl in that $ and @ are |
|
2245 also handled as literals inside the quotes. In Perl, they will cause variable |
|
2246 interpolation. Note the following examples: |
|
2247 |
|
2248 Pattern PCRE matches Perl matches |
|
2249 |
|
2250 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz |
|
2251 \Qabc\$xyz\E abc\$xyz abc\$xyz |
|
2252 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
|
2253 |
|
2254 For compatibility with Perl, \Q...\E sequences are recognized inside character |
|
2255 classes as well as outside them. |
|
2256 |
|
2257 9. Re-organized 3 code statements in pcretest to avoid "overflow in |
|
2258 floating-point constant arithmetic" warnings from a Microsoft compiler. Added a |
|
2259 (size_t) cast to one statement in pcretest and one in pcreposix to avoid |
|
2260 signed/unsigned warnings. |
|
2261 |
|
2262 10. SunOS4 doesn't have strtoul(). This was used only for unpicking the -o |
|
2263 option for pcretest, so I've replaced it by a simple function that does just |
|
2264 that job. |
|
2265 |
|
2266 11. pcregrep was ending with code 0 instead of 2 for the commands "pcregrep" or |
|
2267 "pcregrep -". |
|
2268 |
|
2269 12. Added "possessive quantifiers" ?+, *+, ++, and {,}+ which come from Sun's |
|
2270 Java package. This provides some syntactic sugar for simple cases of what my |
|
2271 documentation calls "once-only subpatterns". A pattern such as x*+ is the same |
|
2272 as (?>x*). In other words, if what is inside (?>...) is just a single repeated |
|
2273 item, you can use this simplified notation. Note that only makes sense with |
|
2274 greedy quantifiers. Consequently, the use of the possessive quantifier forces |
|
2275 greediness, whatever the setting of the PCRE_UNGREEDY option. |
|
2276 |
|
2277 13. A change of greediness default within a pattern was not taking effect at |
|
2278 the current level for patterns like /(b+(?U)a+)/. It did apply to parenthesized |
|
2279 subpatterns that followed. Patterns like /b+(?U)a+/ worked because the option |
|
2280 was abstracted outside. |
|
2281 |
|
2282 14. PCRE now supports the \G assertion. It is true when the current matching |
|
2283 position is at the start point of the match. This differs from \A when the |
|
2284 starting offset is non-zero. Used with the /g option of pcretest (or similar |
|
2285 code), it works in the same way as it does for Perl's /g option. If all |
|
2286 alternatives of a regex begin with \G, the expression is anchored to the start |
|
2287 match position, and the "anchored" flag is set in the compiled expression. |
|
2288 |
|
2289 15. Some bugs concerning the handling of certain option changes within patterns |
|
2290 have been fixed. These applied to options other than (?ims). For example, |
|
2291 "a(?x: b c )d" did not match "XabcdY" but did match "Xa b c dY". It should have |
|
2292 been the other way round. Some of this was related to change 7 above. |
|
2293 |
|
2294 16. PCRE now gives errors for /[.x.]/ and /[=x=]/ as unsupported POSIX |
|
2295 features, as Perl does. Previously, PCRE gave the warnings only for /[[.x.]]/ |
|
2296 and /[[=x=]]/. PCRE now also gives an error for /[:name:]/ because it supports |
|
2297 POSIX classes only within a class (e.g. /[[:alpha:]]/). |
|
2298 |
|
2299 17. Added support for Perl's \C escape. This matches one byte, even in UTF8 |
|
2300 mode. Unlike ".", it always matches newline, whatever the setting of |
|
2301 PCRE_DOTALL. However, PCRE does not permit \C to appear in lookbehind |
|
2302 assertions. Perl allows it, but it doesn't (in general) work because it can't |
|
2303 calculate the length of the lookbehind. At least, that's the case for Perl |
|
2304 5.8.0 - I've been told they are going to document that it doesn't work in |
|
2305 future. |
|
2306 |
|
2307 18. Added an error diagnosis for escapes that PCRE does not support: these are |
|
2308 \L, \l, \N, \P, \p, \U, \u, and \X. |
|
2309 |
|
2310 19. Although correctly diagnosing a missing ']' in a character class, PCRE was |
|
2311 reading past the end of the pattern in cases such as /[abcd/. |
|
2312 |
|
2313 20. PCRE was getting more memory than necessary for patterns with classes that |
|
2314 contained both POSIX named classes and other characters, e.g. /[[:space:]abc/. |
|
2315 |
|
2316 21. Added some code, conditional on #ifdef VPCOMPAT, to make life easier for |
|
2317 compiling PCRE for use with Virtual Pascal. |
|
2318 |
|
2319 22. Small fix to the Makefile to make it work properly if the build is done |
|
2320 outside the source tree. |
|
2321 |
|
2322 23. Added a new extension: a condition to go with recursion. If a conditional |
|
2323 subpattern starts with (?(R) the "true" branch is used if recursion has |
|
2324 happened, whereas the "false" branch is used only at the top level. |
|
2325 |
|
2326 24. When there was a very long string of literal characters (over 255 bytes |
|
2327 without UTF support, over 250 bytes with UTF support), the computation of how |
|
2328 much memory was required could be incorrect, leading to segfaults or other |
|
2329 strange effects. |
|
2330 |
|
2331 25. PCRE was incorrectly assuming anchoring (either to start of subject or to |
|
2332 start of line for a non-DOTALL pattern) when a pattern started with (.*) and |
|
2333 there was a subsequent back reference to those brackets. This meant that, for |
|
2334 example, /(.*)\d+\1/ failed to match "abc123bc". Unfortunately, it isn't |
|
2335 possible to check for precisely this case. All we can do is abandon the |
|
2336 optimization if .* occurs inside capturing brackets when there are any back |
|
2337 references whatsoever. (See below for a better fix that came later.) |
|
2338 |
|
2339 26. The handling of the optimization for finding the first character of a |
|
2340 non-anchored pattern, and for finding a character that is required later in the |
|
2341 match were failing in some cases. This didn't break the matching; it just |
|
2342 failed to optimize when it could. The way this is done has been re-implemented. |
|
2343 |
|
2344 27. Fixed typo in error message for invalid (?R item (it said "(?p"). |
|
2345 |
|
2346 28. Added a new feature that provides some of the functionality that Perl |
|
2347 provides with (?{...}). The facility is termed a "callout". The way it is done |
|
2348 in PCRE is for the caller to provide an optional function, by setting |
|
2349 pcre_callout to its entry point. Like pcre_malloc and pcre_free, this is a |
|
2350 global variable. By default it is unset, which disables all calling out. To get |
|
2351 the function called, the regex must include (?C) at appropriate points. This |
|
2352 is, in fact, equivalent to (?C0), and any number <= 255 may be given with (?C). |
|
2353 This provides a means of identifying different callout points. When PCRE |
|
2354 reaches such a point in the regex, if pcre_callout has been set, the external |
|
2355 function is called. It is provided with data in a structure called |
|
2356 pcre_callout_block, which is defined in pcre.h. If the function returns 0, |
|
2357 matching continues; if it returns a non-zero value, the match at the current |
|
2358 point fails. However, backtracking will occur if possible. [This was changed |
|
2359 later and other features added - see item 49 below.] |
|
2360 |
|
2361 29. pcretest is upgraded to test the callout functionality. It provides a |
|
2362 callout function that displays information. By default, it shows the start of |
|
2363 the match and the current position in the text. There are some new data escapes |
|
2364 to vary what happens: |
|
2365 |
|
2366 \C+ in addition, show current contents of captured substrings |
|
2367 \C- do not supply a callout function |
|
2368 \C!n return 1 when callout number n is reached |
|
2369 \C!n!m return 1 when callout number n is reached for the mth time |
|
2370 |
|
2371 30. If pcregrep was called with the -l option and just a single file name, it |
|
2372 output "<stdin>" if a match was found, instead of the file name. |
|
2373 |
|
2374 31. Improve the efficiency of the POSIX API to PCRE. If the number of capturing |
|
2375 slots is less than POSIX_MALLOC_THRESHOLD, use a block on the stack to pass to |
|
2376 pcre_exec(). This saves a malloc/free per call. The default value of |
|
2377 POSIX_MALLOC_THRESHOLD is 10; it can be changed by --with-posix-malloc-threshold |
|
2378 when configuring. |
|
2379 |
|
2380 32. The default maximum size of a compiled pattern is 64K. There have been a |
|
2381 few cases of people hitting this limit. The code now uses macros to handle the |
|
2382 storing of links as offsets within the compiled pattern. It defaults to 2-byte |
|
2383 links, but this can be changed to 3 or 4 bytes by --with-link-size when |
|
2384 configuring. Tests 2 and 5 work only with 2-byte links because they output |
|
2385 debugging information about compiled patterns. |
|
2386 |
|
2387 33. Internal code re-arrangements: |
|
2388 |
|
2389 (a) Moved the debugging function for printing out a compiled regex into |
|
2390 its own source file (printint.c) and used #include to pull it into |
|
2391 pcretest.c and, when DEBUG is defined, into pcre.c, instead of having two |
|
2392 separate copies. |
|
2393 |
|
2394 (b) Defined the list of op-code names for debugging as a macro in |
|
2395 internal.h so that it is next to the definition of the opcodes. |
|
2396 |
|
2397 (c) Defined a table of op-code lengths for simpler skipping along compiled |
|
2398 code. This is again a macro in internal.h so that it is next to the |
|
2399 definition of the opcodes. |
|
2400 |
|
2401 34. Added support for recursive calls to individual subpatterns, along the |
|
2402 lines of Robin Houston's patch (but implemented somewhat differently). |
|
2403 |
|
2404 35. Further mods to the Makefile to help Win32. Also, added code to pcregrep to |
|
2405 allow it to read and process whole directories in Win32. This code was |
|
2406 contributed by Lionel Fourquaux; it has not been tested by me. |
|
2407 |
|
2408 36. Added support for named subpatterns. The Python syntax (?P<name>...) is |
|
2409 used to name a group. Names consist of alphanumerics and underscores, and must |
|
2410 be unique. Back references use the syntax (?P=name) and recursive calls use |
|
2411 (?P>name) which is a PCRE extension to the Python extension. Groups still have |
|
2412 numbers. The function pcre_fullinfo() can be used after compilation to extract |
|
2413 a name/number map. There are three relevant calls: |
|
2414 |
|
2415 PCRE_INFO_NAMEENTRYSIZE yields the size of each entry in the map |
|
2416 PCRE_INFO_NAMECOUNT yields the number of entries |
|
2417 PCRE_INFO_NAMETABLE yields a pointer to the map. |
|
2418 |
|
2419 The map is a vector of fixed-size entries. The size of each entry depends on |
|
2420 the length of the longest name used. The first two bytes of each entry are the |
|
2421 group number, most significant byte first. There follows the corresponding |
|
2422 name, zero terminated. The names are in alphabetical order. |
|
2423 |
|
2424 37. Make the maximum literal string in the compiled code 250 for the non-UTF-8 |
|
2425 case instead of 255. Making it the same both with and without UTF-8 support |
|
2426 means that the same test output works with both. |
|
2427 |
|
2428 38. There was a case of malloc(0) in the POSIX testing code in pcretest. Avoid |
|
2429 calling malloc() with a zero argument. |
|
2430 |
|
2431 39. Change 25 above had to resort to a heavy-handed test for the .* anchoring |
|
2432 optimization. I've improved things by keeping a bitmap of backreferences with |
|
2433 numbers 1-31 so that if .* occurs inside capturing brackets that are not in |
|
2434 fact referenced, the optimization can be applied. It is unlikely that a |
|
2435 relevant occurrence of .* (i.e. one which might indicate anchoring or forcing |
|
2436 the match to follow \n) will appear inside brackets with a number greater than |
|
2437 31, but if it does, any back reference > 31 suppresses the optimization. |
|
2438 |
|
2439 40. Added a new compile-time option PCRE_NO_AUTO_CAPTURE. This has the effect |
|
2440 of disabling numbered capturing parentheses. Any opening parenthesis that is |
|
2441 not followed by ? behaves as if it were followed by ?: but named parentheses |
|
2442 can still be used for capturing (and they will acquire numbers in the usual |
|
2443 way). |
|
2444 |
|
2445 41. Redesigned the return codes from the match() function into yes/no/error so |
|
2446 that errors can be passed back from deep inside the nested calls. A malloc |
|
2447 failure while inside a recursive subpattern call now causes the |
|
2448 PCRE_ERROR_NOMEMORY return instead of quietly going wrong. |
|
2449 |
|
2450 42. It is now possible to set a limit on the number of times the match() |
|
2451 function is called in a call to pcre_exec(). This facility makes it possible to |
|
2452 limit the amount of recursion and backtracking, though not in a directly |
|
2453 obvious way, because the match() function is used in a number of different |
|
2454 circumstances. The count starts from zero for each position in the subject |
|
2455 string (for non-anchored patterns). The default limit is, for compatibility, a |
|
2456 large number, namely 10 000 000. You can change this in two ways: |
|
2457 |
|
2458 (a) When configuring PCRE before making, you can use --with-match-limit=n |
|
2459 to set a default value for the compiled library. |
|
2460 |
|
2461 (b) For each call to pcre_exec(), you can pass a pcre_extra block in which |
|
2462 a different value is set. See 45 below. |
|
2463 |
|
2464 If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
|
2465 |
|
2466 43. Added a new function pcre_config(int, void *) to enable run-time extraction |
|
2467 of things that can be changed at compile time. The first argument specifies |
|
2468 what is wanted and the second points to where the information is to be placed. |
|
2469 The current list of available information is: |
|
2470 |
|
2471 PCRE_CONFIG_UTF8 |
|
2472 |
|
2473 The output is an integer that is set to one if UTF-8 support is available; |
|
2474 otherwise it is set to zero. |
|
2475 |
|
2476 PCRE_CONFIG_NEWLINE |
|
2477 |
|
2478 The output is an integer that it set to the value of the code that is used for |
|
2479 newline. It is either LF (10) or CR (13). |
|
2480 |
|
2481 PCRE_CONFIG_LINK_SIZE |
|
2482 |
|
2483 The output is an integer that contains the number of bytes used for internal |
|
2484 linkage in compiled expressions. The value is 2, 3, or 4. See item 32 above. |
|
2485 |
|
2486 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
|
2487 |
|
2488 The output is an integer that contains the threshold above which the POSIX |
|
2489 interface uses malloc() for output vectors. See item 31 above. |
|
2490 |
|
2491 PCRE_CONFIG_MATCH_LIMIT |
|
2492 |
|
2493 The output is an unsigned integer that contains the default limit of the number |
|
2494 of match() calls in a pcre_exec() execution. See 42 above. |
|
2495 |
|
2496 44. pcretest has been upgraded by the addition of the -C option. This causes it |
|
2497 to extract all the available output from the new pcre_config() function, and to |
|
2498 output it. The program then exits immediately. |
|
2499 |
|
2500 45. A need has arisen to pass over additional data with calls to pcre_exec() in |
|
2501 order to support additional features. One way would have been to define |
|
2502 pcre_exec2() (for example) with extra arguments, but this would not have been |
|
2503 extensible, and would also have required all calls to the original function to |
|
2504 be mapped to the new one. Instead, I have chosen to extend the mechanism that |
|
2505 is used for passing in "extra" data from pcre_study(). |
|
2506 |
|
2507 The pcre_extra structure is now exposed and defined in pcre.h. It currently |
|
2508 contains the following fields: |
|
2509 |
|
2510 flags a bitmap indicating which of the following fields are set |
|
2511 study_data opaque data from pcre_study() |
|
2512 match_limit a way of specifying a limit on match() calls for a specific |
|
2513 call to pcre_exec() |
|
2514 callout_data data for callouts (see 49 below) |
|
2515 |
|
2516 The flag bits are also defined in pcre.h, and are |
|
2517 |
|
2518 PCRE_EXTRA_STUDY_DATA |
|
2519 PCRE_EXTRA_MATCH_LIMIT |
|
2520 PCRE_EXTRA_CALLOUT_DATA |
|
2521 |
|
2522 The pcre_study() function now returns one of these new pcre_extra blocks, with |
|
2523 the actual study data pointed to by the study_data field, and the |
|
2524 PCRE_EXTRA_STUDY_DATA flag set. This can be passed directly to pcre_exec() as |
|
2525 before. That is, this change is entirely upwards-compatible and requires no |
|
2526 change to existing code. |
|
2527 |
|
2528 If you want to pass in additional data to pcre_exec(), you can either place it |
|
2529 in a pcre_extra block provided by pcre_study(), or create your own pcre_extra |
|
2530 block. |
|
2531 |
|
2532 46. pcretest has been extended to test the PCRE_EXTRA_MATCH_LIMIT feature. If a |
|
2533 data string contains the escape sequence \M, pcretest calls pcre_exec() several |
|
2534 times with different match limits, until it finds the minimum value needed for |
|
2535 pcre_exec() to complete. The value is then output. This can be instructive; for |
|
2536 most simple matches the number is quite small, but for pathological cases it |
|
2537 gets very large very quickly. |
|
2538 |
|
2539 47. There's a new option for pcre_fullinfo() called PCRE_INFO_STUDYSIZE. It |
|
2540 returns the size of the data block pointed to by the study_data field in a |
|
2541 pcre_extra block, that is, the value that was passed as the argument to |
|
2542 pcre_malloc() when PCRE was getting memory in which to place the information |
|
2543 created by pcre_study(). The fourth argument should point to a size_t variable. |
|
2544 pcretest has been extended so that this information is shown after a successful |
|
2545 pcre_study() call when information about the compiled regex is being displayed. |
|
2546 |
|
2547 48. Cosmetic change to Makefile: there's no need to have / after $(DESTDIR) |
|
2548 because what follows is always an absolute path. (Later: it turns out that this |
|
2549 is more than cosmetic for MinGW, because it doesn't like empty path |
|
2550 components.) |
|
2551 |
|
2552 49. Some changes have been made to the callout feature (see 28 above): |
|
2553 |
|
2554 (i) A callout function now has three choices for what it returns: |
|
2555 |
|
2556 0 => success, carry on matching |
|
2557 > 0 => failure at this point, but backtrack if possible |
|
2558 < 0 => serious error, return this value from pcre_exec() |
|
2559 |
|
2560 Negative values should normally be chosen from the set of PCRE_ERROR_xxx |
|
2561 values. In particular, returning PCRE_ERROR_NOMATCH forces a standard |
|
2562 "match failed" error. The error number PCRE_ERROR_CALLOUT is reserved for |
|
2563 use by callout functions. It will never be used by PCRE itself. |
|
2564 |
|
2565 (ii) The pcre_extra structure (see 45 above) has a void * field called |
|
2566 callout_data, with corresponding flag bit PCRE_EXTRA_CALLOUT_DATA. The |
|
2567 pcre_callout_block structure has a field of the same name. The contents of |
|
2568 the field passed in the pcre_extra structure are passed to the callout |
|
2569 function in the corresponding field in the callout block. This makes it |
|
2570 easier to use the same callout-containing regex from multiple threads. For |
|
2571 testing, the pcretest program has a new data escape |
|
2572 |
|
2573 \C*n pass the number n (may be negative) as callout_data |
|
2574 |
|
2575 If the callout function in pcretest receives a non-zero value as |
|
2576 callout_data, it returns that value. |
|
2577 |
|
2578 50. Makefile wasn't handling CFLAGS properly when compiling dftables. Also, |
|
2579 there were some redundant $(CFLAGS) in commands that are now specified as |
|
2580 $(LINK), which already includes $(CFLAGS). |
|
2581 |
|
2582 51. Extensions to UTF-8 support are listed below. These all apply when (a) PCRE |
|
2583 has been compiled with UTF-8 support *and* pcre_compile() has been compiled |
|
2584 with the PCRE_UTF8 flag. Patterns that are compiled without that flag assume |
|
2585 one-byte characters throughout. Note that case-insensitive matching applies |
|
2586 only to characters whose values are less than 256. PCRE doesn't support the |
|
2587 notion of cases for higher-valued characters. |
|
2588 |
|
2589 (i) A character class whose characters are all within 0-255 is handled as |
|
2590 a bit map, and the map is inverted for negative classes. Previously, a |
|
2591 character > 255 always failed to match such a class; however it should |
|
2592 match if the class was a negative one (e.g. [^ab]). This has been fixed. |
|
2593 |
|
2594 (ii) A negated character class with a single character < 255 is coded as |
|
2595 "not this character" (OP_NOT). This wasn't working properly when the test |
|
2596 character was multibyte, either singly or repeated. |
|
2597 |
|
2598 (iii) Repeats of multibyte characters are now handled correctly in UTF-8 |
|
2599 mode, for example: \x{100}{2,3}. |
|
2600 |
|
2601 (iv) The character escapes \b, \B, \d, \D, \s, \S, \w, and \W (either |
|
2602 singly or repeated) now correctly test multibyte characters. However, |
|
2603 PCRE doesn't recognize any characters with values greater than 255 as |
|
2604 digits, spaces, or word characters. Such characters always match \D, \S, |
|
2605 and \W, and never match \d, \s, or \w. |
|
2606 |
|
2607 (v) Classes may now contain characters and character ranges with values |
|
2608 greater than 255. For example: [ab\x{100}-\x{400}]. |
|
2609 |
|
2610 (vi) pcregrep now has a --utf-8 option (synonym -u) which makes it call |
|
2611 PCRE in UTF-8 mode. |
|
2612 |
|
2613 52. The info request value PCRE_INFO_FIRSTCHAR has been renamed |
|
2614 PCRE_INFO_FIRSTBYTE because it is a byte value. However, the old name is |
|
2615 retained for backwards compatibility. (Note that LASTLITERAL is also a byte |
|
2616 value.) |
|
2617 |
|
2618 53. The single man page has become too large. I have therefore split it up into |
|
2619 a number of separate man pages. These also give rise to individual HTML pages; |
|
2620 these are now put in a separate directory, and there is an index.html page that |
|
2621 lists them all. Some hyperlinking between the pages has been installed. |
|
2622 |
|
2623 54. Added convenience functions for handling named capturing parentheses. |
|
2624 |
|
2625 55. Unknown escapes inside character classes (e.g. [\M]) and escapes that |
|
2626 aren't interpreted therein (e.g. [\C]) are literals in Perl. This is now also |
|
2627 true in PCRE, except when the PCRE_EXTENDED option is set, in which case they |
|
2628 are faulted. |
|
2629 |
|
2630 56. Introduced HOST_CC and HOST_CFLAGS which can be set in the environment when |
|
2631 calling configure. These values are used when compiling the dftables.c program |
|
2632 which is run to generate the source of the default character tables. They |
|
2633 default to the values of CC and CFLAGS. If you are cross-compiling PCRE, |
|
2634 you will need to set these values. |
|
2635 |
|
2636 57. Updated the building process for Windows DLL, as provided by Fred Cox. |
|
2637 |
|
2638 |
|
2639 Version 3.9 02-Jan-02 |
|
2640 --------------------- |
|
2641 |
|
2642 1. A bit of extraneous text had somehow crept into the pcregrep documentation. |
|
2643 |
|
2644 2. If --disable-static was given, the building process failed when trying to |
|
2645 build pcretest and pcregrep. (For some reason it was using libtool to compile |
|
2646 them, which is not right, as they aren't part of the library.) |
|
2647 |
|
2648 |
|
2649 Version 3.8 18-Dec-01 |
|
2650 --------------------- |
|
2651 |
|
2652 1. The experimental UTF-8 code was completely screwed up. It was packing the |
|
2653 bytes in the wrong order. How dumb can you get? |
|
2654 |
|
2655 |
|
2656 Version 3.7 29-Oct-01 |
|
2657 --------------------- |
|
2658 |
|
2659 1. In updating pcretest to check change 1 of version 3.6, I screwed up. |
|
2660 This caused pcretest, when used on the test data, to segfault. Unfortunately, |
|
2661 this didn't happen under Solaris 8, where I normally test things. |
|
2662 |
|
2663 2. The Makefile had to be changed to make it work on BSD systems, where 'make' |
|
2664 doesn't seem to recognize that ./xxx and xxx are the same file. (This entry |
|
2665 isn't in ChangeLog distributed with 3.7 because I forgot when I hastily made |
|
2666 this fix an hour or so after the initial 3.7 release.) |
|
2667 |
|
2668 |
|
2669 Version 3.6 23-Oct-01 |
|
2670 --------------------- |
|
2671 |
|
2672 1. Crashed with /(sens|respons)e and \1ibility/ and "sense and sensibility" if |
|
2673 offsets passed as NULL with zero offset count. |
|
2674 |
|
2675 2. The config.guess and config.sub files had not been updated when I moved to |
|
2676 the latest autoconf. |
|
2677 |
|
2678 |
|
2679 Version 3.5 15-Aug-01 |
|
2680 --------------------- |
|
2681 |
|
2682 1. Added some missing #if !defined NOPOSIX conditionals in pcretest.c that |
|
2683 had been forgotten. |
|
2684 |
|
2685 2. By using declared but undefined structures, we can avoid using "void" |
|
2686 definitions in pcre.h while keeping the internal definitions of the structures |
|
2687 private. |
|
2688 |
|
2689 3. The distribution is now built using autoconf 2.50 and libtool 1.4. From a |
|
2690 user point of view, this means that both static and shared libraries are built |
|
2691 by default, but this can be individually controlled. More of the work of |
|
2692 handling this static/shared cases is now inside libtool instead of PCRE's make |
|
2693 file. |
|
2694 |
|
2695 4. The pcretest utility is now installed along with pcregrep because it is |
|
2696 useful for users (to test regexs) and by doing this, it automatically gets |
|
2697 relinked by libtool. The documentation has been turned into a man page, so |
|
2698 there are now .1, .txt, and .html versions in /doc. |
|
2699 |
|
2700 5. Upgrades to pcregrep: |
|
2701 (i) Added long-form option names like gnu grep. |
|
2702 (ii) Added --help to list all options with an explanatory phrase. |
|
2703 (iii) Added -r, --recursive to recurse into sub-directories. |
|
2704 (iv) Added -f, --file to read patterns from a file. |
|
2705 |
|
2706 6. pcre_exec() was referring to its "code" argument before testing that |
|
2707 argument for NULL (and giving an error if it was NULL). |
|
2708 |
|
2709 7. Upgraded Makefile.in to allow for compiling in a different directory from |
|
2710 the source directory. |
|
2711 |
|
2712 8. Tiny buglet in pcretest: when pcre_fullinfo() was called to retrieve the |
|
2713 options bits, the pointer it was passed was to an int instead of to an unsigned |
|
2714 long int. This mattered only on 64-bit systems. |
|
2715 |
|
2716 9. Fixed typo (3.4/1) in pcre.h again. Sigh. I had changed pcre.h (which is |
|
2717 generated) instead of pcre.in, which it its source. Also made the same change |
|
2718 in several of the .c files. |
|
2719 |
|
2720 10. A new release of gcc defines printf() as a macro, which broke pcretest |
|
2721 because it had an ifdef in the middle of a string argument for printf(). Fixed |
|
2722 by using separate calls to printf(). |
|
2723 |
|
2724 11. Added --enable-newline-is-cr and --enable-newline-is-lf to the configure |
|
2725 script, to force use of CR or LF instead of \n in the source. On non-Unix |
|
2726 systems, the value can be set in config.h. |
|
2727 |
|
2728 12. The limit of 200 on non-capturing parentheses is a _nesting_ limit, not an |
|
2729 absolute limit. Changed the text of the error message to make this clear, and |
|
2730 likewise updated the man page. |
|
2731 |
|
2732 13. The limit of 99 on the number of capturing subpatterns has been removed. |
|
2733 The new limit is 65535, which I hope will not be a "real" limit. |
|
2734 |
|
2735 |
|
2736 Version 3.4 22-Aug-00 |
|
2737 --------------------- |
|
2738 |
|
2739 1. Fixed typo in pcre.h: unsigned const char * changed to const unsigned char *. |
|
2740 |
|
2741 2. Diagnose condition (?(0) as an error instead of crashing on matching. |
|
2742 |
|
2743 |
|
2744 Version 3.3 01-Aug-00 |
|
2745 --------------------- |
|
2746 |
|
2747 1. If an octal character was given, but the value was greater than \377, it |
|
2748 was not getting masked to the least significant bits, as documented. This could |
|
2749 lead to crashes in some systems. |
|
2750 |
|
2751 2. Perl 5.6 (if not earlier versions) accepts classes like [a-\d] and treats |
|
2752 the hyphen as a literal. PCRE used to give an error; it now behaves like Perl. |
|
2753 |
|
2754 3. Added the functions pcre_free_substring() and pcre_free_substring_list(). |
|
2755 These just pass their arguments on to (pcre_free)(), but they are provided |
|
2756 because some uses of PCRE bind it to non-C systems that can call its functions, |
|
2757 but cannot call free() or pcre_free() directly. |
|
2758 |
|
2759 4. Add "make test" as a synonym for "make check". Corrected some comments in |
|
2760 the Makefile. |
|
2761 |
|
2762 5. Add $(DESTDIR)/ in front of all the paths in the "install" target in the |
|
2763 Makefile. |
|
2764 |
|
2765 6. Changed the name of pgrep to pcregrep, because Solaris has introduced a |
|
2766 command called pgrep for grepping around the active processes. |
|
2767 |
|
2768 7. Added the beginnings of support for UTF-8 character strings. |
|
2769 |
|
2770 8. Arranged for the Makefile to pass over the settings of CC, CFLAGS, and |
|
2771 RANLIB to ./ltconfig so that they are used by libtool. I think these are all |
|
2772 the relevant ones. (AR is not passed because ./ltconfig does its own figuring |
|
2773 out for the ar command.) |
|
2774 |
|
2775 |
|
2776 Version 3.2 12-May-00 |
|
2777 --------------------- |
|
2778 |
|
2779 This is purely a bug fixing release. |
|
2780 |
|
2781 1. If the pattern /((Z)+|A)*/ was matched agained ZABCDEFG it matched Z instead |
|
2782 of ZA. This was just one example of several cases that could provoke this bug, |
|
2783 which was introduced by change 9 of version 2.00. The code for breaking |
|
2784 infinite loops after an iteration that matches an empty string was't working |
|
2785 correctly. |
|
2786 |
|
2787 2. The pcretest program was not imitating Perl correctly for the pattern /a*/g |
|
2788 when matched against abbab (for example). After matching an empty string, it |
|
2789 wasn't forcing anchoring when setting PCRE_NOTEMPTY for the next attempt; this |
|
2790 caused it to match further down the string than it should. |
|
2791 |
|
2792 3. The code contained an inclusion of sys/types.h. It isn't clear why this |
|
2793 was there because it doesn't seem to be needed, and it causes trouble on some |
|
2794 systems, as it is not a Standard C header. It has been removed. |
|
2795 |
|
2796 4. Made 4 silly changes to the source to avoid stupid compiler warnings that |
|
2797 were reported on the Macintosh. The changes were from |
|
2798 |
|
2799 while ((c = *(++ptr)) != 0 && c != '\n'); |
|
2800 to |
|
2801 while ((c = *(++ptr)) != 0 && c != '\n') ; |
|
2802 |
|
2803 Totally extraordinary, but if that's what it takes... |
|
2804 |
|
2805 5. PCRE is being used in one environment where neither memmove() nor bcopy() is |
|
2806 available. Added HAVE_BCOPY and an autoconf test for it; if neither |
|
2807 HAVE_MEMMOVE nor HAVE_BCOPY is set, use a built-in emulation function which |
|
2808 assumes the way PCRE uses memmove() (always moving upwards). |
|
2809 |
|
2810 6. PCRE is being used in one environment where strchr() is not available. There |
|
2811 was only one use in pcre.c, and writing it out to avoid strchr() probably gives |
|
2812 faster code anyway. |
|
2813 |
|
2814 |
|
2815 Version 3.1 09-Feb-00 |
|
2816 --------------------- |
|
2817 |
|
2818 The only change in this release is the fixing of some bugs in Makefile.in for |
|
2819 the "install" target: |
|
2820 |
|
2821 (1) It was failing to install pcreposix.h. |
|
2822 |
|
2823 (2) It was overwriting the pcre.3 man page with the pcreposix.3 man page. |
|
2824 |
|
2825 |
|
2826 Version 3.0 01-Feb-00 |
|
2827 --------------------- |
|
2828 |
|
2829 1. Add support for the /+ modifier to perltest (to output $` like it does in |
|
2830 pcretest). |
|
2831 |
|
2832 2. Add support for the /g modifier to perltest. |
|
2833 |
|
2834 3. Fix pcretest so that it behaves even more like Perl for /g when the pattern |
|
2835 matches null strings. |
|
2836 |
|
2837 4. Fix perltest so that it doesn't do unwanted things when fed an empty |
|
2838 pattern. Perl treats empty patterns specially - it reuses the most recent |
|
2839 pattern, which is not what we want. Replace // by /(?#)/ in order to avoid this |
|
2840 effect. |
|
2841 |
|
2842 5. The POSIX interface was broken in that it was just handing over the POSIX |
|
2843 captured string vector to pcre_exec(), but (since release 2.00) PCRE has |
|
2844 required a bigger vector, with some working space on the end. This means that |
|
2845 the POSIX wrapper now has to get and free some memory, and copy the results. |
|
2846 |
|
2847 6. Added some simple autoconf support, placing the test data and the |
|
2848 documentation in separate directories, re-organizing some of the |
|
2849 information files, and making it build pcre-config (a GNU standard). Also added |
|
2850 libtool support for building PCRE as a shared library, which is now the |
|
2851 default. |
|
2852 |
|
2853 7. Got rid of the leading zero in the definition of PCRE_MINOR because 08 and |
|
2854 09 are not valid octal constants. Single digits will be used for minor values |
|
2855 less than 10. |
|
2856 |
|
2857 8. Defined REG_EXTENDED and REG_NOSUB as zero in the POSIX header, so that |
|
2858 existing programs that set these in the POSIX interface can use PCRE without |
|
2859 modification. |
|
2860 |
|
2861 9. Added a new function, pcre_fullinfo() with an extensible interface. It can |
|
2862 return all that pcre_info() returns, plus additional data. The pcre_info() |
|
2863 function is retained for compatibility, but is considered to be obsolete. |
|
2864 |
|
2865 10. Added experimental recursion feature (?R) to handle one common case that |
|
2866 Perl 5.6 will be able to do with (?p{...}). |
|
2867 |
|
2868 11. Added support for POSIX character classes like [:alpha:], which Perl is |
|
2869 adopting. |
|
2870 |
|
2871 |
|
2872 Version 2.08 31-Aug-99 |
|
2873 ---------------------- |
|
2874 |
|
2875 1. When startoffset was not zero and the pattern began with ".*", PCRE was not |
|
2876 trying to match at the startoffset position, but instead was moving forward to |
|
2877 the next newline as if a previous match had failed. |
|
2878 |
|
2879 2. pcretest was not making use of PCRE_NOTEMPTY when repeating for /g and /G, |
|
2880 and could get into a loop if a null string was matched other than at the start |
|
2881 of the subject. |
|
2882 |
|
2883 3. Added definitions of PCRE_MAJOR and PCRE_MINOR to pcre.h so the version can |
|
2884 be distinguished at compile time, and for completeness also added PCRE_DATE. |
|
2885 |
|
2886 5. Added Paul Sokolovsky's minor changes to make it easy to compile a Win32 DLL |
|
2887 in GnuWin32 environments. |
|
2888 |
|
2889 |
|
2890 Version 2.07 29-Jul-99 |
|
2891 ---------------------- |
|
2892 |
|
2893 1. The documentation is now supplied in plain text form and HTML as well as in |
|
2894 the form of man page sources. |
|
2895 |
|
2896 2. C++ compilers don't like assigning (void *) values to other pointer types. |
|
2897 In particular this affects malloc(). Although there is no problem in Standard |
|
2898 C, I've put in casts to keep C++ compilers happy. |
|
2899 |
|
2900 3. Typo on pcretest.c; a cast of (unsigned char *) in the POSIX regexec() call |
|
2901 should be (const char *). |
|
2902 |
|
2903 4. If NOPOSIX is defined, pcretest.c compiles without POSIX support. This may |
|
2904 be useful for non-Unix systems who don't want to bother with the POSIX stuff. |
|
2905 However, I haven't made this a standard facility. The documentation doesn't |
|
2906 mention it, and the Makefile doesn't support it. |
|
2907 |
|
2908 5. The Makefile now contains an "install" target, with editable destinations at |
|
2909 the top of the file. The pcretest program is not installed. |
|
2910 |
|
2911 6. pgrep -V now gives the PCRE version number and date. |
|
2912 |
|
2913 7. Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was |
|
2914 causing the entire string to be ignored, instead of just the last character. |
|
2915 |
|
2916 8. If a pattern like /"([^\\"]+|\\.)*"/ is applied in the normal way to a |
|
2917 non-matching string, it can take a very, very long time, even for strings of |
|
2918 quite modest length, because of the nested recursion. PCRE now does better in |
|
2919 some of these cases. It does this by remembering the last required literal |
|
2920 character in the pattern, and pre-searching the subject to ensure it is present |
|
2921 before running the real match. In other words, it applies a heuristic to detect |
|
2922 some types of certain failure quickly, and in the above example, if presented |
|
2923 with a string that has no trailing " it gives "no match" very quickly. |
|
2924 |
|
2925 9. A new runtime option PCRE_NOTEMPTY causes null string matches to be ignored; |
|
2926 other alternatives are tried instead. |
|
2927 |
|
2928 |
|
2929 Version 2.06 09-Jun-99 |
|
2930 ---------------------- |
|
2931 |
|
2932 1. Change pcretest's output for amount of store used to show just the code |
|
2933 space, because the remainder (the data block) varies in size between 32-bit and |
|
2934 64-bit systems. |
|
2935 |
|
2936 2. Added an extra argument to pcre_exec() to supply an offset in the subject to |
|
2937 start matching at. This allows lookbehinds to work when searching for multiple |
|
2938 occurrences in a string. |
|
2939 |
|
2940 3. Added additional options to pcretest for testing multiple occurrences: |
|
2941 |
|
2942 /+ outputs the rest of the string that follows a match |
|
2943 /g loops for multiple occurrences, using the new startoffset argument |
|
2944 /G loops for multiple occurrences by passing an incremented pointer |
|
2945 |
|
2946 4. PCRE wasn't doing the "first character" optimization for patterns starting |
|
2947 with \b or \B, though it was doing it for other lookbehind assertions. That is, |
|
2948 it wasn't noticing that a match for a pattern such as /\bxyz/ has to start with |
|
2949 the letter 'x'. On long subject strings, this gives a significant speed-up. |
|
2950 |
|
2951 |
|
2952 Version 2.05 21-Apr-99 |
|
2953 ---------------------- |
|
2954 |
|
2955 1. Changed the type of magic_number from int to long int so that it works |
|
2956 properly on 16-bit systems. |
|
2957 |
|
2958 2. Fixed a bug which caused patterns starting with .* not to work correctly |
|
2959 when the subject string contained newline characters. PCRE was assuming |
|
2960 anchoring for such patterns in all cases, which is not correct because .* will |
|
2961 not pass a newline unless PCRE_DOTALL is set. It now assumes anchoring only if |
|
2962 DOTALL is set at top level; otherwise it knows that patterns starting with .* |
|
2963 must be retried after every newline in the subject. |
|
2964 |
|
2965 |
|
2966 Version 2.04 18-Feb-99 |
|
2967 ---------------------- |
|
2968 |
|
2969 1. For parenthesized subpatterns with repeats whose minimum was zero, the |
|
2970 computation of the store needed to hold the pattern was incorrect (too large). |
|
2971 If such patterns were nested a few deep, this could multiply and become a real |
|
2972 problem. |
|
2973 |
|
2974 2. Added /M option to pcretest to show the memory requirement of a specific |
|
2975 pattern. Made -m a synonym of -s (which does this globally) for compatibility. |
|
2976 |
|
2977 3. Subpatterns of the form (regex){n,m} (i.e. limited maximum) were being |
|
2978 compiled in such a way that the backtracking after subsequent failure was |
|
2979 pessimal. Something like (a){0,3} was compiled as (a)?(a)?(a)? instead of |
|
2980 ((a)((a)(a)?)?)? with disastrous performance if the maximum was of any size. |
|
2981 |
|
2982 |
|
2983 Version 2.03 02-Feb-99 |
|
2984 ---------------------- |
|
2985 |
|
2986 1. Fixed typo and small mistake in man page. |
|
2987 |
|
2988 2. Added 4th condition (GPL supersedes if conflict) and created separate |
|
2989 LICENCE file containing the conditions. |
|
2990 |
|
2991 3. Updated pcretest so that patterns such as /abc\/def/ work like they do in |
|
2992 Perl, that is the internal \ allows the delimiter to be included in the |
|
2993 pattern. Locked out the use of \ as a delimiter. If \ immediately follows |
|
2994 the final delimiter, add \ to the end of the pattern (to test the error). |
|
2995 |
|
2996 4. Added the convenience functions for extracting substrings after a successful |
|
2997 match. Updated pcretest to make it able to test these functions. |
|
2998 |
|
2999 |
|
3000 Version 2.02 14-Jan-99 |
|
3001 ---------------------- |
|
3002 |
|
3003 1. Initialized the working variables associated with each extraction so that |
|
3004 their saving and restoring doesn't refer to uninitialized store. |
|
3005 |
|
3006 2. Put dummy code into study.c in order to trick the optimizer of the IBM C |
|
3007 compiler for OS/2 into generating correct code. Apparently IBM isn't going to |
|
3008 fix the problem. |
|
3009 |
|
3010 3. Pcretest: the timing code wasn't using LOOPREPEAT for timing execution |
|
3011 calls, and wasn't printing the correct value for compiling calls. Increased the |
|
3012 default value of LOOPREPEAT, and the number of significant figures in the |
|
3013 times. |
|
3014 |
|
3015 4. Changed "/bin/rm" in the Makefile to "-rm" so it works on Windows NT. |
|
3016 |
|
3017 5. Renamed "deftables" as "dftables" to get it down to 8 characters, to avoid |
|
3018 a building problem on Windows NT with a FAT file system. |
|
3019 |
|
3020 |
|
3021 Version 2.01 21-Oct-98 |
|
3022 ---------------------- |
|
3023 |
|
3024 1. Changed the API for pcre_compile() to allow for the provision of a pointer |
|
3025 to character tables built by pcre_maketables() in the current locale. If NULL |
|
3026 is passed, the default tables are used. |
|
3027 |
|
3028 |
|
3029 Version 2.00 24-Sep-98 |
|
3030 ---------------------- |
|
3031 |
|
3032 1. Since the (>?) facility is in Perl 5.005, don't require PCRE_EXTRA to enable |
|
3033 it any more. |
|
3034 |
|
3035 2. Allow quantification of (?>) groups, and make it work correctly. |
|
3036 |
|
3037 3. The first character computation wasn't working for (?>) groups. |
|
3038 |
|
3039 4. Correct the implementation of \Z (it is permitted to match on the \n at the |
|
3040 end of the subject) and add 5.005's \z, which really does match only at the |
|
3041 very end of the subject. |
|
3042 |
|
3043 5. Remove the \X "cut" facility; Perl doesn't have it, and (?> is neater. |
|
3044 |
|
3045 6. Remove the ability to specify CASELESS, MULTILINE, DOTALL, and |
|
3046 DOLLAR_END_ONLY at runtime, to make it possible to implement the Perl 5.005 |
|
3047 localized options. All options to pcre_study() were also removed. |
|
3048 |
|
3049 7. Add other new features from 5.005: |
|
3050 |
|
3051 $(?<= positive lookbehind |
|
3052 $(?<! negative lookbehind |
|
3053 (?imsx-imsx) added the unsetting capability |
|
3054 such a setting is global if at outer level; local otherwise |
|
3055 (?imsx-imsx:) non-capturing groups with option setting |
|
3056 (?(cond)re|re) conditional pattern matching |
|
3057 |
|
3058 A backreference to itself in a repeated group matches the previous |
|
3059 captured string. |
|
3060 |
|
3061 8. General tidying up of studying (both automatic and via "study") |
|
3062 consequential on the addition of new assertions. |
|
3063 |
|
3064 9. As in 5.005, unlimited repeated groups that could match an empty substring |
|
3065 are no longer faulted at compile time. Instead, the loop is forcibly broken at |
|
3066 runtime if any iteration does actually match an empty substring. |
|
3067 |
|
3068 10. Include the RunTest script in the distribution. |
|
3069 |
|
3070 11. Added tests from the Perl 5.005_02 distribution. This showed up a few |
|
3071 discrepancies, some of which were old and were also with respect to 5.004. They |
|
3072 have now been fixed. |
|
3073 |
|
3074 |
|
3075 Version 1.09 28-Apr-98 |
|
3076 ---------------------- |
|
3077 |
|
3078 1. A negated single character class followed by a quantifier with a minimum |
|
3079 value of one (e.g. [^x]{1,6} ) was not compiled correctly. This could lead to |
|
3080 program crashes, or just wrong answers. This did not apply to negated classes |
|
3081 containing more than one character, or to minima other than one. |
|
3082 |
|
3083 |
|
3084 Version 1.08 27-Mar-98 |
|
3085 ---------------------- |
|
3086 |
|
3087 1. Add PCRE_UNGREEDY to invert the greediness of quantifiers. |
|
3088 |
|
3089 2. Add (?U) and (?X) to set PCRE_UNGREEDY and PCRE_EXTRA respectively. The |
|
3090 latter must appear before anything that relies on it in the pattern. |
|
3091 |
|
3092 |
|
3093 Version 1.07 16-Feb-98 |
|
3094 ---------------------- |
|
3095 |
|
3096 1. A pattern such as /((a)*)*/ was not being diagnosed as in error (unlimited |
|
3097 repeat of a potentially empty string). |
|
3098 |
|
3099 |
|
3100 Version 1.06 23-Jan-98 |
|
3101 ---------------------- |
|
3102 |
|
3103 1. Added Markus Oberhumer's little patches for C++. |
|
3104 |
|
3105 2. Literal strings longer than 255 characters were broken. |
|
3106 |
|
3107 |
|
3108 Version 1.05 23-Dec-97 |
|
3109 ---------------------- |
|
3110 |
|
3111 1. Negated character classes containing more than one character were failing if |
|
3112 PCRE_CASELESS was set at run time. |
|
3113 |
|
3114 |
|
3115 Version 1.04 19-Dec-97 |
|
3116 ---------------------- |
|
3117 |
|
3118 1. Corrected the man page, where some "const" qualifiers had been omitted. |
|
3119 |
|
3120 2. Made debugging output print "{0,xxx}" instead of just "{,xxx}" to agree with |
|
3121 input syntax. |
|
3122 |
|
3123 3. Fixed memory leak which occurred when a regex with back references was |
|
3124 matched with an offsets vector that wasn't big enough. The temporary memory |
|
3125 that is used in this case wasn't being freed if the match failed. |
|
3126 |
|
3127 4. Tidied pcretest to ensure it frees memory that it gets. |
|
3128 |
|
3129 5. Temporary memory was being obtained in the case where the passed offsets |
|
3130 vector was exactly big enough. |
|
3131 |
|
3132 6. Corrected definition of offsetof() from change 5 below. |
|
3133 |
|
3134 7. I had screwed up change 6 below and broken the rules for the use of |
|
3135 setjmp(). Now fixed. |
|
3136 |
|
3137 |
|
3138 Version 1.03 18-Dec-97 |
|
3139 ---------------------- |
|
3140 |
|
3141 1. A erroneous regex with a missing opening parenthesis was correctly |
|
3142 diagnosed, but PCRE attempted to access brastack[-1], which could cause crashes |
|
3143 on some systems. |
|
3144 |
|
3145 2. Replaced offsetof(real_pcre, code) by offsetof(real_pcre, code[0]) because |
|
3146 it was reported that one broken compiler failed on the former because "code" is |
|
3147 also an independent variable. |
|
3148 |
|
3149 3. The erroneous regex a[]b caused an array overrun reference. |
|
3150 |
|
3151 4. A regex ending with a one-character negative class (e.g. /[^k]$/) did not |
|
3152 fail on data ending with that character. (It was going on too far, and checking |
|
3153 the next character, typically a binary zero.) This was specific to the |
|
3154 optimized code for single-character negative classes. |
|
3155 |
|
3156 5. Added a contributed patch from the TIN world which does the following: |
|
3157 |
|
3158 + Add an undef for memmove, in case the the system defines a macro for it. |
|
3159 |
|
3160 + Add a definition of offsetof(), in case there isn't one. (I don't know |
|
3161 the reason behind this - offsetof() is part of the ANSI standard - but |
|
3162 it does no harm). |
|
3163 |
|
3164 + Reduce the ifdef's in pcre.c using macro DPRINTF, thereby eliminating |
|
3165 most of the places where whitespace preceded '#'. I have given up and |
|
3166 allowed the remaining 2 cases to be at the margin. |
|
3167 |
|
3168 + Rename some variables in pcre to eliminate shadowing. This seems very |
|
3169 pedantic, but does no harm, of course. |
|
3170 |
|
3171 6. Moved the call to setjmp() into its own function, to get rid of warnings |
|
3172 from gcc -Wall, and avoided calling it at all unless PCRE_EXTRA is used. |
|
3173 |
|
3174 7. Constructs such as \d{8,} were compiling into the equivalent of |
|
3175 \d{8}\d{0,65527} instead of \d{8}\d* which didn't make much difference to the |
|
3176 outcome, but in this particular case used more store than had been allocated, |
|
3177 which caused the bug to be discovered because it threw up an internal error. |
|
3178 |
|
3179 8. The debugging code in both pcre and pcretest for outputting the compiled |
|
3180 form of a regex was going wrong in the case of back references followed by |
|
3181 curly-bracketed repeats. |
|
3182 |
|
3183 |
|
3184 Version 1.02 12-Dec-97 |
|
3185 ---------------------- |
|
3186 |
|
3187 1. Typos in pcre.3 and comments in the source fixed. |
|
3188 |
|
3189 2. Applied a contributed patch to get rid of places where it used to remove |
|
3190 'const' from variables, and fixed some signed/unsigned and uninitialized |
|
3191 variable warnings. |
|
3192 |
|
3193 3. Added the "runtest" target to Makefile. |
|
3194 |
|
3195 4. Set default compiler flag to -O2 rather than just -O. |
|
3196 |
|
3197 |
|
3198 Version 1.01 19-Nov-97 |
|
3199 ---------------------- |
|
3200 |
|
3201 1. PCRE was failing to diagnose unlimited repeat of empty string for patterns |
|
3202 like /([ab]*)*/, that is, for classes with more than one character in them. |
|
3203 |
|
3204 2. Likewise, it wasn't diagnosing patterns with "once-only" subpatterns, such |
|
3205 as /((?>a*))*/ (a PCRE_EXTRA facility). |
|
3206 |
|
3207 |
|
3208 Version 1.00 18-Nov-97 |
|
3209 ---------------------- |
|
3210 |
|
3211 1. Added compile-time macros to support systems such as SunOS4 which don't have |
|
3212 memmove() or strerror() but have other things that can be used instead. |
|
3213 |
|
3214 2. Arranged that "make clean" removes the executables. |
|
3215 |
|
3216 |
|
3217 Version 0.99 27-Oct-97 |
|
3218 ---------------------- |
|
3219 |
|
3220 1. Fixed bug in code for optimizing classes with only one character. It was |
|
3221 initializing a 32-byte map regardless, which could cause it to run off the end |
|
3222 of the memory it had got. |
|
3223 |
|
3224 2. Added, conditional on PCRE_EXTRA, the proposed (?>REGEX) construction. |
|
3225 |
|
3226 |
|
3227 Version 0.98 22-Oct-97 |
|
3228 ---------------------- |
|
3229 |
|
3230 1. Fixed bug in code for handling temporary memory usage when there are more |
|
3231 back references than supplied space in the ovector. This could cause segfaults. |
|
3232 |
|
3233 |
|
3234 Version 0.97 21-Oct-97 |
|
3235 ---------------------- |
|
3236 |
|
3237 1. Added the \X "cut" facility, conditional on PCRE_EXTRA. |
|
3238 |
|
3239 2. Optimized negated single characters not to use a bit map. |
|
3240 |
|
3241 3. Brought error texts together as macro definitions; clarified some of them; |
|
3242 fixed one that was wrong - it said "range out of order" when it meant "invalid |
|
3243 escape sequence". |
|
3244 |
|
3245 4. Changed some char * arguments to const char *. |
|
3246 |
|
3247 5. Added PCRE_NOTBOL and PCRE_NOTEOL (from POSIX). |
|
3248 |
|
3249 6. Added the POSIX-style API wrapper in pcreposix.a and testing facilities in |
|
3250 pcretest. |
|
3251 |
|
3252 |
|
3253 Version 0.96 16-Oct-97 |
|
3254 ---------------------- |
|
3255 |
|
3256 1. Added a simple "pgrep" utility to the distribution. |
|
3257 |
|
3258 2. Fixed an incompatibility with Perl: "{" is now treated as a normal character |
|
3259 unless it appears in one of the precise forms "{ddd}", "{ddd,}", or "{ddd,ddd}" |
|
3260 where "ddd" means "one or more decimal digits". |
|
3261 |
|
3262 3. Fixed serious bug. If a pattern had a back reference, but the call to |
|
3263 pcre_exec() didn't supply a large enough ovector to record the related |
|
3264 identifying subpattern, the match always failed. PCRE now remembers the number |
|
3265 of the largest back reference, and gets some temporary memory in which to save |
|
3266 the offsets during matching if necessary, in order to ensure that |
|
3267 backreferences always work. |
|
3268 |
|
3269 4. Increased the compatibility with Perl in a number of ways: |
|
3270 |
|
3271 (a) . no longer matches \n by default; an option PCRE_DOTALL is provided |
|
3272 to request this handling. The option can be set at compile or exec time. |
|
3273 |
|
3274 (b) $ matches before a terminating newline by default; an option |
|
3275 PCRE_DOLLAR_ENDONLY is provided to override this (but not in multiline |
|
3276 mode). The option can be set at compile or exec time. |
|
3277 |
|
3278 (c) The handling of \ followed by a digit other than 0 is now supposed to be |
|
3279 the same as Perl's. If the decimal number it represents is less than 10 |
|
3280 or there aren't that many previous left capturing parentheses, an octal |
|
3281 escape is read. Inside a character class, it's always an octal escape, |
|
3282 even if it is a single digit. |
|
3283 |
|
3284 (d) An escaped but undefined alphabetic character is taken as a literal, |
|
3285 unless PCRE_EXTRA is set. Currently this just reserves the remaining |
|
3286 escapes. |
|
3287 |
|
3288 (e) {0} is now permitted. (The previous item is removed from the compiled |
|
3289 pattern). |
|
3290 |
|
3291 5. Changed all the names of code files so that the basic parts are no longer |
|
3292 than 10 characters, and abolished the teeny "globals.c" file. |
|
3293 |
|
3294 6. Changed the handling of character classes; they are now done with a 32-byte |
|
3295 bit map always. |
|
3296 |
|
3297 7. Added the -d and /D options to pcretest to make it possible to look at the |
|
3298 internals of compilation without having to recompile pcre. |
|
3299 |
|
3300 |
|
3301 Version 0.95 23-Sep-97 |
|
3302 ---------------------- |
|
3303 |
|
3304 1. Fixed bug in pre-pass concerning escaped "normal" characters such as \x5c or |
|
3305 \x20 at the start of a run of normal characters. These were being treated as |
|
3306 real characters, instead of the source characters being re-checked. |
|
3307 |
|
3308 |
|
3309 Version 0.94 18-Sep-97 |
|
3310 ---------------------- |
|
3311 |
|
3312 1. The functions are now thread-safe, with the caveat that the global variables |
|
3313 containing pointers to malloc() and free() or alternative functions are the |
|
3314 same for all threads. |
|
3315 |
|
3316 2. Get pcre_study() to generate a bitmap of initial characters for non- |
|
3317 anchored patterns when this is possible, and use it if passed to pcre_exec(). |
|
3318 |
|
3319 |
|
3320 Version 0.93 15-Sep-97 |
|
3321 ---------------------- |
|
3322 |
|
3323 1. /(b)|(:+)/ was computing an incorrect first character. |
|
3324 |
|
3325 2. Add pcre_study() to the API and the passing of pcre_extra to pcre_exec(), |
|
3326 but not actually doing anything yet. |
|
3327 |
|
3328 3. Treat "-" characters in classes that cannot be part of ranges as literals, |
|
3329 as Perl does (e.g. [-az] or [az-]). |
|
3330 |
|
3331 4. Set the anchored flag if a branch starts with .* or .*? because that tests |
|
3332 all possible positions. |
|
3333 |
|
3334 5. Split up into different modules to avoid including unneeded functions in a |
|
3335 compiled binary. However, compile and exec are still in one module. The "study" |
|
3336 function is split off. |
|
3337 |
|
3338 6. The character tables are now in a separate module whose source is generated |
|
3339 by an auxiliary program - but can then be edited by hand if required. There are |
|
3340 now no calls to isalnum(), isspace(), isdigit(), isxdigit(), tolower() or |
|
3341 toupper() in the code. |
|
3342 |
|
3343 7. Turn the malloc/free funtions variables into pcre_malloc and pcre_free and |
|
3344 make them global. Abolish the function for setting them, as the caller can now |
|
3345 set them directly. |
|
3346 |
|
3347 |
|
3348 Version 0.92 11-Sep-97 |
|
3349 ---------------------- |
|
3350 |
|
3351 1. A repeat with a fixed maximum and a minimum of 1 for an ordinary character |
|
3352 (e.g. /a{1,3}/) was broken (I mis-optimized it). |
|
3353 |
|
3354 2. Caseless matching was not working in character classes if the characters in |
|
3355 the pattern were in upper case. |
|
3356 |
|
3357 3. Make ranges like [W-c] work in the same way as Perl for caseless matching. |
|
3358 |
|
3359 4. Make PCRE_ANCHORED public and accept as a compile option. |
|
3360 |
|
3361 5. Add an options word to pcre_exec() and accept PCRE_ANCHORED and |
|
3362 PCRE_CASELESS at run time. Add escapes \A and \I to pcretest to cause it to |
|
3363 pass them. |
|
3364 |
|
3365 6. Give an error if bad option bits passed at compile or run time. |
|
3366 |
|
3367 7. Add PCRE_MULTILINE at compile and exec time, and (?m) as well. Add \M to |
|
3368 pcretest to cause it to pass that flag. |
|
3369 |
|
3370 8. Add pcre_info(), to get the number of identifying subpatterns, the stored |
|
3371 options, and the first character, if set. |
|
3372 |
|
3373 9. Recognize C+ or C{n,m} where n >= 1 as providing a fixed starting character. |
|
3374 |
|
3375 |
|
3376 Version 0.91 10-Sep-97 |
|
3377 ---------------------- |
|
3378 |
|
3379 1. PCRE was failing to diagnose unlimited repeats of subpatterns that could |
|
3380 match the empty string as in /(a*)*/. It was looping and ultimately crashing. |
|
3381 |
|
3382 2. PCRE was looping on encountering an indefinitely repeated back reference to |
|
3383 a subpattern that had matched an empty string, e.g. /(a|)\1*/. It now does what |
|
3384 Perl does - treats the match as successful. |
|
3385 |
|
3386 **** |