libraries/spcre/libpcre/pcre/doc/html/pcresyntax.html
changeset 0 7f656887cf89
equal deleted inserted replaced
-1:000000000000 0:7f656887cf89
       
     1 <html>
       
     2 <head>
       
     3 <title>pcresyntax specification</title>
       
     4 </head>
       
     5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
       
     6 <h1>pcresyntax man page</h1>
       
     7 <p>
       
     8 Return to the <a href="index.html">PCRE index page</a>.
       
     9 </p>
       
    10 <p>
       
    11 This page is part of the PCRE HTML documentation. It was generated automatically
       
    12 from the original man page. If there is any nonsense in it, please consult the
       
    13 man page, in case the conversion went wrong.
       
    14 <br>
       
    15 <ul>
       
    16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
       
    17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
       
    18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
       
    19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
       
    20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a>
       
    21 <li><a name="TOC6" href="#SEC6">SCRIPT NAMES FOR \p AND \P</a>
       
    22 <li><a name="TOC7" href="#SEC7">CHARACTER CLASSES</a>
       
    23 <li><a name="TOC8" href="#SEC8">QUANTIFIERS</a>
       
    24 <li><a name="TOC9" href="#SEC9">ANCHORS AND SIMPLE ASSERTIONS</a>
       
    25 <li><a name="TOC10" href="#SEC10">MATCH POINT RESET</a>
       
    26 <li><a name="TOC11" href="#SEC11">ALTERNATION</a>
       
    27 <li><a name="TOC12" href="#SEC12">CAPTURING</a>
       
    28 <li><a name="TOC13" href="#SEC13">ATOMIC GROUPS</a>
       
    29 <li><a name="TOC14" href="#SEC14">COMMENT</a>
       
    30 <li><a name="TOC15" href="#SEC15">OPTION SETTING</a>
       
    31 <li><a name="TOC16" href="#SEC16">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
       
    32 <li><a name="TOC17" href="#SEC17">BACKREFERENCES</a>
       
    33 <li><a name="TOC18" href="#SEC18">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
       
    34 <li><a name="TOC19" href="#SEC19">CONDITIONAL PATTERNS</a>
       
    35 <li><a name="TOC20" href="#SEC20">BACKTRACKING CONTROL</a>
       
    36 <li><a name="TOC21" href="#SEC21">NEWLINE CONVENTIONS</a>
       
    37 <li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a>
       
    38 <li><a name="TOC23" href="#SEC23">CALLOUTS</a>
       
    39 <li><a name="TOC24" href="#SEC24">SEE ALSO</a>
       
    40 <li><a name="TOC25" href="#SEC25">AUTHOR</a>
       
    41 <li><a name="TOC26" href="#SEC26">REVISION</a>
       
    42 </ul>
       
    43 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
       
    44 <P>
       
    45 The full syntax and semantics of the regular expressions that are supported by
       
    46 PCRE are described in the
       
    47 <a href="pcrepattern.html"><b>pcrepattern</b></a>
       
    48 documentation. This document contains just a quick-reference summary of the
       
    49 syntax.
       
    50 </P>
       
    51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
       
    52 <P>
       
    53 <pre>
       
    54   \x         where x is non-alphanumeric is a literal x
       
    55   \Q...\E    treat enclosed characters as literal
       
    56 </PRE>
       
    57 </P>
       
    58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
       
    59 <P>
       
    60 <pre>
       
    61   \a         alarm, that is, the BEL character (hex 07)
       
    62   \cx        "control-x", where x is any character
       
    63   \e         escape (hex 1B)
       
    64   \f         formfeed (hex 0C)
       
    65   \n         newline (hex 0A)
       
    66   \r         carriage return (hex 0D)
       
    67   \t         tab (hex 09)
       
    68   \ddd       character with octal code ddd, or backreference
       
    69   \xhh       character with hex code hh
       
    70   \x{hhh..}  character with hex code hhh..
       
    71 </PRE>
       
    72 </P>
       
    73 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
       
    74 <P>
       
    75 <pre>
       
    76   .          any character except newline;
       
    77                in dotall mode, any character whatsoever
       
    78   \C         one byte, even in UTF-8 mode (best avoided)
       
    79   \d         a decimal digit
       
    80   \D         a character that is not a decimal digit
       
    81   \h         a horizontal whitespace character
       
    82   \H         a character that is not a horizontal whitespace character
       
    83   \p{<i>xx</i>}     a character with the <i>xx</i> property
       
    84   \P{<i>xx</i>}     a character without the <i>xx</i> property
       
    85   \R         a newline sequence
       
    86   \s         a whitespace character
       
    87   \S         a character that is not a whitespace character
       
    88   \v         a vertical whitespace character
       
    89   \V         a character that is not a vertical whitespace character
       
    90   \w         a "word" character
       
    91   \W         a "non-word" character
       
    92   \X         an extended Unicode sequence
       
    93 </pre>
       
    94 In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
       
    95 </P>
       
    96 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTY CODES FOR \p and \P</a><br>
       
    97 <P>
       
    98 <pre>
       
    99   C          Other
       
   100   Cc         Control
       
   101   Cf         Format
       
   102   Cn         Unassigned
       
   103   Co         Private use
       
   104   Cs         Surrogate
       
   105 
       
   106   L          Letter
       
   107   Ll         Lower case letter
       
   108   Lm         Modifier letter
       
   109   Lo         Other letter
       
   110   Lt         Title case letter
       
   111   Lu         Upper case letter
       
   112   L&         Ll, Lu, or Lt
       
   113 
       
   114   M          Mark
       
   115   Mc         Spacing mark
       
   116   Me         Enclosing mark
       
   117   Mn         Non-spacing mark
       
   118 
       
   119   N          Number
       
   120   Nd         Decimal number
       
   121   Nl         Letter number
       
   122   No         Other number
       
   123 
       
   124   P          Punctuation
       
   125   Pc         Connector punctuation
       
   126   Pd         Dash punctuation
       
   127   Pe         Close punctuation
       
   128   Pf         Final punctuation
       
   129   Pi         Initial punctuation
       
   130   Po         Other punctuation
       
   131   Ps         Open punctuation
       
   132 
       
   133   S          Symbol
       
   134   Sc         Currency symbol
       
   135   Sk         Modifier symbol
       
   136   Sm         Mathematical symbol
       
   137   So         Other symbol
       
   138 
       
   139   Z          Separator
       
   140   Zl         Line separator
       
   141   Zp         Paragraph separator
       
   142   Zs         Space separator
       
   143 </PRE>
       
   144 </P>
       
   145 <br><a name="SEC6" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
       
   146 <P>
       
   147 Arabic,
       
   148 Armenian,
       
   149 Balinese,
       
   150 Bengali,
       
   151 Bopomofo,
       
   152 Braille,
       
   153 Buginese,
       
   154 Buhid,
       
   155 Canadian_Aboriginal,
       
   156 Cherokee,
       
   157 Common,
       
   158 Coptic,
       
   159 Cuneiform,
       
   160 Cypriot,
       
   161 Cyrillic,
       
   162 Deseret,
       
   163 Devanagari,
       
   164 Ethiopic,
       
   165 Georgian,
       
   166 Glagolitic,
       
   167 Gothic,
       
   168 Greek,
       
   169 Gujarati,
       
   170 Gurmukhi,
       
   171 Han,
       
   172 Hangul,
       
   173 Hanunoo,
       
   174 Hebrew,
       
   175 Hiragana,
       
   176 Inherited,
       
   177 Kannada,
       
   178 Katakana,
       
   179 Kharoshthi,
       
   180 Khmer,
       
   181 Lao,
       
   182 Latin,
       
   183 Limbu,
       
   184 Linear_B,
       
   185 Malayalam,
       
   186 Mongolian,
       
   187 Myanmar,
       
   188 New_Tai_Lue,
       
   189 Nko,
       
   190 Ogham,
       
   191 Old_Italic,
       
   192 Old_Persian,
       
   193 Oriya,
       
   194 Osmanya,
       
   195 Phags_Pa,
       
   196 Phoenician,
       
   197 Runic,
       
   198 Shavian,
       
   199 Sinhala,
       
   200 Syloti_Nagri,
       
   201 Syriac,
       
   202 Tagalog,
       
   203 Tagbanwa,
       
   204 Tai_Le,
       
   205 Tamil,
       
   206 Telugu,
       
   207 Thaana,
       
   208 Thai,
       
   209 Tibetan,
       
   210 Tifinagh,
       
   211 Ugaritic,
       
   212 Yi.
       
   213 </P>
       
   214 <br><a name="SEC7" href="#TOC1">CHARACTER CLASSES</a><br>
       
   215 <P>
       
   216 <pre>
       
   217   [...]       positive character class
       
   218   [^...]      negative character class
       
   219   [x-y]       range (can be used for hex characters)
       
   220   [[:xxx:]]   positive POSIX named set
       
   221   [[:^xxx:]]  negative POSIX named set
       
   222 
       
   223   alnum       alphanumeric
       
   224   alpha       alphabetic
       
   225   ascii       0-127
       
   226   blank       space or tab
       
   227   cntrl       control character
       
   228   digit       decimal digit
       
   229   graph       printing, excluding space
       
   230   lower       lower case letter
       
   231   print       printing, including space
       
   232   punct       printing, excluding alphanumeric
       
   233   space       whitespace
       
   234   upper       upper case letter
       
   235   word        same as \w
       
   236   xdigit      hexadecimal digit
       
   237 </pre>
       
   238 In PCRE, POSIX character set names recognize only ASCII characters. You can use
       
   239 \Q...\E inside a character class.
       
   240 </P>
       
   241 <br><a name="SEC8" href="#TOC1">QUANTIFIERS</a><br>
       
   242 <P>
       
   243 <pre>
       
   244   ?           0 or 1, greedy
       
   245   ?+          0 or 1, possessive
       
   246   ??          0 or 1, lazy
       
   247   *           0 or more, greedy
       
   248   *+          0 or more, possessive
       
   249   *?          0 or more, lazy
       
   250   +           1 or more, greedy
       
   251   ++          1 or more, possessive
       
   252   +?          1 or more, lazy
       
   253   {n}         exactly n
       
   254   {n,m}       at least n, no more than m, greedy
       
   255   {n,m}+      at least n, no more than m, possessive
       
   256   {n,m}?      at least n, no more than m, lazy
       
   257   {n,}        n or more, greedy
       
   258   {n,}+       n or more, possessive
       
   259   {n,}?       n or more, lazy
       
   260 </PRE>
       
   261 </P>
       
   262 <br><a name="SEC9" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
       
   263 <P>
       
   264 <pre>
       
   265   \b          word boundary
       
   266   \B          not a word boundary
       
   267   ^           start of subject
       
   268                also after internal newline in multiline mode
       
   269   \A          start of subject
       
   270   $           end of subject
       
   271                also before newline at end of subject
       
   272                also before internal newline in multiline mode
       
   273   \Z          end of subject
       
   274                also before newline at end of subject
       
   275   \z          end of subject
       
   276   \G          first matching position in subject
       
   277 </PRE>
       
   278 </P>
       
   279 <br><a name="SEC10" href="#TOC1">MATCH POINT RESET</a><br>
       
   280 <P>
       
   281 <pre>
       
   282   \K          reset start of match
       
   283 </PRE>
       
   284 </P>
       
   285 <br><a name="SEC11" href="#TOC1">ALTERNATION</a><br>
       
   286 <P>
       
   287 <pre>
       
   288   expr|expr|expr...
       
   289 </PRE>
       
   290 </P>
       
   291 <br><a name="SEC12" href="#TOC1">CAPTURING</a><br>
       
   292 <P>
       
   293 <pre>
       
   294   (...)          capturing group
       
   295   (?&#60;name&#62;...)   named capturing group (Perl)
       
   296   (?'name'...)   named capturing group (Perl)
       
   297   (?P&#60;name&#62;...)  named capturing group (Python)
       
   298   (?:...)        non-capturing group
       
   299   (?|...)        non-capturing group; reset group numbers for
       
   300                   capturing groups in each alternative
       
   301 </PRE>
       
   302 </P>
       
   303 <br><a name="SEC13" href="#TOC1">ATOMIC GROUPS</a><br>
       
   304 <P>
       
   305 <pre>
       
   306   (?&#62;...)        atomic, non-capturing group
       
   307 </PRE>
       
   308 </P>
       
   309 <br><a name="SEC14" href="#TOC1">COMMENT</a><br>
       
   310 <P>
       
   311 <pre>
       
   312   (?#....)       comment (not nestable)
       
   313 </PRE>
       
   314 </P>
       
   315 <br><a name="SEC15" href="#TOC1">OPTION SETTING</a><br>
       
   316 <P>
       
   317 <pre>
       
   318   (?i)           caseless
       
   319   (?J)           allow duplicate names
       
   320   (?m)           multiline
       
   321   (?s)           single line (dotall)
       
   322   (?U)           default ungreedy (lazy)
       
   323   (?x)           extended (ignore white space)
       
   324   (?-...)        unset option(s)
       
   325 </PRE>
       
   326 </P>
       
   327 <br><a name="SEC16" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
       
   328 <P>
       
   329 <pre>
       
   330   (?=...)        positive look ahead
       
   331   (?!...)        negative look ahead
       
   332   (?&#60;=...)       positive look behind
       
   333   (?&#60;!...)       negative look behind
       
   334 </pre>
       
   335 Each top-level branch of a look behind must be of a fixed length.
       
   336 </P>
       
   337 <br><a name="SEC17" href="#TOC1">BACKREFERENCES</a><br>
       
   338 <P>
       
   339 <pre>
       
   340   \n             reference by number (can be ambiguous)
       
   341   \gn            reference by number
       
   342   \g{n}          reference by number
       
   343   \g{-n}         relative reference by number
       
   344   \k&#60;name&#62;       reference by name (Perl)
       
   345   \k'name'       reference by name (Perl)
       
   346   \g{name}       reference by name (Perl)
       
   347   \k{name}       reference by name (.NET)
       
   348   (?P=name)      reference by name (Python)
       
   349 </PRE>
       
   350 </P>
       
   351 <br><a name="SEC18" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
       
   352 <P>
       
   353 <pre>
       
   354   (?R)           recurse whole pattern
       
   355   (?n)           call subpattern by absolute number
       
   356   (?+n)          call subpattern by relative number
       
   357   (?-n)          call subpattern by relative number
       
   358   (?&name)       call subpattern by name (Perl)
       
   359   (?P&#62;name)      call subpattern by name (Python)
       
   360   \g&#60;name&#62;       call subpattern by name (Oniguruma)
       
   361   \g'name'       call subpattern by name (Oniguruma)
       
   362   \g&#60;n&#62;          call subpattern by absolute number (Oniguruma)
       
   363   \g'n'          call subpattern by absolute number (Oniguruma)
       
   364   \g&#60;+n&#62;         call subpattern by relative number (PCRE extension)
       
   365   \g'+n'         call subpattern by relative number (PCRE extension)
       
   366   \g&#60;-n&#62;         call subpattern by relative number (PCRE extension)
       
   367   \g'-n'         call subpattern by relative number (PCRE extension)
       
   368 </PRE>
       
   369 </P>
       
   370 <br><a name="SEC19" href="#TOC1">CONDITIONAL PATTERNS</a><br>
       
   371 <P>
       
   372 <pre>
       
   373   (?(condition)yes-pattern)
       
   374   (?(condition)yes-pattern|no-pattern)
       
   375 
       
   376   (?(n)...       absolute reference condition
       
   377   (?(+n)...      relative reference condition
       
   378   (?(-n)...      relative reference condition
       
   379   (?(&#60;name&#62;)...  named reference condition (Perl)
       
   380   (?('name')...  named reference condition (Perl)
       
   381   (?(name)...    named reference condition (PCRE)
       
   382   (?(R)...       overall recursion condition
       
   383   (?(Rn)...      specific group recursion condition
       
   384   (?(R&name)...  specific recursion condition
       
   385   (?(DEFINE)...  define subpattern for reference
       
   386   (?(assert)...  assertion condition
       
   387 </PRE>
       
   388 </P>
       
   389 <br><a name="SEC20" href="#TOC1">BACKTRACKING CONTROL</a><br>
       
   390 <P>
       
   391 The following act immediately they are reached:
       
   392 <pre>
       
   393   (*ACCEPT)      force successful match
       
   394   (*FAIL)        force backtrack; synonym (*F)
       
   395 </pre>
       
   396 The following act only when a subsequent match failure causes a backtrack to
       
   397 reach them. They all force a match failure, but they differ in what happens
       
   398 afterwards. Those that advance the start-of-match point do so only if the
       
   399 pattern is not anchored.
       
   400 <pre>
       
   401   (*COMMIT)      overall failure, no advance of starting point
       
   402   (*PRUNE)       advance to next starting character
       
   403   (*SKIP)        advance start to current matching position
       
   404   (*THEN)        local failure, backtrack to next alternation
       
   405 </PRE>
       
   406 </P>
       
   407 <br><a name="SEC21" href="#TOC1">NEWLINE CONVENTIONS</a><br>
       
   408 <P>
       
   409 These are recognized only at the very start of the pattern or after a
       
   410 (*BSR_...) option.
       
   411 <pre>
       
   412   (*CR)
       
   413   (*LF)
       
   414   (*CRLF)
       
   415   (*ANYCRLF)
       
   416   (*ANY)
       
   417 </PRE>
       
   418 </P>
       
   419 <br><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a><br>
       
   420 <P>
       
   421 These are recognized only at the very start of the pattern or after a
       
   422 (*...) option that sets the newline convention.
       
   423 <pre>
       
   424   (*BSR_ANYCRLF)
       
   425   (*BSR_UNICODE)
       
   426 </PRE>
       
   427 </P>
       
   428 <br><a name="SEC23" href="#TOC1">CALLOUTS</a><br>
       
   429 <P>
       
   430 <pre>
       
   431   (?C)      callout
       
   432   (?Cn)     callout with data n
       
   433 </PRE>
       
   434 </P>
       
   435 <br><a name="SEC24" href="#TOC1">SEE ALSO</a><br>
       
   436 <P>
       
   437 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
       
   438 <b>pcrematching</b>(3), <b>pcre</b>(3).
       
   439 </P>
       
   440 <br><a name="SEC25" href="#TOC1">AUTHOR</a><br>
       
   441 <P>
       
   442 Philip Hazel
       
   443 <br>
       
   444 University Computing Service
       
   445 <br>
       
   446 Cambridge CB2 3QH, England.
       
   447 <br>
       
   448 </P>
       
   449 <br><a name="SEC26" href="#TOC1">REVISION</a><br>
       
   450 <P>
       
   451 Last updated: 09 April 2008
       
   452 <br>
       
   453 Copyright &copy; 1997-2008 University of Cambridge.
       
   454 <br>
       
   455 <p>
       
   456 Return to the <a href="index.html">PCRE index page</a>.
       
   457 </p>