Sortix cross-nightly manual
This manual documents Sortix cross-nightly. You can instead view this document in the latest official manual.
PCREPATTERN(3) | Library Functions Manual | PCREPATTERN(3) |
NAME
PCRE - Perl-compatible regular expressionsPCRE REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions that are supported by PCRE are described in detail below. There is a quick-reference syntax summary in the pcresyntax page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, .NET, and Oniguruma.SPECIAL START-OF-PATTERN ITEMS
A number of options that can be passed to pcre_compile() can also be set by special items at the start of a pattern. These are not Perl-compatible, but are provided to make these options accessible to pattern writers who are not able to change the program that processes the pattern. Any number of these items may appear, but they must all be together right at the start of the pattern string, and the letters must be in upper case.UTF support
The original operation of PCRE was on strings of one-byte characters. However, there is now also support for UTF-8 strings in the original library, an extra library that supports 16-bit and UTF-16 character strings, and a third library that supports 32-bit and UTF-32 character strings. To use these features, PCRE must be built to include appropriate support. When using UTF strings you must either call the compiling function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of these special sequences:(*UTF8)
(*UTF16)
(*UTF32)
(*UTF)
Unicode property support
Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.Disabling auto-possessification
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making quantifiers possessive when what follows cannot match the repeated item. For example, by default a+b is treated as a++b. For more details, see the pcreapi documentation.Disabling start-up optimizations
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables several optimizations for quickly reaching "no match" results. For more details, see the pcreapi documentation.Newline conventions
PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The pcreapi page has further discussion about newlines, and shows how to set the newline convention in the options arguments for the compiling and matching functions.(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
(*CR)a.b
Setting match and recursion limits
The caller of pcre_exec() can set a limit on the number of times the internal match() function is called and on the maximum depth of recursive calls. These facilities are provided to catch runaway matches that are provoked by patterns with huge matching trees (a typical example is a pattern with nested unlimited repeats) and to avoid running out of system stack by too much recursion. When one of these limits is reached, pcre_exec() gives an error return. The limits can also be set by items at the start of the pattern of the form(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
EBCDIC CHARACTER CODES
PCRE can be compiled to run in an environment that uses EBCDIC as its character code rather than ASCII or Unicode (typically a mainframe system). In the sections below, character code values are ASCII or Unicode; in an EBCDIC environment these characters may have different code values, and there are no code points greater than 255.CHARACTERS AND METACHARACTERS
A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the patternThe quick brown fox
\ general escape character with several uses
^ assert start of string (or line, in multiline mode)
$ assert end of string (or line, in multiline mode)
. match any character except newline (by default)
[ start character class definition
| start of alternative branch
( start subpattern
) end subpattern
? extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
* 0 or more quantifier
+ 1 or more quantifier
also "possessive quantifier"
{ start min/max quantifier
\ general escape character
^ negate the class, but only if the first character
- indicates character range
[ POSIX character class (only if followed by POSIX
syntax)
] terminates the character class
BACKSLASH
The backslash character has several uses. Firstly, if it is followed by a character that is not a number or a letter, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
Non-printing characters
A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is often easier to use one of the following escape sequences than the binary character it represents. In an ASCII or Unicode environment, these escapes are as follows:\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh character with hex code hhhh (JavaScript mode only)
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40
previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of
writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the
character with octal code 113
\377 might be a back reference, otherwise
the value 255 (decimal)
\81 is either a back reference, or the two
characters "8" and "1"
Constraints on character values
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Escape sequences in character classes
All the sequences that define a single character value can be used both inside and outside character classes. In addition, inside a character class, \b is interpreted as the backspace character (hex 08).Unsupported escape sequences
In Perl, the sequences \l, \L, \u, and \U are recognized by its string handler and used to modify the case of following characters. By default, PCRE does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and \u can be used to define a character by code point, as described in the previous section.Absolute and relative back references
The sequence \g followed by an unsigned or a negative number, optionally enclosed in braces, is an absolute or relative back reference. A named back reference can be coded as \g{name}. Back references are discussed later, following the discussion of parenthesized subpatterns.Absolute and relative subroutine calls
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative syntax for referencing a subpattern as a "subroutine". Details are discussed later. Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not synonymous. The former is a back reference; the latter is a subroutine call.Generic character types
Another use of backslash is for specifying generic character types:\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space character
\w any "word" character
\W any "non-word" character
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
U+0009 Horizontal tab (HT)
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator
Newline sequences
Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the following:(?>\r\n|\n|\x0b|\f|\r|\x85)
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
(*ANY)(*BSR_ANYCRLF)
Unicode character properties
When PCRE is built with Unicode character property support, three additional escape sequences that match characters with specific properties are available. When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing characters whose codepoints are less than 256, but they do work in this mode. The extra escape sequences are:\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X a Unicode extended grapheme cluster
\p{Greek}
\P{Han}
\p{L}
\pL
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
Extended grapheme clusters
The \X escape matches any number of Unicode characters that form an "extended grapheme cluster", and treats the sequence as an atomic group (see below). Up to and including release 8.31, PCRE matched an earlier, simpler definition that was equivalent to(?>\PM\pM*)
PCRE's additional properties
As well as the standard Unicode properties described above, PCRE supports four more that make it possible to convert traditional escape sequences such as \w and \s to use Unicode properties. PCRE uses these non-standard, non-Perl properties internally when PCRE_UCP is set. However, they may also be used explicitly. These properties are:Xan Any alphanumeric character
Xps Any POSIX space character
Xsp Any Perl space character
Xwd Any Perl "word" character
Resetting the match start
The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:foo\Kbar
(foo)\Kbar
Simple assertions
The final use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of subpatterns for more complicated assertions is described below. The backslashed assertions are:\b matches at a word boundary
\B matches when not at a word boundary
\A matches at the start of the subject
\Z matches at the end of the subject
also matches before a newline at the end of the subject
\z matches only at the end of the subject
\G matches at the first matching position in the subject
CIRCUMFLEX AND DOLLAR
The circumflex and dollar metacharacters are zero-width assertions. That is, they test for a particular condition being true without consuming any characters from the subject string.FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one character in the subject string except (by default) a character that signifies the end of a line.MATCHING A SINGLE DATA UNIT
Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set. In the 8-bit library, one data unit is one byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches line-ending characters. The feature is provided in Perl in order to match individual bytes in UTF-8 mode, but it is unclear how it can usefully be used. Because \C breaks up characters into individual data units, matching one unit with \C in a UTF mode means that the rest of the string may start with a malformed UTF character. This has undefined results, because PCRE assumes that it is dealing with valid UTF strings (and by default it checks this at the start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option is used).(?| (?=[\x00-\x7f])(\C) |
(?=[\x80-\x{7ff}])(\C)(\C) |
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square bracket causes a compile-time error. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash.POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE also supports this notation. For example,[01[:alpha:]%]
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits and space
space white space (the same as \s from PCRE 8.34)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
[12[:^digit:]]
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
- [:graph:]
-
This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf properties, except for:
U+061C Arabic Letter Mark
U+180E Mongolian Vowel Separator
U+2066 - U+2069 Various "isolate"s
- [:print:]
- This matches the same characters as [:graph:] plus space characters that are not controls, that is, characters with the Zs property.
- [:punct:]
- This matches all characters that have the Unicode P (punctuation) property, plus those characters whose code points are less than 128 that have the S (Symbol) property.
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of word". PCRE treats these items as follows:[[:<:]] is converted to \b(?=\w)
[[:>:]] is converted to \b(?<=\w)
VERTICAL BAR
Vertical bar characters are used to separate alternative patterns. For example, the patterngilbert|sullivan
INTERNAL OPTION SETTING
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED options (which are Perl-compatible) can be changed from within the pattern by a sequence of Perl option letters enclosed between "(?" and ")". The option letters arei for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED
(a(?i)b)c
(a(?i)b|c)
SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets), which can be nested. Turning part of a pattern into a subpattern does two things:cat(aract|erpillar|)
the ((red|white) (king|queen))
the ((?:red|white) (king|queen))
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses the same numbers for its capturing parentheses. Such a subpattern starts with (?| and is itself a non-capturing subpattern. For example, consider this pattern:(?|(Sat)ur|(Sun))day
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
/(?|(abc)|(def))\1/
/(?|(abc)|(def))(?1)/
NAMED SUBPATTERNS
Identifying capturing parentheses by number is simple, but it can be very hard to keep track of the numbers in complicated regular expressions. Furthermore, if an expression is modified, the numbers may change. To help with this difficulty, PCRE supports the naming of subpatterns. This feature was not added to Perl until release 5.10. Python had the feature earlier, and PCRE introduced it at release 4.0, using the Python syntax. PCRE now supports both the Perl and the Python syntax. Perl allows identically numbered subpatterns to have different names, but PCRE does not.(?<DN>Mon|Fri|Sun)(?:day)?|
(?<DN>Tue)(?:sday)?|
(?<DN>Wed)(?:nesday)?|
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
(?:(?<n>foo)|(?<n>bar))\k<n>
REPETITION
Repetition is specified by quantifiers, which can follow any of the following items:a literal data character
the dot metacharacter
the \C escape sequence
the \X escape sequence
the \R escape sequence
an escape such as \d or \pL that matches a single character
a character class
a back reference (see next section)
a parenthesized subpattern (including assertions)
a subroutine call to a subpattern (recursive or otherwise)
z{2,4}
[aeiou]{3,}
\d{8}
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
(a?)*
/\*.*\*/
/* first comment */ not comment /* second comment */
/\*.*?\*/
\d??\d
(.*)abc\1
(?>.*?a)b
(tweedle[dume]{3}\s*)+
/(a|(b))+/
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be re-evaluated to see if a different number of repeats allows the rest of the pattern to match. Sometimes it is useful to prevent this, either to change the nature of the match, or to cause it fail earlier than it otherwise might, when the author of the pattern knows there is no point in carrying on.123456bar
(?>\d+)foo
\d++foo
(abc|xyz){2,3}+
(\D+|<\d+>)*[!?]
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
((?>\D+)|<\d+>)*[!?]
BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (that is, to its left) in the pattern, provided there have been that many previous capturing left parentheses.(ring), \1
(ring), \g1
(ring), \g{1}
(abc(def)ghi)\g{-1}
(sens|respons)e and \1ibility
((?i)rah)\s+\1
(?<p1>(?i)rah)\s+\k<p1>
(?'p1'(?i)rah)\s+\k{p1}
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
(a|(bc))\2
Recursive back references
A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can be useful inside repeated subpatterns. For example, the pattern(a|b\1)+
ASSERTIONS
An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described above.Lookahead assertions
Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example,\w+(?=;)
foo(?!bar)
(?!foo)bar
Lookbehind assertions
Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example,(?<!foo)bar
(?<=bullock|donkey)
(?<!dogs?|cats?)
(?<=ab(c|de))
(?<=abc|abde)
abcd$
^.*abcd$
^.*+(?<=abcd)
Using multiple assertions
Several assertions (of any sort) may occur in succession. For example,(?<=\d{3})(?<!999)foo
(?<=\d{3}...)(?<!999)foo
(?<=(?<!foo)bar)baz
(?<=\d{3}(?!999)...)foo
CONDITIONAL SUBPATTERNS
It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on the result of an assertion, or whether a specific capturing subpattern has already been matched. The two possible forms of conditional subpattern are:(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
Checking for a used subpattern by number
If the text between the parentheses consists of a sequence of digits, the condition is true if a capturing subpattern of that number has previously matched. If there is more than one capturing subpattern with the same number (see the earlier section about duplicate subpattern numbers), the condition is true if any of them have matched. An alternative notation is to precede the digits with a plus or minus sign. In this case, the subpattern number is relative rather than absolute. The most recently opened parentheses can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops it can also make sense to refer to subsequent groups. The next parentheses to be opened can be referenced as (?(+1), and so on. (The value zero in any of these forms is not used; it provokes a compile-time error.)( \( )? [^()]+ (?(1) \) )
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
Checking for a used subpattern by name
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used subpattern by name. For compatibility with earlier versions of PCRE, which had this facility before Perl, the syntax (?(name)...) is also recognized.(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the name R, the condition is true if a recursive call to the whole pattern or any subpattern has been made. If digits or a name preceded by ampersand follow the letter R, for example:(?(R3)...) or (?(R&name)...)
Defining subpatterns for use by reference only
If the condition is the string (DEFINE), and there is no subpattern with the name DEFINE, the condition is always false. In this case, there may be only one alternative in the subpattern. It is always skipped if control reaches this point in the pattern; the idea of DEFINE is that it can be used to define subroutines that can be referenced from elsewhere. (The use of subroutines is described below.) For example, a pattern to match an IPv4 address such as "192.168.23.245" could be written like this (ignore white space and line breaks):(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
Assertion conditions
If the condition is not in any of the above formats, it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line:(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
COMMENTS
There are two ways of including comments in patterns that are processed by PCRE. In both cases, the start of the comment must not be in a character class, nor in the middle of any other sequence of related characters such as (?: or a subpattern name or number. The characters that make up a comment play no part in the pattern matching.abc #comment \n still comment
RECURSIVE PATTERNS
Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth.$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
\( ( [^()]++ | (?R) )* \)
( \( ( [^()]++ | (?1) )* \) )
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
(ab(cd)ef)
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
Differences in recursion processing between PCRE and Perl
Recursion processing in PCRE differs from Perl in two important ways. In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. This can be illustrated by the following pattern, which purports to match a palindromic string that contains an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):^(.|(.)(?1)\2)$
^((.)(?1)\2|.)$
^((.)(?1)\2|.?)$
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
^(.)(\1|a(?2))
SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern call (either by number or by name) is used outside the parentheses to which it refers, it operates like a subroutine in a programming language. The called subpattern may be defined before or after the reference. A numbered reference can be absolute or relative, as in these examples:(...(absolute)...)...(?2)...
(...(relative)...)...(?-1)...
(...(?+1)...(relative)...
(sens|respons)e and \1ibility
(sens|respons)e and (?1)ibility
(abc)(?i:(?-1))
ONIGURUMA SUBROUTINE SYNTAX
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative syntax for referencing a subpattern as a subroutine, possibly recursively. Here are two of the examples used above, rewritten using this syntax:(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
(abc)(?i:\g<-1>)
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it possible, amongst other things, to extract different substrings that match the same pair of parentheses when there is a repetition.(?C1)abc(?C2)def
(?(?C9)(?=a)abc|def)
BACKTRACKING CONTROL
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which are still described in the Perl documentation as "experimental and subject to change or removal in a future version of Perl". It goes on to say: "Their usage in production code should be noted to avoid problems during upgrades." The same remarks apply to the PCRE features described in this section.Optimizations that affect backtracking verbs
PCRE contains some optimizations that are used to speed up matching by running some checks at the start of each match attempt. For example, it may know the minimum length of matching subject, or that a particular character must be present. When one of these optimizations bypasses the running of a match, any included backtracking verbs will not, of course, be processed. You can suppress the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_compile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). There is more discussion of this option in the section entitled "Option bits for pcre_exec()" in the pcreapi documentation.Verbs that act immediately
The following verbs act as soon as they are encountered. They may not be followed by a name.(*ACCEPT)
A((?:A|B(*ACCEPT)|C)D)
(*FAIL) or (*F)
a+(?C)(*FAIL)
Recording which path was taken
There is one verb whose main purpose is to track how a match was arrived at, though it also has a secondary use in conjunction with advancing the match starting point (see (*SKIP) below).(*MARK:NAME) or (*:NAME)
re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XY
0: XY
MK: A
XZ
0: XZ
MK: B
re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XP
No match, mark = B
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues with what follows, but if there is no subsequent match, causing a backtrack to the verb, a failure is forced. That is, backtracking cannot pass to the left of the verb. However, when one of these verbs appears inside an atomic group or an assertion that is true, its effect is confined to that group, because once the group has been matched, there is never any backtracking into it. In this situation, backtracking can "jump back" to the left of the entire atomic group or assertion. (Remember also, as stated above, that this localization also applies in subroutine calls.)(*COMMIT)
a+(*COMMIT)b
re> /(*COMMIT)abc/
data> xyzabc
0: abc
data> xyzabc\Y
No match
(*PRUNE) or (*PRUNE:NAME)
(*SKIP)
a+(*SKIP)b
(*SKIP:NAME)
(*THEN) or (*THEN:NAME)
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
A (B(*THEN)C) | D
A (B(*THEN)C | (*FAIL)) | D
^.*? (?(?=a) a | b(*THEN)c )
More than one backtracking verb
If more than one backtracking verb is present in a pattern, the one that is backtracked onto first acts. For example, consider this pattern, where A, B, etc. are complex pattern fragments:(A(*COMMIT)B(*THEN)C|ABD)
...(*COMMIT)(*PRUNE)...
Backtracking verbs in repeated groups
PCRE differs from Perl in its handling of backtracking verbs in repeated groups. For example, consider:/(a(*COMMIT)b)+ac/
Backtracking verbs in assertions
(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack.Backtracking verbs in subroutines
These behaviours occur whether or not the subpattern is called recursively. Perl's treatment of subroutines is different in some cases.SEE ALSO
pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3), pcre16(3), pcre32(3).23 October 2016 | PCRE 8.40 |