Sortix nightly manual
This manual documents Sortix nightly, a development build that has not been officially released. You can instead view this document in the latest official manual.
PCREAPI(3) | Library Functions Manual | PCREAPI(3) |
NAME
PCRE - Perl-compatible regular expressionsPCRE NATIVE API BASIC FUNCTIONS
pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr);PCRE NATIVE API STRING EXTRACTION FUNCTIONS
int pcre_copy_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, char *buffer, int buffersize);PCRE NATIVE API AUXILIARY FUNCTIONS
int pcre_jit_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, pcre_jit_stack *jstack);PCRE NATIVE API INDIRECTED FUNCTIONS
void *(*pcre_malloc)(size_t);PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. To avoid too much complication, this document describes the 8-bit versions of the functions, with only occasional references to the 16-bit and 32-bit libraries.PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There are also some wrapper functions (for the 8-bit library only) that correspond to the POSIX regular expression API, but they do not give access to all the functionality. They are described in the pcreposix documentation. Both of these APIs define a set of C function calls. A C++ wrapper (again for the 8-bit library only) is also distributed with PCRE. It is documented in the pcrecpp page.pcre_copy_substring()
pcre_copy_named_substring()
pcre_get_substring()
pcre_get_named_substring()
pcre_get_substring_list()
pcre_get_stringnumber()
pcre_get_stringtable_entries()
NEWLINES
PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The Unicode newline sequences are the three just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029).MULTITHREADING
The PCRE functions can be used in multi-threading applications, with the proviso that the memory management functions pointed to by pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the callout and stack-checking functions pointed to by pcre_callout and pcre_stack_guard, are shared by all threads.SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a later time, possibly by a different program, and even on a host other than the one on which it was compiled. Details are given in the pcreprecompile documentation, which includes a description of the pcre_pattern_to_host_byte_order() function. However, compiling a regular expression with one version of PCRE for use with a different version is not guaranteed to work and may cause crashes.CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);PCRE_CONFIG_UTF8
PCRE_CONFIG_UTF16
PCRE_CONFIG_UTF32
PCRE_CONFIG_UNICODE_PROPERTIES
PCRE_CONFIG_JIT
PCRE_CONFIG_JITTARGET
PCRE_CONFIG_NEWLINE
PCRE_CONFIG_BSR
PCRE_CONFIG_LINK_SIZE
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
PCRE_CONFIG_PARENS_LIMIT
PCRE_CONFIG_MATCH_LIMIT
PCRE_CONFIG_MATCH_LIMIT_RECURSION
PCRE_CONFIG_STACKRECURSE
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr);pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
PCRE_ANCHORED
PCRE_AUTO_CALLOUT
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
PCRE_CASELESS
PCRE_DOLLAR_ENDONLY
PCRE_DOTALL
PCRE_DUPNAMES
PCRE_EXTENDED
PCRE_EXTRA
PCRE_FIRSTLINE
PCRE_JAVASCRIPT_COMPAT
PCRE_MULTILINE
PCRE_NEVER_UTF
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
PCRE_NO_AUTO_CAPTURE
PCRE_NO_AUTO_POSSESS
PCRE_NO_START_OPTIMIZE
PCRE_UCP
PCRE_UNGREEDY
PCRE_UTF8
PCRE_NO_UTF8_CHECK
COMPILATION ERROR CODES
The following table lists the error codes than may be returned by pcre_compile2(), along with the error messages that may be returned by both compiling functions. Note that error messages are always 8-bit ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed, some error codes have fallen out of use. To avoid confusion, they have not been re-used.0 no error
1 \ at end of pattern
2 \c at end of pattern
3 unrecognized character follows \
4 numbers out of order in {} quantifier
5 number too big in {} quantifier
6 missing terminating ] for character class
7 invalid escape sequence in character class
8 range out of order in character class
9 nothing to repeat
10 [this code is not in use]
11 internal error: unexpected repeat
12 unrecognized character after (? or (?-
13 POSIX named classes are supported only within a class
14 missing )
15 reference to non-existent subpattern
16 erroffset passed as NULL
17 unknown option bit(s) set
18 missing ) after comment
19 [this code is not in use]
20 regular expression is too large
21 failed to get memory
22 unmatched parentheses
23 internal error: code overflow
24 unrecognized character after (?<
25 lookbehind assertion is not fixed length
26 malformed number or name after (?(
27 conditional group contains more than two branches
28 assertion expected after (?(
29 (?R or (?[+-]digits must be followed by )
30 unknown POSIX class name
31 POSIX collating elements are not supported
32 this version of PCRE is compiled without UTF support
33 [this code is not in use]
34 character value in \x{} or \o{} is too large
35 invalid condition (?(0)
36 \C not allowed in lookbehind assertion
37 PCRE does not support \L, \l, \N{name}, \U, or \u
38 number after (?C is > 255
39 closing ) for (?C expected
40 recursive call could loop indefinitely
41 unrecognized character after (?P
42 syntax error in subpattern name (missing terminator)
43 two named subpatterns have the same name
44 invalid UTF-8 string (specifically UTF-8)
45 support for \P, \p, and \X has not been compiled
46 malformed \P or \p sequence
47 unknown property name after \P or \p
48 subpattern name is too long (maximum 32 characters)
49 too many named subpatterns (maximum 10000)
50 [this code is not in use]
51 octal value is greater than \377 in 8-bit non-UTF-8 mode
52 internal error: overran compiling workspace
53 internal error: previously-checked referenced subpattern
not found
54 DEFINE group contains more than one branch
55 repeating a DEFINE group is not allowed
56 inconsistent NEWLINE options
57 \g is not followed by a braced, angle-bracketed, or quoted
name/number or by a plain number
58 a numbered reference must not be zero
59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
60 (*VERB) not recognized or malformed
61 number is too big
62 subpattern name expected
63 digit expected after (?+
64 ] is an invalid data character in JavaScript compatibility mode
65 different names for subpatterns of the same number are
not allowed
66 (*MARK) must have an argument
67 this version of PCRE is not compiled with Unicode property
support
68 \c must be followed by an ASCII character
69 \k is not followed by a braced, angle-bracketed, or quoted name
70 internal error: unknown opcode in find_fixedlength()
71 \N is not supported in a class
72 too many forward references
73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
74 invalid UTF-16 string (specifically UTF-16)
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
76 character value in \u.... sequence is too large
77 invalid UTF-32 string (specifically UTF-32)
78 setting UTF is disabled by the application
79 non-hex character in \x{} (closing brace missing?)
80 non-octal character in \o{} (closing brace missing?)
81 missing opening brace after \o
82 parentheses are too deeply nested
83 invalid range in character class
84 group name must start with a non-digit
85 parentheses are too deeply nested (stack check)
STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options, const char **errptr);PCRE_STUDY_JIT_COMPILE
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
int rc;
pcre *re;
pcre_extra *sd;
re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
sd = pcre_study(
re, /* result of pcre_compile() */
0, /* no options */
&error); /* set to NULL or points to a message */
rc = pcre_exec( /* see below for details of pcre_exec() options */
re, sd, "subject", 7, 0, 0, ovector, 30);
...
pcre_free_study(sd);
pcre_free(re);
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character code point. When running in UTF-8 mode, or in the 16- or 32-bit libraries, this applies only to characters with code points less than 256. By default, higher-valued code points never match escapes such as \w or \d. However, if PCRE is built with Unicode property support, all characters can be tested with \p and \P, or, alternatively, the PCRE_UCP option can be set when a pattern is compiled; this causes \w and friends to use Unicode property support instead of the built-in tables.setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where);PCRE_ERROR_NULL the argument code was NULL
the argument where was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness
PCRE_ERROR_BADOPTION the value of what was invalid
PCRE_ERROR_UNSET the requested field is not set
int rc;
size_t length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
sd, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
PCRE_INFO_BACKREFMAX
PCRE_INFO_CAPTURECOUNT
PCRE_INFO_DEFAULT_TABLES
PCRE_INFO_FIRSTBYTE (deprecated)
PCRE_INFO_FIRSTCHARACTER
PCRE_INFO_FIRSTCHARACTERFLAGS
PCRE_INFO_FIRSTTABLE
PCRE_INFO_HASCRORLF
PCRE_INFO_JCHANGED
PCRE_INFO_JIT
PCRE_INFO_JITSIZE
PCRE_INFO_LASTLITERAL
PCRE_INFO_MATCH_EMPTY
PCRE_INFO_MATCHLIMIT
PCRE_INFO_MAXLOOKBEHIND
PCRE_INFO_MINLENGTH
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
PCRE_INFO_OKPARTIAL
PCRE_INFO_OPTIONS
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back
references to the subpattern in which .* appears
PCRE_INFO_RECURSIONLIMIT
PCRE_INFO_SIZE
PCRE_INFO_STUDYSIZE
PCRE_INFO_REQUIREDCHARFLAGS
PCRE_INFO_REQUIREDCHAR
REFERENCE COUNTS
int pcre_refcount(pcre *code, int adjust);MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize);int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
30); /* number of elements (NOT size in bytes) */
Extra data for pcre_exec()
If the extra argument is not NULL, it must point to a pcre_extra data block. The pcre_study() function returns such a block (when it doesn't return NULL), but you can also create one for yourself, and pass additional information in it. The pcre_extra block contains the following fields (not necessarily in this order):unsigned long int flags;
void *study_data;
void *executable_jit;
unsigned long int match_limit;
unsigned long int match_limit_recursion;
void *callout_data;
const unsigned char *tables;
unsigned char **mark;
PCRE_EXTRA_CALLOUT_DATA
PCRE_EXTRA_EXECUTABLE_JIT
PCRE_EXTRA_MARK
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_MATCH_LIMIT_RECURSION
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_TABLES
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
Option bits for pcre_exec()
The unused bits of the options argument for pcre_exec() must be zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_ xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT.PCRE_ANCHORED
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
PCRE_NOTBOL
PCRE_NOTEOL
PCRE_NOTEMPTY
a?b?
PCRE_NOTEMPTY_ATSTART
PCRE_NO_START_OPTIMIZE
(*COMMIT)ABC
(*MARK:A)(X|Y)
PCRE_NO_UTF8_CHECK
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.\Biss\B
How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured.(a)(?:(b)c|bd)
Error return values from pcre_exec()
If pcre_exec() fails, it returns a negative number. The following are defined in the header file:PCRE_ERROR_NOMATCH (-1)
PCRE_ERROR_NULL (-2)
PCRE_ERROR_BADOPTION (-3)
PCRE_ERROR_BADMAGIC (-4)
PCRE_ERROR_UNKNOWN_OPCODE (-5)
PCRE_ERROR_NOMEMORY (-6)
PCRE_ERROR_NOSUBSTRING (-7)
PCRE_ERROR_MATCHLIMIT (-8)
PCRE_ERROR_CALLOUT (-9)
PCRE_ERROR_BADUTF8 (-10)
PCRE_ERROR_BADUTF8_OFFSET (-11)
PCRE_ERROR_PARTIAL (-12)
PCRE_ERROR_BADPARTIAL (-13)
PCRE_ERROR_INTERNAL (-14)
PCRE_ERROR_BADCOUNT (-15)
PCRE_ERROR_RECURSIONLIMIT (-21)
PCRE_ERROR_BADNEWLINE (-23)
PCRE_ERROR_BADOFFSET (-24)
PCRE_ERROR_SHORTUTF8 (-25)
PCRE_ERROR_RECURSELOOP (-26)
PCRE_ERROR_JIT_STACKLIMIT (-27)
PCRE_ERROR_BADMODE (-28)
PCRE_ERROR_BADENDIANNESS (-29)
PCRE_ERROR_JIT_BADOPTION
PCRE_ERROR_BADLENGTH (-32)
Reason codes for invalid UTF-8 strings
This section applies only to the 8-bit library. The corresponding information for the 16-bit and 32-bit libraries is given in the pcre16 and pcre32 pages.PCRE_UTF8_ERR1
PCRE_UTF8_ERR2
PCRE_UTF8_ERR3
PCRE_UTF8_ERR4
PCRE_UTF8_ERR5
PCRE_UTF8_ERR6
PCRE_UTF8_ERR7
PCRE_UTF8_ERR8
PCRE_UTF8_ERR9
PCRE_UTF8_ERR10
PCRE_UTF8_ERR11
PCRE_UTF8_ERR12
PCRE_UTF8_ERR13
PCRE_UTF8_ERR14
PCRE_UTF8_ERR15
PCRE_UTF8_ERR16
PCRE_UTF8_ERR17
PCRE_UTF8_ERR18
PCRE_UTF8_ERR19
PCRE_UTF8_ERR20
PCRE_UTF8_ERR21
PCRE_UTF8_ERR22
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_copy_substring(const char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, int buffersize);PCRE_ERROR_NOMEMORY (-6)
PCRE_ERROR_NOSUBSTRING (-7)
PCRE_ERROR_NOMEMORY (-6)
EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre_get_stringnumber(const pcre *code, const char *name);(a+)b(?<xxx>\d+)...
DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code, const char *name, char **first, char **last);FINDING ALL POSSIBLE MATCHES
The traditional matching function uses a similar algorithm to Perl, which stops when it finds the first match, starting at a given point in the subject. If you want to find all possible matches, or the longest possible match, consider using the alternative matching function (see below) instead. If you cannot use the alternative function, but still need to find all possible matches, you can kludge it up by making use of the callout facility, which is described in the pcrecallout documentation.OBTAINING AN ESTIMATE OF STACK USAGE
Matching certain patterns using pcre_exec() can use a lot of process stack, which in certain environments can be rather limited in size. Some users find it helpful to have an estimate of the amount of stack that is used by pcre_exec(), to help them set recursion limits, as described in the pcrestack documentation. The estimate that is output by pcretest when called with the -m and -C options is obtained by calling pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its first five arguments.MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, int *workspace, int wscount);int rc;
int ovector[10];
int wspace[20];
rc = pcre_dfa_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
10, /* number of elements (NOT size in bytes) */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_exec()
The unused bits of the options argument for pcre_dfa_exec() must be zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_ xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last four of these are exactly the same as for pcre_exec(), so their description is not repeated here.PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
PCRE_DFA_SHORTEST
PCRE_DFA_RESTART
Successful returns from pcre_dfa_exec()
When pcre_dfa_exec() succeeds, it may have matched more than one substring in the subject. Note, however, that all the matches from one run of the function start at the same point in the subject. The shorter matches are all initial substrings of the longer matches. For example, if the pattern<.*>
This is <something> <something else> <something further> no more
<something>
<something> <something else>
<something> <something else> <something further>
Error returns from pcre_dfa_exec()
The pcre_dfa_exec() function returns a negative number when it fails. Many of the errors are the same as for pcre_exec(), and these are described above. There are in addition the following errors that are specific to pcre_dfa_exec():PCRE_ERROR_DFA_UITEM (-16)
PCRE_ERROR_DFA_UCOND (-17)
PCRE_ERROR_DFA_UMLIMIT (-18)
PCRE_ERROR_DFA_WSSIZE (-19)
PCRE_ERROR_DFA_RECURSE (-20)
PCRE_ERROR_DFA_BADRESTART (-30)
SEE ALSO
pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).18 December 2015 | PCRE 8.39 |