forked from php/php-src
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpcre.txt
10455 lines (7916 loc) · 494 KB
/
pcre.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. Neither has
the pcredemo program. There are separate text files for the pcregrep and
pcretest commands.
-----------------------------------------------------------------------------
PCRE(3) Library Functions Manual PCRE(3)
NAME
PCRE - Perl-compatible regular expressions (original API)
PLEASE TAKE NOTE
This document relates to PCRE releases that use the original API, with
library names libpcre, libpcre16, and libpcre32. January 2015 saw the
first release of a new API, known as PCRE2, with release numbers start-
ing at 10.00 and library names libpcre2-8, libpcre2-16, and
libpcre2-32. The old libraries (now called PCRE1) are still being main-
tained for bug fixes, but there will be no new development. New
projects are advised to use the new PCRE2 libraries.
INTRODUCTION
The PCRE library is a set of functions that implement regular expres-
sion pattern matching using the same syntax and semantics as Perl, with
just a few differences. Some features that appeared in Python and PCRE
before they appeared in Perl are also available using the Python syn-
tax, there is some support for one or two .NET and Oniguruma syntax
items, and there is an option for requesting some minor changes that
give better JavaScript compatibility.
Starting with release 8.30, it is possible to compile two separate PCRE
libraries: the original, which supports 8-bit character strings
(including UTF-8 strings), and a second library that supports 16-bit
character strings (including UTF-16 strings). The build process allows
either one or both to be built. The majority of the work to make this
possible was done by Zoltan Herczeg.
Starting with release 8.32 it is possible to compile a third separate
PCRE library that supports 32-bit character strings (including UTF-32
strings). The build process allows any combination of the 8-, 16- and
32-bit libraries. The work to make this possible was done by Christian
Persch.
The three libraries contain identical sets of functions, except that
the names in the 16-bit library start with pcre16_ instead of pcre_,
and the names in the 32-bit library start with pcre32_ instead of
pcre_. To avoid over-complication and reduce the documentation mainte-
nance load, most of the documentation describes the 8-bit library, with
the differences for the 16-bit and 32-bit libraries described sepa-
rately in the pcre16 and pcre32 pages. References to functions or
structures of the form pcre[16|32]_xxx should be read as meaning
"pcre_xxx when using the 8-bit library, pcre16_xxx when using the
16-bit library, or pcre32_xxx when using the 32-bit library".
The current implementation of PCRE corresponds approximately with Perl
5.12, including support for UTF-8/16/32 encoded strings and Unicode
general category properties. However, UTF-8/16/32 and Unicode support
has to be explicitly enabled; it is not the default. The Unicode tables
correspond to Unicode release 6.3.0.
In addition to the Perl-compatible matching function, PCRE contains an
alternative function that matches the same compiled patterns in a dif-
ferent way. In certain circumstances, the alternative function has some
advantages. For a discussion of the two matching algorithms, see the
pcrematching page.
PCRE is written in C and released as a C library. A number of people
have written wrappers and interfaces of various kinds. In particular,
Google Inc. have provided a comprehensive C++ wrapper for the 8-bit
library. This is now included as part of the PCRE distribution. The
pcrecpp page has details of this interface. Other people's contribu-
tions can be found in the Contrib directory at the primary FTP site,
which is:
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
Details of exactly which Perl regular expression features are and are
not supported by PCRE are given in separate documents. See the pcrepat-
tern and pcrecompat pages. There is a syntax summary in the pcresyntax
page.
Some features of PCRE can be included, excluded, or changed when the
library is built. The pcre_config() function makes it possible for a
client to discover which features are available. The features them-
selves are described in the pcrebuild page. Documentation about build-
ing PCRE for various operating systems can be found in the README and
NON-AUTOTOOLS_BUILD files in the source distribution.
The libraries contains a number of undocumented internal functions and
data tables that are used by more than one of the exported external
functions, but which are not intended for use by external callers.
Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which
hopefully will not provoke any name clashes. In some environments, it
is possible to control which external symbols are exported when a
shared library is built, and in these cases the undocumented symbols
are not exported.
SECURITY CONSIDERATIONS
If you are using PCRE in a non-UTF application that permits users to
supply arbitrary patterns for compilation, you should be aware of a
feature that allows users to turn on UTF support from within a pattern,
provided that PCRE was built with UTF support. For example, an 8-bit
pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
which interprets patterns and subjects as strings of UTF-8 characters
instead of individual 8-bit characters. This causes both the pattern
and any data against which it is matched to be checked for UTF-8 valid-
ity. If the data string is very long, such a check might use suffi-
ciently many resources as to cause your application to lose perfor-
mance.
One way of guarding against this possibility is to use the
pcre_fullinfo() function to check the compiled pattern's options for
UTF. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF
option at compile time. This causes an compile time error if a pattern
contains a UTF-setting sequence.
If your application is one that supports UTF, be aware that validity
checking can take time. If the same data string is to be matched many
times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
and subsequent matches to save redundant checks.
Another way that performance can be hit is by running a pattern that
has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE pro-
vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
ture in the pcreapi page.
USER DOCUMENTATION
The user documentation for PCRE comprises a number of different sec-
tions. In the "man" format, each of these is a separate "man page". In
the HTML format, each is a separate page, linked from the index page.
In the plain text format, the descriptions of the pcregrep and pcretest
programs are in files called pcregrep.txt and pcretest.txt, respec-
tively. The remaining sections, except for the pcredemo section (which
is a program listing), are concatenated in pcre.txt, for ease of
searching. The sections are as follows:
pcre this document
pcre-config show PCRE installation configuration information
pcre16 details of the 16-bit library
pcre32 details of the 32-bit library
pcreapi details of PCRE's native C API
pcrebuild building PCRE
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper for the 8-bit library
pcredemo a demonstration C program that uses PCRE
pcregrep description of the pcregrep command (8-bit only)
pcrejit discussion of the just-in-time optimization support
pcrelimits details of size and other limits
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
pcrepattern syntax and semantics of supported
regular expressions
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API for the 8-bit library
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcresyntax quick syntax reference
pcretest description of the pcretest testing command
pcreunicode discussion of Unicode and UTF-8/16/32 support
In the "man" and HTML formats, there is also a short page for each C
library function, listing its arguments and results.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
Putting an actual email address here seems to have been a spam magnet,
so I've taken it away. If you want to email me, use my two initials,
followed by the two digits 10, at the domain cam.ac.uk.
REVISION
Last updated: 10 February 2015
Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------
PCRE(3) Library Functions Manual PCRE(3)
NAME
PCRE - Perl-compatible regular expressions
#include <pcre.h>
PCRE 16-BIT API BASIC FUNCTIONS
pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre16_extra *pcre16_study(const pcre16 *code, int options,
const char **errptr);
void pcre16_free_study(pcre16_extra *extra);
int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
PCRE_SPTR16 subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
PCRE_SPTR16 subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
int pcre16_copy_named_substring(const pcre16 *code,
PCRE_SPTR16 subject, int *ovector,
int stringcount, PCRE_SPTR16 stringname,
PCRE_UCHAR16 *buffer, int buffersize);
int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
int buffersize);
int pcre16_get_named_substring(const pcre16 *code,
PCRE_SPTR16 subject, int *ovector,
int stringcount, PCRE_SPTR16 stringname,
PCRE_SPTR16 *stringptr);
int pcre16_get_stringnumber(const pcre16 *code,
PCRE_SPTR16 name);
int pcre16_get_stringtable_entries(const pcre16 *code,
PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
int stringcount, int stringnumber,
PCRE_SPTR16 *stringptr);
int pcre16_get_substring_list(PCRE_SPTR16 subject,
int *ovector, int stringcount, PCRE_SPTR16 **listptr);
void pcre16_free_substring(PCRE_SPTR16 stringptr);
void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
PCRE 16-BIT API AUXILIARY FUNCTIONS
pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
void pcre16_jit_stack_free(pcre16_jit_stack *stack);
void pcre16_assign_jit_stack(pcre16_extra *extra,
pcre16_jit_callback callback, void *data);
const unsigned char *pcre16_maketables(void);
int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
int what, void *where);
int pcre16_refcount(pcre16 *code, int adjust);
int pcre16_config(int what, void *where);
const char *pcre16_version(void);
int pcre16_pattern_to_host_byte_order(pcre16 *code,
pcre16_extra *extra, const unsigned char *tables);
PCRE 16-BIT API INDIRECTED FUNCTIONS
void *(*pcre16_malloc)(size_t);
void (*pcre16_free)(void *);
void *(*pcre16_stack_malloc)(size_t);
void (*pcre16_stack_free)(void *);
int (*pcre16_callout)(pcre16_callout_block *);
PCRE 16-BIT API 16-BIT-ONLY FUNCTION
int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
PCRE_SPTR16 input, int length, int *byte_order,
int keep_boms);
THE PCRE 16-BIT LIBRARY
Starting with release 8.30, it is possible to compile a PCRE library
that supports 16-bit character strings, including UTF-16 strings, as
well as or instead of the original 8-bit library. The majority of the
work to make this possible was done by Zoltan Herczeg. The two
libraries contain identical sets of functions, used in exactly the same
way. Only the names of the functions and the data types of their argu-
ments and results are different. To avoid over-complication and reduce
the documentation maintenance load, most of the PCRE documentation
describes the 8-bit library, with only occasional references to the
16-bit library. This page describes what is different when you use the
16-bit library.
WARNING: A single application can be linked with both libraries, but
you must take care when processing any particular pattern to use func-
tions from just one library. For example, if you want to study a pat-
tern that was compiled with pcre16_compile(), you must do so with
pcre16_study(), not pcre_study(), and you must free the study data with
pcre16_free_study().
THE HEADER FILE
There is only one header file, pcre.h. It contains prototypes for all
the functions in all libraries, as well as definitions of flags, struc-
tures, error codes, etc.
THE LIBRARY NAME
In Unix-like systems, the 16-bit library is called libpcre16, and can
normally be accesss by adding -lpcre16 to the command for linking an
application that uses PCRE.
STRING TYPES
In the 8-bit library, strings are passed to PCRE library functions as
vectors of bytes with the C type "char *". In the 16-bit library,
strings are passed as vectors of unsigned 16-bit quantities. The macro
PCRE_UCHAR16 specifies an appropriate data type, and PCRE_SPTR16 is
defined as "const PCRE_UCHAR16 *". In very many environments, "short
int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
as "unsigned short int", but checks that it really is a 16-bit data
type. If it is not, the build fails with an error message telling the
maintainer to modify the definition appropriately.
STRUCTURE TYPES
The types of the opaque structures that are used for compiled 16-bit
patterns and JIT stacks are pcre16 and pcre16_jit_stack respectively.
The type of the user-accessible structure that is returned by
pcre16_study() is pcre16_extra, and the type of the structure that is
used for passing data to a callout function is pcre16_callout_block.
These structures contain the same fields, with the same names, as their
8-bit counterparts. The only difference is that pointers to character
strings are 16-bit instead of 8-bit types.
16-BIT FUNCTIONS
For every function in the 8-bit library there is a corresponding func-
tion in the 16-bit library with a name that starts with pcre16_ instead
of pcre_. The prototypes are listed above. In addition, there is one
extra function, pcre16_utf16_to_host_byte_order(). This is a utility
function that converts a UTF-16 character string to host byte order if
necessary. The other 16-bit functions expect the strings they are
passed to be in host byte order.
The input and output arguments of pcre16_utf16_to_host_byte_order() may
point to the same address, that is, conversion in place is supported.
The output buffer must be at least as long as the input.
The length argument specifies the number of 16-bit data units in the
input string; a negative value specifies a zero-terminated string.
If byte_order is NULL, it is assumed that the string starts off in host
byte order. This may be changed by byte-order marks (BOMs) anywhere in
the string (commonly as the first character).
If byte_order is not NULL, a non-zero value of the integer to which it
points means that the input starts off in host byte order, otherwise
the opposite order is assumed. Again, BOMs in the string can change
this. The final byte order is passed back at the end of processing.
If keep_boms is not zero, byte-order mark characters (0xfeff) are
copied into the output string. Otherwise they are discarded.
The result of the function is the number of 16-bit units placed into
the output buffer, including the zero terminator if the string was
zero-terminated.
SUBJECT STRING OFFSETS
The lengths and starting offsets of subject strings must be specified
in 16-bit data units, and the offsets within subject strings that are
returned by the matching functions are in also 16-bit units rather than
bytes.
NAMED SUBPATTERNS
The name-to-number translation table that is maintained for named sub-
patterns uses 16-bit characters. The pcre16_get_stringtable_entries()
function returns the length of each entry in the table as the number of
16-bit data units.
OPTION NAMES
There are two new general option names, PCRE_UTF16 and
PCRE_NO_UTF16_CHECK, which correspond to PCRE_UTF8 and
PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
define the same bits in the options word. There is a discussion about
the validity of UTF-16 strings in the pcreunicode page.
For the pcre16_config() function there is an option PCRE_CONFIG_UTF16
that returns 1 if UTF-16 support is configured, otherwise 0. If this
option is given to pcre_config() or pcre32_config(), or if the
PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF32 option is given to pcre16_con-
fig(), the result is the PCRE_ERROR_BADOPTION error.
CHARACTER CODES
In 16-bit mode, when PCRE_UTF16 is not set, character values are
treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
that they can range from 0 to 0xffff instead of 0 to 0xff. Character
types for characters less than 0xff can therefore be influenced by the
locale in the same way as before. Characters greater than 0xff have
only one case, and no "type" (such as letter or digit).
In UTF-16 mode, the character code is Unicode, in the range 0 to
0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
because those are "surrogate" values that are used in pairs to encode
values greater than 0xffff.
A UTF-16 string can indicate its endianness by special code knows as a
byte-order mark (BOM). The PCRE functions do not handle this, expecting
strings to be in host byte order. A utility function called
pcre16_utf16_to_host_byte_order() is provided to help with this (see
above).
ERROR NAMES
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
spond to their 8-bit counterparts. The error PCRE_ERROR_BADMODE is
given when a compiled pattern is passed to a function that processes
patterns in the other mode, for example, if a pattern compiled with
pcre_compile() is passed to pcre16_exec().
There are new error codes whose names begin with PCRE_UTF16_ERR for
invalid UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for
UTF-8 strings that are described in the section entitled "Reason codes
for invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
are:
PCRE_UTF16_ERR1 Missing low surrogate at end of string
PCRE_UTF16_ERR2 Invalid low surrogate follows high surrogate
PCRE_UTF16_ERR3 Isolated low surrogate
PCRE_UTF16_ERR4 Non-character
ERROR TEXTS
If there is an error while compiling a pattern, the error text that is
passed back by pcre16_compile() or pcre16_compile2() is still an 8-bit
character string, zero-terminated.
CALLOUTS
The subject and mark fields in the callout block that is passed to a
callout function point to 16-bit vectors.
TESTING
The pcretest program continues to operate with 8-bit input and output
files, but it can be used for testing the 16-bit library. If it is run
with the command line option -16, patterns and subject strings are con-
verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
library functions are used instead of the 8-bit ones. Returned 16-bit
strings are converted to 8-bit for output. If both the 8-bit and the
32-bit libraries were not compiled, pcretest defaults to 16-bit and the
-16 option is ignored.
When PCRE is being built, the RunTest script that is called by "make
check" uses the pcretest -C option to discover which of the 8-bit,
16-bit and 32-bit libraries has been built, and runs the tests appro-
priately.
NOT SUPPORTED IN 16-BIT MODE
Not all the features of the 8-bit library are available with the 16-bit
library. The C++ and POSIX wrapper functions support only the 8-bit
library, and the pcregrep program is at present 8-bit only.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 12 May 2013
Copyright (c) 1997-2013 University of Cambridge.
------------------------------------------------------------------------------
PCRE(3) Library Functions Manual PCRE(3)
NAME
PCRE - Perl-compatible regular expressions
#include <pcre.h>
PCRE 32-BIT API BASIC FUNCTIONS
pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
int *errorcodeptr,
const unsigned char *tableptr);
pcre32_extra *pcre32_study(const pcre32 *code, int options,
const char **errptr);
void pcre32_free_study(pcre32_extra *extra);
int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
PCRE_SPTR32 subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
PCRE_SPTR32 subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
int pcre32_copy_named_substring(const pcre32 *code,
PCRE_SPTR32 subject, int *ovector,
int stringcount, PCRE_SPTR32 stringname,
PCRE_UCHAR32 *buffer, int buffersize);
int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
int buffersize);
int pcre32_get_named_substring(const pcre32 *code,
PCRE_SPTR32 subject, int *ovector,
int stringcount, PCRE_SPTR32 stringname,
PCRE_SPTR32 *stringptr);
int pcre32_get_stringnumber(const pcre32 *code,
PCRE_SPTR32 name);
int pcre32_get_stringtable_entries(const pcre32 *code,
PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
int stringcount, int stringnumber,
PCRE_SPTR32 *stringptr);
int pcre32_get_substring_list(PCRE_SPTR32 subject,
int *ovector, int stringcount, PCRE_SPTR32 **listptr);
void pcre32_free_substring(PCRE_SPTR32 stringptr);
void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
PCRE 32-BIT API AUXILIARY FUNCTIONS
pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
void pcre32_jit_stack_free(pcre32_jit_stack *stack);
void pcre32_assign_jit_stack(pcre32_extra *extra,
pcre32_jit_callback callback, void *data);
const unsigned char *pcre32_maketables(void);
int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
int what, void *where);
int pcre32_refcount(pcre32 *code, int adjust);
int pcre32_config(int what, void *where);
const char *pcre32_version(void);
int pcre32_pattern_to_host_byte_order(pcre32 *code,
pcre32_extra *extra, const unsigned char *tables);
PCRE 32-BIT API INDIRECTED FUNCTIONS
void *(*pcre32_malloc)(size_t);
void (*pcre32_free)(void *);
void *(*pcre32_stack_malloc)(size_t);
void (*pcre32_stack_free)(void *);
int (*pcre32_callout)(pcre32_callout_block *);
PCRE 32-BIT API 32-BIT-ONLY FUNCTION
int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
PCRE_SPTR32 input, int length, int *byte_order,
int keep_boms);
THE PCRE 32-BIT LIBRARY
Starting with release 8.32, it is possible to compile a PCRE library
that supports 32-bit character strings, including UTF-32 strings, as
well as or instead of the original 8-bit library. This work was done by
Christian Persch, based on the work done by Zoltan Herczeg for the
16-bit library. All three libraries contain identical sets of func-
tions, used in exactly the same way. Only the names of the functions
and the data types of their arguments and results are different. To
avoid over-complication and reduce the documentation maintenance load,
most of the PCRE documentation describes the 8-bit library, with only
occasional references to the 16-bit and 32-bit libraries. This page
describes what is different when you use the 32-bit library.
WARNING: A single application can be linked with all or any of the
three libraries, but you must take care when processing any particular
pattern to use functions from just one library. For example, if you
want to study a pattern that was compiled with pcre32_compile(), you
must do so with pcre32_study(), not pcre_study(), and you must free the
study data with pcre32_free_study().
THE HEADER FILE
There is only one header file, pcre.h. It contains prototypes for all
the functions in all libraries, as well as definitions of flags, struc-
tures, error codes, etc.
THE LIBRARY NAME
In Unix-like systems, the 32-bit library is called libpcre32, and can
normally be accesss by adding -lpcre32 to the command for linking an
application that uses PCRE.
STRING TYPES
In the 8-bit library, strings are passed to PCRE library functions as
vectors of bytes with the C type "char *". In the 32-bit library,
strings are passed as vectors of unsigned 32-bit quantities. The macro
PCRE_UCHAR32 specifies an appropriate data type, and PCRE_SPTR32 is
defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
as "unsigned int", but checks that it really is a 32-bit data type. If
it is not, the build fails with an error message telling the maintainer
to modify the definition appropriately.
STRUCTURE TYPES
The types of the opaque structures that are used for compiled 32-bit
patterns and JIT stacks are pcre32 and pcre32_jit_stack respectively.
The type of the user-accessible structure that is returned by
pcre32_study() is pcre32_extra, and the type of the structure that is
used for passing data to a callout function is pcre32_callout_block.
These structures contain the same fields, with the same names, as their
8-bit counterparts. The only difference is that pointers to character
strings are 32-bit instead of 8-bit types.
32-BIT FUNCTIONS
For every function in the 8-bit library there is a corresponding func-
tion in the 32-bit library with a name that starts with pcre32_ instead
of pcre_. The prototypes are listed above. In addition, there is one
extra function, pcre32_utf32_to_host_byte_order(). This is a utility
function that converts a UTF-32 character string to host byte order if
necessary. The other 32-bit functions expect the strings they are
passed to be in host byte order.
The input and output arguments of pcre32_utf32_to_host_byte_order() may
point to the same address, that is, conversion in place is supported.
The output buffer must be at least as long as the input.
The length argument specifies the number of 32-bit data units in the
input string; a negative value specifies a zero-terminated string.
If byte_order is NULL, it is assumed that the string starts off in host
byte order. This may be changed by byte-order marks (BOMs) anywhere in
the string (commonly as the first character).
If byte_order is not NULL, a non-zero value of the integer to which it
points means that the input starts off in host byte order, otherwise
the opposite order is assumed. Again, BOMs in the string can change
this. The final byte order is passed back at the end of processing.
If keep_boms is not zero, byte-order mark characters (0xfeff) are
copied into the output string. Otherwise they are discarded.
The result of the function is the number of 32-bit units placed into
the output buffer, including the zero terminator if the string was
zero-terminated.
SUBJECT STRING OFFSETS
The lengths and starting offsets of subject strings must be specified
in 32-bit data units, and the offsets within subject strings that are
returned by the matching functions are in also 32-bit units rather than
bytes.
NAMED SUBPATTERNS
The name-to-number translation table that is maintained for named sub-
patterns uses 32-bit characters. The pcre32_get_stringtable_entries()
function returns the length of each entry in the table as the number of
32-bit data units.
OPTION NAMES
There are two new general option names, PCRE_UTF32 and
PCRE_NO_UTF32_CHECK, which correspond to PCRE_UTF8 and
PCRE_NO_UTF8_CHECK in the 8-bit library. In fact, these new options
define the same bits in the options word. There is a discussion about
the validity of UTF-32 strings in the pcreunicode page.
For the pcre32_config() function there is an option PCRE_CONFIG_UTF32
that returns 1 if UTF-32 support is configured, otherwise 0. If this
option is given to pcre_config() or pcre16_config(), or if the
PCRE_CONFIG_UTF8 or PCRE_CONFIG_UTF16 option is given to pcre32_con-
fig(), the result is the PCRE_ERROR_BADOPTION error.
CHARACTER CODES
In 32-bit mode, when PCRE_UTF32 is not set, character values are
treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
that they can range from 0 to 0x7fffffff instead of 0 to 0xff. Charac-
ter types for characters less than 0xff can therefore be influenced by
the locale in the same way as before. Characters greater than 0xff
have only one case, and no "type" (such as letter or digit).
In UTF-32 mode, the character code is Unicode, in the range 0 to
0x10ffff, with the exception of values in the range 0xd800 to 0xdfff
because those are "surrogate" values that are ill-formed in UTF-32.
A UTF-32 string can indicate its endianness by special code knows as a
byte-order mark (BOM). The PCRE functions do not handle this, expecting
strings to be in host byte order. A utility function called
pcre32_utf32_to_host_byte_order() is provided to help with this (see
above).
ERROR NAMES
The error PCRE_ERROR_BADUTF32 corresponds to its 8-bit counterpart.
The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
to a function that processes patterns in the other mode, for example,
if a pattern compiled with pcre_compile() is passed to pcre32_exec().
There are new error codes whose names begin with PCRE_UTF32_ERR for
invalid UTF-32 strings, corresponding to the PCRE_UTF8_ERR codes for
UTF-8 strings that are described in the section entitled "Reason codes
for invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
are:
PCRE_UTF32_ERR1 Surrogate character (range from 0xd800 to 0xdfff)
PCRE_UTF32_ERR2 Non-character
PCRE_UTF32_ERR3 Character > 0x10ffff
ERROR TEXTS
If there is an error while compiling a pattern, the error text that is
passed back by pcre32_compile() or pcre32_compile2() is still an 8-bit
character string, zero-terminated.
CALLOUTS
The subject and mark fields in the callout block that is passed to a
callout function point to 32-bit vectors.
TESTING
The pcretest program continues to operate with 8-bit input and output
files, but it can be used for testing the 32-bit library. If it is run
with the command line option -32, patterns and subject strings are con-
verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
library functions are used instead of the 8-bit ones. Returned 32-bit
strings are converted to 8-bit for output. If both the 8-bit and the
16-bit libraries were not compiled, pcretest defaults to 32-bit and the
-32 option is ignored.
When PCRE is being built, the RunTest script that is called by "make
check" uses the pcretest -C option to discover which of the 8-bit,
16-bit and 32-bit libraries has been built, and runs the tests appro-
priately.
NOT SUPPORTED IN 32-BIT MODE
Not all the features of the 8-bit library are available with the 32-bit
library. The C++ and POSIX wrapper functions support only the 8-bit
library, and the pcregrep program is at present 8-bit only.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 12 May 2013
Copyright (c) 1997-2013 University of Cambridge.
------------------------------------------------------------------------------
PCREBUILD(3) Library Functions Manual PCREBUILD(3)
NAME
PCRE - Perl-compatible regular expressions
BUILDING PCRE
PCRE is distributed with a configure script that can be used to build
the library in Unix-like environments using the applications known as
Autotools. Also in the distribution are files to support building
using CMake instead of configure. The text file README contains general
information about building with Autotools (some of which is repeated
below), and also has some comments about building on various operating
systems. There is a lot more information about building PCRE without
using Autotools (including information about using CMake and building
"by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
consult this file as well as the README file if you are building in a
non-Unix-like environment.
PCRE BUILD-TIME OPTIONS
The rest of this document describes the optional features of PCRE that
can be selected when the library is compiled. It assumes use of the
configure script, where the optional features are selected or dese-
lected by providing options to configure before running the make com-
mand. However, the same options can be selected in both Unix-like and
non-Unix-like environments using the GUI facility of cmake-gui if you
are using CMake instead of configure to build PCRE.
If you are not using Autotools or CMake, option selection can be done
by editing the config.h file, or by passing parameter settings to the
compiler, as described in NON-AUTOTOOLS-BUILD.
The complete list of options for configure (which includes the standard
ones such as the selection of the installation directory) can be
obtained by running
./configure --help
The following sections include descriptions of options whose names
begin with --enable or --disable. These settings specify changes to the
defaults for the configure command. Because of the way that configure
works, --enable and --disable always come in pairs, so the complemen-
tary option always exists as well, but as it specifies the default, it
is not described.
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
By default, a library called libpcre is built, containing functions
that take string arguments contained in vectors of bytes, either as
single-byte characters, or interpreted as UTF-8 strings. You can also
build a separate library, called libpcre16, in which strings are con-
tained in vectors of 16-bit data units and interpreted either as sin-
gle-unit characters or UTF-16 strings, by adding
--enable-pcre16
to the configure command. You can also build yet another separate
library, called libpcre32, in which strings are contained in vectors of
32-bit data units and interpreted either as single-unit characters or
UTF-32 strings, by adding
--enable-pcre32
to the configure command. If you do not want the 8-bit library, add
--disable-pcre8
as well. At least one of the three libraries must be built. Note that
the C++ and POSIX wrappers are for the 8-bit library only, and that
pcregrep is an 8-bit program. None of these are built if you select
only the 16-bit or 32-bit libraries.
BUILDING SHARED AND STATIC LIBRARIES
The Autotools PCRE building process uses libtool to build both shared
and static libraries by default. You can suppress one of these by
adding one of
--disable-shared
--disable-static
to the configure command, as required.
C++ SUPPORT
By default, if the 8-bit library is being built, the configure script
will search for a C++ compiler and C++ header files. If it finds them,
it automatically builds the C++ wrapper library (which supports only
8-bit strings). You can disable this by adding
--disable-cpp
to the configure command.
UTF-8, UTF-16 AND UTF-32 SUPPORT
To build PCRE with support for UTF Unicode character strings, add
--enable-utf
to the configure command. This setting applies to all three libraries,
adding support for UTF-8 to the 8-bit library, support for UTF-16 to
the 16-bit library, and support for UTF-32 to the to the 32-bit
library. There are no separate options for enabling UTF-8, UTF-16 and
UTF-32 independently because that would allow ridiculous settings such
as requesting UTF-16 support while building only the 8-bit library. It
is not possible to build one library with UTF support and another with-
out in the same configuration. (For backwards compatibility, --enable-
utf8 is a synonym of --enable-utf.)
Of itself, this setting does not make PCRE treat strings as UTF-8,
UTF-16 or UTF-32. As well as compiling PCRE with this option, you also
have have to set the PCRE_UTF8, PCRE_UTF16 or PCRE_UTF32 option (as
appropriate) when you call one of the pattern compiling functions.
If you set --enable-utf when compiling in an EBCDIC environment, PCRE
expects its input to be either ASCII or UTF-8 (depending on the run-
time option). It is not possible to support both EBCDIC and UTF-8 codes
in the same version of the library. Consequently, --enable-utf and
--enable-ebcdic are mutually exclusive.
UNICODE CHARACTER PROPERTY SUPPORT
UTF support allows the libraries to process character codepoints up to
0x10ffff in the strings that they handle. On its own, however, it does
not provide any facilities for accessing the properties of such charac-
ters. If you want to be able to use the pattern escapes \P, \p, and \X,
which refer to Unicode character properties, you must add
--enable-unicode-properties
to the configure command. This implies UTF support, even if you have
not explicitly requested it.