Menu

[r760]: / trunk / lispbuilder-regex / index.html  Maximize  Restore  History

Download this file

227 lines (157 with data), 9.0 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>REGEX Package</title>
</head>
<body>
<p><em>Text copied from
<a href="http://www.geocities.com/mparker762/clawk">http://www.geocities.com/mparker762/clawk</a></em></p>
<h2>REGEX package</h2>
<p>The regex engine is a pretty full-featured matcher, and
thus is useful by itself. It was originally written as a
prototype for a C++ matcher, though it has since diverged
greatly.</p>
<p>The regex compiler supports the following pattern syntax:</p>
<ul>
<li>^ matches the start of a string.</li>
<li>$ matches the end of a string.</li>
<li>[...] denotes a character class.</li>
<li>[^...] denotes a negated character class.</li>
<li>[:...:] denotes a special character class.
<ul>
<li>[:alpha:] == [A-Za-z]</li>
<li>[:upper:] == [A-Z]</li>
<li>[:lower:] == [a-z]</li>
<li>[:digit:] == [0-9]</li>
<li>[:alnum:] == [A-Za-z0-9]</li>
<li>[:xdigit:] == [A-Fa-f0-9]</li>
<li>[:space:] == whitespace</li>
<li>[:punct:] == punctuation marks</li>
<li>[:graph:] == printable characters other than space</li>
<li>[:cntrl:] == control characters</li>
<li>[:word:] == wordlike characters</li>
<li>[^:...:] denotes a negated special character class.</li>
</ul>
</li>
<li>. matches any character.</li>
<li>(...) delimits a regex subexpression. Also denotes a
register pattern.</li>
<li>(?...) denotes a regex subexpression that <em>will
not be captured in a register</em>.</li>
<li>(?=...) denotes a regex
subexpression that will be used as a forward lookahead.
If the subexpression matches, then the rest of the match
will continue as if the lookahead match had not occurred
(i.e. it does not consume the candidate string). It will
not be captured in a register, though it can contain
subexpressions that may be captured.</li>
<li>(?!...) denotes a regex subexpression that will be used
as a negative forward lookahead (the match will continue only
if the lookahead <em>failed</em> to match). It will not be
captured in a register, though it can contain
subexpressions that may be captured.</li>
<li>* denotes the kleene closure of the previous regex
subexpression.</li>
<li>+ denotes the positive closure of the previous regex
subexpression.</li>
<li>*? denotes the non-greedy kleene closure of the
previous regex subexpression.</li>
<li>+? denotes the non-greedy positive closure of the
previous regex subexpression.</li>
<li>? denotes the greedy match of 0 or 1 occurrences of
the previous regex subexpression.</li>
<li>?? denotes the non-greedy match of 0 or 1 occurrences
of the previous regex subexpression.</li>
<li>\nn denotes a back-match against the contents of a
previously-matched register.</li>
<li>{nn,mm} denotes a bounded repetition.</li>
<li>{nn,mm}? denotes a non-greedy bounded repetition.</li>
<li>\n, \t, \r have their normal meanings.</li>
<li>\d matches any decimal character,
\D matches any nondecimal character.</li>
<li>\w matches any wordlike
character, \W matches any nonwordlike character.</li>
<li>\s matches any whitespace
character, \S matches any nonspace character.</li>
<li>\&lt; matches at the start of a
word. \&gt; matches at the end of a word.</li>
<li>\&lt;char&gt; that character (escapes an otherwise
special meaning).</li>
<li>Special characters lose their specialness when
escaped. There is a flag to control this.</li>
<li>All other characters are matched literally.</li>
</ul>
<p>There are a variety of functions in the REGEX package
that allow the programmer to adjust the allowable regular
expression syntax:</p>
<ul>
<li>The function ESCAPE-SPECIAL-CHARS allows you to change
whether the meta-characters have their magic meaning when
escaped or unescaped. The default behavior (per AWK
syntax) is that special chars are unescaped.</li>
<li>The function ALLOW-BACKMATCH allows you to change
whether or not the \nn syntax is allowed. By default it
is allowed.</li>
<li>The function ALLOW-RANGEMATCH allows you to change
whether or not the the {nn,mm} bounded repetition syntax
is allowed. By default it is allowed.</li>
<li>The function ALLOW-NONGREEDY-QUANTIFIERS allows you to
change whether or not the *?, +?, ??, and {nn,mm}?
quantifiers are recognized. By default they are
allowed.</li>
<li>The function ALLOW-NONREGISTER-GROUPS allows you to
change whether or not the (?...) syntax is recognized. By
default it is allowed.</li>
<li>The function DOT-MATCHES-NEWLINE allows you to change
whether '.' in a pattern matches the newline character.
This is false by default.</li>
</ul>
<p>Parenthesized expressions within the pattern are
considered a register pattern, and will be recorded for use
after the match. There is an implicit set of parentheses
around the entire expression, so the bounds of the matched
text itself will always occupy register 0.</p>
<p>Extensions that will be coming soon include:</p>
<ol>
<li>I am working on a second backend for the regex
compiler that generates an even faster matcher (~4-20x
faster on Symbolics, ~ 2x faster on LWW). The compilation
process itself is substantially slower. I've got some
more work to do to get the speed up even further on
Lispworks, although the current system is already much,
much faster than GNU Regex.</li>
<li>Optionally allowing a negated regex pattern using the
&lt;pattern&gt; '^' syntax. This also subsumes the
negated character class in that <code>[^...]</code> ===
<code>[...]^</code>.</li>
<li>Faster scans by using a possible-prefix set. This
isn't real high priority at the moment since matching is
plenty fast already :-)</li>
<li>Prefix and postfix context patterns ala LEX.</li>
</ol>
<p>Regex has been recently enhanced. Everything from the
parser back has been completely rewritten. The regex system
now includes a bunch of functions for manipulating regex
parse trees directly, a multipass optimizer and code
generator, and a new matching engine.</p>
<p>The new regex system does a better job of optimizing a
wider range of patterns. It also supports an extension that
allows you to provide an "accept" function to the match-str
function. This acceptfn takes the start and end position as
parameters, and can find the string itself in the special
variable *STR* and the registers in the special variable
*regs*. It returns either nil to force the matcher to
backtrack, or a non-nil value which will be returned as
the success code for the match.</p>
<p>An additional change is that register patterns within
quantified patterns now return the leftmost occurrence in
the source string. There is a flag to force the more usual
rightmost match, but this will reduce the applicability of
many critical optimizations.</p>
<p>The latest version of regex supports the Perl \d, \D, \w,
\W, \s, and \S metasequences, as well as the egrep \&lt;
start-of-word and \&gt; end-of-word metasequences.</p>
<hr>
<address><a href="mailto:mparker762@hotmail.com">Michael Parker</a></address>
</body>
</html>
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.