-
-
Notifications
You must be signed in to change notification settings - Fork 148
/
Copy pathregex-matcher-scheme.html
400 lines (273 loc) · 9.55 KB
/
regex-matcher-scheme.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
<!DOCTYPE html>
<html>
<head>
<title>Pattern-matching regular expressions in Scheme using derivatives</title>
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://matt.might.net/articles/feed.rss" />
<link rel="stylesheet" href="../../css/raised-paper-2.css" />
<meta name="viewport" content="width=480, initial-scale=1" />
<link rel="stylesheet" media="screen and (max-device-width: 480px)" href="../../css/raised-paper-2-handheld.css" />
<script type="text/javascript" src="../../matt.might.js"></script>
<script type="text/javascript">
var ArticleVersion = 2 ;
</script>
<script>
<!--
include("article-style.js");
//-->
</script>
<script type="text/javascript" src="../manifest.js"></script>
<script type="text/javascript" src="../index-manifest.js"></script>
<script type="text/javascript">
<!--
// var Key = "[an error occurred while processing the directive]";
var Pathname = location.pathname ;
var PathParts = Pathname.split(/\//) ;
var Key = PathParts[PathParts.length-1] ;
if (Key == "")
Key = PathParts[PathParts.length-2] ;
//-->
</script>
</head>
<body>
<div id="body">
<div id="abstract-container" class="module">
<div id="abstract-content" class="fat-content">
<h1>A regular expression matcher in Scheme using derivatives</h1>
<div>
[<a href="../">article index</a>]
[<script>
var emailMatt = '<a href="mai'+'lto:matt-blog'+'@'+'migh'+'t.net">email me</a>'
document.write(emailMatt);
//-->
</script>]
[<a href="http://twitter.com/mattmight">@mattmight</a>]
[<a href="http://gplus.to/mattmight">+mattmight</a>]
[<a href="../feed.rss">rss</a>]
</div>
<p>
Many assume that pattern-matching a regular expression requires NFA
conversion followed by back-tracking search or DFA conversion.
Using the derivative of regular expressions, it's possible to
write a simple pattern-matcher without NFA conversion or back-tracking.
</p>
<p> The derivative of a regular expression is an algebraic
manipulation of the regular expression that calculates its "partially
matched" tail with respect to a particular character. </p>
<p>
Read on to see how it's done, with examples in Scheme.
</p>
</div> <!-- /#content -->
</div> <!-- /#content-container -->
<div class="module fat-container">
<div class="fat-content">
<center>
<script type="text/javascript"><!--
google_ad_client = "pub-4400645483943138";
/* Header ad unit */
google_ad_slot = "8276008011";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
</center>
</div>
</div>
<div id="content-container" class="module">
<div id="article-content">
<h2>More resources</h2>
<p>
Related blog articles:
<ul>
<li>
<a href="../implementation-of-nfas-and-regular-expressions-in-java/">NFA-based regex matching in Java</a>.
</li>
<li>
<a href="../nonblocking-lexing-toolkit-based-on-regex-derivatives/">A lexing toolkit based on derivatives in Scala</a>.
</li>
</ul>
</p>
<p>Papers:</p>
<ul>
<li>Brzozozwksi's original paper, <a href="http://portal.acm.org/citation.cfm?id=321249">"Derivatives of Regular Expressions."</a></li>
<li>Owens, Reppy and Turon's, <a href="http://www.ccs.neu.edu/home/turon/re-deriv.pdf">"Regular-expression derivatives re-examined."</a></li>
</ul>
<p>Books:</p>
<ul>
<li>
MIT prof Michael Sipser's <a href="http://www.amazon.com/gp/product/0534950973?ie=UTF8&tag=ucmbread-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=0534950973">Theory of Computation</a><img src="http://www.assoc-amazon.com/e/ir?t=ucmbread-20&l=as2&o=1&a=0534950973" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
covers regular languages and automata in exhaustive detail in the very first chapter.
</li>
</ul>
<h2>Derivatives of regular expressions</h2>
<p> The derivative of a regular expression with respect to a character
computes a new regular expression that matches what the original expression
would match, assuming it had just matched the character.
</p>
<p>
For example, the derivative of the expression <code>(foo|frak)*</code>
with respect to the character <code>f</code> is the expression
<code>(oo|rak)(foo|frak)*</code>.
</p>
<p> On the other hand, the derivative of the expression
<code>(foo|frak)*</code> with respect to the character <code>c</code>
is the null pattern ∅. The null pattern is not allowed in most
regular expression implementations, but it is necessary in order to
make the derivative a total function. By definition, no string can
match the null pattern.</p>
<p>
The matching strategy with derivatives is straightforward:
<ol>
<li>If the string to match is empty and the current pattern matches empty,
then the match succeeds. </li>
<li> If the string to match is non-empty, the new pattern is the
derivative of the current pattern with respect to the first
character of the current string, and the new string to match is the
remainder of the current string. </li> </ol>
</p>
<p> To define the derivative, we first need a function <i>δ</i> that
returns the empty string ε if its argument accepts the empty
string, and the null pattern ∅ otherwise:
</p>
<ul>
<li>
<i>δ</i>(∅) = ∅
</li>
<li>
<i>δ</i>(ε) = ε
</li>
<li>
<i>δ</i>(c) = ∅
</li>
<li>
<i>δ</i>(<i>re</i><sub>1</sub> <i>re</i><sub>2</sub>) =
<i>δ</i>(<i>re</i><sub>1</sub>) <i>δ</i>(<i>re</i><sub>2</sub>)
</li>
<li>
<i>δ</i>(<i>re</i><sub>1</sub> | <i>re</i><sub>2</sub>) =
<i>δ</i>(<i>re</i><sub>1</sub>) | <i>δ</i>(<i>re</i><sub>2</sub>)
</li>
<li>
<i>δ</i>(<i>re</i><sup>*</sup>) = ε
</li>
</ul>
</p>
Let <i>D<sub>c</sub></i>(<i>re</i>) denote the derivative of the
regular expression <i>re</i> with respect to the character <i>c</i>;
then the derivative can be defined recursively:
</p>
<ul>
<li>
<i>D<sub>c</sub></i>(∅) = ∅
</li>
<li>
<i>D<sub>c</sub></i>(ε) = ∅
</li>
<li>
<i>D<sub>c</sub></i>(<i>c</i>) = ε
</li>
<li>
<i>D<sub>c</sub></i>(<i>c'</i>) = ∅ if <i>c</i> ≠ <i>c'</i>.
</li>
<li><i>D<sub>c</sub></i>(<i>re</i><sub>1</sub> <i>re</i><sub>2</sub>)
=
<i>δ</i>(<i>re</i><sub>1</sub>) <i>D<sub>c</sub></i>(<i>re</i><sub>2</sub>)
| <i>D<sub>c</sub></i>(<i>re</i><sub>1</sub>) <i>re</i><sub>2</sub>
</li>
<li><i>D<sub>c</sub></i>(<i>re</i><sub>1</sub> | <i>re</i><sub>2</sub>)
=
<i>D<sub>c</sub></i>(<i>re</i><sub>1</sub>)
| <i>D<sub>c</sub></i>(<i>re</i><sub>2</sub>)
</li>
<li>
<i>D<sub>c</sub></i>(<i>re</i><sup>*</sup>) =
<i>D<sub>c</sub></i>(<i>re</i>) <i>re</i><sup>*</sup>
</li>
</ul>
</p>
<h2>Code</h2>
<p>
I've implemented the derivative in Scheme as a quick demonstration.
Without much code, you can actually create a functioning regular-expression matcher:
</p>
<div class="klipse-scheme" data-gist-id="viebel/df7a6f5ad50793252917c593da934e39">
</div>
And it works!
<div class="klipse-scheme">
(d/dc 'baz 'f)
</div>
<div class="klipse-scheme">
(d/dc '(seq foo barn) 'foo)
</div>
<div class="klipse-scheme">
(d/dc '(alt (seq foo bar) (seq foo (rep baz))) 'foo)
</div>
<div class="klipse-scheme">
(regex-match '(seq foo (rep bar))
'(foo bar bar bar))
</div>
<div class="klipse-scheme">
(regex-match '(seq foo (rep bar))
'(foo bar baz bar bar))
</div>
</div>
<hr />
<div id="footer-links">
[<a href="../">article index</a>]
[<script>
var emailMatt = '<a href="mai'+'lto:matt-blog'+'@'+'migh'+'t.net">email me</a>'
document.write(emailMatt);
//-->
</script>]
[<a href="http://twitter.com/mattmight">@mattmight</a>]
[<a href="http://gplus.to/mattmight">+mattmight</a>]
[<a href="../feed.rss">rss</a>]
</div>
</div> <!-- /#content -->
</div> <!-- /#content-container -->
<link rel="stylesheet" type="text/css" href="https://storage.googleapis.com/app.klipse.tech/css/codemirror.css">
<script>
window.klipse_settings = {
selector_eval_scheme: '.klipse-scheme'
};
</script>
<script src="http://www.biwascheme.org/release/biwascheme-0.6.6-min.js"></script>
<script src="https://storage.googleapis.com/app.klipse.tech/plugin/js/klipse_plugin.js"></script>
<div id="footer-ad" class="module fat-container">
<div class="fat-content">
<center>
<script type="text/javascript"><!--
google_ad_client = "pub-4400645483943138";
/* Article footer banner */
google_ad_slot = "3531754286";
google_ad_width = 468;
google_ad_height = 60;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
</center>
</div> <!-- /footer-ad -->
</div> <!-- /footer-ad-container -->
<div id="footer-linode" class="module fat-container">
<div class="fat-content">
<center>
matt.might.net is powered by <b><a href="http://www.linode.com/?r=bf5d4e7c8a1af61855b5227279a6744c3bde8a8a">linode</a></b> |
<a href="../legal/">legal information</a>
</center>
</div>
</div>
</div> <!-- /#body -->
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-3661244-1");
pageTracker._trackPageview();
</script>
</body>
</html>