Lightweight Structure in Text
Robert C. Miller
Robert C. Miller. Lightweight Structure in Text. PhD thesis, Computer
Science Department, School of Computer Science, Carnegie Mellon University,
May 2002. Published as CMU Computer Science technical report CMU-CS-02-134
and CMU Human-Computer Interaction Institute technical report CMU-HCII-02-103.
Abstract
Pattern matching is heavily used for searching, filtering, and transforming
text, but existing pattern languages offer few opportunities for reuse. Lightweight
structure is a new approach that solves the reuse problem. Lightweight structure
has three parts: a model of text structure as contiguous segments of text,
or regions; an extensible library of structure abstractions (e.g., HTML elements,
Java expressions, or English sentences) that can be implemented by any kind
of pattern or parser; and a region algebra for composing and reusing structure
abstractions. Lightweight structure does for text pattern matching what procedure
abstraction does for programming, enabling construction of a reusable library.
Lightweight structure has been implemented in LAPIS, a web browser/text
editor that demonstrates several novel techniques:
- Text constraints is a new pattern language for composing structure abstractions,
based on the region algebra. Text constraint patterns are simple and high-level,
and user studies have shown that users can generate and comprehend them.
- Simultaneous editing uses multiple selections for repetitive text editing.
Multiple selections are inferred from examples given by the user, drawing
on the lightweight structure library to make fast, accurate, domain-specific
inferences from very few examples. In user studies, simultaneous editing
required only 1.26 examples per selection, approaching the 1-example ideal.
- Outlier finding draws the user's attention to inconsistent selections
or pattern matches --- both possible false positives and possible false negatives.
When integrated into simultaneous editing and tested in a user study, outlier
finding reduced user errors.
- Unix tools for structured text extend tools like grep and sort with
lightweight structure, and the browser shell integrates a Unix command prompt
into a web browser, offering new ways to build pipelines and automate web
browsing.
Theoretical contributions include a formal definition of the region algebra,
data structures and algorithms for efficient implementation, and a characterization
of the classes of languages recognized by algebra expressions.
Lightweight structure enables efficient composition and reuse of structure
abstractions defined by various kinds of patterns and parsers, bringing improvements
to pattern matching, text processing, web automation, repetitive text editing,
inference of patterns from examples, and error detection.
Full Text
- Thesis summary (PDF, 13 pages, 220 KB)
- Entire dissertation (PDF, 341 pages, 3.2
MB)
- One chapter at a time in PDF format:
- Front matter (22 pages, 70
KB)
- Chapter 1: Introduction (12 pages, 210 KB)
- Chapter 2: Related Work (10 pages, 60 KB)
- Chapter 3: Region Algebra (28 pages, 390
KB)
- Chapter 4: Region Algebra Implementation
(46 pages, 500 KB)
- Chapter 5: Language Theory (24 pages, 210
KB)
- Chapter 6: Text Constraints (38 pages, 220
KB)
- Chapter 7: Multiple Selections (28 pages,
290 KB)
- Chapter 8: Commands and Scripting (34 pages,
730 KB)
- Chapter 9: Selection Inference (32 pages,
530 KB)
- Chapter 10: Outlier Finding (14 pages,
130 KB)
- Chapter 11: Conclusion (12 pages, 70 KB)
- Appendices and Bibliography
(41 pages, 100 KB)
Rob Miller