07 - String Processing
07 - String Processing
CS4323 / 0708-2
YFA
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
Simple queries
dinosaur
comp.sources.misc
"complete
complete dinosaur"
dinosaur
"the complete dinosaur"
"ext->u.generic"
" d"
"and"
Booleans
dinosaur or bird
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)
"feathered
feathered dinosaur
dinosaur" and (yixian or jehol)
(((a and b) or (c not d) not (e or f and g)) and h not i) or j
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
P
Proximity
i i
The prox operator:
prox/relation/distance/unit/ordering
Examples:
complete prox dinosaur [adjacent]
(caudal or dorsal) prox vertebra
ribs prox//5 chevrons [near 5]
ribs prox//0/sentence chevrons [same sentence]
ribs prox/>/0/paragraph chevrons [not adjacent]
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
Relations
year > 1998
titl allll ""complete
title l t di
dinosaur""
title any "dinosaur bird reptile"
title exact "the complete dinosaur"
publicationYear < 1980
numberOfWheels <= 3
numberOfPlates = 18
lengthOfFemur > 2.4
bioMass >= 100
numberOfToes <> 3
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
Relation Modifiers
title all/stem "complete
complete dinosaur"
dinosaur
title any / relevant "dinosaur bird reptile"
title exact/fuzzy "the complete dinosaur"
author = /fuzzy tailor
The implementations of relevant and fuzzy are not
defined byy the query
q y language.
g g
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
Pattern Matching
dinosaur* [zero or more characters]
*sauria
man?raptor [exactly one character]
man?raptor*
"the
the comp
comp*saur"
saur
char\* [literal "*"]
Word Anchoring
title="^the complete dinosaur" [beginning of field]
author="bakker^" [end of field]
author all "^kernighan ritchie"
author any "^kernighan ^ritchie ^thompson"
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
A complete example
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" or
dc.title=elements prox///4 dc.title=programming) and
subject any/relevant "style design analysis"
http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples
A complete example
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" or
dc.title=elements prox///4 dc.title=programming) and
subject any/relevant "style design analysis"
Find
Fi d recordsd whose
h author
th (in
(i th
the D
Dublin
bli CCore sense)) iincludes
l d either
ith
a word beginning kern or the word ritchie, and which have either the
exact title (in the sense of the Bath profile) the c programming
language or a title containing the words elements and programming
not more the four words apart, and whose subject is relevant to one
or more of the words style,
y design
g or analysis.
y
http://www.ittelkom.ac.id/staf/yanuar
Regular Expressions in Java
Package java.util.regex
java util regex
Classes for matching character sequences against patterns
specified by regular expressions.
An instance of the Pattern class represents a regular expression
that is specified in string form in a syntax similar to that used by
Perl.
Instances of the Matcher class are used to match character
sequences against a given pattern
pattern.
Input is provided to matchers via the CharSequence interface in
order to support matching against characters from a wide variety
of input sources.
http://www.ittelkom.ac.id/staf/yanuar
String Searching:
Naive Algorithm
http://www.ittelkom.ac.id/staf/yanuar
String Searching:
Knuth-Morris-Pratt
Knuth Morris Pratt Algorithm
Concept: The
C Th naivei algorithm
l i h iis modified,
difi d so that
h whenever
h a partial
i l
match is found, it may be possible to advance the character index, j,
by more than 1.
Example:
p = "university"
y
t = "the uniform commercial code ..."
j=5 after partial match continue here
To indicate how far to advance the character pointer, p is preprocessed
to create a table, which lists how far to advance against a given length
of partial match.
match
In the example, j is advanced by the length of the partial match, 3.
http://www.ittelkom.ac.id/staf/yanuar
Signature Files: Sequential Search
without Inverted File
http://www.ittelkom.ac.id/staf/yanuar
Signature Files
http://www.ittelkom.ac.id/staf/yanuar
Signature Files
Example
Word Signature
free 001 000 110 010
text 000 010 101 001
block signature 001 010 111 011
F = 12 bits in a signature
m = 4 bits per word
D = 2 words per block
http://www.ittelkom.ac.id/staf/yanuar
Signature Files
http://www.ittelkom.ac.id/staf/yanuar
String Matching
Find File: Find all files whose name includes the string q.
q
Simple algorithm: Build an inverted index of all substrings of the
file names of the form *f,
Example: if the file name is foo.txt, search terms are:
foo.txt
oo.txt
o.txt
.txt
txt
xt
t
Lexicographic processing allows searching by any q.
http://www.ittelkom.ac.id/staf/yanuar
Search for Substring
http://www.ittelkom.ac.id/staf/yanuar
Tries: Search for Substring
Basic concept
p
The text is divided into unique semi-infinite strings, or
sistrings. Each sistring has a starting position in the text, and
continues
ti to
t the
th right
i ht until
til it iis unique.
i
The sistrings are stored in (the leaves of) a tree, the suffix
tree Common parts are stored only once
tree. once.
Each sistring can be associated with a location within a
document where the sistringg occurs. Subtrees below a certain
node represent all occurrences of the substring represented by
that node.
Suffix
S ffi trees have
h a size
i off the
h same order
d off magnitude
i d as the
h
input documents.
http://www.ittelkom.ac.id/staf/yanuar
Tries: Suffix Tree
gin tween d k
null ning
http://www.ittelkom.ac.id/staf/yanuar
Tries: Sistrings
A binary example
String: 01 100 100 010 111
Sistrings: 1 01 100 100 010 111
2 11 001 000 101 11
3 10 010 001 011 1
4 00 100 010 111
5 01 000 101 11
6 10 001 011 1
7 00 010 111
8 00 101 11
http://www.ittelkom.ac.id/staf/yanuar
Tries: Lexical Ordering
7 00 010 111
4 00 100 010 111
8 00 101 11
5 01 000 101 11
1 01 100 100 010 111
6 10 001 011 1
3 10 010 001 011 1
2 11 001 000 101 11
Unique string indicated in blue
http://www.ittelkom.ac.id/staf/yanuar
Trie: Basic Concept
0 1
1
0 1
0
2
0 0 1
1 0
7 5 1
0 1
0
6 3
0
1
4 8
http://www.ittelkom.ac.id/staf/yanuar
Patricia Tree
1
0 1
2 2
1
0 1
00
3 3 4 2
0 0 1 0 1
10
7 5 5 1 6 3
0 1
http://www.ittelkom.ac.id/staf/yanuar
YFA
April 2008
http://www ittelkom ac id/staf/yanuar
http://www.ittelkom.ac.id/staf/yanuar
Diadaptasi dari cs.cornell.edu
http://www.ittelkom.ac.id/staf/yanuar