0% found this document useful (0 votes)
57 views

07 - String Processing

The Common Query Language is a formal language for queries to information retrieval systems. It aims to be human-readable and writable while maintaining expressiveness. The language supports boolean operators, proximity searches, field searches, pattern matching, and more. Regular expressions in Java allow matching character sequences against patterns specified in regular expressions.

Uploaded by

Adelino Thesaria
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

07 - String Processing

The Common Query Language is a formal language for queries to information retrieval systems. It aims to be human-readable and writable while maintaining expressiveness. The language supports boolean operators, proximity searches, field searches, pattern matching, and more. Regular expressions in Java allow matching character sequences against patterns specified in regular expressions.

Uploaded by

Adelino Thesaria
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

STRING PROCESSING

CS4323 / 0708-2
YFA

Tersedia online di http://www.ittelkom.ac.id/staf/yanuar

Institut Teknologi Telkom


http://www.ittelkom.ac.id/staf/yanuar
Query Languages:
the Common Query Language

The Common Query Language: a formal language for queries


to information retrieval systems such as web indexes,
bibliographic catalogs and museum collection information.
Objective: human readable and human writable; intuitive while
maintaining the expressiveness of more complex languages.
Traditionally, query languages have fallen into two camps:
(a) Powerful and expressive languages which are not easily
readable nor writable by non-experts (e.g. SQL and XQuery).
(b) Simple and intuitive languages not powerful enough to
express complex concepts (e.g. CCL or Google's query
language).

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language

The Common Query Language is maintained by the Z39.50


International Maintenance Agency at the Library of Congress.
http://www.loc.gov/z3950/agency/zing/cql/
The following examples are taken from the CQL Tutorial
Tutorial, A
Gentle Introduction to CQL.

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

Simple queries
dinosaur
comp.sources.misc
"complete
complete dinosaur"
dinosaur
"the complete dinosaur"
"ext->u.generic"
" d"
"and"
Booleans
dinosaur or bird
dinosaur and bird or dinobird
(bird or dinosaur) and (feathers or scales)
"feathered
feathered dinosaur
dinosaur" and (yixian or jehol)
(((a and b) or (c not d) not (e or f and g)) and h not i) or j

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

Indexes [fielded searching]


title = dinosaur
title = ((dinosaur and bird) or dinobird)
dc.title = saurischia
bath.title="the complete dinosaur"
srw serverChoice=foo
srw.serverChoice=foo
srw.resultSet=bar
pp g [definition
Index-set mapping [ of fields]]
>dc="http://www.loc.gov/srw/index-sets/dc"
dc.title=dinosaur and dc.author=farlow

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

P
Proximity
i i
The prox operator:
prox/relation/distance/unit/ordering
Examples:
complete prox dinosaur [adjacent]
(caudal or dorsal) prox vertebra
ribs prox//5 chevrons [near 5]
ribs prox//0/sentence chevrons [same sentence]
ribs prox/>/0/paragraph chevrons [not adjacent]

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

Relations
year > 1998
titl allll ""complete
title l t di
dinosaur""
title any "dinosaur bird reptile"
title exact "the complete dinosaur"
publicationYear < 1980
numberOfWheels <= 3
numberOfPlates = 18
lengthOfFemur > 2.4
bioMass >= 100
numberOfToes <> 3

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

Relation Modifiers
title all/stem "complete
complete dinosaur"
dinosaur
title any / relevant "dinosaur bird reptile"
title exact/fuzzy "the complete dinosaur"
author = /fuzzy tailor
The implementations of relevant and fuzzy are not
defined byy the query
q y language.
g g

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

Pattern Matching
dinosaur* [zero or more characters]
*sauria
man?raptor [exactly one character]
man?raptor*
"the
the comp
comp*saur"
saur
char\* [literal "*"]
Word Anchoring
title="^the complete dinosaur" [beginning of field]
author="bakker^" [end of field]
author all "^kernighan ritchie"
author any "^kernighan ^ritchie ^thompson"

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

A complete example
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" or
dc.title=elements prox///4 dc.title=programming) and
subject any/relevant "style design analysis"

http://www.ittelkom.ac.id/staf/yanuar
The Common Query Language: Examples

A complete example
dc.author=(kern* or ritchie) and
(bath.title exact "the c programming language" or
dc.title=elements prox///4 dc.title=programming) and
subject any/relevant "style design analysis"
Find
Fi d recordsd whose
h author
th (in
(i th
the D
Dublin
bli CCore sense)) iincludes
l d either
ith
a word beginning kern or the word ritchie, and which have either the
exact title (in the sense of the Bath profile) the c programming
language or a title containing the words elements and programming
not more the four words apart, and whose subject is relevant to one
or more of the words style,
y design
g or analysis.
y

http://www.ittelkom.ac.id/staf/yanuar
Regular Expressions in Java

Package java.util.regex
java util regex
Classes for matching character sequences against patterns
specified by regular expressions.
An instance of the Pattern class represents a regular expression
that is specified in string form in a syntax similar to that used by
Perl.
Instances of the Matcher class are used to match character
sequences against a given pattern
pattern.
Input is provided to matchers via the CharSequence interface in
order to support matching against characters from a wide variety
of input sources.

http://www.ittelkom.ac.id/staf/yanuar
String Searching:
Naive Algorithm

Objective: Given a pattern, find any substring of a given text that


matches the pattern.
p pattern
tt to
t be
b matched
t h d
m length of pattern p (characters)
t the text to be searched
n length of t (characters)
The naive algorithm examines the characters of tx in sequence.
for j from 1 to n-m+1
if character j of t matches the first character
of p
(compare following characters of t and p
until a
complete match or a difference is found)

http://www.ittelkom.ac.id/staf/yanuar
String Searching:
Knuth-Morris-Pratt
Knuth Morris Pratt Algorithm

Concept: The
C Th naivei algorithm
l i h iis modified,
difi d so that
h whenever
h a partial
i l
match is found, it may be possible to advance the character index, j,
by more than 1.
Example:
p = "university"
y
t = "the uniform commercial code ..."
j=5 after partial match continue here
To indicate how far to advance the character pointer, p is preprocessed
to create a table, which lists how far to advance against a given length
of partial match.
match
In the example, j is advanced by the length of the partial match, 3.

http://www.ittelkom.ac.id/staf/yanuar
Signature Files: Sequential Search
without Inverted File

Inexact filter: A quick test which discards many of the non-


non
qualifying items.
Advantages
• Much faster than full text scanning -- 1 or 2 orders
of magnitude
• M d t space overhead
Modest h d -- 10% to
t 15% off file
fil
• Insertion is straightforward
Disadvantages
• Sequential searching is no good for very large files
• Some hits are false hits

http://www.ittelkom.ac.id/staf/yanuar
Signature Files

Signature size. Number of bits in a signature, F.


g
Word signature. A bit p
pattern of size F with m bits set to 1
and the others 0.
The word signature is calculated by a hash function.
Block. A sequence of text that contains D distinct words.
Block signature. The logical or of all the word signatures in
a block of text.

http://www.ittelkom.ac.id/staf/yanuar
Signature Files

Example
Word Signature
free 001 000 110 010
text 000 010 101 001
block signature 001 010 111 011

F = 12 bits in a signature
m = 4 bits per word
D = 2 words per block

http://www.ittelkom.ac.id/staf/yanuar
Signature Files

A query term is processed by matching its signature against


the block signature.
(a) If the term is in the block, its word signature will always
match the block signature.
(b) A word
d signature
i t may match
t h the
th block
bl k signature,
i t b
butt th
the
word is not in the block. This is a false hit.
The design challenge is to minimize the false drop
probability, Fd .
Frake, Section 4.2, p
page
g 47 discussed how to minimize Fd.
The rest of this chapter discusses enhancements to the
basic algorithm.

http://www.ittelkom.ac.id/staf/yanuar
String Matching

Find File: Find all files whose name includes the string q.
q
Simple algorithm: Build an inverted index of all substrings of the
file names of the form *f,
Example: if the file name is foo.txt, search terms are:
foo.txt
oo.txt
o.txt
.txt
txt
xt
t
Lexicographic processing allows searching by any q.

http://www.ittelkom.ac.id/staf/yanuar
Search for Substring

In some information retrieval applications, any substring can be a


search term.
Tries, using suffix trees, provide lexicographical indexes for all
the substrings in a document or set of documents.

http://www.ittelkom.ac.id/staf/yanuar
Tries: Search for Substring

Basic concept
p
The text is divided into unique semi-infinite strings, or
sistrings. Each sistring has a starting position in the text, and
continues
ti to
t the
th right
i ht until
til it iis unique.
i
The sistrings are stored in (the leaves of) a tree, the suffix
tree Common parts are stored only once
tree. once.
Each sistring can be associated with a location within a
document where the sistringg occurs. Subtrees below a certain
node represent all occurrences of the substring represented by
that node.
Suffix
S ffi trees have
h a size
i off the
h same order
d off magnitude
i d as the
h
input documents.

http://www.ittelkom.ac.id/staf/yanuar
Tries: Suffix Tree

Example: suffix tree for the


following words:
begin
beginning
b
between
bread
break e rea

gin tween d k

null ning

http://www.ittelkom.ac.id/staf/yanuar
Tries: Sistrings

A binary example
String: 01 100 100 010 111
Sistrings: 1 01 100 100 010 111
2 11 001 000 101 11
3 10 010 001 011 1
4 00 100 010 111
5 01 000 101 11
6 10 001 011 1
7 00 010 111
8 00 101 11

http://www.ittelkom.ac.id/staf/yanuar
Tries: Lexical Ordering

7 00 010 111
4 00 100 010 111
8 00 101 11
5 01 000 101 11
1 01 100 100 010 111
6 10 001 011 1
3 10 010 001 011 1
2 11 001 000 101 11
Unique string indicated in blue

http://www.ittelkom.ac.id/staf/yanuar
Trie: Basic Concept

0 1

1
0 1
0
2
0 0 1
1 0
7 5 1
0 1
0
6 3
0
1

4 8

http://www.ittelkom.ac.id/staf/yanuar
Patricia Tree

1
0 1

2 2
1
0 1
00

3 3 4 2
0 0 1 0 1
10
7 5 5 1 6 3

0 1

4 8 Single-descendant nodes are eliminated.


Nodes have bit number
number.

http://www.ittelkom.ac.id/staf/yanuar
YFA
April 2008
http://www ittelkom ac id/staf/yanuar
http://www.ittelkom.ac.id/staf/yanuar
Diadaptasi dari cs.cornell.edu

http://www.ittelkom.ac.id/staf/yanuar

You might also like