0% found this document useful (0 votes)
424 views8 pages

Tries: Symbol Table Review

The document discusses symbol tables and tries data structures for storing keys and values. It focuses on tries, which store characters of keys in internal nodes to efficiently guide searches. The remainder of the document describes specialized tries called existence tries that can check if a string key exists in the table faster than hashing and more flexibly than balanced search trees.

Uploaded by

Wemzgazerock
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
424 views8 pages

Tries: Symbol Table Review

The document discusses symbol tables and tries data structures for storing keys and values. It focuses on tries, which store characters of keys in internal nodes to efficiently guide searches. The remainder of the document describes specialized tries called existence tries that can check if a string key exists in the table faster than hashing and more flexibly than balanced search trees.

Uploaded by

Wemzgazerock
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Symbol Table Review

Tries Symbol table: key-value pair abstraction.


n Insert a value with specified key.
n Search for value given key.
n Delete value with given key.
n Balanced trees use log N key comparisons.
R-way tries n Hashing uses O(1) probes, but probe proportional to key length.

Ternary search tries


Are key comparisons necessary? No.
Is time proportional to key length required? No.
Best possible. Examine lg N bits.

This lecture: specialized symbol table for string keys.


Reference: Chapter 12, Algorithms in Java, 3rd Edition, Robert Sedgewick.
n Faster than hashing.
n More flexible than BST.

Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226 2

Tries Applications

Tries. Applications.
n Store characters in internal nodes, not keys. n Spell checkers.
n Store records in external nodes. n Data compression. stay tuned
n Use the characters of the key to guide the search. n Princeton U-CALL.
n NB: from retrieval, but pronounced "try." n Computational biology.
n You can get at anything if its organized properly in 40 or 100 bits! n Routing tables for IP addresses.
n Storing and querying XML documents.
n Associative arrays, associative indexing.
Example: sells sea shells by the sea shore

Modern application: inverted index of Web.


n Insert each word of every web page into trie, storing URL list in leaves.
by n Find query keywords in trie, and take intersection of URL lists.
sea the n Use Pagerank algorithm to rank resulting web pages.

sells
shore
shells 4 5
Existence Symbol Table: Operations Keys

Existence symbol table: set of keys. Key = sequence of "digits."


n DNA: sequence of a,c, g, t.
say, strings over ASCII alphabet
n Protein: sequence of 20 amino acids A, C, ..., Y.
Operations. n IPv6 address: sequence of 128 bits.
n st.add(key) inserts a key. n English words: sequence of lowercase letters.
n st.contains(key) checks if the key is in the symbol table. n International words: sequence of UNICODE characters.
n Credit card number: sequence of 16 decimal digits.
n Library call numbers: sequence of letters, numbers, periods.
ExistenceTable st = new ExistenceTable();
while (!StdIn.isEmpty()) {
String key = StdIn.readString();
if (!st.contains(key)) {
This lecture: key = string.
st.add(key);
System.out.println(key); n We assume over ASCII alphabet.
} n We also assume that character '\0' never appears.
}

Removes duplicates from input stream

6 7

Existence Symbol Table: Implementations Cost Summary R-Way Existence Trie: Example

Assumption: no string is a prefix of another string.

Typical Case Dedup


Ex: sells sea shells by the sea shore
Implementation Search hit Insert Space Moby Actors
Input * L L L 0.26 15.1
Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6

Actor: 82MB, 11.4M words, 900K distinct. N = number of strings


Moby: 1.2MB, 210K words, 32K distinct. L = size of string R = 26
C = number of characters in input
R = radix

* only reads in data

Challenge: As fast as hashing, as flexible as BST.

8 9
R-Way Existence Trie: Java Implementation R-Way Existence Trie: Implementation

R-way existence trie: a node. Code is short and sweet.


private static class Node {
Node: reference to R nodes. Node[] next = new Node[R];
public class RwayExistenceTable {
}
private static final int R = 128; ASCII
private static final char END = '\0'; sentinel
private Node root;

private static class Node {


root Node[] next = new Node[R];
}

public boolean contains(String s) {


return contains(root, s + END, 0);
a f h R=8 } ensure no string is a prefix of another

private boolean contains(Node x, String s, int i) {


char d = s.charAt(i);
if (x == null) return false;
if (d == END) return (x.next[END] != null);
return contains(x.next[d], s, i+1);
}

10 11

R-Way Existence Trie: Implementation Existence Symbol Table: Implementations Cost Summary

public void add(String s) { Typical Case Dedup


root = add(root, s + END, 0);
Implementation Search hit Insert Space Moby Actors
}
ensure no string is a prefix of another Input L L L 0.26 15.1
Red-Black L + log N log N C 1.40 97.4
private Node add(Node x, String s, int i) {
char d = s.charAt(i); Hashing L L C 0.76 40.6
if (x == null) x = new Node(); R-Way Trie L L RN+C 1.12 Memory
if (d == END && x.next[END] == null)
x.next[END] = new Node(); R = 128 R = 256
if (d == END) return x;
x.next[d] = insert(x.next[d], s, i+1);
return x; R-way trie: Faster than hashing for small R, but slow and wastes
} memory if R is large.
}

Goal: Use less space.

12 13
Existence TST Existence TST: Implementation

Ternary search trie. Bentley-Sedgewick Existence TST: a node. root


n Each node has 3 children: Node: four fields:
n Left (smaller), middle (equal), right (larger). n Character d.
h
n Reference to left TST. smaller
Ex: sells sea shells by the sea shore n Reference to middle TST. equal
Observation: Few wasted links! n Reference to right TST. larger
a i

private class Node {


char d;
Node l, m, r; \0 i \0
}
ha i

\0

hi
15 16

Existence TST: Java Implementation Existence Symbol Table: Implementations Cost Summary

private boolean contains(Node x, String s, int i) {


char d = s.charAt(i);
if (x == null) return false;
Typical Case Dedup
if (d == END && x.d == END) return true;
if (d < x.d) return contains(x.l, s, i); Implementation Search hit Insert Space Moby Actors
else if (d == x.d) return contains(x.m, s, i+1);
else return contains(x.r, s, i); Input L L L 0.26 15.1
} Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6
private Node add(Node x, String s, int i) { R-Way Trie L L RN+C 1.12 Memory
char d = s.charAt(i);
TST L + log N L + log N C 0.72 38.7
if (x == null) {
x = new Node();
x.d = d; no arithmetic
}
if (d == END && x.d == END) return x;
if (d < x.d) x.l = add(x.l, s, i);
else if (d == x.d) x.m = add(x.m, s, i+1);
else x.r = add(x.r, s, i);
return x;
}
17 18
Existence TST With R2 Branching At Root Existence Symbol Table: Implementations Cost Summary

Hybrid of R-way and TST.


n Do R-way or R2-way branching at root.
n Each of R2 root nodes points to a TST. Typical Case Dedup

Implementation Search hit Insert Space Moby Actors


Input L L L 0.26 15.1
array of R2 roots
Red-Black L + log N log N C 1.40 97.4
Hashing L L C 0.76 40.6
aa ab ac zy zz
R-Way Trie L L RN+C 1.12 Memory
TST L + log N L + log N C 0.72 38.7
TST with R2 L + log N L + log N C 0.51 32.7
TST TST TST TST TST

Q. What about one letter words?

19 20

Existence TST Summary Existence TST: Other Operations

Advantages. Delete. Delete key from the symbol table.


n Very fast search hits. Sort. Examine the keys in ascending order. conventional BST ops
n Search misses even faster. examine only a few digits of the key! Find ith. Find the ith largest key.
n Linear space. Range search. Find all elements between k1 and k2.
n Adapts gracefully to irregularities in keys.
Partial match search.
n Supports even more general symbol table ops.
n Use . to match any character.
additional ops
n co....er .c...c.

Bottom line: more flexible than BST and can be faster than hashing.
Near neighbor search.
especially if lots of search misses
n Find all strings in ST that differ in £ P characters from query.
n Application: spell checking for OCR.

Longest prefix match.


n Find string in ST with longest prefix match to query.
n Application: search IP database for longest prefix matching
destination IP, and route packets accordingly.
21 22
TST: Partial Matches TST Symbol Table

Partial match in a TST. TST implementation of symbol table ADT.


n Search as usual if query character is not a period. n Store key-value pairs in leaves of trie.
n Go down all three branches if query character is a period. n Search hit ends at leaf with key-value pair;
search miss ends at null or leaf with different key.
n Internal node stores char; external node stores key-value pair.
private void match(Node x, String s, int i, String prefix) { – use separate internal and external nodes?
char d = s.charAt(i);
for printing out matches – collapse (and split) 1-way branches at bottom?
if (x == null) return;
if (d == END && x.d == END) System.out.println(prefix);
s
if (d == END) return;
if (d == '.' || d < x.d) match(x.l, s, i, prefix);
if (d == '.' || d == x.d) match(x.m, s, i+1, prefix + x.d); by h the
if (d == '.' || d > x.d) match(x.r, s, i, prefix);
}
e
or use explicit char shells
public void match(String s) { array for efficiency
match(root, s + END, 0, "");
} l

sea sells
23 24

TST Symbol Table Existence Symbol Table: Implementations Cost Summary

TST implementation of symbol table ADT.


n Store key-value pairs in leaves of trie. Typical Case
n Search hit ends at leaf with key-value pair; Implementation Search hit Insert Space
search miss ends at null or leaf with different key.
Input L L L
Internal node stores char; external node stores key-value pair.
Red-Black L + log N log N C
n

– use separate internal and external nodes?


Hashing L L C
– collapse (and split) 1-way branches at bottom?
R-Way Trie L L RN+C
s TST L + log N L + log N C
TST with R2 L + log N L + log N C
by h the
R-way collapse 1-way logR N logR N RN + C
TST collapse 1-way log N log N C
e e

Search, insert time is independent of key length!


l shells shore n Consequence: can use with very long keys.

sea sells
25 26
PATRICIA Tries Suffix Tree

Patricia tries. Practical Algorithm to Retrieve Information Coded in Alphanumeric. Suffix tree: PATRICIA trie of suffixes of a string.
n Collapse one-way branches in binary trie.
n Thread trie to eliminate multiple node types.

Applications. Applications.
n Database search. n Longest common substring.
n P2P network search. n Longest repeated substring.
n IP routing tables: find longest prefix match. n Longest palindromic substring.
n Compressed quad-tree for N-body simulation. n Longest common prefix of two substrings.
n Efficiently storing and querying XML documents. n Computational biology databases (BLAST, FASTA).
n Search for music by melody.
27 28

Associative Arrays Associative Indexing

Associative array. Associative index.


n In Java, C, C++, arrays indexed by integers. n Given list of N strings, associate index 0 to N-1 with each string.
n In Perl, csh, PHP, Python: president["Princeton"] = "Tilghman" n Recall union-find where we assumed objects were labeled 0 to N-1.

Why useful?
# collect data n Using algorithm with strings is more useful.
foreach student ($argv)
n Running algorithm with indices (instead of ST lookup) is faster.
foreach input (input100.txt input1000.txt input10000.txt)
foreach program (worstfit bestfit)
t[$student][$input][$program] = `time java $program < $input`
end while (true) { while (true) {
end int p = StdIn.readInt(); String s = StdIn.readString();
end int q = StdIn.readInt(); String t = StdIn.readString();
... int p = st.index(s);
# compute statistics uf.unite(p, q); int q = st.index(t);
... ...
. . .
} uf.unite(p, q);
...
Idealized excerpt from COS 226 timing script }

29 30
Associative Indexing: Application Symbol Table Summary

Connectivity problem. Hash tables: separate chaining, linear probing.


n N objects: 0 to N-1
n Find: is there a connection between A and B? Binary search trees: randomized, splay, red-black.
n Union: add a connection between A and B.
Tries: R-way, TST.
Fun version.
n N objects: "Kevin Bacon", "Kate Hudson", . . .
n Find: is there a chain of movies connecting Kevin to Kate? Determine the needed ST ops for your application, and choose
the best data structure.
n Union: Kevin and Kate appeared in "How To Lose a Guy in 10 Days"
together, add connection

Real version.
n N objects: "www.cs.princeton.edu", "www.harvard.edu"
n Any graph processing application.

31 32

You might also like