0% found this document useful (0 votes)
22 views23 pages

UNIT 5.3 (String Mactching)

Uploaded by

kushalreddy272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

UNIT 5.3 (String Mactching)

Uploaded by

kushalreddy272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Structures

Prepared by
S Durga Devi
Assistant Professor
Department of Computer Science and Engineering
Chaitanya Bharathi Institute of Technology

S. Durga Devi ,CSE,CBIT


String Algorithms:
 Introduction
String Matching Algorithm
 Brute Force Method
 Rabin-Karp String Matching Algorithm

S. Durga Devi ,CSE,CBIT


String algorithms

-When you type URL in web browser, all the list of possible matching url’s list will be
displayed. That means web browser uses some internal processing and gives list of matching
urls. This technique is called auto- completion.
-Similarly when you type partial directory name in command line prompt and press tab
button which gives list of all the matched directory names. It also uses the auto completion
technique.
-To perform such operations, a special data structures are used to store the string data
efficiently. Design data structures to implement string algorithms.

S. Durga Devi ,CSE,CBIT


Text processing
Text processing is one of the main applications of computer.
 Text processing is a process of analyzing and manipulating electronic text.
 means edit text, search text, compression text to send over internet, display document on
the computer screen, web searching etc.
 text documents are used to communicate and publish information.

String is a sequence of characters or array of characters


Document is a collection of strings: character strings.

String operations: several string operations are used to process the text
1. Substring: break a string into smaller strings is called sub string.
2. Pattern matching problem: we are given a text string T of length n and a pattern string P
of length m,, and want to find whether P is a substring of T. The notion of a “match” is
that there is a substring of T starting at some index i that matches P, character by
character, so that T[i] = P[0], T[i + 1] = P[1], ..., T[i + m − 1] = P[m − 1].
S. Durga Devi ,CSE,CBIT
Example
Suppose we are given the text string T = "abacaabaccabacabaabb"
and the pattern string
P = "abacab".
Then P is a substring of T. Namely
P = T[10..15].

Applications of pattern matching


1. Used in text editors
2. Search engines use pattern matching algorithm for matching the query submitted by the
user.
3. Used in biological research.

S. Durga Devi ,CSE,CBIT


Pattern matching algorithms
1.Brute – Force Pattern matching algorithm
2.Rabin-Karp String Matching Algorithm
3.Boyer-Moore Algorithm
4.Knuth-Morris-Pratt algorithm
5. String Matching with Finite Automata
6. Suffix Trees

S. Durga Devi ,CSE,CBIT


1. Brute – Force Pattern matching algorithm

Brute force approach used to solve a problem. In this approach try out all the possible
solutions of a problem and pick up the best solution.
Example:
B1, B2,G1 there are two boys and one girl arrange them into three chairs what are the
possible ways I can arrange them ?

possible ways are 3!= 6 ways to arrange them.

Brute Force pattern matching algorithm compares the pattern P with text T starts from index
i that ranges from 0 to n-m, where n is length of the String T and m in length of the pattern p
the search will continue until
1. Match is found
2. No match has been found.
In Brute Force pattern matching algorithm the scan is made from left to right.

S. Durga Devi ,CSE,CBIT


Algorithm BruteForceMatch(T,P):
Input: Strings T (text) with n characters and P (pattern) with m characters
Output: Starting index of the first substring of T matching P, or an indication
that P is not a substring of T
for i ← 0 to n − m // for each candidate index in T do
for j ← 0 to m
while (j<m and T[i + j] = P[j]) do
j←j+1
if j = m then
return i (matched at index value i)
else
return “There is no substring of T matching P.”

Drawback: every character in T should be compared in order to locate string P.

S. Durga Devi ,CSE,CBIT


Example on Brute- Force PM

Total 27 comparisions took place to find the matched pattern


S. Durga Devi ,CSE,CBIT
Complexity of BFPM
 The running time of brute-force pattern matching in the worst case
is not good, however, because, for each candidate index in T, we can
perform up to m character comparisons to discover that P does not
match T at the current index
In the algorithm, outer for-loop is executed at most n − m + 1
times, and the inner loop is executed at most m times. Thus, the
running time of the brute-force method is O((n − m + 1)m), which is
O(nm).
when m = n/2, this algorithm has quadratic running time O(n2).

S. Durga Devi ,CSE,CBIT


2. Boyer-Moore Algorithm
Scans the characters of the search pattern from right to left.
If a match is not found then a shift is made by some number of characters. This algorithm is also called “ looking glass
heauristic”

Algorithm
Boyer-Moore(T[0..n],P[0…m])
// set I and j to last index of P
i=m-1;
J=m-1;
// loop to the end of the text string
While i<n
// if P[j]=T[i] then
if j=0 then
return i;
Else
// go to next char
i=i-1;
j=i-1;
Else
// skip over the whole word or shift to last occurrence
i=i+m-min(j,1+last[T[i]])
J=m-1
Return -1 // no match

S. Durga Devi ,CSE,CBIT


Upon mismatch, we shift the pattern until –
1) The mismatch becomes a match
2) Pattern P move past the mismatched character.

S. Durga Devi ,CSE,CBIT


case1

Case-2

S. Durga Devi ,CSE,CBIT


Case-1: we got mismatch at position 3. mismatched Character is ‘A’. Search for last
occurrence of A in pattern P. we got A at position 1 in pattern. Now we will shift pattern 2
times so that A in pattern get aligned with A in Text.

Case-2: pattern move past the mismatch character


We will look up the position of last occurrence of mismatching character in pattern and if
character does not exist we will shift pattern past the mismatching character.

S. Durga Devi ,CSE,CBIT


Knutt- Morris Pratt algorithm
KMP algorithm is another pattern matching algorithm used to find a pattern in the given
text.
This algorithm compares character by character from left to right, when mismatch
occurred it uses the LPS table(Longest Prefix and suffix) table to skip the character
comparisions.

Above LPS table is used to decide how many characters to be skipped when mismatch
occurred.

S. Durga Devi ,CSE,CBIT


S. Durga Devi ,CSE,CBIT
S. Durga Devi ,CSE,CBIT
Time complexity is O(M+N)
S. Durga Devi ,CSE,CBIT
Tries
 tries( try) is a tree based data structures for storing strings in order to support fast pattern
matching. Also called prefix trees.
• The main application for tries is in information retrieval.
• the name “trie” comes from the word “retrieval.”

S. Durga Devi ,CSE,CBIT


Standard tries
The idea is that all strings sharing common prefix should come from a common node the
tries are used in checking programs.

Standard Trie
 The standard trie is an ordered tree for building the strings of set S such that
1. Each node is labelled with an alphabet except root node
2. The children of a node are alphabetically ordered.
3. The path from root to external node should produce a string from set S.
4. No string in S is prefix to another string.
Worst case time complexity is O(n)

S. Durga Devi ,CSE,CBIT


Compressed Tries
 in standard tries, for each character of a string a separate node is to be defined.
The height of the standard trie is maximum length of the string.
In order to improve search efficiency height of the standard tries should be reduced.
Standard trie is converted into compressed trie

Compressed Trie
1. Each internal node has at least two children.

S. Durga Devi ,CSE,CBIT


How to represent compressed trie in memory

S. Durga Devi ,CSE,CBIT


Uses of Tries
-Auto complete
-Text Search
- Radix trees a kind of tree used in IP Routing(computer networks)
- used in spell checking

S. Durga Devi ,CSE,CBIT

You might also like