Boyer-Moore Algorithm for Pattern Searching in C++

Last Updated : 05 Aug, 2024

The Boyer-Moore algorithm is an efficient string searching algorithm that is used to find occurrences of a pattern within a text. This algorithm preprocesses the pattern and uses this information to skip sections of the text, making it much faster than simpler algorithms like the naive approach.

In this article, we will learn the Boyer-Moore algorithm and its implementation in C++.

Example

Input: 
txt[] = “THIS IS A TEST TEXT”
pat[] = “TEST”

Output: 
Pattern found at index 10

What is Boyer-Moore Algorithm?

The Boyer-Moore algorithm is a pattern matching algorithm that uses two heuristics to improve its performance: the bad character heuristic and the good suffix heuristic. These heuristics allow the algorithm to skip over sections of the text that cannot contain the pattern, thus reducing the number of comparisons needed.

How does Boyer-Moore Algorithm work in C++?

The algorithm works by aligning the pattern against the text and then attempting to match it from right to left. If a mismatch is found, the algorithm uses the bad character and good suffix heuristics to determine how far to shift the pattern to the right before attempting the next match.

Case 1: Mismatch Becomes Match

If a mismatch occurs at a position, look up the last occurrence of the mismatched character in the pattern. Shift the pattern such that this character in the pattern aligns with the mismatched character in the text. This allows us to skip unnecessary comparisons.

Example:

Mismatch at position 3 with character 'A'.
Last occurrence of 'A' in the pattern is at position 1.
Shift the pattern right by 2 positions to align 'A' in the pattern with 'A' in the text.

Case 2: Pattern Moves Past the Mismatch Character

If the mismatched character does not exist in the pattern, shift the pattern past the mismatched character. This ensures that any future alignments do not involve the same mismatched character.

Example:

Mismatch at position 7 with character 'C'.
'C' does not exist in the pattern before position 7.
Shift the pattern right past position 7, resulting in a perfect match of the pattern in the text.

Steps of Boyer-Moore Algorithm to Implement in C++

To implement the Boyer-Moore algorithm in C++, follow these steps:

Preprocess the pattern to create the bad character table. This table stores the last occurrence of each character in the pattern. If a character is not present in the pattern, its value is set to -1.
Preprocess the pattern to create the good suffix table. This table helps determine how far to shift the pattern when a mismatch occurs after a partial match.
Initialize the shift of the pattern to the beginning of the text.
Compare the pattern with the text from right to left. Start with the last character of the pattern and move towards the first character.
If a mismatch is found, use the bad character table and good suffix table to calculate the shift distance. Shift the pattern to the right by the maximum value suggested by either heuristic.
If a complete match is found, print the starting index of the match in the text and shift the pattern to the right to continue searching.
Repeat steps 4 to 6 until the pattern has been aligned with the end of the text.

C++ Program to Implement Boyer-Moore Algorithm

Below is a C++ program that implements the Boyer-Moore algorithm for pattern searching:

C++

// C++ Program for Bad Character Heuristic of Boyer Moore String Matching Algorithm

#include <algorithm>
#include <iostream>
#include <string>

#define NO_OF_CHARS 256

using namespace std;

// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic(const string &str, int size, int badchar[NO_OF_CHARS])
{
    // Initialize all occurrences as -1
    for (int i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;

    // Fill the actual value of last occurrence
    // of a character
    for (int i = 0; i < size; i++)
        badchar[(int)str[i]] = i;
}

/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search(const string &txt, const string &pat)
{
    int m = pat.size();
    int n = txt.size();

    int badchar[NO_OF_CHARS];

    /* Fill the bad character array by calling
    the preprocessing function badCharHeuristic()
    for the given pattern */
    badCharHeuristic(pat, m, badchar);

    int s = 0; // s is the shift of the pattern with
               // respect to text
    while (s <= (n - m))
    {
        int j = m - 1;

        /* Keep reducing index j of the pattern while
        characters of the pattern and text are
        matching at this shift s */
        while (j >= 0 && pat[j] == txt[s + j])
            j--;

        /* If the pattern is present at the current
        shift, then index j will become -1 after
        the above loop */
        if (j < 0)
        {
            cout << "Pattern occurs at shift = " << s << endl;

            /* Shift the pattern so that the next
            character in the text aligns with the last
            occurrence of it in the pattern.
            The condition s+m < n is necessary for
            the case when the pattern occurs at the end
            of the text */
            s += (s + m < n) ? m - badchar[txt[s + m]] : 1;
        }
        else
        {
            /* Shift the pattern so that the bad character
            in the text aligns with the last occurrence of
            it in the pattern. The max function is used to
            make sure that we get a positive shift.
            We may get a negative shift if the last
            occurrence of the bad character in the pattern
            is on the right side of the current
            character. */
            s += max(1, j - badchar[txt[s + j]]);
        }
    }
}

/* Driver code */
int main()
{
    string txt = "ABAAABCD";
    string pat = "ABC";
    search(txt, pat);
    return 0;
}

Output

Pattern occurs at shift = 4

Time Complexity: O(m*n)
Auxiliary Space: O(1)

The Bad Character Heuristic may take O(m*n) time in worst case. The worst case occurs when all characters of the text and pattern are same. For example, txt[] = “AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”. The Bad Character Heuristic may take O(n/m) in the best case. The best case occurs when all the characters of the text and pattern are different.

Boyer-Moore vs Traditional Pattern Searching Algorithms

Compared to traditional algorithms like the Naive algorithm, which checks each position in the text one by one, Boyer-Moore often skips large sections of the text, resulting in faster performance. The Knuth-Morris-Pratt (KMP) algorithm is more efficient than the Naive algorithm by preprocessing the pattern to create a partial match table, allowing it to skip unnecessary comparisons, but Boyer-Moore generally outperforms KMP in practical applications due to its heuristics.

Applications of Boyer-Moore Algorithm

Used in text editors and search engines to find occurrences of a word or phrase.
Used in DNA sequence analysis to find patterns within genetic data.
Used in algorithms for data compression to find repeating patterns.
Used in intrusion detection systems to find patterns of malicious activity within network traffic.

Implementation of Rabin Karp Algorithm in C++

dasrudra0710

Improve

Article Tags :

C++
CPP-DSA

Practice Tags :