Open In App

Boyer-Moore Algorithm for Pattern Searching in C

Last Updated : 15 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we will learn the Boyer-Moore Algorithm, a powerful technique for pattern searching in strings using C programming language.

What is the Boyer-Moore Algorithm?

The Boyer-Moore Algorithm is a pattern searching algorithm that efficiently finds occurrences of a pattern within a text. It operates by scanning the text from left to right, comparing characters of the pattern to the text from right to left. This approach allows the algorithm to skip over sections of the text when mismatches occur, making it particularly efficient for large texts.

first

How does the Boyer-Moore Algorithm Work in C language?

The algorithm works using two main heuristics:

  1. Bad Character Heuristic: This heuristic focuses on the last occurrence of a mismatched character in the pattern. When a mismatch occurs at position j in the pattern against a character in the text, the algorithm shifts the pattern right until the character in the text aligns with the last occurrence of that character in the pattern.
  2. Good Suffix Heuristic: This heuristic utilizes information from the pattern itself to shift the pattern relative to the text when mismatches occur. It determines how much the pattern can be shifted without missing occurrences of the pattern in the text.

Steps of Boyer-Moore Algorithm to implement in C

  • Preprocessing Phase:
    • Initialize Bad Character Array: Create an array badchar of size NO_OF_CHARS (typically 256 for ASCII characters) to store the rightmost occurrence of each character in the pattern. Initialize all values to -1.
    • Fill Bad Character Array: Traverse the pattern from left to right. For each character pat[i], update badchar[(int)pat[i]] to i, which stores the last occurrence index of pat[i].
  • Searching Phase:
    • Initialize Shift: Start s (shift) at 0, which represents the position of the pattern relative to the text.
    • Search Loop: Continue searching while s <= (n - m), where n is the length of txt and m is the length of pat.
      • Initialize j to m - 1, representing the last character index of the pattern.
      • Pattern Matching: Compare characters of the pattern and text from right to left while j >= 0 and pat[j] == txt[s + j].
        • If characters match (j becomes -1), print or record the occurrence of the pattern at s.
      • Update Shift:
        • Bad Character Shift: Calculate badcharShift as j - badchar[txt[s + j]]. If badchar[txt[s + j]] is not found in the pattern, shift by j + 1.
        • Good Suffix Shift: Calculate goodSuffixShift based on the longest suffix of the pattern that matches a prefix of the text.
  • Output: Print or store all occurrences of the pattern in the text.

C Program to Implement Boyer-Moore Algorithm

The below program demonstrates the implementation of Boyer-Moore Algorithm in C language.

C
// C Program for Bad Character Heuristic of Boyer Moore
// String Matching Algorithm
#include <limits.h>
#include <stdio.h>
#include <string.h>

#define NO_OF_CHARS 256

// A utility function to get maximum of two integers
int max(int a, int b) { return (a > b) ? a : b; }

// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic(char* str, int size,
                      int badchar[NO_OF_CHARS])
{
    int i;

    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;

    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
        badchar[(int)str[i]] = i;
}

/* A pattern searching function that uses Bad
   Character Heuristic of Boyer Moore Algorithm */
void search(char* txt, char* pat)
{
    int m = strlen(pat);
    int n = strlen(txt);

    int badchar[NO_OF_CHARS];

    /* Fill the bad character array by calling
       the preprocessing function badCharHeuristic()
       for given pattern */
    badCharHeuristic(pat, m, badchar);

    int s = 0; // s is shift of the pattern with
               // respect to text
    while (s <= (n - m)) {
        int j = m - 1;

        /* Keep reducing index j of pattern while
           characters of pattern and text are
           matching at this shift s */
        while (j >= 0 && pat[j] == txt[s + j])
            j--;

        /* If the pattern is present at current
           shift, then index j will become -1 after
           the above loop */
        if (j < 0) {
            printf("\n pattern occurs at shift = %d", s);

            /* Shift the pattern so that the next
               character in text aligns with the last
               occurrence of it in pattern.
               The condition s+m < n is necessary for
               the case when pattern occurs at the end
               of text */
            s += (s + m < n) ? m - badchar[txt[s + m]] : 1;
        }

        else
            /* Shift the pattern so that the bad character
               in text aligns with the last occurrence of
               it in pattern. The max function is used to
               make sure that we get a positive shift.
               We may get a negative shift if the last
               occurrence  of bad character in pattern
               is on the right side of the current
               character. */
            s += max(1, j - badchar[txt[s + j]]);
    }
}

/* Driver program to test above function */
int main()
{
    char txt[] = "AABAACAADAABAABA";
    char pat[] = "AABA";
    search(txt, pat);
    return 0;
}

Output
 pattern occurs at shift = 0
 pattern occurs at shift = 9
 pattern occurs at shift = 12

Time Complexity: O(m*n)
Auxiliary Space: O(1)

The Bad Character Heuristic may take O(m*n) time in worst case. The worst case occurs when all characters of the text and pattern are same. For example, txt[] = “AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”. The Bad Character Heuristic may take O(n/m) in the best case. The best case occurs when all the characters of the text and pattern are different. 

Boyer-Moore vs Traditional Pattern Searching Algorithms

Compared to traditional algorithms like the Naive algorithm, which checks each position in the text one by one, Boyer-Moore often skips large sections of the text, resulting in faster performance. The Knuth-Morris-Pratt (KMP) algorithm is more efficient than the Naive algorithm by preprocessing the pattern to create a partial match table, allowing it to skip unnecessary comparisons, but Boyer-Moore generally outperforms KMP in practical applications due to its heuristics.


Next Article
Article Tags :

Similar Reads