Boyer-Moore-Algorithmus zur Mustersuche

Die Mustersuche ist ein wichtiges Problem in der Informatik. Wenn wir in einer Notepad-/Word-Datei, einem Browser oder einer Datenbank nach einer Zeichenfolge suchen, werden Mustersuchalgorithmen verwendet, um die Suchergebnisse anzuzeigen. Eine typische Problemstellung wäre:
Gegeben sei ein Text txt[0..n-1] und ein Muster pat[0..m-1], wobei n die Länge des Textes und m die Länge des Musters ist, schreibe a Funktion search(char pat[], char txt[]), die alle Vorkommen von pat[] in txt[] ausgibt. Sie können annehmen, dass n > m.
Beispiele:

Input:  txt[] = "THIS IS A TEST TEXT"
        pat[] = "TEST"
Output: Pattern found at index 10

Input:  txt[] =  "AABAACAADAABAABA"
        pat[] =  "AABA"
Output: Pattern found at index 0
        Pattern found at index 9
        Pattern found at index 12

In diesem Beitrag werden wir den Mustersuchalgorithmus von Boyer Moore diskutieren. Wie KMP- und Finite-Automata -Algorithmen verarbeitet auch der Boyer-Moore-Algorithmus das Muster vor.
Boyer Moore ist eine Kombination der folgenden beiden Ansätze.
1) Bad-Character-Heuristik
2) Good-Suffix-Heuristik
Beide der obigen Heuristiken können auch unabhängig verwendet werden, um ein Muster in einem Text zu suchen. Lassen Sie uns zunächst verstehen, wie zwei unabhängige Ansätze im Boyer-Moore-Algorithmus zusammenarbeiten. Wenn wir uns den Naive-Algorithmus ansehen , schiebt er das Muster eins nach dem anderen über den Text. KMP-Algorithmusführt eine Vorverarbeitung des Musters durch, sodass das Muster um mehr als eins verschoben werden kann. Der Boyer-Moore-Algorithmus führt aus dem gleichen Grund eine Vorverarbeitung durch. Es verarbeitet das Muster und erstellt verschiedene Arrays für jede der beiden Heuristiken. Bei jedem Schritt verschiebt es das Muster um das Maximum der von jeder der beiden Heuristiken vorgeschlagenen Verschiebungen. Daher wird bei jedem Schritt der größte Offset verwendet, der von den beiden Heuristiken vorgeschlagen wird .
Im Gegensatz zu den vorherigen Mustersuchalgorithmen beginnt der Boyer-Moore-Algorithmus mit dem Abgleich vom letzten Zeichen des Musters.
In diesem Beitrag werden wir die Heuristik für schlechte Zeichen und im nächsten Beitrag die Heuristik für gute Suffixe diskutieren.

Bad-Character-Heuristik

Die Idee der Bad-Character-Heuristik ist einfach. Das Zeichen des Textes, das nicht mit dem aktuellen Zeichen des Musters übereinstimmt, wird als Bad Character bezeichnet . Bei Nichtübereinstimmung verschieben wir das Muster, bis –
1) die Nichtübereinstimmung zu einer Übereinstimmung wird
2) Muster P sich über das nichtübereinstimmende Zeichen hinausbewegt.
Fall 1 – Nichtübereinstimmung wird zu Übereinstimmung
Wir suchen die Position des letzten Vorkommens des nichtübereinstimmenden Zeichens im Muster, und wenn das nichtübereinstimmende Zeichen im Muster vorhanden ist, verschieben wir das Muster so, dass es an dem nichtübereinstimmenden Zeichen ausgerichtet wird der Text t.

Fall 1

Erläuterung: Im obigen Beispiel haben wir an Position 3 eine Nichtübereinstimmung erhalten. Hier ist unser Nichtübereinstimmungszeichen „A“. Jetzt suchen wir nach dem letzten Vorkommen von „A“ im Muster. Wir haben „A“ an Position 1 im Muster (angezeigt in Blau) und dies ist das letzte Vorkommen davon. Jetzt verschieben wir das Muster zweimal, sodass „A“ im Muster mit „A“ im Text ausgerichtet wird.
Fall 2 – Muster bewegt sich über das Nichtübereinstimmungszeichen hinaus
Wir werden die Position des letzten Vorkommens des Nichtübereinstimmungszeichens im Muster nachschlagen, und wenn das Zeichen nicht existiert, werden wir das Muster über das Nichtübereinstimmungszeichen hinaus verschieben.

Fall2

Erläuterung: Hier haben wir eine Nichtübereinstimmung an Position 7. Das nicht übereinstimmende Zeichen „C“ existiert nicht im Muster vor Position 7, also verschieben wir das Muster an Position 7 vorbei und schließlich haben wir im obigen Beispiel eine perfekte Übereinstimmung von Muster ( grün angezeigt). Wir tun dies, weil „C“ im Muster nicht existiert, sodass wir bei jeder Verschiebung vor Position 7 eine Nichtübereinstimmung erhalten und unsere Suche erfolglos sein wird.
In der folgenden Implementierung verarbeiten wir das Muster vor und speichern das letzte Vorkommen jedes möglichen Zeichens in einem Array, dessen Größe gleich der Alphabetgröße ist. Wenn das Zeichen überhaupt nicht vorhanden ist, kann es zu einer Verschiebung um m (Länge des Musters) kommen. Daher braucht die Bad-Character-Heuristik $O(n/m)$ im besten Fall Zeit.

C++

/* C++ Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
#include <bits/stdc++.h>
using namespace std;
# define NO_OF_CHARS 256
 
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( string str, int size,
                        int badchar[NO_OF_CHARS])
{
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
        badchar[(int) str[i]] = i;
}
 
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( string txt, string pat)
{
    int m = pat.size();
    int n = txt.size();
 
    int badchar[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
    the preprocessing function badCharHeuristic()
    for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0; // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m - 1;
 
        /* Keep reducing index j of pattern while
        characters of pattern and text are
        matching at this shift s */
        while(j >= 0 && pat[j] == txt[s + j])
            j--;
 
        /* If the pattern is present at current
        shift, then index j will become -1 after
        the above loop */
        if (j < 0)
        {
            cout << "pattern occurs at shift = " <<  s << endl;
 
            /* Shift the pattern so that the next
            character in text aligns with the last
            occurrence of it in pattern.
            The condition s+m < n is necessary for
            the case when pattern occurs at the end
            of text */
            s += (s + m < n)? m-badchar[txt[s + m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
            in text aligns with the last occurrence of
            it in pattern. The max function is used to
            make sure that we get a positive shift.
            We may get a negative shift if the last
            occurrence of bad character in pattern
            is on the right side of the current
            character. */
            s += max(1, j - badchar[txt[s + j]]);
    }
}
 
/* Driver code */
int main()
{
    string txt= "ABAAABCD";
    string pat = "ABC";
    search(txt, pat);
    return 0;
}
  
 // This code is contributed by rathbhupendra

C

/* C Program for Bad Character Heuristic of Boyer
   Moore String Matching Algorithm */
# include <limits.h>
# include <string.h>
# include <stdio.h>
 
# define NO_OF_CHARS 256
 
// A utility function to get maximum of two integers
int max (int a, int b) { return (a > b)? a: b; }
 
// The preprocessing function for Boyer Moore's
// bad character heuristic
void badCharHeuristic( char *str, int size,
                        int badchar[NO_OF_CHARS])
{
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
         badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
         badchar[(int) str[i]] = i;
}
 
/* A pattern searching function that uses Bad
   Character Heuristic of Boyer Moore Algorithm */
void search( char *txt,  char *pat)
{
    int m = strlen(pat);
    int n = strlen(txt);
 
    int badchar[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
       the preprocessing function badCharHeuristic()
       for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0;  // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m-1;
 
        /* Keep reducing index j of pattern while
           characters of pattern and text are
           matching at this shift s */
        while(j >= 0 && pat[j] == txt[s+j])
            j--;
 
        /* If the pattern is present at current
           shift, then index j will become -1 after
           the above loop */
        if (j < 0)
        {
            printf("\n pattern occurs at shift = %d", s);
 
            /* Shift the pattern so that the next
               character in text aligns with the last
               occurrence of it in pattern.
               The condition s+m < n is necessary for
               the case when pattern occurs at the end
               of text */
            s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
               in text aligns with the last occurrence of
               it in pattern. The max function is used to
               make sure that we get a positive shift.
               We may get a negative shift if the last
               occurrence  of bad character in pattern
               is on the right side of the current
               character. */
            s += max(1, j - badchar[txt[s+j]]);
    }
}
 
/* Driver program to test above function */
int main()
{
    char txt[] = "ABAAABCD";
    char pat[] = "ABC";
    search(txt, pat);
    return 0;
}

Java

/* Java Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
 
 
class AWQ{
     
     static int NO_OF_CHARS = 256;
      
    //A utility function to get maximum of two integers
     static int max (int a, int b) { return (a > b)? a: b; }
 
     //The preprocessing function for Boyer Moore's
     //bad character heuristic
     static void badCharHeuristic( char []str, int size,int badchar[])
     {
 
      // Initialize all occurrences as -1
      for (int i = 0; i < NO_OF_CHARS; i++)
           badchar[i] = -1;
 
      // Fill the actual value of last occurrence
      // of a character (indices of table are ascii and values are index of occurence)
      for (int i = 0; i < size; i++)
           badchar[(int) str[i]] = i;
     }
 
     /* A pattern searching function that uses Bad
     Character Heuristic of Boyer Moore Algorithm */
     static void search( char txt[],  char pat[])
     {
      int m = pat.length;
      int n = txt.length;
 
      int badchar[] = new int[NO_OF_CHARS];
 
      /* Fill the bad character array by calling
         the preprocessing function badCharHeuristic()
         for given pattern */
      badCharHeuristic(pat, m, badchar);
 
      int s = 0;  // s is shift of the pattern with
                  // respect to text
       //there are n-m+1 potential allignments
      while(s <= (n - m))
      {
          int j = m-1;
 
          /* Keep reducing index j of pattern while
             characters of pattern and text are
             matching at this shift s */
          while(j >= 0 && pat[j] == txt[s+j])
              j--;
 
          /* If the pattern is present at current
             shift, then index j will become -1 after
             the above loop */
          if (j < 0)
          {
              System.out.println("Patterns occur at shift = " + s);
 
              /* Shift the pattern so that the next
                 character in text aligns with the last
                 occurrence of it in pattern.
                 The condition s+m < n is necessary for
                 the case when pattern occurs at the end
                 of text */
              //txt[s+m] is character after the pattern in text
              s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
          }
 
          else
              /* Shift the pattern so that the bad character
                 in text aligns with the last occurrence of
                 it in pattern. The max function is used to
                 make sure that we get a positive shift.
                 We may get a negative shift if the last
                 occurrence  of bad character in pattern
                 is on the right side of the current
                 character. */
              s += max(1, j - badchar[txt[s+j]]);
      }
     }
 
     /* Driver program to test above function */
    public static void main(String []args) {
         
         char txt[] = "ABAAABCD".toCharArray();
         char pat[] = "ABC".toCharArray();
         search(txt, pat);
    }
}

Python

# Python3 Program for Bad Character Heuristic
# of Boyer Moore String Matching Algorithm
 
NO_OF_CHARS = 256
 
def badCharHeuristic(string, size):
    '''
    The preprocessing function for
    Boyer Moore's bad character heuristic
    '''
 
    # Initialize all occurrence as -1
    badChar = [-1]*NO_OF_CHARS
 
    # Fill the actual value of last occurrence
    for i in range(size):
        badChar[ord(string[i])] = i;
 
    # retun initialized list
    return badChar
 
def search(txt, pat):
    '''
    A pattern searching function that uses Bad Character
    Heuristic of Boyer Moore Algorithm
    '''
    m = len(pat)
    n = len(txt)
 
    # create the bad character list by calling
    # the preprocessing function badCharHeuristic()
    # for given pattern
    badChar = badCharHeuristic(pat, m)
 
    # s is shift of the pattern with respect to text
    s = 0
    while(s <= n-m):
        j = m-1
 
        # Keep reducing index j of pattern while
        # characters of pattern and text are matching
        # at this shift s
        while j>=0 and pat[j] == txt[s+j]:
            j -= 1
 
        # If the pattern is present at current shift,
        # then index j will become -1 after the above loop
        if j<0:
            print("Pattern occur at shift = {}".format(s))
 
            '''   
                Shift the pattern so that the next character in text
                      aligns with the last occurrence of it in pattern.
                The condition s+m < n is necessary for the case when
                   pattern occurs at the end of text
               '''
            s += (m-badChar[ord(txt[s+m])] if s+m<n else 1)
        else:
            '''
               Shift the pattern so that the bad character in text
               aligns with the last occurrence of it in pattern. The
               max function is used to make sure that we get a positive
               shift. We may get a negative shift if the last occurrence
               of bad character in pattern is on the right side of the
               current character.
            '''
            s += max(1, j-badChar[ord(txt[s+j])])
 
 
# Driver program to test above function
def main():
    txt = "ABAAABCD"
    pat = "ABC"
    search(txt, pat)
 
if __name__ == '__main__':
    main()
 
# This code is contributed by Atul Kumar
# (www.facebook.com/atul.kr.007)

C#

/* C# Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
 
using System;
public class AWQ{
     
    static int NO_OF_CHARS = 256;
     
    //A utility function to get maximum of two integers
    static int max (int a, int b) { return (a > b)? a: b; }
 
    //The preprocessing function for Boyer Moore's
    //bad character heuristic
    static void badCharHeuristic( char []str, int size,int []badchar)
    {
    int i;
 
    // Initialize all occurrences as -1
    for (i = 0; i < NO_OF_CHARS; i++)
        badchar[i] = -1;
 
    // Fill the actual value of last occurrence
    // of a character
    for (i = 0; i < size; i++)
        badchar[(int) str[i]] = i;
    }
 
    /* A pattern searching function that uses Bad
    Character Heuristic of Boyer Moore Algorithm */
    static void search( char []txt, char []pat)
    {
    int m = pat.Length;
    int n = txt.Length;
 
    int []badchar = new int[NO_OF_CHARS];
 
    /* Fill the bad character array by calling
        the preprocessing function badCharHeuristic()
        for given pattern */
    badCharHeuristic(pat, m, badchar);
 
    int s = 0; // s is shift of the pattern with
                // respect to text
    while(s <= (n - m))
    {
        int j = m-1;
 
        /* Keep reducing index j of pattern while
            characters of pattern and text are
            matching at this shift s */
        while(j >= 0 && pat[j] == txt[s+j])
            j--;
 
        /* If the pattern is present at current
            shift, then index j will become -1 after
            the above loop */
        if (j < 0)
        {
            Console.WriteLine("Patterns occur at shift = " + s);
 
            /* Shift the pattern so that the next
                character in text aligns with the last
                occurrence of it in pattern.
                The condition s+m < n is necessary for
                the case when pattern occurs at the end
                of text */
            s += (s+m < n)? m-badchar[txt[s+m]] : 1;
 
        }
 
        else
            /* Shift the pattern so that the bad character
                in text aligns with the last occurrence of
                it in pattern. The max function is used to
                make sure that we get a positive shift.
                We may get a negative shift if the last
                occurrence of bad character in pattern
                is on the right side of the current
                character. */
            s += max(1, j - badchar[txt[s+j]]);
    }
    }
 
    /* Driver program to test above function */
    public static void Main() {
         
        char []txt = "ABAAABCD".ToCharArray();
        char []pat = "ABC".ToCharArray();
        search(txt, pat);
    }
}
 
// This code is contributed by PrinciRaj19992

Javascript

<script>
/* Javascript Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
let NO_OF_CHARS = 256;
 
// A utility function to get maximum of two integers
function max (a,b)
{
    return (a > b)? a: b;
}
 
// The preprocessing function for Boyer Moore's
// bad character heuristic
function badCharHeuristic(str,size,badchar)
{
    // Initialize all occurrences as -1
      for (let i = 0; i < NO_OF_CHARS; i++)
           badchar[i] = -1;
  
      // Fill the actual value of last occurrence
      // of a character (indices of table are ascii and values are index of occurence)
      for (i = 0; i < size; i++)
           badchar[ str[i].charCodeAt(0)] = i;
}
 
/* A pattern searching function that uses Bad
     Character Heuristic of Boyer Moore Algorithm */
function search(txt,pat)
{
    let m = pat.length;
      let n = txt.length;
  
      let badchar = new Array(NO_OF_CHARS);
  
      /* Fill the bad character array by calling
         the preprocessing function badCharHeuristic()
         for given pattern */
      badCharHeuristic(pat, m, badchar);
  
      let s = 0;  // s is shift of the pattern with
                  // respect to text
       // there are n-m+1 potential allignments
      while(s <= (n - m))
      {
          let j = m-1;
  
          /* Keep reducing index j of pattern while
             characters of pattern and text are
             matching at this shift s */
          while(j >= 0 && pat[j] == txt[s+j])
              j--;
  
          /* If the pattern is present at current
             shift, then index j will become -1 after
             the above loop */
          if (j < 0)
          {
              document.write("Patterns occur at shift = " + s);
  
              /* Shift the pattern so that the next
                 character in text aligns with the last
                 occurrence of it in pattern.
                 The condition s+m < n is necessary for
                 the case when pattern occurs at the end
                 of text */
              //txt[s+m] is character after the pattern in text
              s += (s+m < n)? m-badchar[txt[s+m].charCodeAt(0)] : 1;
  
          }
  
          else
              /* Shift the pattern so that the bad character
                 in text aligns with the last occurrence of
                 it in pattern. The max function is used to
                 make sure that we get a positive shift.
                 We may get a negative shift if the last
                 occurrence  of bad character in pattern
                 is on the right side of the current
                 character. */
              s += max(1, j - badchar[txt[s+j].charCodeAt(0)]);
      }
}
 
/* Driver program to test above function */
let txt="ABAAABCD".split("");
let pat = "ABC".split("");
search(txt, pat);
 
// This code is contributed by unknown2108
</script>

Ausgabe:

 pattern occurs at shift = 4

Die Bad-Character-Heuristik kann $O(mn)$ im schlimmsten Fall einige Zeit in Anspruch nehmen. Der schlimmste Fall tritt auf, wenn alle Zeichen des Textes und des Musters gleich sind. Beispiel: txt[] = „AAAAAAAAAAAAAAAAAA“ und pat[] = „AAAAAA“. Die Bad-Character-Heuristik kann im besten Fall O(n/m) annehmen. Der beste Fall tritt auf, wenn alle Zeichen des Textes und des Musters unterschiedlich sind.

Boyer-Moore-Algorithmus | Gute Suffix-Heuristik
Dieser Artikel wurde von Atul Kumar mitverfasst. Bitte schreiben Sie Kommentare, wenn Sie etwas Falsches finden oder weitere Informationen zu dem oben diskutierten Thema teilen möchten.

Falls Sie an Live-Kursen mit Experten teilnehmen möchten , beziehen Sie sich bitte auf DSA Live-Kurse für Berufstätige und Competitive Programming Live for Students .

Acervo Lima

Die größte Sammlung von Tutorials und Referenzen

Boyer-Moore-Algorithmus zur Mustersuche

C++

C

Java

Python

C#

Javascript

Latest posts

C++

C

Java

Python

C#

Javascript

Latest posts

Most popular posts