Preptide

Pattern Matching

Advanced String Matching Algorithms: KMP, Rabin-Karp, Z-Algorithm, Aho-Corasick

Fundamental Concepts

Problem Statement

Given a Text TT of length nn and a Pattern PP of length mm, find all occurrences of PP in TT.

Naive Approach

Check every position ii in TT. For each, compare PP with T[i..i+m1]T[i..i+m-1]. Complexity: O(nm)O(nm). Drawback: Re-examines characters already matched.

Advanced Algorithms

1. Knuth-Morris-Pratt (KMP) Algorithm

Idea: Use degeneracies in the pattern (prefix/suffix symmetries) to avoid backtracking in TT. LPS Array: Longest Prefix which is also Suffix. lps[i] = length of longest proper prefix of P[0..i] that is also a suffix of P[0..i]. Complexity: Time O(n+m)O(n+m), Space O(m)O(m).

def compute_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0 # Length of previous longest prefix suffix
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)
    i = j = 0 # i for text, j for pattern
    matches = []
    
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        
        if j == m:
            matches.append(i - j)
            j = lps[j-1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j-1]
            else:
                i += 1
    return matches

2. Rabin-Karp Algorithm

Idea: Rolling Hash. Compare hash of window in TT with hash of PP. If match, do full check (to handle collisions). Hash Function: Polynomial rolling hash. H=(c0ak1+c1ak2+...+ck1a0)(modq)H = (c_0 a^{k-1} + c_1 a^{k-2} + ... + c_{k-1} a^0) \pmod q. Update: Remove leading char, shift, add trailing char. O(1)O(1). Complexity: Avg O(n+m)O(n+m), Worst O(nm)O(nm) (many collisions).

3. Z-Algorithm

Z-Array: Z[i] is the length of the longest substring starting at S[i] that is also a prefix of S. Pattern Matching: Construct string S = P + \ + T.ComputeZarray.IfZ[i]==len(P),matchfound.Complexity:. Compute Z-array. If `Z[i] == len(P)`, match found. **Complexity**: O(n+m).Logic:Usesa"Zbox". **Logic**: Uses a "Z-box" [L, R]$ which is the interval matching the prefix. Use previous values inside box to skip comparisons.

4. Aho-Corasick Algorithm

Problem: Find multiple patterns P1,P2,...P_1, P_2, ... in TT. Idea: Generalization of KMP with Trie.

  1. Build Trie of patterns.
  2. Add Failure Links (like KMP LPS) pointing to longest proper suffix which is also a prefix of some pattern.
  3. Traverse TT through the Automaton. Complexity: O(n+mi+output)O(n + \sum m_i + \text{output}). Use: Virus scanning, grep.

5. Boyer-Moore Algorithm

Idea: Scan pattern right-to-left. Skip characters based on Bad Character Rule and Good Suffix Rule. Complexity: Best case O(n/m)O(n/m) (sublinear!). Worst O(nm)O(nm). Use: Standard ctrl+f search.

Advanced Tricks

1. Rolling Hash for Palindromes

Check if substring is palindrome by comparing Forward Hash and Backward Hash in O(1)O(1).

2. String Hashing

Double Hash (use two moduli 109+710^9+7 and 109+910^9+9) to reduce collision probability to near zero. Allows O(1)O(1) string equality checks.

3. Manacher's Algorithm

Finds longest palindromic substring in linear time O(n)O(n). Avoids O(n2)O(n^2) expansion from center.

Interview Problem Types

GivenFindApproach
Text and PatternFirst indexPython find / KMP / Rabin-Karp.
Repeated substring patternIs repeated?KMP lps: if n % (n - lps[n-1]) == 0.

Type 2: Multiple Patterns

GivenFindApproach
List of words, TextAny matching wordsAho-Corasick or Trie search.

Type 3: Palindromes

GivenFindApproach
StringLongest PalindromeManacher's or Expand Center (O(n2)O(n^2)).
StringCount Palindromic SubstringsExpand Center (O(n2)O(n^2)).

Type 4: Substring Properties

GivenFindApproach
StringLongest Duplicate SubstringBinary Search on Length + Rolling Hash (Rabin-Karp).

Common Pitfalls

Pitfall 1: Rolling Hash Collision

Wrong: Assuming hash equality means string equality. Correct: Check string or use Double Hashing.

Pitfall 2: Off-by-One in KMP

Wrong: lps index logic. Correct: j becomes lps[j-1] on mismatch. If j=0, i increments.

Pitfall 3: String Immutability (Python)

Wrong: String concatenation s += c in loop (O(n2)O(n^2)). Correct: Use list append and ''.join().

Quick Reference

  • KMP: Prefix function. No backtracking text.
  • Rabin-Karp: Rolling Hash. Good for fixed length multiple patterns.
  • Trie: Prefix tree. Good for set of strings.
  • Aho-Corasick: Trie + Fail links. Multiple patterns.
  • Z-Algo: Longest prefix substring.

Practice Problem Categories

  • KMP: Shortest Palindrome (using s + # + rev(s)), Repeated Substring Pattern.
  • Rabin-Karp: Longest Duplicate Substring, Repeated DNA Sequences.
  • Trie: Implement Trie, Stream of Characters.
  • Manacher: Longest Palindromic Substring.