Pattern Matching
Advanced String Matching Algorithms: KMP, Rabin-Karp, Z-Algorithm, Aho-Corasick
Fundamental Concepts
Problem Statement
Given a Text of length and a Pattern of length , find all occurrences of in .
Naive Approach
Check every position in . For each, compare with . Complexity: . Drawback: Re-examines characters already matched.
Advanced Algorithms
1. Knuth-Morris-Pratt (KMP) Algorithm
Idea: Use degeneracies in the pattern (prefix/suffix symmetries) to avoid backtracking in .
LPS Array: Longest Prefix which is also Suffix. lps[i] = length of longest proper prefix of P[0..i] that is also a suffix of P[0..i].
Complexity: Time , Space .
def compute_lps(pattern):
m = len(pattern)
lps = [0] * m
length = 0 # Length of previous longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
lps = compute_lps(pattern)
i = j = 0 # i for text, j for pattern
matches = []
while i < n:
if pattern[j] == text[i]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j-1]
elif i < n and pattern[j] != text[i]:
if j != 0:
j = lps[j-1]
else:
i += 1
return matches2. Rabin-Karp Algorithm
Idea: Rolling Hash. Compare hash of window in with hash of . If match, do full check (to handle collisions). Hash Function: Polynomial rolling hash. . Update: Remove leading char, shift, add trailing char. . Complexity: Avg , Worst (many collisions).
3. Z-Algorithm
Z-Array: Z[i] is the length of the longest substring starting at S[i] that is also a prefix of S.
Pattern Matching: Construct string S = P + \ + TO(n+m)[L, R]$ which is the interval matching the prefix. Use previous values inside box to skip comparisons.
4. Aho-Corasick Algorithm
Problem: Find multiple patterns in . Idea: Generalization of KMP with Trie.
- Build Trie of patterns.
- Add Failure Links (like KMP LPS) pointing to longest proper suffix which is also a prefix of some pattern.
- Traverse through the Automaton. Complexity: . Use: Virus scanning, grep.
5. Boyer-Moore Algorithm
Idea: Scan pattern right-to-left. Skip characters based on Bad Character Rule and Good Suffix Rule.
Complexity: Best case (sublinear!). Worst .
Use: Standard ctrl+f search.
Advanced Tricks
1. Rolling Hash for Palindromes
Check if substring is palindrome by comparing Forward Hash and Backward Hash in .
2. String Hashing
Double Hash (use two moduli and ) to reduce collision probability to near zero. Allows string equality checks.
3. Manacher's Algorithm
Finds longest palindromic substring in linear time . Avoids expansion from center.
Interview Problem Types
Type 1: Single Pattern Search
| Given | Find | Approach |
|---|---|---|
| Text and Pattern | First index | Python find / KMP / Rabin-Karp. |
| Repeated substring pattern | Is repeated? | KMP lps: if n % (n - lps[n-1]) == 0. |
Type 2: Multiple Patterns
| Given | Find | Approach |
|---|---|---|
| List of words, Text | Any matching words | Aho-Corasick or Trie search. |
Type 3: Palindromes
| Given | Find | Approach |
|---|---|---|
| String | Longest Palindrome | Manacher's or Expand Center (). |
| String | Count Palindromic Substrings | Expand Center (). |
Type 4: Substring Properties
| Given | Find | Approach |
|---|---|---|
| String | Longest Duplicate Substring | Binary Search on Length + Rolling Hash (Rabin-Karp). |
Common Pitfalls
Pitfall 1: Rolling Hash Collision
Wrong: Assuming hash equality means string equality. Correct: Check string or use Double Hashing.
Pitfall 2: Off-by-One in KMP
Wrong: lps index logic.
Correct: j becomes lps[j-1] on mismatch. If j=0, i increments.
Pitfall 3: String Immutability (Python)
Wrong: String concatenation s += c in loop ().
Correct: Use list append and ''.join().
Quick Reference
- KMP: Prefix function. No backtracking text.
- Rabin-Karp: Rolling Hash. Good for fixed length multiple patterns.
- Trie: Prefix tree. Good for set of strings.
- Aho-Corasick: Trie + Fail links. Multiple patterns.
- Z-Algo: Longest prefix substring.
Practice Problem Categories
- KMP: Shortest Palindrome (using
s + # + rev(s)), Repeated Substring Pattern. - Rabin-Karp: Longest Duplicate Substring, Repeated DNA Sequences.
- Trie: Implement Trie, Stream of Characters.
- Manacher: Longest Palindromic Substring.