Lecture 10

Strings and pattern matching

Strings

Pattern Matching Problem

Brute force pattern matching

Algorithm BruteForceMatch(T, P) Input: text T of size n and pattern P of size m Output: starting index of a substring of T equal to P or -1 if no such substring exists for i <- 0 to n – m do { test shift i of the pattern } j <- 0 while j < m OR T[i + j] = P[j] do j <- j + 1 if j = m then return i {match at i} else break while loop {mismatch} return -1 {no match anywhere}

Alt text

Can we do better?

Boyer-Moore: Looking-Glass Heuristic

Alt text

Boyer-Moore: Character-Jump Heuristic

Alt text

Example

Alt text

Terminology (used further)

symbol def'n
Σ\Sigma alphabet
PP Pattern
TT full string (to pattern match)
mm $
ss $

Last-Occurrence Function

Then:

cc a b c d
L(c)L(c) 4 5 3 -1

Last Occurrence Function

Can be represented by an array indexed by the numeric codes of the characters

Algorithm BoyerMooreMatch(T, P, S) L <- lastOccurenceFunction(P, S) i <- m - 1 { m is size of P } j <- m - 1 repeat if T[i] = P[j] then if j = 0 then return i { match at i } else i <- i - 1 j <- j - 1 else { character-jump } l <- L[T[i]] i <- i + m – min(j, 1 + l) j <- m - 1 until i > n - 1 return -1 { no match}

Performance analysis

Worst case example

Alt text

Knuth-Morris-Pratt (KMP) Algorithm

Alt text

KMP failure function

J 0 1 2 3 4 5
P[j] a b a a b a
F(j) 0 0 1 1 2 3

KMP Algorithm

Algorithm KMPMatch(T, P) F <- failureFunction(P) i <- 0 j <- 0 while i < length(T) if T[i] = P[j] then if j = length(P) - 1 then return i - j { match } else i <- i + 1 j <- j + 1 else if j > 0 then j <- F[j - 1] else i <- i + 1 return -1 { no match }

Example

Alt text

Tries - Retrieval Tries

Preprocessing steps

Standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop }

Alt text

Alt text

Analysis of standard tries

Word matching with a trie

Alt text

Alt text

Compressed Tries

Alt text

Compact Representation (NOT ON EXAM)

Want to create a compact representation of a compressed tree

Compact representation of a compressed trie for an array of strings

Tries - outside of patterns

Input: Query Logs

Building the Trie

Querying the Trie

Given the prefix return the list of k completions
Alt text Alt text

Suffix arrays

Suffix Tree (Suffix Trie)

Alt text

Suffix Tree: Compact Rep.

Alt text

Suffix Tree Pattern Matching

Other query types

Given two strings A and B, what is their longest common substring?

Analysis and performance

Suffix Arrays

Alt text

Does the string “bana” occur in T ? Binary Search the SA!

Alt text

Suffix Array Analysis