6.897: Advanced Data Structures Spring 2005
Lecture 19 — April 14, 2005
Prof. Erik Demaine Scribe: Vincent Yeung
1 Over view
In this lecture, we discuss the problem of approximate string matching. In particular, we outline
solutions for solving the exact matching problem for patterns with “don’t care” symbols, denoted
by ?. In the process, we will be using solutions for the level ancestor problem, which we also discuss.
2 Approximate String Matching
The approximate string matching problem is defined as follows. Given an error tolerance k and
a text T , construct a data structure which can answer the following query: find occurrences of a
pattern P in T within error k. There are different ways to measure error, such as:
1. Hamming distance: the number of character mismatches
2. edit distance: the number of edits (insertions, deletions, substitutions) needed to produce an
exact match.
The best currently-known bounds, given by [4], are:
• space and preprocessing: O(|T |
(c lg |T |)
k
k!
)
• query: O(|P | +
(c lg |T |)
k
k!
lg lg |T |) + 3
k
· (# occurrences)
We will not actually cover this data structure, but we concentrate on a simpler problem where the
same techniques are used. These bounds are only interesting for small k (such as a constant). For
larger k, there are relaxations of the problem which can be solved more efficiently. This will be the
topic of next lecture.
3 Searching with Wildcards
We will fo cus on a subproblem of the above. Again we are given T and k for preprocessing. But
now, the query consists of a pattern P that contains at most k “don’t care” characters (the ?
wildcards). We are to find “exact” matches of P , where wildcards match any character. The best
known solution [4] solves the problem in O(|T |lg
k
|T |) space and O(2
k
lg lg |T |+|P |+# occurrences)
query time.
1
评论0