Approximate_String_Matching.pdf.tar.gz_Enjoy_Stringmatching

版权申诉

enjoy

string

179 浏览量 2022-09-19 14:43:10 上传评论收藏 127KB GZ 举报

共1个文件

pdf：1个

资源详情

资源评论

资源推荐

收起资源包目录

Approximate_String_Matching.pdf.tar.gz （1个子文件）

Approximate_String_Matching.pdf 143KB

6.897: Advanced Data Structures Spring 2005

Lecture 19 — April 14, 2005

Prof. Erik Demaine Scribe: Vincent Yeung

1 Over view

In this lecture, we discuss the problem of approximate string matching. In particular, we outline

solutions for solving the exact matching problem for patterns with “don’t care” symbols, denoted

by ?. In the process, we will be using solutions for the level ancestor problem, which we also discuss.

2 Approximate String Matching

The approximate string matching problem is deﬁned as follows. Given an error tolerance k and

a text T , construct a data structure which can answer the following query: ﬁnd occurrences of a

pattern P in T within error k. There are diﬀerent ways to measure error, such as:

1. Hamming distance: the number of character mismatches

2. edit distance: the number of edits (insertions, deletions, substitutions) needed to produce an

exact match.

The best currently-known bounds, given by [4], are:

• space and preprocessing: O(|T |

(c lg |T |)

)

• query: O(|P | +

(c lg |T |)

lg lg |T |) + 3

· (# occurrences)

We will not actually cover this data structure, but we concentrate on a simpler problem where the

same techniques are used. These bounds are only interesting for small k (such as a constant). For

larger k, there are relaxations of the problem which can be solved more eﬃciently. This will be the

topic of next lecture.

3 Searching with Wildcards

We will fo cus on a subproblem of the above. Again we are given T and k for preprocessing. But

now, the query consists of a pattern P that contains at most k “don’t care” characters (the ?

wildcards). We are to ﬁnd “exact” matches of P , where wildcards match any character. The best

known solution [4] solves the problem in O(|T |lg

|T |) space and O(2

lg lg |T |+|P |+# occurrences)

query time.

All the solutions we will discuss involve the use of suﬃx trees. An obvious simple solution is to walk

down the suﬃx tree while matching P and simply branch |Σ| ways every time a ? is e ncountered

in P . Thus, queries take at most O(|Σ|

·|P |). Compared to the best solution we mentioned above,

the simple solution is lacking in that there is a dependence on alphabet size (which may be very

large) and that the dependence on the pattern length is multiplicative instead of additive.

We now describe how to improve the Σ

factor to 2

. To do so, we perform a heavy-light decom-

position of the suﬃx tree. Recall that an edge to a child is light if the subtree rooted in that child

contains at most half the nodes of the parent’s subtree. The intuition is that we now only diﬀer-

entiate between light edges and possibly a single heavy edge whenever we encounter a ?. Because

light subtrees are small, we group them together in one big chunk. Speciﬁcally, for each node in

the primary suﬃx tree, we store a secondary suﬃx tree on the union of light subtrees of that node,

except the ﬁrst characters of each subtree.

If k > 1, we recurse k times so that there are k + 1 “levels” of secondary trees. Since the light

depth is O(lg |T |) in a heavy-light decomposition, each leaf appears in O(lg

|T |) trees. Thus, the

solution takes O(|T |lg

|T |) space and preprocessing, and O(2

· |P |) query time.

As mentioned above, there is a way to make the |P | factor additive in the query time. The idea

is to ﬁnd a way to quickly (in lg lg |T | time) determine whether we should take the light/heavy

branch. We will not delve into the speciﬁcs of this solution, but mentioned that using the suﬃx

tree from above, least common ancestor queries, and level ancestor queries, we can detect whether

one of the 2

branches is “good”.

4 The Level Ancestor Problem

For the rest of this lecture, we shift our attention to the level ancestor problem. We are given a

static rooted tree, which can be preprocessed. Then, a level-ancestor query is given a node V and

a number l, and must ﬁnd the l

ancestor of V . This is equivalent to ﬁnding the depth-d ancestor

of v, where d + l = depth(V ).

Various solutions to this problem have been proposed [3, 5, 1, 2]. We will discuss the solution in

[2], by Bender and Farach-Colton. We present gradual steps leading to a solution that encompasses

the diﬀerent improvements, and ends up taking linear s pace and preprocessing time, with constant

query time. First observe that an immediate solution is to store a lookup table for each node. This

gives total space O(n

) and constant query time.

Jump pointers. Think of skip lists. With jump pointers, each node stores its 1st, 2nd, 4th, . . .,

-th ancestors. This takes O(n lg n) space. To perform queries, recursively go up bblcc = 2

blg lc

We know that l/2 < bblcc ≤ l, s o queries take O(lg n).

Long path decomposition. We preprocess the tree as follows:

1. take a longest root-to-leaf path and recurse on the remaining connected components.

2. store each path as an array ordered by depth (so that nodes in the path may be randomly

accessed), and store a pointer to its parent path.

3. for each node, store the path to which it belongs and its index in the array for the path.

评论收藏

内容反馈

版权申诉

weixin_42651887

粉丝: 75
资源: 1万+

Approximate_String_Matching.pdf.tar.gz_Enjoy_String matching_str

评论0

最新资源

Approximate_String_Matching.pdf.tar.gz_Enjoy_String matching_str

评论0

Skiena-The_Algorithm_Design_Manual.pdf

task_io_accounting_ops.rar_Because...

Introduction_to_Optimum_Design.pdf

ISO_IEC_10967-1_2012.pdf IPart 1

entropy_rates.tar.gz

Neural_cryptography_IJCNN09.zip_ANN

[Gabi_Ben-Dor_Anatoly_Dubinsky_T._Elperin]_Applied(z-lib.org).pdf

ApEn.rar_apen_approximate_approximate entropy_non-stationary

Approximate String Matching

UG6.0快捷键大全

Approximate-Message-Passing-master_amp_源码.zip

GCmex1.9.tar.gz_GCMex分割_graph cut_graph cuts_graphcut_graphcut m

论文研究-A very fast large scale BSS algorithm by joint approximate diagonalization of simplified cumulant matrices.pdf

Approximate.Dynamic.Programming.

Approximate String Matching和Lock-Free Data Structures

AMP_Tutorial_18.pdf

Computational_Finance_An_Introductory_Course_with_R_（2014）.rar

program_approximate_近似动态规划_

冰河的渗透实战笔记-冰河.pdf

大灰狼远控2021最新版，解压密码222

J-LINK V10 V11固件.rar

ISO21434.pdf

Web安全漏洞扫描工具-AWVS14

CTF 竞赛入门指南（ctf-all-in-one）.pdf

Web中间件常见漏洞总结.pdf

stm32f103 adc采样+dma传输+fft处理 频率计_fft处理_stm32_ADCFFT_频率计_ADC采样_

jts-1.14.zip

CobaltStrike4.4.zip

最新资源

stm32f103 adc采样+dma传输+fft处理频率计_fft处理_stm32_ADCFFT_频率计_ADC采样_