Programming Glenn Manacher
Techniques Editor
Efficient
String Matching:
An Aid to
Bibliographic Search
Alfred V. Aho and Margaret J. Corasick
Bell Laboratories
This paper describes a simple, efficient algorithm to
locate all occurrences of any of a finite number of key-
words in a string of text. The algorithm consists of con-
structing a finite state pattern matching machine from the
keywords and then using the pattern matching machine
to process the text string in a single pass. Construction
of the pattern matching machine takes time proportional
to the sum of the lengths of the keywords. The number
of state transitions made by the pattern matching
machine in processing the text string is independent of
the number of keywords. The algorithm has been used to
improve the speed of a library bibliographic search pro-
gram by a factor of 5 to 10.
Keywords and Phrases: keywords and phrases, string
pattern matching, bibliographic search, information re-
trieval, text-editing, finite state machines, computational
complexity.
CR Categories: 3.74, 3.71, 5.22, 5.25
Copyright © 1975, Association for Computing Machinery, Inc.
General permission to republish, but not for profit, all or part of
this material is granted, provided that ACM's copyright notice is
given and that reference is made to this publication, to its date of
issue, and to the fact that reprinting privileges were granted by
permission of the Association for Computing Machinery.
Authors' present addresses: A. V. Aho, Bell Laboratories,
Murray Hill, N.J. 07974. M. J. Corasick, The MITRE Corporation,
Bedford, Mass. 01730.
333
1. Introduction
In many information retrieval and text-editing appli-
cations it is necessary to be able to locate quickly some or
all occurrences of user-specified patterns of words and
phrases in text. This paper describes a simple, efficient
algorithm to locate all occurrences of any of a finite
number of keywords and phrases in an arbitrary text
string.
The approach should be familiar to those acquainted
with finite automata. The algorithm consists of two parts.
In the first part we construct from the set of keywords a
finite state pattern matching machine; in the second part
we apply the text string as input to the pattern matching
machine. The machine signals whenever it has found a
match for a keyword.
Using finite state machines in pattern matching appli-
cations is not new [4, 8, 17], but their use seems to be
frequently shunned by programmers. Part of the reason
for this reluctance on the part of programmers may be
due to the complexity of programming the conventional
algorithms for constructing finite automata from regular
expressions [3, 10, 15], particularly if state minimization
techniques are needed [2, 14]. This paper shows that an
efficient finite state pattern matching machine can be
constructed quickly and simply from a restricted class of
regular expressions, namely those consisting of finite sets
of keywords. Our approach combines the ideas in the
Knuth-Morris-Pratt algorithm [13] with those of finite
state machines.
Perhaps the most interesting aspect of this paper is
the amount of improvement the finite state algorithm
gives over more conventional approaches. We used the
finite state pattern matching algorithm in a library biblio-
graphic search program. The purpose of the program is
to allow a bibliographer to find in a citation index all titles
satisfying some Boolean function of keywords and
phrases. The search program was first implemented with
a straightforward string matching algorithm. Replacing
this algorithm with the finite state approach resulted in a
program whose running time was a fifth to a tenth of the
original program on typical inputs.
2. A Pattern Matching Machine
This section describes a finite state string pattern
matching machine that locates keywords in a text string.
The next section describes the algorithms to construct
such a machine from a given finite set of keywords.
In this paper a
string
is simply a finite sequence of
symbols. Let K = {Yl,Y2 .....
Yk}
be a finite set of
strings which we shall call
keywords
and let x be an arbi-
trary string which we shall call the
text string.
Our prob-
lem is to locate and identify all substrings of x which are
keywords in K. Substrings may overlap with one another.
A pattern matching machine for K is a program which
takes as input the text string x and produces as output
the locations in x at which keywords of K appear as sub-
strings. The pattern matching machine consists of a set
of
states.
Each state is represented by a number. The
machine processes the text string x by successively read-
ing the symbols in x, making state transitions and occa-
Communications June 1975
of Volume 18
the ACM Number 6
评论2