At one end of this spectrum we have constant-time queries (for prefixes that fit in O(1) words), and
still asymptotically vanishing space usage for the index. At the other end, space is optimal and the
query time grows logarithmically with the length of the prefix. Precise statements can be found in the
technical overview below.
Technical overview. For simplicity we consider strings over a binary alphabet, but our methods
generalise to larger alphabets. Our main result is that weak prefix search needs just O(| p|/w+log | p|)
time and O(n log `) space, where ` is the average length of the strings, p is the query string, and w is
the machine word size. In the cache-oblivious model [12], we use O( p/B + log | p|) I/Os. For strings
of fixed length w, this reduces to query time O(log w) and space O(n log w), and we show that the
latter is optimal regardless of query time. Throughout the paper we strive to state all space results in
terms of `, and time results in terms of the length of the actual query string p, because in a realistic
setting (e.g., term dictionaries of a search engine) string lengths might vary wildly, and queries might
be issued that are significantly shorter than the average (let alone maximum) string length. Actually,
the data structure size depends on the hollow trie size of the set S—a data-aware measure related to
the trie size [13] that is much more precise than the bound O(n log `).
Building on ideas from [1], we then give an O(| p|/w +1) solution (i.e., constant time for prefixes
of length O(w)) that uses space O(n`
1/c
log `) (for any c > 0). This structure shows that weak prefix
search is possible in constant time using sublinear space; queries requires O(| p|/B + 1) I/Os in the
cache-oblivious model.
Comparison to related results. If we study the same problem in the I/O model or in the cache-
oblivious model, the nearest competitors are the String B-tree [10] and its cache-oblivious ver-
sion [4], albeit they require access to the set S. The static String B-tree can be modified to use
space O(n log n + n log `); it has very good search performance with O(| p|/B + log
B
n) I/Os per
query (supporting all query types discussed in this paper), and its cache-oblivious version guarantees
the same bounds with high probability. However, a search for p inside the String B-tree may involve
(|p| + log n) RAM operations ((| p|/w + log n) for the cache-oblivious version), so it may be
too expensive for intensive computations. Our first method, which achieves the optimal space usage
of O(n log `) bits, uses O(|p|/w + log | p|) RAM operations and O(|p|/B + log | p|) I/Os instead.
The number of RAM operations is a strict improvement over String B-trees, while the I/O bound is
better for large enough sets. Our second method uses slightly more space (O(n`
1/c
log `) bits for any
c > 0) but features O(| p|/w + 1) RAM operations and O(| p|/B + 1) I/Os.
In [11], the authors discuss very succinct static data structures for the same purposes (on a generic
alphabet), decreasing the space to a lower bound that is, in the binary case, the trie size. The search
time is logarithmic in the number of strings. As in the previous case, we improve on RAM operations
and on I/Os for large enough sets.
The first cache-oblivious dictionary supporting prefix search was devised by Brodal et al. [5]
achieving O(| p|) RAM operations and O(| p|/B + log
B
n) I/Os. We note that the result in [5] is
optimal in a comparison-based model, where we have a lower bound of (log
B
n) I/Os per query.
By contrast, our result, like those in [4, 11], assumes an integer alphabet where there is no such lower
bound.
Implicit in the paper of Alstrup et al. [1] on range queries is a linear-space structure for constant-
time weak prefix search on fixed-length bit strings. Our constant-time data structure, instead, uses
sublinear space and allows for variable-length strings.
2
评论0
最新资源