HyperLogLog-TheAnalysisofaNear-OptimalCardinalityEstimationAlgorithm(914-3045-2-PB)-计算机科学资源-CSDN文库

23 浏览量 2021-04-22 18:34:52 上传评论收藏 515KB PDF 举报

资源推荐

资源详情

资源评论

2007 Conference on Analysis of Algorithms, AofA 07 DMTCS proc. AH, 2007, 127–146

HyperLogLog: the analysis of a near-optimal

cardinality estimation algorithm

Philippe Flajolet

and Éric Fusy

and Olivier Gandouet

and Frédéric

Meunier

Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)

LIRMM, 161 rue Ada, 34392 Montpellier (France)

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to

estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory

of m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimate

of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/

√

m. This improves on

the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64%

of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10

with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and

adapts to the sliding window model.

Introduction

The purpose of this note is to present and analyse an efﬁcient algorithm for estimating the number of

distinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisets

and are usually massive streams (read-once sequences). This problem has received a great deal of attention

over the past two decades, ﬁnding an ever growing number of applications in networking and trafﬁc

monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service),

and of link-based spam on the web [3]. For instance, a data stream over a network consists of a sequence

of packets, each packet having a header, which contains a pair (source–destination) of addresses, followed

by a body of speciﬁc data; the number of distinct header pairs (the cardinality of the multiset) in various

time slices is an important indication for detecting attacks and monitoring trafﬁc, as it records the number

of distinct active ﬂows. Indeed, worms and viruses typically propagate by opening a large number of

different connections, and though they may well pass unnoticed amongst a huge trafﬁc, their activity

becomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [11]).

Other applications of cardinality estimators include data mining of massive data sets of sorts—natural

language texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, where

the authors of [22] report computational gains by a factor of 500

attained by probabilistic cardinality

estimators.

1365–8050

 2007 Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy, France

128 P. Flajolet, É. Fusy, O. Gandouet, F. Meunier

Clearly, the cardinality of a multiset can be exactly determined with a storage complexity essentially

proportional to its number of elements. However, in most applications, the multiset to be treated is far too

large to be kept in core memory. A crucial idea is then to relax the constraint of computing the value n of

the cardinality exactly, and to develop probabilistic algorithms dedicated to estimating n approximately.

(In many practical applications, a tolerance of a few percents on the result is acceptable.) A whole range

of algorithms have been developed that only require a sublinear memory [2, 6, 10, 11, 15, 16], or, at worst

a linear memory, but with a small implied constant [24].

All known efﬁcient cardinality estimators rely on randomization, which is ensured by the use of hash

functions. The elements to be counted belonging to a certain data domain D, we assume given a hash

function, h : D → {0, 1}

∞

; that is, we assimilate hashed values to inﬁnite binary strings of {0, 1}

∞

, or

equivalently to real numbers of the unit interval. (In practice, hashing on 32 bits will sufﬁce to estimate

cardinalities in excess of 10

; see Section 4 for a discussion.) We postulate that the hash function has been

designed in such a way that the hashed values closely resemble a uniform model of randomness, namely,

bits of hashed values are assumed to be independent and to have each probability

of occurring—

practical methods are known [20], which vindicate this assumption, based on cyclic redundancy codes

(CRC), modular arithmetics, or a simpliﬁed cryptographic use of boolean algebra (e.g., sha1).

The best known cardinality estimators rely on making suitable, concise enough, observations on the

hashed values h(M) of the input multiset M, then inferring a plausible estimate of the unknown cardi-

nality n. Deﬁne an observable of a multiset S ≡ h(M) of {0, 1}

∞

strings (or, equivalently, of real [0, 1]

numbers) to be a function that only depends on the set underlying S, that is, a quantity independent of

replications. Then two broad categories of cardinality observables have been studied.

— Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of

the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit-

pattern 0

ρ−1

1 is more or less a likely indication that the cardinality n of S is at least 2

. The

algorithms known as Probabilistic Counting, due to Flajolet-Martin [15], together with the more

recent LOGLOG of Durand-Flajolet [10] belong to this category.

— Order statistics observables: these are based on order statistics, like the smallest (real) values, that

appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order

of 1/X, since, as regards expectations, one has E(X) = 1/(n + 1). The algorithms of Bar-Yossef

et al. [2] and Giroire’s MINCOUNT [16, 18] are of this type.

The observables just described can be maintained with just one or a few registers. However, as such,

they only provide a rough indication of the sought cardinality n, via log

n or 1/n. One difﬁculty is due

to a rather high variability, so that one observation, corresponding to the maintenance of a single variable,

cannot sufﬁce to obtain accurate predictions. An immediate idea is then to perform several experiments

in parallel: if each of a collection of m random variables has standard deviation σ, then their arithmetic

mean has standard deviation σ/

√

m, which can be made as small as we please by increasing m. That

simplistic strategy has however two major drawbacks: it is costly in terms of computation time (we would

need to compute m hashed values per element scanned), and, worse, it would necessitate a large set of

independent hashing functions, for which no construction is known [1].

The solution introduced in [15] under the name of stochastic averaging, consists in emulating the

effect of m experiments with a single hash function. Roughly speaking, we divide the input stream

h(M) into m substreams, corresponding to a partition of the unit interval of hashed values into [0,

HyperLogLog: analysis of a near-optimal cardinality algorithm 129

Algorithm Cost (units) Accuracy

Hit Counting [24]

N bits ≈2%

Adaptive Sampling [12] m words (≈32 bits) 1.20/

√

Probabilistic Counting [15] m words (≤32 bits) 0.78/

√

MINCOUNT [2, 6, 16, 18] m words (≤32 bits) 1.00/

√

LOGLOG [10] m bytes (≤5 bits) 1.30/

√

HYPERLOGLOG m bytes (≤5 bits) 1.04/

√

Fig. 1: A comparison of major cardinality estimators for cardinalities ≤ N: (i) algorithms; (ii) memory complexity

and units (for N ≤ 10

); (iii) relative accuracy.

[

[, . .. , [

m−1

, 1]. Then, one maintains the m observables O

, . . . , O

corresponding to each of

the m substreams. A suitable average of the {O

} is then expected to produce an estimate of cardinalities

whose quality should improve, due to averaging effects, in proportion to 1/

√

m, as m increases. The

beneﬁt of this approach is that it requires only a constant number of elementary operations per element of

the multiset M (as opposed to a quantity proportional to m), while only one hash function is now needed.

The performances of several algorithms are compared in Figure 1; see also [13] for a review. HYPER-

LOGLOG, described in detail in the next section, is based on the same observable as LOGLOG, namely

the largest ρ value obtained, where ρ(x) is the position of the leftmost 1-bit in binary string x. Stochastic

averaging in the sense above is employed. However, our algorithm differs from standard LOGLOG by its

evaluation function: its is based on harmonic means, while the standard algorithm uses what amounts to

a geometric mean

. The idea of using harmonic means originally drew its inspiration from an insight-

ful note of Chassaing and Gérin [6]: such means have the effect of taming probability distributions with

slow-decaying right tails, and here they operate as a variance reduction device, thereby appreciably in-

creasing the quality of estimates. Theorem 1 below summarizes our main conclusions to the effect that the

relative accuracy (technically, the standard error) of HYPERLOGLOG is numerically close to β

∞

√

where β

∞

√

3 log 2 − 1

= 1.03896. The algorithm needs to maintain a collection of registers, each

of which is at most log

log

N + O(1) bits, when cardinalities ≤ N need to be estimated. In particular,

HYPERLOGLOG achieves an accuracy matching that of standard LOGLOG by consuming only 64% of

the corresponding memory. As a consequence, using m = 2048, hashing on 32 bits, and short bytes of 5

bit length each: cardinalities till values over N = 10

can be estimated with a typical accuracy of 2%

using 1.5kB (kilobyte) of storage.

The proofs base themselves in part on techniques that are now standard in analysis of algorithms,

like poissonization, Mellin transforms, and saddle-point depoissonization. Some nonstandard problems

however present themselves due to the special nonlinear character of harmonic means, so that several

ingredients of the analysis are not completely routine.

1 The HYPERLOGLOG algorithm

The HYPERLOGLOG algorithm is fully speciﬁed in Figure 2, the corresponding program being discussed

later, in Section 4. The input is a multiset M of data items, that is, a stream whose elements are read

sequentially. The output is an estimate of the cardinality, deﬁned as the number of distinct elements

The paper [10] also introduces a variant called SUPERLOGLOG, which attempts to achieve variance reduction by censoring

extreme data. It has however the disadvantage of not being readily amenable to analysis, as regards bias and standard error.

剩余19页未读，继续阅读

评论收藏

内容反馈

weixin_38607088

粉丝: 5
资源: 921

HyperLogLog - The Analysis of a Near-Optimal Cardinality Estimat...

最新资源

HyperLogLog - The Analysis of a Near-Optimal Cardinality Estimat...

Hyperloglog paper

实验--编程实现请求分页存储管理页面Optimal、FIFO、LRU置换算法 (2).docx

data2_optimal_estimation_state_zip_

实验--编程实现请求分页存储管理页面Optimal、FIFO、LRU置换算法.docx

Optimal State Estimation Manual Solution

优化觅食算法 Optimal foraging algorithm for global optimization（OFA）

Optimal state estimation (kalman h-infinity and nolinear approaches) （书内给出了详细的matlab/simulink源程序）

optimal state estimation solution manual dan simon

Optimal State Estimation(最优状态估计)

a class of optimal minimum odd-weight-column SEC-DED Codes

Optimal State Estimation(最优状态估计)2

Performance analysis of smart antenna test-bed operating in a wide-band CDMA channel

optimal state estimation

OPenCV-Human Body Tracking by Adaptive Background Models and Mean-Shift Analysis

Improved algorithm for the deinterleaving of radar pulses

Optimal Control of a Double Inverted Pendulum on a Cart

Optimal-Scheduling-for-Charging-and-Discharging-of-Electric-Vehicles-master.zip

Patch-based Near-Optimal Image Denoising

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

NPPJSONViewer.zip

蓝牙BLE协议中文版.pdf

CANoe通过CAPL脚本实现自动测试

AD20官方中文教程.pdf

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）