Estimating unique visitors in near realtime: HyperLogLog
Table of Contents
Context #
The Count-distinct problem is the problem of finding the number of distinct elements (= cardinality) in a data stream with repeated elements: a common real-world application of this problem is finding the number of unique visitors to a website in near real-time.
For small datasets we can keep track of all user IDs in an in-memory data structure (a hash map for example), but this approach quickly becomes not viable if datasets grow to hundreds of thousands, millions, or even billions or elements.
HyperLogLog is an algorithm published in 2007 that allows us to estimate the cardinality of very large datasets (\(>10^9\)) with an average error lower than 3%, while only using very little memory (less than 2Kb for cardinalities up to billions of records).
Without further ado, let’s jump straight to the formula and then walk our way back:
$$ Cardinality = \alpha_m \cdot m \cdot \frac{m}{\sum_{j=1}^N 2^{\frac{1}{B_j}}} $$
where \(\alpha_m\) is a correction factor introduced by the authors to correct an inherent bias of the formula towards larger estimates (details on how to calculate it in the “Practical considerations” paragraph of the HyperLogLog Wikipedia page).
How does it work #
The algorithm is based on the following observation: instead of keeping track of every distinct value, we can keep track of the longest sequence of leading zeroes that we have seen in them so far. This works because the probability that, in a random dataset, a sequence of \(i\) zeroes will occur once in \(10^i\) values, on average.
The algorithm then addresses the following concerns:
- how to uniformly distribute values, to avoid any bias induced by data
- how to improve accuracy
- how to mitigate the influence of outliers
Uniformly distributing values #
In order to remove any accidental bias caused by the values (e.g. a common prefix that would cause similar values to accumulate close to each other), all values are passed through a hash function that must provide uniformly-distributed outputs. Since the output is a binary bit array, the probability goes from \(10^i\) to \(2^i\).
Improving accuracy #
Instead of storing a single estimation, we could store multiple of them and then average their results: doing so would, however, force us to pass each value through a number of independent hash functions, which is computationally expensive.
As an alternative to that, the authors proposed to stick to a single hash function, using part of its output to assign each value to one of several buckets: for example, we could use the leftmost 4 bits to identify the bucket and then take the longest sequence of leading zeroes from the rest of the output. By having \(m\) buckets, we can simulate having \(m\) different hash functions without any added cost.
The \(m\) bucket values are denoted as \({{B_1, B_2, …, B_m}}\) (thus the \(B^j\) element in the formula).
On top of this, the authors found out that we can further improve accuracy by throwing out the 30% of buckets with the largest values and averaging only the remaining ones.
Mitigating the influence of outliers #
The algorithm achieves this using the harmonic mean:
$$ H(x_1, x_2, …, x_n) = \frac{n}{\sum_{i=1}^{n}\frac{1}{x_i}} $$
Since the formula takes the reciprocal of each value, numbers that deviate a lot from the rest will not significantly affect the overall result (e.g. \(\frac{1}{2} + \frac{1}{100} = 0.51\)).