HyperLogLog — Probabilistic Algorithm
I have mentioned Bloom Filters in my previous post as a probabilistic algorithm to check if an element exists in a set. Let’s go over another algorithm that is used to get the count of distinct elements. Although, as you will know by the end of this post, it cannot be used for all cardinality cases.
To determine cardinality of a set (number of unique elements in a set), you need space for storing distinct elements. i.e if there were 8 distinct numbers in a set of 10 numbers, then you need space to store that 8 numbers. If you apply such algorithm to a larger set, like finding the number of distinct google searches by the end of the day, you will end up requiring huge space. It is obvious why space is problem here. If you assume an id is going to take 128 bits, you would need 45 gb of storage for 3 billion events. Simplest approach to use in-memory hash set would require a single machine with several gigs of memory. If 3 billion events is the number per day, how about tracking it for weeks? or months?
Hyperloglog is about using lesser memory to determine the distinct number of elements with some margin error that becomes negligible with larger dataset and whether the use case surrounding this cardinality logic can live with that margin error or not.
Also, get that - Bloom filter is about whether an element might be present/not present for sure. Hyperloglog (HLL) is about determining the number of distinct elements in a set with a margin error of 2% (some implementations of HLL gives you lesser). You cannot ask HLL “okay now show me all those distinct elements” Because HLL does not store the original value. It manipulates the data to reduce the memory footprint.
Let’s now dig into the algorithm.
Say you need find the distinct number of users connected to Facebook. They could be logged in from any number of devices, i.e. connection id. But you need the count of distinct user id.
Hash this user id and count the leading zeroes in the hash. That will help you get the count of distinct elements you have seen.
What is the probability of getting leading 0? 1/2
0 _ _ _ _
What is the probability of getting 2 leading zeroes? 1/4
0 0_ _ _
i.e. one in (00,01,10,11)
What is the probability of getting 3 leading zeros? 1/8
0 0 0 _ _
i.e. one in (000,010,001,011, 111,101,100,110)
What is the probability of getting k leading zeros? 1 in 2^k
i.e. to get k leading zeros, you are expected to go over 2^k elements.
So just by looking at the number of leading zeros, I can guess how many elements you have come across. If I have seen k leading zeros, then I have already seen 2^k elements.
To keep track of N items, you merely need to store a single number that is about (log N) in magnitude. And to store that single number requires loglogN bits. Hence, the name LogLog. For eg: To represent 256 numbers (0–255), you need log(256) = 8 bits. and to represent 8 numbers (0–7), you need log(8) = 3 bits.
What if I see a hash value with too many leading zeroes way too early?
Before we proceed further, go over this practical example using coin tosses extracted from this post.
Imagine you tell me you spent your day flipping a coin, counting how many times you encountered a non interrupted run of heads. If you tell me that the maximum run was of 3 heads, I can imagine that you did not really flipped the coin a lot of times. If instead your longest run was 13, you probably spent a lot of time flipping the coin.
However if you get lucky and the first time you get 10 heads, an event that is unlikely but possible, and then stop flipping your coin, I’ll provide you a very wrong approximation of the time you spent flipping the coin. So I may ask you to repeat the experiment, but this time using 10 coins, and 10 different piece of papers, one per coin, where you record the longest run of heads. This time since I can observe more data, my estimation will be better.
Split the hash into two parts — First part to denote which bucket they belong and the second part is where you look for the number of leading zeroes. Store the leading zeros against the bucket and take the average.
If you want 8 buckets, then you first part would be — first 3 bits.
SuperLogLog is basically an improvement over what we have seen so far. The Math suggest that by throwing out the 30% of buckets with the largest values, and averaging only 70% of buckets with the smaller values, accuracy can be improved from
1.30/sqrt(m) to only
1.05/sqrt(m) (where m is the number of buckets) with no additional increase in space required. But HyperLogLog was further improved to use harmonic mean instead of geometric mean to edge down the error to slightly less than 1.04/sqrt(m)
If you use HLL for data analytics for really huge numbers, this error becomes negligible. If you use it for accounting, even a single case of double counting is disastrous. Where you are applying this is very important. It is also immensely useful in distributed big data systems where the cardinality counter can be easily merged from multiple machines without having to move the actual data around.
Remember I said that it reduces the memory footprint? Let’s take Redis’s implementation of HLL.
It uses 64 bit hash — first 14 bits to address the bucket (2¹⁴ = 16384 buckets) and the remaining 50 bits are used to count the amount of zeroes to the left. Highest stream of zeroes you can find in a bucket could be 50 — which would require 6 bits to be represented in binary (50 in binary is 110010).
For each bucket you need 6 bits. 16,384 * 6 = 98,304 bits = 12 KB for one HLL structure. With number of bucket m = 16384, the standard error in this implementation = 0.81%
My favorite algorithm (and data structure): HyperLogLog
Every now and then I bump into a concept that's so simple and powerful that I want to stab my brain for missing out on…
What in the HeLL is HyperLogLog?
The first time someone hears the term, HyperLogLog, a common response is to wonder what in the HeLL they are talking…
Simulation : http://content.research.neustar.biz/blog/hll.html