This week I picked up the paper titled **Cuckoo hashing**. The paper presents a dictionary with worst case constant *lookup*, *deletion* and *updation* time and amortized constant *insertion* time (with high probability).

# Summary

The structure of the dictionary is quite simple. Let us say that we have to maintain a set **S**, of size **n**, and keys are taken from a universe **U**. The dictionary uses two hash tables, **T** and **T’**, each consisting of **r** slots (where **r** is slightly bigger than **n**). We have two hash functions **h**, **h’**: U → [r]. Every key **x**∈ **S** will either be stored in cell **h**(x) of **T** or in cell **h’**(x) of **T’**. This means *lookup* takes **O**(1) time as only 2 slots needs to be checked. *Deletion* and *updation* also take **O**(1) time as they first *lookup* the key and then delete or update the corresponding value.

*Insertion* is more interesting and involved than other operations because it uses, what the paper calls, “the cuckoo approach” where a key kicks other keys to it finds a nest for itself. To insert **x**, we check cell **h**(x) of **T**. If it is free then we are done else we update the value of this cell which would make the previous key “nestless”. Then we insert this newly displaced key in **T’** in the same manner iteratively till all keys find a nest. The number of iterations is bound by a constant, **MaxLoop**, to avoid closed loops. After **MaxLoop**number of iterations, a *rehash* is performed. A *rehash* will also be performed once **r^2** insertions have been performed since the last *rehash*. This is all one needs to know to implement their own cuckoo-hash. Simple and straightforward.

# Digging Deeper

The approach may, at first, sound like shifting the overhead of *lookup* to *insertion,* but there is more to it. We first need to understand what is a universal family.

A family of functions, where each function is defined from **U** → **R**, is (**c**, **k**)-universal if, for any **k** distinct **x**’s ∈ U , and any **k** **y**’s ∈ **R**, probability that any randomly chosen function from this family maps these **x**’s to **y**’s ≤ **c/|R|^k** . In effect, it puts a limit on the probability of collisions when randomly picking hashing functions from this family. The paper cites a paper by Alan Siegel to explain how to construct a hash family such that probability of collisions with a set of **r^2** keys is **O**(**1**/**n^2**).

The paper also uses a clever trick to determine when a *rehash* would be needed before reaching **MaxLoop** iterations. If the *insertion* operation returns to a previously visited cell, it checks if a close loop is going to be formed. Suppose we are inserting key **x1** and let **x1**, **x2**, …, **xp** be the sequence of nestless keys. One of these keys, say **xi**, becomes nestless for the second time. So it will be put back to its original position. This will cause all the the previous keys in the sequence to be moved back to their original position and **x1** would be nestless again. So it will be put to the other table this time. This would cause some keys to becomes nestless in the second table. If one of those keys say **xk** moves to a previously visited position, we are guaranteed to have a closed loop. In that case, we do not wait for **MaxLoop** iterations to be reached and perform *rehash*.

Now we can work out the probability of *insertion* loop running for at least **t**iterations. We already know that **t** can not exceed **MaxLoop**. Adding up the probabilities for all possible values of **t** and accounting for the cost of *rehash* and taking **n** to be very large, we get the amortized *insertion* time to be **O**(1). For a detailed understanding of this part, refer the original paper.

# My thoughts

The paper has very strong mathematical and experimental backing. It is full of references related to both theoretical work and experimental evaluation. The analysis of *insertion* is clean and meticulous and all the maths work out beautifully. I particularly liked the elaborate and well-documented experiments. Authors have experimented with a variety of hashing algorithms along with variants of cuckoo hashing. They have highlighted the instances where their implementation differs from the reference and have provided justifications for the same. They have also studied the role of cache in all the hashing techniques in their experiments. They also tested with the dictionary tests of DIMACS implementation challenge. In all experiments, Linear Probing performed better than all other approaches with Cuckoo-Hashing lagging just 20–30% behind. Even then paper presents a strong case for Cuckoo-Hashing.

A few things can still be explored more like how to go about choosing a family of good hash functions as per the constraints. Secondly cache miss rate tends to increase the fastest for cuckoo hashing as load factor increases. Also, it is not possible to have a load factor of more than 50% for cuckoo hashing. So in some cases, the dictionary will have to be resized. The experiments have focused on situations where the size does not vary greatly. Also *insertion* is expected to be **O**(1) only for very large values of **n** though there is no estimate of how big **n** needs to be. This could be persuaded further.