Hash Tables

Overview

In this post, I will talk about hash tables, their usage in databases, most common implementations, and describe a lot of optimization techniques.

Motivation

Hash tables are used everywhere. In databases, hash tables are used to perform GROUP BY, JOINs, SELECT DISTINCT, IN operator, and COUNT(DISTINCT) aggregation functions. Performance of these operations heavily depends on the underlying hash table implementation.

Hash table is a complex data structure and consists of a lot of different parts. There are many trade-offs between hash table operations speed and memory usage. Additionally, different hash tables can work fast in one scenario but work extremely slow in other scenarios. You need to decide which scenarios are important to you and design your hash table implementation for these specific scenarios.

In this post, I will concentrate on data aggregation scenario, but optimization techniques can also be applied in many other cases.

Hash Table Basics

A hash table is a data structure that provides average constant-time complexity O(1) for the following operations:

Lookup.
Insert.
Erase (if this operation is required for your scenario, for example, for data aggregation scenario, erase operation is not required).

Lookup and insert operations are the most important. Usually, these operations have a similar execution path. Let’s take a look at the picture:

Hash Table Overview

Let’s discuss lookup/insert operation (if we omit collision resolution complexity) execution path:

Hash the key using the hash function.
Select a bucket in the hash table array using the remainder of the division of hash by array size (it is possible to not use the remainder of the division for bucket selection, see A fast alternative to the modulo reduction).
Check if the key is inside the bucket for the lookup operation or insert the key into the bucket for the insert operation.

Performance

Now, we need to understand where hash table spends most of its time during its operations (lookup, insert, erase).

There are two different scenarios:

When hash table fits in CPU caches. Performance of hash table operations depends on arithmetic operations like hash function calculation, computing bucket location, elements comparisons, and other operations that are required for specific hash table memory layout and collision resolution scheme.
When hash table does not fit in CPU caches. For such a scenario, number of random memory accesses per operation is the most important factor for hash table performance.

If we check the well-known table of numbers, the main memory reference is 20x times slower than L2 cache access and 200x times slower than L1 cache access. In practice, memory access latency is around 100ns or even more if memory is inside another NUMA node.

Here are some general recommendations that can be applied to all hash table implementations:

You should avoid complex logic on a hot path of your hash table lookup. Because if there are a lot of instructions on the hot path, hash table will work slow when all data fits in CPU caches.
You should not store a lot of additional metadata in your hash table, because, otherwise, your hash table will stop fitting in CPU caches quickly.
You should decrease the number of random memory accesses during hash table lookup because otherwise, it will significantly slow down your hash table implementation after hash table does not fit in CPU caches. Ideally, you should have one memory access for each operation. This also implies that you cannot make complex layouts (2 hash tables, metadata and entries separation, keys and values separation, complex metadata) because usually, this will require additional memory accesses during operations.

Additionally, for the aggregation scenario, you need to consider not only lookup/insert latency but also memory size of the hash table. If during aggregation, the hash table does not fit in RAM memory, the aggregation algorithm should switch implementation from in-memory aggregation to external memory aggregation, but this will work significantly slower.

The trade-off between hash table lookup latency and memory size is really complex. For example, linear probing open addressing hash tables can work well only with a load factor near 0.5, and with resize factor 2, in the worst case, memory overhead for such a hash table will be 4x. But there are also hash table implementations that can work well with load factor like 0.875 or even higher, but in practice, they will provide slower lookup latency.

Hash Table Design

Hash table is a very complex data structure and consists of many components. Each component is critical in hash table performance. A mistake in any of these components might make your hash table implementation inefficient.

Generally, hash table consists of the following components:

Hash function.
Collision resolution method.
Resize policy.
Memory layout.

There can also be dependency between components. For example, the choice of memory layout can affect the collision resolution method and hash function.

Hash function

Let’s start by choosing a hash function. This is a very important element of hash table.

Here, I will provide common issues with hash function choice:

Usage of identity hash function for numeric types. If you use only the lowest bits of integer to compute the bucket in the hash table, you potentially can have a lot of collisions on real data.
Usage of hash functions that are built for string data for numeric types. Such hash functions have a lot of instructions, and the compiler does not perform inlining of complex hash functions.
Usage of cryptographic hash functions unless it is necessary. For example, the throughput of the SipHash function (although technically it is not a cryptographic hash function) is 1 GB/s, and the throughput of CityHash function is about 10 GB/s. If you use the SipHash function, your hash table throughput will be limited to 1 GB/s.
Usage of legacy hash functions like FNV1a. Such legacy hash functions are slow and have poor data distribution relative to competitors.

There is smhasher repository with benchmarks and quality tests of various hash functions, which you can use to choose a hash function for your hash table.

Collision resolution method

Let’s talk about collision resolution. In any hash table, by the birthday paradox, the situation will arise that different entries fall into the same bucket. Suppose we inserted the entry with key K1 into the table’s third bucket. Now, we are trying to insert the entry with key K2, and it turns out that, according to the remainder of the division of the hash, it falls into the same bucket.

Hash Table Collision

We need to figure out how to handle collisions. There are several ways to resolve this situation:

Separate chaining. Each hash table bucket will use some data structure to store collided entries, for example, a list or an array. When the collision occurs, we insert a new entry in this data structure.
Open addressing. We store all hash table entries in the bucket array itself. When the collision occurs, we try to find an empty bucket in the hash table according to the probe sequence until we find an empty bucket. The most common probe sequences are linear probing and quadratic probing.

Let’s discuss a separate chaining method.

Hash Table Collision Resolution Chaining

For example, in each bucket, we will store a list of entries. We can put a new entry in that list. Then, we will check each entry in the list during lookup or insert.

This is the way std::unordered_map is implemented.

Chaining has the following advantages:

Stability of pointers to key, value.
Ability to store large objects, non-movable objects.
Works well even with bad hash functions or with a very high load factor.

But they are not often used in practice for high-performance scenarios because of the following downsides:

High memory overhead.
Loads the allocator (even just a function call is expensive on hash table lookup path).
Poor cache locality, such hash tables make a lot of random memory accesses during lookup.

All modern hash tables use the open addressing method.

Hash Table Collision Resolution Open Addressing

During collision, we insert K2 in some other bucket in the same array depending on the collision resolution scheme. There are the following schemes:

Linear probing. Fixed interval between probes, usually 1. Example: ClickHouse Hash Map.
Quadratic probing. Interval between probes is increased by adding the successive outputs of a quadratic polynomial. Example: Google Dense Hash Map, Abseil Hash Map.
Double hashing. Interval between probes is computed by a secondary hash function.
Robin Hood hashing.

You can read more about these schemes here Open addressing.

There are more complex schemes, such as Cuckoo hashing or Hopscotch hashing. But they do not work well in practice. Because they usually require additional fetches from memory and have a lot of additional instructions on a hot path of hash table lookup.

Open addressing has the following advantages:

Low memory overhead. Typically, we can store only keys and values.
Good cache locality. For example, we can find or insert a key using only a single fetch from memory. These become even more important for big hash tables that do not fit in CPU caches, where memory access latency is the most important factor in hash table performance.

In practice, there are the following important downsides:

Open addressing hash tables also require a careful choice of hash function. For such hash tables, the choice of a hash function is very important to prevent clustering.
Such hash tables also cannot operate under a high load factor. Most implementations use a load factor of 0.5.
Another problem with such hash tables is that it is inefficient to store large objects in such hash tables because that can destroy cache locality. If you need to store big objects in the open addressing hash table, you can serialize a big object into an arena and store a pointer to this object in the hash table.

Resize policy

A very important part of a hash table is the resize policy. If the amount of elements in your hash table is greater than some load factor, it is required to create a new hash table with greater size and copy all elements into it.

The most popular method is to use hash table size as a power of two. So, during each resize, we double hash table size. This method is good because, as we already discussed, we want to spend as little time as possible during hash table lookup. It should occur in nanoseconds if the table fits in CPU caches, and bucket computation is fast if the hash table size is the power of two:

size_t place = hash & (size - 1)

Avoiding division for bucket computation is very important because 64-bit integer division runs dozens of processor cycles on modern CPUs. You can read more about it in Agner Fog Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 4.5 Integer division and Agner Fog Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 16.8 Division (all processors).

There is also a more theoretical justification for using a power that is close to a power of two but a prime number. The downside is that you need to figure out how to avoid division. To do this, you can use constant switches (compiler can optimize division by constant using multiply and shift), libdivide library, or fastmod library.

Now, let’s check graphs on how the load factor can affect the number of probes for successful and unsuccessful lookups for linear probing, double hashing, and chained hashing (separate chaining).

Successful lookup:

Hash Table Load Factor to Number of Probes Successful Lookup

Unsuccessful lookup:

Hash Table Load Factor to Number of Probes Unsuccessful Lookup

Formulas are taken from Knuth, Donald (1998). The Art of Computer Programming. Vol. 3: Sorting and Searching (2nd ed.). Addison-Wesley. “Chapter 6.4 Hashing”.

We see that for linear probing, performance quickly degrades after 0.5 load factor. In practice, most open addressing hash tables use a load factor of 0.5 (ClickHouse Hash Map, Google Dense Hash Map). The Abseil Hash Map uses a load factor of around 0.875.

Memory layout

The most important choice is how the hash table cells are arranged in memory.

Why do we need some special memory layout? As soon as you try to write an open addressing hash table, you will need to check if the hash table cell is empty or not. Or suppose you need to implement erase from the hash table and want to store information if a cell is deleted or not. It is impossible to store this information without any overhead, and different memory layouts provide different trade-offs.

There are several options that are used in practice.

Special key values

The first option is to ask the client to provide some key value as a null key (sentinel). Also, if your hash table supports the erase operation, the client needs to provide an additional key value as a deleted key (tombstone). The null and deleted key values must never be inserted by the client into the hash table. That way, the hash table can use these special key values to identify if a cell is empty or deleted.

Hash Table Memory Layout Sentinels

This method is used in Google Dense Hash Map, LLVM Hash Map, and in a lot of other hash table implementations. The main disadvantage of this method is that we force the client to find some key that will not be used in the hash table. In general, it makes API more complex.

Zero value storage

There is also an interesting method that is used in ClickHouse Hash Map. There is a special bucket for the zero value key (null key), and it is stored separately from the main hash table array. That way, the hash table can use zero value key to identify if a cell is empty.

Hash Table Memory Layout Null Special

During each lookup in the hash table, first we need to check if the key is null (and process it separately), and if the key is not null, continue with the default execution path. The downside is that an additional branch happens during each hash table lookup, but in practice, it is easy predictable by CPU branch predictor and does not affect performance.

For deleted key, you can also use a similar approach. Or, for the linear probing hash table, you can use deletion algorithm without tombstones.

Metadata

You can store metadata for each cell or a group of cells. You can store such metadata within the main hash table array or separately. In metadata, you need to store information about whether a cell is empty or not and whether a cell is deleted or not.

There are a lot of ways of how you can store metadata.

Hash Table Memory Layout Null Special

For example, you can store 8 bytes of metadata for each 8 hash table cells in each hash table bucket. In metadata, you can store not only information about cell state, which can take up to 2 bits, but also additional information, for example, 7 lower bits of hash function. During lookup, you can quickly scan metadata to find which cells contain the same 7 lower bits of the hash function and then iterate over these cells.

A similar approach is used by the Abseil Hash Map from Google. Their hash table consists of buckets. Each bucket has 16 metadata bytes and 16 cells.

During lookup, we can:

Lookup bucket using 57 upper bits of hash function.
Load 16 metadata bytes into the SSE register.
Find potential candidate cells that match 7 lower bits of the hash function using SSE instructions.
Check each candidate cell. If no cell matches, and the bucket contains an empty element, that means that the key does not exist in the hash table. Otherwise, continue lookup at step 1 using the next probe bucket (you can choose the next probe bucket using any probing scheme, for example, linear probing).

For additional details, check Swiss Tables design documentation and Abseil containers documentation.

This approach works well in practice and can work great for unsuccessful lookup scenarios. Additionally, such hash tables can work well with a high load factor. The downside is that hash table lookup works a bit slower because of additional instructions (if we compare it to linear probing implementation).

Benchmarks

If you decide to see which hash table is the best, then every second person on the Internet has written their fastest hash table. However, if you dig a bit deeper, you will see it is not quite the case.

The main problems with hash table benchmarks are:

They are not reproducible, for example, there is no source code with the implementation of the benchmark. Or there is no hardware specification that was used for benchmarking.
They are made not on real data but on random numbers. The distribution of random numbers does not correspond to the real data distribution, and for hash tables, this can be very important.
They do not cover any real-world scenario.
They do not check the memory usage of hash tables, which is important for some scenarios.

How should benchmarks be done? They need to be done on real data and on real-world scenarios. In our case, the scenario is data aggregation.

I designed a benchmark for this specific scenario here hash table aggregation benchmark.

This benchmark is created to compare the performance of different hash tables with different hash functions for in-memory aggregation scenario. Benchmark is based on real anonymized web analytics data from Yandex.Metrica dataset.

Benchmark computes mapping for each unique key to count for columns from the dataset, similar to such SQL query SELECT column, count(column) FROM hits GROUP BY column.

Because each column in the benchmark has different cardinality and distribution, it is possible to check how different hash tables work for in-memory aggregation on real-world data.

Now let’s take a look at how different hash tables work for columns when all data does not fit in CPU caches like WatchID and for the case when all data fits in CPU caches like RegionID.

WatchID file:

File: data/WatchID.bin
Key type: Int64
Keys size: 99997497
Unique keys size: 99997493
+------------------------+---------------+---------------+--------------+
| Hash Table             | Hash Function | Elapsed (sec) | Memory Usage |
+------------------------+---------------+---------------+--------------+
| ClickHouse HashMap     | absl::Hash    |      6.73     |     4.00 GiB |
| absl::flat_hash_map    | absl::Hash    |     10.26     |     2.13 GiB |
| google::dense_hash_map | absl::Hash    |     10.18     |     4.00 GiB |
| std::unordered_map     | absl::Hash    |     54.64     |     5.23 GiB |
+------------------------+---------------+---------------+--------------+

If we look at the benchmarks, we will see that ClickHouse Hash Map and Google hash maps are way ahead of std::unordered_map. Why? Because, as I said, std::unordered_map is not cache local.

If we decide to look at cache misses with perf stat to confirm our assumption:

Hash Table	Cache Misses
ClickHouse HashMap	372,914,615
absl::flat_hash_map	500,798,322
google::dense_hash_map	440,054,848
std::unordered_map	1,478,158,598

We see that std::unordered_map has much more cache misses than other hash tables.

Let’s now take a look at RegionID column where the hash table fits in CPU caches:

+------------------------+---------------+---------------+--------------+
| Hash Table             | Hash Function | Elapsed (sec) | Memory Usage |
+------------------------+---------------+---------------+--------------+
| ClickHouse HashMap     | absl::Hash    |      0.16     |     1.12 MiB |
| absl::flat_hash_map    | absl::Hash    |      0.29     |   864.00 KiB |
| google::dense_hash_map | absl::Hash    |      0.26     |     1.03 MiB |
| std::unordered_map     | absl::Hash    |      1.36     |   764.00 KiB |
+------------------------+---------------+---------------+--------------+

We can take a look at instructions, cycles, branches, branch misses, and cache misses using perf stat to confirm our assumptions:

Hash Table	Cycles	Instructions	Branches	Branch Misses	Cache Misses
ClickHouse HashMap	1,372,922,746	3,330,251,159	594,385,004	679,947	6,626,740
absl::flat_hash_map	1,927,221,156	4,700,631,917	498,715,181	533,743	6,540,744
google::dense_hash_map	1,810,270,745	5,447,719,444	893,992,903	849,321	6,909,115
std::unordered_map	6,535,384,982	6,508,367,772	1,055,471,900	12,210,677	9,184,031

In this case, we see that std::unordered_map is still slow because it performs a lot of additional instructions and has a lot of branches on a hot path of hash table lookup.

Optimizations

Now, I want to discuss various specific optimization techniques that can help you improve performance in different scenarios. Implementation details of most of these optimizations are discussed in my blog post here Hash tables in ClickHouse and C++ Zero-cost Abstractions C++ hash table design.

Saved hash

In each cell, the hash table can additionally store the computed hash. This can be extremely powerful if key comparison or hash computation for a given key are slow operations (Strings, ASTs, Complex objects). Such optimization can help even without special hash table implementation. For example, for std::unordered_map, you can write a simple wrapper around your key.

This is an example from ClickHouse Analyzer query tree node:

template <typename QueryTreeNodePtrType>
struct QueryTreeNodeWithHash
{
    QueryTreeNodeWithHash(QueryTreeNodePtrType node_)
        : node(std::move(node_))
        , hash(node->getTreeHash())
    {}

    QueryTreeNodePtrType node = nullptr;
    CityHash_v1_0_2::uint128 hash;
};

template <typename T>
inline bool operator==(const QueryTreeNodeWithHash<T> & lhs, const QueryTreeNodeWithHash<T> & rhs)
{
    return lhs.hash == rhs.hash && lhs.node->isEqual(*rhs.node);
}

template <typename T>
inline bool operator!=(const QueryTreeNodeWithHash<T> & lhs, const QueryTreeNodeWithHash<T> & rhs)
{
    return !(lhs == rhs);
}

template <typename Value>
using QueryTreeNodePtrWithHashMap = std::unordered_map<QueryTreeNodePtrWithHash, Value>;

Linear probing in place resize

For a linear probing hash table, when the hash table size is a power of two, almost half of the hash table elements will not change their positions. It is possible to implement resize in place.

First, we need to reallocate the initial hash table array. Most allocators do not support realloc operations without copying data, they just perform malloc, memcpy, free. On Linux, you can use mmap system call for the allocation of big hash tables and use mremap system call for better realloc performance.

During in place resize, we significantly reduce the number of random memory lookups, compared to just allocating another hash table with a bigger size and copying all elements to it.

Clearable hash table

If we have a scenario when we periodically need to clear the whole hash table, we can store the version in the hash table and also store the version in each hash table cell.

During clear operation, we can increase the version in the hash table and, during hash table lookup, compare the hash table cell version and hash table version (if the cell version is lower, then this cell is removed).

String hash table

There is an interesting special string hash table implementation that works well for strings on real-world data. You can store five hash tables for strings of different lengths, for which we use different hash functions.

More specifically, this string hash table consists of 5 hash tables:

for empty strings (this can be just a single bucket)
for strings 1-8 bytes in size
for strings 9-16 bytes in size
for strings 17-24 bytes in size
for strings bigger than 24 bytes

String hash table operations (lookup/insert/delete) are dispatched to the right hash table depending on the size of the key. Such string hash table is used in ClickHouse and was contributed by the authors of this paper SAHA: A String Adaptive Hash Table for Analytical Databases.

Fast LRU Cache

Another interesting optimization is the fast LRU Cache. This is an implementation of the LRU cache policy.

Usually, LRU Cache is implemented as a doubly linked LRU list and a hash table, which provides a mapping from the key to the list iterator. During the access of a specific key, we move the iterator to the “most recent” end of the list. During the insert of a specific key, we create an element in the “most recent” end of the list and update mapping. If during insert cache overflows, we evict the element at the “least recent” beginning of the list.

An implementation of the LRU cache using a separate list and a hash table is not optimal because it uses two containers. It is possible to do this in a single container. We can build an intrusive list on top of a hash table array. In a hash table cell, we store a pointer to the next and previous elements of a doubly linked list. That way, we can build a doubly linked list inside the hash table. Additionally, during resize, when cells are moved, these pointers need to be updated.

Store metadata in pointer values

If the value of your hash table is a pointer, you can use the lowest pointer bits to store additional metadata (the lowest bits of hash, entry offset from the ideal position). That way, you can improve lookup performance or try to implement additional hash table layouts like Robin Hood hashing without paying the cost of metadata storage. The downside is that before the client can work with a pointer value, hash table implementation must clear the lowest bits.

Compressed pointers

In some scenarios, when you want to store pointers in the hash table of some complex objects with a fixed size that are allocated on the arena, instead of storing pointers, you can store the positions of these complex objects on the arena. To retrieve the object pointer, you can use arena_start_pointer + object_position_in_arena * object_size. That way, it is possible to significantly reduce the size of your hash table.

Separate hash computation

Most hash table implementations incapsulate hash computation and do not provide an API for the client to insert a key using the already computed hash.

For example, you work with your hash table under mutex and want to avoid computing hash inside the critical section. Or you want to reinsert data from one hash table to another and already have computed hash values for each key.

You can provide API for your hash table operations (lookup/insert/delete) to accept the key and the hash of that key. That way, in many cases, you can significantly improve the performance of those operations.

Summary

Hash tables are extremely complex data structures. There is no best hash table for any scenario. For each scenario, hash table implementation needs to be carefully chosen according to a lot of factors and constraints.

To test and evaluate hash tables for in-memory aggregation scenario, I designed hash table aggregation benchmark.

Hash tables can also be optimized in a lot of different ways using low-level techniques. Hash tables are also a building block for more complex data structures or algorithms.

The interesting observation is that it is impossible to design a single hash table layout that will work well with all scenarios. You can design a hybrid hash table that will work fast and use simple implementation when the hash table fits in CPU caches. And then, when the hash table will no longer fit in CPU caches, transform the implementation to something more complex. For example, a more complex implementation can use more arithmetic instructions during lookup, use different hash functions, or have a more complex layout. I plan to write about these hybrid hash tables in my next blog post.