Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUsReference: Blanas, Spyros, Yinan Li, and Jignesh M. Patel. "Design and evaluation of main memory hash join algorithms for multi-core CPUs." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011.
This is one of the several papers belong to suggested readings for Parallel Join Algorithms (Hashing) of CMU 15-721: Database Systems.
The focus of this paper is on investigating efficient hash join algorithms for modern multi-core processors in main memory environments. This paper dissects each internal phase of a typical hash join algorithm and considers different alternatives for implementing each phase, producing a family of hash join algorithms. Then, we implement these main memory algorithms on two radically different modern multi-processor systems, and carefully examine the factors that impact the performance of each method.
Our analysis reveals some interesting results – a very simple hash join algorithm is very competitive to the other more complex methods. This simple join algorithm builds a shared hash table and does not partition the input relations. Its simplicity implies that it requires fewer parameter settings, thereby making it far easier for query optimizers and execution engines to use it in practice. Furthermore, the performance of this simple algorithm improves dramatically as the skew in the input data increases, and it quickly starts to outperform all other algorithms. Based on our results, we propose that database implementers consider adding this simple join algorithm to their repertoire of main memory join algorithms, or adapt their methods to mimic the strategy employed by this algorithm, especially when joining inputs with skewed data distributions.
- Multi-core era
- Modern high performance main memory join algorithms has two extremes: (1) minimizing the number of processor cache misses, e.g. radix-based hash join; (2) minimizing processor synchronization costs, e.g., this paper.
- To the best of our knowledge, this is the first systematic exploration of multiple hash join techniques that spans multi-core architectures.
- We show that an algorithm that does not do any partitioning, but simply constructs a single shared hash table on the build relation often outperforms more complex algorithms.
- We show that the simple “no-partitioning” hash join algorithm takes advantage of intrinsic hardware optimizations to handle skew.
2. The Multi-core Landscape
We briefly consider two modern architectures that we subsequently use for evaluation:
- The Intel Nehalem family is an instance of Intel’s latest micro-architecture that offers high single-threaded performance because of its out-of-order execution and on-demand frequency scaling (TurboBoost). Multi-threaded performance is increased by using simultaneous multi-threading (Hyper-Threading).
- The Sun UltraSPARC T2 has 8 simple cores that all share a single cache. This CPU can execute instructions from up to 8 threads per core, or a total of 64 threads for the entire chip, and extensively relies on simultaneous multi-threading to achieve maximum throughput.
3. Hash Join Implementation
- A hash join operator works on two input relations, R and S. We assume that . A typical hash join algorithm has three phases: partition, build, and probe. The partition phase is optional and divides tuples into distinct sets using a hash function on the join key attribute. The build phase scans the relation R and creates an in-memory hash table on the join key attribute. The probe phase scans the relation S, looks up the join key of each tuple in the hash table, and in the case of a match creates the output tuple(s).
- We have found that the latch implementation has a crucial impact on the over- all join performance. In particular, the pthreads mutex implementation doesn’t work well (several instructions to acquire/release, significant memory footprint), hence, we implemented our own 1-byte latch for both the Intel and the Sun architectures, using the atomic primitives xchgb and ldstub, respectively.
3.1 Partition Phase
We classify the partitioning algorithms as “non-blocking” if they produce results on-the-fly and scan the input once, in contrast to a “blocking” algorithm that produces results after buffering the entire input and scanning it more than once.
3.1.1 Non-blocking Algorithms
- The first partitioning algorithm creates p shared partitions among all the threads. The threads need to synchronize via a latch to make sure that the writes to a shared partition are isolated from each other.
- The second partitioning algorithm creates p * n partitions in total and each thread is assigned a private set of p partitions. Each thread then writes to its local partitions without any synchronization overhead. When the input relation is depleted, all threads synchronize at a barrier to consolidate the p * n partitions into p partitions.
- The benefit of creating private partitions is that there is no synchronization overhead on each access; and the drawbacks, are (1) many partitions are created, possibly so many that the working set of the algorithm no longer fits in the data cache and the TLB; (2) at the end of the partition phase some thread has to chain n private partitions together to form a single partition, but this operation is quick and can be parallelized.
3.1.2 Blocking Algorithm
- Another partitioning technique is the parallel multi-pass radix partitioning algorithm described by this paper.
- The benefit of radix partitioning is that it makes few cache and TLB misses, as it bounds the number of output destinations in each pass. This particular implementation has the benefit that, it avoids the synchronization over- head that is associated with sharing an output buffer.
- Drawbacks: (1) blocking; (2) this implementation places a burden on the previous operator in a query tree to produce the compact and contiguous output format that the radix partitioning requires as input.
3.2 Build Phase
- To reduce the number of cache misses that are incurred during the next (probe) phase, each bucket of this hash table is sized so that it fits on a few cache lines.
- Each thread scans every tuple t in its partition, extracts the join key k, and then hashes this key using a hash function h(). Then, the tuple t is appended to the end of the hash bucket h(k), creating a new hash bucket if necessary.
3.3 Probe Phase
- Notice that there is parallelism even inside the probe phase: looking up the key for each tuple r in a hash bucket and comparing it to k can be parallelized with the construction of the output tuple, which primarily involves shuffling bytes from tuples r and s.
3.4 Hash Join Variants
In this paper, we focus on the following four hash join variations:
- No partitioning join: an implementation where partitioning is omitted, and this implementation creates a shared hash table in the build phase.
- Shared partitioning join: the first non-blocking partitioning algorithm of Section 3.1.1
- Independent partitioning join: the second non-blocking partitioning algorithm of Section 3.1.1
- Radix partitioning join: as described in Section 3.1.2
4. Experimental Evaluation
- Three different distribution: uniform, low skew, high skew
- Two different architectures: the Intel Nehalem and the Sun UltraSPARC T2.
- Overall, the build phase takes a very small fraction of the overall time, the performance of the join operation is therefore mostly determined by the time spent partitioning the input relations and probing the hash table.
- We see that on the Intel architecture the performance of the no partitioning join algorithm is comparable to the performance of all the other algorithms. For the Sun UltraSPARC T2, we see that the no partitioning join algorithm outperforms the other algorithms by at least 1.5X.
- The no partitioning algorithm is more robust, as the performance of the other algorithms degrades if the query optimizer does not pick the optimal value for the number of partitions.
4.4 Effect of Skew
- When using the shared hash table (no partition), performance improves in the presence of skew. On the other hand, the performance of the shared partitioning algorithm degrades rapidly with increasing skew, while the performance of the independent partitioning and the radix partitioning algorithms shows little change on the Intel Nehalem and degrades on the Sun UltraSPARC T2.
4.5 Performance Counters
- We see fewer cache and TLB misses across all algorithms when adding skew.
- Interpreting performance counters is much more challenging with modern multi-core processors and will likely get worse, this experiment reveals that blindly assigning fixed cycle penalties to architectural events can lead to misleading conclusions.
4.6 Speedup from SMT
- Simultaneous multi-threading (SMT) permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.
- For the uniform dataset, the NO algorithm causes many cache misses, and as a result, it provides more opportunities for SMT to efficiently overlap the memory accesses.
- When comparing the high skew dataset with the uniform dataset across both architectures, we see that the improvement of SMT is reduced. The skewed key distribution incurs fewer cache misses, therefore SMT loses opportunities to hide processor pipeline stalls.
- Synchronization has little impact on the non-partitioned (NO) algorithm for both the uniform and the high skew datasets, regardless of the number of threads that are running.
- The radix partitioning algorithm is significantly impacted by synchronization on both the uniform and the high skew datasets.
4.7.1 Load Balancing
- We tweaked the join algorithm to allow the faster threads that have completed their probe phase to steal work from other slower threads.
- Under skew, a load balancing technique improves the performance of the probe phase but does not address the inherent inefficiency of all the partitioning-based algorithms. In essence, there is a coordination cost to be paid for load balancing, as thread synchronization is necessary. Skew in this case causes contention, stressing the cache coherence protocol and increasing memory traffic.
- The no partitioning algorithm does skewed memory loads of read-only data, which is handled very efficiently by modern CPUs through caching.
4.8 Effect of Output Materialization
- Materialization comes at a fixed price for all algorithms and, therefore, a join algorithm will be faster regardless of the output being materialized or discarded.
4.9 Cardinality Experiments
- No join algorithm is particularly sensitive to our selection of input relation cardinalities,
4.10 Selectivity experiment
- On the Intel Nehalem, decreasing join selectivity has a marginal benefit on the probe phase, but the other two phases are unaffected, while on Sun UltraSPARC T2, the benefit of a small join selectivity on the probe phase is significant.
- Across all the architectures, the code path length (i.e. instructions executed) increases as join selectivity increases. The Intel Nehalem is practically insensitive to different join selectivities, because its out-of-order execution manages to overlap the data transfer with the byte shuffling that is required to assemble the output tuple. For the Sun UltraSPARC T2 machine, there is a strong linear correlation between the code path length and the cycles that are required for the probe phase to complete. The in-order Sun UltraSPARC T2 cannot automatically extract the instruction-level parallelism of the probe phase.
These results imply that DBMSs must reconsider their join algorithms for current and future multi-core processors:
- Modern processors are very effective in hiding cache miss latencies through multi-threading (SMT).
- Optimizing for cache performance requires partitioning, and this has additional computation and synchronization overheads, and necessitates elaborate load balancing techniques to deal with skew. These costs of partitioning on a modern multi-core machine can be higher than the benefit of an increased cache hit rate, especially on skewed datasets.
To fully leverage the current and future CPUs, high performance main memory designs have to achieve good cache and TLB performance, while fully exploiting SMT, and minimizing synchronization costs.
5. Related Work
Hash join for main memory database systems