Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited

Reference: Balkesen, Cagri, et al. "Multi-core, main-memory joins: Sort vs. hash revisited." Proceedings of the VLDB Endowment 7.1 (2013): 85-96.

This is one of the several papers belong to suggested readings for Parallel Join Algorithms (Sorting) of CMU 15-721: Database Systems.

0. Abstract

In this paper we experimentally study the performance of main-memory, parallel, multi-core join algorithms, focusing on sort-merge and (radix-)hash join. The relative performance of these two join approaches have been a topic of discussion for a long time. With the advent of modern multi-core architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide), and NUMA awareness (sort-merge is superior to hash join in NUMA architectures). We conduct extensive experiments on the original and optimized versions of these algorithms. The experiments show that, contrary to these claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved. The paper also provides the fastest implementations of these algorithms, and covers many aspects of modern hardware architectures relevant not only for joins but for any parallel data processing operator.

1. Introduction

Contributions

  • We show that radix-hash join is still superior to sort-merge join in most cases
  • We provide several insights on the implementation of data operators on modern processors
  • We present the fastest algorithms available to date for both sort-merge - 2-3 times faster than available results - and radix-hash join, demonstrating how to use modern processors to improve the performance of data operators.
  • In addition, the paper sheds light on a number of relevant issues involving the processing of “big data” and the factors that affect the choice of the algorithm: (1) input size, (2) degree of parallelism; (3) cache contention; (4) SIMD performance.

  • The paper shows that hash joins still have an edge over sort-merge alternatives unless the amount of data involved is very large.
  • New processor features such as memory gather support in Intel’s upcoming Haswell series may play a bigger role in improving hash joins than the factors considered so far.

2.1 Sort vs Hash - Early Work

  • Graefe used histogram-based partitioning to improve hash-based joins and finally concluded that sort-merge joins only win in a number of corner cases.

2.2 Sort vs. Hash - Multi-core Era

  • Recent work has studied these hash join algorithms and showed that hardware-conscious, parallel radix join has the best overall performance.

2.3 Hardware-Assisted Sorting

2.4 The Role of NUMA

  • Li showed that a hardware-conscious “ring-based” data shuffling approach across NUMA regions achieves a much better interconnect bandwidth and improves the performance of sort-merge join algorithm of Albutiu.

3. Parallelizing Sort With SIMD

  • The dominant cost in sort-merge joins is sorting the input relations.

3.1 Run Generation

For initial run generation, many chunks with a small number of tuples need to be sorted. This favors sorting algorithms that can process multiple chunks in parallel over ones that have a good asymptotic complexity with respect to the tuple count. Sorting networks provide these characteristics and fit well with the SIMD execution model of modern CPUs.

3.1.1 Sorting Networks

  • The beauty of sorting networks is that comparators can be implemented with help of min/max operators only. Limited data dependencies and the absence of branching instructions make such code run very efficiently on modern hardware.
  • Sorting networks are also appealing because they can be accelerated through SIMD instructions.

3.1.2 Speedup Through SIMD

  • However, the strategy illustrated above will sort input items across SIMD registers, and transposition must be achieved though SIMD shuffle instructions that can be used to move individual values within and across SIMD registers.
  • Shuffle operations significantly reduce the effective SIMD speedup for run generation.

3.2 Merging Sorted Runs

3.2.1 Bitonic Merge Networks

  • Looking back to the idea of sorting networks, larger networks can be built with help of merging networks that combine two pre-sorted inputs into an overall sorted output.

3.2.2 Merging Larger Lists using Bitonic Merge

4. Cache Conscious Sort Joins

4.1 Sorting and the Memory Hierarchy

  • In-Register Sorting
  • In-Cache Sorting
  • Out-of-Cache Sorting

4.2 Balancing Computation and BandWidth

  • Intel Architecture Code Analyzer considers the super-scalar instruction pipeline of modern Intel processors and infers the expected execution time for a given instruction sequence. The tool does not consider potential pipeline stalls due to memory accesses (bandwidth and/or latency).
  • Memory bandwidth demand can be reduced by merging more than two runs at once. Multi-way merging saves round-trips to memory and thus precious memory bandwidth.
  • Compute vs. Bandwidth: merging only two runs is bound by memory bandwidth, with plenty of stalled CPU cycles that could be spent on additional CPU instructions. As we increase merge fan-in, memory pressure becomes reduced until the system becomes CPU-bound.
  • Impact of NUMA: in practice, at least some merging passes will inevitably cross NUMA boundaries, and in multi-socket systems, NUMA interconnect bandwidth stays further and further behind the aggregate memory bandwidth that the individual memory controllers could provide. In other words, there may exist memory bottlenecks with across NUMA boundaries.

5. Hash-based Joins

Shatdal identified that when the hash table is larger than the cache size, almost every access to the hash table results in a cache miss. As a result, a partitioning phase to the hash joins is introduced to reduce cache misses.

5.1 Radix Partitioning

  • Manegold refined the partitioning idea by considering as well the effects of translation look-aside buffers (TLBs) during the partitioning phase, leading to the multi-pass radix partitioning join.
  • Conceptually, radix partitioning takes all input tuples one-by-one and writes them to their corresponding destination partition.
  • Generally, partitions are far apart and on separate VM pages. If the fan-out of a partitioning stage is larger than the number of TLB entries in the system, copying each input tuple will cause another TLB miss. The number of TLB entries is thus treated as an upper bound to the partitioning fan-out.

5.2 Software-Managed Buffers

  • By using buffers, the TLB miss limitations on maximum fan-out can be reduced. The idea is to allocate a set of buffers, one for each output partition and each with room for up to N input tuples. Buffers are copied to final destinations only when full.
  • In our implementation of radix join we utilize such software-managed buffers and configure N such that one buffer will exactly fill one cache line (64 bytes). Since we are now always writing a full cache line at once to global memory, the CPU can take advantage of its write combining facilities together with non-temporal writes, thus avoiding to read the cache line before writing it back.
  • In practice, the advantage of software-managed buffers is two-fold: (1) for many situations, software-managed buffers offer better absolute performance, since fewer passes can usually achieve the same overall fan-out; (2) it is possible to partition even larger data sizes in a single pass, which has not been considered previously.

6. Join Algorithms Analyzed

6.1 Sort-Merge Join Algorithm - m-way

  • Phases: (1) partition R; (2) local-sort R; (3) employ multi-way merging to get globally sorted R’; (4) use the same method to get S’; (5) each thread concurrently evaluates the join between NUMA-local sorted runs using a single-pass merge join.
  • The main intuition behind partitioning in the first place is allowing threads in the subsequent phases to work independently without any synchronization.

6.2 Sort-Merge Join Algorithm - m-pass

  • The algorithm differs from m-way only in Phase 2 in Figure 4. Instead of applying a multi-way merge for merging NUMA-remote runs, m-pass applies successive two-way bitonic merging.
  • The first iteration of merging of sorted runs is done as the data is transferred to the local memory. The rest of the merging continues in local memory, using the multi-pass merging technique (cf. Section 3.2.2) in an iterative manner.

6.3 Massively Parallel Sort-Merge Join - mpsm

  • The mpsm algorithm first globally range-partitions relation R (again as discussed in Section 5.2), then each thread independently sorts its partition, resulting in a globally-sorted R’. In contrast, S is sorted only partially. Each thread sorts its NUMA-local chunk of S without a prior partitioning. Therefore, during the last phase, a run of R must be merge-joined with all the NUMA-remote runs of relation S.
  • For cases where relation S is substantially larger than R, avoiding the global partitioning/sorting may pay off and the overall join may become more efficient.
  • For further details, we refer to Albutiu.

6.4 Radix Hash Join - radix

  • For parallel radix-hash join, we partition both input relations as discussed in Section 5.2. The goal is to break at least the smaller input into pieces that fit into caches. Then, we run a cache-local hash join on individual partition pairs.

6.5 No-Partitioning Hash Join - n-part

  • The no-partitioning join is a direct parallel version of the canonical hash join. Both input relations are divided into equi-sized portions that are assigned to a number of worker threads. In the build phase, all worker threads populate a shared hash table with all tuples of R. After synchronization via a barrier, all worker threads enter the probe phase and concurrently find matching join partners for their assigned S portions.

7. Experimental Setup

7.1 Workloads

7.2 System

8. Analysis of The Sort Phase

  • Overall, AVX sort runs between 2.5 and 3 times faster than the C++ sort (single thread). whenever the size of the input increases, both algorithms suffer due to the increasing number of trips to the main-memory.
  • In summary, the sort algorithm behaves well in comparison to existing ones and it does not affect the relative performance of the join algorithms.

9. Analysis of The Merge Phase

9.1 Modeling the Merge Phase

9.2 Performance of the Merge Phase

  • Merge performance decreases steeply with increasing fan-in. At higher fan-in values, multi-way merge becomes less effective as the number of tuples processed at once drops significantly.
  • Another reason for performance degradation can be observed through performance counters: the increasing depth of the merge tree means a logarithmically increasing number of executed instructions per tuple.

10. Optimizing The Merge Phase

10.1 Performance of the Partitioning Phase

  • The performance of the software-managed buffers strategy for high fan-outs is robust.
  • We may use the software-managed buffering technique from the partitioning phase in the merging phase.

10.2 Using Partitioning with Sort

  • The Partition-then-Merge approach achieves a throughput of up to 680 million tuples per second. More importantly, it shows a stable performance with increasing input sizes, sorting 8 GB in less than 2 seconds.
  • The performance of Sort-then-Merge approach drops significantly beyond table sizes of 256 M; mainly due to in- creasing fan-in of the multi-way merge.
  • The main performance difference stems from partitioning vs. merging performance.

10.3 Alternative Implementations for Merge

  • The cooperative m-way approach follows the original idea by Chhugani where there is a single multi-way merge tree and multiple threads cooperatively merge the data. Once the children of a node have enough data, the node becomes available for merging by any of the threads. This approach has a potential advantage: It increases the buffer space per merge node as there is a single merge tree resident in last-level cache.
  • The independent sorting approach follows the Partition-then-Sort idea discussed in Section 10.2. Each thread locally partitions its data and then sorts smaller individual partitions. In this case, the thread can independently merge the sorted runs without further interaction with other threads.

  • The cooperative multi-way merging does not scale for a variety of reasons: contention for upper-level nodes in the merge tree, idle threads due to lack of enough work, and synchronization overhead.
  • The independent sorting approach, in contrast, scales linearly up to the physical number of cores.
  • However, the scalability in the hyper threads region remains limited. This is the common case for hardware-conscious algorithms where most of the physical hardware resources are shared among hyper-threads.

  • As a conclusion, even though all the threads have a fraction of the last level cache for their multi-way merge, the synchronization-free nature of this approach shows its benefit and independent sorting proves to be better than the originally proposed cooperative multi-way merging.

11. Sort-Merge Joins

We now compare the performance of the resulting sort-merge join operator (m-way) with that of mpsm and m-pass.

11.1 Comparison with Different Table Sizes

  • m-way runs significantly faster than the other options and is robust across different relation size ratios while producing fully sorted output.

11.2 Dissecting the Speedup of m-way

  • Up to 16 threads the speedup from multi-way merge is ≈1.5X in which case there is enough aggregate memory bandwidth for that number of threads. However, once the number of threads go beyond 16, memory bandwidth per thread becomes limited and multi-way merge benefit goes up to a factor of 2.
  • Speedup from AVX (over the same algorithm’s scalar variant) is between 2X and 2.3X.
  • The overall speedup of m-way over mpsm is ≈ 3X.

11.3 Scalability of Sort-based Join Algorithms

  • All the algorithms exhibit linear scaling behavior up to 16 physical CPU cores.
  • As all of these algorithms are cache- and CPU resource- sensitive, the scaling with hyper threads is rather limited.

12. Sort or Hash?

  • The best sorting-based join algorithm: m-way
  • The best hash join algorithm: radix

12.1 Comparison with Different Workloads

  • Hash-based join algorithms still maintain an edge over sort-based counterparts.
  • For significantly larger workloads, the picture becomes more favorable for sort-merge joins.
  • Since m-way produces fully sorted output, it could be a good choice for large data volumes that must be further processed.

12.2 Effect of Input Size

  • Sort-merge join approaches the performance of radix only when the input tables become significantly large. Radix join performance degrades with the increasing input size due to the increasing number of passes for the partitioning phase. Sort-merge join demonstrates robust performance due to the optimizations previously discussed in this paper.
  • Sorted output in query plans is an attractive argument to make sort-merge joins even more favorable in this range of the spectrum of input sizes.
  • The experiments show that both algorithms are insensitive to the variance in relation size ratios, being affected only by the size of the output as the amount of data grows.

12.3 Effect of Skew

  • For handling skew in parallel radix hash join, we previously proposed a fine-granular task decomposition method. The key idea is to parallelize the processing of larger partitions through a refined partitioning.
  • Phase 2 in Figure 4, multi-way merging, is prone to skew. We handle the skew in two steps: (1) When creating a multi-way merge task for the thread, if the total size of the merge exceeds an expected average size, we create multiple merge tasks by finding boundaries within each of the sorted runs with binary search. These tasks are then inserted into a NUMA-local task queue shared by the threads in the same NUMA region. (2) For extremely large tasks, we identify heavy hitters by computing an equi-depth histogram over sorted runs (which is a fast operation) in a similar approach to Albutiu. Then heavy hitters are directly copied to their output targets without a need for merging.
  • Because of their efficient skew handling mechanism, the comparison of sort vs. hash joins does not significantly change due to skew.

12.4 Scalability Comparison

  • Both algorithms show almost linear scalability up to the physical number of cores.
  • Only within the SMT (simultaneous multi-threading) region, algorithms scale poorly. This is an inevitable, well-known situation for hardware-conscious algorithms: SMT provides the illusion of running 2 threads on a single core where almost all of the resources are shared.
  • SMT scalability for radix join are different in the two workloads mainly because of the optimal partitioning configuration.

12.5 Sort vs. Hash with All Algorithms

  • Radix hash join comes out as the fastest algorithm in this comparison.
  • mpsm and n-part algorithms in fact achieve similar performance. Optimized n-part is only faster by ≈ 10-15 % while lacking the partially sorted output benefit of mpsm.
Written on November 8, 2017