# Rethinking SIMD Vectorization for In-Memory Databases

Reference: Polychroniou, Orestis, Arun Raghavan, and Kenneth A. Ross. "Rethinking SIMD vectorization for in-memory databases." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.

This is one of the several papers belong to suggested readings for Vectorized Execution (Part I) of CMU 15-721: Database Systems.

## 0. Abstract

Analytical databases are continuously adapting to the underlying hardware in order to saturate all sources of parallelism. At the same time, hardware evolves in multiple directions to explore different trade-offs. The MIC architecture, one such example, strays from the mainstream CPU design by packing a larger number of simpler cores per chip, relying on SIMD instructions to fill the performance gap. Databases have been attempting to utilize the SIMD capabilities of CPUs. However, mainstream CPUs have only recently adopted wider SIMD registers and more advanced instructions, since they do not rely primarily on SIMD for efficiency. In this paper, we present novel vectorized designs and implementations of database operators, based on advanced SIMD operations, such as gathers and scatters. We study selections, hash tables, and partitioning; and combine them to build sorting and joins. Our evaluation on the MIC-based Xeon Phi co-processor as well as the latest mainstream CPUs shows that our vectorization designs are up to an order of magnitude faster than the state-of-the-art scalar and vector approaches. Also, we highlight the impact of efficient vectorization on the algorithmic design of in-memory database operators, as well as the architectural design and power efficiency of hardware, by making simple cores comparably fast to complex cores. This work is applicable to CPUs and co-processors with advanced SIMD capabilities, using either many simple cores or fewer complex cores.

## 1. Introduction

• Real time analytics are important now
• Large main-memory capacity
• Column stores
• Three sources of parallelism: thread, instruction level and data.
• MIC architecture

Contributions

• We introduce design principles for efficient vectorization of in-memory database operators and define fundamental vector operations that are frequently reused.
• We design and implement vectorized selection scans, hash tables, and partitioning, that are combined to design and build sorting and multiple join variants.
• We compare our implementations against state-of-the-art scalar and vectorized techniques. We achieve up to an order of magnitude speedups by evaluating on Xeon Phi as well as on the latest mainstream CPUs.
• We show the impact of vectorization on the algorithmic design of in-memory operators, as well as the architectural design and power efficiency of hardware, by making simple cores comparably fast to complex cores.
• SIMD instructions in database operators
• Optimizing database operators for modern processors
• SIMT GPU

## 3. Fundamental Operations

Operations

• Selective stores: writing a specific subset of the vector lanes to a memory location contiguously. The subset of vector lanes to be written is decided using a vector or scalar register as the mask, which must not be limited to a constant.
• Selective loads: loading from a memory location contiguously to a subset of vector lanes based on a mask. The lanes that are inactive in the mask retain their previous values in the vector.
• Gather: loading values from non-contiguous location. The inputs are a vector of indexes and an array pointer. The output is a vector with the values of the respective array cells. By adding a mask as an input operand, we define the selective gather that operates on a subset of lanes. The inactive mask lanes retain their previous contents.
• Scatter: executing stores to multiple locations. The input is a vector of indexes, an array pointer, and a vector of values. If multiple vector lanes point to the same location, we assume that the rightmost value will be written. By adding a mask as an input we can store lanes selectively.

• Gathers and scatters are not really executed in parallel because the (L1) cache allows one or two distinct accesses per cycle.
• Gathers are supported on the latest mainstream CPUs (Haswell) but scatters are not.

## 4. Section Scans

• When we materialize data on RAM without intent to reuse them soon, we use streaming stores. Mainstream CPUs provide non-temporal stores that bypass the higher cache levels and increase the RAM bandwidth for storing data.

## 5. Hash Tables

• We place multiple keys per bucket and compare them to the probing key using SIMD vector comparisons.
• We term the approach of comparing a single input (probing) key with multiple hash table keys, horizontal vectorization. If we expect to search fewer than W buckets on average per probing key, horizontal vectorization is wasteful.

### 5.1 Linear Probing

Probe

Build

• To process the last items in the input, we switch to scalar code, which is negligible since the number of last items are at most $2 \cdot W$.

### 5.2 Double Hashing

• Double hashing uses a second hash function to distribute collisions so that the number of accessed buckets is close to the number of true matches.

## 7. Partitioning

### 7.2 Range Histogram

Processing W keys in parallel:

### 7.3 Shuffling

• We could use the prefix sum of the histogram to maintain an array of partition offsets, to generate the output partitions in contiguous space.

### 7.4 Buffered Shuffling

Pitfalls of Shuffling

• It suffers from TLB thrashing when the partitioning fanout exceeds the TLB capacity.
• It generates many cache conflicts and in the worst case, may be bound by the size of the cache associativity set.
• Using normal stores, we trigger cache loads to execute the stores and reduce the bandwidth due to loading cache lines that will only be overwritten.

• By keeping the data in buffers and flushing them in groups with W buffer slots per partition, we reduce cache and TLB misses to 1/W.
• To partition multiple columns of payloads, we can either shuffle all the columns together as a unified tuple or shuffle one column at a time. Shuffling unified tuples should optimally compile specific code for each query at run time. Otherwise, we can process one column at a time using pre-compiled type-specialized code.

## 8. Sorting

• Recent work showed that large-scale sorting is synonymous to partitioning.
• Radixsort and comparison sorting based on range partitioning have comparable performance, by maximizing the fanout to minimize the number of partition passes.
• We implement least-significant-bit (LSB) radixsort, which is the fastest method for 32-bit keys.
• Parallel LSB radixsort splits the input equally among threads and uses the prefix sum of the histograms from all threads to interleave the partition outputs.

## 9. Hash Join

• Partition: no, min, max

## 10. Experimental Evaluation

• Previous work has shown that joins, partitioning, and sorting are faster under skew.

### 10.1 Selection Scans

• On Xeon Phi, scalar code is almost an order of magnitude slower than vector code, whereas on Haswell, vector code is about twice faster.

### 10.2 Hash Tables

• For linear probing & double hashing, we are up to 6X faster than everything else on Xeon Phi, and gain a smaller speedup for cache resident hash tables on Haswell.
• For cuckoo hashing, vertically vectorized code is 5X faster on Xeon Phi and 1.7X on Haswell.

### 10.3 Bloom Filters

• The speedup is 3.6-7.8X on Xeon Phi and 1.3-3.1X on Haswell.

### 10.4 Partitioning

• For radix and hash histogram generation, by replicating each count W times, we get a 2.55X speedup over scalar radix.
• The binary search vectorization speedup is 7-15X on Xeon Phi and 2.4-2.8X on Haswell.
• For shuffling, on Xeon Phi, among the unbuffered versions, vectorization achieves up to 1.95X speedup. Among the scalar versions, buffering achieves up to 1.8X speedup. Among the buffered versions, vectorization achieves up to 2.85X speedup, using up to 60% of the bandwidth. The optimal fanout, maximizing throughput x bits, is 5-8 radix bits per pass.

### 10.5 Sorting & Hash Joins

#### 10.5.1 Vectorization Speedup & Algorithmic Designs

• For the performance of LSB radixsort on Xeon Phi, the vectorization speedup is 2.2X over state- of-the-art scalar code and time scales with size.
• For three hash join variants on Xeon Phi, the “no-partition” and the “min-partition” methods get small 1.05X and 1.25X speedups, while the fully partitioned gets a 3.3X speedup.
• Hash join is faster than sort-merge join.
• For the thread scalability of radixsort and partitioned hash join on Xeon Phi, the speedup is almost linear.

#### 10.5.2 Aggregate Performance & Power Efficiency

• Radixsort and hash join are both ≈ 14% slower on the Phi compared to the 4 SB CPUs.
• If we assume the operating power of both platforms to be equally proportional to TDP, both radixsort and hash join are ≈ 1.5X more power efficient on the Phi.

## 11. SIMD CPUs & SIMT CPUs

• SIMT GPUs run scalar code, but “tie” all threads in a warp to execute the same instruction per cycle.
• While conditional control flow is eliminated in SIMD code “manually”, SIMT transforms control flow to data flow automatically, by executing all paths and nullify- ing instructions from paths that are not taken per thread.
Written on November 25, 2017