BitWeaving: Fast Scans for Main Memory Data Processing
Reference: Li, Yinan, and Jignesh M. Patel. "BitWeaving: fast scans for main memory data processing." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013.This is one of the several papers belong to suggested readings for Vectorized Execution (Part II) of CMU 15721: Database Systems.
0. Abstract
This paper focuses on running scans in a main memory data processing system at “bare metal” speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the stateoftheart techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing stateoftheart methods, and in some cases produce over an order of magnitude in performance improvement.
1. Introduction
 The insight behind our intracycle parallelism paradigm is recognizing that in a single processor clock cycle there is “abundant parallelism” as the circuits in the processor core are simultaneously computing on multiple bits of information, even when working on simple ALU operations.
 BitWeaving comes in two flavors: BitWeaving/V and BitWeaving/H, corresponding to two underlying storage formats.
2. Overview
We use fixedlength encoding to convert column values to fixedlength code, as a result, column scans can usually be directly evaluated on the codes. For predicates involving arithmetic or similarity predicates, scans cannot be performed directly on the encoded codes. These codes have to be decoded, and then are evaluated in a conventional way.
2.1 Problem Statement
 A columnscalar scan takes as input a list of n kbit codes and a predicate with a basic comparison, e.g. =, /=, <, >, ≤, ≥, BETWEEN, on a single column. It finds all matching codes that satisfy the predicate, and outputs an nbit vector, called the result bit vector, to indicate the matching codes.
 A processor word is a data block that can be processed as a unit by the processor. For ease of explanation, we initially assume that a processor word is an Arithmetic Logic Unit (ALU) word, i.e. a 64 bit word for modern CPUs. The instructions that process the processor word as a unit of computation are called fullword instructions.
 Bitparallel method
2.2 Framework
 To evaluate a predicate consisting of arbitrary boolean combinations of basic comparisons, we traverse the predicate tree in depthfirst order, performing the columnscalar comparison on each leaf node, and merging result bit vectors at each internal node based on the logical operator that is represented by the internal node.
3. BitParallel Methods
3.1 Overview of the Two BitParallel Methods
 If we thought of a code as a tuple consisting of multiple fields (bits), HBP and VBP can be viewed as roworiented storage and columnoriented storage at the bit level.
 Our methods are optimized for both the number of CPU instructions that are needed to process the data, as well as the number of CPU cache lines that are occupied by the underlying (HBP or VBP) data representations.
3.2 The Horizontal BitParallel (HBP) Method
3.2.1 Storage Layout
3.2.2 ColumnScalar Scans
 Basic idea
 Comparisons: inequality, equality, less than, less than or equal to …
3.3 The Vertical BitParallel (VBP) Method
3.3.1 Storage Layout
3.3.2 ColumnScalar Scans
4. Early Pruning
The early pruning technique is orthogonal to the two bitparallel methods described in Section 3.
4.1 Basic Idea Behind Early Pruning
 It is often not necessary to access all the bits of a code to compute the final result. For example, when comparing two codes, we can stop if two bits are not equal (starting from the most significant bit).
4.2 Estimating the Early Pruning Probability
 Fill factor: he number of codes that are present over the maximum number of codes in the segment, i.e. the width of processor word w.
 The early pruning probability P(b) is defined as the probability that the wf codes in a segment are all different from the constant in the most significant b bits, i.e. it is the probability that we can terminate the computation at the bit position b.
4.3 Filter Bit Vectors on Complex Predicates
 The result bit vector that is produced from a previous step is called the filter bit vector of the current columnscalar scan.
 Thus, the filter bit vector further reduces the computation on the current columnscalar scan (since the fill factor is smaller).
5. Bit Weaving
5.1 BitWeaving/V
5.1.1 Storage Layout
5.1.2 ColumnScalar Scans
 We check the cutoff condition in one of every B iterations of processing the k words of a segment. The purpose of this design is to reduce the cost of checking the condition as well as the cost of CPU branch mispredictions that this step triggers. We have observed that without this attention to branch prediction in the algorithm, the scans generally run slower by up to 40%.
 We feed a filter bit vector into the columnscalar comparisons.
5.2 BitWeaving/H
 The key difference is that we store each bit group in the HBP storage layout. For a columnscalar scan, we evaluate the comparison condition on bit groups starting from the one containing the most significant bits.
 However, the effect of early pruning technique on HBP is offset by the high overhead of computing the additional bit vector, and has an overall negative impact on performance (see Section 6.1.2). Therefore, the BitWeaving/H method is simply the HBP method.
5.3 BitWeaving/H and BitWeaving/V
6. Evaluation
6.1 MicroBenchmark Evaluation
 We created a table R with a single column and one billion uniformly distributed integer values in this column.
 The total number of cycles for the query is measured by using the RDTSC instruction.
6.1.1 BitWeaving v/s the Other Methods
 The SIMDscan achieves 50%75% performance improvement over the Naive method. Even though a SIMD instruction can process four 32bit banks in parallel, the number of instructions drops by only 2.22.6X (over the Naive method), because it imposes extra instructions to align packed codes into the four banks before any computation can be run on that data. Furthermore, we observe that with SIMD instructions, the CPI (Cycles Per Instructions) increases from 0.37 to 0.56, which means that a single SIMD instructions takes more cycles to executed than a ordinary ALU instruction.
 The Bitsliced and the BL methods shows a near linear increase in run time as the code width increases. However, the storage layout of the Bitsliced method occupies many CPU cache lines for wider codes. The number of L3 cache misses quickly increases and hinders overall performance.
 The BitWeaving methods outperform all the other methods across all the code widths, because: (1) unlike the Naive and the SIMDscan methods, they do not need to move data to appropriate positions before the predicate evaluation computation; (2) the BitWeaving methods are optimized for both cache misses and instructions due to their storage layouts and scan algorithms; (3) with the early pruning technique, the execution time of BitWeaving/V does not increase for codes that are wider than 12 bits.
6.1.2 Individual BitWeaving Components
 At certain points, VBP is up to 2X faster than HBP. For example, VBP is 2X faster than HBP for 32bit codes, because HBP has to pad 32bit code to a entire 64bit word to fit both the code and the delimiter bit. In spite of this, HBP and VBP generally show a similar performance trend as the code width increases.
 For wider codes, the early pruning technique quickly reduces the query execution time for VBP, and beyond 12 bits, the query execution time with early pruning is nearly constant.
 The effect of the early pruning technique on the HBP method is offset by the high overhead of computing the additional masks.
 Applying SIMD parallelism achieves marginal speedups for both the HBP and the VBP methods. We observed that the SIMD implementation reduces the number of instructions by 40%, but also increase the CPI by 1.5X.
 The lookup performance of the HBP method is the best, and its performance is stable across the code widths. For the methods with the early pruning technique, the bits of a code are distributed into various bit groups. Assembling a code requires access to data across multiple bit groups at different locations, which incurs many CPU cache misses, and thus significantly hurts the lookup performance.
6.3 TPCH Evaluation
 The SIMDscan method only achieves about 20% performance improvement over the Naive method, mainly because the SIMDscan method performs relatively poorly when evaluating the BETWEEN predicates.
 The BL method runs at a much higher speed compared to the Naive and the SIMD methods.

The BitWeaving methods (BW/H and BW/V) out perform all existing methods.
 BW/Hidx and BW/H have similar performance. The BW/Vidx method is about 30% faster than the BW/V method.
 In general, we ob serve that the BW/V method has higher performance for queries with predicates that involve many columns, involve wider columns, and have highly selective predicates.
7. Related Work
 Compression schemes
 SIMD
 Bitsliced index
 Processing packed data in parallel