BitWeaving: Fast Scans for Main Memory Data Processing

Reference: Li, Yinan, and Jignesh M. Patel. "BitWeaving: fast scans for main memory data processing." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013.

This is one of the several papers belong to suggested readings for Vectorized Execution (Part II) of CMU 15-721: Database Systems.

0. Abstract

This paper focuses on running scans in a main memory data processing system at “bare metal” speed. Essentially, this means that the system must aim to process data at or near the speed of the processor (the fastest component in most system configurations). Scans are common in main memory data processing environments, and with the state-of-the-art techniques it still takes many cycles per input tuple to apply simple predicates on a single column of a table. In this paper, we propose a technique called BitWeaving that exploits the parallelism available at the bit level in modern processors. BitWeaving operates on multiple bits of data in a single cycle, processing bits from different columns in each cycle. Thus, bits from a batch of tuples are processed in each cycle, allowing BitWeaving to drop the cycles per column to below one in some case. BitWeaving comes in two flavors: BitWeaving/V which looks like a columnar organization but at the bit level, and BitWeaving/H which packs bits horizontally. In this paper we also develop the arithmetic framework that is needed to evaluate predicates using these BitWeaving organizations. Our experimental results show that both these methods produce significant performance benefits over the existing state-of-the-art methods, and in some cases produce over an order of magnitude in performance improvement.

1. Introduction

  • The insight behind our intra-cycle parallelism paradigm is recognizing that in a single processor clock cycle there is “abundant parallelism” as the circuits in the processor core are simultaneously computing on multiple bits of information, even when working on simple ALU operations.
  • BitWeaving comes in two flavors: BitWeaving/V and BitWeaving/H, corresponding to two underlying storage formats.

2. Overview

We use fixed-length encoding to convert column values to fixed-length code, as a result, column scans can usually be directly evaluated on the codes. For predicates involving arithmetic or similarity predicates, scans cannot be performed directly on the encoded codes. These codes have to be decoded, and then are evaluated in a conventional way.

2.1 Problem Statement

  • A column-scalar scan takes as input a list of n k-bit codes and a predicate with a basic comparison, e.g. =, /=, <, >, ≤, ≥, BETWEEN, on a single column. It finds all matching codes that satisfy the predicate, and outputs an n-bit vector, called the result bit vector, to indicate the matching codes.
  • A processor word is a data block that can be processed as a unit by the processor. For ease of explanation, we initially assume that a processor word is an Arithmetic Logic Unit (ALU) word, i.e. a 64- bit word for modern CPUs. The instructions that process the processor word as a unit of computation are called full-word instructions.
  • Bit-parallel method

2.2 Framework

  • To evaluate a predicate consisting of arbitrary boolean combinations of basic comparisons, we traverse the predicate tree in depth-first order, performing the column-scalar comparison on each leaf node, and merging result bit vectors at each internal node based on the logical operator that is represented by the internal node.

3. Bit-Parallel Methods

3.1 Overview of the Two Bit-Parallel Methods

  • If we thought of a code as a tuple consisting of multiple fields (bits), HBP and VBP can be viewed as row-oriented storage and column-oriented storage at the bit level.
  • Our methods are optimized for both the number of CPU instructions that are needed to process the data, as well as the number of CPU cache lines that are occupied by the underlying (HBP or VBP) data representations.

3.2 The Horizontal Bit-Parallel (HBP) Method

3.2.1 Storage Layout

3.2.2 Column-Scalar Scans

  • Basic idea
  • Comparisons: inequality, equality, less than, less than or equal to …

3.3 The Vertical Bit-Parallel (VBP) Method

3.3.1 Storage Layout

3.3.2 Column-Scalar Scans

4. Early Pruning

The early pruning technique is orthogonal to the two bit-parallel methods described in Section 3.

4.1 Basic Idea Behind Early Pruning

  • It is often not necessary to access all the bits of a code to compute the final result. For example, when comparing two codes, we can stop if two bits are not equal (starting from the most significant bit).

4.2 Estimating the Early Pruning Probability

  • Fill factor: he number of codes that are present over the maximum number of codes in the segment, i.e. the width of processor word w.
  • The early pruning probability P(b) is defined as the probability that the wf codes in a segment are all different from the constant in the most significant b bits, i.e. it is the probability that we can terminate the computation at the bit position b.

4.3 Filter Bit Vectors on Complex Predicates

  • The result bit vector that is produced from a previous step is called the filter bit vector of the current column-scalar scan.
  • Thus, the filter bit vector further reduces the computation on the current column-scalar scan (since the fill factor is smaller).

5. Bit Weaving

5.1 BitWeaving/V

5.1.1 Storage Layout

5.1.2 Column-Scalar Scans

  • We check the cut-off condition in one of every B iterations of processing the k words of a segment. The purpose of this design is to reduce the cost of checking the condition as well as the cost of CPU branch mispredictions that this step triggers. We have observed that without this attention to branch prediction in the algorithm, the scans generally run slower by up to 40%.
  • We feed a filter bit vector into the column-scalar comparisons.

5.2 BitWeaving/H

  • The key difference is that we store each bit group in the HBP storage layout. For a column-scalar scan, we evaluate the comparison condition on bit groups starting from the one containing the most significant bits.
  • However, the effect of early pruning technique on HBP is offset by the high overhead of computing the additional bit vector, and has an overall negative impact on performance (see Section 6.1.2). Therefore, the BitWeaving/H method is simply the HBP method.

5.3 BitWeaving/H and BitWeaving/V

6. Evaluation

6.1 Micro-Benchmark Evaluation

  • We created a table R with a single column and one billion uniformly distributed integer values in this column.
  • The total number of cycles for the query is measured by using the RDTSC instruction.

6.1.1 BitWeaving v/s the Other Methods

  • The SIMD-scan achieves 50%-75% performance improvement over the Naive method. Even though a SIMD instruction can process four 32-bit banks in parallel, the number of instructions drops by only 2.2-2.6X (over the Naive method), because it imposes extra instructions to align packed codes into the four banks before any computation can be run on that data. Furthermore, we observe that with SIMD instructions, the CPI (Cycles Per Instructions) increases from 0.37 to 0.56, which means that a single SIMD instructions takes more cycles to executed than a ordinary ALU instruction.
  • The Bit-sliced and the BL methods shows a near linear increase in run time as the code width increases. However, the storage layout of the Bit-sliced method occupies many CPU cache lines for wider codes. The number of L3 cache misses quickly increases and hinders overall performance.
  • The BitWeaving methods outperform all the other methods across all the code widths, because: (1) unlike the Naive and the SIMD-scan methods, they do not need to move data to appropriate positions before the predicate evaluation computation; (2) the BitWeaving methods are optimized for both cache misses and instructions due to their storage layouts and scan algorithms; (3) with the early pruning technique, the execution time of BitWeaving/V does not increase for codes that are wider than 12 bits.

6.1.2 Individual BitWeaving Components

  • At certain points, VBP is up to 2X faster than HBP. For example, VBP is 2X faster than HBP for 32-bit codes, because HBP has to pad 32-bit code to a entire 64-bit word to fit both the code and the delimiter bit. In spite of this, HBP and VBP generally show a similar performance trend as the code width increases.
  • For wider codes, the early pruning technique quickly reduces the query execution time for VBP, and beyond 12 bits, the query execution time with early pruning is nearly constant.
  • The effect of the early pruning technique on the HBP method is offset by the high overhead of computing the additional masks.
  • Applying SIMD parallelism achieves marginal speedups for both the HBP and the VBP methods. We observed that the SIMD implementation reduces the number of instructions by 40%, but also increase the CPI by 1.5X.
  • The lookup performance of the HBP method is the best, and its performance is stable across the code widths. For the methods with the early pruning technique, the bits of a code are distributed into various bit groups. Assembling a code requires access to data across multiple bit groups at different locations, which incurs many CPU cache misses, and thus significantly hurts the lookup performance.

6.3 TPC-H Evaluation

  • The SIMD-scan method only achieves about 20% performance improvement over the Naive method, mainly because the SIMD-scan method performs relatively poorly when evaluating the BETWEEN predicates.
  • The BL method runs at a much higher speed compared to the Naive and the SIMD methods.
  • The BitWeaving methods (BW/H and BW/V) out- perform all existing methods.

  • BW/H-idx and BW/H have similar performance. The BW/V-idx method is about 30% faster than the BW/V method.
  • In general, we ob- serve that the BW/V method has higher performance for queries with predicates that involve many columns, involve wider columns, and have highly selective predicates.
  • Compression schemes
  • SIMD
  • Bit-sliced index
  • Processing packed data in parallel
Written on November 30, 2017