Efficiently Compiling Efficient Query Plans for Modern HardwareReference: Neumann, Thomas. "Efficiently compiling efficient query plans for modern hardware." Proceedings of the VLDB Endowment 4.9 (2011): 539-550.
This is one of the several papers belong to suggested readings for Parallel Join Algorithms (Query Compilation) of CMU 15-721: Database Systems.
As main memory grows, query performance is more and more determined by the raw CPU costs of query processing itself. The classical iterator style query processing technique is very simple and flexible, but shows poor performance on modern CPUs due to lack of locality and frequent instruction mis-predictions. Several techniques like batch oriented processing or vectorized tuple processing have been proposed in the past to improve this situation, but even these techniques are frequently out-performed by hand-written execution plans.
In this work we present a novel compilation strategy that translates a query into compact and efficient machine code using the LLVM compiler framework. By aiming at good code and data locality and predictable branch layout the resulting code frequently rivals the performance of hand-written C++ code. We integrated these techniques into the HyPer main memory database system and show that this results in excellent query performance while requiring only modest compilation time.
- The traditional way to execute these algebraic plans is the iterator model, sometimes also called Volcano-style processing: Every physical algebraic operator conceptually produces a tuple stream from its input, and allows for iterating over this tuple stream by repeatedly calling the next function of the operator.
- The disadvantages of Volcano-style processing: (1) the next function will be called for every single tuple produced as intermediate or final result, i.e., millions of times; (2) the call to next is usually a virtual call or a call via a function pointer. Consequently, the call is even more expensive than a regular call and degrades the branch prediction of modern; (3) this model often results in poor code locality and complex book-keeping.
- The block-oriented processing amortizes the costs of calling another operator over the large number of produced tuples, such that the invocation costs become negligible, and it could also allow for vectorized operations.
However, it also eliminates the ability to pipeline data. Pipelining means that an operator can pass data to its parent operator without copying or otherwise materializing the data. When producing more than one tuple during a call this pure pipelining usually cannot be used any more, as the tuples have to be materialized somewhere to be accessible.
- An interesting observation in this context is that a hand-written program clearly outperforms even very fast vectorized systems.
In this paper, we propose a query compilation strategy that differs from existing approaches in several important ways.
- Processing is data centric and not operator centric. Data is processed such that we can keep it in CPU registers as long as possible. Operator boundaries are blurred to achieve this goal.
- Data is not pulled by operators but pushed towards the operators. This results in much better code and data locality.
- Queries are compiled into native machine code using the optimizing LLVM compiler framework.
2. Related Work
- Volcano-style(Iterator) Model
- Block-oriented Processing
- The MonetDB system materializes all intermediate results, which eliminates the need to call an input operator repeatedly.
- Another way to improve query processing is to compile the query into some kind of executable format, instead of using interpreter structures.
- Reducing the impact of branching
- Processing individual expressions more efficiently by using SIMD instructions
3. The Query Optimizer
3.1 Query Processing Architecture
- An algebraic operator is a pipeline breaker for a given input side if it takes an incoming tuple out of the CPU registers. It is a full pipeline breaker if it materializes all incoming tuples from this side before continuing processing.
- Data is always pushed from one pipeline-breaker into another pipeline-breaker.
3.2 Compiling Algebraic Expressions
- The produce function asks the operator to produce its result tuples, which are then pushed towards the consuming operator by calling their consume functions.
4. Code Generation
4.1 Generating Machine Code
- Advantages of using LLVM: (1) producing assembler code using LLVM is much more robust than writing it manually; (2) the LLVM assembler is portable across machine architectures, as only the LLVM JIT compiler translates the portable LLVM assembler into architecture dependent machine code; (3) the LLVM assembler is strongly typed, which caught many latent bugs; (4) LLVM is a full strength optimizing compiler, which produces extremely fast machine code, and usually requires only a few milliseconds for query compilation, while C or C++ compilers would need seconds.
- For optimal performance it is important that the hot path, i.e., the code that is executed for 99% of the tuples, is pure LLVM.
- When calling an external function all registers have to be spilled to memory, which is somewhat expensive. In absolute terms it is very cheap, of course, as the registers will be spilled on the stack, which is usually in cache, but if this is done millions of times it becomes noticeable.
4.2 Complex Operators
- It is not possible or even desirable to compile a complex query into a single function, because: (1) there is the pragmatic reason that the LLVM code will most likely call C++ code at some point that will take over the control flow; (2) in-lining the complete query logic into a single function can lead to an exponential growth in code.
- A pipelining fragment of the algebraic expression should result in one compact LLVM code fragment.
4.3 Performance Tuning
- Two critical issues: (1) hashing; (2) branches.
5. Advanced Parallelization Techniques
- Processing more than one tuple at a time has several advantages: (1) it allows for using SIMD instructions on modern CPUs; (2) it can help delay branching, as predicates can be evaluated and combined without executing branches immediately.
- For our code generation framework, it is simple to support intra-query parallelism.
- TPC-CH Benchmark: for the OLTP side it runs a TPC-C benchmark, and for the OLAP side it executes the 22 TPC-H queries adapted to the (slightly extended) TPC-C schema.
6.1 System Comparison
- For OLTP, the performance of the LLVM version is slightly better than performance of optimized C++ code, and the compile time of LLVM is more than a factor of ten faster than using C++.
- For OLAP, HyPer with the LLVM code generation is frequently another factor 2-4 faster, and its compile time is still much shorter compared with C++.
6.2 Code Quality
- To study the effects of branching and cache, we ran all five queries using the callgrind tool of valgrind 3.6.0, which uses binary instrumentation to observe branches and cache effects. We used callgrind_control to limit profiling to query processing itself.
- Our generated LLVM code contains far less branches than the MonetDB code, since we try to generate all code up to the next pipeline breaker in one linear code fragment.
- The number of branch mis-predictions is significantly lower for the LLVM code. The relative mis-prediction rate of MonetDB is quite good, but in total MonetDB executes far too many branches and thus has many mis-predictions, too.
- For most queries the level 1 data cache misses (D1) and level 2 data misses (L2d) are very similar, which means that if data is not in the level 1 cache it is probably not in the level 2 cache either. This is expected behavior for very large hash tables. Again, the LLVM code shows a very good data locality and has less cache misses than MonetDB.
- The number of executed instructions for the generated LLVM code is much more smaller than the MonetDB code, which means it is more compact.