Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core AgeReference: Leis, Viktor, et al. "Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
This is one of the several papers belong to suggested readings for Execution & Scheduling of CMU 15-721: Database Systems.
With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for “plandriven” parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA).
In response, we present the “morsel-driven” query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (“morsels”) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.
- The main impetus of hardware performance improvement nowadays comes from increasing multi-core parallelism rather than from speeding up single-threaded performance.
- With increasing memory capacity, the NUMA division of the RAM has to be considered to make sure that threads work on NUMA-local data.
- Abundant research in the 1990s into parallel processing adopted “plan-driven” Volcano model: the optimizer statically determines at query compilation time how many threads should run, instantiate one query operator plan for each thread, and connects these with exchange operators.
- Morsel-driven query execution is a new parallel query evaluation framework that fundamentally differs from the traditional Volcano model in that it distributes work between threads dynamically using work-stealing. This prevents unused CPU resources due to load imbalances, and allows for elasticity, i.e., CPU resources can be reassigned between different queries at any time.
- A set of fast parallel algorithms for the most important relational operators.
- A systematic approach to integrating NUMA-awareness into database systems.
2. Morsel-Driven Execution
- HyPer uses Just-In-Time (JIT) compilation to generate highly efficient machine code. Each pipeline segment, including all operators, is compiled into one code fragment. Further, the operators in the pipelines do not even materialize their intermediate results.
- The morsel-driven execution of the algebraic plan is controlled by a so called QEPobject which transfers executable pipelines to a dispatcher.
In order to write NUMA-locally and to avoid synchronization while writing intermediate results the QEPobject allocates a storage area for each such thread/core for each executable pipeline.
- Morsel is the basic execution unit, and the scheduler will try to schedule a morsel located on the same socket where the thread is executed.
- To preserve NUMA-locality in further processing stages, the storage area of a particular core is locally allocated on the same socket.
- The perfectly sized global hash table will be probed by threads located on various sockets of a NUMA system, thus to avoid contention, it should not reside in a particular NUMA-area and is therefore is interleaved (spread) across all sockets, and it should also be lock-free.
3. Dispatcher: Scheduling Parallel Pipeline Tasks
- The dispatcher is controlling and assigning the compute resources to the parallel pipelines. This is done by assigning tasks to worker threads. We (pre-)create one worker thread for each hardware thread that the machine provides and permanently bind each worker to it.
- A task that is assigned to such a worker thread consists of a pipeline job and a particular morsel on which the pipeline has to be executed.
- We experimentally determined that a morsel size of about 100,000 tuples yields good tradeoff between instant elasticity adjustment, load balancing and low maintenance overhead.
- Three main goals for assigning tasks to threads that run on particular cores: (1) locality; (2) elasticity; (3) load balance.
- Morsel-wise processing allows to reassign cores to different pipeline jobs without any drastic interrupt mechanism.
3.2 Implementation Overview
- The dispatcher is implemented with a shared lock-free data structure that is executed on the worker thread that requested the task. QEPobject is implemented as a passive state machine executed on the worker thread.
- All threads working on the same pipeline job run to completion in a “photo finish”: they are guaranteed to reach the finish line within the time period it takes to process a single morsel.
- If, for some reason, a core finishes processing all morsels on its particular socket, the dispatcher will “steal work” from another core, i.e., it will assign morsels on a different socket.
- We currently avoid to execute multiple pipelines from one query in parallel, since the number of independent pipelines is usually much smaller than the number of cores, and it also reduces cache locality.
- Besides elasticity, morsel-driven processing also allows for a simple and elegant implementation of query canceling.
3.3 Morsel Size
- Our work-stealing data structure won’t become a bottle on multi-core systems.
4. Parallel Operator Details
4.1 Hash Join
- Under our setting, our single-table hash join is better than radix join.
4.2 Lock-Free Tagged Hash Table
- As shown in Figure 7, we encode a tag directly into 16 bits of each pointer in the hash table. This saves space and, more importantly, allows to update both the pointer and the tag using a single atomic compare-and-swap operation.
- Our tagging technique has a number of advantages in comparison to Bloom filters.
The hash table array only stores pointers, and not the tuples themselves, i.e., we do not use open addressing.
- We use large virtual memory pages (2MB) both for the hash table and the tuple storage areas. This has several positive effects: The number of TLB misses is reduced, the page table is guaranteed to fit into L1 cache, and scalability problems from too many kernel page faults during the build phase are avoided.
- We allocate the hash table using the Unix mmap system call, if available. Modern operating systems do not eagerly allocate the memory immediately, but only when a particular page is first written to.
4.3 NUMA-Aware Table Partitioning
- In order to implement NUMA-local table scans, we can partition the tuples based on some attributes, such ta join between two tables that are both partitioned on the join key, matching tuples usually reside on the same socket.
- This co-location scheme is beneficial but not decisive for the high performance of morsel-driven execution.
- The performance characteristics of the aggregation operator differs very much depending on the number of groups (distinct keys).
- In the first phase, thread-local pre-aggregation efficiently aggregates heavy hitters using a thread-local, fixed-sized hash table.
- The second phase consists of each thread scanning a partition and aggregating it into a thread-local hash table.
- In our parallel sort operator each thread first materializes and sorts its input locally and in place.
- After local sort, the parallel merge phase begins, the difficulty lies in computing separators, so that merges are independent and can be executed in parallel without synchronization. To do this, each thread first computes local separators by picking equidistant keys from its sorted run. Then, to handle skewed distribution and similar to the median-of-medians algorithm, the local separators of all threads are combined, sorted, and the eventual, global separator keys are computed. After determining the global separator keys, binary (or interpolation) search finds the indexes of them in the data arrays. Using these indexes, the exact layout of the output array can be computed.
- We integrated our parallel query evaluation framework into HyPer, a main-memory column database system that supports SQL-92 and has very good single-threaded performance, but, so far, did not use intra-query parallelism.
- In this evaluation we focus on ad hoc decision support queries, and, except for declaring primary keys, do not enable any additional index structures. Therefore, our results mainly measure the performance and scalability of the table scan, aggregation, and join (including outer, semi, anti join) operators.
5.1 Experimental Setup
- As our main competitor we chose the official single-server TPC-H leader Vectorwise
- The scalability issue of Vectorwise (and Oracle and SQL Server) are related to the use of the Volcano model for parallelizing queries.
- We point out that fixed work division combined with lack of NUMA-awareness can lead to significant performance differences between threads.
- Performance is significantly lower when we disable explicit NUMA-awareness and rely on the operating system instead.
- A further performance penalty can be observed, if we additionally disable adaptive morsel-wise processing and the performance enhancements introduced in this paper like hash tagging.
5.3 NUMA Awareness
- The importance of NUMA-awareness clearly depends on the speed and number of the cross-socket interconnects.
- To demonstrate the elasticity of our approach, we performed an experiment where we varied the number parallel query streams.
5.5 Star Schema Benchmark
- Hyper performs better on SSB than TPC-H, because TPC-H is more complex and challenging.
6. Related Works
- Multi-core join
- Aggregation processing in isolation or full systems descriptions
- Parallel execution frameworks