Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid WorkloadsReference: Arulraj, Joy, Andrew Pavlo, and Prashanth Menon. "Bridging the archipelago between row-stores and column-stores for hybrid workloads." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
This is one of the several papers belong to suggested readings for Storage Models & Data Layout of CMU 15-721: Database Systems.
Data-intensive applications seek to obtain trill insights in real-time by analyzing a combination of historical data sets alongside recently collected data. This means that to support such hybrid workloads, database management systems (DBMSs) need to handle both fast ACID transactions and complex analytical queries on the same database. But the current trend is to use specialized systems that are optimized for only one of these workloads, and thus require an organization to maintain separate copies of the database. This adds additional cost to deploying a database application in terms of both storage and administration overhead.
To overcome this barrier, we present a hybrid DBMS architecture that efficiently supports varied workloads on the same database. Our approach differs from previous methods in that we use a single execution engine that is oblivious to the storage layout of data without sacrificing the performance benefits of the specialized systems. This obviates the need to maintain separate copies of the database in multiple independent systems. We also present a technique to continuously evolve the database’s physical storage layout by analyzing the queries’ access patterns and choosing the optimal layout for different segments of data within the same table. To evaluate this work, we implemented our architecture in an in-memory DBMS. Our results show that our approach delivers up to 3× higher throughput compared to static storage layouts across different workloads. We also demonstrate that our continuous adaptation mechanism allows the DBMS to achieve a near-optimal layout for an arbitrary workload without requiring any manual tuning.
- Many organizations implement HTAP pipelines using separate DBMSs. The most common practice is to use one DBMS for transactions and another for analytical queries. With this model, the new information from transactions goes into an on-line transactional processing (OLTP) DBMS. Then in the background, the system uses an extract-transform-load utility to migrate data from the OLTP DBMS to a data warehouse for OLAP.
- Bridging the architectural gap between the OLTP and OLAP systems using a unified architecture.
- A novel on-line reorganization technique that continuously enhances each table’s physical design in response to an evolving query workload.
- Two storage layouts: (1) n-ary storage model (NSM, row-store), which works well for OLTP; (2) decomposition storage model (DSM, column-store), which works well for OLAP.
- For HTAP, flexible storage model (FSM) generalizes the NSM and DSM as the flexible storage model (FSM), which supports a wide variety of hybrid storage layouts where the DBMS co-locates the attributes that are frequently accessed together.
3. Tile-based Architecture
3.1 Physical Tile
- The fundamental physical storage unit of the DBMS is a tile tuple. Informally, a tile tuple is a subset of attribute values that belong to a tuple. A set of tile tuples from a physical tile. We refer to a collection of physical tiles as a tile group. The storage manager physically stores a table as set of tile groups. All the physical tiles belonging to a tile group contain the same number of tile tuples.
3.2 Logical Tile
- A logical tile represents values spread across a collection of physical tiles from one or more tables. Every column of a logical tile contains a list of offsets corresponding to the tuples in the under lying physical tiles.
- A column in a logical tile can represent the data stored in one or more attributes spread across a collection of physical tiles. The DBMS stores this mapping in the logical tile’s metadata region. It records his information only once for a logical tile’s column.
- For all the attributes that a column in a logical tile maps to, the DBMS uses the same list of tuple offsets during materialization. Currently, we restrict that every logical tile column maps to an attribute in a physical tile.
- During query execution, the DBMS can dynamically choose to materialize a logical tile, that is an intermediate query result, into a physical tile. In this case, the operator constructs a pass-through logical tile with only one column that directly maps to the attributes in the materialized physical tile, and propagates that logical tile upwards to its parent in the plan tree.
3.3 Logical Tile Algebra
- Layout transparency: abstracting away the physical layout of the data from the DBMS’s query processing components, which reduces the coupling between the DBMS’s storage manager and execution engine.
- Bridge Operators: the only operators that interact with the storage manager are the ones at the bottom and the top of the tree. We refer to these operators as bridge operators because they connect logical tiles and physical tiles, e.g., sequential scan, index scan.
- Metadata Operators: the metadata of a logical tile includes information about the underlying physical tiles and a bitmap that represents the rows that must be examined by the operator processing the logical tile. The metadata operator only modifies the metadata of the logical tile and not the data that it represents, e.g., projection, selection.
- Mutator: these operators modify the data stored in the data, e.g., insert, delete, update.
- Pipeline Breakers: these operators consume the logical tiles produced by their children in the plan tree, and they block the execution of the upper-level operators while they wait for their children’s output, e.g., join, set operators (union, intersect), aggregation operators. For aggregation operators, they build new physical tiles to store the aggregated tuples, and construct a set of pass-through logical tiles to encapsulate these physical tiles.
Benefits of the Logical Tile Abstraction
- Layout Transparency: the operators need not be specialized for all possible storage layouts because the logical tile algebra hides this information from them.
- Vectorized Processing: the operators in a tile-based DBMS process data logical tile at a time, which improves the CPU efficiency by reducing the interpretation overhead, while tuple-at-a-time suffers from the high interpretation overhead and prevents key compiler optimizations such as loop pipelining.
- Flexible Materialization: since neither early materialization nor late materialization is a universally good strategy, a tile-based DBMS can dynamically choose to materialize at any operator in the plan tree during the query execution. It can then propagate the pass-through logical tiles upwards in the plan tree.
- Caching Behavior: the DBMS can optimize several dimensions in how it creates tile groups, such as the number of tuples per tile group, to ensure that the tiles fit well within the cache hierarchy. Furthermore, the logical tiles’ succinct representation enables the DBMS to more easily manage complex intermediate query execution results within the cache.
- Our logical-tile algebra bridges the theoretical gap between row-stores and column-stores within a single DBMS architecture.
4. Concurrency Control
- Every transaction maintains a metadata context that includes: (1) the timestamp of last committed transaction that should be visible to this transaction; (2) references to the set of tuple versions that the transaction either inserts or deletes during its lifetime, while in our tile-based architecture, it includes the tile group identifier and the offset of the tuple within that tile group.
- Every tile group contains the following versioning metadata for each tuple: (1) TxnId: a placeholder for the identifier of the transaction that currently holds a latch on the tuple; (2) BeginCTS: the commit timestamp from which the tuple becomes visible; (3) EndCTS: the commit timestamp after which the tuple ceases to be visible; (4) PreV: reference to the previous version, if any, of the tuple.
- Bridge Operators
- Due to this design, all the other operators of the logical tile algebra are decoupled from the tuple visibility problem.
- The DBMS periodically garbage collects these invisible tuple versions in an asynchronous incremental manner. The garbage collection process not only helps in recovering the storage space, but also refreshes the statistics maintained by the query planner.
- While the key stored in the index comprises of a set of tuple attributes, the value is a logical location of the latest version of a tuple. We do not store raw tuple pointers because then the DBMS would have to update them when it reorganizes the layout of the tile groups.
- The recovery module employs a variant of the canonical ARIES recovery protocol that is adapted for in-memory DBMSs.
- During regular transaction execution, the DBMS records the changes performed by the transaction in the write-ahead log, before committing the transaction. It periodically takes snapshots that are stored on the file system to bound the recovery time after a crash.
- At the start of recovery, the DBMS first loads the latest snapshot. It then replays the log to ensure that the changes made by the transactions committed since the snapshot are present in the database. Changes made by uncommitted transactions at the time of failure are not propagated to the database.
- As we do not record the physical changes to the indexes in this log, the DBMS rebuilds all of the tables’ indexes during recovery to ensure that they are consistent with the database.
5. Layout Reorganization
- We use a separate background process to reorganize the data in an incremental manner one tile group at a time. Over time, this process optimizes the storage layout for the workload and amortizes the cost across multiple queries. This approach can be incremental.
5.1 On-line Query Monitoring
- Goal: determining which attributes should be stored in the same physical tile in the new tile group layout.
- The monitor collects information about the attributes present in the query’s SELECT clauses, as well as those that are present in the WHERE clause. It distinguishes between these two sets of attributes because by co-locating only the attributes in the WHERE clause, rather than those in both clauses together, it needs to retrieve less data for predicate evaluation. It then stores this information as a time series graph for each individual table.
- To reduce the monitoring overhead, the monitor only gathers statistics from a random subset of queries.
- To improve the overall throughput on HTAP workloads, it needs to optimize the layout for both the transactions and the data intensive analytical queries. The DBMS achieves this by recording the query plan cost computed by the optimizer. It uses this plan cost information to derive a storage layout that also benefits the analytical queries.
5.2 Partitioning Algorithm
- Two stages: (1) it first employs a clustering algorithm to determine which set of attributes should be stored together in the same physical tile; (2) it then uses a greedy algorithm that uses the output of the clustering algorithm to generate a workload-aware tile group storage layout.
- The clustering algorithm prioritizes each query based on its plan cost, and utilizes weight to drift towards the recent samples.
- Considering a query workload Q with m queries, and a table with n attributes, the time complexity of each iteration of this algorithm is O(mnk) with a space complexity of O(n(m + k)).
- We use a greedy algorithm to derive a partitioning layout for the table using these representative queries. This algorithm iterates over the representative queries in the descending order based on the weight of their associated clusters. For each cluster, the algorithm groups the attributes accessed by that cluster’s representative query together into a tile. It continues this process until it assigns each attribute in the table to some tile.
5.3 Data Layout Reorganization
- We use an incremental approach for data layout reorganization. For a given tile group, the DBMS first copies over the data to the new layout, and then atomically swaps in the newly constructed tile group into the table. Any concurrent delete or update operation only modifies the versioning metadata that is stored separate from the physical tiles. The newly constructed tile group refers to the versioning metadata of the older tile group.
- The storage space consumed by the physical tiles in the older tile group is reclaimed by the DBMS only when they are no longer referenced by any logical tiles.
- Due to its incremental nature, the overhead of data layout reorganization is amortized over multiple queries.
- he reorganization process does not target hot tile groups that are still being heavily accessed by OLTP transactions. Rather, it transforms the cold data to new layouts. That is, the updated versions of the tuples start off in a tuple-centric layout, and then are gradually transformed to an OLAP-optimized layout. Hence, the reorganization process does not impact latency-sensitive transactions while benefiting the OLAP queries.
- For some applications with periodic cycles in their workload that oscillate between different sets of queries, the DBMS prioritizes the older query samples in the clustering algorithm with a larger weight, thereby dampening the adaption mechanism.
6. Experimental Evaluation
- All transactions execute with the default snapshot isolation level.
- We disable the DBMS’s garbage collection and logging components to ensure that our measurements are only for the storage and query processing components.
6.1 ADAPT Benchmark
- Since there currently is a dearth of HTAP workloads for testing, we developed our own that we call the ADAPT benchmark that is composed of queries that are common in enterprise workloads. This benchmark is inspired by the one that Alagiannis developed for evaluating the adaptive storage manager.
6.2 Performance Impact of Storage Models
- For each workload, we first load the database and execute the queries five times till the DBMS reorganizes the tables’ layout. This is the ideal scenario for the FSM storage manager.
- When the query’s projectivity increases, DSM storage manager becomes slower because of the tuple reconstruction cost.
- The benefits of FSM are more pronounced for the wide table.
- The magnitude of the performance gaps between the different storage managers is smaller under the low projectivity settings. This is because the execution engine needs to materialize the logical tiles consisting of the aggregate tuples for all the storage managers.
6.3 Workload-Aware Adaption
- To mimic the temporal locality in HTAP workloads, and to clearly delineate the impact of the DBMS’s storage layout on the different queries, the sequence is divided into segments of 25 queries that each correspond to a particular query type. This means that the DBMS executes the same type of query (with different input parameters) in one segment, and then switches to another query type in the next segment.
- The FSM storage manager converges over time to a layout that works well for the particular segment.
6.4 Horizontal Fragmentation
- Goal: comparing query processing through logical tiles versus the canonical tuple-at-a-time iterator model.
- The coarse-grained fragmentation (10000 tuples per tile group) setting outperforms the fine-grained one (10 tuples per tile group).
6.5 Reorganization Sensitivity Analysis
- We strike a balance at . Under this setting, the storage manager adapts quickly enough for HTAP workloads, but is also not easily swayed over by ephemeral workload shifts.
6.6 Data Reorganization Strategies
- We ob- serve that when it adopts the immediate (combine query processing with reorganization, e.g., ) approach, there are spikes in the query latency.
- H2O chooses to create copies of the data with new layouts as part of the query execution. This improves the execution time of subsequent queries of the same type. But, it necessitates the construction of layout-specific access operators using code generation techniques.
- Our storage manager, however, only maintains one copy of the data with a particular layout at any given point in time. We adopt this approach to avoid the overhead of synchronizing the copies on write-intensive workloads.
7. Related Work
- Storage Models: NSM, DSM, Hybrid NSM/DSM, PAX
- Off-line Physical Design Tuning
- On-line Physical Design Tuning
- Hybrid DBMS Architectures
- Adaptive Stores