Anti-Caching: A New Approach to Database Management System Architecture

Reference: DeBrabant, Justin, et al. "Anti-caching: A new approach to database management system architecture." Proceedings of the VLDB Endowment 6.14 (2013): 1942-1953.

This is one of the several papers belong to suggested readings for Larger-than-Memory Databases of CMU 15-721: Database Systems.

0. Abstract

The traditional wisdom for building disk-based relational database management systems (DBMS) is to organize data in heavily-encoded blocks stored on disk, with a main memory block cache. In order to improve performance given high disk latency, these systems use a multi-threaded architecture with dynamic record-level locking that allows multiple transactions to access the database at the same time. Previous research has shown that this results in substantial overhead for on-line transaction processing (OLTP) applications.

The next generation DBMSs seek to overcome these limitations with architecture based on main memory resident data. To overcome the restriction that all data fit in main memory, we propose a new technique, called anti-caching, where cold data is moved to disk in a transactionally-safe manner as the database grows in size. Because data initially resides in memory, an anti-caching architecture reverses the traditional storage hierarchy of disk-based systems. Main memory is now the primary storage device.

We implemented a prototype of our anti-caching proposal in a high-performance, main memory OLTP DBMS and performed a series of experiments across a range of database sizes, workload skews, and read/write mixes. We compared its performance with an open-source, disk-based DBMS optionally fronted by a distributed main memory cache. Our results show that for higher skewed workloads the anti-caching architecture has a performance advantage over either of the other architectures tested of up to 9x for a data size 8x larger than memory.

1. Introduction

  • When all data fits in main memory, the cost of maintaining a buffer pool is nearly one-third of all the CPU cycles used by the DBMS.
  • The fundamental problem with main memory DBMSs, however, is that this improved performance is only achievable when the database is smaller than the amount of physical memory available in the system.
  • The two-tier (with a main memory distributed cache) has two problems: (1) data objects may reside both in the cache and the DBMS buffer pool; (2) it requires developers to embed logic in their application to keep the two systems independently synchronized.
  • Rather than starting with data on disk and reading hot data into the cache, data starts in memory and cold data is evicted to the anti-cache on disk.
  • The two main difference between ani-caching and virtual memory: (1) fine-grained eviction; (2) non-blocking fetches.
  • The results of experiments show that the anti-caching architecture outperforms both the traditional disk-based and hybrid (with cache layer) architecture on popular OLTP workloads.

2. H-Store System Overview

  • H-Store is designed to efficiently execute OLTP workloads on main memory-only nodes. Each partition is assigned a single-threaded execution engine at its node that is responsible for executing transactions and queries for that partition.
  • H-Store assumes a workload of transactions with the following composition: (1) single-partition transactions; (2) multi-partition transactions.
  • Each H-Store transaction is given a unique transaction ID, based on the time it arrived in the system. Standard clock-skew algorithms are used to keep the various CPU clocks synchronized.
  • To ensure that all modifications to the database are durable and persistent, each DBMS node continuously writes asynchronous snapshots of the entire database to disk at fixed intervals. In between these snapshots, the DBMS writes out a record to a command log for each transaction that completes successfully. The DBMS combines multiple records together and writes them in a group to amortize the cost of writing to disk. Any modifications that are made by a transaction are not visible to the application until this record has been written. This record only contains the original request information sent from the client, which is more lightweight than record-level logging.

3. Anti-Caching System Model

  • When the size of the database relative to the amount of available memory on the node exceeds some administrator-defined threshold, the DBMS “evicts” cold data to the anti-cache in order to make space for new data
  • The DBMS updates a memory-resident catalog that keeps track of every tuple that was received. When a transaction accesses one of these evicted tuples, the DBMS switches that transaction into a “pre-pass” mode to learn about all of the tuples that the transaction needs. After this pre-pass is complete, the DBMS then aborts that transaction (rolling back any changes that it may have made) and holds it while the system retrieves the tuples in the background. Once the data has been merged back into the in-memory tables, the transaction is released and restarted.

3.1 Storage Architecture

  • Block Table: a hash table that maintains the blocks of tuples that have been evicted from the DBMS’s main memory storage. Each block is the same fixed-size and is assigned a unique 4-byte key. A block’s header contains the identifier for the single table that its tuples were evicted from and the timestamp when the block was created. The body of the block contains the serialized evicted tuples from a single table. The key portion of the Block Table stays in memory while its values (i.e., the block data) are stored on disk without OS or file-system caching.
  • Evicted Table: keeps track of the tuples that have been written out to blocks on disk. The DBMS updates any indexes containing evicted tuples to reference the Evicted Table.
  • LRU Chain: an in-memory list of all the tuples for each table in LRU order. The DBMS embeds the pointers directly in the tuples’ headers. To reduce the CPU overhead of tracking the total ordering of each table’s LRU Chain, the DBMS selects a fraction of the transactions to monitor at runtime. Any table not labeled as evictable will not maintain an LRU chain and will remain entirely in main memory.
  • Our current implementation also requires that all of the database’s primary key and secondary indexes fit in memory.

3.2 Block Eviction

  • Our system maintains a separate LRU Chain per table that is local to a partition. The DBMS determines what tables to evict data from and the amount of data that should be evicted from a given table by the relative skew of accesses to tables.
  • After determining how much data to evict from each table, H-Store executes special single-partition transactions that select tuples for eviction and writes blocks to disk.
  • When the eviction transaction executes, it creates a new block by popping tuples off the head of the target table’s LRU Chain. For each tuple being evicted, H-Store copies its data into the eviction block buffer. It then adds an entry into the Evicted Table and updates all indexes to point to this entry instead of the original tuple location. Each tuple in the Evicted Table includes a special evicted flag in its header that enables the DBMS to recognize when a transaction accesses evicted data. Groups of blocks are written out in a single sequential write.
  • The state of the database is consistent during the eviction process.

3.3 Transaction Execution

  • Pre-pass Phase: a transaction enters the pre-pass phase if evicted data is needed to continue execution. The goal of the pre-pass phase is to determine all of the evicted data that the transaction needs to access so that it can be retrieved together. During the pre-pass phase, any in-memory tuples are updated in the LRU Chain to reduce the likelihood that these tuples are evicted before the transaction is re-queued.

3.4 Block Retrieval

After aborting a transaction that attempts to access evicted tuples, the DBMS schedules the retrieval of the blocks that the transaction needs from the Block Table in two steps:

  • The system first issues a non-blocking read to retrieve the blocks from disk. This operation is performed by a separate thread while regular transactions continue to execute at that partition. The DBMS stages these retrieved blocks in a separate buffer that is not accessible to queries. Any transaction that attempts to access an evicted tuple in one of these blocks is aborted as if the data was still on disk.
  • The DBMS performs a “stopand-copy” operation whereby all transactions are blocked at that partition while the unevicted tuples are merged from the staging buffer back into the regular table storage. It then removes all of the entries for these retrieved tuples in the Evicted Table and then updates the table’s indexes to point to the real tuples. Afterwards, the aborted transaction is then rescheduled.

We have two different solutions for how much data to merge from a retrieved block back into the in-memory storage:

  • Block-Merging: merging the entire retrieved block back into the regular table storage. The requested tuple(s) are placed at the back of the table’s LRU Chain. Conversely, any tuples not needed by pending transactions are added to the front (i.e., cold end) of the LRU Chain, which means that they are more likely to be chosen for eviction in the next round.
  • Tuple-Merging: only merge the tuples that caused the block to be read from disk. Over time, these “holes” in the blocks accumulate. This means the amount of valid data that is retrieved in each block is reduced. We employ a lazy block compaction algorithm during the merge process. When the DBMS retrieves a block from disk, it checks whether the number of holes in a block is above a threshold. If it is, then the DBMS will merge the entire block back into the memory, just as with the block-merge strategy.

3.5 Distributed Transactions

  • Multi-partition transactions: the transaction is aborted and not re-queued until it receives a notification that all of the blocks that it needs have been retrieved from the nodes in the cluster. The system ensures that any in-memory tuples that the transaction also accessed at any partition are not evicted during the time that it takes for each node to retrieve the blocks from disk.

3.6 Snapshots & Recovery

  • Delta snapshots

4. Architecture Comparison

4.1 Benchmarks

We use the OLTP-Bench framework for the MySQL experiments and H-Store’s built-in benchmarking framework for the anti-caching experiments.

  • YCSB: a collection of workloads that are representative of large-scale services created by Internet-based companies
  • TPC-C: the current industry standard for evaluating the performance of OLTP systems

4.2 System Configurations

  • MySQL
  • MySQL + Memcached
  • H-Store with Anti-Caching: in our current implementation, we use BerkeleyDB’s hash table to store the Block Table. BerkeleyDB is configured to use direct I/O without any caching or locks.

  • In each trial, the DBMSs are allowed to “warm-up” for two minutes.
  • For the anti-caching architecture, we evaluate H-Store’s performance using a LRU Chain sampling rate of = 0.01 (aLRU) and = 1.00 (LRU). Thus, for aLRU, only one out of every one hundred transactions updates the LRU chain while for LRU every transaction will update the LRU chain.

4.3 Results & Discussion

YCSB

  • Reasons why anti-caching architecture is better: (1) H-Store’s lightweight concurrency control scheme is more efficient than MySQL’s model; (2) tuples are not converted back-and-forth between disk and main memory format.
  • Memcached improves the throughput of MySQL most on the read-only workloads and only for high skew. The lower performance in the other workloads is due to the overhead of synchronizing values in Memcached and in MySQL in the event of a write.
  • For almost all data sizes and skews tested, aLRU performs as well or better than the standard LRU. The anti-caching with traditional LRU suffers significantly as skew is decreased, meaning the maintenance of the LRU chain is a major bottleneck at lower skews.

TPC-C

  • The slight decrease in throughput for anti-caching is a result of the increasing memory overhead as the amount of evicted data grows, since there is an entry in the Evicted Table for each evicted tuple.

5. Experimental Analysis

5.1 Merge Strategies

  • The tuple-merge policy outperforms the block-merge policy, because: (1) larger merge costs of the block-merge policy, since merging tuples blocks transactions from executing on the target partition; (2) in the block-merge policy, unrequested tuples are merged and placed at the cold end of the LRU Chain.

5.2 Evicted Table Block Size

  • Larger block sizes reduce overall throughput, especially for highly skewed workloads.
  • The difference in throughput for larger block sizes is most pronounced at higher skewed workloads.

5.3 Tuple Size

  • The DBMS achieves higher throughputs for the larger tuple sizes, because we have more tuples to evict for smaller tuple size which could incur more overheads.

5.4 LRU Chain Overhead

  • For higher skewed workloads, the doubly-linked list performs within 5% of the baseline and 20% faster than the singly-linked list. The two strategies slowly converge as skew is decreased. The difference in performance between the singly-linked list and doubly-linked list is due to the high cost of updating a tuple in the chain.

5.5 Eviction Overhead Micro-benchmark

  • The cost of updating indexes and copying data to and from disk scales linearly relative to the block size.
  • The update of secondary indexes is unlikely to be a bottleneck for most OLTP workloads.

5.6 Overhead Measurements

  • In TPC-C, throughput is less volatile over time compared to YCSB. There are several reasons for throughput volatility: (1) group commit of writes, (2) creation of eviction block (since the writing of eviction block to disk is done asynchronously).

6. Future Work

6.1 Larger-than-Memory Queries

  • For large read queries, in a traditional DBMS, the query will commence after acquiring a table-level lock, at which point no writes will be processed in parallel.
  • The second solution is to process large read queries in historical mode: each transaction is assigned a timestamp of when it enters the system and then is only allowed to read tuples with a timestamp that is less than or equal to it. The DBMS will not overwrite tuples when one of these queries is running, but instead must keep both the before and after images of the database to ensure that the large read queries are provided with the correct version.
  • The third solution is to allow dirty reads (but not dirty writes).

6.2 Block Reorganization

6.3 Query Optimizations

  • There are several potential optimizations that would allow H-Store to process queries on evicted tuples without needing to retrieve it from disk, e.g., the DBMS does not need to retrieve an evicted tuple if an index “covers” a query.
  • Another idea to further reduce the size of the database that needs to be kept in memory is to evict only a portion of a tuple (i.e., some columns) to disk.
  • The HyPer DBMS relies on virtual memory paging to support databases that are larger than the amount of available memory.
  • Project Siberia: part of Microsoft’s Hekaton main memory table extension for SQL Server
  • Calvin: a main memory OLTP system that is designed to efficiently handle distributed transactions.
Written on December 2, 2017