An Empirical Evaluation of In-Memory Multi-Version Concurrency Control

Reference: Wu, Yingjun, et al. "An empirical evaluation of in-memory multi-version concurrency control." Proceedings of the VLDB Endowment 10.7 (2017): 781-792.

This is one of the several papers belong to suggested readings for Multi-Version Concurrency Control of CMU 15-721: Database Systems.

0. Abstract

Multi-version concurrency control (MVCC) is currently the most popular transaction management scheme in modern database management systems (DBMSs). Although MVCC was discovered in the late 1970s, it is used in almost every major relational DBMS released in the last decade. Maintaining multiple versions of data potentially increases parallelism without sacrificing serializability when processing transactions. But scaling MVCC in a multi-core and in-memory setting is non-trivial: when there are a large number of threads running in parallel, the synchronization overhead can outweigh the benefits of multi-versioning.

To understand how MVCC perform when processing transactions in modern hardware settings, we conduct an extensive study of the scheme’s four key design decisions: concurrency control protocol, version storage, garbage collection, and index management. We implemented state-of-the-art variants of all of these in an in-memory DBMS and evaluated them using OLTP workloads. Our analysis identifies the fundamental bottlenecks of each design choice.

1. Introduction

  • Almost every major relation DBMS in the last decade uses MVCC, including Oracle, Postgres, InnoDB, Microsoft Hekaton, SAP HANA, and MemSQL.
  • Until now, there has not been a comprehensive evaluation of MVCC in a mod- ern DBMS operating environment.
MVCC Implementations

2. Background

2.1 MVCC Overview

  • Advantages: (1) it can potentially allow for greater concurrency than a single-version system; (2) the system can also support “time-travel” operations that allow an application to query a consistent snapshot of the database as it existed at some point of time in the past.

2.2 DBMS Meta-Data

  • Transactions: The DBMS assigns a transaction T a unique, monotonically increasing timestamp as its identifier () when they first enter the system. The concurrency control protocols use this identifier to mark the tuple versions that a transaction accesses. Some protocols also use it for the serialization order of transactions.
  • Tuples: txn-id(lock), begin-ts and end-ts(lifetime), pointer(next or previous version)
Tuple Format

3. Concurrency Control Protocol

  • Concurrency control protocol determines: (1) whether to allow a transaction to access or modify a particular tuple version in the database at runtime; (2) whether to allow a transaction to commit its modifications.

3.1 Timestamp Ordering(MVTO)

  • In addition to the fields described in Sect. 2.2, the version headers also contain the identifier of the last transaction that read it (read-ts).
  • When transaction T invokes a read operation on logical tuple A, the DBMS searches for a physical version where is in between the range of the begin-ts and end-ts fields. T is allowed to read version if its write lock is not held by another active transaction (i.e., value of txn-id is zero or equal to ).
  • Upon reading , the DBMS set ’s read-ts field to if its current value is less than . Otherwise, the transaction reads an older version without updating this field.
  • With MVTO, a transaction always updates the latest version of a tuple. Transaction T creates a new version if (1) no active transaction holds ’s write lock and (2) is larger than ’s read-ts field. If these conditions are satisfied, then the DBMS creates a new version and sets its txn-id to . When T commits, the DBMS sets ’s begin-ts and end-ts fields to and INF respectively, and ’s end-ts field to .

3.2 Optimistic Concurrency Control(MVOCC)

  • The motivation behind OCC is that the DBMS assumes that transactions are unlikely to conflict, and thus a transaction does not have to acquire locks on tuples when it reads or updates them.
  • Three phases: read/validation/write

3.3 Two-Phase Locking(MV2PL)

  • Every transaction acquires the proper lock on the current version of logical tuple before it is allowed to read or modify it.
  • In a disk-based DBMS, locks are stored separately from tuples so that they are never swapped to disk. This separation is unnecessary in an in-memory DBMS, thus with MV2PL the locks are embedded in the tuple headers.
  • The tuple’s write lock is the txn-id field. For the read lock, the DBMS uses a read-cnt field to count the number of active transactions that have read the tuple.
  • The key difference among 2PL protocols is in how they handle deadlocks. Previous research has shown that the no-wait policy is the most scalable deadlock prevention technique. With this, the DBMS immediately aborts a transaction if it is unable to acquire a lock on a tuple.

3.4 Serialization Certifier

  • One can employ certifier-based approaches on top of weaker isolation levels that offer better performance but allow certain anomalies.
  • Serializable Snapshot Isolation (SSI): it guarantees serializability by avoiding write-skew anomalies for snapshot isolation
  • Serial Safety Net (SSN): reducing the number of false aborts makes SSN more amenable to workloads with read-only or read-mostly transactions.

3.5 Discussion

  • These protocols handle conflicts differently, and thus are better for some workloads more than others.
  • Optimizations: (1) allowing a transaction to speculatively read uncommitted versions created by other transactions; (2) allowing transactions to eagerly update versions that are read by uncommitted transactions.
  • Although both optimizations can reduce the number of unnecessary aborts for some workloads, they suffer cascading aborts. Moreover, the maintenance of a centralized data structure can become a major performance bottleneck.

4. Version Storage

4.1 Append-Only Storage

  • All of the tuple versions for a table are stored in the same storage place. This approach is used in Postgres, as well as in-memory DBMSs like Hekaton, NuoDB and MemSQL.
  • Ordering of version chains: (1) Oldest-to-Newest(O2N): the DBMS need not update the indexes to point to a newer version of the tuple whenever it is modified; (2) Newest-to-Oldest(N2O): the chain’s HEAD changes whenever a tuple is modified.
  • Another issue with append-only storage is sometimes creating redundant copies for non-inline attributes (e.g., BLOBs). To avoid this problem, one optimization is to allow the multiple physical versions of the same tuple to point to the same non-inline data.

4.2 Time-Travel Storage

  • The DBMS maintain a master version of each tuple in the main table and multiple versions of the same tuple in a separate time-travel table.
  • This incurs additional maintenance costs during GC because the DBMS copies the data from the time-travel table back to the main table when it prunes the current master version.
  • This scheme also suffers from the same non-inline attribute problem as the append-only approach.

4.3 Delta Storage

  • The DBMS maintains the master versions of tuples in the main table and a sequence of delta versions in a separate delta storage. This storage is referred to as the rollback segment in MySQL and Oracle, and is also used in HyPer.
  • This scheme is ideal for UPDATE operations that modify a subset of a tuple’s attributes because it reduces memory allocations. However, it leads to higher overhead for read-intensive workloads.

4.4 Discussion

  • The append-only scheme is better for analytical queries that perform large scans because versions are stored contiguously in memory, which minimizes CPU cache misses and is ideal for hardware pre-fetching.
  • All of the storage schemes require the DBMS to allocate memory for each transaction from centralized data structures (i.e., tables, delta storage). This will cause access contention, to avoid this problem, DBMS can maintain separate memory spaces for each centralized structures and expand them in fixed-size segments. Each worker thread then acquires memory from a single space. This essentially partitions the database, thereby eliminating centralized contention points.

5. Garbage Collection

  • The GC process is divided into three steps: (1) detect expired versions; (2) unlink those versions from their associated chains and indexes; (3) reclaim their storage space.
  • The DBMS considers a version as expired if it is either an invalid version (i.e., created by an aborted transaction) or it is not visible to any active transaction. For the latter, the DBMS checks whether a version’s end-ts is less than the of all active transactions. The DBMS maintains a centralized data structure to track this information, but this is a scalability bottleneck in a multi-core system.
  • An in-memory DBMS can avoid centralized data structure with coarse-grained epoch-based memory management that tracks the versions created by transactions.

5.1 Tuple-level Garbage Collection

The DBMS checks the visibility of each individual tuple version in one of two ways:

  • Background Vacuuming(VAC): the DBMS uses background threads that periodically scan the database for expired versions. But this mechanism does not scale for large databases, especially with a small number of GC threads. Optimizations: (1) transactions register the invalidated versions in a latch-free data structure, the GC threads then reclaim these expired versions using the epoch-based scheme; (2) the DBMS maintains a bitmap of dirty blocks so that the vacuum threads do not examine blocks that were not modified since the last GC pass.
  • Cooperative Cleaning(COOP): when executing a transaction, the DBMS traverses the version chain to locate the visible version. During this traversal, it identifies the expired versions and records them in a global data structure. This approach scales well but only works for the O2N append-only storage. The DBMS periodically performs a complete GC pass with a separate thread to avoid some expired versions that are not removed because of not being traversed.

5.2 Transaction-level Garbage Collection

  • The DBMS considers a transaction as expired when the versions that it generated are not visible to any active transaction. After an epoch ends, all of the versions that were generated by the transactions belonging to that epoch can be safely removed. It works well with the transaction-local storage optimization, while its downside is that the DBMS tracks the read/write sets of transactions for each epoch instead of just using the epoch’s membership counter.

5.3 Discussion

  • Tuple-level GC with background vacuuming is the most common implementation in MVCC DBMSs.
  • In either scheme, increasing the number of dedicated GC threads speeds up the GC process.
  • The DBMS’s performance drops in the presence of long-running transactions, because all the versions generated during the lifetime of such a transaction cannot be removed until it completes.

6. Index Management

  • All MVCC DBMSs keep the database’s versioning information separate from its indexes.
  • For each index entry, its value is a pointer to the tuple.
  • Primary key indexes always point to the current version of a tuple.
  • For secondary indexes, there are two kinds of schemes for the contents of pointers.

6.1 Logical Pointers

  • DBMS uses a fixed identifier that does not change for each tuple in its index entry. The DBMS uses an indirection layer that maps a tuple’s identifier to the HEAD of its version chain.
  • This avoids the problem of having to update all of a table’s indexes to point to a new physical location whenever a tuple is modified.
  • Two implementation choices for the mapping: (1) Primary Key(PKey); (2) Tuple Id(TupleId).

6.2 Physical Pointers

  • When updating any tuple in a table, the DBMS inserts the newly created version into all the secondary indexes.
  • Several MVCC DBMSs, including MemSQL and Hekaton, employ this scheme.

6.3 Discussion

  • The logical pointer approach is better for write-intensive workloads, while the physical pointer is better for read-intensive workloads.
  • Index-only scans are not possible in a MVCC DBMS unless the tuples’ versioning information is embedded in each index.

7. Experimental Analysis

7.1 Benchmarks

  • YCSB(read-only, read-intensive, update-intensive)
  • TPC-C

7.2 Concurrency Control Protocol

  • Scalability Bottlenecks
  • Transaction Contention: MVOCC’s performance is bad when contention is high
  • Heterogeneous Workload
  • TPC-C

7.3 Version Storage

  • We configured the DBMS to use the MVTO protocol since it achieved the most balanced performance in the previous experiments.
  • Non-Inline Attributes: Maintaining reference counters for unmodified non-inline attributes always yields better performance
  • Version Chain Ordering: the N2O ordering always performs better than O2N in both workloads
  • Transaction Footprint: with more attributes per tuple, delta scheme achieves better performance because it uses less memory
  • Attributes Modified
  • Memory Allocation
  • TPC-C

7.4 Garbage Collection

  • Tuple-level Comparison
  • Tuple-level vs. Transaction-level

7.5 Index Management

  • Logical pointer achieves higher performance compared to physical pointer scheme.

8. Discussion

  • The version storage scheme is one of the most important components to scaling an in-memory MVCC DBMS in a multi-core environment.
  • Using a workload-appropriate concurrency control protocol improves the performance, particularly on high-contention workloads.
  • The performance of a MVCC DBMS is tightly coupled with its GC implementation. In particular, we found that a transaction-level GC provided the best performance with the smallest memory footprint. GC process can cause oscillations in the system’s throughput and memory footprint.
  • The index management scheme can also affect the DBMS’s performance for databases with many secondary indexes are constructed. Logical pointer scheme always achieve a higher throughput especially when processing update-intensive workloads.
  • Concurrency Control Protocol
  • Version Storage
  • Garbage Collection
  • Index Management

10. Conclusion

Written on August 1, 2017