Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores

Reference: Yu, Xiangyao, et al. "Staring into the abyss: An evaluation of concurrency control with one thousand cores." Proceedings of the VLDB Endowment 8.3 (2014): 209-220.

This is one of the several papers belong to suggested readings for Concurrency Control Challenges of CMU 15-721: Database Systems.

0. Abstract

Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.

To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.

1. Introduction

  • We are entering the era of the many-core machines that are powered by a large number of these smaller, low-power cores on a single chip.
  • The scalability of single-node, shared-memory DBMSs is even more important in the many-core era.
  • We implemented seven concurrency control algorithms in a main memory DBMS and used a high-performance, distributed CPU simulator to scale the system to 1000 cores.
  • Our analysis showed that all algorithms fail to scale as the number of cores increases.
  • Contributions: (1) a comprehensive evaluation of the scalability of seven concurrency control schemes; (2) the first evaluation of an OLTP DBMS on 1000 cores; (3) identification of bottlenecks in concurrency control schemes that are implementation-specific.

2. Concurrency Control Schemes

  • A transaction in the context of one of these systems is the execution of a sequence of one or more operations(e.g., SQL queries) on a shared database to perform some higher-level function.
  • The transaction is the basic unit of change in a DBMS: partial transactions are not allowed, and the effect of a group of transactions on the database’s state is equivalent to any serial execution of all transactions.
  • The transactions in modern OLTP workloads have three salient characteristics: (1) they are short-lived(i.e., no user stalls); (2) they touch a small subset of data using index-lookups(i.e., no full table scans on large joins); (3) they are repetitive(i.e., executing the same queries with different inputs).
  • Concurrency control permits end-users to access a database in a multi-programmed fashion while preserving the illusion that each of them is executing their transaction alone on a dedicated system. It essentially provides the atomicity and isolation guarantees in the system.

We follow the canonical categorization that all concurrency schemes are either a variant of two-phase locking or timestamp ordering protocols. Table 1 provides a summary of these different schemes.

2.1 Two-Phase Locking

Introduction

  • Two-phase locking(2PL) was the first provably correct method of ensuring the correct execution of concurrent transactions in a database system.
  • Under this scheme, transactions have to acquire locks for a particular element in the database before they are allowed to execute a read or write operation on that element. The DBMS maintains locks for either each tuple or at a higher logical level(e.g., tables, partitions).

Definition

  • In the first phase of 2PL, known as the growing phase, the transaction is allowed to acquire as many locks as it needs with releasing locks.
  • When the transaction releases a lock, it enters the second phase, known as the shrinking phase; it is prohibited from obtaining additional locks this point.
  • When the transaction terminates(either by committing or aborting), all the remaining locks are automatically released back to the coordinator.

Property

  • 2PL is considered a pessimistic approach in that it assumes that transactions will conflict and thus they need to acquire locks to avoid this problem.
  • If a transaction is unable to acquire a lock for an element, then it is forced to wait until the lock becomes available. If this waiting is uncontrolled(i.e., indefinite), then the DBMS can incur deadlocks.
  • A major difference among the different variants of 2PL is in how they handle deadlocks and the actions that they take when a deadlock is detected.

Schemes

  • 2PL with Deadlock Detection(DL_DETECT)
    • The DBMS monitors a waits-for graph of transactions and checks for cycles(i.e., deadlocks). When a deadlock is found, the system must choose a transaction to abort and restart in order to break the cycle.
    • In practice, a centralized deadlock detector is used for cycle detection. The detector chooses which transaction to abort based on the amount of resources it has already used(e.g., the number of locks it holds) to minimize the cost of restarting a transaction.
  • 2PL with Non-waiting Deadlock Prevention(NO_WAIT)
    • Unlike deadlock detection where the DBMS waits to find deadlocks after they occur, this approach is more cautious in that a transaction is aborted when the system suspects that a deadlock might occur.
    • When a lock request is denied, the scheduler immediately aborts the requesting transaction(i.e., it is not allowed to wait to acquire the lock).
  • 2PL with Waiting Deadlock Prevention(WAIT_DIE)
    • This is a non-preemptive variation of the NO_WAIT scheme technique where a transaction is allowed to wait for the transaction that holds the lock that it needs if that transaction is older than the one that holds the lock. If the requesting transaction is younger, then it is aborted and is forced to restart.
    • Each transaction needs to acquire a timestamp before its execution and the timestamp ordering guarantees that no deadlocks can occur.

2.2 Timestamp Ordering

Definition

  • A transaction is assigned a unique, monotonically increasing timestamp before it is executed; this timestamp is used by the DBMS to process conflicting operations in the proper order(e.g., read and write operations on the same element, or two separate write operations on the same element).
  • The key difference between the schemes are: (1) the granularity that the DBMS checks for conflicts(i.e., tuples vs. partitions); (2) when the DBMS checks for these conflicts(i.e., while the transaction is running or at the end).

Schemes

  • Basic T/O(TIMESTAMP)
    • Every time a transaction reads or modifies a tuple in the database, the DBMS compares the timestamp of the transaction with the timestamp of the last transaction that reads or writes the same tuple.
    • For any read or write operation, the DBMS rejects the request if the transaction’s timestamp is less than the timestamp of the last write to that tuple. Likewise, for a write operation, the DBMS rejects it if the transaction’s timestamp is less than the timestamp of the last read to that tuple.
    • In TIMESTAMP, a read query makes a local copy of the tuple to ensure repeatable reads since it is not protected by locks.
    • When a transaction is aborted, it is assigned a new timestamp and then restarted.
  • Multi-version Concurrency Control(MVCC)
    • Under MVCC, every write operation creates a new version of a tuple in the database. Each version is tagged with the timestamp of the transaction that created it.
    • The DBMS maintains an internal list of the versions of an element. For a read operation, the DBMS determines which version in this list the transaction will access. Thus it ensures a serializable ordering of all operations.
    • One benefit of MVCC is that the DBMS does not reject operations that arrive late. That is, the DBMS does not reject a read operation because the element that it targets has already been overwritten by another transaction.
  • Optimistic Concurrency Control(OCC)
    • The DBMS tracks the read/write sets of each transaction and stores all of their write operations in their private workspace. When a transaction commits, the system determines whether that transaction’s read set overlaps with the write set of any concurrent transactions. If no overlap exists, then the DBMS applies the changes from the transaction’s workspace into the database; otherwise, the transaction is aborted and restarted.
    • The advantage of this approach for main memory DBMSs is that transactions write their updates to shared memory only at commit time, and thus the contention period is short.
  • T/O with Partition-level Locking(H-STORE)
    • The database is divided into disjoint subsets of memory called partitions. Each partition is protected by a lock and is assigned a single-threaded execution engine that has exclusive access to that partition. Each transaction must acquire the locks for all of the partitions that it needs to access before it is allowed to start running. This requires the DBMS to know what partitions that each individual transaction will access before it begins.
    • When a transaction request arrives, the DBMS assigns it a timestamp and then adds it to all of the lock acquisition queues for its target partitions. The execution engine for a partition removes a transaction from the queue and grants it access to that partition if the transaction has the oldest timestamp in the queue.

3. Many-Core DBMS Test-Bed

3.1 Simulator and Target Architecture

  • Graphite is a fast CPU simulator for large-scale multi-core systems.
  • The target architecture is a tiled multi-core CPU, where each tile contains a low-power, in-order processing core, 32KB L1 instruction/data cache, a 512KB L2 cache slice, and a network router.
  • We use a shared L2-cache configuration because it is the most common lat-level cache design for commercial multi-cores.

3.2 DBMS

  • We implemented our own lightweight main memory DBMS based on pthreads to run in Graphite. It executes as a single process with the number of worker threads equal to the number of cores, where each thread is mapped to a different core.
  • Client threads are not simulated in our system; instead, each worker contains a fixed-length queue of transactions that are served in order.
  • Transaction statistics, such as throughput, latency and abort rates, are collected after a warm-up period that is long enough for the system to achieve a steady state.

In addition to runtime statistics, our DBMS also reports how much time each transaction spends in the different components of the system. We group these measurements into six categories:

  • USEFUL WORK: the time that the transaction is actually executing application logic and operating on tuples in the system.
  • ABORT: the overhead incurred when the DBMS rolls back all of the changes made by a transaction that aborts.
  • TS ALLOCATION: the time that it takes for the system to acquire a unique timestamp from the centralized allocator.
  • INDEX: the time that the transaction spends in the hash indexes for tables, including the overhead of low-level latching of the buckets in the hash tables.
  • WAIT: the total amount of time that a transaction has to wait. A transaction may either wait for a lock(e.g., 2PL) or for a tuple whose value is not ready yet(e.g., T/O).
  • MANAGER: the time that the transaction spends in the lock manager or the timestamp manager. This excludes any time that it has to wait.

3.3 Workloads

  • YCSB: The Yahoo! Cloud Serving Benchmark is a collection of workloads that are representative of large-scale services created by Internet-based companies.
  • TPC-C: this benchmark is the current industry standard for evaluating the performance of OLTP systems.

3.4 Simulator vs. Real Hardware

  • Experiments show that all of the concurrency control schemes exhibit the same performance trends on Graphite and the real CPU.

4. Design Choices & Optimizations

  • We did our best to optimize each algorithm, removing all possible scalability bottlenecks of both the 2PL and T/O schemes and show how hardware support mitigates these problems. Most of this work was to eliminate shared data structures and devise distributed versions of the classical algorithms.

4.1 General Optimizations

  • Memory Allocation
    • When using the default Linux version of malloc, we found that the DBMS spends most of the time waiting for memory allocation. This is a problem even for read-only workloads, since the DBMS still needs to copy records for reads in TIMESTAMP and to create internal meta-data handles for access tracking data structures. Optimized versions(TCMalloc, jemalloc) still yielded similar disappointing results.
    • We overcame this by writing a custom malloc implementation. Similar to TCMalloc and jemalloc, each thread is assigned its own memory pool. But the difference is that our allocator automatically resizes the pools based on the workload. For example, if a benchmark frequently allocates large chunks of contiguous memory, the pool size increases to amortize the cost of each allocation.
  • Lock Table
    • Instead of having a centralized lock table or timestamp manager, we implemented these data structures in a per-tuple fashion where each transaction only latches the tuples that it needs.
    • This improves scalability, but increases the memory overhead because the DBMS maintains additional meta-data for the lock sharer/writer information. In practice, this meta-data(several bytes) is negligible for large tuples.
  • Mutexes
    • For 2PL, the mutex that protects the centralized deadlock detector is the main bottleneck, while for T/O algorithm it is the mutex used for allocating unique timestamps.

4.2 Scalable Two-Phase Locking

  • Deadlock Detection
    • For DL_DETECT, we found that the deadlock detection algorithm is a bottleneck when multiple threads compete to update their wait-for graph in a centralized data structure.
    • We solved this by partitioning the data structure across cores and making the deadlock detector completely lock-free. Now when a transaction updates its wait-for graph, its thread updates its queue with the transaction that it is waiting for without any locks. This step is local(i.e., no cross-core communication), as the thread does not write to the queues of other transactions.
    • In the deadlock detection process, a thread searches for cycles in partial wait-for graph constructed by only reading the queues of related threads without locking the queues. Although this approach may not discover a deadlock immediately after it forms, the thread is guaranteed to find it on subsequent passes.
  • Lock Thrashing
    • This occurs when a transaction holds it locks until it commits, blocking all the other concurrent transactions that attempt to acquire those locks.
    • Experiments show that lock thrashing is the key bottleneck of lock-based approaches that limits scalability in high-contention scenarios.
  • Waiting vs. Aborting
    • We added a timeout threshold in the DBMS that causes the system to abort and restart any transaction that has been waiting for a lock for an amount of time greater than than the threshold.
    • In practice, it is important to note the trade-off between performance and the transaction abort rate.

4.3 Scalable Timestamp Ordering

  • Timestamp Allocation
    • Traditional approaches(e.g., mutex-based allocator and atomic addition) are insufficient for a 1000-core CPU.
    • Atomic approach with batching: the DBMS uses the same atomic instruction to allocate timestamps, but the timestamp manager returns multiple timestamps together in a batch for each request.
    • CPU clocks: each worker thread reads a logical lock form its local core and then concatenates it with its thread id. This provides good scalability as along as all the clocks are synchronized.
    • Hardware Counter: the counter is physically located at the center of the CPU such that the average distance to each cores is minimized. No existing CPU currently supports this. Thus, we implemented a counter in Graphite where a timestamp request is sent through the on-chip network to increment it atomically in a single cycle.
    • The hardware-based(CPU clock and hardware counter) solutions are able to scale with the number of cores.
  • Distributed Validation
    • The original OCC algorithm contains critical section at the end of the read phase, where the transaction’s read set is compared to previous transactions’ write sets to detect conflicts. Although this step is short, any mutex-protected critical section severely hurts scalability.
    • We solve this problem by using per-tuple validation that breaks up this check into smaller operations.
  • Local Partitions
    • We optimize the original H-STORE protocol to take advantage of shared memory.
    • Because the DBMS’s worker threads run in a single process, we allow multi-partition transactions to access tuples at remote partitions directly instead of send querying requests that are executed by the remote partitions’ worker threads.

5. Experimental Analysis

5.1 Read-Only Workload

  • Timestamp allocation becomes the bottleneck with a large core count.
  • OCC hits the bottleneck even earlier since it needs to allocate timestamps twice per transaction (i.e., at transaction start and before the validation phase).
  • Both OCC and TIMESTAMP have significantly worse performance than the other algorithms, they waste cycles because they copy tuples to perform a read, whereas the other algorithms read tuples in place.
  • See paper for more details.

5.2 Write-Intensive Workload

  • In reality, the access distribution of an OLTP application is rarely uniform. Instead, it tends to follow a Zipfian skew, where certain tuples are more likely to be accessed than others.
  • See paper for more details.

5.3 Working Set Size

  • When a transaction’s working set is large, it increases the likelihood that the same data is accessed by concurrent transactions. For 2PL algorithms, this increases the length of time that locks are held by a transaction. With T/O, however, longer transactions may reduce timestamp allocation contention.
  • See paper for more details.

5.4 Read/Write Mixture

  • TIMESTAMP and OCC do not perform well because they copy tuples for reading.
  • MVCC stand out as having the best performance when there are small number of write transactions.
  • See paper for more details.

5.5 Database Partitioning

  • See paper for more details.

5.6 TPC-C

  • See paper for more details.

6. Discussion

6.1 DBMS Bottlenecks

  • Table 2 summarizes the findings for each of the schemes. In particular, we identified several bottlenecks to scalability: (1) lock thrashing; (2) preemptive aborts; (3) deadlocks; (4) timestamp allocation; (5) memory-to-memory copying.
  • Although no single concurrency control scheme performed the best for all workloads, one may outperform the others under certain conditions. Thus, it may be possible to combine two or more classes of algorithms into a single DBMS and switch between them based on the workload. For example, a DBMS could use DL_DETECT for workloads with little contention, but switch to NO_WAIT or a T/O-based algorithm when transactions are taking too long to finish due to thrashing. One could also employ a hybrid approach, such as MySQL’s DL_DETECT + MVCC scheme, where read-only transactions use multi-versioning and all others use 2PL.
  • These results also make it clear that new hardware support is needed to overcome some of these bottlenecks. For example, all of the T/O schemes suffer from the timestamp allocation bottleneck when the throughput is high.
  • We also saw that memory issues cause slowdown in some of the schemes. (1) One way to alleviate this problem is to add a hardware accelerator on the CPU to do memory copying in the background, this would eliminate the need to load all data through the CPU’s pipeline; (2) malloc was another bottleneck and we believe that future CPUs will need to switch to decentralized or hierarchical memory controllers to provide faster memory allocation.

6.2 Multi-core vs. Multi-node Systems

  • moving to a multi-node architecture introduces a new performance bottleneck: distributed transactions. Since these transactions access data that may not be on the same node, the DBMS must use an atomic commit protocol, such as two-phase commit.
  • The communication between nodes over the network is much slower compared to the communication between threads in a shared-memory environment. This means that a single many-core node with a large amount of DRAM might outperform a distributed DBMS for all but the largest OLTP applications.
  • It may be that for multi-node DBMSs two levels of abstraction are required: a shared-nothing implementation between nodes and a multi-threaded shared-memory DBMS within a single chip. This hierarchy is common in high-performance computing applications.
  • Because these limitations are inherent to these algorithms, it is possible that no workaround exists in software. In this case, software-hardware co-design is the only solution to address these issues.

8. Future Work

  • For lower core configurations, we found that 2PL-based schemes are good at handling short transactions with low contention that are common in key-value workloads. Whereas T/O-based algorithms are good at handling higher contention with longer transactions that are more common in complex OLTP workloads.
Written on June 1, 2017