# Speedy Transactions in Multicore In-Memory Databases

Reference: Tu, Stephen, et al. "Speedy transactions in multicore in-memory databases." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.

This is one of the several papers belong to suggested readings for Optimistic Concurrency Control of CMU 15-721: Database Systems.

## 0. Abstract

Silo is a new in-memory database that achieves excellent performance and scalability on modern multicore machines. Silo was designed from the ground up to use system memory and caches efficiently. For instance, it avoids all centralized contention points, including that of centralized transaction ID assignment. Silo’s key contribution is a commit protocol based on optimistic concurrency control that provides serializability while avoiding all shared-memory writes for records that were only read. Though this might seem to complicate the enforcement of a serial order, correct logging and recovery is provided by linking periodically-updated epochs with the commit protocol. Silo provides the same guarantees as any serializable database without unnecessary scalability bottlenecks or much additional latency. Silo achieves almost 700,000 transactions per second on a standard TPC-C workload mix on a 32-core machine, as well as near-linear scalability. Considered per core, this is several times higher than previously reported results.

## 1. Introduction

• Silo uses a Masstree-inspired tree structure for its underlying indexes. Masstree is a fast concurrent B-tree-like structure optimized for multi-core performance.
• Silo uses a variant of optimistic concurrency control (OCC).
• Silo provides serializability while avoiding all shared- memory writes for read transactions, by epoch-based group commit.
• Silo assumes a one-shot request model in which all parameters for each client request are available at the start, and requests always complete without further client interaction.

### 2.1 Non-transactional systems

• Masstree: version validation instead of read locks and efficient fine-grained locking algorithms
• PALM: batching technique, extensive pre-fetching, and intra-core SIMD parallelism
• Bw-tree: a high-throughput multi-version tree structure optimized for multi-core flash storage

### 2.2 Transactional systems

• OCC and its variants often induce contention in the commit phase, such as centralized transaction ID assignment or communication among all concurrently executing transactions.
• Hekaton: OCC-based MVCC
• DORA: a locking-based system that partitions data and locks among cores, eliminating long chains of lock waits on a centralized lock manager and increasing cache affinity.
• PLP: follow-on work to DORA. In PLP, the database is physically partitioned among many trees such that only a single thread manages a tree. The partitioning scheme is flexible, and thus requires maintaining a centralized routing table.
• H-Store and its commercial successor VoltDB employ an extreme form of partitioning, treating each partition as a separate logical database even when partitions are collocated on the same physical node. Transactions local to a single partition run without locking at all, and multi-partition transactions are executed via the use of whole-partition locks. This makes single-partition transactions extremely fast, but creates additional scalability problems for multi-partition transactions.
• Multimed runs OLTP on a single multi-core machine by running multiple database instances on separate cores in a replicated setup.

### 2.3 Transactional memory

• Recent STM systems, including TL2, are based on optimistic concurrency control and maintain read and write sets similar to those in OCC databases.
• No STM beats efficient locking-based code.

## 3. Architecture

• Silo is a relational database that provides tables of typed, named records. Its clients issue one-shot requests: all parameters are available when a request begins, and the re- quest does not interact with its caller until it completes.
• Silo tables are implemented as collections of index trees, including one primary tree and zero or more secondary trees per table. Looking up a record by a secondary index requires two tree accesses.
• Each index tree is stored in an ordered key-value structure based on Masstree.
• Although the primary copy of data in Silo is in main memory, transactions are made durable via logging to stable storage. Results are not returned to users until they are durable.

## 4. Design

• Key organizing principle: eliminating unnecessary contention by reducing writes to shared memory

### 4.1 Epochs

• A global epoch number E is visible to all threads. A designated thread periodically advances E; other threads access E while committing transactions.
• E should advance frequently, since the epoch period affects transaction latency, but epoch change should be rare compared to transaction duration so that the value of E is generally cached. Our implementation updates E once every 40 ms; shorter epochs would also work. No locking is required to handle E.
• Each worker w also maintains a local epoch number $e_w$. This can lag behind E while the worker computes, and is used to determine when it is safe to collect garbage. Silo requires that E and $e_w$ never diverge too far: E − $e_w$ ≤ 1 for all w.

### 4.2 Transaction IDs

• Silo assigns each transaction a transaction ID, and each record contains the TID of the transaction that most recently modified it. TID consists of an epoch number at commit time, identifier in the same epoch, and status bits.
• A worker chooses a transaction’s TID only after verifying that the transaction can commit. At that point, it calculates the smallest number that is (a) larger than the TID of any record read or written by the transaction, (b) larger than the worker’s most recently chosen TID, and (c) in the current global epoch.
• The status bits are not related with TID itself, it includes a lock bit, lastest-version bit, and an absent bit.

### 4.3 Data layout

• A record in Silo contains: (1) a TID word(TID + status bits); (2) a previous-version pointer; (3) record data.
• Committed transactions usually modify record data in place. However, readers must then use a version validation protocol to ensure that they have read a consistent version of each record’s data.

### 4.4 Commit protocol

• As a worker runs a transaction, it maintains a read-set, and a write-set.
• To avoid deadlocks, workers lock records in a global order. Any deterministic global order is fine; Silo uses the pointer addresses of records.
• After all write locks are acquired, the worker takes a snapshot of the global epoch number using a single memory access. The snapshot of the global epoch number is the serialization point for transactions that commit.

### 4.5 Database operations

• To modify record data during Phase 3 of the commit protocol, a worker while holding the lock (a) updates the record, (b) performs a memory fence, and (c) stores the TID and releases the lock.
• To access record data during transaction execution (outside the commit protocol), a worker (a) reads the TID word, spinning until the lock is clear, (b) checks whether the record is the latest version, (c) reads the data, (d) performs a memory fence, and (e) checks the TID word again. If the record is not the latest version in step (b) or the TID word changes between steps (a) and (e), the worker must retry or abort.

Deletes

• A delete operation marks its record as “absent” using the absent bit and registers the record for later garbage collection.

Inserts

• However, when the record does not exist, there is nothing to lock. To avoid this problem, we insert a new record for the insert request before starting the commit protocol.

### 4.6 Range queries and phantoms

• The typical solution to the phantom problem in database systems is next-key locking.
• Silo deals with this issue by taking advantage of the underlying B+-tree’s version number on each leaf node. We add the leaf nodes that overlap with the key space [a, b) to the node-set along with the version numbers examined during the scan. We check that the version numbers of all tree nodes in the node-set have not changed, which ensures that no new keys have been added or removed within the ranges examined.

### 4.7 Secondary indexes

• A transaction that used a secondary index will abort if the record it accessed there has changed.

### 4.8 Garbage collection

• When a worker generates garbage, it registers the garbage object and its reclamation epoch in a per-core list for that object type. The reclamation epoch is the epoch after which no thread could possibly access the object; once it is reached, we can free the object.
• The epoch-advancing thread periodically checks all $e_w$ values and sets a global tree reclamation epoch to $min \, e_w − 1$. Garbage tree nodes with smaller or equal reclamation epochs can safely be freed.

### 4.9 Snapshot transactions

• We provide consistent snapshots using snapshot epochs. Snapshot epoch boundaries align with epoch boundaries, and thus are consistent points in the serial order. However, snapshot epochs advance more slowly than epochs.
• Read/write transactions must not delete or overwrite record versions relevant for a snapshot by installing a new record whose previous-version pointer links to the old one.

### 4.10 Durability

• Silo treats whole epochs as the durability commit units. The results of transactions in epoch e are not released to clients until all transactions with epochs ≤ e have been stored durably.
• Although epochs as a whole are serially consistent, the serial order within an epoch is not recoverable from the information we log.
• Silo uses record-level redo logging exclusively, not undo logging or operation logging. Undo logging is not necessary for our system since we log after transactions commit; logging at record level simplifies recovery.
• To recover, Silo would read the most recent $d_l$ for each logger, compute D = min $d_l$, and then replay the logs, ignoring entries for transactions whose TIDs are from epochs after D. Log records for the same record must be applied in TID order to ensure that the result equals the latest version, but replaying can otherwise be performed concurrently.

## 5. Evaluation

Performance Hypotheses

• The cost of Silo’s read/write set tracking for a simple key-value workload is low(5.2).
• Silo scales as more cores become available, even when transactions are made persistent(5.3).
• Silo’s performance is more robust to workload changes compared to a partitioned data store, specifically as we increase the level of cross-partition con- tention and the skew of the workload(5.4).
• Large read-only transactions benefit substantially from Silo’s snapshot transactions(5.5). This mechanism incurs only a small space overhead, even for a write heavy workload(5.6).

### 5.1 Experimental setup

• We disabled hyperthreading on all CPUs; we found slightly worse results with hyperthreading enabled.
• We use a custom memory allocator for both B+-tree nodes and records. Our allocator takes advantage of 2MB “superpages” (a feature supported in recent Linux kernels) to reduce TLB pressure.
• Before we run an experiment, we make sure to pre-fault our memory pools so that the scalability bottlenecks of Linux’s virtual memory system are not an issue in our benchmarks.

### 5.2 Overhead of small transactions

• Varying the number of worker threads
• Silo’s read/write set tracking has low overhead by comparing to the underlying key-value store, which does not perform any tracking.
• Here we see a scalability collapse after 24 workers with globally generated TIDs, which demonstrates the necessity of avoiding even a single global atomic instruction during the commit phase.

### 5.3 Scalability and persistence

• Varying the number of worker threads
• Most of the loss in throughput when enabling durability is due to the overhead of transferring log records from worker threads to logger threads, rather than the overhead of writing to physical hardware.
• Overall, we see that logging does not significantly degrade the throughput of Silo and incurs only a modest increase in latency.

### 5.4 Comparison with Partitioned-Store

• Varying the percentage of cross-partition transactions, the number of worker threads
• Partitioned-Store is clearly the optimal solution for perfectly partitionable workloads. The advantages are two-fold: no concurrency control mechanisms are required, and there is better cache locality due to the partitioned trees being smaller.
• While Silo’s OCC protocol initially pays a non-trivial overhead for tracking record-level changes in low contention regimes, this work pays off as the contention increases.
• We substantially reduce this contention in MemSilo+FastIds by generating IDs outside the new-order transaction. The new-order client request runs two transactions, where the first generates a unique ID and the second uses that ID.

### 5.5 Effectiveness of snapshot transactions

• For other workloads involving many large read-only transactions over frequently updated records, we show that snapshots provide performance gains.

### 5.6 Space overhead of snapshots

• The space overhead of maintaining multiple versions of records to support snapshot transactions is low, even for update heavy workloads.

### 5.7 Factor analysis

• The two major takeaways from Figure 11 are that performing in-place updates is an important optimization for Silo and that the overhead of maintaining snapshots plus garbage collection is low.
• The takeaways from Figure 11 are that spending extra CPU cycles to reduce the amount of bytes written to disk surprisingly does not pay off for this TPC-C workload, and that the overhead of copying record modifications into logger buffers and then into persistent storage is low.

## 6. Conclusions

Written on July 10, 2017