# The End of a Myth: Distributed Transactions Can Scale

Reference: Zamanian, Erfan, et al. "The end of a myth: distributed transactions can scale." Proceedings of the VLDB Endowment 10.6 (2017): 685-696.

## 0. Abstract

The common wisdom is that distributed transactions do not scale. But what if distributed transactions could be made scalable using the next generation of networks and a redesign of distributed databases? There would be no need for developers anymore to worry about co-partitioning schemes to achieve decent performance. Application development would become easier as data placement would no longer determine how scalable an application is. Hardware provisioning would be simplified as the system administrator can expect a linear scale- out when adding more machines rather than some complex sub-linear function, which is highly application specific. In this paper, we present the design of our novel scalable database system NAM-DB and show that distributed transactions with the very common Snapshot Isolation guarantee can indeed scale using the next generation of RDMA-enabled net- work technology without any inherent bottlenecks. Our experiments with the TPC-C benchmark show that our system scales linearly to over 6.5 million new-order (14.5 million total) distributed transactions per second on 56 machines.

## 1. Introduction

• The common wisdom is that distributed transaction do not scale. As a result, many techniques have been proposed to avoid distributed transactions ranging from locality-aware partitioning and speculative execution to new consistency levels and the relaxation of durability guarantees.
• Even worse, most of these techniques are not transparent to the developer. Instead, the developer not only has to understand all the implications of these techniques, but also must carefully design the application to take advantage of them.
• This paper shows that Snapshot Isolation scheme can scale using the next generation of RDMA-enabled networking technology without an inherent bottleneck other than the workload.
• With Remote-Direct-Memory-Access(RDMA), it is possible to bypass the CPU when transferring data from one machine to another. The current generation of RDMA-capable networks, such as InfiniBand FDR 4x, is already able to provide a bandwidth similar to the aggregated memory bandwidth between a CPU socket and its attached RAM.

### 1.1 Why Distributed Transactions Are Considered Not Scalable

• The most important factor is the CPU-overhead of the TCP/IP stack, that is, CPU spends most of the time processing network messages, leaving little room for the actual work.
• Additionally, the network bandwidth also significantly limits the transaction throughput.

### 1.2 Why We Need A System Redesign

• RDMA-enabled networks change the architecture to a hybrid shared-memory and message-passing architecture, and the transaction protocols need to be redesigned to avoid inherent bottlenecks.
• Existing (distributed) Snapshot Isolation schemes typically reply on a single global snapshot counter of timestamp, which obstructs scalability.

### 1.3 Contribution and Outline

• The authors present the full design of a truly scalable system called NAM-DB and propose scalable algorithms specifically for Snapshot Isolation (SI) with (mainly one-sided) RDMA operations.
• The authors present a novel RDMA-based and scalable global counter technique which allows to efficiently read the (latest) consistent snapshot in a distributed SI-based protocol.
• The authors show that NAM-DB is truly scalable using a full implementation of the TPC-C benchmark including additional variants where they vary factors such as degree of distribution as well as the contention rate.

## 2. System Overview

• InfiniBand offers two network communication stacks: (1) IP over InfiniBand (IPoIB), which implements a classic TCP/IP stack over InfiniBand, allowing existing database systems to run on fast networks without any modifications; (2) remote direct memory access(RDMA), which bypasses the OS and achieves low latencies.

### 2.1 The NAM Architecture

• The NAM architecture logically decouples compute and storage nodes and uses RDMA for communication between all nodes as shown in Figure 1.
• Memory Servers hold all data of a database system such as tables, indexes as well as all other state for transaction execution (e.g., logs and metadata).
• Compute Servers executes transactions over the data items stored in the memory servers.
• The separation of computation and storage distinguishes their design from traditional distributed database systems, thus the performance of the system is independent on the location of the data.

### 2.2 Design Principles

Separation of Compute and Memory

• Existing database systems that follow this design typically push data access operations into the storage layer, however, this will cause: (1) memory servers are likely to become a bottleneck; (2) with traditional socket-based network operations, every message consumes additional CPU cycles.
• The memory servers provide a fine-grained byte-level data access and compute server exploit one-sided RDMA operations as much as possible to avoid any unnecessary CPU cycles for message handling.
• This architecture also allows increasing the bandwidth by scaling out the memory servers when the aggregated main memory bandwidth is the main bottleneck.

Data Location Independence

• Compute servers are able to access any data item independent of its storage location(i.e., on which memory server this item is stored), making it easy for implementing work-stealing techniques.
• Data locality is just an optimization that can be added on top of the system.

Partitionable Data Structures

• A single memory region(e.g., a global read or commit timestamp) may become a bottleneck, thus it is important that every data structure is partitionable.

## 3. The Basic SI-Protocol

### 3.1 A Naive RDMA Implementation

• The rts defines a valid snapshot for the transaction and the timestamp oracle is responsible to advance the read timestamp by scanning the queue of completed transactions.
• Since advancing the read timestamp is not in the critical path, the oracle uses a single thread that continuously scans the memory region to find the highest commit timestamp and also adjust the offset if the servers run out of space.

### 3.2 Open Problems and Challenges

• The previously described protocol is not scalable because global timestamps have inherit scalability issues.
• For every transaction, each compute server uses an RDMA atomic fetch and add operation to the same memory region to get a unique timestamp.
• This protocol probably results in high abort rates because the read timestamp is managed by a single thread and may not be updated in time.
• Slow workers also contribute to high abort rate by holding back the most recent snapshot from getting updated.
• The naive implementation does not have any support for fault-tolerance.

## 4. Timestamp Oracle

### 4.1 Scalable Timestamp Generation

• Timestamp vector: it is similar to a vector clock, that represents the read timestamp as the following $% %]]>$. Each component $t_i$ in $T_R$ is a unique counter that is assigned to one transaction execution thread $i$ in a compute server where $i$ is a globally unique identifier.
• In contrast to vector clocks, they do not store the full vector with every record but only the timestamp of the compute server who did the latest update $% %]]>$, to mitigate the high storage overhead per record.
• Commit Timestamp: since each thread already knows its latest commit timestamp and just needs to increase it by one to create the next commit timestamp. At the completion of the transaction, it issues an RDMA write to increase its latest timestamp in the vector $T_R$, and no atomic operations are necessary since each transaction thread $i$ only executes one transaction at a time.
• Read Timestamp: each transaction thread $i$ reads the complete timestamp vector TR and uses it as read timestamp $rts$. If the version of record <i, t> is smaller than or equal to $t_i$ of the vector $T_R$, the update is visible to the transaction.
• Long running transactions, stragglers, or crashed machines do not prevent the read timestamp to advance.
• If the timestamp is stored on a single memory server, it is guaranteed to increase monotonically.

### 4.2 Further Optimizations

• Dedicated Fetch Thread: this seems to increase abort rate since pre-fetching increases the staleness of $T_R$, however, due to the reduced network load, the runtime of each transaction is heavily reduced, leading to actually a lower abort rate.
• Compression of $T_R$: we can compress $T_R$ by having only one slot $t_i$ per compute server, and the threads of a compute server use an atomic RDMA fetch-and-add operation to increase the counter value.
• Partitioning of $T_R$: storing $T_R$ on one memory server could at some point make the network bandwidth of this server the bottleneck. As a transaction execution thread only needs to update a single slot $t_i$, it is easy to partition $T_R$ across several memory nodes. However, partitioning $T_R$ no longer guarantees strict monotonicity, as the result, the order of transactions between transaction execution threads might be different.

## 5. Memory Servers

### 5.1 Multi-Versioning Scheme

The scheme to store multiple versions of a database record in a memory server is shown in Figure 3. The most recent version of a record is stored in a dedicated memory region, thus the most recent version can always be read with a single RDMA request. The oldest versions in the old-version buffer are continuously copied to an *overflow region&, thus slots in the old-version buffer can be re-used for new versions while keeping old versions available for long running transactions.

• Record Layout: see the paper
• Version Management: separating header and data section of old-version buffer to make it easy to search a version of a record because the header section is typically much smaller than the data section and a transaction can fetch all headers easily; others see paper.

### 5.2 Table and Index Structures

• Table Structures: key/value hash tables with range partitioned.
• Index Structures: NAM-DB supports two types of secondary indexes: a hash-index for single-key lookups and a B+-tree for range lookups. Both types of secondary indexes map a value of the secondary attribute to a primary key that can then be used to lookup the record using the table structure. For B+-tree, they use two-sided RDMA operations to implement the communication between compute and memory server.

### 5.3 Memory Management

• Allocate and Free Calls: these calls from compute servers to memory servers are implemented using two-sided RDMA operations. In order to avoid many small memory allocation calls, compute servers request memory regions from memory servers in extends.
• Garbage Collection: in NAM-DB, garbage collection is implemented by having a timeout on the maximal transaction execution time that can be defined as a parameter by the application. For garbage collecting out-dated versions, a garbage collection thread runs on every memory server which continuously scans the overflow regions and sets the deleted-bit of the selected versions of a record 1.

## 6. Compute Servers

### 6.1 Transaction Execution

• The database catalog for transactions to find the storage location of tables and indexes, is hash-partitioned and stored in memory servers.
• Since the catalog does not change too often, the catalog data is cached by compute servers and refreshed in two cases(see paper for more details).

### 6.2 Failures and Recovery

• Memory Server Failures: each transaction execution thread of a compute server writes a private log journal to more than one memory server using RDMA writes.
• Compute Server Failures: compute servers are stateless and thus do not need any special handling for recovery. To prevent abandoned locks, each compute server is monitored by another compute server called monitoring compute server.

## 7. Evaluation

• Benchmark: TPC-C

### 7.4 Exp.4: Effect of Contention

• FaRM
Written on May 8, 2017