# The Bw-Tree: A B-tree for New Hardware Platforms

Reference: Levandoski, Justin J., David B. Lomet, and Sudipta Sengupta. "The Bw-Tree: A B-tree for new hardware platforms." Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, 2013.

This is one of the several papers belong to suggested readings for Latch-free OLTP Indexes (Part II) of CMU 15-721: Database Systems.

## 0. Abstract

The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

## 1. Introduction

• Atomic Record Stores: Bw-tree and its associated storage manager particularly appropriate for the new hardware environment that has emerged over the last several years.
• Design for Multi-core: (1) multi-core allows for high concurrency, but latches limits scalability; (2) updating memory in place results in cache invalidations, which may harm multi-core performance because it depends on high CPU cache hit ratios. To address the first issue, the Bw-tree is latch-free, ensuring a thread never yields or even re-directs its activity in the face of conflicts; to address the second issue, the Bw-tree performs “delta” updates that avoid updating a page in place, hence preserving previously cached lines of pages.
• Design for More Storage Devices: the Bw-tree targets flash storage, whose random write is much slower than sequential write.

## 2. Bw-tree Architecture

• Mapping Table: it maps logical pages to physical pages (in stable storage or memory), allowing for using PIDs rather than physical pointers to reference the pages.
• Delta Updating: page state changes are done by creating a delta record and we install the new memory address of the delta record into the page’s physical address slot in the mapping table using the atomic compare and swap (CAS) instruction. Occasionally, we consolidate pages (create a new page that applies all delta changes) to both reduce memory footprint and to improve search performance. The delta updating technique simultaneously enables latch-free access in the Bw-tree and preserves processor data caches by avoiding update-in-place.
• We use a form of epoch to accomplish safe garbage collection.
• Bw-tree Structure Modification: we break an SMO (because it can not be achieved with a single CAS instruction, e.g., splitting a node) into a sequence of atomic actions, each installed via a CAS. We use a $B^{link}$ design to make this easier. In order to make sure that no thread has to wait for an SMO to complete, a thread that sees a partial SMO will complete it before proceeding with its own operation.
• Log Structured Store: our LSS has the usual advantages of log structuring, pages are written sequentially in a large batch, greatly reducing the number of separate write I/Os required. However, because of garbage collection, log structuring normally incurs extra writes to relocate pages that persist in reclaimed storage areas of the log. To reduce this problem, our LSS only flushes the deltas since its previous flush and utilizes the high random read performance of flash to amortize the penalty of reading. During cleaning, LSS makes pages and their deltas contiguous for improved access performance.
• Managing Transactional Logs: our ASS utilizes log sequence number (LSN) to support recovery. We flushes pages without “recent” deltas that are not on the stable transactional log rather than block page flushes.

## 3. In-Memory Latch Free Pages

Elastic Virtual Pages

• The information stored on a Bw-tree page is similar to that of a typical B+-tree. In addition, pages also contain a low key representing the smallest key can be stored on the page, a high key, and a side link pointer that points to the node’s immediate right sibling on the same level, as in a $B^{link}$ tree.
• Pages are logical, which means PIDs are always used to link nodes; pages are elastic, meaning there is no hard limit on how large a page may grow.
• Updates: using “delta” records as mentioned in Section 2
• Leaf-level update operations: discussed in Section 4
• Page search: search delta first then the base page, more details in Section 4

Page Consolidation

• We trigger consolidation if an accessor thread, during a page search, notices a delta chain length has exceeded a system threshold.
• The thread installs the new address of the consolidated page in the mapping table with a CAS, if it succeeds, the thread requests garbage collection (memory reclamation) of the old page state; otherwise, the thread abandons the operation by deallocating the new page. The thread does not retry, as a subsequent thread will eventually perform a successful consolidation.

Range Scans

• We construct a vector of records containing all the records on the page to be processed as the part of the scan, which allows providing the “next-record” functionality.
• We treat each “next-record” operation as an atomic unit. The entire scan is not atomic. Before delivering a record from our vector, we check whether an update has affected the yet unreturned subrange in our record vector. If such an update has occurred, we reconstruct the record vector accordingly.

Garbage Collection

• Epoch-based mechanism

** 4. Bw-tree Structure Modification

Node Split

• The Bw-tree employs the $B^{link}$ atomic split installation technique that works in two phases. The first step it install the split at the child, then update the parent node, which can be split into two atomic actions since the side link provides a valid search tree after installing the split at the child level.
• Child Split: We atomically install the split by prepending a split delta record to P.
• Parent Update: We prepend an index term delta record to the parent of P and Q. If the parent has been deleted (but can not be reclaimed because of the epoch-based mechanism), we go up the tree to the grandparent node, etc., and do a re-traversal down the tree to find the parent that is “still alive”.
• Consolidation: for pages with split deltas, consolidations involves creating a new base page containing records with keys less than the separator key in the split delta; for pages with index entry deltas, we create a new consolidated base page containing the new separator keys and pointers.

Node Merge

• Marking for Delete: the node R to be merged is updated with a remove node delta, a thread encountering a remove node delta in R needs to read or update the contents of R previously contained in R by going to the left sibling, which data from R will be merged.
• Mering Children: the left node of R is updated with a node merge delta that physically points to the contents of R.
• Parent Update: the parent node P of R is updated by an index term delete delta that includes not only that R is being deleted, but also that L will now include data from the key space formerly contained by R. Once the index term delete delta is posted, all paths to R are blocked.

Serializing Structure Modifications and Updates

• To prevent thread seeing uncommitted state, we require such a thread must complete and commit the SMO it encounters before it can either (1) post its update or (2) continue with its own SMO.

## 5. Cache Management

• The cache layer is responsible for reading, flushing, and swapping pages between memory and flash, and it maintains the mapping table and provides the abstraction of logical pages to the Bw-tree layer.
• To keep track of which version of the page is on stable storage (the LSS) and where it is, we use a flush data record, which is installed at the mapping table entry for the page using a CAS. Flush data records also record which changes to a page have been flushed so that the subsequent flushes send only incremental page changes to stable storage.

Write Ahead Log Protocol and LSNs

• Bw-tree employs Deuteronomy architecture.
• Record insert and update deltas in a the Bw-tree page are tagged with the Log Sequence Number (LSN) of their operations.
• Transaction Log Coordination: Whenever the TC appends (flushes) to its stable transactional log, it updates the End of Stable Log (ESL) LSN value. It is necessary for enforcing causality via the write-ahead log protocol (WAL) that the DC not make durable any operation with an LSN greater than the latest ESL. To enforce this rule, records on a page that have LSNs larger than the ESL are not included in a page when flushed to the LSS.
• Bw-tree Structure Modifications: this part is kind of obscure to me, so I skip it

Flushing Pages to the LSS

• Page Marshalling: the cache manager marshalls the bytes from the pointer representation of the page in main memory into a linear representation that can be written to the flush buffer. The page state is captured at the time it is intended to be flushed. When marshalling records on a page for flush, multiple delta records are consolidated so that they appear contiguously in the LSS.
• Incremental Flushing: when flushing a page, the cache man- ager only marshals those delta records which have an LSN between the previously flushed largest LSN on that page and the current ESL value. The previously flushed largest LSN information is contained in the latest flush delta record on the page.
• Flush Activity: the flush buffer aggregates writes to LSS up to a configurable threshold (currently set at 1MB) to reduce I/O overhead. It uses ping-pong (double) buffers and alternates between them with asynchronous I/O calls to the LSS so that the buffer for the next batch of page flushes can be prepared while the current one is in progress. The cache manager monitors the memory used by the Bw-tree, and when it exceeds a configurable threshold, it attempts to swap out pages to the LSS.

## 6. Performance Evaluation

• Comparison with BerkeleyDB and latch-free skip list
• Workloads: XboxLIVE, Storage deduplication trace, Synthetic
• Delta chain lengths can grow longer (to about eight deltas) without performance consequences.
• The failure rates for the split and consolidate operations are larger than the update failures, because splits and consolidates must compete with the faster record update operations.
• Two main aspects lead to the superior performance of the Bw-tree than B-tree: (1) latch-free freedom, (2) CPU cache efficiency.
• These results suggest the Bw-tree has a clear advantage in search performance compared with skip list, which is mostly attributed to its better CPU cache efficiency.
• Bw-tree is more cache-friendly, since most of its time is spent performing binary search on a base page of keys compacted into a contiguous memory block. The memory usage for search pathways only takes little of the occupied memory, while the skip list scans about 50% of the nodes.
• B-trees
• Latch Free: until our work, it was not clear that B-trees could be made lock or latch-free.
• Flash Based, Log Structured
• Combination Efforts

## 8. Discussion

• We were acutely aware that we were implementing a component that had been successfully implemented any number of times. In such a case, when all is said and done, it is system performance that determines whether the effort bears.
Written on August 20, 2017