Let's Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems

Reference: Arulraj, Joy, Andrew Pavlo, and Subramanya R. Dulloor. "Let's talk about storage & recovery methods for non-volatile memory database systems." Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015.

This is one of the several papers belong to suggested readings for Non-Volatile Memory Database Systems of CMU 15-721: Database Systems.

0. Abstract

The advent of non-volatile memory (NVM) will fundamentally change the dichotomy between memory and durable storage in database management systems (DBMSs). These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existing DBMSs are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy DBMSs are unnecessary and will degrade the performance of data intensive applications.

To better understand these issues, we implemented three engines in a modular DBMS testbed that are based on different storage management architectures: (1) in-place updates, (2) copy-on-write updates, and (3) log-structured updates. We then present NVM-aware variants of these architectures that leverage the persistence and byte-addressability properties of NVM in their storage and recovery methods. Our experimental evaluation on an NVM hardware emulator shows that these engines achieve up to 5.5x higher throughput than their traditional counterparts while reducing the amount of wear due to write operations by up to 2x. We also demonstrate that our NVM-aware recovery protocols allow these engines to recover almost instantaneously after the DBMS restarts.

1. Introduction

  • Non-volatile memory (NVM) offers an intriguing blend of SSD/HDD and DRAM, which does not need to maintain a in-memory cache for blocks (for disk-based DBMSs) and overcome the volatility of DRAM (for in-memory DBMSs).

2. Background

2.1 Motivation

  • Although the advantages of NVM are obvious, making full use of them in an OLTP DBMS is non-trivial. Since current DBMSs assume that memory is volatile, they waste some resources for logging.

2.2 NVM Hardware Emulator

  • NVM storage devices are currently expensive and only support small capacities. For this reason, we use a NVM hardware emulator developed by Intel Labs.
  • NVM technologies have higher read and write latency than DRAM. We are able to emulate the latency for the NVM partition by using custom CPU microcode. The microcode estimates the additional cycles that the CPU would have to wait if DRAM is replaced by slower NVM and then stalls the CPU for those cycles.

NVM Storgae Interfaces

  • Allocator Interface: the emulator uses a NVM-aware memory allocator that provides the POSIX malloc interface.
  • Filesystem Interface: the emulator exposes a NVM-backed volume to the OS through a special filesystem that is optimized for non-volatile memory.
  • The filesystem interface supports a naming mechanism that ensures that file offsets are valid after the system restarts, however its downside is that it requires the application’s writes to go through the kernel’s virtual filesystem (VFS) layer, while the allocator interface allows for reading from and writing to NVM directly without userspace.

2.3 NVM-aware Memory Allocator

  • sync: a durability mechanism to ensure that modifications to data stored on NVM are persisted.
  • non-volatile pointers: a naming mechanism for allocations so that pointers to locations in memory are valid even after the system restarts.

  • The NVM allocator that we use in our evaluation is based on the open-source NVM-related libpmem library. We extended this allocator to follow a rotating best-fit allocation policy and to support multi-threaded usage.
  • NVM-aware allocator delivers 10-12x higher write bandwidth than the filesystem.

3. DBMS TESTBED

  • The DBMS uses pthreads to allow multiple transactions to run concurrently in separate worker threads. It executes as a single process with the number of worker threads equal to the number of cores, where each thread is mapped to a different core.
  • We partition the database and use a lightweight locking scheme where transactions are executed serially at each partition based on timestamp ordering (from H-Store) to avoid the interference of the DBMS’s concurrency control scheme.

3.1 In-Place Updates Engine (InP)

  • This is the most efficient method of applying changes, since the engine does not make a copy of the tuple first before updating it and only the updated fields are modified.
  • The design of this engine is based on VoltDB. The InP engine uses the STX B+tree library for all of its indexes.
  • Storage: shown in the figure
  • Recovery: write-ahead log (WAL). The most well-known recovery protocol for in-place updates is ARIES.

3.2 Copy-on-Write Updates Engine (CoW)

  • With this approach, known as shadow paging in IBM’s System R, the DBMS maintains two look-up directories at all times: (1) the current directory, and (2) the dirty directory. To ensure that the transactions are isolated from the effects of uncommitted transactions, the engine maintains a master record that always points to the current directory.
  • Storage: this engine has high write amplification (i.e., the amount of data written to storage is much higher compared to the amount of data written by the application).
  • Recovery: if the DBMS crashes before the master record is updated, then the changes present in the dirty directory are not visible after restart.

3.3 Log-structured Updates Engine (Log)

  • The design for our Log engine is based on Google’s LevelDB. The log-structured update approach performs well for write-intensive workloads as it reduces random writes to durable storage.
  • The downside of the Log engine is that it incurs high read amplification (i.e., the number of reads required to fetch the data is much higher than that actually needed by the application). To reduce this read amplification, the Log engine performs a periodic compaction process that merges a subset of SSTables.
  • Storage: shown in the figure
  • Recovery: WAL

4. NVM-Aware Engines

4.1 In-Place Updates Engine (NVM-InP)

  • Storage: To reclaim the storage space of tuples and non-inlined fields inserted by uncommitted transactions after the system restarts, the NVM-InP engine maintains durability state in each slot’s header. We modified the STX B+tree library so that all operations that alter the index’s internal structure are atomic.
  • Recovery: the effects of committed transactions are durable, but the changes of uncommitted transactions may be present in the database because the memory controller can evict cache lines containing those changes to NVM at any time. The NVM-InP engine therefore needs to undo those transactions using the WAL.

4.2 Copy-on-Write Updates Engine (NVM-CoW)

  • Recovery: no recovery

4.3 Log-structured Updates Engine (NVM-Log)

  • Our NVM-Log engine avoids data duplication in the MemTable and the WAL as it only records non-volatile pointers to tuple modifications in the WAL.
  • Storage: shown in the figure
  • Recovery: the NVM-Log engine only needs to undo the effects of uncommitted transactions on the MemTable. Its recovery latency is therefore lower than the Log engine as it no longer needs to rebuild the MemTable.

5. Experimental Analysis

  • We also use the perf toolkit to measure additional, lower-level hardware metrics of the system for each experiment.

5.1 Benchmarks

  • YCSB
  • TPC-C

5.2 Runtime Performance

YCSB

  • Read-only workload: NVM-InP engine is not faster than the InP engine for both skew settings; NVM-CoW is 1.9-2.1x faster than the CoW engine; NVM-Log engine is 2.8x faster. The performance gap between the two types of engines decreases in the read-only workload when we increase the NVM latency.
  • Read-heavy workload: NVM-InP engine is 1.3x faster than the InP engine due to lower logging overhead.
  • The benefits of our optimizations are more prominent for the balanced and write-heavy workloads.

TPC-C

  • The NVM-aware engines are 1.8-2.1x faster than the traditional engines.

5.3 Reads & Writes

YCSB

TPC-C

  • NVM-aware engines perform 31-42% fewer writes compared to the traditional engines.

5.4 Recovery

YCSB

  • The NVM-InP and NVM-Log engines’ recovery time is independent of the number of transactions executed.

TPC-C

  • The recovery latency of the NVM-InP and NVM-Log engines is slightly higher than that in the YCSB benchmark because the TPC-C transactions perform more operations.

5.5 Execution Time Breakdown

  • On the write-heavy workload, the NVM-aware engines only spend 13-18% of their time on recovery-related tasks compared to the traditional engines that spend as much as 33% of their time on them.

5.6 Storage Footprint

  • CoW engine has the largest storage footprint.
  • The NVM-aware engines have smaller storage footprints compared to the traditional engines.

5.7 Discussion

  • The NVM access latency has the most impact on the runtime performance of the engines, more so than the amount of skew or the number of modifications to the database in the workload.
  • The NVM-aware engines also perform fewer store operations, which will help extend NVM device lifetimes.
  • The NVM-InP engine performs the best across a wide set of workload mixtures and skew settings for all NVM latency configurations.
Written on December 5, 2017