Fast Databases with Fast Durability and Recovery Through Multicore ParallelismReference: Zheng, Wenting, et al. "Fast Databases with Fast Durability and Recovery Through Multicore Parallelism." OSDI. 2014.
This is one of the several papers belong to suggested readings for Logging Protocols of CMU 15-721: Database Systems.
Multi-core in-memory databases for modern machines can support extraordinarily high transaction rates for on-line transaction processing workloads. A potential weakness, however, is recovery from crash failures. Can classical techniques, such as checkpoints, be made both efficient enough to keep up with current systems’ memory sizes and transaction rates, and smart enough to avoid additional contention? Starting from an efficient multi-core database system, we show that naive logging and checkpoints make normal-case execution slower, but that frequent disk synchronization allows us to keep up with many workloads with only a modest reduction in throughput. We design throughout for parallelism: during logging, during check-pointing, and during recovery. The result is fast. Given appropriate hardware (three SSDs and a RAID), a 32-core system can recover a 43.2 GB key-value database in 106 seconds, and a >70 GB TPC-C database in 211 seconds.
- Crash resistance mechanisms, such as logging and checkpointing, can enormously slow transaction execution if implemented naively.
- Our goal in this work was to develop an in-memory database with full persistence at relatively low cost to transaction throughput, and with fast recovery.
- Starting from Silo, a very fast in-memory database system, we built SiloR, which adds logging, checkpointing, and recovery
2. Silo Overview
- We build on Silo, a fast in-memory relational database that provides tables of typed records.
- Silo tables are stored in efficient, cache-friendly con- current B-trees. All structures are stored in shared memory, so any worker can access the entire database.
- Silo uses a variant of optimistic concurrency control (OCC) to serialize transactions. Classical OCC obtains the TID for a committing transaction by effectively incrementing a global counter. For Silo, a global epoch number E is visible to all threads. A designated thread advances it periodically (every 40 ms). Worker threads use E during the commit procedure to compute the new TID.
- Consider concurrent transactions T1 and T2 where T1 reads a key that T2 then overwrites. The relationship between T1 and T2 is called an anti-dependency: T1 must be ordered before T2 because T1 depends on the absence of T2.
- Because workers read the current epoch at the serialization point, the ordering of TIDs with different epochs is always compatible with the serial order, even in the case of anti-dependencies.
3.1 Basic Logging
- Workers generate log records as they commit transactions; they pass these records to loggers, which commit the logs to disk. When a set of logs is committed to disk via fsync, the loggers inform the workers. This allows workers to send transaction results to clients.
- A log record comprises a committed transaction’s TID plus the table, key, and value information for all records modified by that transaction. Each worker constructs log records in disk format and stores them in a memory buffer taken from a per-worker buffer pool. When a buffer fills, or at an epoch boundary, the worker passes the buffer to the logger over a shared-memory queue.
3.2 Value logging vs. operation logging
- SiloR uses value logging, in which SiloR logs contain each transaction’s output keys and values, because: (1) it is more parallelizable than operation logging; (2) operation logging also requires that the initial pre-replay database state be a transactionally consistent snapshot, which value logging does not; (3) for small transactions value and operation logs are about the same size.
- We solve the problem of value logging I/O by adding hardware until logging is not a bottleneck, and then using that hardware wisely.
3.3 Workers and loggers
- Our final design divides workers into disjoint subsets, and assigns each subset to exactly one logger. Core pinning is used to ensure that a logger and its workers run on the same socket, making it likely that log buffers allocated on a socket are only accessed by that socket.
3.4 Buffer management
- This backpressure is implemented by buffer management. Loggers allocate a maximum number of log buffers per worker core. Buffers circulate between loggers and workers as transactions execute, and a worker blocks when it needs a new log buffer and one is not available.
- A worker flushes a buffer to its logger when either the buffer is full or a new epoch begins, whichever comes first. It is important to flush buffers on epoch changes, whether or not those buffers are full, because SiloR cannot mark an epoch as persistent until it has durably logged all transactions that happened in that epoch.
- We found that log-buffer backpressure in Silo triggered unnecessarily often because it was linked with fsync times. SiloR’s loggers instead recirculate log buffers back to workers as soon as possible - after a write, rather than after the following epoch change and fsync. We also increased the number of log buffers available to workers, setting this to about 10% of the machine’s memory.
3.5 File management
- SiloR uses a distinguished logger thread to maintain another file, pepoch, that contains the current persistent epoch. The logger system guarantees that all transactions in epochs <= pepoch are durably stored in some log.
- Disadvantage: the critical path for transaction commit contains two fsyncs (one for the log file and one for pepoch) rather than one, which somewhat increases latency.
- The smaller the distance between checkpoints, the less log data needs to be replayed, and we found the size of the log to be the major recovery expense.
- Different checkpointers are responsible for different slices of the database; a distinguished checkpoint manager assigns slices to checkpointers.
- Since OCC installs modifications at commit time, all records seen by checkpointers are committed. This means that full ARIES-style undo and redo logging is unnecessary; the log can continue to contain only “redo” records for committed transactions.
- SiloR checkpoints are thus inconsistent or “fuzzy”: the checkpoint is not necessarily a consistent snapshot of the database as of a particular point in the serial order.
- We chose to produce an inconsistent checkpoint because it’s less costly in terms of memory usage than a consistent checkpoint.
- In an important optimization, checkpointer threads skip any records with current epoch .
4.2 Writing the checkpoint
- For each table, each checkpointer divides its assigned key range into m files, where m is the number of cores that would be used during recovery for that key range. Each block contains a contiguous range of records, but blocks are assigned to files in round-robin order.
- The checkpoint manager thread starts a new check- point every C seconds.
- Each time a checkpointer’s outstanding writes exceed 32 MB, it syncs them to disk.
- new tuples created during eh are not stored in the checkpoint, tuples removed or over- written during eh are also not stored in the checkpoint, so the checkpoint can’t be recovered correctly without complete logs up to and including eh. Thus, the manager waits until , where is the persistence epoch computed by the loggers. Once this point is reached, the manager installs the checkpoint on disk by writing a final record to a special checkpoint file.
- After the checkpoint is complete, SiloR removes old files that are no longer needed. This includes any previous checkpoints and any log files that contain only transactions with epochs .
5.1 Checkpoint recovery
- Each table is recorded on all n disks, partitioned so that on each disk there are m files for each table. Recovery is carried out by n x m threads.
- Since the files contain different key ranges, checkpoint recovery threads are able to reconstruct the tree in parallel with little interference; additionally they benefit from locality when processing a subrange of keys in a particular table.
5.2 Log recovery
- First the manager thread reads the pepoch file to obtain , the number of the most recent persistent epoch. All log records for transactions with TIDs for later epochs are ignored during recovery.
- The processor thread reads the entries in the file sequentially. If t contains an epoch number that is > e_p$$, the thread skips the entry. Otherwise, the thread inserts a record into the table if its key isn’t there yet; when a version of the record is already in the table, the thread overwrites only if the log record has a larger TID.
- We use reverse order for reading log files because it uses the CPU more efficiently than forward order when keys are written multiple times.
- The state of the database after processing the checkpoint is definitely not correct: it is inconsistent, and it is also missing modifications of persistent transactions that ran after it finished. All these problems are corrected by processing the log.
6.1 Experimental setup
6.2 Key-value workload
- We run SiloR on a variant of YCSB workload mix A.
- SiloR is able to run multiple checkpoints and almost match LogSilo’s throughput. Its throughput is also close to that of MemSilo.
- Frequent synchronization produces far more stable results; it also can produce a checkpoint more quickly than can the version with occasional sleeps.
- We also experimented with compressing the database checkpoints via lz4 before writing to disk. This didn’t help either latency or throughput, and it actually slowed down the time it took to checkpoint.
6.3 On-line transaction processing workload
- YCSB-A, though challenging, is a well-behaved workload: all records are in one table, there are no secondary indexes, accesses are uniform, all writes are overwrites (no inserts or deletes), all transactions are small.
- We evaluate SiloR on a more complex workload, the popular TPC-C benchmark for online transaction processing.
- Unmodified TPC-C is not a great fit for an in-memory database: very few records are removed, so the database grows without bound.
- TPC-C transactions are challenging enough for Silo’s in-memory structures that the addition of persistence has little effect on throughput: SiloR’s throughput is about 93% that of MemSilo.
- In summary, SiloR can handle more complex workloads with larger transactions as well as it can handle simple workloads with small transactions.
- We run YCSB-A and TPC-C benchmarks, and in each case, crash the database immediately before a checkpoint completes. This maximizes the length of the log that must be recovered to restore a transactionally- correct state.
- For YCSB-A, SiloR must recover 36 GB of checkpoint and 64 GB of log to recreate a 43.2 GB database. Recovery takes 106 s, or about 1.06 s/GB of recovery data. 31% of this time (33 s) is spent on the checkpoint and the rest (73 s) on the log.
- For TPCC, SiloR must recover 15.7 GB of checkpoint and 180 GB of log to recreate this database. Recovery takes 211 s, or about 1.08 s/GB of recovery data. 8% of this time (17 s) is spent on the check- point and the rest (194 s) on the log.
- Thus, recovery time is proportional to the amount of data that must be read to recover, and log replay is the limiting factor in recovery, justifying our decision to checkpoint frequently.
- In contrast with the evaluation of Silo, we disable the NUMA-aware allocator in our tests. When enabled, this allocator improves average throughput by around 25% on YCSB (to 10.91 Mtxn/s for SiloR) and 20% on TPC-C (to 644 Ktxn/s for SiloR). The cost was performance instability and dramatically worse latency.
7. Related work
- H-Store and VoltDB
- The gold standard for database logging and check-pointing is agreed to be ARIES, which combines undo and redo logging to recover inconsistent checkpoints. Undo logging is necessary because ARIES might flush uncommitted data to the equivalent of a checkpoint; since SiloR uses OCC, uncommitted data never occurs in a checkpoint, and redo logging suffices.