Fast Checkpoint Recovery Algorithms for Frequently Consistent ApplicationsReference: Cao, Tuan, et al. "Fast checkpoint recovery algorithms for frequently consistent applications." Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011.
This is one of the several papers belong to suggested readings for Checkpointing Protocols of CMU 15-721: Database Systems.
Advances in hardware have enabled many long-running applications to execute entirely in main memory. As a result, these applications have increasingly turned to database techniques to ensure durability in the event of a crash. However, many of these applications, such as massively multiplayer online games and main-memory OLTP systems, must sustain extremely high update rates – often hundreds of thousands of updates per second. Providing durability for these applications without introducing excessive overhead or latency spikes remains a challenge for application developers. In this paper, we take advantage of frequent points of consistency in many of these applications to develop novel checkpoint recovery algorithms that trade additional space in main memory for significantly lower overhead and latency. Compared to previous work, our new algorithms do not require any locking or bulk copies of the application state. Our experimental evaluation shows that one of our new algorithms attains nearly constant latency and reduces overhead by more than an order of magnitude for low to medium update rates. Additionally, in a heavily loaded main-memory transaction processing system, it still reduces overhead by more than a factor of two.
- Unlike traditional database applications, which may never reach a point of consistency without quiescing the system, FC (frequently consistent) applications reach natural points of consistency very frequently (typically at least once a second) during normal operation.
- Most MMOs use a time-stepped processing model where character behavior is divided into atomic time steps or ticks that are executed many times per second and update the entire state of the game. The game state is guaranteed to be consistent at tick boundaries.
- Requirements: (1) the algorithm should have low overhead during normal operation; (2) the algorithm should distribute its overhead uniformly and not introduce performance spikes or highly variable response times; (3) it should be possible to take checkpoints very frequently so that the logical log can be replayed quickly in the event of a failure.
- We analyze the performance bottlenecks in prior work, and propose two new algorithms to address them.
- We explore the impact of data layout in main memory on cache performance and do a careful cache-aware implementation of all of our algorithms, as well as the best previous algorithms. We find that our algorithms are particularly amenable to these low level optimizations.
- We perform a thorough evaluation of our new algorithms and compare them to existing methods. We find that Wait-Free Ping-Pong exhibits lower overhead than all other algorithms by up to an of magnitude over a wide range of update rates. It also completely eliminates the latency spikes that plagued previous consistent checkpointing algorithms.
- A point of consistency is simply a time at which the state of the node is valid according to the semantics of the application.
- In this paper, we use coordinated checkpointing to provide durability for distributed FC applications at low cost.
- We will focus exclusively on developing robust single-node checkpointing algorithms, as these form the core of most distributed checkpointing schemes.
2.1 Requirements for Checkpoint Recovery Algorithms
- The method must have low overhead.
- The method should distribute overhead uniformly, so that application performance is consistent.
- The method should have fast recovery in the event of failure.
2.2 Algorithmic Framework
- During normal operation, the application takes periodic checkpoints of its entire dynamic state and maintains a logical log of all actions.
- In order to compare with existing algorithms for games, we will use the same algorithmic framework for checkpointing originally introduced by Vaz Salles et al.. Interface 1 lists the key methods in this API.
- We model main-memory checkpointing algorithms using a single Mutator thread that executes the application logic and synchronously updates the application state.
- In addition to the Mutator, we use two threads to write data to disk. The Logical Logger thread synchronously flushes logical log entries to disk. This could be done directly in the Mutator thread, but we implement it as a separate thread so that we can overlap computation and process additional actions while we are waiting for the disk write to complete. Note that we must wait for all disk writes to finish before proceeding to the next point of consistency (e.g. reporting a transaction as committed).
- The final thread, the Asynchronous Writer, writes some or all of the main-memory state to disk. Note that this thread can be run in the background while the Mutator concurrently updates the state.
- The PointOfConsistency method must be executed at a point of consistency, but it need not be executed at every point of consistency. If points of consistency are very frequent (e.g., every few microseconds), we can wait for several of them to pass before calling the method.
- These checkpoints may be organized on disk in several different ways. In our implementation, we use a double-backup organization for all of the algorithms, as it was reported in previous work to consistently out- perform a log-based implementation.
- The recovery procedure is the same for all algorithms that implement this framework. First, the most recent consistent checkpoint is read from disk and materialized as the new application state. Then, the logical log is replayed from the time of the last checkpoint until the state is up-to-date. Since we take checkpoints very frequently, the time to replay the logical log is quite small.
2.3 Existing Algorithms
- They concluded that two algorithms, Copy-on-Update and Naive-Snapshot, performed best for low and high update rates, respectively.
- Naive-Snapshot synchronously copies the entire state of the application to the shadow copy at a point of consistency and then writes it out asynchronously to disk.
- Copy-on-Update groups application objects into blocks and copies each block to the shadow state the first time it is updated during a checkpoint period.
3. New Algorithms
3.1 Design Overview
- Bulk State Copying: the method may need to pause the application to take a snapshot of the whole application state, as in Naive-Snapshot.
- Locking: the method may need to use locking to isolate the Mutator from the Asynchronous Writer, if they works on shared regions of the application state.
- Bulk Bit-Array Reset: if the method uses metadata bits to flag dirty portions of the state, it may need to pause the application and perform a bulk clean-up of this metadata before the start of a new checkpoint period.
- Memory Usage: in order to avoid synchronous writes to disk, the method may need to allocate additional main memory to hold copies of the application state.
Table 1 shows how the factors above apply to all methods.
3.2 Wait-Free Zigzag
- The main intuition behind Wait-Free Zigzag is to maintain an untouched copy of every word in the application state for the duration of a checkpoint period.
3.3 Wait-Free Ping-Pong
- Wait-Free Ping-Pong introduces negligible overhead to the Mutator at the end of a checkpoint period; only simple pointer swaps are needed. Thus, there is no single point in time at which the algorithm introduces a latency peak.
- On the other hand, this algorithm doubles the number of updates, as each update is applied both to the application state and to a copy.
- Merge the words updated during the most recent checkpoint with the last consistent checkpoint in order to construct a new consistent checkpoint that can be written to disk: Copy or Merge.
4.1 Existing Algorithms
- Naive Snapshot (NS): with micro-benchmarks, we observed that a memcpy of a memory-aligned application state was better than our attempts to manually unroll the copy loop.
- Bit-Array Packed Copy-on-Update (BACOU): in order to minimize the overhead of bulk bit-array resetting, we packed the bits into (64 byte) cache lines and used long word instructions for all operations; furthermore, we interleaved blocks of the primary and shadow copies of the application state into one cache line, so that they will be fetched together.
4.2 Wait-Free Zigzag
The Wait-Free Zigzag algorithm has two major sources of overhead: the bit array lookups in the handleRead and handleWrite routines, and the bulk negation in the prepareForNextCheckpoint routine.
Data Layout Variations
- Naive Wait-Free Zigzag (NZZ)
- Interleaved Wait-Free Zigzag (IZZ)
- Packed Wait-Free Zigzag (PZZ)
- Bit-Array Packed Wait-Free Zigzag (BAZZ)
4.3 Wait-Free Ping-Pong
Unlike Wait-Free Zigzag, Wait-Free Ping-Pong has a very inexpensive prepareForNextCheckpoint routine. On the other hand, it must write to two copies of the application state during each update.
Data Layout Variations
- Naive Wait-Free Ping-Pong (NPP)
- Interleaved Wait-Free Ping-Pong (IPP)
- We look at the synchronous overhead per checkpoint. This added overhead indicates the total amount of work done by the checkpointing algorithm during a checkpoint period.
- We also measure how the overhead is distributed over time in order to see whether the checkpointing algorithms introduce any unacceptable latency peaks.
5.1 Setup and Datasets
- Synthetic Zipf workload, synthetic MMO workload
- We turned off both the Asynchronous Writer and logical logging to measure the synchronous overhead introduced by different algorithms more accurately, since these mechanisms perform the same amount of work independently of checkpointing method.
- Unlike in the synthetic benchmark experiments, we checkpoint as frequently as possible in order to understand the maximum impact of our algorithms in a realistic application.
- Since each EC2 instance communicates with EBS over the network, there is a CPU cost to writing out state in the Asynchronous Writer. To limit this effect, we set the thread affinity so that the Asynchronous Writer always runs on a separate core from the Mutator thread.
- To minimize synchronous I/O effects, we configured the Logger thread to perform group commit in batches of 500 transactions. We also overlap computation with IO operations so that when the Logger is writing the actions of one batch of transactions, the Mutator is processing the next batch.
5.2 Comparison of Implementation Variants
- To get a deeper understanding of the runtime characteristics, we profiled our implementation with the Intel VTune Performance Analyzer.
- Overall, we have observed that negation is the most significant source of overhead for Wait-Free Zigzag unless the update rate is extremely high. Thus, BAZZ is the best variant for this algorithm under typical workloads.
- In short, this experiment shows that IPP comfortably dominates NPP over the whole spectrum of update rates evaluated.
5.3 Synthetic Zipf Workload
- As expected, NS is essentially constant regardless of the number of updates, since it always copies the entire state. This is the worst strategy for very low update rates since many unchanged cells get copied, but it dominates the other algorithms, with the notable exception of IPP, for more than 160,000 updates per second.
- IPP displays the best performance of any of the algorithms for all but the highest update rates. IPP scales linearly with the number of updates over the entire range of update rates, since the predominant cost is updating each of two copies of the application state. The absolute overhead of the extra updates performed by IPP is extremely low, however, due to its cache-aware data layout and its wait-free operation.
Scaling the State Size
- We confirm that the trends we described above for 200 MB of application state continue to hold for larger state sizes.
- We see that NS has the worst overhead distribution of any of the algorithms.
- Overall, IPP has the most consistent overhead of any tested algorithm, and its total overhead is also lowest for all but the highest update rates.
5.4 Synthetic MMO Workload
- The experiments show that IPP is the method with the lowest overhead and the best overhead distribution for a realistic application with hundreds of thousands of updates per second. In addition, IPP exhibits short recovery times in this scenario.
5.5 TPC-C Application
- IPP-Copy always slightly outperforms IPP-Merge at the cost of maintaining an additional copy of the application state.
- IPP allows for frequent checkpointing with significantly lower overhead than the best existing methods.
- IPP distributes overhead more evenly than either Naive Snapshot or ARIES, and thus is more suitable for latency-sensitive applications
5.6 Further Optimizations: Large Pages
- Page walks resulting from TLB misses are a significant bottleneck for the Mutator in most algorithms we examined. Large pages may reduce TLB misses because they cover the same region of physical memory with a smaller number of TLB entries.
- In our scenario, the whole application state and auxiliary data structures are implemented as a few large objects in memory, so using large pages causes very little internal fragmentation.
- Large pages have little impact on NS and BACOU, since these two methods do not put much stress on TLB. In constrast, all variants of Wait-Free Zigzag and Wait-Free Ping-Pong benefit noticeably from large pages.
6. Related Work
- Checkpointing algorithms for main-memory DBMSs
- different approaches to integrating checkpoint-recovery systems with applications.
- In classic relational DBMSs, ARIES is the gold standard for recovery.
- Hot standby architectures have been commonly used to provide fault tolerance on multiple database nodes.