Fast database restarts at Facebook

Reference: Goel, Aakash, et al. "Fast database restarts at Facebook." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.

This is one of the several papers belong to suggested readings for Checkpointing Protocols of CMU 15-721: Database Systems.

0. Abstract

Facebook engineers query multiple databases to monitor and analyze Facebook products and services. The fastest of these databases is Scuba, which achieves subsecond query response time by storing all of its data in memory across hundreds of servers. We are continually improving the code for Scuba and would like to push new software releases at least once a week. However, restarting a Scuba machine clears its memory. Recovering all of its data from disk — about 120 GB per machine — takes 2.5-3 hours to read and format the data per machine. Even 10 minutes is a long downtime for the critical applications that rely on Scuba, such as detecting user-facing errors. Restarting only 2% of the servers at a time mitigates the amount of unavailable data, but prolongs the restart duration to about 12 hours, during which users see only partial query results and one engineer needs to monitor the servers carefully. We need a faster, less engineer intensive, solution to enable frequent software upgrades.

In this paper, we show that using shared memory provides a simple, effective, fast, solution to upgrading servers. Our key observation is that we can decouple the memory lifetime from the process lifetime. When we shutdown a server for a planned upgrade, we know that the memory state is valid (unlike when a server shuts down unexpectedly). We can therefore use shared memory to preserve memory state from the old server process to the new process. Our solution does not increase the server memory footprint and allows recovery at memory speeds, about 2-3 minutes per server. This solution maximizes uptime and availability, which has led to much faster and more frequent rollouts of new features and improvements. Furthermore, this technique can be applied to the in-memory state of any database, even if the memory contains a cache of a much larger disk-resident data set, as in most databases.

1. Introduction

  • Scuba used at Facebook needs to be upgraded frequently, but its down time is too long.
  • We decided to decouple the memory’s lifetime from the process’s lifetime when we shutdown a server for a planned upgrade.
  • Two key design decisions: (1) Scuba copies data from heap memory to shared memory at shutdown time and copies it back to the heap at startup; (2) during the copy, data structures are translated from their heap format to a (very similar but not the same) shared memory format.

2. Scuba Architecture

  • Scuba’s storage engine is a column store.
  • All other addresses in the row block column, such as the beginning of the dictionary, data, and footer, are offsets from this base address. BerkeleyDB is another database that uses a base address plus offsets for its pointers.
  • Using offsets en- ables us to copy the entire row block column between heap and shared memory in one memory copy operation. Only the address of the row block column itself (in the row block) needs to be changed for its new location.
  • Compression reduces the size of the row block column by a factor of about 30. Scuba’s compression methods are a combination of dictionary encoding, bit packing, delta encoding, and lz4 compression, with at least two methods applied to each column.

3. Shared Memory

  • We use the Posix mmap (mmap, munmap, sync, mprotect) based API from Boost::Interprocess.
  • At Facebook, our default heap memory allocator is jemalloc.
  • We allocate data in heap memory during normal operation, and copy it to shared memory at shutdown and copy it back at start up.

4. Restart Implementation

  • When there is a clean shutdown, such as when we want to deploy a new Scuba binary, we can use shared memory rather than restarting by reading from disk. We do not use shared memory to recover from a crash; the crash may have been caused by memory corruption.

4.1 Restart from Disk

  • We do not use shared memory to recover from a crash; the crash may have been caused by memory corruption.

4.2 Shared Memory Layout

  • Since the number and contents of row blocks and row blocks columns are known at allocation time in shared memory, we can eliminate one level of indirection and allocate them contiguously.
  • Additionally, there is leaf metadata for each of the eight leaf servers, although at most one of them will roll over using shared memory at a time (it is much better to restart eight leaf servers on eight different machines in parallel than to restart all eight leaf servers on the same machine at once).
  • Each leaf has a unique hard coded location in shared memory for its metadata. In that location, the leaf stores a valid bit, a layout version number, and pointers to any shared memory segments it has allocated.

4.3 Restart Using Shared Memory

4.4 Copying to And from Shared Memory

  • We copy data gradually, allocating enough space for one row block column at a time in shared memory, copying it, and then freeing it from the heap, which keeps the total memory footprint of the leaf nearly unchanged during both shutdown and restart.
  • Since all pointers in a row block column are offsets from the start of the row block column, copying a row block column can be done in one call to memcpy. Therefore, copying a table only requires one call per row block column.

4.5 System-wide Rollover

  • If we plan a rollover, we keep most of the data available for queries. Typically, we restart 2% of the leaf servers at a time, and the entire rollover takes 10-12 hours to restart from disk. We therefore monitor the rollover process closely, to make sure it is making progress.
  • Using shared memory is much faster, about 2-3 minutes per server (including the time to detect that a leaf is done with recovery and then initiate rollover for the next one). We can restart the entire cluster of Scuba machines in under an hour by using shared memory, with 98% of data online and available to queries.

5.1 Database Recovery

  • Most databases rely on recovery from disk. VoltDB, SAP Hana, Hekaton, and TimesTen, are in memory databases that recover using a combination of checkpoints and write ahead logs.
  • Other database systems, such as SQLite, store the metadata required for restarts in shared memory. The metadata provides an index into the data files. For example, SQLite maintains a write-ahead-log index in shared memory.
  • There are database systems that use shared memory to coordinate actions between concurrent server processes. eXtremeDB is one such example.

5.2 Shared Memory Usage in Other Systems

  • At Facebook, two other big, distributed systems use shared memory to keep data alive across software upgrades: TAO and Memcache.
Written on October 12, 2017