System Overview: LevelDB

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Many of its ideas and techniques are widely used in Big Data stacks, e.g., BigTable and HBase. The code in LevelDB is well-written with good documents, which is a ideal project to learn from.

1. Features

  • LevelDB stores keys and values in arbitrary byte arrays, and data is sorted by key. It supports batching writes, forward and backward iteration, and compression of the data via Google’s Snappy compression library.
  • LSM Tree: MemTable, SSTable, Compaction
  • It does not support SQL, indexes.

2. Access Path

Read Path

  1. MemTable
  2. Last MemTable(which is being compacted now)
  3. SSTable

Write Path

  1. Add this operation to log
  2. MemTable

3. Implementations

Official Doc

MemTable

  • Essentially a wrapper for skip list
  • LevelDB uses customized memory allocator to estimate the size of data in MemTable
  • LevelDB actually holds two instances of MemTable, one for current operations(e.g., Put), another for dumping to disk, which is similar to the double-buffer mechanism

SSTable

  • SSTable Format
  • Write: according to the format mentioned above
  • Read: (1) start from level 0; (2) for each level, use metadata to binary search the file with corresponding key; (3) for each file(SSTable), use metadata and index to binary search the key-value pair
  • Leveled structures

Redo Log

Compaction

  • Background
  • Minor Compaction: when the size of MemTable is bigger then the predefined threshold, dump the current MemTable to a Level 0 SSTable
  • Major Compaction: VersionSet::PickCompaction chooses a level for compaction
  • Manual Compaction
  • Compaction Algorithm
  • Timing

Version/VersionSet/VersionEdit

  • Version: a snapshot for current database, generated by compaction
  • VersionSet: all the versions
  • VersionEdit: diff between versions

Failure Recovery

4. Interesting Tricks

  • Iterator Design Pattern
  • Abstract for Environment(POSIX Env)
  • SequenceNumber for order of causality
  • Cache for file descriptors(TableCache)
Written on March 17, 2017