System Overview: LevelDB – Zeyuan Shang

System Overview: LevelDB

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Many of its ideas and techniques are widely used in Big Data stacks, e.g., BigTable and HBase. The code in LevelDB is well-written with good documents, which is a ideal project to learn from.

1. Features

LevelDB stores keys and values in arbitrary byte arrays, and data is sorted by key. It supports batching writes, forward and backward iteration, and compression of the data via Google’s Snappy compression library.
LSM Tree: MemTable, SSTable, Compaction
It does not support SQL, indexes.

2. Access Path

Read Path

MemTable
Last MemTable(which is being compacted now)
SSTable

Write Path

Add this operation to log
MemTable

3. Implementations

MemTable

Essentially a wrapper for skip list
LevelDB uses customized memory allocator to estimate the size of data in MemTable
LevelDB actually holds two instances of MemTable, one for current operations(e.g., Put), another for dumping to disk, which is similar to the double-buffer mechanism

SSTable

SSTable Format
Write: according to the format mentioned above
Read: (1) start from level 0; (2) for each level, use metadata to binary search the file with corresponding key; (3) for each file(SSTable), use metadata and index to binary search the key-value pair
Leveled structures

Redo Log

Redo Log Format
Failure Recovery(e.g., re-apply redo log when starting)
Append to log before actual writing(a.k.a Write-ahead Logging)

Compaction

Background
Minor Compaction: when the size of MemTable is bigger then the predefined threshold, dump the current MemTable to a Level 0 SSTable
Major Compaction: VersionSet::PickCompaction chooses a level for compaction
Manual Compaction
Compaction Algorithm
Timing

Version/VersionSet/VersionEdit

Version: a snapshot for current database, generated by compaction
VersionSet: all the versions
VersionEdit: diff between versions

Failure Recovery

Recovery

4. Interesting Tricks

Iterator Design Pattern
Abstract for Environment(POSIX Env)
SequenceNumber for order of causality
Cache for file descriptors(TableCache)

Written on March 17, 2017