System Overview: LevelDB
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Many of its ideas and techniques are widely used in Big Data stacks, e.g., BigTable and HBase. The code in LevelDB is well-written with good documents, which is a ideal project to learn from.
1. Features
- LevelDB stores keys and values in arbitrary byte arrays, and data is sorted by key. It supports batching writes, forward and backward iteration, and compression of the data via Google’s Snappy compression library.
- LSM Tree: MemTable, SSTable, Compaction
- It does not support SQL, indexes.
2. Access Path
Read Path
- MemTable
- Last MemTable(which is being compacted now)
- SSTable
Write Path
- Add this operation to log
- MemTable
3. Implementations
MemTable
- Essentially a wrapper for skip list
- LevelDB uses customized memory allocator to estimate the size of data in MemTable
- LevelDB actually holds two instances of MemTable, one for current operations(e.g., Put), another for dumping to disk, which is similar to the double-buffer mechanism
SSTable
- SSTable Format
- Write: according to the format mentioned above
- Read: (1) start from level 0; (2) for each level, use metadata to binary search the file with corresponding key; (3) for each file(SSTable), use metadata and index to binary search the key-value pair
- Leveled structures
Redo Log
- Redo Log Format
- Failure Recovery(e.g., re-apply redo log when starting)
- Append to log before actual writing(a.k.a Write-ahead Logging)
Compaction
- Background
- Minor Compaction: when the size of MemTable is bigger then the predefined threshold, dump the current MemTable to a Level 0 SSTable
- Major Compaction:
VersionSet::PickCompaction
chooses a level for compaction - Manual Compaction
- Compaction Algorithm
- Timing
Version/VersionSet/VersionEdit
- Version: a snapshot for current database, generated by compaction
- VersionSet: all the versions
- VersionEdit: diff between versions
Failure Recovery
4. Interesting Tricks
- Iterator Design Pattern
- Abstract for Environment(POSIX Env)
- SequenceNumber for order of causality
- Cache for file descriptors(
TableCache
)
Written on March 17, 2017