The Case for Learned Index StructuresReference: Kraska, Tim, et al. "The Case for Learned Index Structures." arXiv preprint arXiv:1712.01208 (2017).
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
- Traditional general-purpose indexes do not make full usage of data distribution, which could be mitigated with machine learning techniques.
- The high cost to execute a neural net might actually be negligible in the future, e.g., the emergence of Tensor Processing Unit (TPU).
- This paper aims to open up a entirely new research direction: replacing core components of a data management system through learned models has far reaching implications for future system designs.
2. Range Index
2.1 What Model Complexity Can We Afford? A Back-Of-The-Envelope Calculation
- Let’s say that B-Tree with a page-size of 100 has a precision gain of 1/100 per node. Traversing a single B-Tree page takes roughly 50 cycles (we measured that binary search over 100 cache-resident items has roughly the same performance as scanning) and is notoriously hard to parallelize.
- A modern CPU can do 8-16 SIMD operations per cycle. Thus, a model will be faster as long as it has a better precision gain than 1/100 per 50 * 8 = 400 arithmetic operations.
A single cache-miss costs 50-100 additional cycles and would thus allow for even more complex models.
- In this paper we focus on the more limited CPUs to better study the implications of replacing/enhancing indexes through machine learning without the impact of hardware changes.
2.2 Range Index Models are CDF Models
2.3 A First, Naive Learned Index
- We trained a two-layer fully-connected neural network with 32 neurons per layer (i.e., 32 width) using ReLU activation functions; the timestamps are the input features and the positions are the labels.
- Learned index: 80, 000 nano-seconds (ns) to execute the model with Tensorflow, without even the search time.
- B-Tree: 300 ns
- Tensorflow was designed to efficiently run larger models, not small models, and thus, has a significant invocation overhead, especially with Python as the front-end.
- B-Tree, or decision trees in general, are really good in overfitting the data with a few operations as they recursively divide the space using simple if-statement; while other models can be significantly more efficient to approximate the general shape of a CDF, but have problems to be accurate at the individual data instance level.
- The typical ML optimization goal is to minimize the average error. However, for indexing, where we not only need the best guess where the item might be but also to actually find it, the min- and max-error as discussed earlier are more important.
- B-Trees are extremely cache-efficient as they keep the top nodes always in cache and access other pages if needed. However, other models are not as cache and operation efficient.
3. The RM-Index
3.1 The Learning Index Framework (LIF)
- The LIF can be regarded as an index synthesis system; given an index specification, LIF generates different index configurations, optimizes them, and tests them automatically.
- LIF uses Tensorflow to train, but it will extracts all weights from the model and generates efficient index structures in C++ based on the model specification during inference.
- As a result, we are able to execute simple models on the order of 30 nano-seconds.
3.2 The Recursive Model Index
- Note that we currently train stage-wise and not fully end-to-end. Full end-to-end training would be even better and remains future work.
- It leverages the fact, that it is easy to learn the overall shape of the data distribution.
- The architecture effectively divides the space into smaller sub-ranges like a B-Tree/decision tree to make it easier to achieve the required “last mile” accuracy with a fewer number of operations.
- There is no search process required in-between the stages, since the output of models are simply offset. This not only reduces the number of instructions to manage the structure, but also allows representing the entire index as a sparse matrix-multiplication for a TPU/GPU.
3.3 Hybrid Indexes
- RMI enabling mixtures of models, and we could even use traditional B-Trees at the bottom stage if the data is particularly hard to learn.
- We found that a threshold of 128 or 256 (typical page-sizes for B-Trees), and for CPUs we can rarely afford neural nets with more 1 or 2 fully-connected hidden layers and 8 to 128 neurons per layer.
- Given the rather low capacity of the models, it is possible to train the higher level model stages with small samples of the data, which significantly speeds up the training process.
- Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of B-Trees.
3.4 Search Strategies
- Model Binary Search
- Biased Search
- Biased Quaternary Search
3.5 Indexing Strings
- Convert strings into a equal-size vector, with each character’s decimal value as the feature value.
- Datasets: Weblogs, Maps, web-documents, and Lognormal
- Our B-Tree implementation is similar to the stx::btree but with further cache-line optimization and very competitive performance.
- We tuned the 2-stage models using simple grid-search over rather simple models. In general we found, that a simple (0 hidden layers) to semi-complex (2 hidden layers and 8 or 16 wide) models for the first stage work the best. For the second stage, it turned our that simple (0 hidden layers), which are essentially linear models, had the best performance.
- The learned index dominates the B-Tree index in almost all configurations by being up to 3x faster and being up to an order-of-magnitude smaller.
- The speedups for learned indexes over B-Trees for strings is not as prominent anymore.
3.7 Future Research Challenges
Inserts and Updates
- Use the model to find position to insert or update if the distribution does not change
- Increase the capacity of model
- Warm-start every model training by using the previous solution as a starting point.
- Leveraging the RMI structure
- Translating key to disk position then using the same index scheme
4. Point Index
- Most solution often allocate significant more memory (i.e., slots) than records to store as well as combine it with additional data structures (such as dividing the hash-space into buckets), which yet again requires more overhead. For example, Google’s Dense-hashmap has a typical overhead of of about 78% memory, whereas Google’s sparse-hashmap only has 4 bits overhead but is up to 3-7 times slower because of its search and data placement strategy.
4.1 The Hash-Model Index
- The index with the model hash function overall has similar performance while utilizing the memory better.
4.3 Alternative Approaches and Future Work
- Hybrid indexes
5. Existence Index
5.1 Learned Bloom filters
- Bloom filters as a Classification Problem: binary classification, and to preserver zero false negatives, we build a Bloom filter on the set of false negatives of the learned Bloom filter.
- Bloom filters with Model-Hashes: train a hash function to map most keys to the higher range of bit positions and non-keys to the lower range of bit positions.
- We train a character-level RNN (GRU) to predict which set a URL belongs to.
- we find that our model + the spillover Bloom filter gets a 47% reduction in size.
6. Related Works
- B-Trees and variants
- Better Hash-Functions
- Succinct Data Structures
- Modeling CDFs
- Mixture of Experts
7. Conclusion and Future Work
- Multi-Dimensional Indexes
- Beyond Indexing: Learned Algorithms