The Case for Learned Index Structures
Reference: Kraska, Tim, et al. "The Case for Learned Index Structures." arXiv preprint arXiv:1712.01208 (2017).0. Abstract
Indexes are models: a BTreeIndex can be seen as a model to map a key to the position of a record within a sorted array, a HashIndex as a model to map a key to a position of a record within an unsorted array, and a BitMapIndex as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deeplearning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cacheoptimized BTrees by up to 70% in speed while saving an orderofmagnitude in memory over several realworld data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
1. Introduction
 Traditional generalpurpose indexes do not make full usage of data distribution, which could be mitigated with machine learning techniques.
 The high cost to execute a neural net might actually be negligible in the future, e.g., the emergence of Tensor Processing Unit (TPU).
 This paper aims to open up a entirely new research direction: replacing core components of a data management system through learned models has far reaching implications for future system designs.
2. Range Index
2.1 What Model Complexity Can We Afford? A BackOfTheEnvelope Calculation
 Let’s say that BTree with a pagesize of 100 has a precision gain of 1/100 per node. Traversing a single BTree page takes roughly 50 cycles (we measured that binary search over 100 cacheresident items has roughly the same performance as scanning) and is notoriously hard to parallelize.
 A modern CPU can do 816 SIMD operations per cycle. Thus, a model will be faster as long as it has a better precision gain than 1/100 per 50 * 8 = 400 arithmetic operations.

A single cachemiss costs 50100 additional cycles and would thus allow for even more complex models.
 In this paper we focus on the more limited CPUs to better study the implications of replacing/enhancing indexes through machine learning without the impact of hardware changes.
2.2 Range Index Models are CDF Models
2.3 A First, Naive Learned Index
 We trained a twolayer fullyconnected neural network with 32 neurons per layer (i.e., 32 width) using ReLU activation functions; the timestamps are the input features and the positions are the labels.
 Learned index: 80, 000 nanoseconds (ns) to execute the model with Tensorflow, without even the search time.
 BTree: 300 ns
Reasons
 Tensorflow was designed to efficiently run larger models, not small models, and thus, has a significant invocation overhead, especially with Python as the frontend.
 BTree, or decision trees in general, are really good in overfitting the data with a few operations as they recursively divide the space using simple ifstatement; while other models can be significantly more efficient to approximate the general shape of a CDF, but have problems to be accurate at the individual data instance level.
 The typical ML optimization goal is to minimize the average error. However, for indexing, where we not only need the best guess where the item might be but also to actually find it, the min and maxerror as discussed earlier are more important.
 BTrees are extremely cacheefficient as they keep the top nodes always in cache and access other pages if needed. However, other models are not as cache and operation efficient.
3. The RMIndex
3.1 The Learning Index Framework (LIF)
 The LIF can be regarded as an index synthesis system; given an index specification, LIF generates different index configurations, optimizes them, and tests them automatically.
 LIF uses Tensorflow to train, but it will extracts all weights from the model and generates efficient index structures in C++ based on the model specification during inference.
 As a result, we are able to execute simple models on the order of 30 nanoseconds.
3.2 The Recursive Model Index
 Note that we currently train stagewise and not fully endtoend. Full endtoend training would be even better and remains future work.
Benefits
 It leverages the fact, that it is easy to learn the overall shape of the data distribution.
 The architecture effectively divides the space into smaller subranges like a BTree/decision tree to make it easier to achieve the required “last mile” accuracy with a fewer number of operations.
 There is no search process required inbetween the stages, since the output of models are simply offset. This not only reduces the number of instructions to manage the structure, but also allows representing the entire index as a sparse matrixmultiplication for a TPU/GPU.
3.3 Hybrid Indexes
 RMI enabling mixtures of models, and we could even use traditional BTrees at the bottom stage if the data is particularly hard to learn.
 We found that a threshold of 128 or 256 (typical pagesizes for BTrees), and for CPUs we can rarely afford neural nets with more 1 or 2 fullyconnected hidden layers and 8 to 128 neurons per layer.
 Given the rather low capacity of the models, it is possible to train the higher level model stages with small samples of the data, which significantly speeds up the training process.
 Hybrid indexes allow us to bound the worst case performance of learned indexes to the performance of BTrees.
3.4 Search Strategies
 Model Binary Search
 Biased Search
 Biased Quaternary Search
3.5 Indexing Strings
 Convert strings into a equalsize vector, with each character’s decimal value as the feature value.
3.6 Results
 Datasets: Weblogs, Maps, webdocuments, and Lognormal
 Our BTree implementation is similar to the stx::btree but with further cacheline optimization and very competitive performance.
 We tuned the 2stage models using simple gridsearch over rather simple models. In general we found, that a simple (0 hidden layers) to semicomplex (2 hidden layers and 8 or 16 wide) models for the first stage work the best. For the second stage, it turned our that simple (0 hidden layers), which are essentially linear models, had the best performance.
Integer Datasets
 The learned index dominates the BTree index in almost all configurations by being up to 3x faster and being up to an orderofmagnitude smaller.
String Datasets
 The speedups for learned indexes over BTrees for strings is not as prominent anymore.
3.7 Future Research Challenges
Inserts and Updates
 Use the model to find position to insert or update if the distribution does not change
 Increase the capacity of model
 Deltaindex
 Warmstart every model training by using the previous solution as a starting point.
Paging
 Leveraging the RMI structure
 Translating key to disk position then using the same index scheme
4. Point Index
 Most solution often allocate significant more memory (i.e., slots) than records to store as well as combine it with additional data structures (such as dividing the hashspace into buckets), which yet again requires more overhead. For example, Google’s Densehashmap has a typical overhead of of about 78% memory, whereas Google’s sparsehashmap only has 4 bits overhead but is up to 37 times slower because of its search and data placement strategy.
4.1 The HashModel Index
4.2 Results
 The index with the model hash function overall has similar performance while utilizing the memory better.
4.3 Alternative Approaches and Future Work
 Hybrid indexes
5. Existence Index
 BloomFilter
5.1 Learned Bloom filters
 Bloom filters as a Classification Problem: binary classification, and to preserver zero false negatives, we build a Bloom filter on the set of false negatives of the learned Bloom filter.
 Bloom filters with ModelHashes: train a hash function to map most keys to the higher range of bit positions and nonkeys to the lower range of bit positions.
5.2 Results
 We train a characterlevel RNN (GRU) to predict which set a URL belongs to.
 we find that our model + the spillover Bloom filter gets a 47% reduction in size.
6. Related Works
 BTrees and variants
 Better HashFunctions
 BloomFilters
 Succinct Data Structures
 Modeling CDFs
 Mixture of Experts
7. Conclusion and Future Work
 MultiDimensional Indexes
 Beyond Indexing: Learned Algorithms
 GPU/TPUs