# Dictionary-based Order-preserving String Compression for Main Memory Column Stores

Reference: Binnig, Carsten, Stefan Hildenbrand, and Franz Frber. "Dictionary-based order-preserving string compression for main memory column stores." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

This is one of the several papers belong to suggested readings for Database Compression of CMU 15-721: Database Systems.

## 0. Abstract

Column-oriented database systems perform better than traditional row-oriented database systems on analytical workloads such as those found in decision support and business intelligence applications. Moreover, recent work [1, 24] has shown that light- weight compression schemes significantly improve the query processing performance of these systems. One such a lightweight compression scheme is to use a dictionary in order to replace long (variable-length) values of a certain domain with shorter (fixed-length) integer codes. In order to further improve expensive query operations such as sorting and searching, column-stores often use order-preserving compression schemes.

In contrast to the existing work, in this paper we argue that order-preserving dictionary compression does not only pay off for attributes with a small fixed domain size but also for long string attributes with a large domain size which might change over time. Consequently, we introduce new data structures that efficiently support an order-preserving dictionary compression for (variable-length) string attributes with a large domain size that is likely to change over time. The main idea is that we model a dictionary as a table that specifies a mapping from string-values to arbitrary integer codes (and vice versa) and we introduce a novel indexing approach that provides efficient access paths to such a dictionary while compressing the index data. Our experiments show that our data structures are as fast as (or in some cases even faster than) other state- of-the-art data structures for dictionaries while being less memory intensive.

## 1. Introduction

• Current column stores does not work well with dictionary-based encoding when the domain size is not known a priori, and order-preserving scheme can not generate fixed-length records which are easy to extend for new values.

Contribution

• We introduce a new approach for indexing a dictionary of string values (called shared leaves) that leverages an order-preserving encoding scheme efficiently.
• We introduce a concrete leaf structure for the shared-leaves approach that can be used by the indexes of a dictionary for efficiently encoding and decoding string values while the leaf structure itself is compressed.
• We present two new cache-conscious string indexes that can be used on top of our leaf structure to efficiently support the encoding of string data in the dictionary.
• The experiments show that in terms of performance our leaf structure is as efficient as other read-optimized indexes while using less memory.

## 2. Overview

### 2.1 Dictionary Operations

• encode: values → codes: this bulk operation involves (1) the lookup of codes for those strings that are already in the dictionary and (2) the insertion of new string values as well as the generation of order-preserving codes for those new values.
• decode: codes → values
• lookup: (value, type) → code: the parameter type specifies whether the dictionary should execute an exact-match lookup (as it is necessary for string constants in equality- predicates) or return the integer code for the next smaller (or larger) string value (as it is necessary for string constants in range-predicates).
• lookup: prefix → (mincode, maxcode): this operation is used during query compilation to rewrite the prefix of a prefix-predicate (e.g., p_name = ‘Whole Milk*’) with the corresponding integer ranges (i.e., the mincode and the maxcode).

In order to use table T to efficiently support all these operations, we propose to build indexes for both attributes of table T (value and code) because encoding and decoding usually access only a subset of the dictionary.

### 2.2 Shared-leaves Indexing

• The new idea of this paper is that the two indexes for encoding and decoding share the same leaves here both indexes directly hold the data of table T in their leaves but avoid the redundancy of the direct indexing approach.
• As the string dictionary uses an order-preserving encoding scheme, the string values and the integer codes in table T follow the same sort order (i.e., we can have clustered indexes on both columns and thus can share the leaves between two direct indexes). Consequently, as the attribute values value and code of table T can both be kept in sort order inside the leaves, the leaves can provide efficient access paths for both lookup directions using a standard search method for sorted data (e.g., binary search or interpolation search).
• In order to generate the codes for new string values that are inserted into the dictionary, we simply partition the code range where new string values are inserted in into equi-distant intervals.

### 2.3 Requirements and Design Decisions

• Leaf structure
• Encode/Decode index structure: all-bulked, hybrid, all-in-place
• In order to guarantee consistency of the data dictionary, for simplicity we decide to lock the complete indexes as well as the leaves during data loading (i.e., the encoding of string values) because this usually happens only at predefined points in time in data warehousing (i.e., once a day).
• During query processing (i.e., for decoding query results), we allow concurrency because these are read-only operations.

## 3. Leaf Structure

### 3.1 Memory Layout

• The leaf structure compresses the string values using incremental-encoding while each $n$-th string (e.g., each 16-th string value) is not compressed to enable an efficient lookup of strings without having to decompress the the complete leaf data.
• In order to enable an efficient lookup using this leaf structure, an offset vector is stored at the end of the leaf that holds references (i.e., offsets) and the integer codes of all uncompressed strings of a leaf also in a sorted way.
• Multiple parameters: leaf size, offset size, code size, prefix size, decode interval

### 3.2 Leaf Operations

Bulk Lookup

• Procedures: (1) use the offset vector to execute a binary search over the uncompressed strings of a leaf in order to find an uncompressed string value v’ that satisfies v’ ≤ v and no other uncompressed value v’’ exists with v’ < v’’ < v; (2) if v’ = v return the corresponding code c for v that is stored in the offset vector; (3) otherwise, sequentially search value v from value v’ on until v is found or the next uncompressed value appears. In the first case return the code c, in the second case indicate that the value v was not found.
• The average costs for lookups (where n is the number of strings/codes stored in a leaf and d is the decode interval): $O(log(n/d)) + O(d)$

Bulk Update

• In order to initially bulk load a leaf with a list of string values, we first have to sort the string values. Afterwards, the leaf data can be written sequentially from the beginning of the leaf while the offset vector is written reversely from the end. If the string values do not utilize the complete memory allocated for a leaf, then the offset vector can be moved to the end of the compressed data section and the unused memory can be released.
• In order to insert a list of new string values into an existing leaf, we again have to sort these string values first. Afterwards, we can do a sort merge of these new string values and the existing leaf in order to create a new leaf. The sort merge is cheaper if we can reuse as much of the compressed data of the existing leaf as possible and thus do not have to decode and compress the leaf data again.

## 4. Cache-Conscious Indexes

### 4.1 CS-Array-Trie

General Idea

• Linked lists are known to be not very cache efficient on modern CPUs because of pointer-chasing. Moreover, most existing trie implementations decompose the indexed strings completely (i.e., each letter of a string results in a node of the trie).
• A node of the CS-Array-Trie uses an array instead of a linked list to store the characters of the indexed string values, which is efficient for bulk inserting, and searching.
• The second key idea of the CS-Array-Trie is that it does not decompose the string values completely. A leaf of the CS-Array-Trie stores the complete strings and not only their suffixes (without the common prefix) in order to enable efficient decoding of integer codes using the same leaves. Moreover, storing the complete strings in a leaf (i.e., repeating the same prefix) is still space efficient because we use incremental encoding to compress the strings.
• For those trie nodes that only hold a pointer to one child node, we use the path compression used in patricia tries.

Index Operations

• In order to leverage the fact of having bulk inserts, we propagate the string values (in pre-order) to the leaves using variable buffers at each node of the trie to increase the cache locality during lookup.
• In order to keep the order of strings in the lookup probe for creating the encoded result (at the bottom of Figure 5), a sequence number is added for each value in the buffers.
• When all string values are propagated to their corresponding leaves (i.e., the new strings are inserted into the leaves), new integer codes are generated for the new string values. This is done by analyzing the number of strings that are inserted in between two existing string values.
• Another task that can efficiently be supported using the CS-Array-Trie, is the predicate rewrite: for equality- and range-predicates the constants are simply propagated through the trie without buffering. For prefix-predicates, the prefix is used to find the minimal and maximal string value that matches this prefix.

Cost Analysis

Parallelization

• The propagation of string values from the root of the trie to the leaves (including the update of the nodes) can be easily parallelized for different sub-tries because sub-tries share no data.
• The generation of new integer codes can be done in parallel.
• The lookup operation of the new string values can also be parallelized without locking any data structures.

### 4.2 CS-Prefix-Tree

General Idea

• Same as the Prefix-B-Tree, a node of a CS-Prefix-Tree contains the shortest prefixes that enable the propagation of string values to the corresponding child nodes. However, instead of storing a pointer to each child, the CS-Prefix-Tree uses a contiguous block of memory for all nodes and offsets to navigate through this block (as the CSS-Tree).
• In order to further reduce the memory footprint of the CS-Prefix-Tree, we only store the offset to the first child node explicitly (since the nodes have a fixed size s).
• In order to enable fast search over the variable-length keys of a node (e.g., binary search), we store the offsets to the keys (i.e., the string prefixes) in an offset vector at the beginning of each node.

Index Operations

Cost Analysis

Parallelization

These sections are hard to read, thus I skip them.

## 5. Performance Experiments

### 5.1 Efficiency of Leaf Structure

• The leaf structure is optimal when having a medium size (∼ 512kB).
• The performance of our leaf structure is comparable to the read-optimized index structures mentioned above while using less memory.

### 5.2 Efficiency of Encoding Indexes

• While the CS-Prefix-Tree is more expensive to bulk load than the CS-Array-Trie, the workload which represents the no-update pattern is faster on the CS-Prefix-Tree because the CS-Prefix-Tree is (1) read-optimized and (2) not as high a the CS-Array-Trie.
• In general a workload that is more update-intensive (i.e., which has more new string values) has a better performance on the CS-Array-Trie.

### 5.3 Scalability of the Dictionary

• Dictionary compression in column-stores
• Cache-conscious indexes
• Indexing in main memory databases
Written on September 7, 2017