# The Design and Implementation of Modern Column-Oriented Database Systems

Reference: Abadi, Daniel, et al. "The design and implementation of modern column-oriented database systems." Foundations and Trends® in Databases 5.3 (2013): 197-280.

## 0. Abstract

In this article, we survey recent research on column-oriented database systems, or column-stores, where each attribute of a table is stored in a separate file or region on storage. Such databases have seen a resurgence in recent years with a rise in interest in analytic queries that perform scans and aggregates over large portions of a few columns of a table. The main advantage of a column-store is that it can access just the columns needed to answer such queries. We specifically focus on three influential research prototypes, MonetDB, VectorWise, and C-Store. These systems have formed the basis for several well-known commercial column-store implementations. We describe their similarities and di erences and discuss their specific architectural features for compression, late materialization, join processing, vectorization and adaptive indexing (database cracking).

## 1. Introduction

### 1.1 Data Layout and Access Patterns

Since data transfer costs from storage(or through a storage hierarchy) are often the major performance bottlenecks in database systems, while at the same time database schemas are becoming more and more complex with fat tables with hundreds of attributes being common, a column-store is likely to be much more efficient at executing queries, that touch only a subset of a table’s attributes.

There are tradeoffs depending on the access patterns in the workload that dictate whether a column-oriented or a row-oriented physical is better(e.g., seek time, transfer time).

### 1.3 Column-store Architectures

• Virtual IDs: modern column-stores avoid storing a tuple identifier(e.g., a numeric primary key) with every column by using the position(offset) of the tuple in the column as a virtual identifier. Relying on fixed-width columns greatly simplifies locating a record and avoid storing IDs, but may not be compatible with most compression algorithms.
• Block-oriented and vectorized processing: (1) by passing cache-line sized blocks of tuples between operators, and operating on multiple values at a time, rather using a conventional tuple-at-a-time iterator, column-stores can achieve substantially better cache utilization and CPU efficiency; (2) the use of vectorized CPU instructions for selections, expressions, and other types of arithmetic on these blocks of values can further improve throughput.
• Late materialization: late materialization or late tuple reconstruction refers to delaying the joining of columns into wider tuples. In this way, late materialization means that column-stores not only store data one column-at-a-time, they also process data in a columnar format.
• Column-specific compression: (1) by compressing each column using a compression method that is most effective for it, substantial reductions in the total size of data on disk can be achieved; (2) by storing data from the same attribute (column) together, column-stores can obtain good compression ratios using simple compression schemes.
• Direct operation on compressed data: many modern column-stores delay decompressing data until it is absolutely necessary, ideally until results need to be presented to the user.
• Efficient join implementation: because columns are stored separately, join strategies similar to classic semi-joins are possible.
• Redundant representation of individual columns in different sort orders: (1) columns that are sorted according to a particular attribute can be filtered much more quickly on that attribute; (2) by storing several copies of each column sorted by attributes heavily used in an application’s query workload, substantial performance gains can be achieved; (3) low-cardinality data that is stored in sorted order can be aggressively compressed.
• Database cracking and adaptive indexing: a column-store with cracking can adaptively and incrementally sort(index) columns as a side-effect of query processing. Each query partially reorganizes the columns it touches to allow future queries to access data faster. Fixed-width allow for efficient physical reorganization, while vector processing means that we can reorganize whole blocks of columns efficiently in one go.
• Efficient loading architectures: one concern with column-stores is that they may be slower to load and update than row-store, because each column must be written separately, and because data is kept compressed. To optimize loading, for example, in the C-Store system, data is first written into an uncompressed, write-optimized buffer, and then flushed periodically in large, compressed batches to write and compress many records at a time.

### 1.4 Performance Example

SSBM is a simplified version of the TPC-H data warehousing benchmark. One reason that the unoptimized column-store does not do particularly well is that the SSBM benchmark uses relatively narrow tables. In most real-world data-warehouses, the ratio of columns-read to table-width would be much smaller, so these advantages would be much more pronounced.

### 2.1 History

• N-ary Storage Model(NSM)
• Decomposition Storage Model(DSM)
• At its core, the basic design of a relational database management system has remained to date very close to systems developed in the 1980s.
• Two metrics for disk capacity growth and disk transfer: (1) the transfer bandwidth per available byte(assuming the entire disk in used), which has been reduced over the years by two orders of magnitude; (2) the ratio of sequential access speed over random access speed, which has increased one order of magnitude. These two metrics clearly show that DBMSs need to not only avoid random disk IOs whenever possible, but, most importantly, preserve disk bandwidth.
• In order for a column-based(DSM) storage scheme to outperform row-based(NSM) storage, it needed to have a fast mechanism for reconstructing tuples(since the rest of the DBMS would still operate on rows) and it also needed to be able to amortize the cost of disk seeks when accessing multiple columns on disk. Faster CPUs would eventually enable the former and larger memories(for buffering purposes) would allow the latter.

• Over time, worse-case scenarios for column-stores(projectivity close to 100%) case increasingly closer to row-store performance.
• When it comes to solid state storage(such as Flash SSDs), column-oriented storage was shown to never be worse than row storage, and in some cases where selective predicates were used, it outperformed row storage for any projectivity.
• If selectivity is high, then column-stores can minimize the amount of intermediate results they create which otherwise represents a significant overhead.

## 3. Column-store Architectures

### 3.1 C-Store

1. The primary representation of data on disk is as a set of column files. Each column-file contains data from one column, compressed using a column-specific compression method, and sorted according to some attribute in the table that the column belongs to. This collection of files in known as the “read optimized store”(ROS).
2. Newly loaded data is stored in a write-optimized store(“WOS”), where data is uncompressed and not vertically partitioned. The WOS enables efficient loading of data, and amortizes the cost of compression and seeking.
3. Periodically, data is moved from the WOS into the ROS via a background “tuple mover” process, which sorts, compresses, and writes re-organized data to disk in a columnar form.
• Each column in C-Store may be stored several times in several different orders. Group of columns sorted on the same attribute are referred to as “projections”. Typically there is at least one projection containing all columns.
• Each column in C-Store is compressed and the choice of a compression method for each column depends on: (1) whether the column is sorted or not; (2) the data type; (3) the number of distinct values in the column.
• C-Store does not support secondary indexes on tables, but support efficient indexing into sorted projections through the use of sparse indexes. A sparse index is a small tree-based index that stores the first value contained on each physical page of a column and a similar sparse index is stored on tuple position.
• C-Store uses a “no-overwrite” storage representation, where updates are treated as deletes followed by inserts, and deletes are processed by storing a special “delete column” that records the time every tuple was deleted(if ever).
1. Query Execution in C-Store involves accessing data from both the ROS and WOS and unioning the results together.
2. Queries are run as of a specific time, which is used to filter out deleted tuples from the delete column. This allows queries to be run as of some time in the past.
3. Queries that modify the database are run using traditional two-phase locking. If read-only queries are tolerant of reading slightly stale data they can be run without setting locks by executing them in the very recent past.
4. C-Store’s query executor utilizes a number of advanced execution techniques, including late materialization, various column-oriented join techniques, and batch processing.
• C-Store was conceived as a shared-nothing massively parallel distributed database system, while the idea behind the parallel design of C-Store is that projections are horizontally partitioned across multiple nodes using hashing- or range-partitioning, and queries are pushed down and executed as much as possible on each node, with partial answers merged to produce a final answer at the output node.

### 3.2 MonetDB and VectorWise

#### 3.2.1 MonetDB

MonetDB differs from traditional RDBMS architecture in many aspects:

• Execution engine, which uses a column at-a-time-algebra.
• Processing algorithms, that minimize CPU cache misses rather than IOs
• Indexing, which is not a DBA task but happens as a by-product of query execution, i.e., database cracking
• Query optimization, which is done at run-time, during query incremental execution
• Transaction management, which is implemented using explicit additional tables and algebraic operations, so read-only workloads can omit these and avoid all transactions overhead.

Column-at-a-Time

• CPU-Friendly: MonetDB expressed its calculations typically in tight loops over fixed-width and dense arrays(i.e., columns), which is well-supported by compiler technology such as strength reduction(replacing an operation with an equivalent less costly operation), array blocking(grouping subsets of an array to increase cache locality), and loop pipelines(mapping loops into optimized pipeline executions), such that eliminates tuple-at-a-time iterator function calls, by employing column-at-a-time primitives.
• BAT(Binary Association Table): the column-at-a-time processing is realized through the BAT Algebra, which offers operations that work only on a handful of BATs, and produce new BATs. BAT refers to a two-column <surrogate, value> table as proposed in DSM, while the surrogate is a Virtual ID which is the array index of the column and is not materialized. MonetDB hence takes late tuple materialization to the extreme.

Execution

• The philosophy behind the BAT Algebra can also be paraphrased as “the RISC approach to database query languages”: by making the algebra simple, the opportunities are created for implementations that execute the common case very fast.
• To handle updates, MonetDB uses a collection of pending updates columns for each base column in a database. Every query on-the-fly merges updates by reading data both from the base columns and from the pending update columns. Periodically, pending columns are merged with their base columns.

#### 3.2.2 VectorWise

VectorWise was aimed to address many problems of MonetDB: (1) uncompressed and direct access to memory mappings; (2) absence of a buffer manager; (3) full materialization of intermediate results.

### 3.3 Other Implementations

• Columnar Storage Only: storing data one column-at-a-time on disk but relies on a standard row-store execution engine to process queries.
• Native Column-Store Designs: adopt the full column-store model, providing both a columnar storage layer and an execution engine which is tailored for operating on one column-at-a-time with late tuple reconstruction. IBM BLU employed frequency partitioning in order to reorganize columns with the intention to minimize the variability in each data page.

## 4. Column-store Internals and Advanced Techniques

### 4.1 Vectorized Processing

• Two strategies for query execution layer: (1) “Volcano-style” iterator model(tuple-at-a-time); (2) full materialization.
• The typical size for the vectors used in vectorized processing is such that each vector comfortably fits in L1 cache as this minimizes reads and writes through the memory hierarchy.
• While vectorized execution in its current form, was developed and analyzed in the column storage context, the principle can also be applied to row stores as it is not tied to the storage layout.
• Typically, sequential operators(project, selection) work best on vertical(column) vectors(exploiting automatic memory prefetching and SIMD opportunities), whereas random access operator(hash-join or -aggregation) work best using blocks of horizontal(record) records due to cache locality.

• Reduced interpretation overhead: the amount of function calls goes down by by a factor equal to the vector size compared to the tuple-at-a-time model, especially for computationally intensive queries.
• Better cache locality: (1) for CPU cache, all vectors needed for evaluating a query together comfortably fit in the CPU cache; (2) for instruction cache, control now stays for as many iterations as the vector size in the same primitive function, thus creating instruction locality.
• Compiler optimization opportunities: (1) vector primitives which typically perform a tight loop over arrays, are amenable to some of the most productive compiler optimizations, and typically also triggers compilers to generate SIMD instructions.
• Block algorithms: the fact that data processing algorithms now process N tuples, often gives rise to logical algorithm optimizations(e.g., checking for some conditions like output buffer full).
• Parallel memory access: algorithms that perform memory accesses in a tight vectorized loop on modern CPUs are able to generate multiple outstanding cache misses for different values in a vector, since when a cache miss occurs, modern CPUs can speculate ahead in such tight loops. By incurring cache-misses, in which case code that through out-of-order speculation generates multiple parallel misses often performs four times faster than non-vectorized memory lookups.
• Profiling: the overhead of keeping performance profiling measurements for each individual vectorized operation is low, allowing providing highly detailed performance insight into where CPU cycles are spent.
• Adaptive execution: performance profile information on vectorized primitives can also be exploited at run-time, during the execution of a query. For example, Vectorwise decides adaptively in case of arithmetic operations on vectors where only a subset of the values in the arrays is selected by some predicate, whether to compute the result only for the selected tuples iteratively, or for all tuples in the array. The latter strategy, while performing extra work, leads to a tight loop without if-then-else, where SIMD instructions can be used, making it overall faster as long as the percentage of selected tuples is relatively high.

### 4.2 Compression

• Compressing one column-at-a-time: (1) compression algorithms may be able to compress more data with the same common patterns as more data of the same type fit in a single page when storing data of just one attribute; (2) more similar data implies that in general the data structures, codes, etc. used for compression will be smaller and thus this leads to better compression; (3) if the data is sorted by one of the columns, which is common with column-store projections, that column will be super-compressible (for example, runs of the same value can be run-length encoded).
• Exploiting extra CPU cycles: (1) if data is compressed, then less time is spent in I/O during query processing; (2) CPUs are getting much faster compared to memory bandwidth, the cost of accessing data costs more in terms of CPU cycles than it did in the past, which means that now we prefer sparing in decompressing compressed data fast to transferring uncompressed and thus bigger data at slow speeds through the memory hierarchy.
• Fixed-width arrays and SIMD: light-weight compression schemes that compress a column into mostly fixed-width (smaller) values (with exceptions handled carefully) are often preferred.

Techniques

• Frequency partitioning: IBM BLINK reorganizes each column based on the frequency of values that appear in the column, i.e., frequent values are stored together in the same page(s). This allows the system to use compact per page dictionaries dictionary compression, requiring fewer codes which can in turn be stored with fewer bits.
• Compression schemes: run-length encoding, bit-vector encoding, dictionary com- pression and patching.

#### 4.2.1 Run-length Encoding

• Run-length encoding(RLE) compresses runs of the same value in a column to a compact singular representation. Thus, it is well-suited for columns that are sorted or that have reasonable-sized runs of the same value. These runs are replaced with triples: (value, start position, runLength) where each element of the triple is typically given a fixed number of bits.
• Tradeoff: Given that RLE replaces arbitrary blocks of values with a single triple at a time, it results in variable width and variable length columns. This implies that we cannot use the kind of operators described previously for fixed-width columns as well as that tuple reconstruction becomes a little bit more complicated.

#### 4.2.2 Bit-Vector Encoding

• Bit-vector encoding is most useful when columns have a limited number of possible data values.
• In this type of encoding, a bit-string(whose number of bits is equal to the size of the column) is associated with each possible unique element from a column’s domain, with a 1 in the $i^{th}$ position in the bitstring if the $i^{th}$ value in the column is equal to the domain element that the bitstring is associated with, and a 0 otherwise.

#### 4.2.3 Dictionary

• Dictionary encoding works well for distribution with a few very frequent values, and can also be applied to strings.
• One benefit of dictionary compression is that it can result in fixed width columns if the system chooses all codes to be of the same width.

#### 4.2.4 Frame of Reference(FOR)

• If the column distribution has value locality, one may represent it as some constant base plus a value, the value then is a small integer(which takes fewer bits to store than larger integers).
• The physical representation of a block of FOR values is the base followed by one small integer for each tuple.

#### 4.2.5 The Patching Technique

• If the frequency of the distribution is skewed, then we can still compress the data if the compression is done only for the most frequent values, while others are called exception values.
• For tuples encoded as exception values, then compressed code would be a special escape. Checking for this escape with an if-then-else, constitutes a difficult to predict branch in the very kernel of the algorithm, which does not run well on modern CPUs(branch mis-predictions).
• The patching technique, uses these to maintain a linked list. Decompression first compresses all codes regardless exceptions; in a second step, the linked list is traversed and the exception values are “patched into” the decompressed output.

### 4.3 Operating Directly on Compressed Data

• Operating directly on compressed data requires modifications to the query execution engine, since query operators must be aware of how data is compressed and adjust the way they process data accordingly.
• One solution to this problem is to abstract the general properties of compression algorithm in order to facilitate the their direct operations so that operators only have to be concerned with these properties, allowing adding new compression algorithms without adjustments to the query execution engine code.
• In practice, this is done by adding a component to the query executor that encapsulates an intermediate representation for compressed data called a compression block, which contains a buffer of column data in compressed format and provides an API that allows the buffer to be accessed by query operators in several ways.
• Results from experiments in the literature show that compression not only saves space, but significantly improves performance. However, without operation on compressed data, it is rare to get more than a factor of three improvement in performance.
• Once the query execution engine is extended with extensible compression-aware techniques, it is possible to obtain more than an order of magnitude improvement in performance, especially on columns that are sorted or have some order to them.

### 4.4 Late Materialization

• Join-like materialization of tuples(also called tuple reconstruction): (1) most queries access more than one attributes from a particular entity; (2) most database output standard(e.g., ODBC and JDBC) access database results entity-at-a-time.
• Late materialization means that we always operate on individual columns and maintain position lists containing qualifying tuples.
• Every time we fetch the values of a column given a position list(which is the result of a previous operator), we say that we perform a tuple reconstruction action. Such actions have to be performed multiple times within a query plan, i.e., at least N - 1 times for each table, where N is the number of attributes of a given table referenced in a query.
• Given that each projection in C-Store is sorted by one leading column, tuple reconstruction becomes a more lightweight action.

• Selection and aggregation operators tend to render the construction of some tuples unnecessary. Therefore, if the executor waits long enough before constructing a tuple, it might be able to avoid the overhead of constructing it altogether.
• If data is compressed using a column-oriented compression method(that potentially allow compressions symbols to span more than one value within a column, such as RLE), it must be decompressed during tuple reconstruction, which removes the advantages of operating directly on compressed data.
• Cache performance is improved when operating directly on column data, since a given cache line is not polluted with surrounding irrelevant attributes for a given operation.
• The vectorized optimizations described above have a higher impact on performance for fixed-length attributes. In a late materialized column-store, fixed-width columns can be operated on separately.

Despite all these advantages, late materialization can sometimes be slower than early materialization, e.g., a predicate that is not restrictive on many attributes within a tuple.

Multi-column Blocks

• The high level idea is to store data in groups of columns, which is called multi-column blocks(or vector blocks, or column-groups).
• Multi-column blocks allow predicates to be applied to the compressed representation of each column involved in a query predicate separately.
• Although multi-column blocks do not eliminate the need for intersecting positions, only small subsets of each column are operated on at once, allowing the pipelining of predicate evaluation output(position lists) directly to the intersection operator, and therefore enabling the construction of tuples to occur while the values that are going to be stitched together are still in cache.
• One drawback of multi-column blocks is that one needs to make such decisions a priori, i.e., to decide which columns will be grouped together at loading time.

### 4.5 Joins

The output of join is a set of pairs of positions in the two input relations for which the predicate succeeded. For example, the figure below shows the results of a join of a column of size 5 with a column of size 4(left input in order and right input out of order):

• There is at least one set of output positions will not be sorted, which is problematic since typically after the join, other columns from the joined tables will be needed(causing significant slowdown since most storage devices have much slower random access than sequential).

• Jive join: sort by positions, scan data then revert, which allows all columns to be iterated through sequentially, at the cost of adding two sorts of the join output data(can be addressed with efficient external sort algorithm).
• Radix join: Actually, the database does not need to completely sort the position list before using to extract values from columns; rather, it just needs to be partitioned into the blocks on storage(or an approximation thereof) in which those positions can be found. Within each partition, the positions can remain unordered, since random access within a storage block is much cheaper(e.g., the difference between memory and disk I/O, or the difference between cache and memory I/O).
• Hybrid materialization: for the right(inner) table, instead of sending only the column(s) which compose the join predicate, all relevant columns are materialized before the join and input to the join operator, while the left(outer) relation sends only the single join predicate column. For join algorithms that iterate through both input relations out of order, both relations are materialized before the join.
• Multi-column blocks: instead of materializing the tuples of inner table, the relevant set of columns are input tot he join operator in a sequence of multi-column blocks. As inner table values match the join predicate, the position of the value is used to retrieve the values for other columns(within the same block) and tuples are constructed on the fly. This technique is useful when the join selectivity is low and few tuples need to be constructed, but is otherwise expensive, since it potentially requires a particular tuple from the inner relation to be constructed multiple times.

### 4.6 Group-by, Aggregation and Arithmetic Operations

• Group-by is typically a hash-table based operation in modern column-stores and thus it exploits similar properties as discussed in the previous section.
• Aggregation operations make heavy use of the columnar layout, in particular, they can work on only the relevant column with tight for-loops.
• Arithmetic operations can use vectorization to help in minimizing the memory footprint of intermediate results at any given time, and it has been shown that it may also be beneficial to on-the-fly transform intermediate results into column-groups in order to work with (vectors of) multiple columns, avoiding materialization of intermediate results completely.

Problems

• Inherently, column-stores are more sensitive to updates compared to row-stores. By storing each column separately in a separate file, this means that each record/tuple of a relational table is stored in more than one files, i.e., as many files as the number of attributes in the table(in order to perform even a single update action on a single row, we need multiple IO actions to update all files).
• Column-stores in addition to fragmentation make heavy use of compression, and may also store multiple table replicas or projections in different value orders, all to enhance analytical query performance. Even if a user wants to insert many tuples at once, these disk I/Os are scattered (random) I/Os, because of the ordered or clustered table storage.
• Compression makes updates computationally more expensive and complex and extra complications occur if the updated data no longer fits the original location.

Solutions

• C-Store and MonetDB handle updates by splitting their architecture into a “read-store” that manages the bulk of all data and a “write-store” that manages updates that have been made recently. Consequently, all queries merge base table information from the read-store and all corresponding differences from the write-store on-the-fly. In order to keep the write-store small(if resides typically in RAM), changes in it are periodically propagated into the read-store.
• In-memory structure for the write-store: (1) MonetDB uses plain columns; (2) C-Store uses a row-format which speeds up updates even more as only one I/O is needed to write a single new row(but merging of updates in the column format becomes potentially more expensive); (3) VectorWise uses a novel data structure, called Positional Delta Trees(PDTs) to store differences.
• Differential data structures such as PDTs, but also previous approaches like differential files, can be layered: one can create deltas on deltas on deltas, etc. For example, placing very small deltas in the CPU cache, larger ones in RAM, and huge deltas on disk or on solid state memory. Additionally, layered deltas are a tool for implementing isolation and transaction management.

### 4.8 Indexing, Adaptive Indexing and Database Cracking

#### 4.8.1 Indexing

• C-Store proposed the concept of projections, i.e., to replicate each table multiple times and each replica may be ordered by a different attribute. In addition, each replica does not necessarily have to contain all of the table’s attributes. Given that columns compress very well, materializing these extra projections does not bring a significant storage overhead when compared to a traditional row-store system. The amount and the kinds of projections needed depends on the workloads and having extra projections bring an overhead for updates.
• Another form of indexing which is typically used in column-stores are zonemaps, i.e., to store light-weight metadata on a per page basis, e.g., min/max. Other attractive ideas include the use of cache conscious bitmap indexing which create a bitmap for each zone as opposed to having simply min/max information.

### 4.8.2 Database Cracking and Adaptive Indexing

• The main motivation is that the physical data store is continuously changing with each incoming query q, using q as a hint on how data should be stored. Cracking brings drastic improvements in column-store performance.
• The terminology “cracking” reflects the fact that the database is partitioned(cracked) into smaller and manageable pieces.
• The database cracking processes queries and refined indexes at the same time, gaining both an immediate and a long term performance benefit. In particular, bulk processing and columnar storage enabled these adaptive indexing ideas.

### 4.9 Summary and Design Principles Taxonomy

• As it is evident by the plethora of those features, modern column stores go beyond simply storing data one column-at-a-time; they provide a completely new database architecture and execution engine tailored for modern hardware and data analytics.

## 5. Discussion, Conclusions, and Future Directions

### 5.1 Comparing MonetDB/VectorWise/C-Store

• Block-oriented execution