Cardinality Estimation Done Right: Index-Based Join Sampling

Reference: Leis, Viktor, et al. "Cardinality Estimation Done Right: Index-Based Join Sampling." CIDR. 2017.

0. Abstract

After four decades of research, today’s database systems still suffer from poor query execution plans. Bad plans are usually caused by poor cardinality estimates, which have been called the “Achilles Heel” of modern query optimizers. In this work we propose index-based join sampling, a novel cardinality estimation technique for main-memory databases that relies on sampling and existing index structures to obtain accurate estimates. Results on a real-world data set show that this approach significantly improves estimation as well as overall plan quality. The additional sampling effort is quite low and can be configured to match the desired application profile. The technique can be easily integrated into most systems.

1. Introduction

  • Cost-based query optimization is fundamentally based on cardinality estimation.
  • Virtually all industrial-strength systems estimate cardinalities by combining some fixed-size, per-attribute summary statistics (histograms) with strong assumptions (uniformity, independence, inclusion, ad hoc constants).
  • Since they try to approximate an arbitrarily large database in a constant amount of space (i.e., summary), estimators have a very hard time detecting complex patterns like join-crossing correlations.

  • Although sampling can detect arbitrary correlations and therefore produces much more accurate estimates, few systems actually use sampling in production.
  • One reason is that in the past, when main memory capacity was small, sampling was too expensive to be practical due to the high cost of random disk I/O operations. For now, many databases fully reside in RAM, making sampling much more practical.
  • A second reason is that even a relatively long sampling time may not be sufficient to estimate joins.

2. Background: Sampling-based Query Optimization

3. Index-based Join Sampling

3.1 The Index-Based Sampling Operator

  • To estimate the cardinalities of a query, we first compute random samples of constant, configurable size for each base relation in the query.

3.2 Greedy vs. Bottom-Up Enumeration

  • We do not sample cross products, as most join enumeration algorithms ignore them anyway.
  • The number of intermediate results grows exponentially with the number of joins in a query.

Greedy Approach

  • ROX-style greedy approach: start with a small result and extend it one-by-one until all relations in the query graph are covered.
  • The advantage is that one quickly obtains accurate estimates for large intermediate results.
  • The disadvantage is that many small intermediate results are not sampled and thus have to be estimated using traditional estimation. It is well known that - due to the independence assumption - traditional estimators tend to underestimate result sizes. Therefore, when this mix of (accurate) sampling-based estimates and traditional (under-)estimates are injected into a query optimizer, it will often pick a plan based on the traditional estimates (as they appear to be very cheap). This phenomenon has been called “fleeing from knowledge to ignorance” and - paradoxically - causes additional, accurate information to decrease plan quality.

Bottom-Up Enumeration

  • We compute intermediate results in a bottom-up fashion, i.e., we first sample all 2-way joins, then all 3-way joins, and so on until the we run out of budget.

3.3 Enumeration Algorithm

3.4 Database System Integration

  • We inject the estimates computed by index-based sampling into the existing cardinality estimation component of the query optimizer.

4. Evaluation

4.1 Experimental Setup

  • In prior work we have shown that cardinality estimation for synthetic benchmarks like TPC-H is unrealistically easy, therefore we use the Join Order Benchmark (JOB), which is based on the Internet Movie Database.
  • The experiments were performed in a single-threaded in-memory prototype similar to MonetDB (column-wise storage, full materialization after each operator) developed for query optimization experimentation. For simplicity, indexes are implemented as hash tables.
  • For most experiments, we create indexes on all primary key and foreign key columns, which is the worst case in terms of relative plan quality (more indexes make it harder to find the optimal plan) and sampling overhead (more indexes allow for more sampling steps).

  • By default, cardinalities are estimated using index-based sampling and a sample size of 1,000.
  • As a sampling-based competitor, we implemented sampling-based query re-optimization.
  • The quality of cardinality estimates of most commercial systems is similar to PostgreSQL, so we compare with PostgreSQL for cardinality estimate.
  • We also experimented with ROX-style join ordering.

4.2 Does Sampling Improve Estimation?

  • Two major trends: (1) With increasing join sizes, the errors increase exponentially; (2) underestimation is much more common than overestimation and is getting more pronounced with each additional join.
  • Index-based join sampling does much better than PostgreSQL, in particular with a higher budget. Underestimation occurs later because smaller intermediate results are estimated accurately using sampling.

4.3 How Expensive is Sampling?

  • Even without a budget, for queries with up to 8 joins, the sampling overhead is quite low (less than 20 ms).
  • The sampling overhead of sampling-based re-optimization is generally much larger.

4.4 Does Sampling Improve Plan Quality?

  • Some plans are actually faster (up to a factor of 3) with inaccurate estimates than with the true cardinalities. This effect is caused by cost model errors rather than inaccurate cardinalities and explains the hesitation of many commercial database systems to change their query optimizers.
  • ROX algorithm performs worse than PostgreSQL, mostly because the situation in the JOB benchmark appears to be quite different.

4.5 What If There are Few Indexes?

  • Index-based sampling improves plan quality across all configurations, but especially in those cases where plan quality is worst - namely, when multiple indexes are available.
  • Detecting cardinality estimation errors at runtime: the LEO project
  • Adaptive query processing: Eddies, Plan Bouquet, SpillBound
  • A number of techniques have been proposed that strive to detect optimizer mistakes using limited runtime adaptivity.
  • Designing the query operators in a such a way that they do not rely on cardinality estimates.
Written on February 2, 2018