How Good Are Query Optimizers, Really?Reference: Leis, Viktor, et al. "How good are query optimizers, really?." Proceedings of the VLDB Endowment 9.3 (2015): 204-215.
This is one of the several papers belong to suggested readings for Cost Models of CMU 15-721: Database Systems.
Finding a good join order is crucial for query performance. In this paper, we introduce the Join Order Benchmark (JOB) and experimentally revisit the main components in the classic query optimizer architecture using a complex, real-world data set and realistic multi-join queries. We investigate the quality of industrial-strength cardinality estimators and find that all estimators routinely produce large errors. We further show that while estimates are essential for finding a good join order, query performance is unsatisfactory if the query engine relies too heavily on these estimates. Using another set of experiments that measure the impact of the cost model, we find that it has much less influence on query performance than the cardinality estimates. Finally, we investigate plan enumeration techniques comparing exhaustive dynamic programming with heuristic algorithms and find that exhaustive enumeration improves performance despite the sub-optimal cardinality estimates.
In this experiments and analyses paper we investigate the three main components of the classical query optimization architecture in order to answer the following questions:
- How good are cardinality estimators and when do bad estimates lead to slow queries?
- How important is an accurate cost model for the overall query optimization process?
- How large does the enumerated plan space need to be?
- We design a challenging workload named Join Order Benchmark (JOB), which is based on the IMDB data set. The benchmark is publicly available to facilitate further research.
- To the best of our knowledge, this paper presents the first end-to-end study of the join ordering problem using a real-world data set and realistic queries.
- By quantifying the contributions of cardinality estimation, the cost model, and the plan enumeration algorithm on query performance, we provide guidelines for the complete design of a query optimizer. We also show that many disastrous plans can easily be avoided.
2. Background and Methodology
2.1 The IMDB Dataset
- While TPC-H, TPC-DS, or the Star Schema Benchmark (SSB) have proven their value for evaluating query engines, we argue that they are not good benchmarks for the cardinality estimation component of query optimizers. The reason is that in order to easily be able to scale the benchmark data, the data generators are using the very same simplifying assumptions (uniformity, independence, principle of inclusion) that query optimizers make. Real-world data sets, in contrast, are full of correlations and non-uniform data distributions, which makes cardinality estimation much harder.
- We chose the Internet Movie Data Base (IMDB). It contains a plethora of information about movies and related facts about actors, directors, production companies, etc. Data, Processing Code
2.2 The JOB Queries
- We propose JOB for future research in cardinality estimation and query optimization. The query set is available online.
- The cardinalities of base tables are estimated using histograms (quantile statistics), most common values with their frequencies, and domain cardinalities (distinct value counts). These per-attribute statistics are computed by the analyze command using a sample of the relation.
- PostgreSQL’s cardinality estimator is based on the following assumptions: (1) uniformity: all values, except for the most frequent ones, are assumed to have the same number of tuples; (2) independence: predicates on attributes (in the same table or from joined tables) are independent; (3) principle of inclusion: the domains of the join keys overlap such that the keys from the smaller domain have matches in the larger domain.
- The query engine of PostgreSQL takes a physical operator plan and executes it using Volcano-style interpretation.
2.4 Cardinality Extraction and Injection
- We use these estimates of different systems to obtain optimal query plans (w.r.t. respective systems) and run these plans in PostgreSQL.
2.5 Experimental Setup
- PostgreSQL configurations
3. Cardinality Estimation
- Cardinality estimates are the most important ingredient for finding a good query plan.
3.1 Estimates for Base Tables
- To measure the quality of base table cardinality estimates, we use the q-error, which is the factor by which an estimate differs from the true cardinality.
- We found that DBMS A and HyPer can usually predict even complex predicates like substring search using LIKE very well.
- To estimate the selectivities for base tables HyPer uses a random sample of 1000 rows per table and applies the predicates on that sample.
- When we looked at the selections where DBMS A and HyPer produce errors above 2, we found that most of them have predicates with extremely low true selectivities (e.g., or ). This routinely happens when the selection yields zero tuples on the sample, and the system falls back on an ad-hoc estimation method (“magic constants”).
- The estimates of the other systems are worse and seem to be based on per-attribute histograms, which do not work well for many predicates and cannot detect (anti-)correlations between attributes.
3.2 Estimates for Joins
- For all systems we routinely observe mis-estimates by a factor of 1000 or more. Furthermore, as witnessed by the increasing height of the box plots, the errors grow exponentially as the number of joins increases.
- All tested systems - though DBMS A to a lesser degree - tend to systematically underestimate the results sizes of queries with multiple joins.
- We speculate that DBMS A uses a damping factor that depends on the join size, similar to how many optimizers combine multiple selectivities. Many estimators combine the selectivities of multiple predicates (e.g., for a base relation or for a subexpression with multiple joins) not by assuming full independence, but by adjusting the selectivities “upwards”, using a damping factor.
- No system tested was able to detect join-crossing correlations.
- The results demonstrate that the state-of-the-art in cardinality estimation is far from perfect.
3.3 Estimates for TPC-H
- The TPC-H query workload does not present many hard challenges for cardinality estimators.
- In contrast, our workload contains queries that routinely lead to severe overestimation and underestimation errors, and hence can be considered a challenging benchmark for cardinality estimation.
3.4 Better Statistics for PostgreSQL
- To determine if the mis-estimated distinct counts are the underlying problem for cardinality estimation, we computed these values precisely and replaced the estimated with the true values. However, the trend to underestimate cardinalities becomes even more pronounced.
4. When Do Bad Cardinality Estimates Lead To Slow Queries?
- Query optimization is closely intertwined with the physical database design: the type and number of indexes heavily influence the plan search space, and therefore affects how sensitive the system is to cardinality mis-estimates.
4.1 The Risk of Relying on Estimates
- The vast majority of the queries are slower when estimates are used.
- we found that most queries that did not finish in a reasonable time using the estimates have one thing in common: PostgreSQL’s optimizer decides to introduce a nested-loop join (without an index lookup) because of a very low cardinality estimate, whereas in reality the true cardinality is larger. The underlying reason why PostgreSQL chooses nested-loop joins is that it picks the join algorithm on a purely cost-based basis.
- We disabled nested-loop joins in all following experiments, we observed no more timeouts despite using PostgreSQL’s estimates, and none of the queries performed slower than before despite having less join algorithm options, confirming our hypothesis that nested-loop joins (without indexes) seldom have any upside.
- Not taking into account the inherent uncertainty of cardinality estimates and the asymptotic complexities of different algorithm choices, can lead to very bad query plans. Algorithms that seldom offer a large benefit over more robust algorithms should not be chosen.
- Query processing algorithms should, if possible, automatically determine their parameters at runtime instead of relying on cardinality estimates.
4.2 Good Plans Despite Bad Cardinalities
- The database has only primary key indexes, as in all in experiments so far, and once nested loop joins have been disabled and rehashing has been enabled, the performance of most queries is close to the one obtained using the true cardinalities.
4.3 Complex Access Paths
- Indeed overall performance generally increases significantly, but the more indexes are available the harder the job of the query optimizer becomes.
4.4 Join-Crossing Relations
- Join-crossing correlations: queries where correlated predicates involve columns from different tables, connected by joins. Such correlations frequently occur in the IMDB data set, e.g., actors born in Paris are likely to play in French movies.
- The effects of query optimization are always gated by the available options in terms of access paths.
5. Cost Models
5.1 The PostgreSQL Cost Model
- PostgreSQL’s disk-oriented cost model combines CPU and I/O costs with certain weights.
5.2 Cost and Runtime
- Only using the true cardinalities makes the PostgreSQL cost model a reliable predictor of the runtime.
- Using the default cost model of PostgreSQL and the true cardinalities, the median error of the cost model is 38%.
5.3 Tuning the Cost Model for Main Memory
- In many settings the default values for CPU cost parameters and I/O cost parameters are sub optimal.
- Parameter tuning improvement is still overshadowed by the difference between the estimated and the true cardinalities.
- We observe that tuning improves the predictive power of the cost model: the median error decreases from 38% to 30%.
5.4 Are Complex Cost Models Necessary?
- In order to find out whether the complexity of cost models is actually necessary in a main-memory setting, we contrast it with a very simple cost function.
- We see that even our trivial cost model is able to fairly accurately predict the query runtime using the true cardinalities, this allows us to reiterate our main message that cardinality estimation is much more crucial than the cost model.
6. Plan Space
- We use the in-memory cost model from Section 5.4 and assume that it perfectly predicts the query runtime, giving us a very good approximation of the runtime the plan would have in reality, such that we could compare a large number of plans without executing all of them.
6.1 How Important Is the Join Order?
- We use the Quickpick algorithm to visualize the costs of different join orders. Quickpick is a simple, randomized algorithm that picks joins edges at random until all joined relations are fully connected.
- Join order matters: the slowest or even median cost is generally multiple orders of magnitude more expensive than the cheapest plan.
6.2 Are Bushy Trees Necessary?
- Using hash joins, right-deep trees are executed by first creating hash tables out of each relation except one before probing in all of these hash tables in a pipelined fashion, whereas in left-deep trees, a new hash table is built from the result of each join. In zig-zag trees, which are a super set of all left- and right-deep trees, each join operator must have at least one base relation as input.
- In general, zig-zag trees are better than left-deep trees, and right-deep trees are the worst.
6.3 Are Heuristics Good Enough?
- Quickpick-1000, Greedy Operator Ordering (GOO)
- enumerating all bushy trees exhaustively offers moderate but not insignificant performance benefits in comparison with algorithms that enumerate only a sub set of the search space. The performance potential from good cardinality estimates is certainly much larger.
7. Related Work
- Existing research work to better estimate result sizes of queries with join-crossing correlations
- Exploiting query feedback
- We found that the performance impact of estimation mistakes heavily depends on the physical database design.
- Work to make systems adaptive to estimation mistakes
- A radical approach is to move query optimization to runtime, when actual value-distributions become available.
- Optimizing cost model
- Plan enumeration