Continuous Cloud-Scale Query Optimization and Processing

Reference: Bruno, Nicolas, Sapna Jain, and Jingren Zhou. "Continuous cloud-scale query optimization and processing." Proceedings of the VLDB Endowment 6.11 (2013): 961-972.

0. Abstract

Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. High-level scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, complex computation logic, and frequent usage of user-defined functions. In this paper we propose novel techniques to adapt query processing in the Scope system, the cloud-scale computation environment in Microsoft Online Services. We continuously monitor query execution, collect actual runtime statistics, and adapt parallel execution plans as the query executes. We discuss similarities and differences between our approach and alternatives proposed in the context of traditional centralized systems. Experiments on large-scale Scope production clusters show that the proposed techniques systematically solve the challenge of missing/inaccurate data statistics, detect and resolve partition skew and plan structure, and improve query latency by a few folds for real workloads. Although we focus on optimizing high-level languages, the same ideas are also applicable for MapReduce systems.

1. Introduction

The choice of a plan heavily depends on properties of the input data and userdefined functions, while several aspects of highly distributed systems make this problem more challenging compared to traditional database systems:

  • Difficult to obtain and maintain good quality statistics on huge volumes of data, and common design choices do not even account for accurate statistics on input unstructured data.
  • Most queries heavily rely on user-defined code, which makes the problem of statistical inference much more challenging during optimization
  • Input scripts typically consist of hundreds of operators, which magnify the well-known problem of propagation of statistical errors.

2. The SCOPE System

2.1 The Language and Data Model

  • Extractor, Processor, Reducer, Combiner, Outputter
  • Structured stream

2.2 Query Compilation and Optimization

  • SCOPE optimizer: Cost-based transformation engine
  • Backend compiler: code generation, stage

2.3 Job Scheduling and Runtime

  • Job Manager: data locality, speculative execution
  • Pull-based execution model and materialization of intermediate results: simplifying job scheduling and failure recovery, decreasing resource usage (because of materialization)

3. Continuous Query Optimization

3.1 Architecture

3.2 Plan Signature

  • Plan signatures uniquely identify a logical query fragment during optimization, which is similar to the notion of view matching technology in traditional database systems.
  • We traverse back all to rules to reach the initial semantically equivalent expression, which could be used as the canonical representation. We then recursively serialize the representation of the canonical expression and compute a 64-bit hash value for each subtree, which serves as the plan signature.
  • To distinguish these operators with same logical results and different physical properties (i.e., partitioning) during execution, the optimizer associates each operator with both its signature and the corresponding delivered properties.

3.3 Statistics Instrumentation

  • We model statistics as very lightweight user-defined aggregates whose execution is interleaved with normal vertex processing.

3.4 Query Reoptimization

  • As shown in the experimental evaluation in Section 4, this simple policy provides remarkably robust results.
  • We leverage skew values during cost estimation by calculating the cost of the operator for the worst case (i.e., the instance with the largest cardinality values).
  • Partial materialized views: has an additional cost associated with it, which models the remaining work needed to complete the execution of all remaining vertices for the stage.

3.5 Adapting Execution Plans

  • When the JM receives a new execution plan from the query optimizer, it merges it with the currently executing graph and continues execution.

4. Experimental Evaluation

  • In our production workload, the main performance bottlenecks of problematic queries are sub-optimal degree of parallelism and data partition skew due to suboptimal partitioning keys chosen by the optimizer.
  • This usually happens due to incorrect estimation of cardinality and key distribution across partitions.
  • Most of the cardinality estimation errors occurs due to black-box user-defined functions, whose selectivity varies in a wide range, and unknown data distribution, either from inputs or from intermediate results.

4.1 Case Study

  • Continuous optimization: query latency is reduced by 8.4x
  • Leveraging partial materialized views: the overall latency is further reduced by 9.4%

4.2 Performance Evaluation

  • Partition skew, Over-partitioning, Under-partitioning
  • Join variants
  • Reusing intermediate results

4.3 Plan Convergence

  • The system converges to the final plan in just a couple of iterations in most cases.

4.4 Continuous Optimization Overhead

  • Most queries using CQO are just a few percentage points above the optimal strategy.
  • The main outlier is query Q6, which is one instance that changes the plan multiple times to recover from a multi-column skew. In the future, we may address this by more aggressively choosing partitioning columns based on the history of reoptimizations.
  • Virtually all of optimization techniques rely on accurate data statistics to choose the best execution plan in a cost based manner. It is well known to be very difficult, and sometimes impossible, to compute good quality statistics on intermediate computation results.

6. Future Work

  • Making the stage boundaries aware of reoptimization opportunities, rather than the current approach which performs grouping in a reoptimization-agnostic manner.
  • Effectively exploiting statistics from partially completed vertices to detect expensive predicates before a single vertex finishes execution.
  • Leveraging different conditions to trigger reoptimization, rather than the current fixed policy.
Written on January 22, 2018