Orca: A Modular Query Optimizer Architecture for Big DataReference: Soliman, Mohamed A., et al. "Orca: a modular query optimizer architecture for big data." Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014.
This is one of the several papers belong to suggested readings for Optimizer Implementation (Part II) of CMU 15-721: Database Systems.
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system’s query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with original research resulting a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
- Orca is a state-of-the-art query optimizer specifically designed for demanding analytics workloads. It is distinguished from others in these ways: (1) modularity; (2) extensibility; (3) Multi-core ready; (4) Verifiability; (5) Performance.
2.1 Massively Parallel Processing
- Pivotal’s Greenplum Database (GPDB) is a massively parallel processing (MPP) analytics database.
- Distribution: (1) hashed; (2) replicated; (3) singleton
2.2 SQL on Hadoop
- Pivot’s HAWQ is a massively parallel SQL-compliant engine on top of HDFS.
3. Orca Architecture
Orca is a modern top-down query optimizer based on the Cascades optimization framework.
- Orca includes a framework for exchanging information between the optimizer and the database system called Data eXchange Language (DXL).
- The framework uses an XML-based language to encode the necessary information for communication, output plans and metadata.
- A major benefit of DXL is packaging ORca as a stand-alone product.
Memo. The space of plan alternatives generated by the optimizer is encoded in a compact in-memory data structure called the Memo. The Memo structure consists of a set of containers called groups, where each group contains logically equivalent expressions. Memo groups capture the different sub-goals of a query (e.g., a filter on a table, or a join of two tables). Group members, called group expressions, achieve the group goal in different logical ways (e.g., different join orders). Each group expression is an operator that has other groups as its children. This recursive structure of the Memo allows compact encoding of a huge space of possible plans.
Search and Job Scheduler. Orca uses a search mecha- nism to navigate through the space of possible plan alter- natives and identify the plan with the least estimated cost. The search mechanism is enabled by a specialized Job Scheduler that creates dependent or parallel work units to perform query optimization in three main steps: exploration, where equivalent logical expressions are generated, implementation where physical plans are generated, and optimization where required physical properties (e.g., sort order) are enforced and plan alternatives are costed.
Transformations. Plan alternatives are generated by applying transformation rules that can produce either equivalent logical expressions (e.g., InnerJoin(A,B) → InnerJoin(B,A)), or physical implementations of existing expressions (e.g., Join(A,B) → HashJoin(A,B)). The results of applying transformation rules are copied-in to the Memo, which may result in creating new groups and/or adding new group expressions to existing groups. Each transformation rule is a self-contained component that can be explicitly activated/deactivated in Orca configurations.
Property Enforcement. Orca includes an extensible framework for describing query requirements and plan characteristics based on formal property specifications. Properties have different types including logical properties (e.g., output columns), physical properties (e.g., sort order and data distribution), and scalar properties (e.g., columns used in join conditions). During query optimization, each operator may request specific properties from its children. An optimized child plan may either satisfy the required properties on its own (e.g., an IndexScan plan delivers sorted data), or an enforcer (e.g., a Sort operator) needs to be plugged in the plan to deliver the required property. The framework allows each operator to control enforcers placement based on child plans’ properties and operator’s local behavior.
Metadata Cache. Since metadata (e.g., table definitions) changes infrequently, shipping it with every query incurs an overhead. Orca caches metadata on the optimizer side and only retrieves pieces of it from the catalog if something is unavailable in the cache, or has changed since the last time it was loaded in the cache. Metadata cache also abstracts the database system details from the optimizer, which is particularly useful during testing and debugging.
GPOS. In order to interact with operating systems with possibly different APIs, Orca uses an OS abstraction layer called GPOS. The GPOS layer provides Orca with an extensive infrastructure including a memory manager, primitives for concurrency control, exception handling, file I/O and synchronized data structures.
4. Query Optimization
- Exploration: transformation rules that generate logically equivalent expressions are triggered. The Memo structure has a built-in duplicate detection mechanism, based on expression topology, to detect and eliminate any duplicate expressions created by different transformations.
- Statistics Derivation: Orca’s statistics derivation mechanism is triggered at the end of exploration to compute statistics for the Memo groups. Derivation of statistics takes place on the compact Memo structure to avoid expanding the search space. In order to derive statistics for a target group, Orca picks the group expression with the highest promise of delivering reliable statistics.
- Implementation: Transformation rules that create physical implementations of logical expressions are triggered.
- In this step, properties are enforced and plan alternatives are costed. Optimization starts by submitting an initial optimization request to the Memo’s root group specifying query requirements such as result distribution and sort order. Submitting a request r to a group g corresponds to requesting the least cost plan satisfying r with a root physical operator in g.
- For each incoming request, each physical group expression passes corresponding requests to child groups depending on the incoming requirements and operator’s local requirements. And we use a group hash table to cache the best group expressions of request, and a local hash table to cache the mapping from request to child optimization requests.
The extracted plan is serialized in DXL format and shipped to the database system for execution. DXL2Plan translator at the database system translates DXL plan to an executable plan based on the underling query execution framework.
- Multi-Stage Optimization: an optimization stage in Orca is defined as a complete optimization workflow using a subset of transformation rules and (optional) time-out and cost threshold. A stage terminates when any of the following conditions is met: (1) a plan with cost below cost threshold is found, (2) time-out occurs, or (3) the subset of transformation rules is exhausted. The specification of optimization stages can be given by the user through Orca’s configuration.
- Query Execution: A copy of the final plan is dispatched to each segment. During distributed query execution, a distribution enforcer on each segment acts as both sender and receiver of data.
4.2 Parallel Query Optimization
- Optimization process is broken to small work units called optimization jobs. Orca currently has seven different types of optimization jobs:
- Orca includes a specialized job scheduler designed from scratch to maximize the fan-out of job dependency graph and provide the required infrastructure for parallel query optimization.
- When an optimization job with some goal is under processing, all other incoming jobs with the same goal are forced to wait until getting notified about the completion of the running job. At this point, the suspended jobs can pick up the results of the completed job. This functionality is enabled by attaching a job queue to each group, such that incoming jobs are queued as long as there exists an active job with the same goal.
5. Metadata Exchange
- The access to metadata is facilitated by a collection of Metadata Providers that are system-specific plug-ins to retrieve metadata from the database system.
- In addition to system-specific providers, Orca implements a file-based MD Provider to load metadata from a DXL file, eliminating the need to access a live backend system. Orca includes an automated tool for harvesting metadata that optimizer needs into a minimal DXL file.
- There is a built-in testing scheme that makes it difficult for developers to introduce regressions as part of adding new features, and makes it simple for test engineers to add test cases to be verified with every build.
- In addition, we leverage several tools and testing frameworks we built to assure the quality and verifiability of Orca, including a cardinality estimation testing framework, a number of benchmark tests at various scales, a data generator that can generate data by reversing database statistics , and two unique testing tools we discuss below.
6.1 Minimal Repros
- AMPERe is a tool for Automatic capture of Minimal Portable and Executable Repros. The motivation for building AMPERe was to be able to reproduce and debug customer issues in the optimizer without having access to the customer production system.
- An AMPERe dump is automatically triggered when an unexpected error is encountered, but can also be produced on demand to investigate suboptimal query plans. The dump captures the minimal amount of data needed to reproduce a problem, including the input query, optimizer configurations and metadata, serialized in DXL. If the dump is generated due to an exception, it also includes the exception’s stack trace.
- AMPERe allows replaying a dump outside the system where it was generated.
- AMPERe is also used as a testing framework, where a dump acts as a test case that contains an input query and its expected plan.
6.2 Testing Optimizer Accuracy
- Orca includes a built-in tool called TAQO for Testing the Accuracy of Query Optimizer. TAQO measures the optimizer’s accuracy by costing and executing plans that the optimizer considers when optimizing a given query.
- Orca automatically checks whether the ordering of the estimate cost of two plans matches their actual execution cost.
- Evaluating each single plan in the search space is infeasible, in general. This limitation can be overcome by sampling plans uniformly from the search space. Optimization requests’ linkage structure provides the infrastructure used by TAQO to build a uniform plan sampler based on the method introduced in this paper.
- Given a sample of plans from the search space of a given query, TAQO computes a correlation score between the ranking of sampled plans based on estimated costs and their ranking based on actual costs.
7.1 TPC-DS Benchmark
- TPC-DS is a widely adopted decision support benchmark that consists of a set of complex business analytic queries with rich SQL syntax.
- It has superseded the well-known TPC-H by providing a much richer schema and a larger variety of business problems ranging from business reporting, ad-hoc exploration, iterative queries to data mining. In our development process we have observed that TPC-H often lacks the sophistication of the workload from our enterprise customers.
7.2 MPP Databases
- Planner: the GPDB legacy query optimizer that inherits part of its design from the PostgreSQL optimizer.
- For the entire TPC-DS suite, Orca shows a 5x speed-up over Planner.
- The performance improvement provided by Orca is due to a combination of salient features including the following: (1) Join Ordering; (2) Correlated Subqueries; (3) Partition Elimination; (4) Common Expressions.
- For a smaller number of queries, Orca produced suboptimal plans with up to 2x slow down compared to Planner. These sub-optimal plans are partly due to cardinality estimation errors or sub-optimal cost model parameters that need further tuning.
7.3 SQL on Hadoop
- we compare the performance of Pivotal HAWQ (powered by Orca) against three Hadoop SQL engines: Impala, Presto, and Stinger.
- The average speedup ratio of HAWQ in this set of experiments is 6x against Impala and 21x against Stinger. we were unable to successfully run any TPC-DS query in Presto.
8. Related Work
8.1 Query Optimization Foundations
- Volcano Parallel Database
8.2 SQL Optimization On MPP Databases
- SQL Server Parallel Data Warehouse (PDW) makes extensive re-use of the established Microsoft’s SQL Server optimizer.
- Structured Computations Optimized for Parallel Execution (SCOPE)
- SAP HANA
8.3 SQL On Hadoop
- Hadoop-based: Impala, HAWQ, Presto, Hadapt, PolyBase
- AsterixDB, Dremel