Notes on : 15-721 (2018) 16 Optimizer Implementation Part II

16 Optimizer Implementation Part II (Spring 2018)
2018年3月31日
16:53
Optimization strategies
Choice #1: Heuristics
→ INGRES, Oracle (until mid 1990s)
Choice #2: Heuristics + Cost-based Join Search
→ System R, early IBM DB2, most open-source DBMSs
Choice #3: Randomized Search
→ Academics in the 1980s, current Postgres
Choice #4: Stratified Search
→ IBM’s STARBURST (late 1980s), now IBM DB2 + Oracle
Choice #5: Unified Search
→ Volcano/Cascades in 1990s, now MSSQL + Greenplum
Postgres optimizer
Imposes a rigid workflow for query optimization:
→ First stage performs initial rewriting with heuristics
→ It then executes a cost-based search to find optimal join ordering.
→ Everything else is treated as an “add-on”.
→ Then recursively descends into sub-queries.
Difficult to modify or extend because the ordering has to be preserved.
Optimizer generator
Framework to allow a DBMS implementer to write the declarative rules for optimizing queries.
→ Separate the search strategy from the data model.
→ Separate the transformation rules and logical operators from physical rules and physical operators.
Implementation can be independent of the optimizer's search strategy.
Examples: Starburst, Exodus, Volcano, Cascades, OPT++
Stratified search
First rewrite the logical query plan using transformation rules.
→ The engine checks whether the transformation is allowed before it can be applied.
→ Cost is never considered in this step.
Then perform a cost-based search to map the logical plan to a physical plan.
Unified search
Unify the notion of both logical→logical and logical
→ physical transformations.
→ No need for separate stages because everything is transformations.
This approach generates a lot more transformations so it makes heavy use of memorization to reduce redundant
work.
Top-down / bottom-up
Top-down Optimization
→ Start with the final outcome that you want, and then work down the tree to find the optimal plan that gets you
to that goal.
→ Example: Volcano, Cascades
Bottom-up Optimization
→ Start with nothing and then build up the plan to get to the final outcome that you want.
→ Examples: System R, Starburst
Cascades optimizer
Object-oriented implementation of the Volcano query optimizer.
Simplistic expression re-writing can be through a direct mapping function rather than an exhaustive search.
1. Optimization tasks as data structures.
2. Rules to place property enforcers.
3. Ordering of moves by promise.
分区 datebase 的第 1 页3. Ordering of moves by promise.
4. Predicates as logical/physical operators.
rules
A rule is a transformation of an expression to a logically equivalent expression.
→ Transformation Rule: Logical to Logical
→ Implementation Rule: Logical to Physical
Each rule is represented as a pair of attributes:
→ Pattern: Defines the structure of the logical expression that can be applied to the rule.
→ Substitute: Defines the structure of the result after applying the rule.
Memo table
Stores all previously explored alternatives in a compact graph structure.
Equivalent operator trees and their corresponding plans are stored together in groups.
Provides memorization, duplicate detection, and property + cost management.
Principal of optimality
Every sub-plan of an optimal plan is itself optimal.
This allows the optimizer to restrict the search space to a smaller set of expressions.
→ The optimizer never has to consider a plan containing sub-plan P1 that has a greater cost than equivalent plan
P2 with the same physical properties.
Search termination
Approach #1: Wall-clock Time
→ Stop after the optimizer runs for some length of time.
Approach #2: Cost Threshold
→ Stop when the optimizer finds a plan that has a lower cost than some threshold.
Approach #3: Transformation Exhaustion
→ Stop when there are no more ways to transform the target plan. Usually done per group.
Cascade implementations
Standalone:
→ Wisconsin OPT++ (1990s)
→ Portland State Columbia (1990s)
→ Pivotal Orca (2010s)
→ Apache Calcite (2010s)
Integrated:
→ Microsoft SQL Server (1990s)
→ Tandem NonStop SQL (1990s)
→ Clustrix (2000s)
→ CMU Peloton (2010s)
Predicate expressions
Predicates are defined as part of each operator.
→ These are typically represented as an AST.
→ Postgres implements them as flatten lists.
The same logical operator can be represented in multiple physical operators using variations of the same
expression.
Predicate pushdown
Approach #1: Logical Transformation
→ Like any other transformation rule in Cascades.
→ Can use cost-model to determine benefit.
Approach #2: Rewrite Phase
→ Perform pushdown before starting search using an initial rewrite phase. Tricky to support complex predicates.
Approach #3: Late Binding
→ Perform pushdown after generating optimal plan in Cascades. Will likely produce a bad plan.
Predicate migration
分区 datebase 的第 2 页Predicate migration
Observation: Not all predicates cost the same to evaluate on tuples.
Select ... where sha256(field) == xxx ....
The optimizer should consider selectivity and computation cost when determining the evaluation order of
predicates.
Pivotal orca
Standalone Cascades implementation.
→ Originally written for Greenplum.
→ Extended to support HAWQ.
A DBMS can use Orca by implementing API to send catalog + stats + logical plans and then retrieve physical plans.
Supports multi-threaded search.
Orca engineering hints
Issue #1: Remote Debugging
→ Automatically dump the state of the optimizer (with inputs) whenever an error occurs.
→ The dump is enough to put the optimizer back in the exact same state later on for further debugging.
Issue #2: Optimizer Accuracy
→ Automatically check whether the ordering of the estimate cost of two plans matches their actual execution
cost.
MEMSQL optimizer
Rewriter
→ Logical-to-logical transformations with access to the cost-model.
Enumerator
→ Logical-to-physical transformations.
→ Mostly join ordering.
Planner
→ Convert physical plans back to SQL.
→ Contains MemSQL-specific commands for moving data.
分区 datebase 的第 3 页