Notes on : 15-721 (2018) 15 Optimizer Implementation Part I
15 Optimizer Implementation Part I
QUERY OPTIMIZATION
- For a given query, find a correct execution plan that has the lowest “cost”.
- This is the part of a DBMS that is the hardest to implement well (proven to be NP-Complete).
RELATIONAL ALGEBRA EQUIVALENCES
- as the name means
Query planning for OLTP queries is easy because they are sargable(search argument able).
→ It is usually just picking the best index.
→ Joins are almost always on foreign key relationships with a small cardinality.
→ Can be implemented with simple heuristicsCOST ESTIMATION
Generate an estimate of the cost of executing a plan for the current state of the database.
→ Interactions with other work in DBMS
→ Size of intermediate results
→ Choices of algorithms, access methods
→ Resource utilization (CPU, I/O, network)
→ Data properties (skew, order, placement)
DESIGN DECISIONS
- Optimization Granularity
-
Choice #1: Single Query → Much smaller search space.
→ DBMS cannot reuse results across queries.
→ In order to account for resource contention, the cost model must account for what is currently running. -
Choice #2: Multiple Queries
→ More efficient if there are many similar queries.
→ Search space is much larger.
→ Useful for scan sharing.
-
- Optimization Timing
- Choice #1: Static Optimization
→ Select the best plan prior to execution.
→ Plan quality is dependent on cost model accuracy.
→ Can amortize over executions with prepared stmts. - Choice #2: Dynamic Optimization
→ Select operator plans on-the-fly as queries execute.
→ Will have re-optimize for multiple executions.
→ Difficult to implement/debug (non-deterministic) - Choice #3: Hybrid Optimization
→ Compile using a static algorithm.
→ If the error in estimate > threshold, re-optimize
- Choice #1: Static Optimization
- Prepared Statements
- Choice #1: Re-Optimize
→ Rerun optimizer each time the query is invoked. → Tricky to reuse existing plan as starting point. - Choice #2: Multiple Plans
→ Generate multiple plans for different values of the parameters (e.g., buckets). - Choice #3: Average Plan
→ Choose the average value for a parameter and use that for all invocations.
- Choice #1: Re-Optimize
- Plan Stability
- Choice #1: Hints
→ Allow the DBA to provide hints to the optimizer. - Choice #2: Fixed Optimizer Versions
→ Set the optimizer version number and migrate queries one-by-one to the new optimizer. - Choice #3: Backwards-Compatible Plans
→ Save query plan from old version and provide it to the new DBMS.OPTIMIZATION SEARCH STRATEGIES
- Choice #1: Hints
- Heuristics(Ingres no join support)
- Define static rules that transform logical operators to a physical plan.
→ Perform most restrictive selection early
→ Perform all selections before joins
→ Predicate/Limit/Projection pushdowns
→ Join ordering based on cardinality - Advantages:
→ Easy to implement and debug.
→ Works reasonably well and is fast for simple queries. - Disadvantages:
→ Relies on magic constants that predict the efficacy of a planning decision.
→ Nearly impossible to generate good plans when operators have complex inter-dependencies.
- Define static rules that transform logical operators to a physical plan.
- Heuristics + Cost-based Join Order Search(System R)
- Use static rules to perform initial optimization.
- Then use dynamic programming to determine the best join order for tables.
→ This is the first cost-based query optimizer
→ Bottom-up planning (forward chaining) using a divide and-conquer search method - System R did this:
- Break query up into blocks and generate the logical operators for each block.
- For each logical operator, generate a set of physical operators that implement it.
- All combinations of join algorithms and access paths
- Then iteratively construct a “left-deep” tree that minimizes the estimated amount of work to execute the plan
- Advantages:
→ Usually finds a reasonable plan without having to perform an exhaustive search. - Disadvantages:
→ All the same problems as the heuristic-only approach.
→ Left-deep join trees are not always optimal.(sometimes bush tree may be better)
→ Have to take in consideration the physical properties of data in the cost model (e.g., sort order).
- Randomized Algorithms(Postgres’ genetic algorithm)
- Perform a random walk over a solution space of all possible (valid) plans for a query.
- Continue searching until a cost threshold is reached or the optimizer runs for a particular length of time.
- SIMULATED ANNEALING Example:
- Start with a query plan that is generated using the heuristic-only approach.
- Compute random permutations of operators (e.g.,swap the join order of two tables)
→ Always accept a change that reduces cost
→ Only accept a change that increases cost with some probability.
→ Reject any change that violates correctness (e.g., sort ordering)
- Advantages:
→ Jumping around the search space randomly allows the optimizer to get out of local minimums.
→ Low memory overhead (if no history is kept). - Disadvantages:
→ Difficult to determine why the DBMS may have chosen a particular plan.
→ Have to do extra work to ensure that query plans are deterministic.
→ Still have to implement correctness rules.
- Problem of these approaches
- Writing query transformation rules in a procedural language is hard and error-prone.
→ No easy way to verify that the rules are correct without running a lot of fuzz tests.
→ Generation of physical operators per logical operator is decoupled from deeper semantics about query. - A better approach is to use a declarative DSL to write the transformation rules and then have the optimizer enforce them during planning.
- Use a rule engine that allows transformations to modify the query plan operators.
- The physical properties of data is embedded with the operators themselves.
- Writing query transformation rules in a procedural language is hard and error-prone.
- Stratified Search(Planning is done in multiple stages)
- Better implementation of the System R optimizer that uses declarative rules.
- Stage #1: Query Rewrite(a new logical plan)
→ Compute a SQL-block-level, relational calculus-like representation of queries. - Stage #2: Plan Optimization(as before)
→ Execute a System R-style dynamic programming phase once query rewrite has completed. - Advantages:
→ Works well in practice with fast performance. - Disadvantages:
→ Difficult to assign priorities to transformations
→ Some transformations are difficult to assess without computing multiple cost estimations.
→ Rules maintenance is a huge pain.
- Unified Search(Perform query planning all at once)
- General purpose cost-based query optimizer, based on equivalence rules on algebras.
→ Easily add new operations and equivalence rules.
→ Treats physical properties of data as first-class entities during planning.
→ Top-down approach (backward chaining) using branch-and-bound search. - Advantages:
→ Use declarative rules to generate transformations.
→ Better extensibility with an efficient search engine.Reduce redundant estimations using memoization. - Disadvantages:
→ All equivalence classes are completely expanded to generate all possible logical operators before the optimization search.
→ Not easy to modify predicates.
- General purpose cost-based query optimizer, based on equivalence rules on algebras.