Notes on : 15-721 (2018) 15 Optimizer Implementation Part I

15 Optimizer Implementation Part I

QUERY OPTIMIZATION

For a given query, find a correct execution plan that has the lowest “cost”.
This is the part of a DBMS that is the hardest to implement well (proven to be NP-Complete).

RELATIONAL ALGEBRA EQUIVALENCES

as the name means
Query planning for OLTP queries is easy because they are sargable(search argument able).

→ It is usually just picking the best index.
→ Joins are almost always on foreign key relationships with a small cardinality.
→ Can be implemented with simple heuristics

COST ESTIMATION

Generate an estimate of the cost of executing a plan for the current state of the database.
→ Interactions with other work in DBMS
→ Size of intermediate results
→ Choices of algorithms, access methods
→ Resource utilization (CPU, I/O, network)
→ Data properties (skew, order, placement)

DESIGN DECISIONS

Optimization Granularity
- Choice #1: Single Query → Much smaller search space.
  → DBMS cannot reuse results across queries.
  → In order to account for resource contention, the cost model must account for what is currently running.
- Choice #2: Multiple Queries
  → More efficient if there are many similar queries.
  → Search space is much larger.
  → Useful for scan sharing.
Optimization Timing
- Choice #1: Static Optimization
  → Select the best plan prior to execution.
  → Plan quality is dependent on cost model accuracy.
  → Can amortize over executions with prepared stmts.
- Choice #2: Dynamic Optimization
  → Select operator plans on-the-fly as queries execute.
  → Will have re-optimize for multiple executions.
  → Difficult to implement/debug (non-deterministic)
- Choice #3: Hybrid Optimization
  → Compile using a static algorithm.
  → If the error in estimate > threshold, re-optimize
Prepared Statements
- Choice #1: Re-Optimize
  → Rerun optimizer each time the query is invoked. → Tricky to reuse existing plan as starting point.
- Choice #2: Multiple Plans
  → Generate multiple plans for different values of the parameters (e.g., buckets).
- Choice #3: Average Plan
  → Choose the average value for a parameter and use that for all invocations.
Plan Stability
- Choice #1: Hints
  → Allow the DBA to provide hints to the optimizer.
- Choice #2: Fixed Optimizer Versions
  → Set the optimizer version number and migrate queries one-by-one to the new optimizer.
- Choice #3: Backwards-Compatible Plans
  → Save query plan from old version and provide it to the new DBMS.
  OPTIMIZATION SEARCH STRATEGIES
Heuristics(Ingres no join support)
- Define static rules that transform logical operators to a physical plan.
  → Perform most restrictive selection early
  → Perform all selections before joins
  → Predicate/Limit/Projection pushdowns
  → Join ordering based on cardinality
- Advantages:
  → Easy to implement and debug.
  → Works reasonably well and is fast for simple queries.
- Disadvantages:
  → Relies on magic constants that predict the efficacy of a planning decision.
  → Nearly impossible to generate good plans when operators have complex inter-dependencies.
Heuristics + Cost-based Join Order Search(System R)
- Use static rules to perform initial optimization.
- Then use dynamic programming to determine the best join order for tables.
  → This is the first cost-based query optimizer
  → Bottom-up planning (forward chaining) using a divide and-conquer search method
- System R did this:
  - Break query up into blocks and generate the logical operators for each block.
  - For each logical operator, generate a set of physical operators that implement it.
  - All combinations of join algorithms and access paths
  - Then iteratively construct a “left-deep” tree that minimizes the estimated amount of work to execute the plan
- Advantages:
  → Usually finds a reasonable plan without having to perform an exhaustive search.
- Disadvantages:
  → All the same problems as the heuristic-only approach.
  → Left-deep join trees are not always optimal.(sometimes bush tree may be better)
  → Have to take in consideration the physical properties of data in the cost model (e.g., sort order).
Randomized Algorithms(Postgres’ genetic algorithm)
- Perform a random walk over a solution space of all possible (valid) plans for a query.
- Continue searching until a cost threshold is reached or the optimizer runs for a particular length of time.
- SIMULATED ANNEALING Example:
  - Start with a query plan that is generated using the heuristic-only approach.
  - Compute random permutations of operators (e.g.,swap the join order of two tables)
    → Always accept a change that reduces cost
    → Only accept a change that increases cost with some probability.
    → Reject any change that violates correctness (e.g., sort ordering)
- Advantages:
  → Jumping around the search space randomly allows the optimizer to get out of local minimums.
  → Low memory overhead (if no history is kept).
- Disadvantages:
  → Difficult to determine why the DBMS may have chosen a particular plan.
  → Have to do extra work to ensure that query plans are deterministic.
  → Still have to implement correctness rules.
Problem of these approaches
- Writing query transformation rules in a procedural language is hard and error-prone.
  → No easy way to verify that the rules are correct without running a lot of fuzz tests.
  → Generation of physical operators per logical operator is decoupled from deeper semantics about query.
- A better approach is to use a declarative DSL to write the transformation rules and then have the optimizer enforce them during planning.
- Use a rule engine that allows transformations to modify the query plan operators.
- The physical properties of data is embedded with the operators themselves.
Stratified Search(Planning is done in multiple stages)
- Better implementation of the System R optimizer that uses declarative rules.
- Stage #1: Query Rewrite(a new logical plan)
  → Compute a SQL-block-level, relational calculus-like representation of queries.
- Stage #2: Plan Optimization(as before)
  → Execute a System R-style dynamic programming phase once query rewrite has completed.
- Advantages:
  → Works well in practice with fast performance.
- Disadvantages:
  → Difficult to assign priorities to transformations
  → Some transformations are difficult to assess without computing multiple cost estimations.
  → Rules maintenance is a huge pain.
Unified Search(Perform query planning all at once)
- General purpose cost-based query optimizer, based on equivalence rules on algebras.
  → Easily add new operations and equivalence rules.
  → Treats physical properties of data as first-class entities during planning.
  → Top-down approach (backward chaining) using branch-and-bound search.
- Advantages:
  → Use declarative rules to generate transformations.
  → Better extensibility with an efficient search engine.Reduce redundant estimations using memoization.
- Disadvantages:
  → All equivalence classes are completely expanded to generate all possible logical operators before the optimization search.
  → Not easy to modify predicates.

15 Optimizer Implementation Part I

QUERY OPTIMIZATION

RELATIONAL ALGEBRA EQUIVALENCES

Query planning for OLTP queries is easy because they are sargable(search argument able).

COST ESTIMATION

DESIGN DECISIONS

OPTIMIZATION SEARCH STRATEGIES