15 Optimizer Implementation Part I

QUERY OPTIMIZATION

  • For a given query, find a correct execution plan that has the lowest “cost”.
  • This is the part of a DBMS that is the hardest to implement well (proven to be NP-Complete).

RELATIONAL ALGEBRA EQUIVALENCES

  • as the name means

    Query planning for OLTP queries is easy because they are sargable(search argument able).

    → It is usually just picking the best index.
    → Joins are almost always on foreign key relationships with a small cardinality.
    → Can be implemented with simple heuristics

    COST ESTIMATION

    Generate an estimate of the cost of executing a plan for the current state of the database.
    → Interactions with other work in DBMS
    → Size of intermediate results
    → Choices of algorithms, access methods
    → Resource utilization (CPU, I/O, network)
    → Data properties (skew, order, placement)

DESIGN DECISIONS

  • Optimization Granularity
    • Choice #1: Single Query → Much smaller search space.
      → DBMS cannot reuse results across queries.
      → In order to account for resource contention, the cost model must account for what is currently running.

    • Choice #2: Multiple Queries
      → More efficient if there are many similar queries.
      → Search space is much larger.
      → Useful for scan sharing.

  • Optimization Timing
    • Choice #1: Static Optimization
      → Select the best plan prior to execution.
      → Plan quality is dependent on cost model accuracy.
      → Can amortize over executions with prepared stmts.
    • Choice #2: Dynamic Optimization
      → Select operator plans on-the-fly as queries execute.
      → Will have re-optimize for multiple executions.
      → Difficult to implement/debug (non-deterministic)
    • Choice #3: Hybrid Optimization
      → Compile using a static algorithm.
      → If the error in estimate > threshold, re-optimize
  • Prepared Statements
    • Choice #1: Re-Optimize
      → Rerun optimizer each time the query is invoked. → Tricky to reuse existing plan as starting point.
    • Choice #2: Multiple Plans
      → Generate multiple plans for different values of the parameters (e.g., buckets).
    • Choice #3: Average Plan
      → Choose the average value for a parameter and use that for all invocations.
  • Plan Stability
    • Choice #1: Hints
      → Allow the DBA to provide hints to the optimizer.
    • Choice #2: Fixed Optimizer Versions
      → Set the optimizer version number and migrate queries one-by-one to the new optimizer.
    • Choice #3: Backwards-Compatible Plans
      → Save query plan from old version and provide it to the new DBMS.

      OPTIMIZATION SEARCH STRATEGIES

  • Heuristics(Ingres no join support)
    • Define static rules that transform logical operators to a physical plan.
      → Perform most restrictive selection early
      → Perform all selections before joins
      → Predicate/Limit/Projection pushdowns
      → Join ordering based on cardinality
    • Advantages:
      → Easy to implement and debug.
      → Works reasonably well and is fast for simple queries.
    • Disadvantages:
      → Relies on magic constants that predict the efficacy of a planning decision.
      → Nearly impossible to generate good plans when operators have complex inter-dependencies.
  • Heuristics + Cost-based Join Order Search(System R)
    • Use static rules to perform initial optimization.
    • Then use dynamic programming to determine the best join order for tables.
      → This is the first cost-based query optimizer
      → Bottom-up planning (forward chaining) using a divide and-conquer search method
    • System R did this:
      • Break query up into blocks and generate the logical operators for each block.
      • For each logical operator, generate a set of physical operators that implement it.
      • All combinations of join algorithms and access paths
      • Then iteratively construct a “left-deep” tree that minimizes the estimated amount of work to execute the plan
    • Advantages:
      → Usually finds a reasonable plan without having to perform an exhaustive search.
    • Disadvantages:
      → All the same problems as the heuristic-only approach.
      → Left-deep join trees are not always optimal.(sometimes bush tree may be better)
      → Have to take in consideration the physical properties of data in the cost model (e.g., sort order).
  • Randomized Algorithms(Postgres’ genetic algorithm)
    • Perform a random walk over a solution space of all possible (valid) plans for a query.
    • Continue searching until a cost threshold is reached or the optimizer runs for a particular length of time.
    • SIMULATED ANNEALING Example:
      • Start with a query plan that is generated using the heuristic-only approach.
      • Compute random permutations of operators (e.g.,swap the join order of two tables)
        → Always accept a change that reduces cost
        → Only accept a change that increases cost with some probability.
        → Reject any change that violates correctness (e.g., sort ordering)
    • Advantages:
      → Jumping around the search space randomly allows the optimizer to get out of local minimums.
      → Low memory overhead (if no history is kept).
    • Disadvantages:
      → Difficult to determine why the DBMS may have chosen a particular plan.
      → Have to do extra work to ensure that query plans are deterministic.
      → Still have to implement correctness rules.
  • Problem of these approaches
    • Writing query transformation rules in a procedural language is hard and error-prone.
      → No easy way to verify that the rules are correct without running a lot of fuzz tests.
      → Generation of physical operators per logical operator is decoupled from deeper semantics about query.
    • A better approach is to use a declarative DSL to write the transformation rules and then have the optimizer enforce them during planning.
    • Use a rule engine that allows transformations to modify the query plan operators.
    • The physical properties of data is embedded with the operators themselves.
  • Stratified Search(Planning is done in multiple stages)
    • Better implementation of the System R optimizer that uses declarative rules.
    • Stage #1: Query Rewrite(a new logical plan)
      → Compute a SQL-block-level, relational calculus-like representation of queries.
    • Stage #2: Plan Optimization(as before)
      → Execute a System R-style dynamic programming phase once query rewrite has completed.
    • Advantages:
      → Works well in practice with fast performance.
    • Disadvantages:
      → Difficult to assign priorities to transformations
      → Some transformations are difficult to assess without computing multiple cost estimations.
      → Rules maintenance is a huge pain.
  • Unified Search(Perform query planning all at once)
    • General purpose cost-based query optimizer, based on equivalence rules on algebras.
      → Easily add new operations and equivalence rules.
      → Treats physical properties of data as first-class entities during planning.
      → Top-down approach (backward chaining) using branch-and-bound search.
    • Advantages:
      → Use declarative rules to generate transformations.
      → Better extensibility with an efficient search engine.Reduce redundant estimations using memoization.
    • Disadvantages:
      → All equivalence classes are completely expanded to generate all possible logical operators before the optimization search.
      → Not easy to modify predicates.