Notes on : 15-721 (2018) 17 Cost Models

17 Cost Models

Cost Models

Cost based query planning

Generate an estimate of the cost of executing a particular query plan for the current state of the database. → Estimates are only meaningful internally.
This is independent of the search strategies that we talked about last class. (Heuristic / random search / stratify / unified search )
Cost Model Component (how to build a cost model)
Choice #1: Physical Costs
→ Predict CPU cycles, I/O, cache misses, RAM consumption, pre-fetching, etc…
→ Depends heavily on hardware.
Choice #2: Logical Costs
→ Estimate result sizes per operator.
→ Independent of the operator algorithm.
→ Need estimations for operator result sizes.
Choice #3: Algorithmic Costs
→ Complexity of the operator algorithm implementation.
Disk based cost model
The number of disk accesses will always dominate the execution time of a query.
→ CPU costs are negligible.
→ Have to consider sequential vs. random I/O.
This is easier to model if the DBMS has full control over buffer management.
→ We will know the replacement strategy, pinning, and assume exclusive access to disk.
Postgres Example
Uses a combination of CPU and I/O costs that are weighted by “magic” constant factors.
Default settings are obviously for a disk-resident database without a lot of memory:
→ Processing a tuple in memory is 400x faster than reading a tuple from disk.
→ Sequential I/O is 4x faster than random I/O.
IBM DB2 Example
Database characteristics in system catalogs
Hardware environment (microbenchmarks)
Storage device characteristics (microbenchmarks)
Communications bandwidth (distributed only)
Memory resources (buffer pools, sort heaps)
Concurrency Environment
→ Average number of users
→ Isolation level / blocking
→ Number of available locks
In memory DB cost model
No I/O costs, but now we have to account for CPU and memory access costs.
Memory cost is more difficult because the DBMS has no control cache management.
→ Unknown replacement strategy, no pinning, shared caches, non-uniform memory access.
The number of tuples processed per operator is a reasonable estimate for the CPU cost.
Smallbase (a db name) cost model
Two-phase model that automatically generates hardware costs from a logical model.
Phase #1: Identify Execution Primitives
→ List of ops that the DBMS does when executing a query
→ Example: evaluating predicate, index probe, sorting.
Phase #2: Microbenchmark
→ On start-up, profile ops to compute CPU/memory costs
→ These measurements are used in formulas that compute operator cost based on table size.
Cost Estimation

Selectivity
The selectivity of an operator is the percentage of data accessed for a predicate.
→ Modeled as probability of whether a predicate on any given tuple will be satisfied.
The DBMS estimates selectivities using:
→ Domain Constraints
→ Precomputed Statistics (Zone Maps)
→ Histograms / Approximations
→ Sampling
IBM DB2 learning optimizer(not working as expected)
Update table statistics as the DBMS scans a table during normal query processing.
Check whether the optimizer’s estimates match what it encounters in the real data and incrementally updates them.
Approximation
Maintaining exact statistics about the database is expensive and slow.
Use approximate data structures called sketches to generate error-bounded estimates.
→ Count Distinct
→ Quantiles
→ Frequent Items
→ Tuple Sketch
Sampling
Execute a predicate on a random sample of the target data set.
The # of tuples to examine depends on the size of the table.
Approach #1: Maintain Read-Only Copy
→ Periodically refresh to maintain accuracy.
Approach #2: Sample Real Tables
→ Use READ UNCOMMITTED isolation. → May read multiple versions of same logical tuple.
Cardinality

The number of tuples that will be generated per operator is computed from its selectivity multiplied by the number of tuples in its input.
Assumption #1: Uniform Data
→ The distribution of values (except for the heavy hitters) is the same.
Assumption #2: Independent Predicates
→ The predicates on attributes are independent
Assumption #3: Inclusion Principle
→ The domain of join keys overlap such that each key in the inner relation will also exist in the outer table.
column group statistics
The DBMS can track statistics for groups of attributes together rather than just treating them all as independent variables.
→ Only supported in commercial systems.
→ Requires the DBA to declare manually.
Estimator Quality(a benchmark. do estimate and verify)
Evaluate the correctness of cardinality estimates generated by DBMS optimizers as the number of joins increases.
→ Let each DBMS perform its stats collection.
→ Extract measurements from query plan.
Compared five DBMSs using 100k queries.
lessons learned
Query opt is more important than a fast engine
→ Cost-based join ordering is necessary Cardinality estimates are routinely wrong
→ Try to use operators that do not rely on estimates
Hash joins + seq scans are a robust exec model
→ The more indexes that are available, the more brittle the plans become (but also faster on average)
Working on accurate models is a waste of time
→ Better to improve cardinality estimation instead
Working with a large code base

Passive reading will not help
work with the code
Test case
write test case
- this will not break production code
- this makes you work with the code base
- helpful (on assumption that nobody complains that you are writing test cases.)
  refactoring
possible aspects:
- comments(add/remove comments)
- messy code
- break out repeatable logic
  get familiar with toolchain & process
if there’s no documentation, you can write some.

17 Cost Models

Cost Models

Cost based query planning

Cost Model Component (how to build a cost model)

Disk based cost model

Postgres Example

IBM DB2 Example

In memory DB cost model

Smallbase (a db name) cost model

Cost Estimation

Selectivity

IBM DB2 learning optimizer(not working as expected)

Approximation

Sampling

Cardinality

column group statistics

Estimator Quality(a benchmark. do estimate and verify)

lessons learned

Working with a large code base

Passive reading will not help

Test case

refactoring

get familiar with toolchain & process