Granularity of Locks and Degrees of Consistency
Advanced Topics in Computer Systems  
Joe Hellerstein & Timothy Roscoe

SQL Query Optimization


Given: A query joining n tables
The "Plan Space": Huge number of alternative, semantically equivalent plans.
The Perils of Error: Running time of plans can vary by many orders of magnitude
Ideal Goal: Map a declarative query to the most efficient plan tree.
Conventional Wisdom: You’re OK if you avoid the rotten plans.
Industrial State of the art: Most optimizers use System R technique and work "OK" up to about 10 joins.

Approach 1: The Optimization Oracle

(definitely not to be confused with the company of the same name)
You’d like to get the following information, but in 0 time:

Approach 2: Make Up a Heuristic & See if it Works

University INGRES

Oracle 6

OK,OK. Approach 3: Think!

Three issues:

Selinger & the System R crowd the first to do this right. The Bible of Query Optimization.

SQL Refresher

SELECT {DISTINCT} <list of columns>  
FROM <list of tables>
{WHERE <list of "Boolean Factors" (predicates in CNF)>}
{GROUP BY <list of columns>
{HAVING <list of Boolean Factors>}}
{ORDER BY <list of columns>};

Semantics are:

  1. take Cartesian product (a/k/a cross-product) of tables in FROM clause, projecting to only those columns that appear in other clauses
  2. if there’s a WHERE clause, apply all filters in it
  3. if there’s a GROUP BY clause, form groups on the result
  4. if there’s a HAVING clause, filter groups with it
  5. if there’s an ORDER BY clause, make sure output is in the right order
  6. if there’s a DISTINCT modifier, remove dups

Of course the plans don’t do this exactly; query optimization interleaves 1 & 2 into a plan tree. GROUP BY, HAVING, DISTINCT and ORDER BY are applied at the end, pretty much in that order.

Plan Space

All your favorite query processing algorithms:

Logical equivalences:

Note some assumptions folded in here:

Some other popular assumptions (System R)

Cost Estimation

The soft underbelly of query optimization.


Searching the Plan Space

The System R Optimizer’s Search Algorithm

Look only at left-deep plans: there are n! plans (not factoring in choice of join method)
Observation: many of those plans share common prefixes, so don’t enumerate all of them
Sounds like a job for … Dynamic Programming!

  1. Find all plans for accessing each base relation
  1. For each relation, save cheapest unordered plan, and cheapest plan for each "interesting order". Discard all others.
  2. Now, try all ways of joining all pairs of 1-table plans saved so far. Save cheapest unordered 2-table plans, and cheapest "interesting ordered" 2-table plans.
  1. Now try all ways of combining a 2-table plan with a 1-table plan. Save cheapest unordered and interestingly ordered 3-way plans. You can now throw away the 2-way plans.
  2. Continue combining k-way and 1-way plans until you have a collection of full plan trees
  3. At top, satisfy GROUP BY and ORDER BY either by using interestingly ordered plan, or by adding a sort node to unordered plan, whichever is cheapest.

Some additional details:


Simple variations to improve plan quality:

Subqueries 101: Selinger does a very complete job with the basics.

Extensible Query Optimization

In the 1980's and 1990's there was a lot of work on alternatives and extensions to the relational model (especially object-oriented DBs, later XML).  In the 90's and 2000's, there is work on dataflow engines for networking (Click, P2) that look a lot like query processors.

Challenge: design an extensible query optimizer that can be used for non-relational contexts.

We read the Volcano paper both because it is one of the 2 main examples of this work (the other being the optional paper on Starburst), and because it presents an alternative search strategy to Selinger.  The Volcano style is used in MS SQL Server (where Graefe now works), though it was based on the successor to Volcano, called Cascades.

Core ideas:
Some details:
Given these building blocks, Volcano works "top-down" as follows. It will maintain a hashtable of existing subplans (hashed on physical/logical properties) to memoize previous work. Some notes on this "top-down" exploration: Because of this pruning, top-down can look at less of the plan space than bottom-up if it gets lucky.

Parting thoughts: