A while back I had a very interesting conversation with Jack an application developer for a larger software company in the area; so to speak a person on the other side of the database.
He maintains that database application programmers who have to support a number of database systems have long developed a kind of taxonomy of query optimizers: they know which systems have ‘good’ and which have ‘bad’ optimizers. They know on which systems queries need “tuning” and even to what extent. Apparently, there is at least one system that ‘gets it almost always right’ as Jack would put it, and lots that are way off by default. When asked, however, how he’d quantify these differences Jack simply shrugged: ‘It’s something you develop over time; can’t give you a number.’
We talked about measuring the quality of an optimizer in this post earlier on and I’ve put forward a number of desiderata for an optimizer benchmark then.
In the meantime, as we built our very own optimizer we studied pretty much any paper on optimizer design that we could get our hands on and, naturally, did a fair bit of competitive analysis. By talking with customers we could corroborate Jack’s assessment—if only anecdotally. However, one expression that came up over and over again was the notion of ‘mistakes’. An optimizer made a mistake if you can prove that the plan it chooses (we call that best plan found) is actually not optimal. Database administrators around the world spend significant amounts of time hunting down these mistakes and try to correct it by what they call query tuning. The process is as simple as it is tedious: virtually every optimizer exposes certain controls that let users affect the plan choice. They range from implementation choices for operators—e.g., whether a logical join is implemented as a hash-based join or a nested-loops variant—to more obtuse options such as the ever so mythical and incomprehensible ‘optimization level’, one product has become famous for.
For an administrator, troubleshooting a query means toggling and adjusting any of the optimizer controls at her disposal and checking if this action affected the plan and, if so, whether the new plan is outperforming the old one. Whenever this happens, the administrator basically exposed a defect in the optimizer. As a side effect, the administrator’s status within the office may be elevated to ‘hero’—especially if this wasn’t the first time she saved the day this way. (As an unintended consequence, administrators like to convince their CIO’s to purchase systems with higher defect rates and thus higher job security and hero status) Depending on the underlying problem, query tuning can make a significant difference to the performance of a query and tuning queries this way is not limited to performance problems but can target resource consumption problems as well. Let’s stick with performance for the time being as it is the most tangible of the problems.
How can we use this to build some sort of an objective measure around this now?
A straight-forward—and, in my view, overly simplistic—approach could be to take the anticipated cost the optimizer estimated and compare it against the actual execution cost. This seems convincing at first, but is usually counter-productive. Almost all optimizers use so-called imaginary cost values that do not correspond to actual execution time. It’s not even close. Given that the sole purpose of the cost value is to choose the best-suited alternative among plans for the very same query one can, as is usually done, argue that, as long as the right plan is chosen, the exact extent of the cost value is immaterial.
Another very intuitive and equally doomed approach is trying to assess an optimizer by what is perceived as a key ingredient such as the number of rows forwarded between operators, i.e., size of intermediate results. One might think the best optimizer is the one that produces plans with the smallest intermediate results and minimal number of intermediates. Very often this measure is indeed strongly correlated with a plan’s optimality but there are a couple of problems with this metric: Modern query processors may forfeit optimality with regards to intermediate result size in order to prevent skew. A classic example is multi-stage aggregation in MPP databases where the optimal plan does not have the smallest intermediate results in general.
Instead we looked into generalizing the manual approach we outlined above: given an set of optimizer switches, we force a number of different plans by enumerating different switch settings. This way we can usually generate tens if not hundreds of plans for any given single query, as long as it is of non-trivial size. We rank the plans by the estimated cost. The best plan found is #1, the next best #2 and so on. Then we execute these same plans and measure their execution time. Now we construct a ranking by actual execution time. The fastest is #1, and so on.
In the perfect optimizer, the two result rankings—estimates and actuals—will show the plans in exactly the same order. In reality, this never happens, but using these rankings we can define a metric that represents a measure for the number and severity of mistakes the optimizer made—and best of all: all this without ever understanding any of the intricacies of the system, the hardware underlying the experiment, or the semantics of the query!
My coworker Zhongxian Gu built some very nice automation around this with a powerful abstraction of the actual database connectivity components and the systems’ optimizer switches. The result is a highly portable analysis tool that can be deployed on any database within a few hours and test an optimizer’s accuracy for any given workload in fully automatic fashion. Building the system was a lot of fun, but actually running it against a couple of commercial database systems was even more eye opening. The methodology and the results are documented in this paper, published at DBTest’12, in great detail. This paper made quite a splash when we published it and it appears the idea has been taken up by a number of our competitors as far as we can tell from direct inquiries and conversations around this subject when interviewing candidates from those companies.
The major takeaways are, firstly, that optimizers differ substantially in their accuracy indeed, corroborating Jack’s gut feel with actual numbers! Personally, I didn’t expect such huge differences despite having been in this industry for a while…
On a subtler note, the differences are the more impressive given the different development efforts that we suspect have gone into these products. Some optimizers were written by a team of less than 10 people, others have even more than 100 engineers just looking to maintain a seemingly sophisticated and highly tuned codebase. And, while I’m not going to disclose the identities of the systems tested, it seems to be the case that ‘money spent on development’ is not an indicator for quality, neither is ‘experience’.
As you might have realized, this approach has a number of obvious limitations and while it could be a key ingredient for the ever elusive optimizer benchmark it still doesn’t measure a couple of things that we’d like a true benchmark to cover. Let me know what you think!