Performance evaluation with benchmarks has a long-standing tradition in the database industry and it has become customary to use benchmarks to underline the usefulness of many an idea in research papers over the past decades. Besides testing a system with end-to-end scenario, most benchmarks are also suitable to demonstrate performance enhancements of storage and query execution etc. in isolation. However, none of the benchmarks, I would argue, are any good at assessing the query optimizer.
There is probably no other single component in a database system on which vendors have spent more research and development dollars than the query optimizer. And yet, there is no dedicated optimizer benchmark that helps judge their return on investment. Granted, in many cases a very good or a very bad optimizer will not go unnoticed even in standard benchmarks (take a moment to compare the official TPC-H reports for individual query times across different systems and you’ll see what I mean). But apart from the obvious cases, it remains difficult even for the expert to tell whether the optimizer did a great job for a given query.
I’ve been toying with the idea for an optimizer benchmark over the past years and tried to encourage researchers in industrial and academic labs to give it a thought—unfortunately, after great initial excitement everybody seemed to have arrived at the conclusion that it was just ‘too difficult’. I’m not one to give up quickly, so, let’s take a shot at it!
First, we need to come to an understanding what we want to get out of such an exercise. Well, here’s a list of attributes that I would like to see associated with this benchmark:
The score of the benchmark has to be a simple value. Multi-dimensional measures—i.e., vectors of individual (maybe even unrelated) measures—are difficult to compare. One of the strength of the TPC benchmarks is just that: a single score.
The score should express a strong and semantically clear assessment. Ideally, the score relates to a simple and tangible concept of goodness, e.g., number of mistakes the optimizer has made, etc. A great counter example would be, say, the error margin in cardinality estimation: that’s clearly an important measure but it doesn’t tell me (a) whether I can do better at all, and (b) if so, how?
The benchmark should be agnostic of the particular optimizer technology used. We would like to be able to compare different optimization algorithms and strategies. I don’t care whether your optimizer uses heuristics or is cost-based or a smart hybrid—as long as it does generate great plans you should get a great score!
A query optimizer tries to generate the optimal plan for a given target platform, i.e., a combination of execution and storage components, among other things. In general, no two target platforms are the same. For example, execution engines differ in the operators they support, storage engines differ in performance for scans and look-ups, etc. The score must take into account that every optimizer will optimize for a different target platform.
If you thought the previous two were already hard to reconcile, you’ll enjoy this one: ideally the benchmark can be applied to any query optimizer. If special hooks are required, they should be general enough that any optimizer can be expected to provide these.
The benchmark should work for individual queries. This will enable implementers to troubleshoot specific optimizer problems that are otherwise buried in larger workloads or hidden by interactions of other components.
The benchmark, or an extension of it, should enable application specific comparisons. That is, it should apply to any workload in addition to individual queries. This will enable operators and users to assess whether an optimizer is suitable for a specific application scenario.
The score should be straight-forward to compute and lend itself easily to automation in test suites and daily reporting within a development organization.
Finally a word of caution: a lot of people seem to get suckered into what I would call the ‘intuitive stuff’: trying to argue about plans being ‘acceptable’ if their anticipated cost is within X times the optimal and so on. I know, it’s very tempting but I’m afraid it will not get us any closer to a scientific method. So, we’ll try to stay away from these.
I know the list above is quite a tall order! For now, I leave you with this set of desiderata, and will present a few ideas over the next couple of months.