Performance evaluation with benchmarks has a long-standing tradition in the database industry and it has become customary to use benchmarks to underline the usefulness of many an idea in research papers over the past decades. Besides testing a system with end-to-end scenario, most benchmarks are also suitable to demonstrate performance enhancements of storage and query execution etc. in isolation. However, none of the benchmarks, I would argue, are any good at assessing the query optimizer.
There is probably no other single component in a database system on which vendors have spent more research and development dollars than the query optimizer. And yet, there is no dedicated optimizer benchmark that helps judge their return on investment. Granted, in many cases a very good or a very bad optimizer will not go unnoticed even in standard benchmarks (take a moment to compare the official TPC-H reports for individual query times across different systems and you’ll see what I mean). But apart from the obvious cases, it remains difficult even for the expert to tell whether the optimizer did a great job for a given query.
I’ve been toying with the idea for an optimizer benchmark over the past years and tried to encourage researchers in industrial and academic labs to give it a thought—unfortunately, after great initial excitement everybody seemed to have arrived at the conclusion that it was just ‘too difficult’. I’m not one to give up quickly, so, let’s take a shot at it!
First, we need to come to an understanding what we want to get out of such an exercise. Well, here’s a list of attributes that I would like to see associated with this benchmark:
The score of the benchmark has to be a simple value. Multi-dimensional measures—i.e., vectors of individual (maybe even unrelated) measures—are difficult to compare. One of the strength of the TPC benchmarks is just that: a single score.
The score should express a strong and semantically clear assessment. Ideally, the score relates to a simple and tangible concept of goodness, e.g., number of mistakes the optimizer has made, etc. A great counter example would be, say, the error margin in cardinality estimation: that’s clearly an important measure but it doesn’t tell me (a) whether I can do better at all, and (b) if so, how?
The benchmark should be agnostic of the particular optimizer technology used. We would like to be able to compare different optimization algorithms and strategies. I don’t care whether your optimizer uses heuristics or is cost-based or a smart hybrid—as long as it does generate great plans you should get a great score!
A query optimizer tries to generate the optimal plan for a given target platform, i.e., a combination of execution and storage components, among other things. In general, no two target platforms are the same. For example, execution engines differ in the operators they support, storage engines differ in performance for scans and look-ups, etc. The score must take into account that every optimizer will optimize for a different target platform.
If you thought the previous two were already hard to reconcile, you’ll enjoy this one: ideally the benchmark can be applied to any query optimizer. If special hooks are required, they should be general enough that any optimizer can be expected to provide these.
The benchmark should work for individual queries. This will enable implementers to troubleshoot specific optimizer problems that are otherwise buried in larger workloads or hidden by interactions of other components.
The benchmark, or an extension of it, should enable application specific comparisons. That is, it should apply to any workload in addition to individual queries. This will enable operators and users to assess whether an optimizer is suitable for a specific application scenario.
The score should be straight-forward to compute and lend itself easily to automation in test suites and daily reporting within a development organization.
Finally a word of caution: a lot of people seem to get suckered into what I would call the ‘intuitive stuff’: trying to argue about plans being ‘acceptable’ if their anticipated cost is within X times the optimal and so on. I know, it’s very tempting but I’m afraid it will not get us any closer to a scientific method. So, we’ll try to stay away from these.
I know the list above is quite a tall order! For now, I leave you with this set of desiderata, and will present a few ideas over the next couple of months.
There is one simple score you can use, but it is not one you had in mind: Optimize the query and execute it. The sum of optimization time and execution time is your score.
Of course this is completely besides the point of this benchmark, it benchmarks the execution engine too, etc., but if you think about it it is the only measure that makes sense. Relying about query optimizer estimates is inherently wrong, because 1) they are always off, at least somewhat, 2) it is the job of the optimizer to get them right, so you cannot accept them as ground truth, and 3) if the optimizer did his job we will find the best plan according to its cost model anyway! Now I know that in many systems 3) is not true, but there are optimizer out there that at least for a very wide class of queries manage 3), and then a comparison based upon optimizer estimates makes no sense.
If you really want a portable, query-optimizer-only score, you could look at the generated plans, and compute the data flow (i.e., the number of tuples processed). It is a very crude cost function, but has the advantage of being deterministic, largely unaffected by engine design, and system independent. And it is often surprisingly good as a rough cost estimator. Not great, but at least it allows for reasonable comparisons between systems.
Personally I would still prefer the sum of query optimization time and execution time, though.
I agree, using the combination of optimization time and execution time is besides the point of an optimizer benchmark: a stellar optimizer gets dinged if the executor is a dog and, conversely, a bad optimizer gets a high score just because other components can make up for the damage! It’s clearly not an optimizer benchmark–let’s forget about that one.
Using estimates for a benchmark is probably not much better: hardly portable (most optimizer use imaginary cost units) and very technology specific. One might argue that we can use the accuracy as a benchmark but that doesn’t go far enough. The optimizer might have all the estimates right but still choose a bad plan because it does not know how to exploit certain algebraic properties of the intermediates, e.g., does not know how to push an aggregate below a join. So, this won’t work either.
The third suggestion is an interesting alternative that comes up often in this discussion. I assume the suggestion is to consider lower data flow superior? It has its own pitfalls though: consider the case of a sort-based vs. a hash aggregate. The sort-based agg will always appear as having greater data flow when an additional sort operator is needed and yet it may outperform the hash agg when dealing with exceedingly large numbers of groups or heavy skew. Seems we can take that one off the list of interesting candidates too.
Keep them coming!
I think you might a bit rash by dismissing the data-flow measure. The question is, what you want to measure? Do you want to see if the optimizer gets the big picture right, or do you want to see if it gets precisely the optimal plan. If you are aiming at the one and only optimal plan then indeed data flow is too inaccurate.
On the other hand, if you think about your hash vs. sort example, how big is the impact of this decision? Of course this depends on the robustness of the runtime system, if it has a safe fall-back at runtime, etc. but I think a factor of 2-3 is probably a conservative upper bound. A factor of 2 is noticeable, but other optimization decision can easily add a factor of 10.
And a very fine grained benchmark goal has other downsides, in particular if you do not intend to execute the queries. The more fine-grained you are, the more brittle is your model. How can you reliably model the propagation of data skew across operator? And not only in the broad picture (e.g., effects on cardinality, which is already difficult), but so fine grained that you can reliably notice the cross-over point for hash vs. sort. I think this will be very, very difficult, and also not desirable. You will never have this accuracy for your real data, either!
I would say, rather aim at the big picture, and make sure that you runtime system can handle the difficult-to-model cases reasonably graceful. For example in the presence of large and/or heavily skewed data the hash-based group-by operator will not have stellar performance, but it can be implement it in a way that it handles even these cases reasonably.
Good point! I think we’re getting somewhere.
I might not have stressed this in the original text enough: we’re trying to find a scientific method for comparing query optimizers.
I’m afraid things like the concept of an ‘optimal plan’, as you mentioned, might be misleading (and I shall explain why shortly). We also need to look beyond how we can gloss over optimizer mistakes using another component. The latter is interesting in its own right and for me as an implementer certainly daily reality. But I’m not giving up on the idea of a scientifically principled approach just yet.
Why not view the physics department as an inspiration: make a theoretical model first, try to reconcile it with reality afterwards. Not sure how far this will get us but it seems like an interesting start!
Pingback: testing the accuracy of query optimizers « theory in practice