Recently, a guest speaker at our Engineering Colloquium — a weekly institution at Greenplum — presented an elaborate scheme to improve the performance of a general purpose query processor. In short, the approach consisted of a couple of interesting although not necessarily fundamentally new ideas. At some point in the presentation our guest showed a comparison of his method with a well-known commercial database system using TPC-H’s Q1 and Q6. And, you guessed it, the difference in query performance was marked!
A number of members of the audience later mentioned they found the comparison discredited the actual research rather than underlined its value. What had happened? Did the presenter not know the difference between a database system and a C program? Or, if his numbers are so convincing why did he not get through to folks who, of all people, should be able to assess the benefits of his system quite accurately? Worse still, this is not the first time this sort of thing happened and, from discussions with colleagues in the industry I take it, it is not specific to this audience either.
The answer is as simple as profound: the two sides had an utterly different understanding of what a database system is and how to characterize it. So, let’s take this as a starting point and ask the obvious and ultimately fundamental question: What is a database system, really? And no, this is no trick question. Come to think of it, it is actually a pretty good starting point for our journey.
So, let’s see. Depending on your profession, your experience with databases or your research interests, you will probably enumerate a larger number of theoretical concepts and describe a database system as the sum of these — from ACID to WAL, from query languages to query execution, and so on. Not far from the truth. But when taking into account the way database systems are being used, the type of problems users try to solve, and considering economic aspects IT is typically subject to I would like to propose a much broader definition:
A database system is a cost-effective compromise between otherwise conflicting requirements.
The vagueness in this definition is intended. In particular, the costs as well as the requirements are an expression of the context in which the system is to be applied and what’s important to the users in a given application scenario.
General purpose database systems attempt to be an average for a very large number of scenarios and customers, specialized ones have the luxury of ignoring certain use cases and focus on what its implementers believed are the more interesting ones — either because they are more fun to work on, there is more money to be made, or as the case may be, because of both. With growing popularity and an ever increasing user base the number of conflicting requirements that need to be reconciled typically increases and leads to less and less optimal compromises. Pretty much all commercial database systems are testament to this evolution; like it or not.
Back to the beginning of our story. Our visitor’s mistake was primarily to consider performance the only requirement for a query executor. More specifically, considering only a subset of queries (hash joins are generally considered more fun to implement than sort-merge), special characteristics of data sets (“everything fits in memory”), etc.
More importantly, our mistake — and by us I mean researchers in the industry — is that we fail to articulate many of the intricate and less obvious requirements and their interplay appropriately and why it should be an exciting challenge to solve these not in isolation but in combination with other requirements. After all, it is the hard problems that we love to work on. So, let’s get the word out.