getting the basics right

Recently, a guest speaker at our Engineering Colloquium — a weekly institution at Greenplum — presented an elaborate scheme to improve the performance of a general purpose query processor. In short, the approach consisted of a couple of interesting although not necessarily fundamentally new ideas. At some point in the presentation our guest showed a comparison of his method with a well-known commercial database system using TPC-H’s Q1 and Q6. And, you guessed it, the difference in query performance was marked!

A number of members of the audience later mentioned they found the comparison discredited the actual research rather than underlined its value. What had happened? Did the presenter not know the difference between a database system and a C program? Or, if his numbers are so convincing why did he not get through to folks who, of all people, should be able to assess the benefits of his system quite accurately? Worse still, this is not the first time this sort of thing happened and, from discussions with colleagues in the industry I take it, it is not specific to this audience either.

The answer is as simple as profound: the two sides had an utterly different understanding of what a database system is and how to characterize it. So, let’s take this as a starting point and ask the obvious and ultimately fundamental question: What is a database system, really? And no, this is no trick question. Come to think of it, it is actually a pretty good starting point for our journey.

So, let’s see. Depending on your profession, your experience with databases or your research interests, you will probably enumerate a larger number of theoretical concepts and describe a database system as the sum of these — from ACID to WAL, from query languages to query execution, and so on. Not far from the truth. But when taking into account the way database systems are being used, the type of problems users try to solve, and considering economic aspects IT is typically subject to I would like to propose a much broader definition:

A database system is a cost-effective compromise between otherwise conflicting requirements.

The vagueness in this definition is intended. In particular, the costs as well as the requirements are an expression of the context in which the system is to be applied and what’s important to the users in a given application scenario.

General purpose database systems attempt to be an average for a very large number of scenarios and customers, specialized ones have the luxury of ignoring certain use cases and focus on what its implementers believed are the more interesting ones — either because they are more fun to work on, there is more money to be made, or as the case may be, because of both. With growing popularity and an ever increasing user base the number of conflicting requirements that need to be reconciled typically increases and leads to less and less optimal compromises. Pretty much all commercial database systems are testament to this evolution; like it or not.

Back to the beginning of our story. Our visitor’s mistake was primarily to consider performance the only requirement for a query executor. More specifically, considering only a subset of queries (hash joins are generally considered more fun to implement than sort-merge), special characteristics of data sets (“everything fits in memory”), etc.

More importantly, our mistake — and by us I mean researchers in the industry — is that we fail to articulate many of the intricate and less obvious requirements and their interplay appropriately and why it should be an exciting challenge to solve these not in isolation but in combination with other requirements. After all, it is the hard problems that we love to work on. So, let’s get the word out.

3 Responses to getting the basics right

Marcin Zukowski says:

September 1, 2010 at 6:22 am

Hi Florian,

I am very happy to be the first commenter on your blog.

Interesting post, reminds me of our visit at GP a few years back. I wonder if your team thought something similar about our research back then :)

I guess it’s a natural implication of your discussion of the compromise between conflicting requirements that we now see quite a number of specialized engines doing fewer things better. At the same time, I believe it’s not only the conflicting requirements making various aspects of ‘elephant’ databases suboptimal. The progress in research, improvements in hardware, and changes of requirements simply allow and require the level of change that are extremely hard for these systems. Hence, new players not only can focus on fewer aspects of the system, but they can also try new approaches, that haven’t been tried before. So, perhaps your visitor was an extreme case of that – very narrow but cutting edge?

Anyway, keep it coming, we’ll be watching!

Best,
Marcin

Joe Hellerstein says:

September 1, 2010 at 9:33 pm

Great post, Florian! What do you make of the tall guy who’s been running around saying that “one size doesn’t fit all”, and special-purpose solutions are the next wave? Do you think that message is all wet? Is there common ground with your point above?

- flw says:
  
  September 4, 2010 at 4:02 pm
  
  Good point, Joe — and a subject that we need to discuss in more depth; after all, my team works in a specialized field for a reason.
  I will get back to this after a couple of posts that are in the pipeline at the moment!