dbtest’11: preview of EMC/DCD paper

Dealing with plan regressions caused by changes to the query optimizer has been a royal pain for any development organization. Even if you manage to shield your customers with clever tricks from experiencing these in their end-user workloads, plan regressions are a major drain of resources during software development. Therefore, knowing whether a change in software (or configuration) is beneficial or detrimental is a huge benefit!

Together with my friends at Microsoft I’ve developed such an ‘early warning system’ for plan regressions based on relative orders within a plan space.

The details are in this paper, accepted at this year’s DBTest workshop at SIGMOD where we’ll talk about optimizer testing in detail — yet another good reason to book your ticket for Athens quick!

Posted in Uncategorized | 1 Comment

benchmarking query optimizers

Performance evaluation with benchmarks has a long-standing tradition in the database industry and it has become customary to use benchmarks to underline the usefulness of many an idea in research papers over the past decades. Besides testing a system with end-to-end scenario, most benchmarks are also suitable to demonstrate performance enhancements of storage and query execution etc. in isolation. However, none of the benchmarks, I would argue, are any good at assessing the query optimizer.

There is probably no other single component in a database system on which vendors have spent more research and development dollars than the query optimizer. And yet, there is no dedicated optimizer benchmark that helps judge their return on investment. Granted, in many cases a very good or a very bad optimizer will not go unnoticed even in standard benchmarks (take a moment to compare the official TPC-H reports for individual query times across different systems and you’ll see what I mean). But apart from the obvious cases, it remains difficult even for the expert to tell whether the optimizer did a great job for a given query.

I’ve been toying with the idea for an optimizer benchmark over the past years and tried to encourage researchers in industrial and academic labs to give it a thought—unfortunately, after great initial excitement everybody seemed to have arrived at the conclusion that it was just ‘too difficult’. I’m not one to give up quickly, so, let’s take a shot at it!

First, we need to come to an understanding what we want to get out of such an exercise. Well, here’s a list of attributes that I would like to see associated with this benchmark:

The score of the benchmark has to be a simple value. Multi-dimensional measures—i.e., vectors of individual (maybe even unrelated) measures—are difficult to compare. One of the strength of the TPC benchmarks is just that: a single score.

The score should express a strong and semantically clear assessment. Ideally, the score relates to a simple and tangible concept of goodness, e.g., number of mistakes the optimizer has made, etc. A great counter example would be, say, the error margin in cardinality estimation: that’s clearly an important measure but it doesn’t tell me (a) whether I can do better at all, and (b) if so, how?

technology agnostic
The benchmark should be agnostic of the particular optimizer technology used. We would like to be able to compare different optimization algorithms and strategies. I don’t care whether your optimizer uses heuristics or is cost-based or a smart hybrid—as long as it does generate great plans you should get a great score!

platform specific
A query optimizer tries to generate the optimal plan for a given target platform, i.e., a combination of execution and storage components, among other things. In general, no two target platforms are the same. For example, execution engines differ in the operators they support, storage engines differ in performance for scans and look-ups, etc. The score must take into account that every optimizer will optimize for a different target platform.

If you thought the previous two were already hard to reconcile, you’ll enjoy this one: ideally the benchmark can be applied to any query optimizer. If special hooks are required, they should be general enough that any optimizer can be expected to provide these.

The benchmark should work for individual queries. This will enable implementers to troubleshoot specific optimizer problems that are otherwise buried in larger workloads or hidden by interactions of other components.

application specific
The benchmark, or an extension of it, should enable application specific comparisons. That is, it should apply to any workload in addition to individual queries. This will enable operators and users to assess whether an optimizer is suitable for a specific application scenario.

The score should be straight-forward to compute and lend itself easily to automation in test suites and daily reporting within a development organization.

Finally a word of caution: a lot of people seem to get suckered into what I would call the ‘intuitive stuff’: trying to argue about  plans being ‘acceptable’ if their anticipated cost is within X times the optimal and so on. I know, it’s very tempting but I’m afraid it will not get us any closer to a scientific method. So, we’ll try to stay away from these.

I know the list above is quite a tall order! For now, I leave you with this set of desiderata, and will present a few ideas over the next couple of months.

Posted in Uncategorized | 5 Comments

R and D — not: R or D

A few days ago a frustrated recruiter from an agency we have been using on and off for a while wrote to me

I [come] to the conclusion that your standards are impossibly high.  I don’t see myself being able to fill any of your roles.

For a fact, we do hire engineers on an on-going basis via all conceivable channels including agencies, which goes to show that it isn’t impossible if difficult.  We do pride ourselves for hiring only the top candidates, but, hey, that’s what they all say, right? Anyways, this little episode made me think some more about our hiring process and in the next couple of posts I’d like to shed some light on how to find great talent. Let’s first take a look at the target candidates.

The candidates we are looking for

A number of folks (most notably Joel Spolsky) have written extensively on this subject in the past and boiled it down to the simple and catchy formula: “smart people who get stuff done!”. In a software industry context I actually prefer to call them “people who can do both research and development”. Unlike a number of other places out there, here at Greenplum we’re looking for people who can drive a research agenda, solve hard problems, and communicate the solutions effectively, and, at the same time, they have to be able to turn theoretical results into enterprise software. Simple enough you might think, and yet, I see plenty of candidates who tell me they do R&D but “don’t code”. I’ve come to the conclusion, most candidates are good at one or the other — but only very few excel at both. In order to find the right talent (and enable a larger team to do so in order to scale out the process) you need to understand your candidate pool. The little drawing on the left is what I found captures the landscape best: the theoreticians, the pragmatists, and, well, the people who are both and these are the ones you want to hire.

The theoreticians. The upper left quadrant (‘research’) are the theoreticians who may very well be able to solve hard problems but are having a hard time to turn their results into robust and elegant software solutions. In fact, writing good software is, as you know, extremely challenging. Turning the solution for a hard problem into software is much harder than most people think. Turning it into maintainable software that leaves enough headroom to use it as a workbench for future research is even harder. (If you don’t believe me, try to find an implementation of Paxos that doesn’t require major rewrites before it fits your project.)

The pragmatics. The lower right quadrant (‘development’) are the pragmatics who produce huge amounts of code rapidly that, to no surprise, needs constant fixing afterward. The pragmatics use the generally agreed upon fact that there is no such thing as perfect software as an excuse to create prototypes that tend to escape into production.

Ironically, both of these group have surprisingly little regard for each other: the pragmatics call the theorists “too academic”, the theorist think of the pragmatics as “hackers”; and that’s putting it mildly. Interestingly, it also does not occur to them that combining their skills/talents would be highly desirable — they typically view the world as either research or development.

R&D: pragmatic theoreticians

Here at Greenplum, we’re looking for people who can do it all — the candidates in the upper right quadrant (‘R&D’). And that’s not just because we think it’s “somehow” better. What we’re looking for are folks who are strong abstract thinkers, who can solve hard problems on paper, file for patents, and publish their solutions; but also design and build elegant software solutions, drive designs top-down and implementations bottom-up.

Luckily there are folks who are willing to take on hard challenges! If you’re intrigued and think you have what it takes, visit http://www.greenplum.com/about-us/careers and apply for the Query Processing or the Distributed Systems Engineering position.

P.S. we’re opening offices in Beijing and Seattle!

Posted in Uncategorized | 1 Comment

icde’11: preview of EMC/DCD paper

Is workload management via admission control good enough? Definitely not.

Admission control based workload management is only one part of the solution and most database systems support this by now, one way or the other. However, the ability to run queries in the background without affecting more important query workloads is key to a better resource utilization as well as customer satisfaction.

Recently, we’ve designed some pretty nifty mechanisms, based on control theory, for prioritization of queries and workloads that are both intuitive/easy to administrate as well as highly resource efficient—including advanced features like priority inversion, i.e., speeding up a low priority query that got in the way of a high priority one, etc.

The details are in this paper, accepted at next year’s ICDE.

Posted in Uncategorized | Leave a comment

beer: to buy or to brew — and where does map-reduce come into the picture?

Some 1,500 breweries in the U.S. produce around 6 billion gallons of beer every year. Interestingly, according to the American Homebrewers Association there are currently about 750,000 homebrewers in the U.S. In other words, several hundred thousand people who could just walk into a supermarket — apart from maybe the brewing youth — and buy a case of beer go out of their way to create their own beverage.

Homebrewing is a fascinating activity. With some talent you can produce beer of a quality similar to what you get in the supermarket! However, doing it yourself allows you to create special flavors, experiment with all sorts of ingredients and, simply put, customize the process to your own needs. It comes at a price, however. Homebrewing demands commitment. Not only will it affect your vacation schedule and disrupt your social life it also tends to stink up the house.

Mind you, the result is beer, like the one from the supermarket. Is it better — or even just different (in a good way)? That’s a matter of perception.

Well, you see where I’m getting at. Much has been written about why MapReduce is or is not on par with database technology. And much of it resembles the discussion of beer — including a number of enthusiastic homebrewers with refreshingly little respect for an established discipline. On the past weekend I participated in a lively panel discussion at CIKM’s DOLAP. Carlos Ordonez and Il-Yeol Song did a great job at organizing and running this year’s edition of this long standing data warehouse and OLAP workshop series! Our panel was an interesting mix of folks from industry; database vendors as well as MapReduce users. Many of the usual questions were pondered and we touched on a number of urban myths: databases do not scale, databases cannot be expanded easily, databases are difficult to install, and so on — most of these are problems of specific products from specific vendors rather than short-comings of database technology.

My take away is simple though:

  1. there’s a place for MapReduce’s programming model (as there is for a plethora of programming languages), and
  2. marrying some of MapReduce’s conveniences such as its fault-tolerance with the power of consistency in the database world is not only an interesting challenge but a promising direction.

In other words, it’s not about whether brewing beer at home is the right thing to do or not — it’s about how we can make it more convenient so it does not stink up the house!

Posted in Uncategorized | Tagged , , | Leave a comment

HyPer — rethinking architecture and execution

A few weeks ago Alfons Kemper of TU Munich visited EMC/Greenplum and gave a presentation on his latest research project (joint work with Thomas Neumann): an elegant yet astonishingly powerful implementation of a database system, called HyPer, that boasts both massive OLTP throughput as well as strong OLAP query numbers. More importantly, there are a few general takeaways buried in there that merit a closer look.

The idea is as simple as striking—exploit the OS’s memory management when forking processes to implement shadow-paging. Recall, when forking a process the child inherits the parent’s memory snapshot and modifications by the child process to any page are managed copy-on-write-style, i.e., only the pages that are modified are actually copied. Make sure all data fits into main-memory, serialize all writes, and besides one writer permit only read queries and you get a database system that excels at OLAP and OLTP workloads.

The principle is phenomenally simple and keeps unnecessary complexity at bay. HyPer boasts some impressive performance on a mixed TPC-H/TPC-C benchmark which captures both extremes of high-frequency updates and read-only OLAP queries (although nothing in between). At its current stage HyPer is a prototype that comes with a number of limitations that suggest a fair amount of engineering is required before it can be put in the hands of users. Also, not everybody might agree that TCP-C and TPC-H together actually make for what people commonly call a mixed workload. However, all this aside, HyPer offers a couple of profound insights:

  1. Leverage your OS. Database systems re-implement lots of OS functionality to achieve portability among other things. HyPer is astonishingly simple to implement because it avoids exactly this re-implementation. Its raw performance of 100,000+ tps is accomplished by using one of the most fundamental OS facilities. Given the central role of memory management within the OS we can safely expect continuous performance improvements on the OS’s side. This is going to be a free ride for HyPer.
  2. Leverage your compiler. The queries are written in C and compiled. A number of research prototypes have demonstrated significant performance gains through compilation recently—which by the way is a technique that is as old as relational databases but had temporarily gone out of fashion. Further development of compiler technology continuously tries to better exploit new hardware developments and we can expect immediate performance increases here too.

The first questions conventional database architecture and you may or may not agree with the resulting design limitations. However, the second is much more interesting: the whole discussion of CPU caches, batching or vectorizing of data we have seen in recent years simply does not matter and the numbers seem to back it. Supposedly, these techniques do not even add to HyPer’s performance when applied on top of the current implementation.

This is good news for the practitioner as it suggests we can safely trust the OS/hardware/compiler folks do what they are best at and we can focus on the real database problems!

Posted in Architecture, Performance | Tagged | Leave a comment

into a new era

Much has been written about EMC’s acquisition of Greenplum a few weeks back. Most of the discourse focused on the impact EMC  will have with the newly formed Data Computing Products Division on the data warehousing landscape as a whole and the appliance market in particular. The contributions spanned the whole spectrum from serious punditry all the way to plain gossip. And I’m sure investors would find these quips useful or — in case of the latter — at least entertaining.

For the candidates we are currently interviewing, a couple of additional questions that haven’t been answered in the blogosphere are important. Here are some of the most frequently asked ones:

Q: What has changed for Greenplum R&D through the acquisition?

A: Not really much — we’ll continue to innovate, run circles around the competition and have fun doing so! Just with more money in the bank, access to additional sales channels and infrastructure, etc., which will help us accelerate growth. We partnered closely with EMC already in the past and worked on joint projects including the exploring of new storage configurations. It’s fair to say you will see more of this :-)


Q: Isn’t EMC an odd partner to be acquired by, compared to, say, one of the more traditional database companies?

A: The best part about the acquisition is that EMC did NOT have a database department! Why? Any company with a database department that could have acquired Greenplum would have to integrate different product groups, i.e., reconcile different architectures, design philosophies, feature sets, etc. There are plenty of examples in the database world where this kind of thing went awfully wrong (Quick, how many of these train wrecks, I was alluding to, can you name? More than half a dozen? Wasn’t difficult, huh?)

Q: Who sets the direction of the new Data Computing Products Division?

A: Greenplum brings database expertise to EMC and is in charge of defining the product agenda (Bill Cook, the former CEO of Greenplum heads the new DCP Division, after all).  As you can tell from other recent acquisitions of EMC’s, they understand the value of letting the companies they acquire do what they are best at.

At Greenplum we always believed we are in charge of our own destiny — no change of plans!

Posted in Hiring | Tagged | Leave a comment

getting the basics right

Recently, a guest speaker at our Engineering Colloquium — a weekly institution at Greenplum — presented an elaborate scheme to improve the performance of a general purpose query processor. In short, the approach consisted of a couple of interesting although not necessarily fundamentally new ideas. At some point in the presentation our guest showed a comparison of his method with a well-known commercial database system using TPC-H’s Q1 and Q6. And, you guessed it, the difference in query performance was marked!

A number of members of the audience later mentioned they found the comparison discredited the actual research rather than underlined its value. What had happened? Did the presenter not know the difference between a database system and a C program? Or, if his numbers are so convincing why did he not get through to folks who, of all people, should be able to assess the benefits of his system quite accurately? Worse still, this is not the first time this sort of thing happened and, from discussions with colleagues in the industry I take it, it is not specific to this audience either.

The answer is as simple as profound: the two sides had an utterly different understanding of what a database system is and how to characterize it. So, let’s take this as a starting point and ask the obvious and ultimately fundamental question: What is a database system, really? And no, this is no trick question. Come to think of it, it is actually a pretty good starting point for our journey.

So, let’s see. Depending on your profession, your experience with databases or your research interests, you will probably enumerate a larger number of theoretical concepts and describe a database system as the sum of these — from ACID to WAL, from query languages to query execution, and so on. Not far from the truth. But when taking into account the way database systems are being used, the type of problems users try to solve, and considering economic aspects IT is typically subject to I would like to propose a much broader definition:

A database system is a cost-effective compromise between otherwise conflicting requirements.

The vagueness in this definition is intended. In particular, the costs as well as the requirements are an expression of the context in which the system is to be applied and what’s important to the users in a given application scenario.

General purpose database systems attempt to be an average for a very large number of scenarios and customers, specialized ones have the luxury of ignoring certain use cases and focus on what its implementers believed are the more interesting ones — either because they are more fun to work on, there is more money to be made, or as the case may be, because of both. With growing popularity and an ever increasing user base the number of conflicting requirements that need to be reconciled typically increases and leads to less and less optimal compromises. Pretty much all commercial database systems are testament to this evolution; like it or not.

Back to the beginning of our story. Our visitor’s mistake was primarily to consider performance the only requirement for a query executor. More specifically, considering only a subset of queries (hash joins are generally considered more fun to implement than sort-merge), special characteristics of data sets (“everything fits in memory”), etc.

More importantly, our mistake — and by us I mean researchers in the industry — is that we fail to articulate many of the intricate and less obvious requirements and their interplay appropriately and why it should be an exciting challenge to solve these not in isolation but in combination with other requirements. After all, it is the hard problems that we love to work on. So, let’s get the word out.

Posted in Foundation | Tagged | 3 Comments

% begin

Welcome to “theory in practice” — I intend to turn this into a place to discuss database technology, ideas related to “big data”, and what it takes to turn these ideas into products. And although you will eventually find all sorts of subjects here, there are three recurring themes that are near and dear to my heart:

Research & Development – In my job research and development are the two sides of the same coin. We are constantly looking for exciting ways to enhance database technology by solving some darn hard problems and we are constantly looking how best to turn these solutions into solid products.

Expectations & Applications – Users have data problems not database problems. We database people tend to think in terms of implementations and features but forget sometimes what it actually was our users wanted to solve in the first place and — equally important — in which context.

The Human Element – To generate great ideas and build great products, you need a great team. Hiring and mentoring is a fundamental part of my professional life — hence, we will talk about observations around the hiring process and tips & tricks for candidates and hiring managers alike.

Okay, this will be fun. Stay tuned.

Posted in Uncategorized | Leave a comment