A few weeks ago Alfons Kemper of TU Munich visited EMC/Greenplum and gave a presentation on his latest research project (joint work with Thomas Neumann): an elegant yet astonishingly powerful implementation of a database system, called HyPer, that boasts both massive OLTP throughput as well as strong OLAP query numbers. More importantly, there are a few general takeaways buried in there that merit a closer look.
The idea is as simple as striking—exploit the OS’s memory management when forking processes to implement shadow-paging. Recall, when forking a process the child inherits the parent’s memory snapshot and modifications by the child process to any page are managed copy-on-write-style, i.e., only the pages that are modified are actually copied. Make sure all data fits into main-memory, serialize all writes, and besides one writer permit only read queries and you get a database system that excels at OLAP and OLTP workloads.
The principle is phenomenally simple and keeps unnecessary complexity at bay. HyPer boasts some impressive performance on a mixed TPC-H/TPC-C benchmark which captures both extremes of high-frequency updates and read-only OLAP queries (although nothing in between). At its current stage HyPer is a prototype that comes with a number of limitations that suggest a fair amount of engineering is required before it can be put in the hands of users. Also, not everybody might agree that TCP-C and TPC-H together actually make for what people commonly call a mixed workload. However, all this aside, HyPer offers a couple of profound insights:
- Leverage your OS. Database systems re-implement lots of OS functionality to achieve portability among other things. HyPer is astonishingly simple to implement because it avoids exactly this re-implementation. Its raw performance of 100,000+ tps is accomplished by using one of the most fundamental OS facilities. Given the central role of memory management within the OS we can safely expect continuous performance improvements on the OS’s side. This is going to be a free ride for HyPer.
- Leverage your compiler. The queries are written in C and compiled. A number of research prototypes have demonstrated significant performance gains through compilation recently—which by the way is a technique that is as old as relational databases but had temporarily gone out of fashion. Further development of compiler technology continuously tries to better exploit new hardware developments and we can expect immediate performance increases here too.
The first questions conventional database architecture and you may or may not agree with the resulting design limitations. However, the second is much more interesting: the whole discussion of CPU caches, batching or vectorizing of data we have seen in recent years simply does not matter and the numbers seem to back it. Supposedly, these techniques do not even add to HyPer’s performance when applied on top of the current implementation.
This is good news for the practitioner as it suggests we can safely trust the OS/hardware/compiler folks do what they are best at and we can focus on the real database problems!