Hardly any book has attracted more attention among software companies than Clayton Christensen’s “The Innovator’s Dilemma” and some companies even welcome it as a sign of “innovative spirit” when engineers slap a product manager with it over the head, figuratively that is, whenever he or she presents an idea to increase business values along the lines of the existing corporate strategy.
In a nutshell, The Dilemma is the observation that established industries are more likely to invest in existing, proven but aging technologies rather than look for new, well, disruptive but initially risky or economically even outright unattractive innovations. The incumbents will miss the boat and their business model gets disrupted by a competitor who took the plunge. Eventually, the disrupter puts the incumbents out of business.
For a technology to disrupt an existing industry a few things must happen:
- The existing products have outgrown the actual market expectations, e.g., deliver more features than appear useful.
- The new technology changes one fundamental parameter in the equation, often at the cost of some other fundamental property like performance, e.g. radically altered user interface.
- The new product needs a new market that is almost always downmarket from the established industry—mind you, it’s about disruption not immediate displacement of the incumbent technology.
The above graph captures exactly this phenomenon when it happened in the relational database (RDBMS) market towards the end of the last century: initially the relational model was nowhere near the performance of a Codasyl database, which had already exceeded market expectations (see gray dashed line). However relational databases offered a fundamentally enhanced interface through a declarative query language. A truly revolutionary thing. Yet, it lagged behind even modest market expectations (blue dashed line). Within the next 20 years RDBMS’s pretty much grew up, broke the dominance of Codasyl, and basically killed the old technology. Perfect disruption! The y-axis is the product of the system’s expressiveness and its performance which I call effective productivity (EP). Admittedly, it is a bit popular science rather than measurable fact but bear with me, databases aren’t exactly one-dimensional products. The blue dots mark a number of key releases in the RDBMS world.
With Hadoop, the picture is similar in some ways but quite different in others: its original value proposition was a ‘scalable programming model in the form of MapReduce’ but over the years it turned much more into ‘cheap scalable storage’. Simply put, the programming model is commercially not viable and relegated the product to a market of software companies that can afford having a large team of software engineers maintain a data management platform that would require a single DBA had they used a database.
Besides being used by start-ups with deep software expertise and an early-adopter’s DNA, Hadoop is also very popular with educational institutions and research labs—just not with mainstream enterprises although they are eager to adopt it and would love to tap into the incredible storage commoditization that Hadoop promises. And if you think enterprises would come around and start programming in MapReduce, think again: MapReduce written by non-engineers is horribly difficult to maintain and audit; user-defined functions cause already enough headache in the database world and that’s nothing in comparison.
So, in terms of the Innovator’s Dilemma, Hadoop just moved way too far downmarket: Vanilla Hadoop lets you do less than IBM’s famed System R prototype in the late 70′s, and higher level of expressiveness and accessibility such as Hive are not competitive because of their deplorable performance. That puts Hadoop’s EP (green line in graph below) not only below expectations, which would be perfectly okay at the introduction of a new technology, but also below what I call the Enterprise Threshold, the minimal viable EP that enterprises need in order to make sense of a new technology (thin gray line). And sadly, the slope of Hadoop’s development curve has been rather flat over the past years.
“But what about the fault tolerance of Hadoop? Isn’t that a clear winner?” you say. Yes, that’s sweet but, alas, not enough to make up for the bad EP; again, only attractive for a very limited niche market, even though this is something the database folks should really look into.
So here’s my point: In order to disrupt the database industry with a fighting chance for broader adoption, Hadoop needs to boost its EP substantially. The most important step in this direction will be the integration with a proper processing engine—something along the lines of a modern query engine. I call this HQ in the chart: Hadoop w/ Query Engine (red line in graph). Query engines are substantially more powerful when it comes to processing prowess which would finally move the needle in Hadoop’s advantage. And while still downmarket from MPP databases, HQ is clearly above the Enterprise Threshold.
So, what about disruption then? Well, let’s look at the checklist:
- The target to disrupt would be the MPP RDBMS market which is pretty close to meeting if not outperforming its expectations — check!
- Scalability is the big change parameter: cheap scalable storage way beyond the 10PB typical MPP databases can service today — check!
- HQ will deliver an EP that’s good enough for customers downmarket to address a market that MPP databases can’t serve — check!
Porting a query engine from an existing MPP database is the most promising first step. Long term, I expect a more differentiated architecture will be needed, but to get going, a disruptive technology first needs to get the foot in the door and that’s exactly what HQ will accomplish.