expanding without reloading — online expansion of large-scale data warehouses

DBA’s the world over dread the day when their boss walks into their office and announces that it is time to expand the Enterprise Data Warehouse, the company’s crown jewel. While not a pleasant operation on single-node databases it means major surgery in conventional MPP databases that deploy a large array of shared-nothing compute and storage nodes. The standard approach is to dump all the data, provision new capacity, and then reload all the data. What sounds¬†rather simple is actually an impressive logistic feat — if all goes well. Weeks, if not months, in advance, elaborate project plans are developed that span a number of teams across the company: from the hardware department all the way to the business customers of the database; you need all hands on deck. The process itself requires up to several days of downtime for databases in the Petabyte range — that is, if all goes well. In the event, that one or more things go wrong the crew will be scrambling to get either back to the original configuration or to a makeshift solution before the scheduled window of downtime expires and the business suffers from the outage.
The sheer prospect of difficulty of this operation makes many IT organizations put off an expansion as long as possible. Which usually makes things even harder as the system will be close to capacity when it finally needs to be expanded and spare components or capacity will be harder to come by in the heat of the battle.

With all that in mind, we developed a mechanism that allows expanding a Greenplum Database instance (1) without downtime, (2) no significant performance impact during the operation, and is (3) transactionally consistent on top of it. In addition, we took great care to make the whole process fully transparent to the DBA. The basic idea is as simple as effective: add spare capacity first, then redistribute data gradually(!) over to the new nodes as necessary/desirable. As soon as the spare capacity is added to the system, we treat data as if it was simply skewed and accidentally ended up only occupying the old nodes but not the new ones. The system is immediately fully functional even though the data has not been redistributed.
Here are the key ingredients for our solution:

Distribution-aware query processor

The query processor’s ability to deal with the fact that data may be distributed in various ways across all nodes in Greenplum DB is key to our approach. In particular, our query processor can deal with data being which ever way distributed. This is in stark contrast to systems that need data to be strictly co-located for joins or, even worse, auxiliary data structures to be precomputed to execute joins.

Transactional data redistribution

We added a simple, yet powerful, primitive to the DDL quiver of the system that allows transactional redistributing of tables to re-balance the data after the actual expansion. Its syntax looks like this:


Strictly speaking, this mechanisms is not really needed. In a data warehouse, new data is continuously loaded and old data is rolled off. Since new data is automatically distributed over the expanded configuration, data automatically converges toward being uniformly distributed over time. However, we felt this mechanism will be useful for completeness, and for the odd case of a static environment where little new data is loaded and redistributing the currently loaded data is desired.

Managing redistribution

Since we exposed the redistribution mechanism through DDL, the process of redistributing can be taken care of entirely using SQL. We provide a utility that creates a couple of tables that contain the names of objects that have not been distributed yet and provides user-defined functions that work off these tables to schedule the actual redistribution. This mechanism was heavily inspired by, i.e., modeled after, a system we use here at Greenplum internally for performance tests where the scheduling of experiments is entirely implemented using only SQL statements.
Besides scheduling, the utility provides DBA’s with full transparency of the process (including query-able state of the redistribution) should he or she choose to re-balance the data after an expansion.

By now, a number of customers have used this mechanism to expand their systems with great success. Since re-balancing data provides slightly better performance (although often not discernibly so) customers usually opt for redistribution in the background. In some cases the redistribution of data was spread over several weeks without disrupting of impacting daily operations.

To the best of my knowledge, this mechanism is unique in the industry. It’s truly a gem primarily due to its simplicity and robustness: there’s no ambiguity what data is residing where at what time. For a more detailed write-up including scale and performance experiments check out this paper accepted at this year’s VLDB conference. If you happen to attend the conference, make sure to stop by at our session!

Update: in the past days I got a number of questions as to whether the ALTER command is truly needed. Given that we expose the catalog, is it possible to do all of the alteration in SQL? Well, hypothetically, that is possible. One would have to do an elaborate scripting job to preserve indexes and other dependent objects, then create the copy of the table and recreate the dependents accordingly, and finally rename the objects. Nevertheless, I think the ALTER command is actually a nice shorthand to do all this under the covers and might also come in handy in other situations.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s