What’s that Greenplum thing
Tomorrow I catch my flight to head to Las Vegas so I can attend Greenplum Days on Monday-Tuesday. I’m pretty excited about this trip even though I will be in Vegas and essentially spending all of my time in a conference. You see the last month or so I’ve been heavily working with the Greenplum Database. As someone with a past in all aspects of database interaction from DBA to Programmer to SysAdmin, I find the product to be a very powerful offering. It still has a little maturing to do, but it is currently dazzling me with some of the things we’ve been able to do.
Stepping back a little bit I should explain how Greenplum Database is different then your traditional database offering. In many aspects a database is very similar to a book, and in this instance I’ll compare them to an encyclopedia (those things where you found info before Google/Wikipedia). If you only have a small amount of information for your encyclopedia the book form this is fairly simple. Ideally this information is in alphabetical order and finding an entry is as simple as opening up the book and finding the correct page between the two covers. If you are lucky there may even be an table of contents or index where you can reference a term you a looking for and go directly to it. This is exactly what a database is doing in a digital setting. As a business that uses that data you have the book as the central source of the information and having one person who spends their day looking up things. As business ramps up having this one person scenario might not be fast enough for the company, so you look to other solutions. The traditional database method was that you hire someone that is faster at looking things up (bigger server), you might even hire the Rain Man because he can memorize the whole thing. Another approach that has become fairly common is going out and buying another copy of the encyclopedia and someone to read it (replication). This works out great in that not only does it double the amount of question you can answer but additionally if person #1 calls in sick or Rain Man leaves to accompany me to Las Vegas, you still have someone there to answer those questions. It gets a little more complicated in that when the next version of the encyclopedia comes out you need to make sure they both get it at the same time so they don’t give out conflicting information, but all in all this is a fairly good solution.
This method of handling databases has historically worked out well as people accept the bounds of what they can do. More and more though companies don’t want to know what’s in the encyclopedia, they want to know what’s in the whole library. They also don’t want to know what the entry on Edgar Allan Poe says, instead they want any entry that might talk about Poe or potentially any entry that might mention literary works from the early 19th century. Your traditional ways of looking for data break down very quickly here. Nobody looks through every book that comes into the library and notes any mention of such things and one person searching through the entire library to answer that question is a monumental task, it can’t be done in a timely manner.
In the real world what you would do is hire a team of people and divide up the library so each person was in charge of a specific section, the more people you bring in the faster this is going to go. This runs well if of course you have a good system and someone who is very competent overlooking the whole project. It is this same track that databases have started to take, commonly referred to as sharding, and this is what Greenplum does. It splits the data up among a set of servers so when a question comes in it asks all of the workers to look at their smaller set of data, report back and the manager consolidates that data comes up with the answer. This works extremely well for massive sets of data compared to the old way of doing things. Greenplum is not the first to tackle the problem in this matter, but I do believe they are the first to abstract away most of the complication to set up and maintain such a system at a reasonable price.
So tomorrow I head out to Vegas to be a part of Greenplum’s announcement of their new software version as well as something called Greenplum Chorus. Greenplum 4.0 looks very promising and is going to make a huge impact upon our usage. As for Chorus, they’ve been touting as the next big thing in data warehousing, so it will be interesting to find out exactly what it is.