Greenplum vs Hadoop Disk Space
I’ve been spending a whole lot of time calculating Greenplum vs Hadoop disk usage. So here the general equation
(MaxAllocFactor * DiskSize * ( #Disk – RaidDisks ) ) / ReplicationFactor
MaxAllocFactor = Max recommended allocation. 70% for Greenplum and 75% for Hadoop
DiskSize = Size of your drive
#Disk = Number of drives
RaidDisks = Number disk eaten up by RAID, for Hadoop this is 0
ReplicationFactor = Greenplum everything is mirrored for replication factor is 2. Hadoop recommends three copies of data thus it gets a replication factor of 3.
So let’s look at a 24 drive array attached storage, we’ll use 500GB drives.
(MaxAllocFactor * DiskSize * ( #Disk - RaidDisks ) ) / ReplicationFactor
Greenplum: ( .70 * 500GB * ( 24 - 4 ) ) / 2 = 3.5 TB effective space
Hadoop:Â ( .75 * 500GB * ( 24 - 0 ) ) / 3 = 3.0 TB effective space
Next we’ll look at single server, let’s say a 1U with 4 3.5″ 2TB drives
Greenplum: ( .70 * 2TB * ( 4 - 1 ) ) / 2 = 2.1 TB effective space
Hadoop: ( .75 * 2TB * ( 4 - 0 ) ) / 3 = 2 TB effective space
How about a single 2U server with 12 1TB drives
Greenplum: ( .70 * 1TB * ( 12 - 2 ) ) / 2 = 3.5 TB effective space
Hadoop: ( .75 * 1TB * ( 12 - 0 ) ) / 3 = 3 TB effective space
So what does this mean? It means that you shouldn’t run laughing to the bank on your backend savings by choosing Hadoop over Greenplum, given you plan to use the same storage architecture. Greenplum and Hadoop are two very different technologies so comparing the two is kind of silly in the first place. They fall into the same category of processing large datasets in the same manner that a Ford F350 and Mazda Miata are both cars. They will both get you down that road, but in an entirely different manner.
Don’t talk to me about compression factors, everyone wants to say how their grandmother in Pensacola got 20x compression on system X. System X never happens to be my system, so I’ve stopped drinking the compression factor koolaid.
1 Comment
[…] saving is going to be very small and going to be similar to the exercise I went through comparing Greenplum to Hadoop disk usage, which is really not that much. So this as a selling point of using Gluster as a replacement for […]