Greenplum was founded in 2003. It was acquired by EMC in 2010, which itself was acquired by Pivotal Labs the following year. Pivotal was itself acquired by VMWare at the end of 2019.
The company has offices throughout North America and in France, Germany, Ireland, China, South Korea, Japan and Australia.
Greenplum is an open source though not an Apache product, based on a Postgres code base. It is available for implementation either on bare metal (the company has a partnership with Dell, along with a reference architecture) and in both private and public clouds.
Greenplum is designed to be massively parallel, essentially connecting a series of separate database instances and allowing them to be addressed as a single unit. They have customers in 35 countries, with the largest customers deploying Greenplum data warehouses of several petabytes.
Company Info
Headquarters: 3495 Deer Creek Rd, Palo Alto, CA 94304, US Telephone: +1 (415) 777 4868
Greenplum is a massively parallel shared-nothing data warehouse based on a PostgreSQL kernel. It supports the analysis of both structured and unstructured data, enables federated query processing and includes text search capabilities. Geospatial, time-series and image processing are all supported, while user-defined functions are provided for both R and Python.
In addition, and this is a major differentiator, the company also supports Apache MADlib (again, an in-house development – in conjunction with work done at several universities – which Pivotal has contributed to Apache). This provides machine learning capabilities using SQL and runs against PostgreSQL-based databases. MADlib algorithms, of which there are forty, are parallelised (where relevant) and run within the database engine, but can be called from R, Python and Java (but not Scala) programs. Greenplum also supports the deployment of GPUs within a cluster to further improve processing performance for relevant (deep learning) algorithms. Jupyter notebooks are supported.
It is particularly worth commenting on the graph algorithms supported. All the main graph algorithms that can sensibly be parallelised (not all can) have been implemented within MADlib. So, if these are the only graph algorithms you want, you won’t need to install a graph database.
Customer Quotes
“We chose Greenplum due to its superior performance … Without Greenplum, we could not have achieved our regulatory reporting requirements.” Morgan Stanley
Figure 1 illustrates Greenplum’s Platform Extension Framework (PXF), the product’s architecture. The warehouse offers what the company calls “polymorphic storage” by which it means that data is stored in row and/or column format, or in external tables in HDFS or Amazon S3, depending on the queries you want to run. Row-based data is indexed for faster access, while columnar storage leverages Standard compression.
But query capability is extended beyond these confines as illustrated, so that you can federate queries across multiple data stores, with queries either running in the source system, or you can pull data into the warehouse. The company’s “next generation” cost-based database optimiser, called Orca (see Figure 2), introduced in 2017, knows about the location of relevant data and in federated environments can use both push-down filters and column projection to improve performance. The database employs the concept of slices, which are used in parallel for the execution of scans, joins, sorts, aggregations and so forth. A variety of indexes types are supported, including specialised ones for geospatial and text data.
Figure 2 - The architecture of Greenplum
Looking into the architecture shown in Figure 2, in more detail, the key to Greenplum’s performance is in the way that it distributes data, using a combination of vertical partitioning and segments, where segments also support redundancy for failover purposes. The product also features dynamic pipelining, in-memory query processing, workload management (significantly enhanced in the latest version, which is 6.0), automatic elastic query execution (where you spin up more resources, as required), and dynamic cluster expansion/shrink capabilities when the product is deployed in the cloud.
Other noteworthy capabilities include the fact that Greenplum is ACID compliant and supports two-phase commit (with major performance improvements in the latest version of the product), connectivity with both Spark and Kafka (mapping Kafka topics to Greenplum tables on a continuous basis, with exactly once delivery), the Greenplum Command Center for database administration and management (significantly enhanced in the latest release), high availability, snapshotting to support disaster recovery, write ahead logs for improved (in version 6) stability and fault tolerance. In the latest release, the company has also made a concerted effort to reduce the need to move data. For example, it has introduced a replicated tables capability and, when querying across a third-party environment such as Amazon S3, you now only have to move required data into Greenplum rather than having to do so in bulk.
Greenplum is very highly featured. Outside of the behemoths of the data warehousing space there are few, if any, competitors that can offer the range of datatype support that Greenplum offers. In particular many other suppliers cannot support time-series or text, let alone offering support for image processing. The ability to leverage MADlib is also a major advantage.
The Bottom Line
We are impressed with Greenplum. It has far more features than most, if not all, of the recent entrants into the data warehousing market while, on the other hand, we would expect it to be attractive, from a price/performance perspective, compared to the traditional suppliers to this market.
We use third-party cookies, including Google Analytics, to ensure that we give you the best possible experience on our website.I AcceptNo, thanksRead our Privacy Policy