Cloudera Data Warehouse

Update solution on July 1, 2020

The Cloudera Data Warehouse is based on the Cloudera Data Platform (CDP), which is effectively a convergence of the Hadoop platforms that were previously offered by Cloudera and HortonWorks individually. The new Cloudera has adopted the open source principle previously advocated by HortonWorks in that everything within the CDP platform is now available with an open source license where previously some Cloudera products were treated as proprietary. The company has also adopted a cloud-first development process whereby new features are first available in the cloud (AWS or Azure) and only subsequently for on-premises implementations. For example, Cloudera ML (machine learning) Experience and the Cloudera Data Catalog are both available for in-cloud deployments today (2019) but will only be available on-premises sometime during the first half of 2020.

Figure 1 is a marketecture diagram showing the capabilities provided by CDP of which the Cloudera Data Warehouse is one use case. The whole environment involves more than 30 different open source (Apache) projects, especially in the security and governance and analytics layers, but it would be tedious to call each of these out by name (but see Figure 2). Suffice it to say that, as Figure 1 illustrates, the environment is comprehensive.

Figure 1 – Marketecture, showing capabilities of Cloudera Data Warehouse

Customer Quotes

“Thanks to the Cloudera platform, we can serve our customers much better and faster, we can respond to regulatory compliance much better
and faster, and we can improve our fraud detection capabilities.”
KBTG (Kasikorn Bank)

“We’ve created a platform that provides our scientists with insights that can shorten delivery timelines, reduce costs, expand reach, increase safety, and, in the end, improve, extend, and save lives.”
GlaxoSmithKline

Figure 2 shows the various elements of the platform that are specific to the Cloudera Data Warehouse environment.

Figure 2 – Elements of platform specific to Cloudera Data Warehouse

As can be seen, a major feature of the Cloudera Data Warehouse is that it is, from a storage perspective, much more than an HDFS platform. Other notable capabilities which you may not be able to directly infer from Figure 2, include support for time-series and geospatial data (important in Internet of Things contexts, as is support for Apache NiFi and Kafka); auto-scaling (cluster shrinks or expands depending on workload, and compute is separated from storage), auto-suspension and auto-resumption; workload isolation so that differing tasks do not compete with each other for resources (different compute engines can run against the same data); and data and metadata caching to improve performance so that subsequent queries run faster than the first, even when the query is similar rather than identical or when you have a different query but running against the same data. There are also migration capabilities for users moving to the cloud from on-premises implementations.

It is also worth commenting on Data Steward Studio, which provides a user interface for data stewards that includes profiling and classification capabilities as well as the ability to discover sensitive data using both machine learning and natural language processing. Policy management and auditing are also provided. Finally, Cloudera Machine Learning (CML) offers a self-service, data science and machine learning development environment with on-demand access to (governed) business data, auto-scaling computing resources and users’ preferred libraries, frameworks and IDEs for the Python, R and Scala ecosystem. The Cloudera Data Science Workbench is CML’s on-premises equivalent.

Hadoop clusters are complex to implement and manage so moving to a cloud environment where that complexity is removed, makes a lot of sense, so you don’t have to worry about resource management or, for that matter, governance and security, because these are built-in. Moreover, Cloudera Data Warehouse isn’t just about Hadoop: if you want to use Amazon S3 or Azure BLOB storage instead of, or in addition to HDFS, then you can do that, and/or you can leverage other Apache database engines.

Going beyond this purely architectural perspective, Cloudera Data Warehouse is unusual in that it also provides a data catalog, the Data Steward Studio, the Data Science Workbench and Apache Hue (which is a SQL editor). Most competitive data warehouses, in the cloud or otherwise, do not offer all of these complementary technologies and it is arguable that Cloudera Data Warehouse is a misnomer: the product is much more than just a data warehouse.

However, those are technical arguments. The other major benefit that Cloudera offers is that it is completely based on open source projects. Whether for licensing reasons or simply because of preference, this will be a major argument in favour of Cloudera for many decision makers.

The Bottom Line

We are going to have to stop thinking about Cloudera as a Hadoop company and start considering it as a general-purpose, data and database oriented organisation that is focused on open source and the cloud. That will be a powerful argument in its favour for many people.

Related Company

Cloudera

Connect with Us

Ready to Get Started

Learn how Bloor Research can support your organization’s journey toward a smarter, more secure future."

Connect with us Join Our Community