Cloudera was founded in 2008. Initially it was backed by venture capital but in 2017 it floated on the New York Stock Exchange. The following year it announced a merger with HortonWorks, which was completed in January 2019. The company is expecting revenues of around $780m for the financial year ending January 2020. While historically known as a commercial provider of a Hadoop distribution the company is now marketing itself as the “Enterprise Data Cloud Company”. This doesn’t mean that it is eschewing its heritage but that it is focusing on the provisioning of data in the cloud, whereby users do not have to worry about any complexity involved in provisioning big data and data warehouse clusters.
Company Info
Headquarters: 5470 Great America Pkwy, Santa Clara, CA 95054 Telephone: +1 888 789 1488
The Cloudera Data Platform (CDP) is effectively a convergence of the Hadoop platforms that were previously offered by Cloudera and HortonWorks individually. The new Cloudera has adopted the open source principle previously advocated by HortonWorks in that everything within the CDP platform is now available with an open source license where previously some Cloudera products were treated as proprietary. The company has also adopted a cloud-first development process whereby new features are first available in the cloud (AWS or Azure) and only subsequently for on-premises implementations.
Fig 01 - The Cloudera Data Platform
Figure 1 is a marketecture diagram showing the capabilities provided by CDP. The whole environment involves more than 30 different open source (Apache) projects, especially in the security and governance and analytics layers, but it would be tedious to call each of these out by name. Note that the Cloudera Data Warehouse represents a use for CDP but is not otherwise discussed here.
Customer Quotes
“Real-time analytics and scalability are factors imperative to the sustainable growth of BSE. This will ensure that our critical systems are future proof so we can continue to enable the industry by building capital market flows. Cloudera meets our custom requirement by providing us with industry-standard technology and infrastructural expertise that has helped us deploy the highest number of references with the lowest total cost of ownership among vendors.” Bombay Stock Exchange
Fig 02 - The CDP integration environment and functions
For data ingestion, stream processing, real-time streaming analytics and large-scale data movement into the data lake or cloud stores, CDP’s DataFlow capabilities powered by Apache NiFi, MiNiFi, Kafka, Flink or Spark Streaming are ideal. They enable the journey of data-in-motion from the edge to the cloud or the enterprise through an integrated platform. ELT (extract, load and transform) is supported with Apache NiFi powering the extract and load parts, while Hive and Tez or Spark provide the Transform part. Traditional ETL is also possible though the majority of CDP use cases are at a much higher-scale of volume and velocity than traditional ETL. Figure 2 provides an illustration of the sorts of capabilities provided.
As far as data quality is concerned, Clouder’s approach is that you land your data in your data lake, curate it using some third party data preparation tool, and then move it into your data warehouse if appropriate. Thus, while data cleansing transformations are available there is no available data quality tool per se. On the other hand, CDP includes Cloudera SDX (shared data experience), which includes the Hive Metastore, Apache Ranger for security, Apache Atlas for governance and Apache Knox for single sign-on. Atlas is especially worth calling out because it provides a metadata repository for assets within the enterprise. That is, details about the assets derived from use of technologies such as Hive, Impala, Spark, Kafka and so on. More than 100 of these are supported out of the box and there is support to allow you to define additional asset classes. Underneath the hood a graph database is used to store asset definitions and instances, and the graph-based nature of the product allows you to explore relationships between different asset classes, and to support data lineage. It also supports the classification of assets and these can be linked to glossary terms to enable easier discovery of assets. The ability to apply policies to assets is enabled by attaching tags to columns that propagate through lineage and are then applied automatically to all derived tables. This is especially when it comes to such activities as masking sensitive data. Atlas integrates with Apache Ranger to enable classification based access control.
In addition to the metadata repository provided by Atlas, Cloudera also offers the Cloudera Data Catalog. This is intended for data stewards and end users to browse, curate and tag data. It provides a single pane of glass for data security and governance across all deployments with the notable exception (currently) that it is limited to data stored within the Cloudera environment.
Fig 03 - Cloudera Data Warehouse within the broader CDP environment
The major reason for adopting CDP is that it is all open source and it supports a very wide range of such technologies, many of which are not widely supported by other vendors. For example, relatively few other suppliers support Apache Flink or Sqoop. In addition, CDP forms the underlying platform for implementing the Cloudera Data Warehouse as either a data warehouse or lake. In this context see Figure 3, which we are presenting to illustrate that Cloudera is not just about Hadoop: note the support for object storage, Apache Druid and so on.
Going into specifics we particularly like the way that Apache Atlas enables governance policies to be defined once and then applied automatically across all relevant tables. Conversely, we would like to see more automation built into the platform. Machine learning has been implemented to support workload management but otherwise it is primarily a roadmap item at present. Cloudera is clearly on the right track here but, being purists, we would like to see it go faster!
The Bottom Line
Readers need to let go of the idea of equating Cloudera with Hadoop. Cloudera is a general-purpose, data and database oriented organisation that is focused on open source as well as the public, hybrid and private cloud. That will be a powerful argument in its favour for many people.
The Cloudera Data Warehouse is based on the Cloudera Data Platform (CDP), which is effectively a convergence of the Hadoop platforms that were previously offered by Cloudera and HortonWorks individually. The new Cloudera has adopted the open source principle previously advocated by HortonWorks in that everything within the CDP platform is now available with an open source license where previously some Cloudera products were treated as proprietary. The company has also adopted a cloud-first development process whereby new features are first available in the cloud (AWS or Azure) and only subsequently for on-premises implementations. For example, Cloudera ML (machine learning) Experience and the Cloudera Data Catalog are both available for in-cloud deployments today (2019) but will only be available on-premises sometime during the first half of 2020.
Figure 1 - Marketecture, showing capabilities of Cloudera Data Warehouse
Figure 1 is a marketecture diagram showing the capabilities provided by CDP of which the Cloudera Data Warehouse is one use case. The whole environment involves more than 30 different open source (Apache) projects, especially in the security and governance and analytics layers, but it would be tedious to call each of these out by name (but see Figure 2). Suffice it to say that, as Figure 1 illustrates, the environment is comprehensive.
Customer Quotes
“Thanks to the Cloudera platform, we can serve our customers much better and faster, we can respond to regulatory compliance much better and faster, and we can improve our fraud detection capabilities.” KBTG (Kasikorn Bank)
“We’ve created a platform that provides our scientists with insights that can shorten delivery timelines, reduce costs, expand reach, increase safety, and, in the end, improve, extend, and save lives.” GlaxoSmithKline
Figure 2 - Elements of platform specific to Cloudera Data Warehouse
Figure 2 shows the various elements of the platform that are specific to the Cloudera Data Warehouse environment.
As can be seen, a major feature of the Cloudera Data Warehouse is that it is, from a storage perspective, much more than an HDFS platform. Other notable capabilities which you may not be able to directly infer from Figure 2, include support for time-series and geospatial data (important in Internet of Things contexts, as is support for Apache NiFi and Kafka); auto-scaling (cluster shrinks or expands depending on workload, and compute is separated from storage), auto-suspension and auto-resumption; workload isolation so that differing tasks do not compete with each other for resources (different compute engines can run against the same data); and data and metadata caching to improve performance so that subsequent queries run faster than the first, even when the query is similar rather than identical or when you have a different query but running against the same data. There are also migration capabilities for users moving to the cloud from on-premises implementations.
It is also worth commenting on Data Steward Studio, which provides a user interface for data stewards that includes profiling and classification capabilities as well as the ability to discover sensitive data using both machine learning and natural language processing. Policy management and auditing are also provided. Finally, Cloudera Machine Learning (CML) offers a self-service, data science and machine learning development environment with on-demand access to (governed) business data, auto-scaling computing resources and users’ preferred libraries, frameworks and IDEs for the Python, R and Scala ecosystem. The Cloudera Data Science Workbench is CML’s on-premises equivalent.
Hadoop clusters are complex to implement and manage so moving to a cloud environment where that complexity is removed, makes a lot of sense, so you don’t have to worry about resource management or, for that matter, governance and security, because these are built-in. Moreover, Cloudera Data Warehouse isn’t just about Hadoop: if you want to use Amazon S3 or Azure BLOB storage instead of, or in addition to HDFS, then you can do that, and/or you can leverage other Apache database engines.
Going beyond this purely architectural perspective, Cloudera Data Warehouse is unusual in that it also provides a data catalog, the Data Steward Studio, the Data Science Workbench and Apache Hue (which is a SQL editor). Most competitive data warehouses, in the cloud or otherwise, do not offer all of these complementary technologies and it is arguable that Cloudera Data Warehouse is a misnomer: the product is much more than just a data warehouse.
However, those are technical arguments. The other major benefit that Cloudera offers is that it is completely based on open source projects. Whether for licensing reasons or simply because of preference, this will be a major argument in favour of Cloudera for many decision makers.
The Bottom Line
We are going to have to stop thinking about Cloudera as a Hadoop company and start considering it as a general-purpose, data and database oriented organisation that is focused on open source and the cloud. That will be a powerful argument in its favour for many people.
We use third-party cookies, including Google Analytics, to ensure that we give you the best possible experience on our website.I AcceptNo, thanksRead our Privacy Policy