Cassandra and Hadoop

I am continuing to investigate Hadoop storage options as I get briefed by more vendors and as new products get released. In this article I want to focus on Cassandra.

DataStax is the leading commercial provider for distributions of Cassandra, which is a BDDB (big data database). However, unlike HDFS (the standard storage mechanism for Hadoop) or GPFS (IBM’s alternative) Cassandra is not a key-value store but a column-family store. This is not to be confused with a column-based relational database such as HP Vertica or ParAccel. In fact, it is unfortunate that whoever thought of the name “column-family” didn’t think of something else. The point is that while Infobright and Sensage (more columnar relational databases) and Cassandra all use columns, this is the limit of their similarity: the former two are relational and Cassandra isn’t.

I don’t intend to go into the details of column-family databases and how they are architected. At least not now. But the main difference between a column-family database such as Cassandra and a key-value data store such as HDFS is that the latter stores just a key and a value while the former stores tuples that consist of a name, a value and a time stamp. It is this last that makes a big difference: there are lots of environments – smart metering, security logs and so on – where understanding time series is important and this means that Cassandra can support applications that Hadoop cannot. Not surprisingly, DataStax is exploiting this capability. Thus, for example, you can either store timestamps as the order in which they arrive in the database or as the order in which the events actually occurred (which may not be the same thing). You can also index against the timestamps and, indeed, the software supports secondary indexes as well. One further notable feature is that DataStax has introduced CQL as a query language, which is a subset of SQL, although you can’t do such things as joins, because there are no tables.

In so far as Hadoop is concerned you can implement Hadoop and Cassandra on the same cluster. This means that you can have your time-based and real-time applications (real-time being a strength of Cassandra) running under Cassandra while batch-based analytics and queries that do not require a timestamp can run on Hadoop. In practice, in this environment, Cassandra replaces HDFS under the covers but this is invisible to the developer. You can reassign (dynamically where appropriate) nodes between the Cassandra and Hadoop environments as is appropriate for your workload. The other major upside is that using Cassandra removes the single points of failure that are associated with HDFS, namely the NameNode and JobTracker, which I have discussed in previous articles.

One final point is that Cassandra has a reputation for being difficult to get started. In order to simplify this process, DataStax is providing installers, examples and so forth within its Community Edition, while the Enterprise Edition, amongst other things, includes a visual point-and-click, web-based management environment that integrates with third party environments such as Tivoli and OpenView.