Is Hadoop developing too fast for its own good?
Consider vendor’s (and some user’s) initial attitude towards Hadoop. You would have thought it was a panacea for all ills. Of course, people tried to improve on it incrementally—removing NameNode and JobTracker single points of failure, for example. That’s fine—I don’t have an argument with that.
But think about the task of writing queries to run against Hadoop. Initially there was MapReduce but that required programming skills so someone developed Hive so that we could have a more declarative environment for development. However, that wasn’t good enough so we got initiatives like Impala—with which I don’t have an argument apart from the lack of database optimisation in Hadoop, which I’ve discussed previously.
Now we have Spark and Shark and Spark Streaming, not to mention MLib (machine learning) and GraphX where Spark replaces MapReduce and Shark (which is built on top of Spark) replaces (but is compatible with) Hive. The other three capabilities, also built on top of Spark, are new.
Spark is currently in version 0.9 so it’s not ready for prime time yet and the other sets of capabilities are, necessarily, further in the future. Nevertheless, Spark, which is an Apache project, is claimed to run in-memory processing up to 100 times faster than MapReduce and 10 times faster for disk-based processing. It also requires programming (Java, Scala or Python) but it doesn’t use a two-phase (or three-phase if you count shuffling) process.
I am ambivalent about all of this. On the one hand I like to see better performance and improved technology. So, from that perspective, I am very pleased with Spark. But on the other hand it worries me when new innovations come around this quickly and this regularly. And these are not just minor changes but big ones.
The problem, of course, is that Hadoop was not truly enterprise-ready when everybody jumped onto its bandwagon. Those who didn’t jump onto it then would do well to wait until Spark is available and get second mover advantage. But the real question is this: if Spark represents a second generation of Hadoop how soon is it going to be before we get a third generation? And with the rate of change in the Hadoop community is there any reason to doubt that there will be a third generation? In which case should companies wait to get third mover advantage?
The big problem with Hadoop is that it’s not mature and, worse, nobody knows when it will be.