Data Lake Management
Date:
By: Philip Howard and Daniel Howard
Classification: Spotlight
Part of the reason data lakes have become popular is that setting up a basic data lake is inexpensive and easy: all you really need is some spare hardware and Hadoop and you’re off to the races. Unfortunately, without additional software, such lakes are likely to fail when used for anything substantial due to lack of effective processes. This is exacerbated by the open source nature of the data lake community: many software offerings available are open source, and therefore cheap to try out. However, in part because of the proliferation of open source software on the data lake, there are no pre-packaged, one-size-fits-all solutions available. This makes it difficult to build a truly effective data lake, as a suite of mostly open source solutions must be assembled manually to address a variety of issues. This paper discusses these issues and how they might be addressed. A companion paper to this – a Market Update on Data Lake Management – discusses the solutions provided by a range of vendors, in order to prevent your lake turning into a swamp.