Big data integration
Published:
Content Copyright © 2013 Bloor. All Rights Reserved.
Also posted on: Accessibility
In the previous articles in this series I have considered the need for trust in, context around and the security of big data. In each of these cases, governance capabilities are required that are parallel to those of conventional data even though, taken individually, these requirements are typically simpler than for those for transactional data. Conversely, governing the variety of different types of data that may be being analysed will typically require a more agile approach. Nevertheless, it is not really more complex or complicated. Unfortunately, this is not the case when it comes to integration.
Take smart meters as an example. You collect the data to feed your sales invoicing and your CRM system as well as to support capacity planning in your power stations (if you are an electricity generator). In addition, you will want reconciliations between the smart data and billing systems to prevent leakage, you will want integration with fraud systems and you will need smart metering (error) data to feed into your service management applications. That’s half a dozen different applications that you will want to feed from your smart meters; and there are probably some others that I haven’t thought of (like loading the data into a data warehouse for analysis and subsequently archiving it).
That’s an awful lot of integration to do and that’s not even mentioning that you may have a streaming platform added into the mix if you want to do real-time analysis, or Cassandra if you want real-time trending as well, and/or Hadoop if you don’t want either of these, plus your data warehouse and archiving platforms – but that’s an aside. Some of this integration is going to be hard-wired but you’re not going to hard-wire all of this, at least not to start with, so you’re going to need ETL (extract, transform and load) or ELT, data federation and, quite possibly, data replication as well. And, of course, you need to manage all this: which means having the metadata to understand where and how these different integration techniques are used, and you are going to need lineage capabilities across this environment, which feeds into (if it is not actually a part of) your big data governance requirement.
To be fair, this issue will not be quite as severe in other big data environments where you are ingesting data purely for analytic purposes as opposed to environments where there are also transactional implications. For example, if you are analysing social media data then your integration requirements will be more limited (subject to the fact that there are many social media sites and you may be adding or changing the sites you derive data from on an ongoing basis) but, nevertheless, they will still be more complex than previously.
So, bearing in mind that I started off by saying that your governance environment needs to be more flexible for big data the same applies, at least sometimes in spades, to integrating it with your conventional environment.