Is data warehousing holding back the advance of analytics?
Published:
Content Copyright © 2013 Bloor. All Rights Reserved.
Also posted on: Accessibility
I have worked in data warehousing and analytics since the idea that Business Intelligence solutions had to offer more than reports added onto operational systems became accepted. But of late, I am facing a realisation: By extracting data from the operational environment, and loading it into a business intelligence environment, we introduce limitations that defeat the basis of why we set about doing it in the first place.
In a traditional data warehouse environment you have complex software to extract data, transform it, and reload it into a new location. Increasingly, that procedure takes a very long time, even when you throw very expensive hardware and software at it, because of the sheer volume involved.
When we have the data, the next step is storing it in a third normal form layer, to isolate the data from changes, and make it independent of future change – all very worthy, but again complex and time consuming. No end user can actually use the data in third normal form so we move the data again into a simpler format in a presentation layer, which also takes time.
The net result is that we have tied up immense amounts of intellect and capital to deliver data to the business that is heavily compromised by latency, cost, and difficulty of use.
As data has gotten really big, we have introduced big data solutions such as Hadoop, which remove some of the complexity by avoiding the structured stores, and exploit the capability to deliver affordable scalable solutions using commodity hardware. There is still a lot of complexity in the solution, because the means of extracting value require MapReduce programming, which is still an arcane skill and not for the average user.
So, after twenty years, I am starting to think the orthodox solutions have run their course and we need to think differently. So when I see something like the Pervasive combination of RushAnalytics and DataRush I am starting to see a solution that offers light at the end of the tunnel. What is required is something to enable Business Intelligence to provide business with insight quickly and affordably. That means commodity hardware, not expensive technical solutions, and software that supports rapid iterative development using visual interfaces, and not complex, arcane programming skills. So we need something that is fast, easy to use, and affordable. If we can tick those boxes, we are starting to get to a position where we can keep up with the demand that the business has for analytics, at a price that makes it economically feasible.
What Pervasive is offering is a platform that offers data access, transformation, analysis, visualisation and delivery using KNIME, an open source visual data analysis workflow environment, on top of Pervasive DataRush parallel dataflow architecture. This means that with Pervasive RushAnalytics, you do not have to move the data into a specific data store before you can start to analyse it. Now domain experts can start to gain insight from data in time spans that are in a different league to the days that traditional analysis takes, and is being achieved on commodity technology. This offers what business really craves – speed of return on investment that is measured in hours, or even minutes, not weeks!
KNIME offers the tools to address the data mining tasks that are required for risk management, fraud detection, and superior decision support that includes association rules, classifiers, clustering, principal component analysis and regression – all of the things that are key to effective data mining, and all via a graphical interface, so it’s point and click, not code and sweat. That workflow is then executed on the highly parallel Pervasive DataRush processing engine.
When I first came across DataRush a couple of years ago, I thought it was the best-kept secret in the IT industry. It is designed to enable code to work in parallel across multiple cores without having to redesign things to exploit the additional cores as we move from a single threaded environment up through the various permutations of twin core, quadruple, eight core, sixteen core etc. DataRush detects the number of cores and nodes available at runtime, and adjusts the processing workflow to exploit them, so its model is “Build once and run on whatever,” – total future proofing.
I am hoping that this is a sign that we are ending the world, as we know it, where analytics is held back by the technology we make it run on. We enter a new era in which analytics can run unfettered and deliver the returns that we all crave, which is an exciting prospect.