The case for a data quality platform
Here at Bloor Research we have recently been
investigating why so many data migration projects (84%) run
over time or budget. Over half of the respondents in our survey who
had run over budget blamed inadequate scoping (that is, they had
not properly or fully profiled their data) and more than two thirds
of those that had gone over time put their emphasis in the same
place.
I mention this because it is symptomatic of all data integration
and data movement projects: data quality needs to start before you
begin your project (so that you can properly budget time and
resources) and continues right through the course of the project
and, where it is not a one-time project like data migration, is
maintained on an on-going basis through production. Further, in
order to maintain quality you need to be able to monitor data
quality (via dashboards and the like) on an on-going basis as well.
This is especially important within the context of data
governance.
In other words, data quality follows the lifecycle of your data,
and it spans multiple applications and systems as the data is
reused and shared. In this article I discuss the need for an
integrated data quality platform to support such an environment,
with particular reference to the Trillium Software System, whose
latest release (version 11) has just been announced.
Version 11 includes some very significant features, not least of
which is that this release represents the culmination of the
efforts the company has made, over the last few years, to fully
integrate its Avellino acquisition. Specifically, this means that
there is now a single repository (Metabase) which is shared across
the environment (or you can have multiple Metabases), and that
there is a single integrated interface across the product. The
consequence of this is that data profiling and analysis need not be
distinct from data cleansing and matching. In other words, you can
now easily swap between one function and the other, as requirements
dictate, rather than being forced to use a more waterfall-style
approach in which profiling came first and quality came second.
The second major enhancement is the introduction of phrase
analysis. This provides the ability to identify unique words and
phrases (substrings), and combinations thereof, within a selected
attribute. The importance of this is that it provides the ability
to parse unstructured and semi-structured data and to then build
data quality rules based on these. This is particularly important
if you want to apply data quality rules to product data or other
descriptive data that comes into the organisation in unstructured
formats.
Shortly, phrase analysis will be extended by the introduction of
Universal Data Libraries that will provide standardised taxonomies
for things such as colours, units of measure, currencies, sizes and
shapes and so on. Although the core libraries will be in English at
first, you will be able to customise these libraries so that you
can automatically recognise that blau = bleu = blue for
example.
In addition, the company has opened up the Metabase so that
customers and partners can more easily integrate data quality and
profiling processes into their applications and in order to provide
a foundation for interactive report building. Through this API, for
example, customers can easily incorporate metadata from Trillium,
such as profiling results or business rules, into broader metadata
repositories or other applications.
Finally, this release sees the introduction of time series
analysis. Previously, you could only take snapshots of quality
information but, with time series support, trending becomes
possible so that you can monitor data quality and profiling
statistics over time.
Anyway, so much for the major new features (there are a number
of smaller ones as well) in version 11. Now I want to focus more on
strategy.
There are two types of data quality vendor: pure plays (like
Trillium) and those integrated with ETL tools.
As far as pure plays in the market are concerned, most of these
(apart from Trillium) are smaller companies that specialise in a
particular area such as real-time data quality (embedding data
quality into call centre applications) or product data quality, for
example. In contrast, Trillium is recognised as an enterprise-wide
solution, providing the tools and content to support data quality
needs across a wide range of business applications, data domains,
and implementation types. However, there is more to it than that:
by choosing the smaller suppliers of point solutionswhether for
price or because they offer deeper capabilities in a specific
areayou will end up having multiple data quality tools when it
might be better to have a single solution that did everything, even
if, in some circumstances, it wasnt quite the best thing since
sliced bread.
Now, this may seem like the traditional integrated solution
versus best-of-breed argument. Some people like one, some people
like the other. However, in the case of data quality it is not
quite as simple as that because using a data quality platform such
as Trilliums allows you to reuse business rules across both
real-time and batch processes, and across different application
environments. Further, if we consider the implementation of data
governance then one of the precepts involved would be the adoption
of common data quality standards across the organisation, which is
best facilitated by using a common platform and reusable rules.
For example, consider the
respective requirements for data quality in the data warehouse
versus real-time data quality in the call centre. Separate tools
for each of these instances would almost certainly lead to
different standards, duplicate values and other inconsistencies
between the transactional application and the data warehouse. This,
in turn, will perpetuate the mis-alignment and lack of
understanding that is so common across different business
functions, and between those people who are making strategic
decisions as opposed to those that are actually executing
day-to-day processes. The remedy for such a mismatch is to look for
a data quality platform that extends across the enterprise both for
different types of applications and different types of
implementation.
If you accept the idea that the most desirable approach will be
to have a single data quality platform you might then ask whether
this should be a part of a larger data integration platform or
whether your needs will be better served by a data quality
specialist such as Trillium.
Specifically, of course, Trillium aims (and claims) to provide
more functionality than its competitors, especially those coming
from the ETL space. Hence the introduction of phrase analysis, the
range and extent of Trilliums interfaces (it aims to integrate with
anything), its new API, its Unicode support, its TS Insight product
for trending and reporting, and so on.
However, leaving aside these technical considerations there is a
clear argument for investing in an integrated ETL/data quality
environment: one vendor, one tool set and so on. However, there is
also a clear argument to not invest in this way, because many data
quality requirements have nothing to do with data integration or
ETL. If you want to embed data matching, say, in a call centre
application then this has nothing to do with ETL, so why use an
ETL-based tool rather than a pure data quality offering?
To take this further: the fundamental purpose of IT is to
provide information to the business in a suitable format and in a
timely manner; and data quality is fundamental to realising the
value of that information. However, most large organisations have a
wide variety of technologies in production and have very rapidly
evolving business requirements. Because of this, data quality
processes and capabilities must be deployed in a wide variety of
contexts: batch and real-time, at the point of data capture and the
point of data extraction, in data migrations and in on-going
database maintenance operations. A data quality platform that can
be directly integrated within these various contexts is likely to
be more flexible and scalable (quicker to deploy and cheaper to
operate) than data quality components that are wrapped within a
specific environments such as a data integration suite. If you buy
into this argument then Trillium must be a leading contender for
any such platform.