Rules for data warehousing and data warehouse appliances

Written By: and Jameel Abdul
Published:
Content Copyright © 2006 Bloor. All Rights Reserved.

One of the major discussion points at Bloor Research’s recent
conference on “Data Warehousing: the rise of the
appliance” was a discussion of the rules (though they might
equally be regarded as reference points rather than rules) that
might apply to data warehouse appliances as opposed to enterprise
data warehouses.

I presented an initial set of rules based on my own research and
these have been subsequently modified in light of the comments made
by people at the conference. These are detailed below and, as will
be seen, they are broken down into three categories: generic rules
for appliances, specific rules for data warehouse appliances and
specific rules for enterprise data warehouses. These rules exclude
general considerations that are applicable to all sorts of
offerings such as security, integration with third party vendors,
support for open standards, encryption, load and unload speeds, and
so on. The list of rules, with comments pertinent to data warehouse
appliances, is as follows:

Appliance rules:

Rule 0: An appliance may be a
multi-purpose device but exists within a limited
context—currently, data warehouse appliances may be good for
fast table scans, complex analytics and as aggregation engines, for
example—thus they certainly do more than one thing; but they
are only just moving into the enterprise data warehouse space.
Context is a matter of perspective.

Rule 1: You plug it in and it
goes—how quickly it goes may be variable. For example,
Netezza set up a system at our conference in just 15 minutes. On
the other hand, IBM reckons that it takes around 6 hours to set up
a BCU. Again, the extent to which you consider either of these
figures as “plug it and it goes” is a matter of
viewpoint. Note that you may still require a special power supply
to make it go.

Rule 2: It is simple to implement,
administer and maintain—this is a no-brainer; administration
here refers to the appliance rather than the database.

Rule 3: No (minimal) tuning is
required—ditto. It is worth noting that no tuning is
difficult to achieve unless the context of your product is very
limited. For example, you can have no indexes and no aggregates but
such things as prioritisation and scheduling arguably involve
tuning. Note that the sort of autonomics provided by the likes of
IBM and Oracle at least makes index tuning simple.

Rule 4: It is data centre
friendly—less footprint, lower power requirements and reduced
cooling needs are all increasingly important and appliances tend in
this direction. We all want more for less.

Data Warehouse Appliance (DWA) rules

Rule 5: In a DWA the hardware and
software have been designed to optimise each other—this is
the ideal position if you want to get maximum
performance—some vendors only optimise the hardware or
software and not both. The downside (which I do not consider
particularly significant) of optimising both is that you don’t have
a choice of hardware platform: do you care?

Rule 6: A DWA attempts to minimise all
potential system bottlenecks—in theory, any system can have
an I/O, CPU, memory or interconnect bottleneck though I/O is by far
the most common. Different vendors in the market use different
approaches to overcome their point(s) of weakness, which may impact
on their performance in different environments. This has important
implications for both live running and proofs of concept, which I
discuss further in my (forthcoming) article “Data warehouse
appliances: designing a proof of concept”.

Rule 7: A DWA appliance is easily
upgradeable—systems should be easily upgradeable at the
component, disk and software levels—the need to replace
systems should be absolutely minimised. Note that this is less of
an issue now than it used to be.

Rule 8: A DWA provides high
availability—there should be no single point of failure:
mirrored disks, dual interconnects, failover and so forth should
all be implemented. Note that if you are building your own solution
based on a software-only appliance then you should not attempt to
cut any corners here.

Enterprise Data Warehouse (EDW) rules (providing
functionality beyond a DWA):

Rule 9: An EDW supports a mixed query
workload—more and more users are accessing data warehouses
with a wider and wider range of queries (and query types). DWA
suppliers are starting to develop capabilities in this area but
most such solutions are limited in their capability today, though
some more than others. Netezza, for example, has a number of
facilities in this area, such as short query bias (which DATAllegro
also offers), scheduling, prioritisation, guaranteed resource
allocation and so on.

Rule 10: An EDW is scalable both in
capacity and for users, with maximised concurrency. Scalability for
users is a significant issue for appliance vendors at
present—typically, user scalability for a DWA is measured in
hundreds at best, rather than thousands. On the capacity side there
is not such an issue: Netezza has offerings up to 100Tb while
DATAllegro can grow significantly larger than this.

Rule 11: An EDW supports real-time
data loading and operational and actionable (process aware)
BI—this is not something that appliance vendors are much
involved with right now, though this is, at least in part, about
partnerships.

Rule 12: An EDW handles unstructured
(text and XML) data as well as structured data—none of the
appliance vendors can do this yet. Both this rule and rule 11 will
become increasingly important over the next five years, in my
opinion.

Looking at the various offerings in the market in terms of this
reference model is quite interesting. It is quite clear that the
appliance vendors are working down this stack while traditional
suppliers already have capabilities 9 to 12 (some more than others
perhaps) but are working at introducing the earlier rules. Sybase,
for example, which presented at our conference, discussed Sybase IQ
as being appliance-like (because of its performance, low storage
requirements and so on) while IBM is going down the same path (in a
different way) with the introduction of the BCU.

Having said that, it is important to appreciate that an EDW is
whatever is in the eye of the beholder. I certainly know of users
that claim to have a Netezza EDW: they don’t require the
functionality of rules 11 or 12 and Netezza is scalable enough, and
has sufficient mixed query capability for the company’s needs. This
is potentially true of other appliance vendors also.

In principle, one could rate all vendors against these rules,
add in the generic considerations not discussed, apply relevant
weighting factors and come up with a league table of results. Maybe
I’ll get around to doing that in due course.