Data warehouse appliances: designing a proof of concept
During the course of Bloor Research’s recent conference on
“Data warehousing: the rise of the appliance”, a full
report on which can be found
here, one of the discussion points was the “rules”
of a data warehouse appliance, which provide reference points for
the sort of features one would like to see in such a device.
While full details of these rules are included in the event
report I would like here to discuss Rule 6 “a DWA
attempts to minimise all potential system bottlenecks”
as this did not receive much attention during our discussions.
All databases may be I/O bound, CPU bound, memory bound or
Interconnect bound. While there will always be some limiting factor
on performance, notably the raw speed of disk access, appliances
should attempt to take these boundaries to their limit. However,
different vendors take different approaches. For example, Netezza
aims to get as close as possible to raw disk speed and to minimise
the amount of data read from disk; DATAllegro takes a similar
approach but relies heavily on partitions (which may be replicated)
and a very fast Interconnect and indeed, it claims to be CPU bound
rather than I/O bound. Conversely, IBM makes extensive use of
bufferpools (caching) and partitioning to minimise reliance on
disk. Kognitio, on the other hand, relies heavily on memory.
Now consider the significance of this for proofs of
concept—more or less all data warehouse appliances are only
installed nowadays after a proof of concept—if you ask each
vendor how it performs a single query then the vendors that focus
on getting data off disk as fast as possible (which also includes
Hewlett-Packard) will always win, especially if we are talking
about complex queries run against large datasets, which are
typically the ones that take forever to run today or can not run at
all. And this advantage will clearly persist if you test a series
of such queries in a serial fashion.
Now this may be your issue, in which case such an approach is
fine. However, in many environments the issue will not be running
individual queries quickly but running a number of queries in
parallel as fast as possible over a prolonged period. This is where
IBM’s Balanced Configuration Unit (BCU) has a chance to come into
its own: if these queries make substantial reuse of data that has
already been accessed then IBM will be able to reuse the data held
in its bufferpools and, thereby, dramatically improve
performance.
The bottom line here is that it is important to design a proof
of concept that reflects your operating environment: do not allow
individual vendors to persuade you that one approach to the PoC is
better than another, as different suppliers have products that will
lend themselves to particular tests and they will each have a
vested interest in adopting the methodology that is most likely to
be sympathetic to their cause.