Yellowbrick Data
Update solution on July 1, 2020

What is it?
Yellowbrick Data Warehouse is a massively parallel data warehouse available on-premises as an appliance or there is a multi-cloud (Amazon, Azure and Google) Cloud Data Warehouse option, which provides the Yellowbrick environment as a (hybrid cloud) managed service. The latter is the company’s primary focus, which combines all the benefits of managed cloud services, including disaster recovery, alongside the performance gains offered by the hardware architecture. The warehouse can also be implemented on a private cloud if required.
The company’s products are targeted at traditional enterprise data warehouses with, as of January 2020, a maximum of 3.5Pb. As withdrawal of Netezza support by IBM started in June 2019, Yellowbrick is also targeting Netezza replacements, not least because Yellowbrick, like Netezza, is based on PostgreSQL. The company has also added a library of functions specifically tailored to provide Netezza compatibility.
Customer Quotes
“In our testing of Yellowbrick, we compared the performance of a six-rack (Netezza) TwinFin to the six-U (30cm high) baseline Yellowbrick system. And performance was anywhere from 3 to 50 to 100 times faster.”
TEOCO
What does it do?
The fundamental principle behind Yellowbrick’s thinking is that traditional data warehousing architecture with spinning disks is simply old-fashioned. Its view is that even more modern, in-memory based systems with flash disks simply transfer bottlenecks to processing from disk to memory. In these architectures, incoming data goes to memory which, in turn, leverages, or tries to leverage, CPU cache. Yellowbrick argues that this is the wrong way around: that it is better for data to be directly processed by CPU cache (L1, L2 and L3: L3 in the first instance) with CPU cache and memory-based capabilities interacting, as illustrated in Figure 1.

The company also argues that the current trend towards separating compute from storage is an illusion. Certainly, there are environments where you do not have much in the way of seasonality – workloads are more or less consistent – where it does not bring any advantage. More specifically, Yellowbrick contends that, yes, it may be appropriate for some smaller environments and, yes, it may have advantages in cost terms when you can scale compute power up and down. However, its view is that the problem is that the interconnect is typically too slow and that the time needed to warm up caches means that performance is impaired. There is some truth in this argument and the desirability of separating compute from storage is not as clear cut as some vendors might have you believe. Indeed, even some of the suppliers that offer this approach do so only as an option.

Figure 2 – Notable features of Yellowbrick Data
More generally, the broader architecture of Yellowbrick is illustrated in Figure 2 with notable features including parallel loaders, a fast row store as well as columnar storage, a cost-based optimiser, workload management, a system management console and a customised (vectorised) SQL processor. Note that in the latter case this replaces the standard PostgreSQL processor as Yellowbrick does not believe that that is fast enough. In the same context it is also worth commenting that as a product built on top of PostgreSQL you should be able to leverage the PostgreSQL extensions supporting geo-spatial and time series data, which will be important in Internet of Things environments. Not shown in this diagram is the fact that Yellowbrick offers asynchronous replication across Yellowbrick instances regardless of whether these are on-premises or in the cloud.
Why should you care?
TDM offers particularly mature synthetic data capabilities, including the ability to generate truly representative synthetic data sets via analysis of your production data. It also features competitive data subsetting, masking and profiling capabilities, making it an ideal solution if you want to combine synthetic data with data subsetting in a single platform. Moreover, TDM’s web portal provides a one-stop-shop for test data provisioning, while the test data warehouse, self-service, automated delivery, and find and reserve features facilitate reuse, collaboration and expedient test creation. These qualities are further enhanced by integration with ARD, itself a leading test design automation product.
BlazeData, on the other hand, provides test data management that is deeply embedded within your functional tests, with compatibility with performance tests, virtual services and more due (we are told) in only a few months. Baking your test data in at this level means that it can be created and delivered automatically and on-demand, automating it almost completely. We see this sort of all-encompassing testing as the next step in test automation, and we’re very glad to see it here.
The Bottom Line
Test Data Manager and BlazeData are each formidable, highly automated test data management solutions. The both of them together are even more so. Whether you prefer the functionality and maturity of TDM or the simplicity and ease of use of BlazeData, at least one of them – if not both – should be on your radar.
Related Company
Connect with Us
Ready to Get Started
Learn how Bloor Research can support your organization’s journey toward a smarter, more secure future."
Connect with us Join Our Community