The Algebra of Data - A necessary foundation for a resilient data economy
Published:
Content Copyright © 2024 Bloor. All Rights Reserved.
Also posted on: Bloor blogs
This is a review of “The Algebra of Data, A Foundation for the Data Economy”, by Professor Gary Sherman & Robin Bloor, together with some thoughts on a previous approach to such things (Codd’s Relational Model) and the need for a mathematical basis for data storage, retrieval and manipulation.
First, for full disclosure, Robin Bloor is the founder and (currently) an active participant in Bloor Research – but one of the things I like about Bloor is that it is quite happy for its analysts to disagree (with appropriate research backup) so I am under no pressure to market the book particularly. It is available on Amazon. One review on Amazon includes “I’m thinking that I’m not smart enough to appreciate this book” and I know what he means, but I feel much the same about C. J. Date’s “An Introduction to Database Systems” [here]. In any case, this is a bit unfair to Sherman and Bloor’s book, which is written in an entertaining and accessible style – although some familiarity with set theory would help. Suffice it to say that anyone calling themselves a “data scientist” (which I don’t, by the way, although I was a DBA and data analyst for many years) should have no trouble fully comprehending both books, and any IT professional interested in data should get a lot out of them. For completeness, there is a brief summary of Robin Bloor’s view of “data science” here.
Secondly, some background. Why do we need an Algebra of Data, don’t we already have one and isn’t SQL good enough for all practical purposes anyway? Well, in my opinion (we’ll see if Sherman and Bloor agree later), we need a mathematically sound meta-model for data so we can process it algorithmically with absolutely predictable results. We want to be able, for example, to optimise a query automatically, for performance, with complete confidence that optimisation will not affect the results of the query. If the mathematics isn’t right, we don’t have this confidence. Similarly, how we formulate a query may impact the answer we retrieve. If this doesn’t happen very often, perhaps we can live with it in a manual system (where people can recognise and reject “wrong” answers) but in an automated system at scale, processing millions of queries a day, unlikely events can happen sufficiently often to cause real problems (and we can’t even predict the likelihood of these problems easily).
The problems arise particularly (but not exclusively) with the way that implementations of Codd’s Relational Model of data treat nulls. There are lots of different kinds of null – unknown (as yet) values, things that simply don’t apply and can’t have a value and so on; and treating them all as the same thing – perhaps stored as a “special value” of all nines or something – leads to significant problems. Chris Date talks about “The Problem of Missing Information” at OUG Harmony 2012 in Finland [youtube.com], including an example of how an optimised and unoptimised query can yield different results. It is worth noting that Ted Codd invented the relational model in 1969 and only added nulls in 1979, so the relational model worked for a decade without them!
What I have just said, implies that there are outstanding issues with the mathematical basis of Codd’s Relational Model of data, at least when applied to real-world databases, where sometimes data is simply missing or not applicable. I’ll leave that discussion to the mathematicians, but I am quite clear that there are even more issues with the standard SQL data language, which isn’t even a particularly good implementation of the Relational Model allegedly behind it. Nulls are optional (there are various workarounds for this) but if you do allow them, in SQL there is only one kind of null to represent the many different kinds of null and sometimes they can be treated and processed as values – “9999” for example. There are even further problems with physical SQL databases: the Relational Model is Logical not Physical, yet many databases match logical tables to physical tables on disk (the InterSystems Iris data platform is a notable, and welcome, exception [see here]; and the original requirement for atomic relational attributes is being increasingly compromised (you can store whole Excel spreadsheets, or even whole databases, as database fields in a so-called relational database these days). There are a lot of data in relational databases now and the lack of a firm mathematical foundation behind such databases implies that processing their data will be more complicated, slower, and less resilient than it needs to be.
So, we come to the book under review. “The Algebra of Data, A Foundation for the Data Economy” (ISBN 978-0-9789791-6-4, published by Bloor Group Press, Austin Texas, 2015) proposes a new, mathematically correct, data algebra based on Zermelo–Fraenkel set theory (ZFC), which is implemented in a database from the Algebraix Data Corporation and the Algebraixlib Python library on Github – both products are largely outside the scope of this review.
As I’ve said, this book requires some acquaintance with, and preferably knowledge of, set theory. Nevertheless, the authors have made a valiant (and, I think, successful, on the whole) attempt to make it entertaining and accessible. You don’t even have to read all of it – chapters 1, 2, 5, and 9 (there are only 9) are aimed at the more general reader (chapters 1 and 2 cover “just enough” set theory); the other chapters delve more into the mathematics. It seems to agree with the justification for a mathematically sound data algebra I gave at the start of this review. The main issue, for me, having been at school in the mid 20th century, is the terminology – and the book has a useful 1.5 page Appendix on symbols and notation. Nevertheless, there are, unavoidably, new symbols to get used to and typographic conventions (bold and italic matter) to recognise. Trust me, it is worth persevering!
Chapter 5 summarises what is meant by data. It says that data inside a computer is physically represented as a couplet – one part representing value and the other part representing usage. It also says that a couplet at the logical level can represent data in terms that a human can understand and that the couplet is the fundamental unit of data in the algebra of data. Defining what we mean by data and how it relates to human understanding of data is essential to any useful algebra of data.
Chapters 6 to 8 go into the data algebra in more detail. In particular, Chapter 8 looks at how the data algebra can provide equivalent capabilities to SQL SELECT and JOIN. In essence, I think, the authors show that anything the relational model and SQL can do, the algebra of data can do better.
Chapter 9 describes a practical data platform based on this Data Algebra and introduces the only company (so far) that is working on it: the Algebraix Data Corporation (already mentioned). It also talks about possible broader applications in the future. The trouble with predicting the future is that it is fairly easy to say what could, or should, happen, but much harder to say when (or even if) it will happen. That said, this is an interesting chapter of plausible possibilities.
One issue that I had while reading this book, was wondering how and why Codd’s relational model has been so successful. It is not just that big companies such as IBM and Microsoft have adopted it uncritically, as IBM (in particular) has put considerable intellectual effort into it. It has worked “well enough”, until now, at least. The beginning of Chapter 7 in the present book is a useful summary of how we got to where we are now; and the end of the chapter brings us up to date, talking about graph databases. I think that a detailed consideration of what is good, if anything, about relational models is (quite reasonably) outside the scope of this book; although perhaps the practical consequences of the shortcomings could be highlighted more (as Date does in his YouTube lecture, op. cit.). That said, the implication throughout is that the shortcomings of current approaches will only really become serious as mathematically-based exploitation of “big data” really takes off – so, perhaps not quite yet. Even so, I rather feel the need for another book, on the status quo – where we are now, with more or less compromised commercial “relational databases” – and its shortcomings, and on how we will journey to where we want to be – with a data platform based on a ZFC foundation – without going out of business on the way (and without requiring developers to be trained mathematicians). Starting from scratch with a platform based on the algebra of data would be comparatively easy, I think, but there might be issues with converting existing data stores. I think this book does cover the status quo, if rather generally, but not the practical journey to a better world.
This book, for me, rather confirms my view that standard SQL, for all its faults isn’t going away but that its problems will become increasingly important as automation (including what is rather misleadingly called Artificial Intelligence – currently little more than “advanced computational analytics”) increases in importance. We need a formal, mathematically respectable, data algebra to support automation of data processing at scale – but just one company, is not going to displace the Oracles and IBMs of this world. On the other hand, I rather hope Sherman and Bloor are right: “the IT industry has been bereft of a mathematical foundation of data… But now the situation is changed and the industry can move forward”. It is just that, in the past, the IT industry has proven to be a lot better at disrupting other industries than it is at disrupting itself…