The ethics of data science - A commentary on Weapons of Math Destruction

I don’t usually write book reviews and I am not going to start now (full disclaimer: actually I was a reviewer of SF books many years ago, so technically I should say that I will not restart now). Anyway, if you are a data scientist, or with a company developing software for data scientists, or you are a chief data officer or you are in any way involved with commissioning data science projects then you should read “Weapons of Math Destruction: how big data increases inequality and threatens democracy” by Cathy O’Neil. Cathy is a data scientist, previous quant and ex-professor of mathematics, so she knows what she is talking about: the abuse and misuse of data science.

There are three fundamental issues that may cause problems: lack of statistical validity, a failure to update models and the use (misuse) of proxies. Let me discuss each of these.

Statistics is, in many ways, the biggest problem, because most business people don’t understand statistics. They don’t understand that your sample data needs to be sufficiently large and representative that it is both statistically significant and without bias. There is a good example used in the book, which is the use of “valued added” metrics to monitor the performance of teachers. This is used widely in the United States, the UK and, no doubt, elsewhere. The basic principle is that you take the predicted grades for each pupil in a class at the beginning of the year and then measure that against the actual grades used at the end of the year. If the latter score is higher than the former, then the teacher gets either praise or a raise or both, and if the reverse is true then he or she doesn’t get a raise and may (in the US) get fired. The problem with this is that while it might be a valid approach to monitor the teaching community as a whole – Cathy suggests 10,000 students – it is totally invalid when it is looked at across a class of perhaps 30 pupils. As Cathy puts it: this is “a statistical farce”.

The second major problem is a failure to update. The point here is that things change, people’s opinions change and so do their circumstances. If you build a model of how people are likely to behave and that is encapsulated in a static model, then that model will become out-of-date very quickly. So you need to provide feedback into the model that allows your algorithm to see how things are changing so that your business can evolve at the same time. The point here is that if there is no feedback loop then how can you tell if your model is accurate? You might, perhaps, argue that your algorithm has increased sales by 10% and you’ll worry about it later if sales drop off. But meanwhile your competitor, who has implemented feedback, has not only increased sales but is continuing to do so. By the time he overtakes you it will be too late. Updating models isn’t just about accuracy, it’s also about business sense.

Finally, there can be an issue with proxies. A proxy is used as a substitute metric when the metric you actually want to measure is not available. There are two problems with the use of proxies. The first is that they are, at best, an approximation. Secondly, they are seriously open to abuse. For example, it is illegal to discriminate against people on the basis of their race or religion. However, there is no law against discrimination by postal code. So, if you happen to know that a particular post code is dominated by a particular ethnic or religious grouping then you can misuse this proxy in ways that are completely unethical. Proxies need to be watched very carefully.

So the question is, how do you ensure that data science is used ethically? It is difficult to say that data scientists themselves should be responsible because, after all, their jobs are on the line. At the end of the day this is really a governance issue and I’d like to think that data governance vendors were starting to look at how to monitor the usage of, and basis for, the algorithms, machine learning and statistical models that are becoming more and more prevalent. I would also like to think that companies that provide model management capabilities were providing, or planning to provide, suitable monitoring capabilities. I say I’d like to think these things were in the pipeline, but I also think I am going to be disappointed. A lot of pressure is going to have to build up – the sooner the better as far as I am concerned – before any sort of appropriate governance is going to be put in place.