Governance and risk post CrowdStrike

There has been a near frantic level of commentary about the CrowdStrike induced outage of about 8.5 million Windows devices on 19^th July. And I have no intention of revisiting the technical details or indulging in some of the finger pointing that ensued. However, if you still want to get a well informed, and in my opinion, very clear and fair analysis of what happened from a technical perspective then these two YouTube videos are a great place to start. They were made by Dave Plummer, a retired Microsoft engineer. They are relatively short at 13 and 17 minutes respectively and are easily understood for all but the most technology averse individuals.

https://www.youtube.com/watch?v=wAzEJxOo1ts

https://www.youtube.com/watch?v=ZHrayP-Y71Q

But, I think we need to change the narrative away from technical arguments about what caused the problem, or even how to mitigate the impact of a similar issue in the future, and towards a broader understanding of required business outcomes, the critical business functions that drive the attainment of those outcomes and the underpinning information technology that supports them.

Let’s be clear, I am not advocating that IT departments should place less emphasis on reducing the likelihood and impact of IT failures. Far from it. But, in a world where technological change is enabling a greater degree of flexibility and agility; that is bringing forth new business models at pace; that has led to greater collaborative interdependencies, and that fundamentally continues to change the way we all do business, requires a fundamental reset in corporate governance in general, and in corporate risk management in particular.

In the CrowdStrike event, some businesses recovered more quickly than others. Anecdotal evidence suggest that some of the ones who recovered more quickly had already understood the critical importance of the business functions that relied on Windows devices and had mitigations in place that either restricted the number of devices that were affected, or enabled them to recover swiftly. Any mitigation brings its own costs and issues. It would be great to hear the views of companies who handled the CrowdStrike issue with the least disruption. How did their mitigations reduce the financial and reputational fallout? It would also be interesting to see the corporate risk registers of those companies who lost critical business transaction functionality for a significant amount of time.

But let’s face it, managing risk is usually a bit of an afterthought. Spending extra money on mitigating risk always feels a bit like those insurance policies we resent signing up to. There is a greater focus on operational resiliency, and legislation like DORA, aimed at financial institutions, that should bring greater rigour, perhaps brought about by the threat of hefty fines up to 2% of turnover for failures in compliance.

However, if we keep managing risk in the same ways as before, we run the risk of seeing large scale outages over and over again. At Bloor we talk about the need for businesses to be mutable. In other words to have the ability and resilience to work in an environment of constant change. Part of being mutable is the ability to manage change in business functions. We have already got the technology and techniques to reduce our reliance on monolithic IT systems. Maybe we need to adopt a similar mindset to business functions. Perhaps, as a start, functional heads should have objectives that are outcome, not output based. If those outcomes relate to things like revenue, profit, employee attraction and retention, compliance etc. then maybe they will see the resilience of their supporting systems, and their supporting digital supply chains, as an essential enabler and focus their resources accordingly.

This is not about whether to have multi-cloud instead of a single cloud. It is not about driving tougher service level agreements with vendors. It is not about considering a mix of Windows, Apple or Linux devices. There will always be trade-offs in these decisions. But if you can’t identify and understand what functions are critical to the business and you can’t engage in a risk and mitigation discussions that both IT and business executives can understand, we are all going to be subject to increasing disruption to the services we now take for granted.