Change Control Doesn’t Work: When Regulated DevOps Goes Wrong

This week I’ve been reading through the recent judgment from the Swedish FSA on the Swedbank outage. If you’re unfamiliar with this story, Swedbank had a major outage in April 2022 that was caused by an unapproved change to their IT systems. It temporarily left nearly a million customers with incorrect balances, many of whom were unable to meet payments. 

After investigation, the regulator found that Swedbank had not followed its change management process and issued a SEK850M (~85M USD) fine. That’s a lot of money to you and me but probably didn’t impact their bottom line very much. Either way, I’m sure the whole episode will have been a big wake-up call for the people at the bank whose job it is to ensure adequate risk and change controls. So, what went wrong, and how could it have been avoided? 

How Did the Swedbank Incident Happen?

The judgment doesn’t describe the technical details behind the incident, but it does provide glimpses into how they assessed what went wrong:

Even if you think $85M isn’t much of a fine — simply the cost of doing business — the full range of options open to the regulator included removing Swedbank’s banking license: “It is therefore not relevant to withdraw Swedbank’s authorization or issue the bank a warning. The sanction should instead be limited to a remark and an administrative fine.” Gulp.

Change Management Doesn’t Mitigate Risk

What really interests me about cases like this is that even when followed to the letter, the old ways of managing change with manual approvals and change meetings do not mitigate risk in today’s technology organizations. These processes don’t work because complying with them is no guarantee that changes are being made safely and securely. 

Tell me if you’ve heard this one before...

The position of the regulator constitutes self-referential logic. You said you’d do something to manage risk. It wasn’t done; therefore, you are in violation. But is change management the best way to manage IT risk?

What the UK FCA Says About Change

I’ve written previously on some fantastic research published by the Financial Conduct Authority in the UK. They took a data-driven approach to understanding the workings of change management processes, which uncovered some provocative findings:

“One of the key assurance controls firms used when implementing major changes was the Change Advisory Board (CAB). However, we found that CABs approved over 90% of the major changes they reviewed, and in some firms the CAB had not rejected a single change during 2019. This raises questions over the effectiveness of CABs as an assurance mechanism.”

Change as a control gate doesn’t work, but everyone does it. Why? To avoid $85MUSD fines. In the UK and USA, these can be issued to individuals as well as organizations. So, if you have followed the process, at the very least, you are compliant and not liable for heavy financial penalties. It’s also about covering your back — “It’s not my fault. I ticked all the boxes.” But is the bank safe, though? Are the systems themselves secure?

Change management gathers documentation of process conformance, but it doesn’t reduce risk in the way that you’d think. It reduces the risk of undocumented changes, but risks in changes that are fully documented can sail through the approval process unnoticed. This is an important and quite shocking finding: adherence to traditional change management doesn’t work to manage the risk of changes.

Research Shows External Approvals Don’t Work

The science of DevOps backs this up. Here’s the unvarnished truth on external approvals and CABs based on research by Dr. Nicole Forsgren, Jez Humble, and Gene Kim in their 2018 book, Accelerate: Building and Scaling High Performing Technology Organizations.

“We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a change manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.”

Worse than no change approval process at all. So, if you want to avoid fines, cover your back, AND reduce the likelihood of production incidents, what would you do? 

Change Is Not the Problem. It’s Unaddressed Risk

If change is not the problem, then what is?

What would work? Well, the FCA has some insights on this:

“Frequent releases and agile delivery can help firms to reduce the likelihood and impact of change related incidents:

Overall, we found that firms that deployed smaller, more frequent releases had higher change success rates than those with longer release cycles. Firms that made effective use of agile delivery methodologies were also less likely to experience a change incident.”

In short, paperwork doesn’t reduce risk. Less risky changes reduce risk. I’m going out on a limb here, but if Swedbank had, in fact, followed processes and still had the outage, I believe Finansinspektionen (the Swedish FCA) would still have given a fine, but for insufficient risk management.

Storytime: Streams Feeding the Lake

We can think of software changes as streams feeding into our environments which are lakes. Change management puts a gate in the stream to control what flows into the lake but doesn’t monitor the lake.  

Streams feeding the lake

If it is possible to make a change to production without detection, then change management only protects one source of risk. The only way to be sure you don’t have undocumented production changes is with runtime monitoring.

For me, what is really interesting about this story is the echoes and parallels it has with the Knight Capital incident so well documented by the SEC. In both cases, an incomplete understanding of how changes have been applied to production systems due to insufficient observability and traceability prolonged and amplified the scale of the outages.

And leaves an open question: how many similar changes have been made that didn’t cause an outage? Without monitoring, it is really hard to know.

If Change Management Doesn’t Work, Why Do We Do It?

It all goes back to software history. Traditionally changes were rare, big, and risky. It was the annual upgrade or the monthly patch. Because these big batches of change were risky, companies introduced long testing and qualification processes, change management, service windows, and a large number of checklists to help mitigate the risks and test-in quality.

Before we had modern practices of test automation, continuous delivery, DevSecOps, and rolling deployments with fast rollback, this was the only way. The trouble is the financial services industry is packed full of legacy systems and outsourcing, where implementing these practices is technically challenging and uneconomic.

Maybe it is time we acknowledge legacy software, risk management, and outsourcing are major systemic risks in the financial sector.

The flip side is also true. Many next-generation systems in financial services are so dynamic and distributed that it is really hard to get a handle on the volume of changes occurring.

Risk Management That Works

The only way to not get burned is to avoid playing with fire. Checklists can help, but if you have a lot of IT risk, the only way to really reduce it is to do the technical work to make changes less risky and move to smaller, more frequent changes. And you can reduce this toil by automating change controls and documentation and introducing monitoring and alerting systems to detect unauthorized changes.  

 

 

 

 

Top