Playing the Blame(less) Game

Angga Kho
4 min readJan 21, 2024

--

When something went wrong, it is in our nature to find something or someone to blame for. Especially in software, human errors are often the cause of bugs that result in server outages, data leaks, loss of revenue, etc. So the fact that the “blameless post-mortem” concept in software development exists seems like an oxymoron. What is the point of an activity to uncover the root causes of a problem, if not to find someone to blame for? And what about accountability? Doesn’t having a blameless post-mortem mean people can’t be held accountable for their negligence? In this post, I traced back the history of the concept, and explained why it is preferable if we want to ensure a well-functioning team, and how accountability can in fact go hand in hand with blameless culture. And finally, I’ll give some examples of root causes of problems that you can consider before blaming your team for incompetence.

The earliest reference that I could find on blameless post-mortem is a 2012 business insider article by CEO of Etsy. It was later further elaborated in another article by Etsy CTO, John Allspaw, in which he referenced a 2001 medical paper titled “Patient Safety and Just Culture” by David Marx as the inspiration for the practice. The paper argued that a risk-based approach for disciplinary actions is preferable to ensure fewer mistakes in medical settings. What this approach means is, there is no punishment for wrongdoers who made honest mistakes, no matter how severe the consequences of the mistakes are. Instead, punishment is reserved only for people who purposely take unnecessary risks (being “reckless” or “gross negligent”). This approach encourages people to self-report mistakes, and these reports allow the organization to learn so that improvement can be made to prevent more mistakes in the future. And this result (fewer mistakes) is preferable than punishing a few individuals who might be the victims of dysfunctional organization themselves.

This concept fits perfectly in software development context because modern IT systems are often very complex; with different layers of technology that require different kinds of expertise. For example, a typical mobile application comprises: front-end, back-end, API gateway, load balancer, caches, databases, CDNs, marketing trackers, etc. And in bigger companies, they are usually handled by different teams, and so some changes in one place may cause unforeseen impact in another place. As Etsy CEO succinctly put:

“In complex systems, there’s rarely one root cause (for a problem)”

And thus it does not make sense to play the blame game because the root causes of the mistakes are often not the people but the process.

Blameless doesn’t mean no accountability. The team who are responsible for the mistakes shows accountability by taking the responsibility to share with the whole organization how the mistakes are made and what are the steps that can be taken to prevent the mistakes in the future. They take accountability by actively proposing improvement and evangelizing those improvements to the rest of the organization. And without fear of punishment, they are empowered to do more honest and thorough investigations.

So if individuals are not to be blamed, what are some possible causes of problems for bad software? Here I put forward several possible reasons, but this list is by no means comprehensive:

  1. Is there enough investment in paying off “tech debt”?

We can’t separate code quality to the context of when the system was being built. For example, when the end users are small, some shortcuts taken during development may not make such a bad impact to performance. But when users increase (as we hope they will), the code may not be able to cope. And thus, it’s important to reserve some work to pay off this “tech debt”.

2. Is there enough investment in building the right infrastructure for the developers?

New technologies in monitoring & alerts keep emerging, and implementing these may prevent some pains in the future.

3. Is there enough time given to the developers?

To meet deadlines, it is often the work required to ensure quality that is sacrificed. For example, omitting unit tests, not doing code review, etc. Remember the trifecta of project management: cost, quality, time. When you demand faster time without giving enough people (“cost”), don’t expect to get a quality product.

4. Are people in the right role?

For example, just because someone is a senior engineer, does not mean they are fit to become an engineering manager. Another example, making someone a manager but still requiring them to code is a recipe for disaster. Depending on the size of the organization, a managerial role is usually a full-time job.

5. Are there alignment problems between teams? Is the management doing enough to bridge these gaps?

The complexity of modern software requires collaboration among teams with different goals and expertise. And thus, blaming a few individuals for mistakes not only shows a lack of maturity in the organization but also creates an environment that ensures bad software is kept being delivered.

--

--

Angga Kho

I worked as a product manager for a tech company based in Jakarta. Opinions in this blog are my own.