Can a good explanation really prevent a prod incident?

A story of how properly positioned rationales can be powerful enough to prevent prod incidents

A missed opportunity to communicate

It was March 21, 2022. I remember the day like yesterday. I just started my On-Call Shift. Our team On-Call rotation is set up so each engineer goes on call for a week, once every 6 or so weeks. The On-Call Engineer is responsible for responding to incoming customer support requests, upgrading packages for our core microservice, and addressing a dedicated backlog of small system enhancements. Our core microservice is one of the older systems in our company. In the earlier years, priorities, unfortunately, trumped the kind of TLC necessary to keep the microservice lean and focused. Our microservice boasts lots of outdated packages, and accreted impostors that never got evicted. Despite successful efforts to rectify this, there still remains an entire journey ahead. As such, the On-Call is never bored, or without a steady stream of work.

I walked into this shift feeling inspired, and desirous of making material improvements. Having worked with the same service for 4 years, I carry the burden of service improvement every day. Unsettled with the monotone of package updates, I sifted through the list of updates looking for an opportunity to make a bigger impact. As I scanned the list of packages that required updates and stumbled across two third-party packages I didn't recognize. What's defusedxml and bcrypt?!? Why do we need those?

Let's start with defusedxml, I thought! I use VS Code for editing, despite my unresolved feelings for Sublime Text. I opened up the file with the list of packages and their versions. With one keystroke, I invoked a nifty feature of Git Lens, which deposited the Github URL for the commit to my clipboard. I quickly dropped the link in the browser to start the interrogation. I was anxious to see the commit message...

"saml2 backend for [company X]"

Hmmm... not too much context there. Somebody was trying to add a SAML authentication backend to our Identity microservice. SAML is a Single-Sign-On mechanism. Some years ago, we wanted users from Company X to be able to sign in to our Identity microservice once they had an account at Company X. Ok... but what does that have to do with defusedxml? Maybe let's look at the PR. I found a PR with 18 files and ~1000 lines of code. The author had long left the company years ago. I did some additional research on the package and discovered it's used to prevent certain types of XML attacks when parsing XML. There were no direct references to the package in our code. I concluded that there was perhaps some interest in using the package, but the engineers never got around to using it. Seemed safe to remove!

Now for bcrypt! I checked across our codebase and realized there were no direct references to the package except some old, unused SQL scripts that were used previously to move data from prod to our staging environment. Enough for me! Let's remove the packages and get a PR out there!

I baked up a quick PR removing the packages and pushed my branch to Circle CI for testing. In parallel, I also checked a few endpoints to ensure nothing critical was broken. The tests in Circle CI were all green! I wrote a descriptive commit message.
I captioned my commit message, then provided my rationale in the body of the message.

bcrypt - this is a package that provides some crypto functions for hashing. It was never used across our repo. I'm not sure if this was a dependency of the Django password hasher previously. However, today, we're not using it.

defusedxml - this package is apparently used to prevent certain types of XML attacks when parsing XML. It was added in this PR while the Company X integration was being developed, but it doesn't seem to have been used. We still have Company X code to remove but none of it is being used today.

I got the +1 to merge, and even a heart emoji from my manager! Who doesn't love to see unused code being removed?

As my finger depressed the mouse over the merge button, there was a sense of pride welling up inside. I felt like a proud citizen of our team, and steward of our microservice. I had discovered defunct dependencies and chipped away at the unnecessary cruft inside our service. For all intents and purposes, I could have ended my shift early! My work was done!

Not so fast

But then, Slack started getting noisy -- one of our monitors went off. People were having issues logging into our site. The monitor was quickly followed by a trove of Sentry errors -- something about bcyrpt missing. It turns out, bcrypt was actually required.

Wait... what?!?

I could feel the contentment being collected by the alerts. I saw the alerts, and was able to jump in and roll back the deployment. The incident lasted just a few minutes, and not too many customers were impacted. Hmmm... I guess bcrypt is actually a thing, huh?

After deeper investigation, I realized that bcrypt is actually a runtime dependency of our system. We use Django, a Python framework that provides lots of batteries. One such battery is authentication. Django provides functions to authenticate users with some configurations. One such configuration is the password hasher. We dictate the hashing algorithm Django uses when persisting and verifying passwords. To complicate things, apparently, bcrypt is rather slow, so we disable it in Circle CI. This explained why all our tests passed. None of them ran using this hasher.

Thankfully, nothing actually broke in prod related to defusedxml. Seems like I wasn't incorrect in considering defusedxml safe to remove. However, I rolled back the change so we could perform a deeper investigation.

What did I learn?

So let's talk about what I learned from this incident.

A good commit message goes a long way

Writing a good commit message is really critical. You can use a commit message to help future you understand why you introduced a change. Usually, someone can derive what was done by looking at the change, but the why is that much more important! Years later, someone may be looking at your code with the thought of modifying it or removing it. How will someone know if your code can be safely modified? A good commit message bakes in the context for why the change was made, and all the working assumptions involved in the introduced change. More importantly, we can illuminate any potential conundrums. However, commits aren't the only place where we can provide context.

Context can (and should) be provided in many forms

In the case of the bcrypt dependency, it would have really helped to have a comment right beside the dependency (in the dependency file) explaining why it was added. Additionally, that would also have been a great place to help folks understand why bcrypt wasn't used in testing. Those are some inconspicuous connections that are worth highlighting for folks. I realized after that I didn't review the commit history for bcrypt. I must have been riding the adrenaline rush after discovering defusedxml was unused. Sure, I could have done more digging. However, that's a personal change we can't guarantee other engineers will use. Everyone has their own approach and may check different places. The more places the context exists, the better! The dependency was quite obscure. It took a fair amount of digging to realize the connection between bcrypt and our authentication system.

This other time a great commit saved prod

Fast forward to another day when I was working on a ticket to upgrade an outdated package that was pinned. By pinned package I mean we've told our package manager that we only want a specific version for a particular package. Some of these packages end up being pinned for years and never get updated. Ideally, packages are unpinned and the lock file captures the actual version. However, this time, we had an oauth2client dependency that was pinned, wasn't updated, and was now replaced by a new package Google had.

In this case, there were some dependencies that were pinned whose version conflicted with the new package Google came out with. I couldn't install Google's new package without upgrading those dependencies. I wanted to know if I could safely upgrade them. Were they pinned for a particular reason? Was it safe to upgrade?

Thankfully, when I took a look at the commit history I found a fantastic explanation that indicated that the packages were pinned to avoid breaking changes in yet another package. So, I was trying to replace oauth2client, but I needed to upgrade package X, and the commit told me that package X's version was pinned to avoid a conflict with package Y, which happened to be X's dependency. It was just a tad convoluted, but I was eternally grateful. Without this commit message, it meant I could have updated package X and then broke prod because of some conflicts with package Y.

Conclusion

Don't ever think a little extra can't prevent some downtime in production. A great commit message can provide a lot of context and really give us the confidence to make modifications. Remember though, it's not just commit messages! Think about how you can sprinkle context in other helpful locations so we help others and our future selves avoid making destructive changes!

Do you have a story of how commits prevented a disaster, or other ideas to preserve context? Let me know in the comments!