Introduction
Observability is a great vehicle for lowering the bar to understanding and supporting production systems for all, regardless of company or professional tenure. A system that emits enough useful and thoughtful telemetry enables system comprehension -- especially for the newbies on the team. Not only does it lower the bar, but it also reduces the time-to-useful debugging (TTUD -- pronounced ta-tud). It's a new, famous metric I've created that I'm hoping catches on. With great telemetry, any newcomer can play a pivotal role in debugging the gnarliest of issues in production and reducing remediation time. If a newcomer can do that, just imagine what the rest of your team will be capable of!
It always surprises me, how tribal knowledge becomes a barrier for democratizing system comprehension. There's a wealth of tribal knowledge that doesn't make its way into our system telemetry. As a result, many of our teammates' TTUD (there it goes again) extend way longer than it could be. If we could only get our telemetry to speak the native tribal knowledge tongue! If we did, it'd surprise us how quickly folks can orient themselves, learn our systems and chip in with production incidents.
Note that this reference to "newbie" isn't a euphemism for junior engineers - professional tenure isn't the only dimension that benefits from better observability. You could have just onboarded a Senior Engineer who has little familiarity with your tech stack. Perhaps her dominant language is Go, but your company codes in Python. Or maybe you use PostgreSQL and he's spent years working with Oracle. Each language, framework, database, tool, and process comes with its idiosyncrasies that create barriers to system comprehension. Observability helps us break those barriers.
So let's dig in! Let's put ourselves in the shoes of the newcomer and explore four steps to improve system observability. In this article, I'll focus mainly on improving traces, but the principles broadly apply to other metrics and logs.
But first...
Before we jump in, let's see how quickly you can diagnose an issue with my system.
Here's the problem...
I've got a worker that processes events. It makes a database call to fetch events from our database, then it iterates over each event, reads it and performs some additional logic dependent on the event type. They're basically CRUD for subscriptions. Depending on the event type, we could be creating a new subscription, updating or deleting it.
Here's an example of a typical trace.
The root span shows the entry point for the event processor. Below you can see the call to EventService.fetch_events
which leads to a database call in the green span. After fetching events, on the right you see there's a call to EventService.process_events
which leads to several smaller EventService.process_event
spans.
The issue is sometimes events take really long to process. We can observe from the trace that some of the EventService.process_event
spans are completely quickly (the ones with the truncated text), while occasionally, the processing takes much longer. The last span gives an example of this.
What do you think is going on here?!?!?!?
Maybe you were thinking...
Some event types take longer than others
Could it be the payload size? Maybe there's a poison pill
Maybe some retries are happening somewhere
Possibly there's a cache that's messing things up
Maybe it's a Kafka consumer and there's some contention for processing requests
Database latency -- maybe an index is missing
Possibly you've concocted some ideas that are much more novel than the simple ones I listed.
Any way... If you ended up guessing and speculating about the cause. You're wrong.
The correct answer actually is...
I don't know! Show me the data!
Instead of formulating ideas just based on the problem description and then seeking to confirm those ideas, we should use the data available to us to form theories. So, fundamentally, we need to flip the order that we're approaching our investigations. Don't go in with a theory. Rather, go in with a curious eye, ready to interrogate the data available to you. Only then, should we start to speculate about what's happening. Changing our approach is going to be fundamental to levelling the playing field and making systems interrogable for all -- especially the new guy.
Let's improve Observability
Now let's consider four practical steps we can take to elevate the interrogability of our systems. As we're building and extending our application, we want to take these four steps cyclically to improve observability. First, we start by getting a baseline of how easily we can interrogate a particular part of our system. That's the Observe step. Next, we'll provide a central mechanism for collecting our telemetry. That's the Centralize step. Once that's in, we'll talk about adding telemetry according to best practices. That's the Telemeterize step. Finally, we'll attach this central mechanism to all the tools that we send our telemetry to. That's the Permeate step. Using these four steps, we'll ensure that our systems can be queried to distill whatever state it has gotten itself into.
Step 1 - Observe
We want to start by understanding how easily we can debug a particular area of our system based on the telemetry available to us. We're examining our system with the ultimate pursuit of reducing the time-to-useful debugging (TTUD). We want to ensure that the tribal knowledge makes its way into our telemetry so our systems are easier to debug.
We want to start by looking at the area we'll be working in and evaluating how easily we can answer different types of questions to explain system behaviour and state. We need to look at the logs that we have available and determine if sufficient breadcrumbs exist. We also need to examine the current metrics and explore the tags associated with the metric. Then, we also want to look at the traces. How much information are we able to glean structurally from the traces? What theories can we test with the information available to us? How much annotation was added to each span in the trace? We want to establish a baseline to determine how much work needs to be done.
As we explore the telemetry and context available in each tool, we want to determine the level of transparency we have into our system state and behaviour. Let's consider a few questions we can ask ourselves:
Do we have enough information to confirm success or investigate failures? What kind of failures are made obvious? Are different types of failures represented (client-side failures like validation failure, or backend failures like a broken/failing dependency)?
Do we have a clear view of the system behaviour when forks or conditionals are involved? Does our telemetry explain which path our system took through a tree of conditionals?
Is it obvious why the system performs particular actions? Is it obvious when the system doesn't perform a particular action? For e.g., can we tell why our system performed 2 database queries in one particular trace as opposed to 5 in another?
If we're seeing problems or partial degradation, can we decompose our system into arbitrary segments to examine performance?
Let's consider a few examples of traces where we can practice this step of observing and taking stock of where we are.
The Flat Trace
I've encountered many traces where there's a span representing the service followed by a sea of heterogeneous spans. This doesn't communicate much and is especially hostile to the unfamiliar. For starters, you need to examine the details of each span and understand what the operation is doing first. For some spans that means reading the database queries and trying to understand what it's doing. Often arguments are unavailable, in the trace which makes it harder to derive the intent. Then, after understanding what each span represents, you now need to speculate about why. It provides more questions than answers:
Why are selecting a bunch of records, inserting into one table, and then updating another?
Why are we pulling data from the cache? Was the fetch from the cache successful? What's the expected system behaviour for a cache hit/miss?
How exactly and how long would it take a newcomer to look at this trace and understand that we are:
Fetching events from the database...
Inserting each into the database
Updating some internal state that keeps track of what event we last fetched
Pulling the latest timestamp from other workers and inserting it into the database
The Unexplained Shapeshifter
Imagine these two traces represent different calls for the same endpoint (a login endpoint). They're different for some reason, but only the tenured understand why. Both traces start with a call to the cache, but everything after is different. The first selects some data from an auth_user
table then does an insertion into the auth_log
. The second trace only inserts into the auth_log
. Without more context, or looking at the code, it's not very clear what's happening.
For the uninitiated, we don't know...
Why do we take different pathways after calling the cache? What's the significance of that?
If we're investigating a particular scenario, how do we confirm the expected behaviour? Does the first trace represent success, failure or something else?
How would a faulty cache impact the functioning of our endpoint?
For a newcomer, someone would need to explain that this login endpoint checks the cache to see if a user's email is banned. If the user is banned, we insert a record in the log capturing the attempt. If the user isn't banned, we check their credentials and then log the success/failure of the authentication.
The Unidentified Victim
Too many times we have traces that don't allow us to identify or categorize requests. We have traces that don't contain sufficient identifiers -- there are no links to any useful entities in our application. The traces don't have user IDs or any other useful domain identifiers that can help us identify the victims of some service degradation or incident we're investigating. What's even worse is, many times our traces lack the traits that can be used to at least bucket users into similar categories.
Where I work, one of our core products is an Accounting SaaS. Our customers will typically work with invoices, receipts and other document types. Our traces must contain identifiers for users, the businesses they're operating on, and the entities they're acting on. Speaking of traits, it's also useful to be able to bucket requests by other traits of the objects. For example, it can be useful to know if the user submitting the request is a business owner or a user they've delegated access to. We also now have several monetized areas of our core product. It is useful to know whether we're getting a request from a paid, or non-paying user.
Flip the script... sometimes it's a villain we need to identify! Once we had a credential sniffing attack where a user tried different credentials on our public API. They likely got some credentials from some illegal site, and they wanted to check if any of the credentials worked. To access our public API, users have to create an application. Fortunately, our API team had captured the application_id
on the traces and they were able to pinpoint exactly which application the traffic was coming from. Across the sea of requests the malicious user threw at our application, they were able to pinpoint the source exactly, because they included identifiers on their traces.
Conclusion
In conclusion, the journey towards effective observability starts with observing our systems to see how accessible it is for the newcomer. Understanding the posture of our system observability is the first step to demystifying complex systems and democratizing system comprehension. It helps to reduce the Time-To-Useful Debugging (TTUD) for all our team members -- not just the team newbie. We looked at some examples of traces that aren't the most welcoming and questioned how much tribal knowledge we pour into our telemetry.
In the next article, we'll talk about Centralizing our telemetry in a way that pushes us further on the journey of better observability. We'll make it super easy for anyone to capture telemetry!