Accelerating the pace of custom instrumentation

Introduction

I've found a new love - custom instrumentation, and I can’t keep it to myself. We manually increased the custom instrumentation in one of our newer services and loved the impact. The instrumentation accelerated forensic investigations, root cause analysis and system comprehension. I’m hopelessly in love because I experienced worse. I’ve felt the pain of data insufficiency. We’ve had too many incidents where our traces could barely help us identify the impacted customer segment, much less tell us how many customers were impacted. Custom instrumentation has demonstrated its value, and we wanted to share it with the rest of the microservices in our fleet.

There was only one problem. Although we had one well-instrumented microservice, we still had over 40 other microservices that needed to be instrumented. Adding high-quality custom instrumentation takes time. Assuming it’d take about 3 months to instrument each of those services, it’d take ~10 years to get to the quality we needed. And which engineering leadership team doesn’t like to hear a feature which takes only 10 years to build? So… yah… we sought to fix that problem.

The rest of the talk covers a new approach we created called Config-Accelerated Custom Instrumentation, which we’re using to accelerate the time taken to add high-quality custom instrumentation to our services.

Solution Goals

We had a few goals in mind for building an automated solution.

Speed - We want instrumentation creation to be really fast too. We had ~40+ services in our fleet that needed this tooling, and 10 years wasn’t going to cut it.
Simplicity - we wanted a solution that enabled engineers to add instrumentation easily. Relative to the manual effort, it should be a piece of cake to get quality instrumentation captured. We want engineers to enjoy adding telemetry.
Data-centric - we want our engineers to think simply about the data they want to capture. We’d do the heavy lifting and capture the data for them. So, we want them to think “these are the data points I want to capture”. We’d provide the necessary automation and logic to observe the application stack and collect the data.
Small footprint - we also wanted a super unobtrusive implementation. We don’t want our tooling littering the codebase. A large footprint puts a real damper on engineer productivity and any evolutions we want to make.
Verifiability - we want engineers to verify that the instrumentation works as expected. Our desire for simplicity would likely remove some visibility, so we want to restore that visibility with verifiable mechanisms. We needed to answer questions like “will this datapoint be captured in my telemetry?" or “which areas of my app are under-instrumented?”
Secure - we don't want to risk any PII leakages. If we’re automating custom instrumentation, we’ll likely touch some PII and we’ll need to be careful nothing’s unintentionally leaked.

How it works

Our Config-Accelerated Custom Instrumentation (CACI) approach consisted of three distinct pieces. There’s the configuration that tells us what to instrument, a hook (metaclass) that captures the telemetry from classes and a report that provides feedback on what’ll be captured. Altogether, this simple set of tools helps us accelerate the pace of custom instrumentation dramatically!

The Configuration

Let’s talk about the Configuration first. The configuration provides guidelines for what our tooling should capture for engineers. This is where engineers tell us what they want to instrument. The config captures allowed field names, field types, data classes and dictionaries. It functions as a purely opt-in configuration. Data has to be explicitly included; nothing is captured by default. Here’s what a sample configuration looks like:

capture_configuration = TelemetryCaptureConfiguration(
    allowed_field_names=[
        "account_id",
        "business_id",
    ],
    disallowed_field_names=[],
    allowed_field_types=[UUID, datetime, bool, Enum],
    disallowed_field_types=[],
    allowed_dataclasses=[
        # Common
        AllowedDataclass(
            dataclass_name="Business",
            allowed_field_names=[
                "business_id",
                ...
            ],
        ),
    ],
)

The configuration places the focus squarely on the application’s data. We want the configuration to be completely decoupled from the application’s architecture. This avoids fragility and is change-resistant. We don’t want engineers having to update their configuration every time they add a new function, or change an existing one. The configuration is deliberately simple, easy to maintain and extend. Using this, engineers can tell us what application data is important for us to capture. Once this configuration exists, we can collect the data with our hooks!

The Hook

Once we’ve got the configuration telling us what needs to be captured, now we need a mechanism for collecting the data exchanged between components in the application stack. We want to capture information exchanged by components. We started by capturing the function inputs for the component. So, effectively, every time some component calls another component, we want to inspect the function input to see if any of the variables are specified in the configuration. To enable this, we created a metaclass that wraps each public method, captures relevant inputs then passes through outputs and errors. We only want to do this for public methods to keep things simple and easy to understand. We opted to use a metaclass as our hook. Using a metaclass, we can hook into the class creation and refine the methods dynamically, inserting our logic.

The basic logic looks like this…

class ComponentMetaclass(type):

    def __new__(...):
        # Loop over all methods
        for method_name, method in methods:
            # Wrap the method with our logic
            new_method = wrap_method_with_our_logic(method)
            # Replace the method definition
            attrs[method_name] = new_method

Then, using it is really easy. We just specify the metaclass on any class to use it.

class SubscriptionService(metaclass=ComponentMetaclass):

    def create_subscription(self, business_id: BusinessId, user_id: UserId):
        ...

And that’s it! Just by specifying that a component class wants to use our metaclass, our Observability package gets to work! It starts intercepting all calls. For instance, when a call is made to SubscriptionService().create_subscription(….) it calls our logic and checks if the business_id and user_id are in the configuration (TelemetryCaptureConfiguration). If it is, we capture those inputs for submission later.

Coverage Report

An important aspect of our design was to provide an explicit feedback loop engineers could use. These tools enable engineers to cover more ground faster, however, it also removes considerable visibility for them. Previously, an engineer could look at a method and see what’s instrumented. In the example below you can see where engineers would manually stuff data into a ctx. It was a tremendous feedback loop. Engineers can see the code and touch it. They can write test assertions to ensure it works as expected. By adding a configuration and these hooks, we’re granting power but removing visibility.

class SubscriptionService:

    def create_subscription(self, ctx, business: BusinessId, user_id: UserId):
        # Capture some telemetry
        ctx["business_id"] = business_id
        ctx["user_id"] = user_id

        # do actual logic ...

To replace this visibility, we provide engineers with a Coverage Report. It’s a report that generates a method-by-method commentary on what will or won’t get captured by our tooling. It also provides a percentage of how many variables are captured.

- SubscriptionService
  - create_subscription(self, business_id: BusinessId, user_id: UserId)
  - Inputs
    - business_id:BusinessId - ✅
    - user_id:UserId - 🚫

-----------------------------------------------------------------
Parameters captured: 1/2 (50%)

This restores visibility for engineers aiming to understand what’s instrumented and what’s not. It also provides a goal post they can aim for. Currently, we encourage engineers to aim for 70% or higher, but it’s not enforced currently.

Next Steps

There are a few improvements we’re thinking of making. We want to decouple it from specific object types to simplify things even further. Engineers are less likely to write format-specific rules for the same piece of data. Our configuration has explicit support for data classes and Python dictionaries. However, I’ve thought of using it in other places where we’re using different data types but handling the same data. I want to transition it to where it thinks generically about primitive and complex object types (including nested ones), regardless of the format.

We also want to move beyond coverage. Coverage is a controversial metric in engineering. It may improve code quality but aiming too high can be pointless, arbitrary and force engineers to game the system. Coverage gives us a basic idea of what we’ll capture, but it falls short of the real picture. We want to know if the data will be captured in a particular workflow. It doesn’t matter that the data exists at a certain level. We’ll definitely want to extend this model so it’s more sophisticated.

Finally, we want to capture function outputs too. I haven’t figured a great way to represent outputs given they’re unnamed in Python. I don’t want to couple the naming to the structure. Eventually we’ll figure that out.

Conclusion

Config-Accelerated Custom Instrumentation has been a game-changer for us. What started as a 10-year problem became something we could tackle in a reasonable timeframe. A configuration file, a metaclass, and a coverage report—and that's actually it!

We're still iterating and improving, but even in its current form, CACI has proven its worth. Our services are better instrumented, our incidents are easier to debug, and our engineers are spending less time manually adding telemetry and more time building features.

If you're staring down a similar mountain of under-instrumented services, I'd encourage you to think about what automation could do for you. The manual approach works, but it doesn't scale. And in a world where observability can make or break your ability to respond to incidents, waiting 10 years isn't an option.

Thanks for stopping by!

Instrumenting at 10 years per second

Introduction

Solution Goals

How it works

The Configuration

The Hook

Coverage Report

Next Steps

Conclusion

Comments

More from this blog

On shipping smaller PRs

Small PRs ship faster - QED

Order matters - making a compound index 50x faster

Dear Editor: We need better Database Observability

Command Palette

Introduction

Solution Goals

How it works

The Configuration

The Hook

Coverage Report

Next Steps

Conclusion

Comments

More from this blog