Dear Editor: We need better Database Observability

Introduction

Dear Editor, hear my plea. I pen this piece in earnest search of enlightenment, or the beginnings of a well needed change; we need better database observability! Specifically, more granular, and useful database spans. Coarse, high-level metrics we lack not, but detailed, infrastructure-intent revealing, behaviour illuminating spans... that we have not. We've got fantastic silos, a bevy of different tools, but we lack the cohesion that simplifies database debugging for all.

Please tell me I'm wrong, blinded or misguided - show me an existing reality that eludes me, one I’m oblivious of. I'll confess the pain I've felt, and the gaps I've seen. I'll recall my suboptimal optimization attempts and struggles with error interpretation. Come get me in the comments, tell me I've been living under a rock (a purple and white rock with a dog logo) and better exists today. If not, join the cry, regurgitate and amplify, protest with me. Let's ask more of database vendors, telemetry vendors and maybe OpenTelemetry? We need better database observability!

For context, Dear Editor... kindly note we're using Postgres, a little MySQL and DataDog for Monitoring and Observability. My thoughts and reflections are based on this perspective.

Don't lock up all the context 🤦‍♂️

A few months back a faulty deploy increased our API latency 20x times. Our endpoint slowed to ~1s when it previously performed between 30-50ms. Fortunately we quickly identified the faulty deploy and rolled back. The investigation on the other hand... that took some effort. A really smart co-worker of mine discovered it was a locking issue. A sub-optimal query obtained a lock and blocked other concurrent requests, preventing successful completion. So, we had enough context to know that our database was conflicting with something... but what Madam Editor?

My contention (see what I did there… locks… contention) is the insufficient context to confidently navigate locking issues. Why can't the database tell me what it's conflicting with? I'd personally love a proper EXPLAINation. There are likely a myriad of different possibilities. Our service only has ~15 endpoints and this was difficult enough. Could I get the process ID or the other process? Perhaps the transaction ID of the conflicting concurrent transaction? Or better yet, what SQL statement is it conflicting with? Actually, what would be ideal, would be link to the conflicting trace ID 🥇. The database spans I examine today are woefully inadequate, and don't help much if there's a locking issue.

Here’s an example of what’s available currently… I can see the statement, number of rows impact (row_count), some connection details (ip, port and user), but that’s about it.

{
    "application": "my-application",
    "db": {
        "row_count": 1,
        "statement": "UPDATE product SET name = ? where id = ?",
        "user": "db-user",
    },
    "network": {
        "ip": "prod.my-application.cluster.us-east-1.rds.amazon.com",
        "port": 5432,
    }
    ...,
}

The lack of proper context complicates database debugging unnecessarily. How does one without a wealth of knowledge on databases and isolation levels understand what's happening? Given a locking error, or some other obtuse error without sufficient pointers, what steps does one take next? Tool hopping can't be the answer, Mr. Editor. We want more information in our traces to make debugging more accessible. Is sending folks up another tool's learning curve the best we can do? I think not, Sir Editor!

Looking up index issues

I recall another time where missing index information extended our performance optimization attempts. There was a stubborn query resisting our optimization attempts. The query latency was climbing over the last month... 50x times slower. Given the query included multiple fields, we toyed around with different compound indexes hoping for a resolution. Madam Editor... the feedback loop wasn't the greatest. We'd run EXPLAIN, add the index, deploy the code, cross our fingers hoping the latency decreases, then run EXPLAIN again to see if the index was used. Using EXPLAIN is a protracted feedback loop. There's got to be a better way!

What if we could bring up a trace and immediately determine which index was used for the query? Or... group the API requests by the index being used... that'd allow quick evaluation of index usage -- and indicate whether our new index is being used. Is the index being used exclusively? Do some combination of arguments change the database's plan? That'd allow me to confirm whether the index is working as expected. And what about impact -- are queries faster using this index? Zooming out... I could look at index usage across lots of different use cases and see other improvement areas across the system. Are queries always faster when using this index? Are there a set of arguments, or a particular use case where the index is used but the latency is still high?

Oh how things would be much better if traces could read my database's mind.

Parseless errors

Madam Editor, it's not all bad. Occasionally I'll get some useful context in my database span. I want to formally register my appreciation of said context. Here's a sample error I've encountered:

django.db.utils.IntegrityError: duplicate key value violates unique constraint "email_key" DETAIL: Key (email)=(thisuser@gmail.com) already exists.

Here, an SQL statement violates a database constraint, causing an integrity error. This isn't bad at all -- there's some detail here! It provides the following:

Error type (integrity error)
Error reason (unique constraint violation)
Constraint name (email_key)
The field in question (email)
The actual email value (thisuser@gmail.com)

The only issue being, this context is only contained in a single string (error.message) on the database span. The span lacks dedicated tags for the items mentioned above. That's really useful context for understanding the application behaviour. Such span tags would enable quick assessment of the endpoint degradation. Is the API degradation dominated by or exclusively attributable to this error? Is it always the same email, a small set of problem ones, or equally distributed across different users? Is it even the same constraint being violated all the time or a mixture of different ones?

I'd love to have these span tags to dig deeper into these issues. How about something like this? A somewhat predictable schema navigable by tools with less expressive power. cough cough woof woof

{
    "error": {
        "error_type": "integrity-error",
        "error_subtype": "unique-constraint-violation",
        "error_code": "INT002",
        "error_details": {
            "constraint": "email_key",
            "field": "email",
            "value": "thisuser@gmail.com"
        }
    }
}

Can someone explain EXPLAIN? 😖

Dear Editor, this rant would not be complete without a request to make EXPLAIN more accessible. EXPLAIN in Postgres is pretty informative... especially when you have ChatGPT ready to translate the cryptic syntax you don't remember. I've consistently found EXPLAIN to be too cryptic for my infrequent usage. My infrequent use means I don't have the working context to easily understand EXPLAIN. It's really inaccessible.. not to mention, not every company is setup for everyone to run EXPLAIN. And yes... I know it's also possible to auto log EXPLAIN for slow queries... that doesn't change my rant.

Having this information in our telemetry tool would be so much more useful! If the information from the EXPLAIN could be embedded as span tags instead, it'd really increase the accessibility. Who is going to remember what cost, rows and width means? What does Rows removed by filter mean? A thoughtful list of span tags capturing the information in an EXPLAIN query could really accelerate the database investigations and number of folks that can jump in.

Painful performance investigations

I've previously used AWS RDS Insights sporadically to investigate database CPU spikes. Yes, yes, yes... you're asking why I'm chasing down CPU spikes and wondering about user-facing degradation. I've since learnt that symptom-based monitoring is better. However, bear with me, I'm going somewhere! Back then... I was approaching the problem from a different angle. The database CPU is spiking... what's causing it to spike? I didn't have a particular use case in mind to limit the scope.... I wanted to understand why we were getting these spikes.

PS: If it helps… 🪄🎩 let’s imagine that was some pronounced and unacceptable user impact.

RDS Insights provides several high-level metrics. Here are some examples:

Innodb_rows_read.avg – total rows read from InnoDB tables
db.sql.select_scan.avg – the number of joins that did a full scan of the first table.
db.sql.sort_scan.avg – the number of sorts done by scanning the table.

A CPU spike I saw aligned with a spike in the Innodb_rows_read metric -- it told me the database was reading 250k rows when it normally read ~20k rows. Interesting! I broke the metric down DB activity and saw a few by suboptimal SQL statements. Great, now I had a short list of suboptimal queries... right? The only issue was ... these queries all get used in different use cases and are passed different arguments, and called a varying number of times for each use case. An ORM abstraction also adds complexity -- you're typically interacting with your ORM, and not writing SQL queries directly. You've got to find the equivalent ORM syntax that generates said SQL. Our system is 10+ years old and has a really wide surface area to cover. Linking a query in RDS Insights and our application was a cumbersome and difficult task.

But... Dear Editor! What if all that data existed on the trace instead? What if I could easily see query-level attributes inside DataDog? Was a full table scan performed? How many sorts are done in the query? How does the database break down its work (e.g. through additional spans)? How much time was spent actually crunching the query vs sending the data back? How much CPU was utilized for the query? I can see how this could really empower a team looking to optimize database performance overall. It'd also be really helpful in incidents when there's an unexplained spike.

Conclusion

And thus ends my plea... I'm calling for better database observability as a necessary step towards more efficient, intuitive debugging. I've recounted the pain of being locked out, struggles with index feedback loops, complex performance investigations. We need more detailed spans that synthesizes the tool silos into a formidable, coherent base for investigations. If database vendors and telemetry providers can rise to meet these needs, we’ll be far better equipped to maintain optimal system performance, reduce troubleshooting time, and foster a deeper understanding of our databases. It’s time we push for more, or for someone to bring me the enlightenment I've been missing all along!

If you know better, let’s talk… I’d really love to improve this!

Thanks for stopping by and reading this piece. I'd love to hear about your experiences and how you've navigated it.