That time PostgreSQL said "no thanks, I don't need your index"

Introduction

Today I have an interesting tale of victory after PostgreSQL rejected our attempts to optimize a query. We had a query we were observing that was growing in latency. We analyzed it and determined the best course of action was to add an index. PostgreSQL disagreed, and outrightly refused to employ the index. Following is the story of the negotiation that took place to push past the blatant rejection of Postgres.

Up and to the right...

Below is the pattern one of our on-calls picked up. That's the p95 latency for one of our critical endpoints, called InitializeBusiness. The trend started some time in February after a big launch, grew since then.

We did a bit of digging, and found the query that was responsible for the latency trend. The query targeted our subscription table. It contains a list of subscriptions, both paid and internal subscriptions. In total, there are about 14 million rows in the table. It has a few columns of interest:

plan_id - a foreign key to the plan the subscription is for. We have a premium plan called PRO, and we have a STARTER plan which is the default, free plan.
status - this indicates whether the subscription is active or cancelled.
business_id - this is a UUID representing the business the user bought the subscription for.

We got some variables from one of our traces in DataDog, and then proceeded to inquire about how Postgres was running our query.

Our query looked something like this...

SELECT * FROM subscription 
    WHERE ( 
    subscription . business_id = '...' 
    AND subscription . plan_id = '...' 
    AND subscription . status = 0 
    ) 
LIMIT 1

You can see from the query, we're targeting the subscription table and filtering on the columns I mentioned earlier. We're filtering by the business_id, plan_id and status.

Explain yourself Postgres

Our next step was to interrogate Postgres to understand how it was executing our query. Postgres has a database planner that determines how it will run the query. Will it search all the rows in the table, will it join tables this way, or that way, will it use an index, etc...? EXPLAIN ANALYZE <query> can be used to prompt Postgres to explain how it will execute the query.

So we ran EXPLAIN ANALYZE <query>. The result looked something like this...

Limit  (cost=0.56..4.58 rows=1 width=94) (actual time=56.301..56.302 rows=1 loops=1)
   ->  Index Scan using subscription_plan_id on subscription  (cost=0.56..4.58 rows=1 width=94) (actual time=56.301..56.301 rows=1
 loops=1)
         Index Cond: (plan_id = '987aa064-129b-40e6-aa1e-133985a3c31e'::uuid)
         Filter: ((business_id = 'bb97897b-035f-4bdb-af5e-79cec295dba1'::uuid) AND (status = 0))
         Rows Removed by Filter: 228731
 Planning Time: 0.535 ms
 Execution Time: 56.340 ms

From the output, we could see Postgres was explaining a few things to us:

The cost to run the query is pretty low. The cost=0.56..4.58 portion indicates that. It's an arbitrary unit meant to estimate the CPU and I/O work involved. The 0.56 indicates how long it takes to retrieve the first row, while 4.58 indicates how long it takes to complete the operation. Looks good.
The actual time=56.301..56.302 portion indicates the start and end time for the actual operation in milliseconds. So it takes less than a millisecond (i.e. 56.302 - 56.301 ). Looks really good.
It was using an index to run the query. An index called subscription_plan_id. This was the index on the subscription.plan_id field. Well... this is a little suspect. The subscription table has 14M rows, and we have ~4 plans. So, if the database is using the plan_id to cut down the number of rows it has to search, it's going to have to do a lot of work!
Finally, Postgres is using the index to filter by plan_id then doing a filter on business_id and status. Looking at the Rows Removed by Filter: 228731 means that after using the index, Postgres is filtering an additional 228K rows!

Well, we can help Postgres out! If we add an index on the fields we're targeting in our query (business_id, plan_id and status), that should take care of it... right?

Postgres says no

We eagerly got a PR up to add the migration and merged it as fast as we could! Opened up DataDog awaiting the sweet relief of a nose dive in the latency curve. Unfortunately... the graph still looked kinda like this...

Usain Bolt Victory Pose Trademark News | Hypebeast

So... we went in, ran explain again... and got this.

Limit  (cost=0.56..4.58 rows=1 width=94) (actual time=56.301..56.302 rows=1 loops=1)
   ->  Index Scan using subscription_plan_id on subscription  (cost=0.56..4.58 rows=1 width=94) (actual time=56.301..56.301 rows=1
 loops=1)
         Index Cond: (plan_id = '987aa064-129b-40e6-aa1e-133985a3c31e'::uuid)
         Filter: ((business_id = 'bb97897b-035f-4bdb-af5e-79cec295dba1'::uuid) AND (status = 0))
         Rows Removed by Filter: 228731
 Planning Time: 0.535 ms
 Execution Time: 56.340 ms

🤦🏾‍♂️ Don't bother to re-read it. I copied it from above. It's the same result. Even though we added a new index with just the fields Postgres needed to execute the query, Postgres still proceeded with its previous plan. Now we needed to figure out why Postgres wasn't using the index, and how we get it to use it.

Let the negotiations begin...

The first thing I learned was Postgres will not be forced. There's no way to directly push Postgres to use a particular index or plan. Postgres, unlike MySQL, doesn't support hints or any other mechanism where you can tell it what index to use. Instead it has these coarse knobs for toggling available strategies for executing queries. In their own words...

These configuration parameters provide a crude method of influencing the query plans chosen by the query optimizer

Query Planning, PostgreSQL Documentation

One example of such a knob is enable_indexscan. When enabled, Postgres will leverage indexes to execute queries. When disabled, Postgres won't use indexes. There's also a enable_seqscan option which, when disabled, discourages doing a sequential scan of all the rows in a table. So, Postgres does provide some configuration, but none that allows explicit specification of the index it should use.

Next, we tried a few suggestions from an article one of our Staff Engineers shared with us.

One reason Postgres mightn't use your index is that it can't. This could happen if you're using a function, or if there's a datatype mismatch. Using a function (e.g. WHERE lower(plan_id) = '...') may prompt Postgres to skip using the index, because it doesn't know what the function does. You would need to add an index for that expression to get Postgres to use it. Well... we weren't using any functions, so we ruled that out. Next, if there's a datatype mismatch between the index contents and your query (e.g. using a string status WHERE status = '1' when the index has integer statuses). We checked our datatypes and ruled that out as well. So, our problem wasn't caused by use of functions or datatype mismatches.

The other reason Postgres may not use the index is because it incorrectly thinks it's not fast enough. The article highlighted that Postgres won't use your index if it believes there's a faster alternative. The planner considers multiple plans and estimates the "cost" of each, then picks the one it thinks is the fastest. So, Postgres might be considering doing a sequential scan (looping over all the records in a table), or searching through an index. It would need to estimate the "cost" of each of these plans. One key point though, is the planner doesn't count rows in real time, it uses estimates. And these estimates, could be wrong! Let's maybe try updating them?

We found out how to update Postgres' estimates, but had to resolve concerns running it in production. Postgres has a command called ANALYZE <table name> that updates the estimates it's using. We weren't entirely sure how expensive this command was, so we ran it in our staging environments first. That gave us mixed signals. We ran ANALYZE subscription in four different environments. Two of the environments continued to disregard our index. The other two environments started using our index! That was encouraging and also quite puzzling. Sooo... this left us with only one option. Let's run it in prod! YOLO!

We booked some time to run the command in production and well... we were in for quite a surprise! Fast forward to a day later... we all jumped on a Zoom, held hands, and stared intently as the ceremony started. We all watched feverishly as my colleague typed the longest two words I've ever encountered... ANALYZE ... subscription. No sooner than my colleague hit the ENTER at the console, the command returned. I was baffled! If the command ran that quickly, there's no way it did anything, I quipped. I wasn't very hopeful that anything changed. I started thinking about other plans we could consider... but let's still run EXPLAIN to see what Postgres is doing now...

Limit  (cost=0.43..8.45 rows=1 width=94) (actual time=0.039..0.039 rows=0 loops=1)
   ->  Index Scan using business_id_plan_id_and_status on subscription  (cost=0.43..8.45 rows=1 width=94) (actual time=0.038..0.038 rows=0 
loops=1)
        Index Cond: (
         (business_id = 'cb97897b-045f-4bdb-af5e-79cbc295dba1'::uuid) 
         AND (plan_id = '987aa064-129b-40e6-aa1e-133985a3c51f'::uuid)        
         AND (status = 1))
 Planning Time: 0.787 ms
 Execution Time: 0.085 ms

Hmm... something looks different. Oh wait, would you look at that! Now it's using an index called business_id_plan_id_and_status (we can tell from the second line Index Scan using business_id_plan_id_and_status on subscription). Ahhmmm... that's our index! Postgres is using our index! Finally!

Well... that should also mean that the query latency should have decreased... let's check... 👀

😍 Where's that image of the little guy saying yes...

The Boy From The "Success Kid" Meme Is Trying To Get His Dad ...

Yess... Postgres was now using our index because we updated its estimates for our subscription table. As a result, the query is now much faster... 70x faster as a matter a fact. We won! We got Postgres to use our index!

If you're wondering whether Postgres should be keeping its estimates up-to-date, it should! In our case, Postgres automatically updates its estimates after a number of rows are modified, and we just didn't hit that threshold. Because we never did, Postgres never updated its estimates for our table. We've since started looking into tweaking those thresholds. Oh, and the reason Postgres chose different plans in our staging environments was because of the different table sizes, or more precisely, different table estimates.

Well... that's a wrap!

Conclusion

Despite the blatant, initial rejection of Postgres, we fortified our resolve, and achieved a significant reduction in query latency. Initially, Postgres stubbornly adhered to its original execution plan even after we introduced a seemingly suitable index. After exploring some deeper intricacies of PostgreSQL's query planning mechanisms, we suspected Postgres might have faulty insights. We ran the ANALYZE command which refreshed the database's stale estimates and ultimately allowed Postgres to recognize... we were right, and it needed to use our index.

I enjoyed writing this, and hope you enjoyed reading it too! Have you waged any battles with any of our tools? I'd love to hear about it! 👇🏾 Let's hear it in the comments!