Every slow query situation has its own quirks. Here’s my generic template for troubleshooting them.
First, I start with the holy trinity of IT Troubleshooting Questions:
- What are you trying to do?
- What data do you want?
- What is the end result you want?
- How does your result differ from what you expected?
- “I thought I could do a 7-way join on unindexed fields in tables with over 3 million records each in less than 2 seconds. I’m pretty disappointed with Postgres’s performance.”
- What changed?
- Has it always been this slow?
- Did this happen suddenly, or did it slowly degrade?
- If suddenly, what changed? (“This started right after our last deploy.”)
Next, I collect as much as I can of the following before I do any further work:
The query, preferably from the database logs or pg_stat_statements so I know what’s *actually* being run. I’ve learned that when using an ORM, the query that’s in the code sometimes bears little resemblance to the end result.
DDL (table definitions) for all tables involved field names, datatypes, indexes, index types
Information about the dataset: approximate number of records (pg_stat_user_tables.n_live_tup and pg_class.reltuples) and distribution of values (see the pg_stats view)
System specs + cluster configuration, plus any per-database or per-table settings.
Metrics to support the “slow” assertion. I’m not saying that user perception of slowness isn’t a valuable datapoint, just that I need actual measurements in addition to subjective values in order to determine if my solution works.
Database activity logs and/or values from pg_stat_statements. It’s better to have these configured before you start having problems1. I find both to be useful; pg_stat_statements is great for finding average timings etc, but it’s the activity log that shows I’m having slowness issues at certain times of the day, or for certain versions of the query (example: an IN clause with 980 values), or what have you.
Application logs, if I have them and they include results/errors/timings or other potentially useful information.
pg_stat_user_tables metrics for all tables involved. This helps me determine if VACUUM/ANALYZE is keeping up with table churn.
What’s the EXPLAIN plan? How about EXPLAIN (ANALYZE, BUFFERS)? Remember EXPLAIN ANALYZE actually runs the query, so act accordingly.
Additional useful questions:
How do query run times compare measured locally vs user perception? Example: the query runs in 2ms locally, but my end user doesn’t see data for half a minute. As a DBA, there is not a lot I can do about this one.
Is it actually the query that’s slow? Example: `SELECT * FROM blah`, where =’blah’ is a 500k record table, then the app code filters that down to 2 records. That may be part of the problem. Let the database do what the database is good at: sorting and limiting results to specific records within the database can dramatically improve performance.
How many times is this query run per request? I had a situation recently where it appeared that a very small and usually fast query was suddenly running very slowly. It turned out the code had changed and instead of running the query just once, it was running an average of 1000 times before the app returned a result to the user.
Does this query call any functions? How many times are the functions executed? Do they call other functions2
Then it’s time to dig in and fix stuff!
These steps are numbered, but aren’t necessarily in order. If EXPLAIN shows sequential scans, I’ll start by adding an index on the field involved if it makes sense to do so. (Good index candidates is a whole ‘nother discussion3.)
I test before and after every change and compare the results with what I collected earlier, especially around timings and system load.
- Tune the query
Are there any obvious culprits, e.g. CTEs4, lots of JOINs, LIKEs of this format: ‘%so cute%’ (check your index types)?
- Tune the DDL
Am I missing indexes? Do I need to normalize or :gasp: denormalize?
- Tune the database config
Do I have enough memory to do what I want? Is autovacuum keeping up with table churn?
I learn something new every time I do this; I hope you do, too!
1 – See https://gorthx.wordpress.com/2013/07/05/postgresql-conf-rudimentary-log-settings-annotated/
2 – Useful tool: PL Profiler
3 – I can’t recommend this book enough: Refactoring SQL Applications