Archive for June, 2014

13 June, 2014

Postgres Monitoring Wishlist

by gorthx

Since “what should I monitor in my database” has come up in conversation several times lately, I thought I’d put this here where I (theoretically) won’t lose it. I’ll save for later the discussion of where to get this info and which tools give me which stats :)

Bare minimum:
server CPU, memory, I/O, network usage, and all “slow” queries logged.

More extensive:
System stats:
CPU usage, per-proc if available
Memory usage, including swap
disk usage (in terms of space – pay special attention to database partitions)
disk I/O
disk busy
network stats, including errors (if you have a Cisco network & are friends with the network team, netflow data is cool to have)

If I could have everything I wanted: everything from vmstat and iostat extended data

Pg stats:
number of connections
idle transactions
commits vs rollbacks
checkpoint frequency
database size
table size (plus bloat, if we can find a good query for it)
index size (same)

If I could have everything I wanted:
everything from pg_stat_database, pg_stat_sys_*, pg_statio_*, and pg_stat_user_activity

Activity logs configured as outlined here.

Then there’s a whole class of things that fall under “How long does it take to…”: do a backup, restore a backup, etc.

6 June, 2014

autovacuum: long naps aren’t better

by gorthx

There’s that saying about “the first time’s an accident, the second’s a coincidence, the third is a pattern”. It’s probably because I’ve been studying Postgres’s autovacuum feature so much lately and these things stand out to me now, but I’ve noticed a really intriguing pattern (n>5) over the past month or so: folks with their autovacuum_naptime set too dang high.

The autovacuum_naptime GUC is the amount of time the daemon waits between checking for tables that need vacuuming and analyzing (or have xids that need freezing). If you set this to, for example, one day1, each table in your database gets checked only once a day. It’s certainly possible that this will work ok for your use case. What tends to happen, though, is that you end up with several tables that need maintenance, but the long naptime doesn’t allow that to happen in a timely fashion. And of course, the longer you go between vaccing/analyzing/freezing them, the longer it takes to get that job done and soon you’re in a downward spiral, and maybe your database ends up taking a dirt nap, at least from the viewpoint of your users who are all saying “OMG, why is this so sloooooow?” Then you conclude “autovacuum sucks” and disable it and try to stay on top of it with cron jobs etc like back in the Bad Old Days When We Didn’t Have Nice Things.

So, while the autovac config params are all interdependent to some extent2, this is kind of a big one and IME you don’t want to increase the naptime beyond the already-conservative default of 1 minute.

1 – Yes, really. Not trying to embarrass anyone here! Rest assured you are not alone.
2 – I was going to say that individual GUCs don’t operate in a vacuum, but figured that might be taking things Just Too Far.