Several years back, a new fellow on a ride with my bike club had a rather serious crash. I’ll spare you the gory details here1, other than to say that he was very lucky we had two Wilderness First Responders, a nurse, and an Army medic with us. The experience made quite an impression on me, and I got a Wilderness First Aid certification a few months after that.
The way the NOLS courses work is you have lectures about eg “how to splint a broken arm with found materials”, then you divvy up into pairs or small groups for practice. One person is the “victim” and the other has to fix them. The first time I took the class, I “killed” most of my patients. I did much better the next few times, and have had several occasions to be grateful for these skills.
Over the past year, I’ve been thinking a lot about how training like this applies to our everday work as IT specialists. A lot of us inherit systems that may not be documented, or maybe the backup’s not actually running, etc. And then disaster strikes, and we think, “well, I know *in theory* what I’m supposed to do”. Good luck!
My point is, we need to practice this stuff. Preferably in a low-pressure environment where it’s ok to make mistakes. This is right up there with the concept that “you don’t have a backup unless you’ve successfully restored it recently”. Do many places encourage (or even allow) you to practice restores, though? I know of a few that do, but for the most part people are pretty silent about this and I suspect it’s because nobody’s doing “what’s right”.
This was my reasoning behind starting a lab series with PDXPUG. Streaming Rep seemed like a good candidate for the first lab; several PDXPUG members expressed interest, and I personally was feeling a bit rusty because I haven’t had to actually configure it since 9.1.
My goals were:
- create a checklist of steps to follow to set it up. This is one of those things you probably won’t have to do very often, but why not make it as streamlined as possible?
- figure out if it’s actually working correctly
- practice a fail over, practice restoring a standby, etc
The group recap is here.
The checklist is up on github; the group found a few mistakes in my original.
Additional cool tidbits I learned:
wal_level has to be hot_standby on the master, or the standby won’t start up. (Postgres is pretty good about letting you know exactly what the problem is, though.) I was unable to get around this without completely reimaging the master; I would like to dig into that further and see if there’s a shortcut.
SELECT * FROM pg_stat_replication; shows you the current rep status. Combine this with \watch for extra entertainment. Available on the master, and on standbys if you’re cascading. (Hmm, setting up cascading rep might be another fun lab…)
pg_xlogfile_name translates ID -> WAL name, also only available on the master.
\df *log* …somehow, all these years I have missed that you can pass globs to the \ commands.
You can actually ship and stream WALs, but be prepared for 2X the network load.
CREATE USER applies login privs; CREATE ROLE does not.
Items that we’ll be looking at during Streaming Rep part 22:
- what is the “right” # for max_wal_senders – is it just the number of standbys? Any reason you’d want to have more?
- what causes that “FATAL: database is starting up” problem, and how do you fix it?
- how the heck do we monitor this thing, and what is reason for alarm
- the “human-readable” WAL transform
Yes, there will be more labs. Watch the PDXPUG blog for announcements. If you’re in the Portland area, please come join us.
1 – I will tell them over beer, though.
2 – Thursday Feb 27, sign up here