3 November, 2014

PgConf.EU recap

by gorthx

I’m safely home from PgConf.EU. Madrid at this time of year was glorious, particularly to this Portlander. (I came home to a steady 12*C and rainy for the next week or … so ;))

We had over 300 attendees, making this the biggest Postgres conference to date, I hear. Of course, I couldn’t get to every talk I wanted (does that ever happen?), but here are some highlights:

Performance Archaeology was a thorough review of how Postgres performance has improved (or not) from version to version. I’m a sucker for benchmarks, and it makes me very happy that Tomas Vondra did this work :)

“Who’s the Fairest of Them All? Pg Interface Performance Comparison” was good from an informational standpoint (ODBC pretty much sucks) but also from a test design standpoint (hey, a valid use case for a Cartesian join!) Most relevant tip for me: complaints about db performance usually turn out to be caused by running queries returning one row at a time, one connection each – and usually from an ORM.

The demo of 3D rendering from Vincent Picavet’s “PostGIS Latest News” looked very promising. There’s a docker container available on his github; make sure you follow the setup instructions. I’m also excited about SP-GiST, spatial GiST indexes, which will provide faster reads and is 3X faster to build. It’s a WIP, and so far it only works on points.

XoF’s talk on “Finding and Repairing Data Corruption” covered some case histories from PgExperts. You all know I like the “war stories”; one thing I like especially about XoF’s talks is he includes “oh yeah, btw, don’t do [x] to try to fix this because you’ll make things worse”. Additional recommendation: disable autovacuum while you’re debugging corruption, because you don’t want it kicking off & changing things.

As usual, Simon and Alvaro packed a ton of info into “Locks Unpicked”. The most immediately useful tip for me was how to avoid the ACCESS EXCLUSIVE lock when adding an FK; do it in two steps: 1. ALTER TABLE [blahdeblah] NOT VALID 2. ALTER TABLE [blahdeblah] VALIDATE CONSTRAINT.

I didn’t get to attend Dimitri’s “You Better Have Tested Backups”, but I was in on a rehearsal. My reaction can be summed up with “You had to do what?!” If this talk didn’t scare you, I don’t know what will.

Craig Ringer’s talk about usability started with a round of “Error Message Jeopardy”, and included a reminder that we were all new once, and have forgotten how much we know. I personally accidentally tried to run psql on a -FC pg_dump just last week, and really appreciate the addition of the HINT message! I also hadn’t heard about the update problems on Yosemite.

Stephen Frost’s “Hacking Postgres” was one of my favorites. We got a tour of the source tree, backend components, and some background about the community coding conventions. (“Programming in Postgres may not always be standard C.”)
General advice:
- check the mailing list for people working on similar problems
- create your patch as a context diff or git –diff
- read your actual patch before you submit it, just in case you did something dorky.

Don’t forget to post your slides & leave conference and speaker feedback.

It was wonderful meeting people I’ve only known from the mailing lists & IRC. (For example.) Thanks very much to the PgConf.EU and SPI for helping me out! I hope to see you next year in Vienna.

6 October, 2014

RDS: Three weeks in

by gorthx

I’ve spent the past few weeks learning my way around Amazon’s RDS offering (specifically Postgres, and a bit of elasticache). It’s a mixed bag so far; for every feature I think “Hey, this is neat!” I find at least one or two others that are not so thrilling.

One of the things that may annoy you if you’re used to running your own Pg server is not having “real” superuser access to your own cluster. There’s an rdsadmin database which you won’t have access to, and a couple of rds users as well. This is part of Amazon’s security implementation to protect all their users from destroying each other.

You need to exclude the rdsadmin database out of any management queries you run, like so:
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database
WHERE datname NOT IN ('postgres','template0','template1', 'rdsadmin')
ORDER BY datname;

Otherwise you’ll get a permissions error.

Today, I had this bit of fun:
pg_dumpall --globals-only
ERROR: permission denied for relation pg_authid

So no dumping your roles. (No pg_dumpall, period.) I don’t have high hopes for an alternative.

I also haven’t found a place to report bugs – the “Feedback” button only allows you to contact tech support if you have a contract. So far I’m relying on the “I know someone who knows someone” method and the RDS user forums; AWS team members respond fairly quickly to posts there.

Aside from all that, I’m primarily interested in automating my instance management, so I’ve focused on the CLI tools.

My first thought was “Egads, I can’t get away from Java, can I”, but installing the toolkit turned out to be the easiest part. I set my $JAVA_HOME as outlined here to avoid that annoying “Unable to find $JAVA_HOME” error.

The CLI support is much more extensive than I expected – you can manage pretty much everything, very easily, once you find your way around and learn the appropriate incantation. Which is my biggest complaint: the docs on the web and the cli help don’t always match each other, and sometimes neither are correct. It cost me a significant amount of startup time flipping back and forth between the cli help, the web help, and just flat-out experimenting and cursing that mysterious “Malformed input” error message. (I probably have unrealistically high standards after working with the Pg docs for so many years.)

Fun stuff you can do:
RDS offers “event subscriptions” to help you keep tabs on your instance health (failover, storage, etc)1. They’re pretty easy to configure from the web console, but once you’ve done so, there’s no way to view or edit them except from the CLI. (At least, not that I can find.)

You can grab your Cloudwatch metrics, if you need to “roll your own” and integrate them into an existing monitoring system. (Yes, I briefly considered this!)

Log files are also accessible through the CLI for watching or downloading. It’s a huge improvement on the tiny green-text-on-black background on the web console. There’s no glob expression matching, so you have to request them one at a time. If you request a log file that doesn’t exist, you don’t get an error.

log_line_prefix is one of the unconfigurable GUCs on RDS, so if you are planning to use pgbadger 2, specify Amazon’s format as outlined at the bottom of this page.

1 – Be warned that restoring a snapshot will generate an “instance deleted” alert for your newly-restored instance. Your on-call person may not appreciate this.

2 – Alternatively, the pg_stat_statements extension is available on RDS, so you could get query stats this way.

29 September, 2014

My PgConf.EU Schedule

by gorthx

Yep, I’m headed to Madrid! I’ll be reprising my Autovacuum talk from SCALE, and am really looking forward to meeting some new folks. I’ll be helping out at the conference in some capacity, so come say hello.

For reference, the conference schedule is here:

Other talks I plan to attend:

Performance Archaeology sounds pretty cool!

Joe Conway’s Who’s the Fairest of Them All, since I didn’t get to catch it at PgOpen.

Open Postgres Monitoring

I’m interested in hearing Devrim’s opinions about Pg filesystems. (And pgpool, but that’s a discussion for the pub. ;) )

Next, a case study from Dimitri, this one on backups. (This sounds like one of those talks that will have people muttering “oh crap!!” and running out of the room.)

Dmitri again, on pgloader, because I always have data loading needs.

I want to see all three sessions in the next time slot (Logical decoding, PostGIS, and authentication), so I’ll wait until the day of to make up my mind.

Hmm Bruce’s indexing talk, or Christophe’s on Data Corruption?

I hope I never have to join 1 million tables.

Locks unpicked, Analytical Postgres, and of course the Lightning Talks will finish out the day.

Unit testing with PgTAP

Disaster Planning and Recovery

Logical decoding for auditing

Replication of a single database? Sign me up!

Saturday I plan to do touristy things: check out the park, a museum or two, and hopefully a fabric shop, before my flight out. If anyone has any recs, I’d love to hear them.

22 September, 2014

PgOpen 2014 – quick recap

by gorthx

Many thanks to the speakers, my fellow conference committee members, and especially our chair, Kris Pennella, for organizing the best PgOpen yet.

(Speakers: please upload your slides or a link to your slides to the wiki.)

I came back with a big to-do/to-try list: check out Catherine Devlin’s DDL generator, familiarize myself with the FILTER aggregates in 9.4, make a web interface to the PDXPUG talks db (on a tiny little heroku instance), re-do the examples from the PostGIS tutorial, etc. Plus apparently I have a tiny little patch to write (flw). Many thanks to Denish Patel of OmniTI and Will Leinweber of Heroku for the personalized help sessions.

All in all, it was a wonderful conference & I’m looking forward to 2015′s version. If you’re interested in being on next year’s committee, let us know at program2014 at postgresopen.org.

15 September, 2014


by gorthx

Last weekend we held the biggest PDXPUGDay we’ve had in a while! 5 speakers + a few lightning talks added up to a fun lineup. About 1/3 of the ~50 attendees were in town for FOSS4G; I think the guy from New Zealand will be holding the “visitor farthest from PDXPUG” for a good long while. Some folks from SEAPUG daytripped down (hi!) and we made plans for PDXPUG to road trip up there, probably for next year’s LinuxFestNW.

My highlights:
HSTORE, XML, JSON, and JSONB – David Wheeler
- Pg’s XML features are pretty neat, but I still think XML needs to DIAF. Perhaps that’s just my previous experience speaking.
- We renamed the HSTORE containment operator (@>) to “ice cream cone operator”, courtesy Mark Wong.
- Operations on JSON are slower than on HSTORE. That’s interesting.
- The storage overhead for JSONB is higher than for regular JSON, because it doesn’t compress very well. Josh B took an audience vote on improving compression at the expense of slowing down operations, and it was pretty evenly split.
- As usual, David included benchmarks and gave good overviews of when to use which data type.

Snapshotted Data Versioning – Eric Hanson
Eric gave a talk about this at PDXPUG last year and was showing an updated version of what Aquameta’s up to. Eric’s philosophy is “make everything data, and then make a UI for it”.
- Implemented FUSE for Pg, bidirectional, so you can change your data by making updates directly in the database or by editing a text file on the filesystem. I believe this was described as “perverse” by a certain audience member.

Data Near Here – Veronika Megler

- Another update to a previous PDXPUG talk
- Scientists report that they spend up to 80% of their time just finding data relevant to their research. Not collecting – locating previously saved data. What a time sink.
- Parsers for each data format have to be custom coded.

Portal Update – Kristin Tufte
- Another example of pulling data from many different sources in many “unique” formats!
- Current research on pedestrian counts uses the crosswalk buttons as a potential method to count pedestrians.
- I’d like to get ahold of the traffic light data, to see if the light at 32nd and Powell really is the longest light in Portland, or if that’s just my imagination.

AWS Faceoff (Cloud Shootout!) – Josh Berkus
I don’t care too much about Postgres on AWS – if I’m going to go that route, I’ll buy my own hardware, TYVM.
- RDS has a limited number of extensions installed, and PL/R isn’t one of them.* They did just add pg_stat_statements, which is cool. The Amazon support people are taking requests, and are attentive to the community, according to Josh. (I don’t have enough experience with that to have an opinion.)
- performance on RDS just isn’t that great; Josh got 325 TPS read/write, and 1430 TPS read-only.
- Then there was the cost comparison; RDS and Heroku don’t look that great compared to hosting it yourself, but you’d need to factor in the cost of support staff there.

Thanks for a great event!

* I decided to see for myself what extensions were available. Mark warned me “don’t shed too many tears for what they don’t have”. To my surprise, many of my favorites are available – pgperl, plpgsql, postgis, and tablefunc! (SO EXCITE MUCH PIVOT)

Check what’s available on your instance with this command:
SHOW rds.extensions;

Note that “SELECT * FROM pg_available_extensions ORDER BY name;” will show you a bunch of stuff that’s not necessarily available on RDS. (Something I wish they’d fix.)

8 September, 2014

Updating My Linux Command line Toolbox, episode 2

by gorthx

Part 1

Five more, all from this week:

1. date -u to get your date in UTC

2. pushd and popd – create your own directory stack.  I’m still trying this one out.  (“why not just use the up arrow?”)

3. pbcopy – copy to clipboard from the command line.

4. !$ contains the last arg of the previous command, so you can do something like this:
ls -l filename.*    # check what you have in the dir
vi !$

5. This one is my favorite: !?[string] runs last command that contains that string.

Tags: ,
18 August, 2014

My PostgresOpen schedule

by gorthx

We had so many good submissions for Postgres Open this year, we had to make some very difficult choices. While I haven’t quite achieved Dan Langille’s (of PgCon fame) level of conference-running zen and rarely get to all the talks I’d like to see, here are my picks:

The PostGIS tutorials. I am really excited about this tutorial, and that Regina Obe and Leo Hsu will be at our conference!

Bruce Momjian and Vibhor Kumar’s NoSQL on ACID (plus pizza and root beer!) Most everyone knows who Bruce is, but you may not know Vibhor – I got to work with him at EDB, and he’s very kind and always had answers to my most obscure questions. This tutorial is free, but you have to register in advance. If you’re coming to the conf early, this is a great thing to do on your first evening.

Denish Patel gives great presentations, and I’m looking forward to learning about Pg on RDS from him.

Streaming replication was first introduced in 9.0. Simon Rigg’s Pg Replication Overview will bring you up to speed on the new developments since then.

You can read the 9.4 release notes or just go to Magnus Hagander’s talk.

I’m really interested in the tools that Gleb Arshinov used for his customer retention analysis use case.

Jonathan Katz’s data types tour has had rave reviews, and I’m glad he’s giving it again here.

Jeff Amiel’s Monolithic Query Syndrome talk looks educational and entertaining.

I can’t decide between Sehrope Sarkuni’s Audit logging talk and Gary Seiling’s discussion about immutable data. (Note: if you are new to Postgres and configuring its security features, I recommend pairing one of these with Sarah Conway’s introductory talk on Thursday.)

Full text search is a hot topic right now.

I will almost always choose a monitoring talk over any other topics, so Shaun Thomas’s collectd and graphite talk is on my list.

What’s the state of Row level Security in Postgres? Go to this talk by Stephen Frost and find out.

John Melesky is a fellow PDXPUGer and always bring interesting problems to the table. This time he’s talking about partitions.

It’s not too late to join us in Chicago!

13 June, 2014

Postgres Monitoring Wishlist

by gorthx

Since “what should I monitor in my database” has come up in conversation several times lately, I thought I’d put this here where I (theoretically) won’t lose it. I’ll save for later the discussion of where to get this info and which tools give me which stats :)

Bare minimum:
server CPU, memory, I/O, network usage, and all “slow” queries logged.

More extensive:
System stats:
CPU usage, per-proc if available
Memory usage, including swap
disk usage (in terms of space – pay special attention to database partitions)
disk I/O
disk busy
network stats, including errors (if you have a Cisco network & are friends with the network team, netflow data is cool to have)

If I could have everything I wanted: everything from vmstat and iostat extended data

Pg stats:
number of connections
idle transactions
commits vs rollbacks
checkpoint frequency
database size
table size (plus bloat, if we can find a good query for it)
index size (same)

If I could have everything I wanted:
everything from pg_stat_database, pg_stat_sys_*, pg_statio_*, and pg_stat_user_activity

Activity logs configured as outlined here.

Then there’s a whole class of things that fall under “How long does it take to…”: do a backup, restore a backup, etc.

6 June, 2014

autovacuum: long naps aren’t better

by gorthx

There’s that saying about “the first time’s an accident, the second’s a coincidence, the third is a pattern”. It’s probably because I’ve been studying Postgres’s autovacuum feature so much lately and these things stand out to me now, but I’ve noticed a really intriguing pattern (n>5) over the past month or so: folks with their autovacuum_naptime set too dang high.

The autovacuum_naptime GUC is the amount of time the daemon waits between checking for tables that need vacuuming and analyzing (or have xids that need freezing). If you set this to, for example, one day1, each table in your database gets checked only once a day. It’s certainly possible that this will work ok for your use case. What tends to happen, though, is that you end up with several tables that need maintenance, but the long naptime doesn’t allow that to happen in a timely fashion. And of course, the longer you go between vaccing/analyzing/freezing them, the longer it takes to get that job done and soon you’re in a downward spiral, and maybe your database ends up taking a dirt nap, at least from the viewpoint of your users who are all saying “OMG, why is this so sloooooow?” Then you conclude “autovacuum sucks” and disable it and try to stay on top of it with cron jobs etc like back in the Bad Old Days When We Didn’t Have Nice Things.

So, while the autovac config params are all interdependent to some extent2, this is kind of a big one and IME you don’t want to increase the naptime beyond the already-conservative default of 1 minute.

1 – Yes, really. Not trying to embarrass anyone here! Rest assured you are not alone.
2 – I was going to say that individual GUCs don’t operate in a vacuum, but figured that might be taking things Just Too Far.

25 April, 2014

Mistakes I make with Perl hashes

by gorthx

I’ve been doing a bit of Perl programming the past couple months. I felt pretty rusty at first (it’s been over a year since I’ve written anything serious with it) but am getting back into the swing of things. My main use of Perl is manipulating delimited text data (from databases or flat files) for reports or loading into databases and the like. For these types of tasks, I really prefer hashes (and HoH…oH) to arrays because I can can give my variables appropriately-named keys, such as $switch{‘card’}{‘port’}. It’s a lot more obvious what that’s doing than $switch[12][2]. Obvious is good, especially 6 months later when I’ve come back to a project and am saying “what the hell is this” (as we all do).

I always seem to make the same three mistakes with hashes. The first two feature the fun symptom “I’m getting data that’s different from what I expect, and it’s randomly different every time I run my program”:

1. Oops, I need my data in a certain order. This is the first thing I forget if I haven’t written any Perl code for a while: Perl hands hash data back in whatever order it feels like. When I’m writing my initial tests, I’m using smaller data sets (like maybe one or two hash “rows”) and I get lucky and my data’s in the sort order I want. Then I get some real data, and … “oh right!” I fix this by using an AoH instead, or storing (some of) the keys in a separate array (feels kludgy), and I suppose some day I will get around to trying one of the permutations of Tie::Hash.

2. I fail to provide a unique key for the hash. This is another one that doesn’t become apparent until I’m working with real data: a “random” small data sample has two unique identifiers, but when get a bigger data set I find out there’s actually five, and so on, and pretty soon I’ve got a HoH…oH that’s 17 levels deep. (I kid.) (Maybe.)

3. The bane of my existence: typos in my hash keys. ‘use strict’ doesn’t protect from this. Writing tests helps[1], but they still occasionally slip through[2]. I troubleshoot with rigorous use of Data::Dumper::Simple and the debugger.

1 – Thank you, PDX.pm.
2 – My test for variable names: 3 months (6 months, a year) later, is it still obvious what this variable holds, and would I name it the same thing?


Get every new post delivered to your Inbox.