X
    Categories Industry Insights

Adventures with a PostGreSQL RAM Drive

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

Some days, you want to distribute your application to 500 nodes around the world. Other days, you just want it to run faster on this one stupid little box. Today is one of the latter days.

Here at AppNeta, we run everything in Amazon. When we were starting out, like most people, we treated Amazon instances as physical server replacements. Also like most people, we quickly figured out that I/O performance on EBS volumes can be … spotty.  Running PostgreSQL on an EBS-backed instance in an application that collects 2 trillion metrics per month can be challenging.  Last summer, Amazon addressed this issue with their Provisioned IOPS EBS volume, promising predictable high-speed I/O. However, you end up paying $100 a month, and that’s not cheap if you are running couple dozen of PostgreSQL servers. The question is, how can we get that kind of performance out of AWS without the cost?

[As with all good experiments, I prefer to concentrate on the experiment, instead of the experimenter. Therefore, the cost of my time not included in the calculation :)]

There are plenty of solutions posted on Web that promise higher TPS by tuning “wal_buffers”, “wal_buffers”, “commit_delay”, and “commit_siblings” with different version Postgres.  These are valuable to tuning instances to workload, and we’ll get there, but blindly turning these knobs isn’t going to give us the kind of speedup we’re looking for. If you’re looking to get a quick win, I’d certainly recommend taking a look at these, but I’ve got a better idea.

A Baseline PostGreSQL Setup

<br />root@jsrvr31-asl-vm:/data/dbservers/pgsql9/bin# ./pgbench -h localhost -p 5432 -U postgres -c 10 -t 10000 appnetdb<br />Password:<br />starting vacuum...end.<br />transaction type: TPC-B (sort of)<br />scaling factor: 1<br />query mode: simple<br />number of clients: 10<br />number of threads: 1<br />number of transactions per client: 10000<br />number of transactions actually processed: 100000/100000<br />tps = 568.146099 (including connections establishing)<br />tps = 568.566356 (excluding connections establishing)<br />

My baseline gives me 568 tps.  Not great, if you ask me.

I have a second hard disk on my machine.  Let’s put that to work.

Separate Disks

<br />root@jsrvr31-asl-vm:/data/databases/db001# mount<br />/dev/sda1 on / type ext3 (rw)<br />/dev/sdb1 on /data type ext3 (rw)<br />root@jsrvr31-asl-vm:~# mkdir /db001_pg_xlog<br />root@jsrvr31-asl-vm:/db001_pg_xlog# cp -ar /data/databases/db001/pg_xlog/* /db001_pg_xlog/.<br />root@jsrvr31-asl-vm:/data/databases/db001# mv pg_xlog/ pg_xlog.bak<br />root@jsrvr31-asl-vm:/data/databases/db001# ln -s /db001_pg_xlog/ pg_xlog<br />root@jsrvr31-asl-vm:/data/databases/db001# chown postgres:postgres pg_xlog<br />

The command above lets the WAL (Write Ahead Log) directory run on a separate disk.  The WAL and the main data directory are somewhat different access patterns. In theory, all of my write contention will be limited to a single disk, and the database actual file system will be remain on another disk.  This reduced contention could allow for a bump in performance.

<br />root@jsrvr31-asl-vm:/db001_pg_xlog# /data/dbservers/pgsql9/bin/pgbench -h localhost -p 5432 -U postgres -c 10 -t 10000 appnetdb<br />Password:<br />starting vacuum...end.<br />transaction type: TPC-B (sort of)<br />scaling factor: 1<br />query mode: simple<br />number of clients: 10<br />number of threads: 1<br />number of transactions per client: 10000<br />number of transactions actually processed: 100000/100000<br />tps = 594.288200 (including connections establishing)<br />tps = 594.776358 (excluding connections establishing)<br />

From 568 to 594. That’s around 5% improvement. Performance gains don’t come easy, right?  If we know it’s the disk speed, let’s try something stupid. Let’s make our main data store a RAM disk.

Adventures in Data Loss

<br />root@jsrvr31-asl-vm:~# mkdir /tmp/ramdisk;<br />root@jsrvr31-asl-vm:~# chmod 777 /tmp/ramdisk/<br />root@jsrvr31-asl-vm:~# mount -t tmpfs -o size=1024M tmpfs /tmp/ramdisk<br />root@jsrvr31-asl-vm:/data/databases/db001# ln -s /tmp/ramdisk/ pg_xlog<br />root@jsrvr31-asl-vm:/data/databases/db001/pg_xlog# cp -ar /db001_pg_xlog/*<br />root@jsrvr31-asl-vm:/data/databases/db001/pg_xlog# /data/dbservers/pgsql9/bin/pgbench -h localhost -p 5432 -U postgres -c 10 -t 10000 appnetdb<br />Password:<br />starting vacuum...end.<br />transaction type: TPC-B (sort of)<br />scaling factor: 1<br />query mode: simple<br />number of clients: 10<br />number of threads: 1<br />number of transactions per client: 10000<br />number of transactions actually processed: 100000/100000<br />tps = 964.496814 (including connections establishing)<br />tps = 965.963811 (excluding connections establishing)<br />

From 568 to 964. 70% improvement!  Plus, AWS will let us scale this up and down pretty easily, since we don’t have to match EBS disks to instance sizes anymore. Great!

Of course, there’s a reason why RAM disks are a bad idea on any database: durability. If we even turn off this machine, we’d lose all the data on it. I think I can safely say that that’s not a good tradeoff, no matter what the use case is.

On the other hand, let’s try add durability back in, without going back to conventional disks. One way to do this would be to add replication to a durable disk. This is a proof of concept, so let’s quickly spin up a replication server.  We can even run it on the same machine, and re-use that disk we already have attached:

Master Setting:

<br />port = 5432<br />wal_level = hot_standby<br />archive_mode = on<br />archive_command = cp %p /backup/db001/%f’<br />

Slave Setting:

<br />port = 5433<br />wal_level = hot_standby<br />archive_mode = on<br />archive_command = '/bin/true'<br />hot_standby = on<br />

recovery.conf:

<br />standby_mode = 'on'<br />primary_conninfo = 'host=127.0.0.1 port=5432 user=postgres'<br />trigger_file = '/tmp/postgresql.trigger.5433'<br />restore_command = 'cp /backup/db001/%f \"%p\"'<br />

Let’s start both master and slave and run a couple insert statements to make some changes to the DB.

The first thing that I want to know is, would the database survive a normal shutdown?  Assume the content inside the RAM Disk is completely wiped out after the shutdown.

  1. Stop the Master
  2. Stop the Slave
  3. Remove the pg_xlog directory from Master
  4. Reconstruct the pg_xlog directory from Slave
  5. Start the Master
  6. Start the Slave

Let’s check to log to see if the Master started up correctly:

<br />2013-11-15 21:56:42 PST LOG:  database system was shut down at 2013-11-15 21:51:06 PST<br />2013-11-15 21:56:42 PST LOG:  database system is ready to accept connections<br />2013-11-15 21:56:42 PST LOG:  autovacuum launcher started<br />2013-11-15 21:56:43 PST LOG:  connection received: host=[local]<br />2013-11-15 21:56:47 PST LOG:  connection received: host=127.0.0.1 port=38471<br />2013-11-15 21:56:47 PST LOG:  replication connection authorized: user=postgres host=127.0.0.1 port=38471<br />2013-11-15 21:56:58 PST LOG:  connection received: host=127.0.0.1 port=38472<br />2013-11-15 21:57:01 PST LOG:  connection received: host=127.0.0.1 port=38473<br />2013-11-15 21:57:01 PST LOG:  connection authorized: user=postgres database=appnetdb<br />

Everything works fine, with no data lost.

Of course, if everything was best case, we could make this really fast. Let’s formulate a worse case scenario.  Let’s do the kill -9 postmaster.pid on the master. Will I corrupt the files inside the pg_clog.  Will I lose all my data?  Worse case, I’d just run dd and zero out the block, and we’re back to ground zero.

<br />root@jsrvr31-asl-vm:/data/databases/db001# cat postmaster.pid<br />30921<br />/data/databases/db001<br />  5432001   1048576<br />root@jsrvr31-asl-vm:/data/databases/db001# kill -9 30921<br />Let’s replace the pg_xlog on master with slave pg_xlog again.<br />root@jsrvr31-asl-vm:/data/databases/db001# /etc/init.d/postgresql-9.0 start<br />Starting PostgreSQL 9.0:<br />pg_ctl: another server might be running; trying to start server anyway<br />waiting for server to start... done<br />server started<br />PostgreSQL 9.0 started successfully<br />2013-11-16 02:16:57 PST LOG:  database system was interrupted; last known up at 2013-11-16 02:15:34 PST<br />2013-11-16 02:16:57 PST LOG:  database system was not properly shut down; automatic recovery in progress<br />2013-11-16 02:16:57 PST LOG:  consistent recovery state reached at 54/4A000078<br />2013-11-16 02:16:57 PST LOG:  record with zero length at 54/4A000078<br />2013-11-16 02:16:57 PST LOG:  redo is not required<br />2013-11-16 02:16:57 PST LOG:  autovacuum launcher started<br />2013-11-16 02:16:57 PST LOG:  database system is ready to accept connections<br />2013-11-16 02:16:58 PST LOG:  connection received: host=[local]<br />2013-11-16 02:17:01 PST LOG:  connection received: host=127.0.0.1 port=38495<br />2013-11-16 02:17:01 PST LOG:  replication connection authorized: user=postgres host=127.0.0.1 port=38495<br />

Yes, the database started up successfully.  In theory, you might experience data loss, though we didn’t here. Because, both Master and Slave are located on the same machine, there isn’t enough latency to cause any data loss. (Consider that there was also data loss for any open transactions that didn’t get a chance to commit. We’re certainly being a bit optimistic by reporting success before the actual durable write, but it worked for MongoDB, right?) If the processes were on different machines, we’d expect a lot higher chance of loss, but even that can be acceptable. For example, if there is an Amazon EC2 zone shortage, we could temporarily separate the two machines, increasing our data loss risk to compensate for the lack of new machines. If we could replay these transactions, or the data was low-value, this might be an acceptable solution.

A Few Caveats

  1. RAM disks are limited by the allocated memory, and memory is much more expensive than disk space.  In addition to just the raw amount of data, PostgreSQL allocates WAL files in increments of 16MB. This extra size might matter in certain configurations, though you could tweaking checkpoint_segments and checkpoint_timeout.  By having a lower but acceptable figures on both values, then we can have a way to manage the pg_xlog directory size.
  2. More moving parts are dangerous. Restarts now require manually copying over pg_xlog, and communication issues between the master and slave might result in extra log space used by the master or data loss. We can certainly fix that by having our custom script for archive_command and restore_command to ship the log locally, though with more things in play, there are more places for bugs to creep in.

Should You Use It?

Probably not. In addition to the two issues above, I haven’t run this enough to get comfortable, even though, in theory, it’s an interesting tradeoff between performance and durability. I’m still planning to play around with it and see what comes out — stay tuned!

Alan Leung: