Solid State Drive (SSD) is a technology which brings a few times better performance compared to enterprise SAS drives and we at Nordeus are heavily using it with Postgres. SSDs especially shine in random page reads/writes compared to regular drives, because they have a higher number of input/output operations per seconds (IOPS). For example best 15k RPM SAS drives can achieve about 200 IOPS, while SSDs can do 75 000 reads and 15 000 writes. This is one of the main reasons why we run Postgres on SSDs, because our applications' workloads are heavy on random reads/writes which SAS drives simply cannot handle, not even if they are put into RAID configurations. Of course this performance comes at a cost of SSDs having a few times smaller capacities, higher cost, and a shorter life time since flash cells have a limited number of writes. SSDs are also tied to being more unreliable than regular drives, even though new enterprise ready SSDs are highly reliable, which means reliability shouldn't be a problem, but they also cost a lot more.
Enough about SSDs in general, let's get back to the title of this article and why we started doing power failure tests with our SSDs. In the past 2 years we had power failures on our servers - they are not that often, but they happen. Usually these are not a problem, but a few times we noticed corruption of indexes in Postgres instances, and on a few of them even unrecoverable corruption. Of course we had backups, but this shouldn't happen. Also a few months ago we had a big migration of hundreds of servers from one data center to another after which we also noticed data and index corruption. The most interesting thing is that these servers were normally powered off before they were migrated, but we still had problems. That's when we started to do some investigation and one thing we noticed is that only particular SSD models were giving us problems, while others didn't. Here is a list of models we used in our production:
|SSD Model||Detected Issues|
|Intel 320||No issues|
|Intel 330||Database data/index corruption|
|Intel 335||Database data/index corruption|
|Intel 520||Database data/index corruption|
|Samsung 840 Pro||No issues|
The order in which the drives are listed, is the order in which they were added to our servers, with the Intel 320 being the oldest and Samsung 840 being the newest drive. As you can see we noticed that Intel 330, 335 and 520 were the ones causing problems, and when we compared the specs to the other two models that isn't a coincidence at all. Those three models are consumer SSDs which have no enterprise ready features to make them reliable in case of a power failure and in many other situations. On the other hand Intel 320 is one of the rare consumer oriented SSDs which has supercapacitors (SC), which allows the SSD to survive power failures. SC keeps just enough power in it to flush everything from SSD cache to flash in case of a power failure. Samsung 840 Pro doesn't have SC, but we didn't have any problems with it, at least for now, which of course doesn't mean it is safe.
What was even more interesting is that our SSDs are connected to the Dell H700/H710 RAID controller which has a battery backup unit (BBU) which should make our drives power failure resilient. RAID controller with BBU in case of a power failure can hold the cached data until the power comes back, so that it can flush it to the drives when the drives come back online. The problem is that with most consumer SSDs you can never be 100% sure an SSD has finished the write and flushed the data from its cache to the drive. Because of this even if the SSD is connected to a RAID controller with a BBU, it is still not reliable if it doesn't have its own power loss protection.
Since we wanted to fix this problem once and for all, we first needed to it in a test environment. We decided to use the following three models of SSDs:
- Intel 520 - We know it failed in production, so we wanted to reproduce this in a test environment.
- Samsung 840 Pro - Since it never failed in production, we wanted to test how resilient it really is.
- Intel DC S3500 - This is a new enterprise ready SSD which should be extremely resilient and durable. We use it in all our new servers.
Here is the testing configuration we used:
- Dell R420
- H710p RAID controller
- 2 x SSD (models above)
- 48GB RAM
- CPU and other things are not important for this test.
CentOS 6.5 was used, SSDs were formatted with XFS and they were mounted into
/mnt/ssd2. XFS was used because that is our main file system for databases. Reason why we chose XFS is not just because XFS is awesome, but it's also because when we started a few years ago we were using CentOS 5 at that time, which had ext3 as the main FS and no ext4 support, so XFS was a no-brainer at that time, because of much better performance compared to ext3.
We used two types of tests to determine if the drives are working correctly in case of a power failure:
- fsync test
- pgbench with Postgres 9.3 with data checksum
Before I explain these tests, there are two important things you need to know. What is fsync and what are write barriers?
What is fsync?
In a nutshell fsync is a system call which flushes file system buffers/cache to the disk. It is heavily used in databases when a database needs to make sure the data has made it to the disk. This doesn't mean the data was really written to disk, it can still remain in the disk cache, but enterprise drives usually make sure the data was really written to disk on fsync calls. If you want to learn more about fsync, I would recommend that you read Everything You Always Wanted to Know About Fsync().
What Are Write Barriers?
I won't go into much detail what write barriers are, but in short they make sure that the file system write cache is written in the correct order to the disk. This means when barriers are enabled on a file system (they usually are by default on modern file systems), and you call fsync to flush a particular file, fsync will first flush all the file system cache before that file and then flush the file, keeping the order of writing as it was in the file system cache. This gives a big writing performance penalty especially to applications which use fsync frequently (like databases). Barriers are especially important in case of a power failure, so that there is no data corruption. If write barriers are disabled, files system cache can be written to the disk in a different order, making the file system vulnerable to power failures and corruption. Only time it is considered safe to turn off write barriers is if you have a RAID controller with BBU and if your SSD has power loss protection like supercapacitors. If you would like to read more about write barriers I would suggest Barriers, Caches, Filesystems.
How Did We Test?
Both tests were done with 3 different parameters changed each time:
Disk cache on/off - Disabling disk cache reduces disk performance, but also decreases the life time of the SSD.
Write barriers on/off - As explained above, turning off write barriers increases performance, but if the hardware doesn't have proper protection, it can cause data corruption.
RAID controller write-back/write-through - In write-back mode RAID controller's cache is used to cache writes and make them faster, while in write-through mode, the cache is not used and the data is written directly to disks, which again reduces performance.
Each test that was run was repeated at least 5 times with a different combination of these 3 parameters, and it was stopped the first time it failed. SSD needed to pass the test at least 5 times to be considered as passed, but the majority of the passed tests were repeated a lot more than 5 times, just in case.
For the fsync test we decided to use a well know diskchecker.pl script (more details). For this script to work, you need two hosts, a server and a client. In short this script writes into a file on the host you are testing, calls fsync after each write and sends the information to the server host what it actually had written. During the test you pull the plug off, wait a few minutes, plug the server back on and re-run the script to check what was written and what wasn't. Also we needed to modify the script a little bit since it didn't support running two instances of the server on the same host.
Here is the full testing procedure:
- On the server (none-testing host) run two instances of the diskchecker utility, since we will be testing two SSDs in parallel on the same host:
./diskchecker.pl -l 1212 ./diskchecker.pl -l 1213
- On the client (testing host) run:
./diskchecker.pl -s SERVER_IP:1212 create /mnt/ssd1/first_test 51200 ./diskchecker.pl -s SERVER_IP:1213 create /mnt/ssd2/first_test 51200
- Unplug power on the testing server after 5 minutes, power the host back on after 10 minutes and run verify step:
./diskchecker.pl -s SERVER_IP:1212 verify /mnt/ssd1/first_test ./diskchecker.pl -s SERVER_IP:1213 verify /mnt/ssd2/first_test
- Clean everything up and run the test again at least 5 more times.
|Model||Disk Cache||Barrier||Write Mode||Disk 1||Disk 2|
|Samsung 840 Pro||On||Off||Write-back||FAIL||FAIL|
As we can see from the results Intel S3500 works without any problems with all parameters configured for best performance, which is the reason why we didn't do any testing of other combinations, since they would give us lower performance. Samsung 840 Pro failed in the first test, but didn't report any problems with disk cache off. On the other hand we haven't managed to make Intel 520 pass any combination of settings, even with disabling RAID controller cache (write-through), disk cached turned off and barriers enabled, it still managed to fail the tests. I would like to note that we didn't test Intel 520 with all possible combinations of parameters, since if the last combination (which should be the safest, but with lowest performance) failed, all of them will.
The idea behind the pgbench test is to do a lot of writing into the Postgres database and during this writing to unplug the power cord. To verify that the data written into the database is correct, we will be using data checksums which were introduced into Postgres from version 9.3. The verification will be done by running
pg_dumpall, since all the data will be checked against a check-sum on each read. This is how we tested:
- Install PostgreSQL 9.3 from the official repositories
- Create needed directories on SSDs and link them to be used as a default Postgres data directory:
mkdir -p /mnt/ssd1/pgsql/9.3/data mkdir -p /mnt/ssd2/pgsql/9.3/index chown postgres:postgres /mnt/ssd1/pgsql/9.3/data chown postgres:postgres /mnt/ssd2/pgsql/9.3/index rm -rvf /var/lib/pgsql/9.3/* ln -s /mnt/ssd1/pgsql/9.3/data /var/lib/pgsql/9.3/data ln -s /mnt/ssd2/pgsql/9.3/index /var/lib/pgsql/9.3/index
Initialized the Postgres cluster on SSD1 with data checksums enabled and start postgres:
runuser -l postgres -c "/usr/pgsql-9.3/bin/initdb \ --pgdata=/var/lib/pgsql/9.3/data \ --auth='trust' \ --data-checksums" service postgresql-9.3 start
pgbenchdatabase, add table space
indexon SSD2 and initialize the
pgbenchdatabase with indices moved to index table space (this way both SSDs will be used while testing):
psql -U postgres -c "CREATE DATABASE pgbench;" psql -U postgres -c "CREATE TABLESPACE index LOCATION '/var/lib/pgsql/9.3/index';" /usr/pgsql-9.3/bin/pgbench -U postgres -i pgbench --index-tablespace=index # Create additional indexes so there is more writing on SSD2 psql -U postgres -d pgbench -c "CREATE INDEX ON pgbench_accounts(abalance,aid) TABLESPACE index;" psql -U postgres -d pgbench -c "CREATE INDEX ON pgbench_history(mtime) TABLESPACE index;" psql -U postgres -d pgbench -c "CREATE INDEX ON pgbench_history(aid,delta) TABLESPACE index;"
Run pgbench for a few minutes, e.g:
/usr/pgsql-9.3/bin/pgbench -U postgres -c 12 -T 3600 pgbench
- Unplug the power, wait 10 minutes, and power on the server.
Run pg_dumpall (just to be sure) and check for any errors:
/usr/pgsql-9.3/bin/pg_dumpall -U postgres > /dev/null
- Destroy the Postgres instance and run everything again.
|Model||Disk Cache||Barrier||Write Mode||Postgres|
|Samsung 840 Pro||On||Off||Write-back||OK|
Like in the previous test, Intel S3500 worked without any problems with default settings for best performance, but this time Samsung 840 Pro also didn't cause any problems. While on Intel 520 Postgres was reporting errors until we disabled disk cache, but this doesn't make it safe, only less prone to corruption. Probably the reason why Intel 520 "survived" with disk cache disabled is because Postgres is executing fsync less often than diskchecker.pl, so the probability of a failed fsync is smaller.
As we can see from the results, Intel S3500 is an enterprise ready drive which works reliably out of the box and we would recommend it to everyone who needs a reliable, high performance drive which isn't very expensive. On the other hand Samsung 840 Pro looks to be doing pretty good with the disk cache disabled in the fsync test, and in the Postgres test it even managed to pass with disk cache enabled. I would still not recommend it on mission critical databases, but if you have no choice, at least disable disk cache (and test performance). I'm not exactly sure what to say about Intel 520 except that this test showed all the concerns that we had, that it is not reliable for production and that it should probably only be used in consumer devices.
Finally, never believe what the hardware vendors and/or the internet is saying. If you have a RAID controller with BBU, it doesn't mean your data is safe when you lose power, it depends on many factors like the quality of the controller, quality of the drives, type of drives, etc, as you can see from this article. If you want to be sure your data is safe, it's best that you test it yourself.