Some quick numbers about SSD for PostgreSQL

16/07/2009 , by Sylvain Munaut

SSDs are not new anymore, there are plenty of benchmark and tests out there. Even on the specific combination of SSDs and PostgreSQL, there is quite a lot of availabe data. This post by Jignesh Shah is a good example (fyi, if you use PostgreSQL on Solaris, that blog's awesome !).

And when the time came to choose hardware for one of our DB server, we chose to include some SSDs in it. But once installed and used with real data under a real workload, I wanted to make sure that the prediction we made at the time were accurate and that indeed the latency of the SSDs was low and well used.

To this end, I wrote a little DTrace script (btw, DTrace _rocks_ !) that records all calls to the read syscall with a size of 8192 (PostgreSQL page size) and displays a statistical distribution graph of their completion time, both globally and by table space.

I chose to analyze the reads only because of the various special techniques involved in DB writes that would have made the results hard to interpret.

Here is the script:

#!/usr/sbin/dtrace -qs

dtrace:::BEGIN
{
        self->start = 0;
}

syscall::read:entry
/arg2 == 8192/
{
        self->start = timestamp;
        self->dir = dirname(fds[arg0].fi_pathname);
        self->file = basename(fds[arg0].fi_pathname);
        self->bytes = arg2;
}

syscall::read:return
/self->start/
{
        this->elapsed = timestamp - self->start;
        @read_time_all = quantize(this->elapsed);
        @read_time[self->dir] = quantize(this->elapsed);
        self->start = 0;
}

dtrace:::END
{
        printa("%@d\n", @read_time_all);
        printf("\n");
}

Then, just need to logon to a machine under load and run it.

In this test, I used a PostgreSQL setup with the main PostgreSQL data files on the OS drives and 3 table spaces for the tables/indexes:

  • ts_main: Stored on a minimal RAID10 array of four Seagate SAS 300G 10krpm
  • ts_aux: Stored on the same drives as the OS
  • ts_ssd: Stored on a mirror RAID of two Intel X25-E of 32G (this array also hosts the pg_xlog)

Here are the results after a few minutes of data collection (the value column is the time it took to complete the read, in ns) :

root@painkiller:~/dtrace# ./pg-latency.d
^C

          value  ------------- Distribution ------------- count
           8192 |                                         0
          16384 |@                                        403
          32768 |@                                        223
          65536 |                                         6
         131072 |                                         49
         262144 |@@@@@@@@@                                3176
         524288 |@                                        247
        1048576 |@@                                       754
        2097152 |@@@                                      1071
        4194304 |@@@@@@@@@                                3390
        8388608 |@@@@@@@@                                 2962
       16777216 |@@@@                                     1566
       33554432 |@                                        545
       67108864 |                                         100
      134217728 |                                         20
      268435456 |                                         15
      536870912 |                                         5
     1073741824 |                                         2
     2147483648 |                                         0        

 /var/postgres/8.3/data/pg_clog
          value  ------------- Distribution ------------- count
           8192 |                                         0
          16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@               394
          32768 |@@@@@@@@@@@@@                            201
          65536 |                                         0
         131072 |                                         0
         262144 |                                         0
         524288 |                                         0
        1048576 |                                         0
        2097152 |                                         0
        4194304 |                                         1
        8388608 |                                         0        

 /var/postgres/8.3/data/ts_ssd/17946
          value  ------------- Distribution ------------- count
          16384 |                                         0
          32768 |                                         1
          65536 |                                         0
         131072 |                                         3
         262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          3049
         524288 |@@                                       211
        1048576 |@@@@@@                                   598
        2097152 |                                         32
        4194304 |                                         9
        8388608 |                                         6
       16777216 |                                         5
       33554432 |                                         5
       67108864 |                                         3
      134217728 |                                         2
      268435456 |                                         6
      536870912 |                                         1
     1073741824 |                                         0

 /var/postgres/8.3/data/ts_aux/17946
          value  ------------- Distribution ------------- count
         131072 |                                         0
         262144 |                                         2
         524288 |                                         0
        1048576 |@@                                       48
        2097152 |@@@@@@@@@                                290
        4194304 |@@@@@@@@@@@@@@@@@@                       570
        8388608 |@@@@@@@@                                 252
       16777216 |@@@                                      91
       33554432 |                                         7
       67108864 |                                         1
      134217728 |                                         2
      268435456 |                                         0        

 /var/postgres/8.3/data/ts_main/17946
          value  ------------- Distribution ------------- count
           8192 |                                         0
          16384 |                                         9
          32768 |                                         21
          65536 |                                         6
         131072 |                                         46
         262144 |@                                        125
         524288 |                                         36
        1048576 |                                         108
        2097152 |@@@                                      749
        4194304 |@@@@@@@@@@@@@                            2810
        8388608 |@@@@@@@@@@@@                             2704
       16777216 |@@@@@@@                                  1470
       33554432 |@@                                       533
       67108864 |                                         96
      134217728 |                                         16
      268435456 |                                         9
      536870912 |                                         4
     1073741824 |                                         2
2147483648 |                                         0

Let's first focus on the first graph. You can clearly see three zones :

  • 16 - 32us: This is very fast ! Probably data served directly from the OS level cache.
  • 250 us: This spike is quite sharp. Not fast enough to be from the OS cache and probably too fast to be from the RAID. So I'd guess theses are the reads hitting the SSDs.
  • 1 - 32ms: This is a pretty wide range. Most likely the HDDs seek times mixed with the queuing delays from other IOPs.

Ok, now that we have some hypothesis, we can confirm them (or not !) by looking at the detailed graphs.

The second graph are the read hitting the clog directory of PostgresSQL, they are obviously all the ones hitting the OS cache which is not surprising given the 'space/time locality' of the data stored in there.

The third graph are all the reads to the tables/indexes stored in the SSD tablespace. And as predicted, they are the ones in the 250us range. I'm not quite sure what's up with the "sub spike" at around 1ms tough. Possibly the effects of queued commands that have to wait for completion of some other reads being issued ... but that would require more investigation to be sure. Also note that those 250us are not all due to the latency ! Even at 100 Mbytes/s, just the transfer time already represents around 80us ...

Finally the fourth and fifth graph are quite similar and are indeed all the reads hitting the disks, centered around 4-8 ms. This matches the disks specs.

So, what's the conclusion ? Are the SSDs as good as they say ?

Well, as always, there is not one unique, simple, universal answer ! Yes, they are indeed very good at access times, but whether this really matter for the real world performance of your application depends on your workload.

If you have a lot of random access in some huge table or index, the access time of the SSDs can give you a ~ 10 fold improvement ! On the other hand, if you mostly have sequential scans and if all your random access are confined to a set of pages sufficiently small to fit in the PostgreSQL or OS cache, then you might not need them. Because, let's face it, at around 1000 eur a piece, those X25-E are kinda expensive and you might be better off with a lot of 15krpm spindles.

comments powered by Disqus