What Does PostgreSQL 9.5 Bring to Developers

Dev & Ops

It's been a while since PostgreSQL brought so many new things that are interesting to developers, like the version 9.5. We found a lot of features that will make our life easier and our code safer. Atomic UPSERT and skipping locked rows SELECT...SKIP LOCKED is definitely something that we were looking for for a long time.

UPSERT

UPSERT is probably one of the most wanted features in PostgreSQL. It will be used by developers who work on highly concurrent software. It brings us the opportunity to atomically execute insert or update commands based on a certain condition.

Let's prepare the table for testing the UPSERT command:

CREATE TABLE upsert_test (
    id SERIAL NOT NULL PRIMARY KEY,
    type int4 NOT NULL,
    number int4 NOT NULL,
    name text NOT NULL
);

ALTER TABLE upsert_test ADD CONSTRAINT upsert_test_unique UNIQUE (type, number);

Now we can try some basic UPSERT variations. Let's suppose we populated this table with two rows, so it looks like this:

 id | type | number |  name
----+------+--------+---------
  1 |    1 |      1 | Name 1
  2 |    1 |      2 | Name 2

According to the documentation, we can implement a behavior if a conflict happens during the INSERT in one of two ways: we can decide to DO NOTHING (skip insert) or execute UPDATE on the conflicted row. What is most important is that PostgreSQL will ensure our update is executed atomically with check (it can't happen that another transaction adds a conflicted row between check and insert). This is great because developers had to handle this on their own with some SQL gymnastics. Let's see the most common use case. We want to insert a new entry in the table, but in case of any constraint violation, we would skip that insert (with the assumption that another transaction has already inserted the wanted value):

INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT DO NOTHING;
INSERT 0 0

As we can see, the insert was finished without any errors, but no new values were inserted. It is important to properly handle this in code. Sometimes successful inserts are handled in code with the assumption that one row is inserted. Now we have to consider the case when there are no inserted rows because the conflict is ignored during the insert. What happens if we want to change the conflicted row:

-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) DO UPDATE SET name = EXCLUDED.name;
INSERT 0 1

It is clear from the example that if the conflicted row exists, it will be updated with the new name. It is important to know that we must have a unique index on columns type and number. If such an index doesn't exist, an error is raised. We can also address the exact constraint name, but it won't be wise. It is better to address exact columns, as we can change the constraint name in the future or create other constraints, with different order of columns. We can also add predicate to conflict the columns:

-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) WHERE type = 1 DO UPDATE SET name = EXCLUDED.name;

If we have predicate (in this case it is WHERE type = 1) and we don't have an exact index, PostgreSQL will try to find an index without predicate (non-partial index) and use it. This means that the conflict will be checked on all conflicts on type and number, not just those with type = 1! We can also add WHERE when updating a conflicted row:

-- If there is conflict on unique constraint on columns type and number, update conflicted row with new name
INSERT INTO upsert_test (type, number, name) VALUES (1, 1, 'Name 3') ON CONFLICT (type, number) DO UPDATE SET name = EXCLUDED.name WHERE type = 1;

In this example only rows with type = 1 will be updated. If the row doesn't satisfy that condition, it will be ignored. It is important to know that all conflicted rows will be locked, even if they are not updated because of the WHERE condition.

New options for GROUP BY

In PostgreSQL 9.5 we have a new option for GROUP BY and we can use it by creating grouping sets. There are three ways (options) for these grouping sets: GROUPING SETS, ROLLUP and CUBE. Let's suppose we have this employee table:

SELECT * FROM employee;

 id |    full_name     | age |  salary   | gender |      role
----+------------------+-----+-----------+--------+-----------------
  1 | Barry Sutton     |  31 |  60000.00 | Male   | Manager
  2 | Austin Jensen    |  26 |  52500.00 | Male   | Developer
  3 | Charles Robbins  |  48 |  80400.00 | Male   | Project Manager
  4 | Sylvester Carson |  22 |  45800.00 | Male   | Developer
  5 | Julius Gonzales  |  38 |  71000.00 | Male   | Sales
  6 | Lana Lewis       |  34 |  63700.00 | Female | Manager
  7 | Gail Ingram      |  30 |  54300.00 | Female | Developer
  8 | Delores Brock    |  24 |  48250.00 | Female | Sales
  9 | Whitney Mcguire  |  51 |  79800.00 | Female | Project Manager
 10 | Kristy Romero    |  55 | 102800.00 | Female | CEO
(10 rows)

If we want to see the average salary by all combinations of gender, role and age groups, we can do that as follows:

SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY GROUPING SETS (role, gender, age / 10, ());

      role       | gender | age_group |  avg
-----------------+--------+-----------+--------
 CEO             |        |           | 102800
 Developer       |        |           |  50867
 Manager         |        |           |  61850
 Project Manager |        |           |  80100
 Sales           |        |           |  59625
                 |        |           |  65855
                 |        | 2x        |  48850
                 |        | 3x        |  62250
                 |        | 4x        |  80400
                 |        | 5x        |  91300
                 | Female |           |  69770
                 | Male   |           |  61940
(12 rows)

We can see average salaries for each group and also overall (the line with empty values for role, gender and age). Apart from that, we can see the average salaries grouped by the combination of all three groups:

SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY CUBE (role, gender, age / 10);

      role       | gender | age_group |  avg
-----------------+--------+-----------+--------
 CEO             | Female | 5x        | 102800
 CEO             | Female |           | 102800
 CEO             |        |           | 102800
 Developer       | Female | 3x        |  54300
 Developer       | Female |           |  54300
 Developer       | Male   | 2x        |  49150
 Developer       | Male   |           |  49150
 Developer       |        |           |  50867
 Manager         | Female | 3x        |  63700
 Manager         | Female |           |  63700
 Manager         | Male   | 3x        |  60000
 Manager         | Male   |           |  60000
 Manager         |        |           |  61850
 Project Manager | Female | 5x        |  79800
 Project Manager | Female |           |  79800
 Project Manager | Male   | 4x        |  80400
 Project Manager | Male   |           |  80400
 Project Manager |        |           |  80100
 Sales           | Female | 2x        |  48250
 Sales           | Female |           |  48250
 Sales           | Male   | 3x        |  71000
 Sales           | Male   |           |  71000
 Sales           |        |           |  59625
                 |        |           |  65855
 Developer       |        | 2x        |  49150
 Sales           |        | 2x        |  48250
                 |        | 2x        |  48850
 Developer       |        | 3x        |  54300
 Manager         |        | 3x        |  61850
 Sales           |        | 3x        |  71000
                 |        | 3x        |  62250
 Project Manager |        | 4x        |  80400
                 |        | 4x        |  80400
 CEO             |        | 5x        | 102800
 Project Manager |        | 5x        |  79800
                 |        | 5x        |  91300
                 | Female | 2x        |  48250
                 | Female | 3x        |  59000
                 | Female | 5x        |  91300
                 | Female |           |  69770
                 | Male   | 2x        |  49150
                 | Male   | 3x        |  65500
                 | Male   | 4x        |  80400
                 | Male   |           |  61940
(44 rows)

We now see all possible combinations of values from columns that are mentioned in the GROUP BY statements. At the end, if we want to just group from left to right, we can use the ROLLUP option:

SELECT role, gender, (age / 10)::text || 'x' AS age_group, avg(salary) FROM employee GROUP BY ROLLUP (role, gender, age / 10);

      role       | gender | age_group |  avg
-----------------+--------+-----------+--------
 CEO             | Female | 5x        | 102800
 CEO             | Female |           | 102800
 CEO             |        |           | 102800
 Developer       | Female | 3x        |  54300
 Developer       | Female |           |  54300
 Developer       | Male   | 2x        |  49150
 Developer       | Male   |           |  49150
 Developer       |        |           |  50867
 Manager         | Female | 3x        |  63700
 Manager         | Female |           |  63700
 Manager         | Male   | 3x        |  60000
 Manager         | Male   |           |  60000
 Manager         |        |           |  61850
 Project Manager | Female | 5x        |  79800
 Project Manager | Female |           |  79800
 Project Manager | Male   | 4x        |  80400
 Project Manager | Male   |           |  80400
 Project Manager |        |           |  80100
 Sales           | Female | 2x        |  48250
 Sales           | Female |           |  48250
 Sales           | Male   | 3x        |  71000
 Sales           | Male   |           |  71000
 Sales           |        |           |  59625
                 |        |           |  65855
(24 rows)

SELECT ... SKIP LOCKED

As opposite to NOWAIT, SKIP LOCKED is added as an option in SELECT query. The NOWAIT option (from previous versions of PostgreSQL) simply raises an error if a row can't be locked. SKIP LOCKED, on the other hand, just skips that row. This is very useful in some highly concurrent environments when we are selecting a certain number of rows, but we can live without some of them (for example selecting some number of random rows to be shown to the user).

SELECT ... TABLESAMPLE

The TABLESAMPLE option allows us to get just one percent of the table. This is particularly useful for creating statistical analyses on large tables. By default, there are two methodologies of choosing a random part of a table: SYSTEM and  BERNOULLI. SYSTEM gets one (or more) physical parts of a table (pages on disk) and returns them. If our inserts are evenly distributed, this is good enough for us, but if they are not we should use BERNOULLI method. It scans the entire table and returns random rows. It is slower, especially when we get lower percent of entries from the table. For example, if we have a table with one million entries and we get one percent of rows, the SYSTEM method (in my testing environment) lasts about 7 ms and BERNOULLI about 43 ms. If we want 50 percent of entries, the SYSTEM method lasts about 292 ms and BERNOULLI about 331 ms. We can pass random seed as a parameter and, if the table did not change in the meantime, then the same row set is returned each time. If we omit a random seed then a new one is generated on each call. For example:

SELECT * FROM employees TABLESAMPLE BERNOULLI (1);

UPDATE multiple columns from SELECT sub-query

To update multiple columns with a sub-select query returning a row with the same number and types of columns, you can now do this:

UPDATE executors ex
SET (full_name, role) = (
  SELECT em.name, em.role
  FROM employee em
  WHERE em.employee_id = ex.employee_id
);

In the above query, two columns (full_name and role) are updated with the result of the sub-query. There is no need to create two separate assignments, and keep the result of the sub-query in temporal variables, or use the WITH statement.

Suggest misspelled column names

This small addition is very useful, especially with bad autocomplete on the default psql client. Very often, we misspell column names when executing some queries:

SELECT id, nam FROM employees;
ERROR:  column "nam" does not exist
LINE 1: select id, nam from employees;
                   ^
HINT:  Perhaps you meant to reference the column "employees.name".

Unfortunately, if we misspell more than one column name, we will be notified about the first one only, so we have to execute the same query again to see the issue in subsequent parts of the query.

DDL changes

The most important is that IF NOT EXISTS clause is added to CREATE TABLE AS, CREATE INDEX, CREATE SEQUENCE and CREATE MATERIALIZED VIEW, and IF EXISTS clause is added to ALTER TABLE...RENAME CONSTRAINT.

Changes in indexes

Block Range Index (BRIN) is added to PostgreSQL 9.5. BRIN index is useful for columns whose values do have some correlation with the location in the table (physical location on pages on a disk). For example, if we have the time of row insertion, most of the time rows with similar time will be inserted in similar periods and thus will be on the same pages on the disk. This is very useful for huge tables with large B-TREE indexes. The B-TREE index is still much faster as BRIN index sometimes fetches data for pages that don't contain the searched value. On the other hand, they are much smaller and faster to create. Which index will be used is just a question of compromise and use case of that particular table. We can compare some values from a table with 66 million rows:

  BRIN B-Tree
Index size 136 KB 1414 MB
Duration 4.5 ms 0.374 ms

This is an analysis for counting the number of rows with a certain int value. The query without index lasts about 11.6 seconds. This is a big difference in size and in performance. With these numbers, the BRIN index will definitely be used for huge tables and queries that are executed occasionally. It is important to make sure that the index is maintained regularly. If new rows are added with existing range values (the values are already in existing rows), then the index is maintained as expected — the new page is added to existing set of pages. On the other hand, if a new range is added (for example with a value greater than any of the previously inserted), the index must be maintained manually. This can be done by calling VACUUM on the table, or by calling the brin_summarize_new_pages(regclass) function.

Sorting optimization

According to the release notes for PostgreSQL 9.5, sorting should be significantly faster than before. I tried some tests and it is obvious that sorting is faster, especially for non-indexed tables. In one specific example, I sorted 1 million random strings with 32 characters. On PostgreSQL 9.4 it lasted for 6.2 seconds and on PostgreSQL 9.5, 3.4 seconds. On indexed columns, this improvement is less visible but it exists — it is about 5%.

Utilities

The most notable change in tools is that now vacuumdb has a -j option with the number of jobs to be executed in parallel:

vacuumdb -j 8

Note that each job creates its own connection to DB, so be sure the max number of connections is greater than the number of jobs.

Petar Partlov

Petar Partlov

Software Developer

April 11, 2016