PostgreSQL - Planned Failover / Failback procedures with streaming replication

Introduction

Many articles deal with promoting a PostgreSQL standby server to primary server using the command pg_ctl promote. Many of them conclude the standby server shoud be be rebuilt when primary server is back.

What about a scheduled maintenance on the primary server including shutdowns ? Do we have to rebuild the primary/standby system ? Obviously the answer is : NO.

Assuming there is enough space disk for WAL files retention during the maintenance task, if the replication slot and the recovery option recovery_timeline_target are cautiously managed during the failover/failback procedures, there is no need to rebuild the whole standby system implying backup/restore databases and/or to use pg_rewind.

For more informations about installing and configuring a standby server with PostgreSQL streaming replication : Streaming replication with PostgreSQL 9.6, 10 & 11 - PostgreSQL Standby Databases

Client applications connect to the database using a logical server name LSRVSQLPAC (virtual IP address) on port 30001. The DNS Server resolves the right physical address depending on the context (failover, failback…).

Before performing the switch, servers must be checked and prepared.

Let’s see how to failover/failback properly a PostgreSQL server for a scheduled maintenance on the primary server without loosing any transaction and without any need to rebuild.

On each server, the context is the following :

Operating System : Ubuntu 18.04 Binaries (PostgreSQL 9.6.15) : /opt/postgres/pgsql-9.6/bin $PATH : /opt/postgres/pgsql-9.6/bin:$PATH $PGLIB : /opt/postgres/pgsql-9.6/lib Port : 30001 $PGDATA : /sqlpac/postgres/srvpg1 $CFG : /opt/postgres/dba/srvpg1/cfg Configuration files : $CFG/postgresql.conf $CFG/pg_hba.conf $CFG/pg_ident.conf Controlling the PostgreSQL Server :

pg_ctl start|stop|restart…  -D $CFG

Replication role user : repmgr Replication slot name : standby1

The procedure below is only valid for PostgreSQL versions 9, 10 and 11. For versions 12 and onwards, the procedure is slightly different and not covered here.

Preparing the primary server VPSFRSQLPAC1 to become a standby server

In advance, on the primary server, set the parameter hot_standby to on (this parameter is ignored when it is a primary server).

VPSFRSQLPAC1 : postgresql.conf

hot_standby = on

Preparing the standby server VPSFRSQLPAC2 to become a primary server

On the standby server, check the parameters needed to act as a primary server are appropriately set (minimal values…) :

VPSFRSQLPAC2 : postgresql.conf

listen_addresses = '*'
wal_level = replica 
max_wal_senders = 3
max_replication_slots = 3

If some of these static parameters have to be adjusted, modify the values and eventually restart the standby server (optional) :

postgres@vpsfrsqlpac2$ pg_ctl restart -D $CFG

Check the used replication role is defined in the standby server authentication file pg_hba.conf with the IP address of the current primary server, this entry will allow connections from VPSFRSQLPAC1 when this one becomes a standby server in a future step.

VPSFRSQLPAC2 : pg_hba.conf

host    replication     repmgr          79.xxx.xxx.xxx/32        md5

Reload the configuration if the entry had to be added :

postgres@vpsfrsqlpac2$ pg_ctl reload -D $CFG

Failover

Failover - Step 1 : Stop the applications, backup the database and then stop the database on primary server VPSFRSQLPAC1

postgres@vpsfrsqlpac1$  pg_ctl stop -D $CFG

At this stage, deactive all scripts that could restart the database on VPSFRSQLPAC1 (startup shell scripts…). The maintenance can now be performed on the server VPSFRSQLPAC1.

Failover - Step 2 : Promote the standby server to primary server

Use pg_ctl promote to promote the standby server VPSFRSQLPAC2 to primary server.

postgres@vpsfrsqlpac2$  pg_ctl promote -D $CFG

server promoting

Check the log file :

2019-11-22 16:23:20.394 CET LOG:  received promote request
2019-11-22 16:23:20.394 CET LOG:  redo done at 0/3016198
2019-11-22 16:23:20.394 CET LOG:  last completed transaction was at log time 2019-11-22 16:22:12.212852+01
2019-11-22 16:23:20.407 CET LOG:  selected new timeline ID: 2
2019-11-22 16:23:20.600 CET LOG:  archive recovery complete
2019-11-22 16:23:20.620 CET LOG:  MultiXact member wraparound protections are now enabled
2019-11-22 16:23:20.623 CET LOG:  database system is ready to accept connections
2019-11-22 16:23:20.623 CET LOG:  autovacuum launcher started

A new timeline id is applied when promoting. "selected new timeline ID: 2" in the log file.

postgres@vpsfrsqlpac2$  psql -p 30001

select substr(pg_xlogfile_name(pg_current_xlog_location()), 1, 8);
  substr
----------
 00000002

The file recovery.conf in the directory $PGDATA is renamed to recovery.done when the command pg_ctl promote has run.

When the switch to VPSFRSQLPAC2 is performed for the first time, the replication slot standby1 does not exist ($PGDATA/pg_replslot).

select slot_name, restart_lsn from pg_replication_slots
 slot_name | restart_lsn
-----------+-------------
(0 rows)

In this context, create immediately an active replication slot standby1 on the new primary server :

postgres@vpsfrsqlpac2$  psql -p 30001

select * from pg_create_physical_replication_slot('standby1',true);
 slot_name | xlog_position
-----------+---------------
 standby1  |0/3016318

The former standby server is now the primary server.

Failover - Step 3 : Switch applications to the new primary server

The logical address LSRVSQLPAC switches to the VPSFRSQLPAC2 physical address in the DNS server. Applications can now connect.

Failover - Step 4 : Convert the former primary server to standby server

The former primary server VPSFRSQLPAC1 is now back, maintenance task is completed.

At this stage, do not start the database, a configuration must be performed !

On the former primary server, create a file recovery.conf in the directory $PGDATA.

VPSFRSQLPAC1 : $PGDATA/recovery.conf

standby_mode = 'on'
primary_conninfo = 'user=repmgr host=vpsfrsqlpac2 port=30001 password=***********!'
primary_slot_name = 'standby1'
recovery_target_timeline = 'latest'

The standby mode is set to on.
Connection informations to the primary server VPSFRSQLPAC2 are stored in the parameter primary_conninfo.
Replication slot name previously recreated on VPSFRSQLPAC2 is obviously specified (primary_slot_name).
The parameter recovery_target_timeline is set to latest : this is an important information as the timeline has changed from 1 to 2 on the newly primary server. If not set to latest, recovery does not automatically switch to the new timeline id.

When the file recovery.conf is properly defined, start the former primary server, check log file and test replication. At startup, VPSFRSQLPAC1 treats WAL files from VPSFRSQLPAC2 generated since its promotion to primary status. VPSFRSQLPAC1 becomes the standby server.

postgres@vpsfrsqlpac1$  pg_ctl start -D $CFG

2019-11-22 18:33:09.163 CET LOG:  entering standby mode
2019-11-22 18:33:09.176 CET LOG:  consistent recovery state reached at 0/3016208
2019-11-22 18:33:09.176 CET LOG:  invalid record length at 0/3016208: wanted 24, got 0
2019-11-22 18:33:09.177 CET LOG:  database system is ready to accept read only connections
2019-11-22 18:33:09.203 CET LOG:  fetching timeline history file for timeline 2 from primary server
2019-11-22 18:33:09.211 CET LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
2019-11-22 18:33:09.215 CET LOG:  replication terminated by primary server
2019-11-22 18:33:09.215 CET DETAIL:  End of WAL reached on timeline 1 at 0/3016208.
2019-11-22 18:33:09.216 CET LOG:  new target timeline is 2
2019-11-22 18:33:09.217 CET LOG:  restarted WAL streaming at 0/3000000 on timeline 2
2019-11-22 18:33:09.579 CET LOG:  redo starts at 0/3016208

Wait for replication to be fully synced.

If the parameter recovery_target_timeline is not defined to latest, the recovery loops with the below error when reaching the end of timeline id 1, it does not switch to timeline id 2 :

LOG:  started streaming WAL from primary at 0/36000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/362E3B70.
LOG:  restarted WAL streaming at 0/36000000 on timeline 1
LOG:  replication terminated by primary server
DETAIL:  End of WAL reached on timeline 1 at 0/362E3B70.

In this case, just simply restart the database with recovery_target_timeline set to latest : it is not an unrecoverable error for the database.

Now VPSFRSQLPAC1 is the standby server. Automatic database startup scripts can be reactivated on VPSFRSQLPAC1.

Failback

About the failback procedure, it will be similar to the failover procedure.

Failback - Step 1 : Stop the applications, backup and stop the database on primary server VPSFRSQLPAC2

postgres@vpsfrsqlpac2$  pg_ctl stop -D $CFG

Deactive all scripts that could restart the database on VPSFRSQLPAC2 (startup shell scripts…).

Failback - Step 2 : Promote the standby server to primary server

Use pg_ctl promote to promote the standby server VPSFRSQLPAC1 to primary server.

postgres@vpsfrsqlpac1$  pg_ctl promote -D $CFG

server promoting

Check the log file :

2019-11-22 18:59:53.743 CET LOG:  received promote request
2019-11-22 18:59:53.743 CET LOG:  redo done at 0/E98A7988
2019-11-22 18:59:53.743 CET LOG:  last completed transaction was at log time 2019-11-22 18:40:13.347683+01
2019-11-22 18:59:53.752 CET LOG:  selected new timeline ID: 3
2019-11-22 18:59:54.074 CET LOG:  archive recovery complete
2019-11-22 18:59:54.092 CET LOG:  MultiXact member wraparound protections are now enabled
2019-11-22 18:59:54.122 CET LOG:  database system is ready to accept connections
2019-11-22 18:59:54.123 CET LOG:  autovacuum launcher started

The timeline id is incremented by 1 when promoting. "selected new timeline ID: 3" in the log file.

The file recovery.conf in the directory $PGDATA is renamed to recovery.done by pg_ctl promote.

When the failback to VPSFRSQLPAC1 is performed, the replication slot standby1 already exists ($PGDATA/pg_replslot), but this one is obsolete as it was the replication slot before the switch to VPSFRSQLPAC2.

select slot_name, restart_lsn from pg_replication_slots
          
 slot_name | restart_lsn
-----------+-------------
 standby1  | 0/3016198

Recreate immediately an active replication slot standby1 on the new primary server :

postgres@vpsfrsqlpac1$  psql -p 30001

select * from pg_drop_replication_slot('standby1');

select * from pg_create_physical_replication_slot('standby1',true);
 slot_name | xlog_position
-----------+---------------
 standby1  | 0/E98A7B08

VPSFRSQLPAC1 is again the primary server.

Failback - Step 3 : Switch applications to the primary server

The logical address LSRVSQLPAC switches to the VPSFRSQLPAC1 physical address in the DNS server. Applications can now connect.

Failback - Step 4 : Convert VPSFRSQLPAC2 to standby server

Now, VPSFRSQLPAC2 must become again the standby server.

At this stage, do not start the database on VPSFRSQLPAC2.

On VPSFRSQLPAC2, rename the file recovery.done to recovery.conf in the directory $PGDATA. Add the parameter recovery_target_timeline defined to latest if this parameter was not existing in the original file recovery.conf, otherwise the switch to timeline ID 3 won’t be automatically performed during recovery.

VPSFRSQLPAC2 : $PGDATA/recovery.conf

standby_mode = 'on'
primary_conninfo = 'user=repmgr host=vpsfrsqlpac1 port=30001 password=***********!'
primary_slot_name = 'standby1'
recovery_target_timeline = 'latest'

When the file recovery.conf is properly defined, start the database on VPSFRSQLPAC2, check log file and test replication. At startup, VPSFRSQLPAC2 treats WAL files coming from VPSFRSQLPAC1 since its promotion to primary status and VPSFRSQLPAC2 becomes again the standby server.

postgres@vpsfrsqlpac2$  pg_ctl start -D $CFG

2019-11-22 19:05:17.799 CET LOG:  entering standby mode
2019-11-22 19:05:17.811 CET LOG:  consistent recovery state reached at 0/E98A79F8
2019-11-22 19:05:17.812 CET LOG:  database system is ready to accept read only connections
2019-11-22 19:05:17.812 CET LOG:  invalid record length at 0/E98A79F8: wanted 24, got 0
2019-11-22 19:05:17.824 CET LOG:  fetching timeline history file for timeline 3 from primary server
2019-11-25 19:05:17.835 CET LOG:  started streaming WAL from primary at 0/E9000000 on timeline 2
2019-11-22 19:05:18.568 CET LOG:  replication terminated by primary server
2019-11-22 19:05:18.568 CET DETAIL:  End of WAL reached on timeline 2 at 0/E98A79F8.
2019-11-22 19:05:18.570 CET LOG:  new target timeline is 3
2019-11-22 19:05:18.571 CET LOG:  restarted WAL streaming at 0/E9000000 on timeline 3
2019-11-22 19:05:19.524 CET LOG:  redo starts at 0/E98A79F8

Automatic database startup scripts can be reactivated on VPSFRSQLPAC2.

Situation is back to normal.

Conclusion

Scheduled failover/failback procedures with PostgreSQL streaming replication are very easy. Assuming there is enough space disk for WAL files retention during the failover/failback tasks, managing the steps and the replication slot guarantee consistency, no data loss and most of all there is no need for a full rebuild requiring database backup/restore or other painful methods (pg_rewind…)

Through this use case, it seems to be a good practice to (re)create the active replication slot as soon as the standby server is promoted to primary role.

If there is not enough disk space for WAL files retention during the maintenance task, the procedure is slightly different and not covered here but in few words : the archive log mode must be active and each standby PostgreSQL server is able to read archive logs generated by its primary server.