欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Postgresql 如何选择正确的关闭模式

程序员文章站 2023-03-07 14:10:02
停止数据库的命令:pg_ctl stop -d $pgdata [-m shutdown-mode]shutdown-mode有如下几种模式:1. smart: 等所有的连接中止后,关闭数据库。如果客...

停止数据库的命令:

pg_ctl stop -d $pgdata [-m shutdown-mode]

shutdown-mode有如下几种模式:

1. smart: 等所有的连接中止后,关闭数据库。如果客户端连接不终止, 则无法关闭数据库。

开启一个空会话:

[root@localhost ~]# su - postgres
[postgres@localhost ~]$ psql
psql (9.4.4)
type "help" for help.
 
postgres=#

用smart关闭数据库:

[postgres@localhost ~]$ pg_ctl stop -d $pgdata -m smart
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
hint: the "-m fast" option immediately disconnects sessions rather than
waiting for session-initiated disconnection

2. fast: 快速关闭数据库, 断开客户端的连接,让已有的事务回滚,然后正常关闭数据库。

[postgres@localhost ~]$ pg_ctl stop -d $pgdata -m fast
waiting for server to shut down.... done
server stopped

查看关闭日志:

log: received fast shutdown request
log: aborting any active transactions
fatal: terminating connection due to administrator command
log: shutting down
log: database system is shut down

会话被强制中断,然后关闭数据库。

起一个事务,然后测试关闭:

postgres=# create table t(id int primary key, name varchar(9));
create table
postgres=# begin;
begin
postgres=# insert into t values(1,'a')
postgres-# ;
insert 0 1

不提交, 然后用fast mode去关闭数据库:

[postgres@localhost ~]$ pg_ctl stop -d $pgdata -m fast
waiting for server to shut down.... done
server stopped

查看日志:

log: received fast shutdown request
log: aborting any active transactions
log: autovacuum launcher shutting down
fatal: terminating connection due to administrator command
log: shutting down
log: database system is shut down

同样是直接中断会话, 而不去管事务有没有提交。

postgres=# select * from t;
id | name
----+------
(0 rows)

没有提交的数据, 在重启之后并不能查到。

3. immediate: 立即关闭数据库,立即停止数据库进程,直接退出,下次启动时会进行实例恢复。

postgres=# insert into t values(1,'a')
;
insert 0 1
postgres=# select * from t;
id | name
----+------
1 | a
(1 row)

关闭数据库:

[postgres@localhost ~]$ pg_ctl stop -d $pgdata -m immediate
waiting for server to shut down.... done
server stopped

查看日志:

log: received immediate shutdown request
warning: terminating connection because of crash of another server process
detail: the postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
hint: in a moment you should be able to reconnect to the database and repeat your command.
warning: terminating connection because of crash of another server process
detail: the postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
hint: in a moment you should be able to reconnect to the database and repeat your command.

启动数据库:

[postgres@localhost ~]$ pg_ctl -d /apps/pgsql/pgdata -l 1.log start
server starting

查看日志:

log: database system was interrupted; last known up at 2017-04-27 18:56:47 pdt
log: database system was not properly shut down; automatic recovery in progress #提示非正常关机,自动开启恢复。
log: redo starts at 0/181f910
log: record with zero length at 0/181fa90
log: redo done at 0/181fa60
log: last completed transaction was at log time 2017-04-27 18:59:13.727213-07
log: multixact member wraparound protections are now enabled
log: autovacuum launcher started
log: database system is ready to accept connections

查看数据:

[postgres@localhost ~]$ psql
psql (9.4.4)
type "help" for help.
 
postgres=# select * from t;
id | name
----+------
1 | a
(1 row)

提交的数据已通过实例恢复。

小结:

对比以上三种关库模式:

smart最为安全,但最慢, 需要将所有连接都断开后,才会关库,默认关库模式。

fast强制中断会话,而不管有操作有没有提交,在做系统维护(系统维护时一般应用都正常关闭了,或者不再会有事务操作。)时,需要这种模式来关闭数据库。

immediate最暴力的方式,不管数据有没有落盘(posgre是遵循wal机制),就直接关掉, 待启动时进行实例恢复, 如果在关库前有大量的事务没有写入磁盘, 那这个恢复过程可能会非常的漫长。

补充:postgresql 异步 stream replication 环境关闭 master 的验证

os: ubuntu 16.04

db: postgresql 9.6.8

验证在异步 stream replication环境下,主动关闭master时,数据是否有丢失,能丢失多少。

版本

# lsb_release -a
no lsb modules are available.
distributor id: ubuntu
description: ubuntu 16.04.5 lts
release: 16.04
codename: xenial
# su - postgres -c "psql -c \"select version();\""
                 version                 
----------------------------------------------------------------------------------------------------------------------------------------------
 postgresql 9.6.8 on x86_64-pc-linux-gnu (ubuntu 9.6.8-1.pgdg16.04+1), compiled by gcc (ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
(1 row)

用 pgbench 模拟数据库的大量数据操作

postgres=# create database pgbenchdb;
# su - postgres
$ pgbench -i -s 20 pgbenchdb
$ pgbench -r -j2 -c4 -t300 pgbenchdb

关闭 master

# su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_ctl stop -m fast -d "/data/pg9.6/main"

提升 slave

# su - postgres
$ /usr/lib/postgresql/9.6/bin/pg_ctl promote -d "/data/pg9.6/main"

查看 old master 的 xlog location

$ /usr/lib/postgresql/9.6/bin/pg_xlogdump 000000010000000000000016
rmgr: xlog  len (rec/tot): 106/ 106, tx:   0, lsn: 0/16000028, prev 0/152c9a10, desc: checkpoint_shutdown redo 0/16000028; tli 1; prev tli 1; fpw true; xid 0:118746; oid 16432; multi 1; offset 0; oldest xid 543 in db 1; oldest multi 1 in db 1; oldest/newest commit timestamp xid: 550/118745; oldest running xid 0; shutdown
pg_xlogdump: fatal: error in wal record at 0/16000028: invalid record length at 0/16000098: wanted 24, got 0

可以看到 lsn: 0/16000028, prev 0/152c9a10, desc: checkpoint_shutdown redo 0/16000028;

查看 new master 的 .history文件

$ ls -lt|more
total 360456
-rw------- 1 postgres postgres 16777216 nov 30 10:32 000000020000000000000016
drwx------ 2 postgres postgres  4096 nov 30 10:16 archive_status
-rw------- 1 postgres postgres  42 nov 30 10:16 00000002.history
-rw------- 1 postgres postgres 16777216 nov 30 10:16 000000010000000000000016.partial
-rw------- 1 postgres postgres 16777216 nov 30 10:16 000000010000000000000015
-rw------- 1 postgres postgres 16777216 nov 30 10:16 000000010000000000000014
-rw------- 1 postgres postgres 16777216 nov 30 10:05 000000010000000000000013
$ cat 00000002.history 
1 0/16000098 no recovery target specified
$ /usr/lib/postgresql/9.6/bin/pg_xlogdump 000000010000000000000016
rmgr: xlog  len (rec/tot): 106/ 106, tx:   0, lsn: 0/16000028, prev 0/152c9a10, desc: checkpoint_shutdown redo 0/16000028; tli 1; prev tli 1; fpw true; xid 0:118746; oid 16432; multi 1; offset 0; oldest xid 543 in db 1; oldest multi 1 in db 1; oldest/newest commit timestamp xid: 550/118745; oldest running xid 0; shutdown
rmgr: xlog  len (rec/tot):  42/ 42, tx:   0, lsn: 0/16000098, prev 0/16000028, desc: end_of_recovery tli 2; prev tli 1; time 2018-11-30 10:16:57.249408 cst
rmgr: standby  len (rec/tot):  50/ 50, tx:   0, lsn: 0/160000c8, prev 0/16000098, desc: running_xacts nextxid 118746 latestcompletedxid 118745 oldestrunningxid 118746
rmgr: xlog  len (rec/tot):  51/ 312, tx:   0, lsn: 0/16000100, prev 0/160000c8, desc: fpi_for_hint , blkref #0: rel 1664/0/1260 blk 0 fpw
rmgr: standby  len (rec/tot):  50/ 50, tx:   0, lsn: 0/16000238, prev 0/16000100, desc: running_xacts nextxid 118746 latestcompletedxid 118745 oldestrunningxid 118746
pg_xlogdump: fatal: error in wal record at 0/16000238: invalid record length at 0/16000270: wanted 24, got 0

可以看到关键记录

lsn: 0/16000028, prev 0/152c9a10, desc: checkpoint_shutdown redo 0/16000028
lsn: 0/16000098, prev 0/16000028, desc: end_of_recovery

而 end_of_recovery 对应的 lsn 为 0/16000098,和 00000002.history 时间线文件的内容完全一致。

所以在异步 stream replication 环境下,主动关闭master时,会将最后一条记录(checkpoint_shutdown)发送给slave,不会造成数据的丢失。

而 synchronous_commit = on 保证事务有两份持久化的落盘数据。

分析 pg_log 日志

old master 上的最后几条日志

2018-11-30 10:16:40.986 cst,"postgres","pgbenchdb",7559,"[local]",5c009d79.1d87,4,"update waiting",2018-11-30 10:16:25 cst,,0,log,00000,"disconnection: session time: 0:00:15.723 user=postgres database=pgbenchdb host=[local]",,,,,,,,,"pgbench"
2018-11-30 10:16:40.993 cst,"postgres","pgbenchdb",7558,"[local]",5c009d79.1d86,3,"idle",2018-11-30 10:16:25 cst,4/0,0,fatal,57p01,"terminating connection due to administrator command",,,,,,,,,"pgbench"
2018-11-30 10:16:40.994 cst,"postgres","pgbenchdb",7560,"[local]",5c009d79.1d88,3,"idle",2018-11-30 10:16:25 cst,5/0,0,fatal,57p01,"terminating connection due to administrator command",,,,,,,,,"pgbench"
2018-11-30 10:16:40.994 cst,"postgres","pgbenchdb",7558,"[local]",5c009d79.1d86,4,"idle",2018-11-30 10:16:25 cst,,0,log,00000,"disconnection: session time: 0:00:15.729 user=postgres database=pgbenchdb host=[local]",,,,,,,,,"pgbench"
2018-11-30 10:16:40.994 cst,"postgres","pgbenchdb",7560,"[local]",5c009d79.1d88,4,"idle",2018-11-30 10:16:25 cst,,0,log,00000,"disconnection: session time: 0:00:15.725 user=postgres database=pgbenchdb host=[local]",,,,,,,,,"pgbench"
2018-11-30 10:16:40.999 cst,"postgres","pgbenchdb",7561,"[local]",5c009d79.1d89,3,"idle",2018-11-30 10:16:25 cst,6/0,0,fatal,57p01,"terminating connection due to administrator command",,,,,,,,,"pgbench"
2018-11-30 10:16:41.001 cst,"postgres","pgbenchdb",7561,"[local]",5c009d79.1d89,4,"idle",2018-11-30 10:16:25 cst,,0,log,00000,"disconnection: session time: 0:00:15.731 user=postgres database=pgbenchdb host=[local]",,,,,,,,,"pgbench"
2018-11-30 10:16:41.010 cst,,,7156,,5c009735.1bf4,7,,2018-11-30 09:49:41 cst,,0,log,00000,"shutting down",,,,,,,,,""
2018-11-30 10:16:41.209 cst,,,7156,,5c009735.1bf4,8,,2018-11-30 09:49:41 cst,,0,log,00000,"checkpoint starting: shutdown immediate",,,,,,,,,""
2018-11-30 10:16:47.623 cst,,,7156,,5c009735.1bf4,9,,2018-11-30 09:49:41 cst,,0,log,00000,"checkpoint complete: wrote 29357 buffers (89.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=5.931 s, sync=0.399 s, total=6.418 s; sync files=53, longest=0.179 s, average=0.007 s; distance=311294 kb, estimate=311294 kb",,,,,,,,,""
2018-11-30 10:16:47.683 cst,"repl","",7227,"192.168.56.90:52556",5c009795.1c3b,3,"streaming 0/16000098",2018-11-30 09:51:17 cst,,0,log,00000,"disconnection: session time: 0:25:30.149 user=repl database= host=192.168.56.90 port=52556",,,,,,,,,"walreceiver"
2018-11-30 10:16:47.730 cst,,,7153,,5c009735.1bf1,5,,2018-11-30 09:49:41 cst,,0,log,00000,"database system is shut down",,,,,,,,,""

注意倒数第二条信息 streaming 0/16000098 ,说明当时的master关闭时,已经和salve沟通过,确认已经接收到 end_of_recovery 之前所有的数据了。

old slave 日志

2018-11-30 10:16:47.660 cst,,,7256,,5c009795.1c58,2,,2018-11-30 09:51:17 cst,,0,log,00000,"replication terminated by primary server","end of wal reached on timeline 1 at 0/16000098.",,,,,,,,""
2018-11-30 10:16:47.660 cst,,,7256,,5c009795.1c58,3,,2018-11-30 09:51:17 cst,,0,fatal,xx000,"could not send end-of-streaming message to primary: no copy in progress
",,,,,,,,,""
2018-11-30 10:16:47.660 cst,,,7255,,5c009795.1c57,5,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"invalid record length at 0/16000098: wanted 24, got 0",,,,,,,,,""
2018-11-30 10:16:47.854 cst,,,7443,,5c009d8f.1d13,1,,2018-11-30 10:16:47 cst,,0,fatal,xx000,"could not connect to the primary server: could not connect to server: connection refused
 is the server running on host ""192.168.56.119"" and accepting
 tcp/ip connections on port 5432?
",,,,,,,,,""
2018-11-30 10:16:52.668 cst,,,7444,,5c009d94.1d14,1,,2018-11-30 10:16:52 cst,,0,fatal,xx000,"could not connect to the primary server: could not connect to server: connection refused
 is the server running on host ""192.168.56.119"" and accepting
 tcp/ip connections on port 5432?
",,,,,,,,,""
2018-11-30 10:16:56.875 cst,,,7255,,5c009795.1c57,6,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"received promote request",,,,,,,,,""
2018-11-30 10:16:56.875 cst,,,7255,,5c009795.1c57,7,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"redo done at 0/16000028",,,,,,,,,""
2018-11-30 10:16:56.875 cst,,,7255,,5c009795.1c57,8,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"last completed transaction was at log time 2018-11-30 10:16:40.986869+08",,,,,,,,,""
2018-11-30 10:16:56.888 cst,,,7255,,5c009795.1c57,9,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"selected new timeline id: 2",,,,,,,,,""
2018-11-30 10:16:57.166 cst,,,7255,,5c009795.1c57,10,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"archive recovery complete",,,,,,,,,""
2018-11-30 10:16:57.267 cst,,,7255,,5c009795.1c57,11,,2018-11-30 09:51:17 cst,1/0,0,log,00000,"multixact member wraparound protections are now enabled",,,,,,,,,""
2018-11-30 10:16:57.267 cst,,,7257,,5c009795.1c59,1,,2018-11-30 09:51:17 cst,,0,log,00000,"checkpoint starting: force",,,,,,,,,""
2018-11-30 10:16:57.275 cst,,,7253,,5c009795.1c55,3,,2018-11-30 09:51:17 cst,,0,log,00000,"database system is ready to accept connections",,,,,,,,,""
2018-11-30 10:16:57.276 cst,,,7447,,5c009d99.1d17,1,,2018-11-30 10:16:57 cst,,0,log,00000,"autovacuum launcher started",,,,,,,,,""

信息也是相当的清晰。

wal_retrieve_retry_interval = 5s 控制 salve 到 master 失败时,再次重试的等待时间。

以上为个人经验,希望能给大家一个参考,也希望大家多多支持。如有错误或未考虑完全的地方,望不吝赐教。