在群里看到某网友问pg_authid删了可以恢复不,我想,没有备份的话很难了吧,权叔建议其他好的系统拷贝一个过去看看。再仔细一问,是delete,那可以试着用我的bbedp恢复看看。
这个系统表记录了的是role和权限,如果删了数据会报错没有role。当然,数据库的启停是没有影响的,没有用户都不能登录了。
postgres=# \d pg_authid
Table "pg_catalog.pg_authid"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+---------
oid | oid | | not null |
rolname | name | | not null |
rolsuper | boolean | | not null |
rolinherit | boolean | | not null |
rolcreaterole | boolean | | not null |
rolcreatedb | boolean | | not null |
rolcanlogin | boolean | | not null |
rolreplication | boolean | | not null |
rolbypassrls | boolean | | not null |
rolconnlimit | integer | | not null |
rolpassword | text | C | |
rolvaliduntil | timestamp with time zone | | |
Indexes:
"pg_authid_oid_index" UNIQUE, btree (oid), tablespace "pg_global"
"pg_authid_rolname_index" UNIQUE, btree (rolname), tablespace "pg_global"
Tablespace: "pg_global"
postgres=#
postgres=# delete from pg_authid;
DELETE 12
postgres=# \q
[pg@whf307 soft]$
[pg@whf307 global]$ psql
psql: FATAL: role "pg" does not exist
这种方法测试是可行的。新建一个目录来initdb,然后把对应的page拷贝过去,系统超级用户是有了,但是其他用户就根据业务来补了。也很麻烦。
bbedp是我写的一个小工具,可以查看和修改表的page每一个字节。具体用法可以参考https://whf307.github.io/pg/BBED-for-PostgreSQL.html
这里说一下恢复思路。delete没有加条件的话只要没有被vacuum清理,数据还在的。我们只要修改t_xmax、t_infomask2和t_infomask三个标志位就好了。但是infomask没有参考。那试着新建一个用户来照着新用户的infomask来恢复。t_xmax改为0就好了。
首先备份数据库。任何有风险的操作都应该进行备份。如果实在不行把pg_authid对应的物理文件先备份一下。
避免dead tuple被autovacuum清理先关掉vacuum
postgres=# show autovacuum;
autovacuum
------------
on
(1 row)
postgres=# alter system set autovacuum=off;
ALTER SYSTEM
postgres=# show autovacuum;
autovacuum
------------
on
(1 row)
postgres=# \q
[pg@whf307 ~]$pg_ctl reload
server signaled
[pg@whf307 ~]$psql
psql (11.2)
Type "help" for help.
postgres=# show autovacuum;
autovacuum
------------
off
(1 row)
postgres=#
[pg@whf307 soft]$ postgres --single -D /oracle/soft/pg11_data postgres
2019-10-30 17:46:49.341 CST [18101] LOG: database system was interrupted; last known up at 2019-10-30 17:46:12 CST
2019-10-30 17:46:49.353 CST [18101] LOG: database system was not properly shut down; automatic recovery in progress
2019-10-30 17:46:49.355 CST [18101] LOG: redo starts at 0/169B0F8
2019-10-30 17:46:49.355 CST [18101] LOG: invalid record length at 0/169B980: wanted 24, got 0
2019-10-30 17:46:49.355 CST [18101] LOG: redo done at 0/169B948
2019-10-30 17:46:49.355 CST [18101] LOG: last completed transaction was at log time 2019-10-30 17:46:25.53714+08
2019-10-30 17:46:49.357 CST [18101] LOG: checkpoint starting: end-of-recovery immediate
2019-10-30 17:46:49.363 CST [18101] LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.007 s; sync files=1, longest=0.000 s, average=0.000 s; distance=2 kB, estimate=2 kB
2019-10-30 17:46:49.364 CST [18101] WARNING: no roles are defined in this database system
2019-10-30 17:46:49.364 CST [18101] HINT: You should immediately run CREATE USER "pg" SUPERUSER;.
PostgreSQL stand-alone backend 11.2
backend> create user sa;
2019-10-30 17:46:55.115 CST [18101] LOG: statement: create user sa;
2019-10-30 17:46:55.116 CST [18101] LOG: duration: 1.860 ms
backend> select * from pg_authid;
2019-10-30 17:46:58.274 CST [18101] LOG: statement: select * from pg_authid;
1: rolname (typeid = 19, len = 64, typmod = -1, byval = f)
2: rolsuper (typeid = 16, len = 1, typmod = -1, byval = t)
3: rolinherit (typeid = 16, len = 1, typmod = -1, byval = t)
4: rolcreaterole (typeid = 16, len = 1, typmod = -1, byval = t)
5: rolcreatedb (typeid = 16, len = 1, typmod = -1, byval = t)
6: rolcanlogin (typeid = 16, len = 1, typmod = -1, byval = t)
7: rolreplication (typeid = 16, len = 1, typmod = -1, byval = t)
8: rolbypassrls (typeid = 16, len = 1, typmod = -1, byval = t)
9: rolconnlimit (typeid = 23, len = 4, typmod = -1, byval = t)
10: rolpassword (typeid = 25, len = -1, typmod = -1, byval = f)
11: rolvaliduntil (typeid = 1184, len = 8, typmod = -1, byval = t)
----
1: rolname = "sa" (typeid = 19, len = 64, typmod = -1, byval = f)
2: rolsuper = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
3: rolinherit = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
4: rolcreaterole = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
5: rolcreatedb = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
6: rolcanlogin = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
7: rolreplication = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
8: rolbypassrls = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
9: rolconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t)
----
2019-10-30 17:46:58.276 CST [18101] LOG: duration: 2.303 ms
可以看到只有sa用户。
查到pg_authid的物理文件绝对路径,新建file文件
vi file
/oracle/soft/pg11_data/global/1260
bbed查看,可以看到这个page有13行tuple,说明没有被vacuum清理,还有机会。
[pg@whf307 soft]$ ./bbedp
Password:
BBEDP: Release 0.1.0.0.0 - Limited Production on Wed Oct 30 17:59:09 2019
Copyright (c) 2018, 2019, whf307 and/or its affiliates. All rights reserved.
************* !!! For PostgreSQL Internal Test only !!! ***************
BBEDP> map
File: /oracle/soft/pg11_data/global/1260
Block: 0 offset: 0
-------------------------------------------------------------------
struct PageHeaderData , 24 bytes @0
struct Linps , 52 bytes @24
struct Tuples , 1456 bytes @76
struct Pd_special , 0 bytes @8191
There are 13 tuples in the block
超级用户一般都是第一个tuple,如果不是按tupledata,按utf8转换16进制找就可以了。
BBEDP> p tuple 1
tuple 1
-----------------------------------------------------
lp_off 8080 @24 0x00d89f90
lp_flags 1 @24 0x00d89f90
lp_len 108 @24 0x00d89f90
t_xmin 1 @8080 0x01000000
t_xmax 588 @8084 0x4c020000
t_field3 0 @8088 0x00000000
t_ctid (0,1) @8092 0x000000000100
t_infomask2 8203 @8098 0x0b20
t_infomask 265 @8100 0x0901
t_hoff 32 @8102 0x20
bits8 0b11111111 @8103 0xff
tupledata @8104 0x010000000a000000706700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff
这里tupledata中,0100这段后面的就是rolname,7076转换utf8就是pg。而且这里t_xmax不是0,说明是dead tuple。
查看新建的用户对应的数据
BBEDP> p tuple 13
tuple 13
-----------------------------------------------------
lp_off 6736 @72 0x00d89ba0
lp_flags 1 @72 0x00d89ba0
lp_len 108 @72 0x00d89ba0
t_xmin 587 @6736 0x4b020000
t_xmax 0 @6740 0x00000000
t_field3 0 @6744 0x00000000
t_ctid (0,13) @6748 0x000000000d00
t_infomask2 11 @6754 0x0b00
t_infomask 2313 @6756 0x0909
t_hoff 32 @6758 0x20
bits8 0b11111111 @6759 0xff
tupledata @6760 0x0100000014400000736100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff
这里,7361对应的就是sa。t_xmax是0,是正常的tuple。我们按照tuple 13去恢复tuple 1
BBEDP> m /x 0000 offset 8084
BBEDP> m /x 0b00 offset 8098
BBEDP> m /x 0909 offset 8100
BBEDP> p tuple 1
tuple 1
-----------------------------------------------------
lp_off 8080 @24 0x00d89f90
lp_flags 1 @24 0x00d89f90
lp_len 108 @24 0x00d89f90
t_xmin 1 @8080 0x01000000
t_xmax 0 @8084 0x00000000
t_field3 0 @8088 0x00000000
t_ctid (0,1) @8092 0x000000000100
t_infomask2 11 @8098 0x0b00
t_infomask 2313 @8100 0x0909
t_hoff 32 @8102 0x20
bits8 0b11111111 @8103 0xff
tupledata @8104 0x010000000a000000706700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff
这里正常起库测试。
[pg@whf307 soft]$ pg_ctl start
waiting for server to start....2019-10-30 18:03:35.517 CST [19514] LOG: listening on IPv4 address "0.0.0.0", port 2345
2019-10-30 18:03:35.518 CST [19514] LOG: could not create IPv6 socket for address "::": Address family not supported by protocol
2019-10-30 18:03:35.521 CST [19514] LOG: listening on Unix socket "/oracle/soft/pg11_data/.s.PGSQL.2345"
2019-10-30 18:03:35.532 CST [19514] LOG: redirecting log output to logging collector process
2019-10-30 18:03:35.532 CST [19514] HINT: Future log output will appear in directory "log".
done
server started
[pg@whf307 soft]$ psql
psql (11.2)
Type "help" for help.
postgres=# select * from pg_authid;
rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolbypassrls | rolconnlimit | rolpassword | rolvaliduntil
---------+----------+------------+---------------+-------------+-------------+----------------+--------------+--------------+-------------+---------------
pg | t | t | t | t | t | t | t | -1 | |
sa | f | t | f | f | t | f | f | -1 | |
(2 rows)
测试是可以了,但是很尴尬。退出再登录就不行了。
[pg@whf307 soft]$ psql
psql: FATAL: role "pg" does not exist
再去单用户模式确认
[pg@whf307 soft]$ postgres --single -D /oracle/soft/pg11_data postgres
PostgreSQL stand-alone backend 11.2
backend> select * from pg_authid;
2019-10-30 18:04:06.082 CST [19539] LOG: statement: select * from pg_authid;
1: rolname (typeid = 19, len = 64, typmod = -1, byval = f)
2: rolsuper (typeid = 16, len = 1, typmod = -1, byval = t)
3: rolinherit (typeid = 16, len = 1, typmod = -1, byval = t)
4: rolcreaterole (typeid = 16, len = 1, typmod = -1, byval = t)
5: rolcreatedb (typeid = 16, len = 1, typmod = -1, byval = t)
6: rolcanlogin (typeid = 16, len = 1, typmod = -1, byval = t)
7: rolreplication (typeid = 16, len = 1, typmod = -1, byval = t)
8: rolbypassrls (typeid = 16, len = 1, typmod = -1, byval = t)
9: rolconnlimit (typeid = 23, len = 4, typmod = -1, byval = t)
10: rolpassword (typeid = 25, len = -1, typmod = -1, byval = f)
11: rolvaliduntil (typeid = 1184, len = 8, typmod = -1, byval = t)
----
1: rolname = "pg" (typeid = 19, len = 64, typmod = -1, byval = f)
2: rolsuper = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
3: rolinherit = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
4: rolcreaterole = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
5: rolcreatedb = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
6: rolcanlogin = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
7: rolreplication = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
8: rolbypassrls = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
9: rolconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t)
----
1: rolname = "sa" (typeid = 19, len = 64, typmod = -1, byval = f)
2: rolsuper = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
3: rolinherit = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
4: rolcreaterole = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
5: rolcreatedb = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
6: rolcanlogin = "t" (typeid = 16, len = 1, typmod = -1, byval = t)
7: rolreplication = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
8: rolbypassrls = "f" (typeid = 16, len = 1, typmod = -1, byval = t)
9: rolconnlimit = "-1" (typeid = 23, len = 4, typmod = -1, byval = t)
----
2019-10-30 18:04:06.084 CST [19539] LOG: duration: 2.084 ms
有点奇怪,重启库第一次可以登录,退出再登录就不行了。而且但用户模式去看数据是恢复出来了的。我strace跟踪了一把也没发现什么。。。
重新梳理了一下思路,page中的数据肯定是在了,会不会是哪里遗漏了什么?索引呢?对,我直接改的page,索引还里面是没有这个记录的,数据不一致了。重建索引试试。
postgres=# reindex index pg_authid_oid_index;
REINDEX
postgres=# reindex index pg_authid_rolname_index;
REINDEX
postgres=# select * from pg_authid;
rolname | rolsuper | rolinherit | rolcreaterole | rolcreatedb | rolcanlogin | rolreplication | rolbypassrls | rolconnlimit | rolpassword | rolvaliduntil
---------+----------+------------+---------------+-------------+-------------+----------------+--------------+--------------+-------------+---------------
pg | t | t | t | t | t | t | t | -1 | |
sa | f | t | f | f | t | f | f | -1 | |
(2 rows)
postgres=# \q
[pg@whf307 global]$ psql
psql (11.2)
Type "help" for help.
postgres=# \q
[pg@whf307 global]$ psql
psql (11.2)
Type "help" for help.
果然,重建了索引就可以了,重启库也没有问题。如果要恢复其他用户,恢复pg一样的步骤,不再赘述。
这个工具我搞了很久,很强大,可以绕过数据库层面修改page,也修改了很多bug,我对他是很有信心的。这次一遍测试,一遍帮助网友恢复。
网友这边的生产环境是数据恢复出来了。但是登陆的时候还是报错role不存在。当我建议网友重建索引的时候,网友说他新建库重新导数据了。。重建索引的明天再试。。。
就差那么一点了,可能人家生产环境急,也可能数据可以丢。
备份终于一切!