Table of Contents
PostgreSQL Production AQMS System Operations Manual
AQMS/DB: Commands
- Start PostgreSQL database:
[postgres@aqms-bk-rt ~]$ sudo systemctl [re]start postgresql-14
- Stop PostgreSQL database:
[postgres@aqms-bk-rt ~]$ sudo systemctl stop postgresql-14
- PostgreSQL database Status:
[postgres@aqms-bk-rt ~]$ systemctl status postgresql-14
- Show RT/PP system role:
[aqms@aqms-bk-rt ~]$ showrole System is shadow [aqms@aqms-bk-rt ~]$ [aqms@aqms-bk-pp events]$ showrole System is primary [aqms@aqms-bk-pp events]$
- AQMS command:
[aqms@aqms-bk-rt ~]$ aqms -h AQMS realtime-processing control script can be used to start,stop, restart, show status Usage: aqms [start|stop|status|restart|forcestop] stop does a kill -1 and forcestop does a kill -9, returns 1 when stopped status exits with 0 if all required processes are running and 1 if not. restart does a stop 1 or 9 followed by start and status. [aqms@aqms-bk-rt ~]$
- Start all AQMS+EW codes on each system:
[aqms@aqms-bk-rt ~]$ aqms start
- Status of what is running:
[aqms@aqms-bk-rt ~]$ aqms status Usage: /home/aqms/utils/init/startStopCMS {start|stop} adadup_slave /home/aqms/configs/adadup_slave/adadup_mnlons1_to_aqms-bk-rt.cfg running on pid 2068089 adadup_slave /home/aqms/configs/adadup_slave/adadup_mnlons2_to_aqms-bk-rt.cfg running on pid 2068097 adadup_slave /home/aqms/configs/adadup_slave/adadup_ucbns1_to_aqms-bk-rt.cfg running on pid 2068105 adadup_slave /home/aqms/configs/adadup_slave/adadup_ucbns2_to_aqms-bk-rt.cfg running on pid 2068113 ntrcg2 /home/aqms/configs/ntrcg2/ntrcg2.cfg running on pid 2068263 sigswitch /home/aqms/configs/sigswitch/sigswitch_event.cfg running on pid 2068274 sigswitch /home/aqms/configs/sigswitch/sigswitch_trigger.cfg running on pid 2068289 eqrun /home/aqms/configs/eqrun/eqrun_db2nsmp.cfg running on pid 605941 eqrun /home/aqms/configs/eqrun/eqrun_rcg.cfg running on pid 2068242 ec /home/aqms/configs/ec/ec.cfg running on pid 2068149 tc2 /home/aqms/configs/tc2/tc2.cfg running on pid 2068175 trig2db /home/aqms/configs/trig2db/trig2db.cfg running on pid 2068188 trimag /home/aqms/configs/trimag/trimag.cfg running on pid 2068201 ampgen /home/aqms/configs/ampgen/ag_early.cfg running on pid 2068213 ampgen /home/aqms/configs/ampgen/ag_late.cfg running on pid 2068223 alarmdec /home/aqms/configs/alarmdec/alarmdec-ql.cfg running on pid 294066 alarmdec /home/aqms/configs/alarmdec/alarmdec-tm.cfg running on pid 294076 alarmdec /home/aqms/configs/alarmdec/alarmdec-amp-early.cfg running on pid 294086 alarmdec /home/aqms/configs/alarmdec/alarmdec-amp-late.cfg running on pid 294095 alarmdec /home/aqms/configs/alarmdec/alarmdec-mw.cfg running on pid 294106 alarmdec /home/aqms/configs/alarmdec/alarmdec-mech.cfg running on pid 294115 alarmdec /home/aqms/configs/alarmdec/alarmdec-tmts-done.cfg running on pid 294126 alarmact /home/aqms/configs/alarmact/alarmact-rt.cfg running on pid 294137 alarmdist /home/aqms/configs/alarmdist/alarmdist-rt.cfg running on pid 294148 runner /home/aqms/configs/runner/sendcancel_runner.cfg running on pid 294160 telestifle /home/aqms/configs/telestifle/telestifle.cfg running on pid 294172 pws /home/aqms/configs/pws/pws.cfg running on pid 2068255 solserver.pl running on pids 2068319 2068332 2068372 2068410 PDL Exec Rec running (pid=2068648) EIDS Notification Server running (pid=2068763) EIDS Server running (pid=2068850) PDL Hub running (pid=2068956) PDL LI Poll Sender running (pid=2069106) [aqms@aqms-bk-rt ~]$ [aqms@aqms-bk-rt ~]$ status using default config file startstop_unix.d NOTE: If next line reads "ERROR: tport_attach...", Earthworm is not running. Sent request for status; waiting for response... EARTHWORM-64 SYSTEM STATUS Hostname-OS: aqms-bk-rt - Linux 4.18.0-425.10.1.el8_7.x86_64 Start time (UTC): Wed Mar 22 23:07:49 2023 Current time (UTC): Wed Mar 29 22:07:07 2023 Disk space avail: 508789568 kb Ring 1 name/key/size: HYPO_RING / 1015 / 1024 kb Ring 2 name/key/size: PICK_RING / 1005 / 1024 kb Ring 3 name/key/size: FILTERPICK_RING / 1039 / 1024 kb Startstop's Log Dir: /home/aqms/logs/ Startstop's Params Dir: /home/aqms/ew/params/ Startstop's Bin Dir: /home/aqms/ew/bin Startstop Version: v7.10 2019-08-13 (64 bit) Process Process Class/ CPU Name Id Status Priority Used Argument ------- ------- ------ -------- ---- -------- startstop 853366 Alive ??/ 0 00:00:21 - import_ack 853367 Alive ??/ 0 00:03:21 imp_pktrg_menlo1.d import_ack 853368 Alive ??/ 0 00:03:21 imp_pktrg_menlo2.d import_ack 853369 Alive ??/ 0 00:02:05 imp_pktrg_ucb1.d import_ack 853370 Alive ??/ 0 00:02:36 imp_pktrg_ucb2.d pkfilter 853371 Alive ??/ 0 00:04:08 pkfilter.d statrigfilter 853372 Alive ??/ 0 00:03:39 statrigfilter.d binder_ew 853373 Alive ??/ 0 00:09:16 binder_ew.d eqassemble 853374 Alive ??/ 0 00:03:57 eqassemble.d hyps2ps 853375 Alive ??/ 0 00:00:19 hyps2ps.d carlsubtrig 853376 Alive ??/ 0 00:01:23 carlsubtrig.d trig2ps 853377 Alive ??/ 0 00:00:18 trig2ps.d statmgr 853378 Alive ??/ 0 00:00:22 statmgr.d telafeeder 853379 NoExec telafeeder.d diskmgr 853380 Alive ??/ 0 00:00:17 diskmgr.d copystatus 853381 Alive ??/ 0 00:00:18 PICK_RING HYPO_RING copystatus 853382 Alive ??/ 0 00:00:16 FILTERPICK_RING <_RING [aqms@aqms-bk-rt ~]$
[aqms@aqms-bk-pp ~]$ aqms status alarmdec /home/aqms/configs/alarmdec/alarmdec-pp.cfg running on pid 1640591 alarmdec /home/aqms/configs/alarmdec/alarmdec-pp-mech.cfg running on pid 1640600 alarmdec /home/aqms/configs/alarmdec/alarmdec-pp-mw.cfg running on pid 1640610 alarmact /home/aqms/configs/alarmact/alarmact-pp.cfg running on pid 1640623 alarmdist /home/aqms/configs/alarmdist/alarmdist-pp.cfg running on pid 1640636 pws /home/aqms/configs/pws/pws.cfg running on pid 865789 865833 1640697 pws /home/aqms/configs/pws/pws_UCB.cfg running on pid 865827 1640707 pws /home/aqms/configs/pws/pws_CE.cfg running on pid 1640714 pws /home/aqms/configs/pws/pws_dart.cfg running on pid 865829 1640721 pws /home/aqms/configs/pws/pws_NN.cfg running on pid 865828 1640728 aqms 201218 0.3 0.0 342384 50712 ? S Mar28 7:25 wanc wanc_AZBKBPLBPBUS.cfg aqms 201219 0.0 0.0 20276 2216 ? S Mar28 0:03 conlog /home/aqms/logs/wanc_AZBKBPLBPBUS_con aqms 201227 0.2 0.0 305004 13168 ? S Mar28 4:24 wanc wanc_BG.cfg aqms 201228 0.0 0.0 20276 2068 ? S Mar28 0:00 conlog /home/aqms/logs/wanc_BG_con aqms 201243 0.3 0.0 346852 55604 ? S Mar28 6:44 wanc wanc_CE.cfg aqms 201244 0.0 0.0 20276 2056 ? S Mar28 0:00 conlog /home/aqms/logs/wanc_CE_con aqms 201258 0.0 0.0 317196 23496 ? S Mar28 1:48 wanc wanc_CINN.cfg aqms 201259 0.0 0.0 20276 2032 ? S Mar28 0:00 conlog /home/aqms/logs/wanc_CINN_con aqms 201269 0.6 0.0 390836 99372 ? S Mar28 11:24 wanc wanc_NC.cfg aqms 201270 0.0 0.0 20276 3744 ? S Mar28 0:04 conlog /home/aqms/logs/wanc_NC_con aqms 201283 0.5 0.0 342224 50608 ? S Mar28 10:04 wanc wanc_GMGSNPPGRESBSFUOUWWR.cfg aqms 201284 0.0 0.0 20276 2124 ? S Mar28 0:00 conlog /home/aqms/logs/wanc_GMGSNPPGRESBSFUOUWWR_con solserver.pl running on pids 1640802 1640808 1640814 1640820 pcs_poller /home/aqms/configs/waveform_gifs/drp_pcs.cfg running on pid 1640828 pcs_poller /home/aqms/configs/waveform_gifs/trp_pcs.cfg running on pid 1640836 [aqms@aqms-bk-pp ~]$
AQMS: Metadata Loading
- As the 'aqms' user on the active pp (aqms-bk-pp or aqms-nc-pp), cd into /home/aqms/DBmaint/metadata/.
- To load all stations for all networks:
- Run 'getNCEDCstaXML' to retrieve all stationXML files from the NCEDC.
- Run 'loadAllStaXML.sh' to load all stations (3-4h for ~3500 stations).
- To load a given network:
- Run 'getNCEDCstaXML <Network>' to retrieve all stationXML files for the given network.
- Run 'loadAllStaXML.sh' to load all stations for the given network.
- To load a given station:
- Run 'getNCEDCstaXML <Network> <Station>' to retrieve the stationXML file for the given station.
- Run 'loadAllStaXML.sh' to load the station.
- To load a list of networks and stations: [Preferred Method]
- Run 'getFromList' to retrieve the stationXML files for the networks and stations specified in 'netsta.lst', e.g.:
[aqms@aqms-bk-pp metadata]$ cat netsta.lst SB BK MERC NP 1858 UO [aqms@aqms-bk-pp metadata]$
- To refresh the metadata on aqms-bk-rt & aqms-nc-rt (~30 mins):
[aqms@aqms-bk-pp ~]$ cd /home/aqms/DBmaint/metadata/ [aqms@aqms-bk-pp metadata]$ ./refreshMetadata.csh
- To refresh the PWS channel mappings:
[aqms@aqms-bk-pp ~]$ cd /home/aqms/DBmaint/PWS/ [aqms@aqms-bk-pp PWS]$ ./pws
- To refresh the PWS channel mappings on aqms-bk-rt & aqms-nc-rt:
[aqms@aqms-bk-pp ~]$ cd /home/aqms/DBmaint/misc_tables/ [aqms@aqms-bk-pp misc_tables]$ ./refreshMiscTables.csh
AQMS: Miscellaneous DB Tables
- As the 'aqms' user on the active pp (aqms-bk-pp or aqms-nc-pp), use:
Table: | Maintenance Application: |
---|---|
APPCHANNELS | StationUI, AppChan.sql, codaparamsDB (section 111) & pws (section 101) |
APPLICATIONS | StationUI, AppChan.sql, codaparamsDB (section 111) & pws (section 101) |
CONFIG_CHANNEL | AppChan.sql (derived from APPCHANNELS table) |
PROGRAM | AppChan.sql (derived from APPLICATIONS table) |
CHANNELMAP_AMPPARMS | codaparamsDB |
CHANNELMAP_CODAPARMS | codaparamsDB |
CREDIT_ALIAS | psql, by hand |
GAZETTEERBIGTOWN | placesDB |
GAZETTEERTOWN | placesDB |
GAZETTEERQUAKE | placesDB |
GAZETTEERQUARRY | quarryDB |
GAZETTEERPT | placesDB & quarryDB |
GAZETTEER_REGION | regionDB |
MAGPREFPRIORITY | magprefDB |
STACORRECTIONS | codaparamsDB (Md), loadCISNmagcorr (Ml) |
- To refresh the AppChan.sql tables on the active pp:
[aqms@aqms-bk-pp ~]$ cd /home/aqms/DBmaint/AppCha/ [aqms@aqms-bk-pp ~]$ vi appChan_pg.sql [aqms@aqms-bk-pp misc_tables]$ ./refreshAppCha.csh
- To refresh the miscellaneous tables on aqms-bk-rt & aqms-nc-rt:
[aqms@aqms-bk-pp ~]$ cd /home/aqms/DBmaint/misc_tables/ [aqms@aqms-bk-pp misc_tables]$ ./refreshMiscTables.csh
AQMS: Adding New Stations
- (See instructions).
AQMS: Real Time Role Switch
NOTES: Make sure no earthquakes are being processed by the RT system. You do this in a terminal window by looking at the current eqassemble log file on one of the RT systems. Check that this file has not changed in the last few minutes. Then run “tail -f” on the eqassemble log to see what is happening. If a new event starts to be processed here, wait until it has finished processing in the RT systems.
The command 'switchrole' can be launched on either aqms-bk-rt or aqms-nc-rt. The real time switch should happen before the post processing one because currently a PP system is considered primary at startup if its corresponding RT system is primary (showrole script logic).
In order to do a complete switchover: * Disable the 'rsync_archive' cron jobs on aqms-bk-pp & aqms-nc-pp. * Perform the steps in the 'AQMS: Real Time Role Switch' section. * Perform the steps in the 'AQMS/DB: Post Processing Role Switch' section. * Enable the 'rsync_archive' cron jobs on aqms-bk-pp & aqms-nc-pp. * Perform the steps in the 'AQMS/DB: Auxiliary Role Switch & Other Items' section.
- Switchover:
[aqms@aqms-bk-rt ~]$ switchrole Will switch aqms-bk-rt (rtdb1) from primary to shadow and aqms-nc-rt(rtdb2) from shadow to primary OK (Y or N):Y Will do it. rtem@rtdb1 System was primary, changing database to shadow spawn psql -q -h aqms-bk-rt.ncss-ucb.cisn.org -d rtdb1 -U rtem -W Password: rtdb1=> \i /home/aqms/utils/scripts/set_shadow.sql rtdb1=> \q rtem@rtdb2 System was shadow, changing database to primary spawn psql -q -h aqms-nc-rt.ncss-mp.cisn.org -d rtdb2 -U rtem -W Password: rtdb2=> \i /home/aqms/utils/scripts/set_primary.sql rtdb2=> \q [aqms@aqms-bk-rt ~]$
- Standby Status:
[aqms@aqms-bk-rt ~]$ psql -q -h aqms-bk-rt.ncss-ucb.cisn.org -d rtdb1 -U rtem -W Password: rtdb1=> select * from rt_role; primary_system | modification_time ----------------+---------------------------- true | 2023-03-07 00:45:20.992049 false | 2023-04-21 17:01:55.751883 true | 2023-04-21 17:04:13.129714 false | 2023-04-21 17:21:42.727915 true | 2023-04-21 17:21:54.800726 false | 2023-06-08 21:30:29.775161 true | 2023-06-08 21:59:35.722209 false | 2023-07-12 17:06:11.835567 true | 2023-07-12 17:10:47.257782 false | 2023-07-12 17:12:49.347408 (10 rows) rtdb1=> \q [aqms@aqms-bk-rt ~]$
- Primary Status:
[aqms@aqms-bk-rt ~]$ psql -q -h aqms-nc-rt.ncss-mp.cisn.org -d rtdb2 -U rtem -W Password: rtdb2=> select * from rt_role; primary_system | modification_time ----------------+--------------------------- false | 2023-03-27 00:00:00 true | 2023-07-12 17:12:49.54559 (2 rows) rtdb2=> \q [aqms@aqms-bk-rt ~]$
AQMS/DB: Post Processing Role Switch
NOTE:
MAKE SURE that the DB instances are started and stopped with systemctl, not pg_ctl. systemctl status postgresql-14 sudo systemctl stop postgresql-14 sudo systemctl [re]start postgresql-14
- Post Processing Failover:
- In a controlled environment (e.g. patching the active PP): * On the standby system as 'postgres', do a dry run first: % repmgr standby switchover --verbose --force-rewind --dry-run * If you get no errors from the dry run, on the active & standby systems as 'aqms' shut down AQMS: % aqms forcestop * Check to see if anything is still running, kill remaining processes: % aqms status * On the standby system as 'aqms', sync the event waveform archive: % rsync -av -e ssh --progress aqms@{primary}:/home/aqms/data/events/active/ /home/aqms/data/events/active where {primary} = {aqms-bk-pp.ncss-ucb.cisn.org|aqms-nc-pp.ncss-mp.cisn.org} * On the standby system as 'postgres', do the switchover: % repmgr standby switchover --verbose --force-rewind * On aqms-nc-aux as 'ncss', switch the DB for the DDRT process: % cp /home/ncss/run/params/dd2db.conf.{new_primary} /home/ncss/run/params/dd2db.conf where {new_primary} = {aqms-bk-pp|aqms-nc-pp} * On the active & standby systems as 'aqms', start AQMS: % aqms start * Check to make sure that all processes are running: % aqms status - In an emergency (e.g. active PP has crashed): If something bad happened or you don't expect to be able to have the standby become primary, instead of using --force-rewind use this command which kills the original primary after taking over. % repmgr standby switchover --always-promote * Stop AQMS on the old primary if it isn't already, and start it up on the new primary (see above).
- Switchover dry run:
[postgres@aqms-nc-pp ~]$ repmgr standby switchover --verbose --force-rewind --dry-run INFO: checking for package configuration file "/etc/repmgr/14/repmgr.conf" INFO: configuration file found at: "/etc/repmgr/14/repmgr.conf" NOTICE: checking switchover on node "walreceiver" (ID: 2) in --dry-run mode INFO: searching for primary node INFO: checking if node 1 is primary INFO: current primary node is 1 INFO: prerequisites for using pg_rewind are met INFO: SSH connection to host "aqms-bk-pp.ncss-ucb.cisn.org" succeeded INFO: able to execute "repmgr" on remote host "aqms-bk-pp.ncss-ucb.cisn.org" INFO: 1 walsenders required, 2 available INFO: demotion candidate is able to make replication connection to promotion candidate INFO: archive mode is "off" INFO: replication lag on this standby is 0 seconds NOTICE: attempting to pause repmgrd on 2 nodes NOTICE: local node "walreceiver" (ID: 2) would be promoted to primary; current primary "aqmsbkpp" (ID: 1) would be demoted to standby INFO: following shutdown command would be run on node "aqmsbkpp": "sudo /bin/systemctl stop postgresql-14" INFO: parameter "shutdown_check_timeout" is set to 60 seconds INFO: prerequisites for executing STANDBY SWITCHOVER are met [postgres@aqms-nc-pp ~]$
- Switchover:
[postgres@aqms-nc-pp ~]$ repmgr standby switchover --verbose --force-rewind INFO: checking for package configuration file "/etc/repmgr/14/repmgr.conf" INFO: configuration file found at: "/etc/repmgr/14/repmgr.conf" NOTICE: executing switchover on node "walreceiver" (ID: 2) INFO: searching for primary node INFO: checking if node 1 is primary INFO: current primary node is 1 INFO: prerequisites for using pg_rewind are met INFO: SSH connection to host "aqms-bk-pp.ncss-ucb.cisn.org" succeeded INFO: archive mode is "off" INFO: replication lag on this standby is 0 seconds NOTICE: attempting to pause repmgrd on 2 nodes NOTICE: local node "walreceiver" (ID: 2) will be promoted to primary; current primary "aqmsbkpp" (ID: 1) will be demoted to standby NOTICE: stopping current primary node "aqmsbkpp" (ID: 1) NOTICE: issuing CHECKPOINT on node "aqmsbkpp" (ID: 1) DETAIL: executing server command "sudo /bin/systemctl stop postgresql-14" INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") NOTICE: current primary has been cleanly shut down at location 12/4ED71AF8 NOTICE: promoting standby to primary DETAIL: promoting server "walreceiver" (ID: 2) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete INFO: standby promoted to primary after 1 second(s) NOTICE: STANDBY PROMOTE successful DETAIL: server "walreceiver" (ID: 2) was successfully promoted to primary NOTICE: issuing CHECKPOINT on node "walreceiver" (ID: 2) ERROR: unable to execute CHECKPOINT INFO: node "aqmsbkpp" (ID: 1) is pingable WARNING: node "aqmsbkpp" attached in state "startup" INFO: waiting for node "aqmsbkpp" (ID: 1) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout") DETAIL: node "walreceiver" (ID: 1) is currently attached to its upstream node in state "startup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" INFO: waiting for node "aqmsbkpp" (ID: 1) to connect to new primary; 6 of max 60 attempts (parameter "node_rejoin_timeout") DETAIL: node "walreceiver" (ID: 1) is currently attached to its upstream node in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" INFO: waiting for node "aqmsbkpp" (ID: 1) to connect to new primary; 11 of max 60 attempts (parameter "node_rejoin_timeout") DETAIL: node "walreceiver" (ID: 1) is currently attached to its upstream node in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" INFO: waiting for node "aqmsbkpp" (ID: 1) to connect to new primary; 16 of max 60 attempts (parameter "node_rejoin_timeout") DETAIL: node "walreceiver" (ID: 1) is currently attached to its upstream node in state "catchup" WARNING: node "aqmsbkpp" attached in state "catchup" INFO: node "aqmsbkpp" (ID: 1) has attached to its upstream node NOTICE: node "walreceiver" (ID: 2) promoted to primary, node "aqmsbkpp" (ID: 1) demoted to standby NOTICE: switchover was successful DETAIL: node "walreceiver" is now primary and node "aqmsbkpp" is attached as standby NOTICE: STANDBY SWITCHOVER has completed successfully [postgres@aqms-nc-pp ~]$
- Primary Status:
[postgres@aqms-bk-pp ~]$ repmgr node status Node "aqmsbkpp": PostgreSQL version: 14.6 Total data size: 2088 MB Conninfo: host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=aqmsbkpp Role: primary WAL archiving: off Archive command: (none) Replication connections: 2 (of maximal 3) Replication slots: 1 physical (of maximal 80; 0 missing) Replication lag: n/a [postgres@aqms-bk-pp ~]$ psql psql (14.6) Type "help" for help. postgres=# \x Expanded display is on. postgres=# select * from pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 944229 usesysid | 19831 usename | repadmin application_name | walreceiver client_addr | 192.168.119.115 client_hostname | client_port | 45236 backend_start | 2023-04-17 16:43:34.556498-07 backend_xmin | state | streaming sent_lsn | A/5C76A040 write_lsn | A/5C76A040 flush_lsn | A/5C76A040 replay_lsn | A/5C76A040 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2023-04-17 16:55:38.587616-07 -[ RECORD 2 ]----+------------------------------ pid | 946732 usesysid | 23101 usename | barman application_name | barman_receive_wal client_addr | 169.229.197.35 client_hostname | client_port | 48700 backend_start | 2023-04-17 16:50:01.99932-07 backend_xmin | state | streaming sent_lsn | A/5C76A040 write_lsn | A/5C76A040 flush_lsn | A/5C000000 replay_lsn | write_lag | 00:00:05.659113 flush_lag | 00:05:32.17972 replay_lag | 00:05:32.17972 sync_priority | 0 sync_state | async reply_time | 2023-04-17 16:55:34.204999-07 postgres=# \q [postgres@aqms-bk-pp ~]$
- Standy Status:
[postgres@aqms-nc-pp ~]$ repmgr node status Node "walreceiver": PostgreSQL version: 14.7 Total data size: 2088 MB Conninfo: host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=walreceiver Role: standby WAL archiving: off Archive command: (none) Replication connections: 1 (of maximal 3) Replication slots: 1 physical (of maximal 80; 0 missing) Upstream node: aqmsbkpp (ID: 1) Replication lag: 0 seconds Last received LSN: A/5CD499C8 Last replayed LSN: A/5CD499C8 [postgres@aqms-nc-pp ~]$ psql psql (14.7) Type "help" for help. postgres=# \x Expanded display is on. postgres=# select * from pg_stat_wal_receiver; -[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- pid | 1932766 status | streaming receive_start_lsn | A/5C000000 receive_start_tli | 1 written_lsn | A/5CE0DED8 flushed_lsn | A/5CE0DED8 received_tli | 1 last_msg_send_time | 2023-04-17 17:04:42.446616-07 last_msg_receipt_time | 2023-04-17 17:04:42.447379-07 latest_end_lsn | A/5CE0DED8 latest_end_time | 2023-04-17 17:04:42.446616-07 slot_name | sender_host | aqms-bk-pp.ncss-ucb.cisn.org sender_port | 5432 conninfo | user=repadmin password=******** channel_binding=prefer dbname=replication host=aqms-bk-pp.ncss-ucb.cisn.org port=5432 fallback_application_name=walreceiver sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any postgres=# \q [postgres@aqms-nc-pp ~]$
AQMS: Auxiliary Role Switch & Other Items
- TDMT Users Email Notification
On aqms-bk-aux,run this command if Berkeley will be primary:
/home/aqms/bin/update_tmts_webpage.sh Berkeley
If Menlo will be primary:
/home/aqms/bin/update_tmts_webpage.sh Menlo
- ShakeMap Config Changes
On new primary aux system,
run_shakecontrol stop cd /home/aqms/shakemap_profiles/default/install/config \rm shake.conf transfer.conf ln -s transfer.conf-prod transfer.conf ln -s shake.conf-prod shake.conf run_shakecontrol start
On new shadow aux system,
run_shakecontrol stop cd /home/aqms/shakemap_profiles/default/install/config \rm shake.conf transfer.conf ln -s transfer.conf-back transfer.conf ln -s shake.conf-back shake.conf run_shakecontrol start
- DRP Missing WaveForm SnapShots
On primary RT system, e.g. aqms-bk-rt is primary and evid 75000006 has missing snapshots, please run this command on aqms-bk-rt:
/home/aqms/bin/siggen /signals/RTDB1/event 75000006 aqms-bk-rt_SIGSWITCH_EVENT_AQMS-BK-RT /home/aqms/cms/cms.cfg
Please check sigswitch log on primary RT system, if no new events are sent out. You need to restart:
sigswitch_ctl.sh forcestop sigswitch_ctl.sh start
DB: Shadow Post Processing DB Recovery
- Streaming Replication Status (shadow node aqms-bk-pp is detached):
[postgres@aqms-nc-pp ~]$ repmgr cluster show WARNING: node "aqms-bk-pp" not found in "pg_stat_replication" ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------------+---------+-----------+---------------+----------+----------+----------+-------------------------------------------------------- 1 | aqmsbkpp | standby | running | ! walreceiver | default | 100 | 12 | host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 2 | walreceiver | primary | * running | | default | 100 | 12 | host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 WARNING: following issues were detected - node "aqms-bk-pp" (ID: 1) is not attached to its upstream node "walreceiver" (ID: 2) [postgres@aqms-nc-pp ~]$
- If WAL segments have not yet been recycled:
==> Run the following on the shadow PP system (e.g. aqms-bk-pp): [postgres@aqms-bk-pp ~]$ repmgr node rejoin --force-rewind --verbose -d 'host= aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2'
- If WAL segments have been recycled:
- If errors are present in log or pg_rewind fails when running 'repmgr node rejoin': 2021-07-01 14:46:12.115 PDT [2491] ERROR: requested WAL segment 0000000C0000003A000000AA has already been removed First on the shadow PP DB, move $PGDATA out of the way and back it up if you want to. Then run: /usr/pgsql-14/bin/pg_basebackup -h source.host.com -R -D $PGDATA -U repadmin -v -P The "-R" generates a recovery file when doing this copy, so when you start up this instance it will be a streaming standby kept in sync with source.host.com.
- Recovery example:
[postgres@aqms-bk-pp ~]$ sudo systemctl stop postgresql-14 [postgres@aqms-bk-pp ~]$ pwd /var/lib/pgsql [postgres@aqms-bk-pp ~]$ mv 14 14.0 [postgres@aqms-bk-pp ~]$ /usr/pgsql-14/bin/pg_basebackup -h aqms-nc-pp.ncss-mp.cisn.org -R -D $PGDATA -U repadmin -v -P pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 6D/4F1BE930 on timeline 12 pg_basebackup: starting background WAL receiver 5474940/5474940 kB (100%), 1/1 tablespace pg_basebackup: write-ahead log end point: 6D/83E8EE50 pg_basebackup: waiting for background process to finish streaming ... pg_basebackup: base backup completed [postgres@aqms-bk-pp ~]$ [postgres@aqms-bk-pp ~]$ cd 14/data [postgres@aqms-bk-pp ~]$ cp ../../14.0/data/pg_hba.conf . [postgres@aqms-bk-pp ~]$ sudo systemctl start postgresql-14 [postgres@aqms-bk-pp ~]$ repmgr node check INFO: connection is not a superuser connection, falling back to simple check HINT: provide a superuser with -S/--superuser, or add the "repadmin" user to role "pg_read_all_settings" or "pg_monitor" Node "aqmsbkpp": Server role: OK (node is standby) Replication lag: OK (0 seconds) WAL archiving: OK (0 pending archive ready files) Upstream connection: OK (node "aqmsbkpp" (ID: 1) is attached to expected upstream node "walreceiver" (ID: 2)) Downstream servers: OK (this node has no downstream nodes) Replication slots: OK (node has no physical replication slots) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/home/pgsql/data") [postgres@aqms-bk-pp ~]$ repmgr node status Node "aqmsbkpp": PostgreSQL version: 14.6 Total data size: 3746 MB Conninfo: host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 Role: standby WAL archiving: off Archive command: (none) Replication connections: 0 (of maximal 3) Replication slots: 0 physical (of maximal 80; 0 missing) Upstream node: walreceiver (ID: 2) Replication lag: 0 seconds Last received LSN: 6D/869FEEA8 Last replayed LSN: 6D/869FEEA8 [postgres@aqms-bk-pp ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------------+---------+-----------+-------------+----------+----------+----------+-------------------------------------------------------- 1 | aqmsbkpp | standby | running | walreceiver | default | 100 | 12 | host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 2 | walreceiver | primary | * running | | default | 100 | 12 | host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 [postgres@aqms-bk-pp ~]$ [postgres@aqms-nc-pp ~]$ repmgr node check INFO: connection is not a superuser connection, falling back to simple check HINT: provide a superuser with -S/--superuser, or add the "repadmin" user to role "pg_read_all_settings" or "pg_monitor" Node "walreceiver": Server role: OK (node is primary) Replication lag: OK (N/A - node is primary) WAL archiving: OK (0 pending archive ready files) Upstream connection: OK (N/A - node is primary) Downstream servers: OK (1 of 1 downstream nodes attached) Replication slots: OK (node has no physical replication slots) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/home/pgsql/data") [postgres@aqms-nc-pp ~]$ repmgr node status Node "walreceiver": PostgreSQL version: 10.7 Total data size: 3746 MB Conninfo: host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 Role: primary WAL archiving: off Archive command: (none) Replication connections: 1 (of maximal 3) Replication slots: 0 physical (of maximal 80; 0 missing) Replication lag: n/a [postgres@aqms-nc-pp ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------------+---------+-----------+-------------+----------+----------+----------+-------------------------------------------------------- 1 | aqmsbkpp | standby | running | walreceiver | default | 100 | 12 | host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 2 | walreceiver | primary | * running | | default | 100 | 12 | host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 [postgres@aqms-nc-pp ~]$
DB: Streaming Replication Status
- Node Status:
[postgres@aqms-bk-pp ~]$ repmgr node check Node "aqmsbkpp": Server role: OK (node is primary) Replication lag: OK (N/A - node is primary) WAL archiving: OK (0 pending archive ready files) Upstream connection: OK (N/A - node is primary) Downstream servers: OK (1 of 1 downstream nodes attached) Replication slots: OK (1 of 1 physical replication slots are active) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/home/pgsql/data") [postgres@aqms-bk-pp ~]$ repmgr node status Node "aqmsbkpp": PostgreSQL version: 14.6 Total data size: 2088 MB Conninfo: host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=aqmsbkpp Role: primary WAL archiving: off Archive command: (none) Replication connections: 2 (of maximal 3) Replication slots: 1 physical (of maximal 80; 0 missing) Replication lag: n/a [postgres@aqms-bk-pp ~]$
[postgres@aqms-nc-pp ~]$ repmgr node check Node "walreceiver": Server role: OK (node is standby) Replication lag: OK (0 seconds) WAL archiving: OK (0 pending archive ready files) Upstream connection: OK (node "walreceiver" (ID: 2) is attached to expected upstream node "aqmsbkpp" (ID: 1)) Downstream servers: OK (this node has no downstream nodes) Replication slots: OK (1 of 1 physical replication slots are active) Missing physical replication slots: OK (node has no missing physical replication slots) Configured data directory: OK (configured "data_directory" is "/var/lib/pgsql/14/data") [postgres@aqms-nc-pp ~]$ repmgr node status Node "walreceiver": PostgreSQL version: 14.7 Total data size: 2088 MB Conninfo: host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=walreceiver Role: standby WAL archiving: off Archive command: (none) Replication connections: 1 (of maximal 3) Replication slots: 1 physical (of maximal 80; 0 missing) Upstream node: aqmsbkpp (ID: 1) Replication lag: 0 seconds Last received LSN: A/5D18B6C0 Last replayed LSN: A/5D18B6C0 [postgres@aqms-nc-pp ~]$
- Cluster Status:
[postgres@aqms-bk-pp ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------------------------------------------------------- 1 | aqmsbkpp | primary | * running | | default | 100 | 1 | host=aqms-bk-pp.ncss-ucb.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=aqmsbkpp 2 | walreceiver | standby | running | aqmsbkpp | default | 100 | 1 | host=aqms-nc-pp.ncss-mp.cisn.org user=repadmin dbname=repmgr connect_timeout=2 application_name=walreceiver [postgres@aqms-bk-pp ~]$ repmgr cluster event Node ID | Name | Event | OK | Timestamp | Details ---------+-------------+------------------+----+---------------------+------------------------------------------------------- 2 | walreceiver | standby_register | t | 2023-04-17 11:19:08 | standby registration succeeded; upstream node ID is 1 1 | aqmsbkpp | primary_register | t | 2023-04-17 11:09:15 | 1 | aqmsbkpp | cluster_created | t | 2023-04-17 11:09:15 | [postgres@aqms-bk-pp ~]$
AQMS/DB: Troubleshooting
- UCB-EMP connectivity down:
- No action required. - Wait until connectivity comes back up. Replication should resume and catch up automatically.
- Standby RT host down:
- No action required. - Bring host back up.
- Standby PP host down:
- No action required. - Bring host back up. Replication should resume and catch up automatically. - In case of problems, rebuild shadow PP DB.
- Active RT host down:
- Set the other RT system as active. - If active PP system is up: - Perform a PP role switch. - If active PP system is unavailable temporarily: - Wait for it to come back up. - Perform a PP role switch. - If active PP system is unavailable for an extended period of time: - Force promote the other PP DB as active. - The old active PP DB will need to be rebuilt as shadow. - Bring host back up.
- Active PP host down:
- If active PP system is unavailable temporarily: - Wait for it to come back up. - If active PP system is unavailable for an extended period of time: - Perform a RT role switch. - Force promote the other PP DB as active. - The old active PP DB will need to be rebuilt as shadow. - Bring host back up.
DB: DDL Updates
===== If aqms-bk-rt/aqms-bk-pp are primary ===== 0.- Use 'postgres' DB user for all queries. 1.- Alter subscription on aqms-bk-pp: = \c archdb1 = SELECT pglogical.alter_subscription_add_replication_set('rt1subscr', 'ddl_sql'); 2.- Create subscription on aqms-nc-rt: = \c rtdb2 = SELECT pglogical.create_subscription( subscription_name := 'rtsubscrddl', provider_dsn := 'host=aqms-bk-rt.ncss-ucb.cisn.org port=5432 user=repadmin password=repadmin_pass_ucb dbname=rtdb1', replication_sets := ARRAY['ddl_sql'], synchronize_structure := 'FALSE', synchronize_data := FALSE ); 3.- Apply schema change on aqms-bk-rt: = \c rtdb1 = SELECT pglogical.replicate_ddl_command('alter table trinetdb.swarm_state add testfield bigint'); 4.- Alter subscription on aqms-bk-pp: = \c archdb1 = SELECT pglogical.alter_subscription_remove_replication_set('rt1subscr', 'ddl_sql'); 5.- Drop subscription on aqms-nc-rt: = \c rtdb2 = SELECT pglogical.drop_subscription( subscription_name := 'rtsubscrddl'); ===== If aqms-nc-rt/aqms-nc-pp are primary ===== 0.- Use 'postgres' DB user for all queries. 1.- Alter subscription on aqms-nc-pp: = \c archdb1 = SELECT pglogical.alter_subscription_add_replication_set('rt2subscr', 'ddl_sql'); 2.- Create subscription on aqms-bk-rt: = \c rtdb1 = SELECT pglogical.create_subscription( subscription_name := 'rtsubscrddl', provider_dsn := 'host=aqms-nc-rt.ncss-mp.cisn.org port=5432 user=repadmin password=repadmin_pass_wr dbname=rtdb2', replication_sets := ARRAY['ddl_sql'], synchronize_structure := 'FALSE', synchronize_data := FALSE ); 3.- Apply schema change on aqms-nc-rt: = \c rtdb2 = SELECT pglogical.replicate_ddl_command('alter table trinetdb.swarm_state add testfield bigint'); 4.- Alter subscription on aqms-nc-pp: = \c archdb1 = SELECT pglogical.alter_subscription_remove_replication_set('rt2subscr', 'ddl_sql'); 5.- Drop subscription on aqms-bk-rt: = \c rtdb1 = SELECT pglogical.drop_subscription( subscription_name := 'rtsubscrddl');
DB: Adding New Tables
In this example, the PDL_Ids and PDL_Product tables are added. 1.- Create tables and grant permissions on rtdb1, rtdb2 and archdb1 (active PP DB): [postgres@aqms-bk-rt PDL_Tables]$ more *.sql :::::::::::::: create_PDL_IDS.sql :::::::::::::: CREATE TABLE PDL_IDS ( DBSETID BIGINT NOT NULL, TNAME VARCHAR(30) NOT NULL, ID BIGINT NOT NULL, LDDATE TIMESTAMP DEFAULT (CURRENT_TIMESTAMP AT TIME ZONE 'UTC'), CONSTRAINT PDL_IDS_PK PRIMARY KEY (DBSETID, TNAME) ); :::::::::::::: create_PDL_PRODUCT.sql :::::::::::::: CREATE TABLE PDL_PRODUCT ( EVID BIGINT NOT NULL, PRODTYPE VARCHAR(30) NOT NULL, DEST VARCHAR(30) NOT NULL, SOURCE VARCHAR(8) NOT NULL, TIME TIMESTAMP NOT NULL, ACTION VARCHAR(2) NOT NULL, DBSETID BIGINT, LDDATE TIMESTAMP DEFAULT (CURRENT_TIMESTAMP AT TIME ZONE 'UTC'), CONSTRAINT PDL_PRO_PK PRIMARY KEY (EVID, PRODTYPE, ACTION, DEST, TIME, SOURCE), CONSTRAINT PDL_PRO_TYPE CHECK (PRODTYPE IN ('origin','phase-data','moment-tensor','focal-mechanism','nearby-cities','shakemap')), CONSTRAINT PDL_PRO_DEST CHECK (DEST IN ('Recent_event_PDL','ComCat_PDL','dev-ComCat_PDL','dev-Recent_event_PDL')), CONSTRAINT PDL_PRO_ACT CHECK (ACTION IN ('U','D','u','d')) ); :::::::::::::: grant.sql :::::::::::::: GRANT SELECT ON PDL_IDS TO trinetdb_read, code; GRANT INSERT, UPDATE, DELETE ON PDL_IDS TO trinetdb_write, code; GRANT SELECT ON PDL_PRODUCT TO trinetdb_read, code; GRANT INSERT, UPDATE, DELETE ON PDL_PRODUCT TO trinetdb_write, code; [postgres@aqms-bk-rt PDL_Tables]$ [postgres@aqms-bk-rt PDL_Tables]$ psql -U trinetdb -d rtdb1 < create_PDL_IDS.sql CREATE TABLE [postgres@aqms-bk-rt PDL_Tables]$ psql -U trinetdb -d rtdb1 < create_PDL_PRODUCT.sql CREATE TABLE [postgres@aqms-bk-rt PDL_Tables]$ psql -U trinetdb -d rtdb1 < grant.sql GRANT GRANT GRANT GRANT [postgres@aqms-bk-rt PDL_Tables]$ 2.- Set up replication on rtdb1, rtdb2 and archdb1: In this example, aqms-nc-rt/aqms-nc-pp are the active systems. 2.1.- Remove aqms-bk-rt replication set on aqms-nc-pp: [postgres@aqms-nc-pp ~]$ psql -d archdb1 psql (10.14) archdb1=# SELECT pglogical.alter_subscription_remove_replication_set('rt1subscr', 'rtrepset'); alter_subscription_remove_replication_set ------------------------------------------- t (1 row) archdb1=# \q [postgres@aqms-nc-pp ~]$ 2.2.- Add tables to replication set on aqms-bk-rt: [postgres@aqms-bk-rt ~]$ psql -d rtdb1 psql (9.2.24, server 10.10) Type "help" for help. rtdb1=# SELECT pglogical.replication_set_add_table(set_name:='rtrepset', relation:='trinetdb.pdl_ids',synchronize_data:=TRUE); replication_set_add_table --------------------------- t (1 row) rtdb1=# SELECT pglogical.replication_set_add_table(set_name:='rtrepset', relation:='trinetdb.pdl_product',synchronize_data:=TRUE); replication_set_add_table --------------------------- t (1 row) rtdb1=# \q [postgres@aqms-bk-rt ~]$ 2.3.- Add aqms-bk-rt replication set on aqms-nc-pp: [postgres@aqms-nc-pp ~]$ psql -d archdb1 psql (10.14) Type "help" for help. archdb1=# SELECT pglogical.alter_subscription_add_replication_set('rt1subscr', 'rtrepset'); alter_subscription_add_replication_set ---------------------------------------- t (1 row) archdb1=# \q [postgres@aqms-nc-pp ~]$ 2.4.- Do an active/standby role switch (aqms-bk-rt/aqms-bk-pp are now active). 2.5.- Remove aqms-nc-rt replication set on aqms-bk-pp: [postgres@aqms-bk-pp ~]$ psql -d archdb1 psql (10.14) Type "help" for help. archdb1=# SELECT pglogical.alter_subscription_remove_replication_set('rt2subscr', 'rtrepset'); alter_subscription_remove_replication_set ------------------------------------------- t (1 row) archdb1=# \q [postgres@aqms-bk-pp ~]$ 2.6.- Add tables to replication set on aqms-nc-rt: [postgres@aqms-nc-rt ~]$ psql -d rtdb2 psql (10.16) Type "help" for help. rtdb2=# SELECT pglogical.replication_set_add_table(set_name:='rtrepset', relation:='trinetdb.pdl_ids',synchronize_data:=TRUE); replication_set_add_table --------------------------- t (1 row) rtdb2=# SELECT pglogical.replication_set_add_table(set_name:='rtrepset', relation:='trinetdb.pdl_product',synchronize_data:=TRUE); replication_set_add_table --------------------------- t (1 row) rtdb2=# \q [postgres@aqms-nc-rt ~]$ 2.7.- Add aqms-nc-rt replication set on aqms-bk-pp: [postgres@aqms-bk-pp ~]$ psql -d archdb1 psql (10.14) Type "help" for help. archdb1=# SELECT pglogical.alter_subscription_add_replication_set('rt2subscr', 'rtrepset'); alter_subscription_add_replication_set ---------------------------------------- t (1 row) archdb1=# \q [postgres@aqms-bk-pp ~]$