general:monitor

Monitor, an Application for Monitoring Systems
- Configuration File
- Programs Used By Monitor

Monitor, an Application for Monitoring Systems

Monitor is a perl script (written by Doug Neuhauser in about 1994) for generalized monitoring. Being a perl script, it is more or less agnostic to the type of unix or linux on which it runs. However, there are a number of configurable parameters (both in the script and in the configuration file) that must be adjusted to make the program work with some “standard” utilities provided by unix/linux. Don't try to copy a Solaris version of monitor to Linux; it will most likely not work properly.

Minimal help is provided by running “monitor -h”:

monitor - Monitor processes, computers, disks, and other progs.
Syntax:
monitor [-h] [-d n] [-l logfile] [config_file]
where:
	-b	Boot option.  Delay monitoring at boot time for
		BOOT_DELAY seconds specified in config file.
	-d n	Debug value n.  Value can be the OR of:
		1 = print scheduling info
		2 = print time info
		4 = print sorting info
		8 = print out commands
		16 = print alarm messages instead of sending them.
		64 = display monitor_list structure
		128 = print proc list when missing process.
	-h  Help - prints this help message.
	-l logfile
	        Log all alarms to the specified logfile.
	config_file
		Configuration file.  Default file is: monitor.config

For most applications, monitor is started at system boot and runs continuously; e.g.

monitor -b -l /home/ncss/run/logs/monitor.log /home/ncss/run/params/monitor.config

In (almost?) all systems, monitor is started at boot time with the “-b” flag. This allows a few minutes for applications to get started before monitor starts checking things. This boot delay also comes into effect when monitor is restarted with the HUP signal. This feature can be useful to avoid notifications when restarting applications being checked by monitor.

Configuration File

The configuration file, monitor.conf by default, is a text file consisting of 5 sections including some comments.

The first section provides basic configuration parameters for the monitor's operations. Each parameter is treated very much like a unix environment variable, and are actually added to monitor's process environment. This makes these variables available to any programs that are run by monitor. Thus PATH (and LD_LIBRARY_PATH, if necessary) can be set here.

Comments are prefixed with “#”.

The “EMAIL” parameter includes the subject line under which email messages are sent; e.g,:

EMAIL=/bin/mail -s 'ALARM - monitor@ucbns2'

In this first section, arbitrary variables may be defined with will be used in the following sections of monitor.conf. For example, in ucbns2 we have:

 # Notification list - may be used in place of user:action
 #-----------------------------------------------------------------------
 NOTIFY_LIST_1=(peggy,lombard,cpaff):(email,pager1),(jennifer):email
 NOTIFY_LIST_2=(peggy,lombard,cpaff):(email,pager2)
 NOTIFY_NCSS_1=(peggy,lombard,cpaff):(email,pager1)
 NOTIFY_NCSS_2=(peggy,lombard,cpaff):(email,pager2)
 NOTIFY_DOUG=doug:(email,pager)

The remaining four sections configure the four different ways that monitor does its work. These are “prog”, “alive”, “proc” and “disk” Each of these sections is optional. The only restriction is that at least one command must be provided from any of these sections. Otherwise monitor won't have anything to do so it will simply exit.

The “prog” command tells monitor to run some program (any sort of executable). The comments in monitor say:

 # Expectations of a program that is run under monitor:
 #       writes to STDOUT only when conditions are unsatisfactory;
 #           if conditions are normal, it should write nothing to STDOUT
 #       The STDOUT of the program will be sent as part of the page/email
 #       anything written to STDERR will go in monitor.error.log but
 #           nowhere else.
 #       Exit status: should be 0 under most conditions; 
 #           use non-zero exit status only for failure to execute part of
 #           part of the program. This non-zero exit status will REPLACE
 #           anything written to STDOUT in the page/email sent by monitor.

Note that program STDERR output is ignored by monitor but gets combined with any STDERR output from monitor itself.

Here are some comments and sample “prog” lines from ucbns2:

 # Alarmflags are the following numeric parameters:
 #       run renotify notify_clear
 #   where:
 #       run -           Run program every N seconds.
 #                       Notify users when alarm is first raised.
 #       renotify -      Renotify users every N seconds
 #                       if alarm stays raised.
 #       notify_clear -  Boolean flag (0 or 1) whether users
 #                       should be notified when alarm is cleared.
 
 #    Program         Notify_list    Alarmflags  Prog_args
 #-----------------------------------------------------------
 prog status_mon      NOTIFY_LIST_2  60 3600 1  -a 600 -d 28800
 prog status_wda      NOTIFY_LIST_2  60 3600 1  -C 5 -c 900 -T 99999999
 prog status_wda      NOTIFY_LIST_2  60 3600 1  -C 999 -c 99999999 -T 43200
 prog check_page.pl   NOTIFY_LIST_1  60 3600 1  -t 300

Notice the last item here, running the check_page.pl script. This is testing some conditions of the pager daemon on ucbns2. If it finds a problem, it must be reported using pager1, the pager system on ucbns1. It cannot use the pager system on ucbns2 to report that ucbns2's pager system is not working! This is another instance in which monitor.conf is specific for one machine, and cannot be copied intact to another machine.

The “alive” commands are used to check the network connectivity of other hosts using the “ping” utility. Here's a few sample entries:

 #       Computer                     Notify         Alarmflags    ping_count
 #-------------------------------------------------------------------
 alive   ucbns1.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
 alive   rumble.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
 alive   benito.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
 alive   ucbrt.seismo.berkeley.edu    NOTIFY_NCSS_2  60 3600 1       5

The “proc” command is used to check for the presence of a given process, by user, name, and possibly some arguments. The “Program” part of the proc command uses perl regular expressions to compare against what is reported by “ps -ef”. This is a bit tricky to get right! Here's some samples:

 #       User,Program            Notify                  Alarmflags
 #-----------------------------------------------------------------
 proc    ncss:.*adadup_ucbns22ucbrt.*  NOTIFY_LIST_2     120 3600 1
 proc    ncss:.*adadup_ucbns22ncss3.*  NOTIFY_LIST_2     120 3600 1
 proc    ncss:.*crossoverSA/CA_base.* lombard:email      120 3600 1
 proc    ncss:.*SocketAgent/CA_base.* lombard:email      120 3600 1

The “disk” command is used to have monitor check for sufficient free space on a given file-system. For example:

 #       Disk            Notify         Alarmflags        Minfree|full%
 #----------------------------------------------------------------------
 disk    /home/aq12      lombard        1800 21600 0    90%
 disk    /home/aq12      NOTIFY_NCSS_2  1800  7200 0    95%

For BSL systems, monitor is run for at least the following user:host combinations:

ncss:ucbns1  BSL acquisition and AQMS net services
ncss:ucbns2  ""
ncss:ucbrt   BSL AQMS RT system
ncss:ucbpp   BSL AQMS PP system
dcmgr:ucbpp  BSL event waveform archiving
ncss:rumble  ShakeMap
redi:shaker  Finite Fault
ncss:quake7  PSD system
ncss:benito  BSL AQMS DRP system
ncss:sutter  dac480 support
dcmgr:hugo   DART

BSL Test AQMS systems:

ncss:mono    Solaris AQMS
ncss:seiche  Linux AQMS

For Menlo Park systems, monitor is run for the following user:host combinations:

ncss:mnlons1  MP AQMS net services
ncss:mnlons2  MP AQMS net services
ncss:mnlort1  MP AQMS RT system
ncss:mnlodb1  MP AQMS PP system
dcmgr:mnlodb1 MP event waveform archiving
ncss:mnlodd1  MP real-time double-difference system

Programs Used By Monitor

The following programs are used by monitor in the prog section of the configuration file on various systems. Each of these programs can be used by hand when needed. Most but not all of them will report how to use them when given the “-h” command-line option.

action_error

On AQMS systems with alarm systems (alarmdec, alarmact, alarmdist), the script action_error is used to find any alarm actions that are in the ERROR state in the Alarm_Action database table.

ncss@ucbrt.geo.berkeley.edu:action_error -h
    action_error version 0.0.2
        action_error - report alarm_actions in ERROR state
Syntax:
        action_error  [-c config] [-E evid] [-U evid action]
where:
        -E evid         - query for the action commandline for
        any alarm actions in ERROR state for event <evid>.
        -U evid action  - update any of event <evid>'s actions
        of name <action> from ERROR state to ERROR-ACK state.
        -c config       - specify an alternate configuration file.
        The default config file is /home/ncss/run/params/db.conf
        -h      Help    - prints this help message.

        When neither option -E or -U is given, action_error prints any event
        IDs and their actions which are in the ERROR state. This mode is
        suitable for use by monitor.

NOTE:

Once action_error (through monitor) has reported an alarm action in the ERROR state, the user should investigate the problem. Most alarm actions log there results and errors in files in the run/alarms/logs directory.

action_error can be used with the -E evid option to learn the command-line appropriate for manually running the alarm action script.

Once the error condition has been resolved, action_error should be run with the -U evid action to change the alarm action state from ERROR to ERROR-ACK in the alarm_action table. This will silence the complaints from monitor when it runs action_error again.

Note that on the post-processing systems, the alarm_action table is replicated among all the archive databases. That means that only one instance of monitor should be configured to run action_error at a time; otherwise you may get multiple pager messages about a single error condition. By convention, monitor runs action_error only on the active post-processing system

check_page.pl

On the BSL systems (ucbns1, ucbns2) which actually submit pager and SMS messages, we use check_page.pl to look for stale pager files. If these were present, it would indicate that the pager daemon was not able to deliver messages in a timely manner.

ncss@ucbns1:./check_page.pl -h
    check_page.pl version 0.0.1
        check_page.pl - check for stale pager files
Syntax:
        check_page.pl  [-h] [-t maxAge]
where:
        -t maxAge       - Maximum age in seconds for pending pager files
                        If older files are found, squawk about them!
                        Default max age is 300 seconds.
        -h              - prints this help message.

If check_page.pl is reporting a problem with the pager system on the local system, e.g. ucbns1, it is important to configure monitor to send reports of the problem to a different system, e.g. ucbns2.

checkampexchange

To monitor the exchange of ground-motion amplitude packets on the post-processing systems, monitor uses the checkampexchange script:

ncss@ucbpp:checkampexchange -h
    checkampexchange version 0.0.3 - 2015/01/29 NCSS
        checkampexchange - report problems with amp import/export
Syntax:
        checkampexchange [-a] [-b] [-e] [-j] [-q] [-T maxAge] [-v]
        checkampexchange -h
where:
        -a      check for unexpected files in get_amps/new directory
        -b      check for stale heartbeats
        -e      check for files in import error directory
        -j      check for jammed-up outgoing files
        -q      check for SQL errors or exceptions in Gmp2Db log
        -T maxAge  set max heartbeat or file age in minutes; default is 10
        -v      set verbose output; not suitable for use under monitor
        -h      Help    - prints this help message.

        When none of -a, -b, -e, -j, -q are specified, they ALL are implied.

NOTE:

We use only the -T 30 option in monitor.config; i.e., perform all of the checks defined for -a, -b, -e, -j, and -q with a 30 minute delay allowed in the heartbeat messages sent from our sister agencies CGS and Caltech.

See the code (run/bin/checkampexchange) for details of the checks.

checkautoposter

On the post-processing systems, we use an Oracle Database job to insert new subnet trigger events into the pcs system. To monitor the status of the autoposter job, we use the script checkautoposter:

ncss@ucbpp:checkautoposter -h

check status of autoposter jobs

  usage: /home/ncss/ncpp/bin/checkautoposter -[h] [-d dbase]

  -h        : print usage info
  -d dbase  : use this particular dbase (defaults to "MasterDB")
  -v        : verbose: report good status; normally report only bad status

example: /home/ncss/ncpp/bin/checkautoposter -d dcmp2

This script queries the database for to see if the autoposter job is running and if it has reported any errors. If errors are found, they should be reported to the Oracle DBA for help in resolving them.

comparelocks

The jiggle application is used for human-controlled event location and magnitude evaluation. In order to ensure that different jiggle users are not working on the same event, jiggle uses the database table JasiEventLock to “lock” other users out of the event that is being worked. Database replication is not adequate for this locking mechanism on different databases, so the JasiEventLock table is not replicated between the archive DBs.

Instead we use the script comparelocks to ensure that all jiggle users are using the same database:

ncss@ucbpp:comparelocks -h

    Compare current event locks.

    usage: /home/ncss/ncpp/bin/comparelocks [-v] [-m]

     -h        : print usage info
     -m        : truncated one-line output for use with monitor (NCSS)
     -v        : verbose: print all locks;
                 Normally all locks printed only if multiple DB's in use.

This script queries each of the listed databases (hard-coded in the script) to see if any events are locked on that database. If it finds events locked on more than one database, it cause monitor to send a message.

The corrective action is to tell the jiggle users to get on the same database.

dbping

dbping tests its configured database for basic functionality. This perl script connects to the database and does one query. If that succeeds, dbping is happy; otherwise it reports an error.

-h dbping version $Id: dbping.pl,v 1.6 2004/11/20 16:43:26 redi Exp $ dbping - ping a database Syntax: dbping where: -c config - specify an alternate configuration file The default config file is /home/redi/run/params/db.conf -n repeats - number of times to try for database response -t timeout - time to wait for response from database -h - prints this help message. -d debug_level - prints programmer debug code

Unexpected errors reported by dbping should be forwarded to the Oracle DBA for corrective action.

dircheck

dircheck is a simple shell script to monitor directories for unexpected files. We use it on directories that have files written to them and then removed by various polling programs. For example, the input directory for the Earthworm sendfileII program normally is empty. Other programs write files in this directory; sendfileII deletes the files as soon as they have been sent. If there are files sitting in the sendfileII input directory, it is because they cannot be sent. dircheck will report this problem through monitor.

ncss@ucbpp:dircheck -h
usage: /home/ncss/run/bin/dircheck max-files directory-path

When dircheck reports errors, the AQMS operator will have to track down the processes involved and determine the appropriate corrective action.

pcsWatchdog

The pcsWatchdog program checks for abnormal entries in the pcs state table. We run this program on whichever of the post-processing systems is configured as “active”.

ncss@ucbpp:pcsWatchdog -h

    Check for PCS backlog

    usage: /home/ncss/ncpp/bin/pcsWatchdog [-d] <config_file> [<dbase>]

     -h        : print usage info
     -d        : turn on diagnostic output
     config_file : full path to the config file
     dbase     : use this particular dbase (defaults to "dcucb")

    example: /home/ncss/ncpp/bin/pcsWatchdog /home/ncss/ncpp/conf/checkStates.cfg dcucb

This script uses a configuration file that lists the various states to check for. Note that the Group, Table and State entries in this file are used directly in a database query, so SQL wild-cards can be used. This file is /home/ncss/ncpp/conf/checkStates.cfg:

# Group Table State Age(secs) Count
EventStream % NewEvent 0 0
EventStream % NewTrigger 0 0
EventStream % MakeDRPGif 60 2
EventStream % MakeTrigGif 60 2
EventStream % ExportAmps 60 10
EventStream % FPfit 120 0
EventStream % ExportWF 760 1
EventStream % ExportArc 120 0
EventStream % ddrtFeed 120 0
EventStream % SwarmAlarm 120 0
EventStream % AssocTrig 0 0
EventStream % TrigCheck 360 10
TPP  TPP DELETED  60 0
TPP  TPP FINALIZE 60 0
TPP  TPP ALARM 60 0
TPP  TPP ddrtFeed 120 1
TPP  TPP MakeDRPGif 60 2
TPP  TPP FPfit 120 0
TPP  TPP ExportArc 60 1
TPP  TPP CANCELALARM 60 0
TPP  TPP DeleteArc 60 0
TPP  TPP REPOP 300 0
# >100 rows of any states older than 10min
% % % 600 100

Errors reported by pcsWatchdog are usually due to problems in the pcs client programs responsible for handling that state.

pdlSendCheck

pdlSendCheck is a script for monitoring the ProductClient poll directory. Since we no longer use ProductClient in polling mode, this script is no longer needed.

status_ada, status_wda

status_ada and status_wda are scripts for monitoring data latency in the AQMS ADA and WDA shared memory regions (also called GCDA, generic channel data area), respectively. These two scripts work by using the output from AQMS programs adastat and wdastat.

The two scripts are configured with files listing the SNCLs to be monitored; a flag value following each SNCL indicates whether that SCNL should be reported (flag = 1) or temporarily ignored (value = 0). Normally one SCNL from each station whose data should be in the GCDA, since most data latency problems affect all SNCLs of a station in the same way. One exception to this is for stations with multiple dataloggers: in those cases, one SNCL from each datalogger may be included in the configuration file.

Each of these scripts offer the same command-line options. Here's status_wda:

ncss@ucbns2:status_wda -h
status_wda version 0.8 (2010.218)
status_wda - Monitor WDA region.
Syntax:
status_wda   [-T N] [-C N] [-c K] [-d n] [-h]
where:
        -f file Name of config file. Default is:
                        /home/ncss/run/params/status_wda.config
        -T N    Total allowable delay summed over all stations.
                Default is 240 * number of channels;
        -C list Comma-delimited list of cluster sizes.
        -c list Comma-delimited list of allowable delay for each
                station for the corresponding cluster size.
        -d n    Debug option.
                1 provides basic debugging info.
                2 provides sorted delay info.
Examples:
status_wda -T 1800
        Set max total delay.
status_wda -C 1,2,3 -c 1800,900,360
        Set delay limits of
                1 station at 1800 seconds each
                2 stations at 900 seconds each
                3 stations at 360 seconds each
        Max total delay is unchanged from default.

These options provide two basic ways of monitoring data latency. To monitor the total data latency (sum of latencies for all configured SNCLs), set the -T value to some low number, while setting the -c and -C values to large numbers. On the network service systems, we typically use a -T value of 43200 (12 hours), with -C 999 (more than the configured number of SNCLs). This will generate warning messages if one SNCL is out for about 12 hours, or two SNCLs out for about 6 hours, etc.

To monitor latency for groups of SNCLs instead of total latency, we set small values for -c and -C, with a very value for -T. On the UCB net service systems, we use -C 3 (a cluster of 3 SNCLs) and -c 600 (ten minutes). The result is that if any three SNCLs each have latency of more than 600 seconds, we will get a warning message.

status_wda and status_ada are often used manually. In that case, the most useful option -d2 to get a list of all configured SNCLs (including those configured with flag “0”), sorted in descending order of latency.

status_mon

The AQMS systems at UC Berkeley do not use the standard Earthworm programs startstop, statmgr or status. Instead they use locally written programs status_mgr and status_mon for checking the state of health of Earthworm programs running on these systems.

status_mgr creates a small share-memory region containing entries for each of the Earthworm message types it is configured to monitor. Like the Earthworm statmgr, status_mgr normally is configured to watch heartbeat messages from configured modules. But status_mgr can also monitor other message types, which it considers data. In the share-memory regions. status_mgr keeps track of the last time it received each of the messages it is configured to monitor, as well as a few other useful parameters.

ncss@ucbns1:status_mgr -h
status_mgr   [-F | -K] [-h] [-d N] config_file
    where:
        -h          Help - prints syntax message.
        -r ringname Name of ring to monitor.  Default is ?
        -F | -K     Flush or Keep old contents of redi ring (default=Keep).
        -d N        Debug option N (currently unused).
        config_file Name of configuration file (default = status_mgr.config)

Note that status_mgr is hard-coded to read Earthworm messages from REDI_RING. This seemed reasonable when it was first written. It may be time to make the small code changes necessary to make the input ring name configurable.

status_mgr runs continuously, started with at the same time as other Earthworm programs. It should be restarted whenever changes are made to its configuration file:

 run_status_mgr restart

Here's a sample configuration file, taken from urbrt. It demonstrates the use of both HEARTBEAT and DATA configuration types.

#
# Status region: Hard-coded to read from REDI_RING
#
SHMEM=3030
#
# Type=Program_Name,Module,Net,Expect,Flag
#
HEARTBEAT=ImportPkTrigMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigUcbns1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS1,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns1,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS1_BK,INST_UCB,60,1
HEARTBEAT=ImportPkTrigUcbns2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS2,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns2,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS2_BK,INST_UCB,60,1
HEARTBEAT=File2EW_Trig,TYPE_HEARTBEAT,MOD_FILE2EW_TRIG,INST_UCB,60,1
#
HEARTBEAT=Pkfilter,TYPE_HEARTBEAT,MOD_PKFILTER,INST_UCB,60,1
HEARTBEAT=Binder_ew,TYPE_HEARTBEAT,MOD_BINDER_EW,INST_UCB,60,1
HEARTBEAT=Eqassemble,TYPE_HEARTBEAT,MOD_EQASSEMBLE,INST_UCB,60,1
DATA=EvtAssemble,TYPE_HYP2000ARC,MOD_EQASSEMBLE,INST_UCB,0,1
HEARTBEAT=Hyps2ps,TYPE_HEARTBEAT,MOD_HYPS2PS,INST_UCB,60,1
#
HEARTBEAT=StatrigFilter,TYPE_HEARTBEAT,MOD_STATRIGFILTER,INST_UCB,60,1
HEARTBEAT=Carlsubtrig,TYPE_HEARTBEAT,MOD_CARLSUBTRIG,INST_UCB,60,1
DATA=SubnetTrigger,TYPE_TRIGLIST_SCNL,MOD_CARLSUBTRIG,INST_UCB,0,1
HEARTBEAT=Trig2ps,TYPE_HEARTBEAT,MOD_TRIG2PS,INST_UCB,60,1
#

The configuration file allows for naming the triplet of Earthworm message type, module ID and installation ID. This name is intended to be (hopefully) more comprehensible to human readers than the Earthworm syntax.

status_mgr ignores Earthworm TYPE_ERROR messages. These messages are completely ignored in UCB Earthworm systems.

The companion program status_mon is normally used by monitor to report abnormal Earthworm conditions. It reads from the same share-memory region controlled by status_mgr and determines the latency of each parameter being monitored. If the latency exceeds a limit, it will be reported. The latency limits are the same as configured in status_mgr's configuration file, unless modified by status_mon's command line. If no limits are exceeded, then status_mon is silent.

ncss@ucbns2:status_mon -h
status_mon   [-a n] [-d N] [-h] [-I list | -O list]    where:
        -h             Help - prints syntax message.
        -m shmkey      Specify shared memory key.
        -a N           Set alive (heartbeat) threshold to N seconds.
        -d N           Set data threshold to N seconds.
        -I ignore-list Ignore items in comma-separted list.
        -O only-list   Only list items in comma-separted list.
    -I and -O cannot both be used at the same time.

The UCB command status is simply a symbolic link to status_mon. This provides a complete list of the configured names, last update times, expected update interval and current latency. This command is intended for manual use. For example:

ncss@ucbns2: status
Name                   Last Update (UTC)    Expect  Delta
---------------------------------------------------------
Import_Trace_Local   2018/03/19,22:35:52.0000   10      1
Slink2EW_USLB        2018/03/19,22:35:35.0000   10     18
Mcast_USLB           2018/03/19,22:35:47.0000   60      6
WfTimeFilter         2018/03/19,22:35:45.0000   60      8
Pickew               2018/03/19,22:35:45.0000   60      8
Coda_AAV             2018/03/19,22:35:47.0000   60      6
Coda_DUR             2018/03/19,22:35:51.0000   60      2
Carlstatrig          2018/03/19,22:35:26.0000   60     27
CsDetectEw           2018/03/19,22:35:49.0000   10      4
PowerMon             2018/03/19,22:35:46.0000   10      7
Export_PkTrig_Menlo  2018/03/19,22:35:50.0000   60      3
Export_PkTrig_UCB    2018/03/19,22:35:46.0000   60      7
Export_Pick_CIT      2018/03/19,22:35:00.0000   60     53   (x)
Export_Pick_Golden   2018/03/19,22:35:37.0000   60     16   (x)
Export_Trace_MP      2018/03/19,22:35:49.0000   60      4   (x)
Export_Trace_ATWC    2018/03/19,22:35:42.0000   60     11   (x)
Export_Trace_PTWC    2018/03/19,22:35:44.0000   60      9   (x)
Export_Trace_UW      2018/03/19,22:35:44.0000   60      9   (x)
Export_Trace_NCVL    2018/03/19,22:35:33.0000   60     20   (x)
Export_Trace_UCSD_KF 2018/03/19,22:35:48.0000   60      5   (x)
Wave_serverV_1       2018/03/19,22:35:47.0000   60      6
Wave_serverV_2       2018/03/19,22:35:47.0000   60      6
Wave_serverV_3       2018/03/19,22:35:47.0000   60      6
Wave_serverV_4       2018/03/19,22:35:47.0000   60      6
Wave_serverV_5       2018/03/19,22:35:47.0000   60      6

In this output, the “(x)” text indicates that this item is flagged to turn off warning messages sent through monitor. If this parameter's latency exceeded the limit, this symbol would change to “(ALARM)”. Likewise, if a parameter were not flagged for silence, the symbol would be “ALARM”, without the enclosing parenthesis.

Automatic Restart for Earthworm Modules at UCB

Because UCB AQMS systems do not use Earthworm's startstop or statmgr programs, they are missing the ability to automatically restart Earthworm modules when they die. To get around this lacuna, we have a restart_mgr system to handle automatic restarts. /home/ncss/run/bin/restart_mgr.pl is a per script that parses its configuration file, runs status, and restarts any configured programs that are in the ALARM or (ALARM) state. Since this system is run by cron, we have /home/ncss/run/bin/restart_mgr.csh, a C-shell script to set the appropriate environment variables needed by the perl script and by status.

ncss@ucbns1:./restart_mgr.pl -h
    restart_mgr.pl version 0.0.1
        restart_mgr.pl - check status and restart failed programs
Syntax:
        restart_mgr.pl  [-c config] [-h] [-v]
where:
        -c config       - specify an alternate configuration file
        The default config file is /home/ncss/run/params/restart_mgr.conf
        -w wait_time
                        - how long (seconds) to wait for heartbeat before
                        restarting a program; default is 300
        -v              - verbose output
        -h              - prints this help message.

The configuration file for restart_mgr on ucbns1 is shown here. Note that this is in perl syntax that defines a hash. The left column gives the names of items in status_mgr's configuration file. The right column gives the commands needed to restart the given item when called with the restart option:

%progMap = (
            Export_Trace_Quake => "run_export_trace_quake",
            Export_Trace_MP    => "run_export_trace_menlo",
            Export_Trace_ATWC  => "run_export_trace_atwc",
            Export_Trace_PTWC  => "run_export_trace_ptwc",
            Export_Trace_UW    => "run_export_trace_uw",
            Export_Trace_NCVL  => "run_export_trace_ncal_valve",
            Export_Trace_UCSD_KF => "run_export_trace_ucsd_kf",
            Export_Pick_Golden => "run_export_pick_golden",
            );

# Following required to make perl happy:
1;

Yes, this is a kludge! But it seems to work OK.

Table of Contents