===== Monitor, an Application for Monitoring Systems =====

Monitor is a perl script (written by Doug Neuhauser in about 1994) for
generalized monitoring. Being a perl script, it is more or less agnostic to
the type of unix or linux on which it runs. However, there are a number of
configurable parameters (both in the script and in the configuration file)
that must be adjusted to make the program work with some "standard" utilities
provided by unix/linux. Don't try to copy a Solaris version of monitor to
Linux; it will most likely not work properly.

Minimal help is provided by running "monitor -h":
<code>
monitor - Monitor processes, computers, disks, and other progs.
Syntax:
monitor [-h] [-d n] [-l logfile] [config_file]
where:
	-b	Boot option.  Delay monitoring at boot time for
		BOOT_DELAY seconds specified in config file.
	-d n	Debug value n.  Value can be the OR of:
		1 = print scheduling info
		2 = print time info
		4 = print sorting info
		8 = print out commands
		16 = print alarm messages instead of sending them.
		64 = display monitor_list structure
		128 = print proc list when missing process.
	-h  Help - prints this help message.
	-l logfile
	        Log all alarms to the specified logfile.
	config_file
		Configuration file.  Default file is: monitor.config
</code>

For most applications, monitor is started at system boot and runs
continuously; e.g.

  monitor -b -l /home/ncss/run/logs/monitor.log /home/ncss/run/params/monitor.config

In (almost?) all systems, monitor is started at boot time with the "-b"
flag. This allows a few minutes for applications to get started before monitor
starts checking things. This boot delay also comes into effect when monitor is
restarted with the //HUP// signal. This feature can be useful to avoid
notifications when restarting applications being checked by monitor.

==== Configuration File ====

The configuration file, monitor.conf by default, is a text file consisting of 5
sections including some comments.

The first section provides basic configuration parameters for the monitor's
operations. Each parameter is treated very much like a unix environment
variable, and are actually added to monitor's process environment. This makes
these variables available to any programs that are run by monitor. Thus PATH
(and LD_LIBRARY_PATH, if necessary) can be set here.

Comments are prefixed with "#".

The "EMAIL" parameter includes the subject line under which email messages are
sent; e.g,:

  EMAIL=/bin/mail -s 'ALARM - monitor@ucbns2'

In this first section, arbitrary variables may be defined with will be used in
the following sections of monitor.conf. For example, in ucbns2 we have:

   # Notification list - may be used in place of user:action
   #-----------------------------------------------------------------------
   NOTIFY_LIST_1=(peggy,lombard,cpaff):(email,pager1),(jennifer):email
   NOTIFY_LIST_2=(peggy,lombard,cpaff):(email,pager2)
   NOTIFY_NCSS_1=(peggy,lombard,cpaff):(email,pager1)
   NOTIFY_NCSS_2=(peggy,lombard,cpaff):(email,pager2)
   NOTIFY_DOUG=doug:(email,pager)

The remaining four sections configure the four different ways that monitor
does its work. These are "prog", "alive", "proc" and "disk" Each of these
sections is optional. The only restriction is that at least one command must
be provided from any of these sections. Otherwise monitor won't have anything
to do so it will simply exit.

The "prog" command tells monitor to run some program (any sort of
executable). The comments in monitor say:

   # Expectations of a program that is run under monitor:
   #       writes to STDOUT only when conditions are unsatisfactory;
   #           if conditions are normal, it should write nothing to STDOUT
   #       The STDOUT of the program will be sent as part of the page/email
   #       anything written to STDERR will go in monitor.error.log but
   #           nowhere else.
   #       Exit status: should be 0 under most conditions; 
   #           use non-zero exit status only for failure to execute part of
   #           part of the program. This non-zero exit status will REPLACE
   #           anything written to STDOUT in the page/email sent by monitor.

Note that program STDERR output is ignored by monitor but gets combined with
any STDERR output from monitor itself.

Here are some comments and sample "prog" lines from ucbns2:

   # Alarmflags are the following numeric parameters:
   #       run renotify notify_clear
   #   where:
   #       run -           Run program every N seconds.
   #                       Notify users when alarm is first raised.
   #       renotify -      Renotify users every N seconds
   #                       if alarm stays raised.
   #       notify_clear -  Boolean flag (0 or 1) whether users
   #                       should be notified when alarm is cleared.
   
   #    Program         Notify_list    Alarmflags  Prog_args
   #-----------------------------------------------------------
   prog status_mon      NOTIFY_LIST_2  60 3600 1  -a 600 -d 28800
   prog status_wda      NOTIFY_LIST_2  60 3600 1  -C 5 -c 900 -T 99999999
   prog status_wda      NOTIFY_LIST_2  60 3600 1  -C 999 -c 99999999 -T 43200
   prog check_page.pl   NOTIFY_LIST_1  60 3600 1  -t 300

Notice the last item here, running the check_page.pl script. This is testing
some conditions of the pager daemon on ucbns2. If it finds a problem, it must
be reported using pager1, the pager system on ucbns1. It cannot use the pager
system on ucbns2 to report that ucbns2's pager system is not working! This is
another instance in which monitor.conf is specific for one machine, and cannot
be copied intact to another machine.

The "alive" commands are used to check the network connectivity of other hosts
using the "ping" utility. Here's a few sample entries:

   #       Computer                     Notify         Alarmflags    ping_count
   #-------------------------------------------------------------------
   alive   ucbns1.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
   alive   rumble.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
   alive   benito.seismo.berkeley.edu   NOTIFY_NCSS_2  60 3600 1       5
   alive   ucbrt.seismo.berkeley.edu    NOTIFY_NCSS_2  60 3600 1       5
 
The "proc" command is used to check for the presence of a given process, by
user, name, and possibly some arguments. The "Program" part of the proc
command uses perl regular expressions to compare against what is reported by
"ps -ef". This is a bit tricky to get right! Here's some samples:

   #       User,Program            Notify                  Alarmflags
   #-----------------------------------------------------------------
   proc    ncss:.*adadup_ucbns22ucbrt.*  NOTIFY_LIST_2     120 3600 1
   proc    ncss:.*adadup_ucbns22ncss3.*  NOTIFY_LIST_2     120 3600 1
   proc    ncss:.*crossoverSA/CA_base.* lombard:email      120 3600 1
   proc    ncss:.*SocketAgent/CA_base.* lombard:email      120 3600 1


The "disk" command is used to have monitor check for sufficient free space on
a given file-system. For example:

   #       Disk            Notify         Alarmflags        Minfree|full%
   #----------------------------------------------------------------------
   disk    /home/aq12      lombard        1800 21600 0    90%
   disk    /home/aq12      NOTIFY_NCSS_2  1800  7200 0    95%


For BSL systems, monitor is run for at least the following user:host
combinations:

  ncss:ucbns1  BSL acquisition and AQMS net services
  ncss:ucbns2  ""
  ncss:ucbrt   BSL AQMS RT system
  ncss:ucbpp   BSL AQMS PP system
  dcmgr:ucbpp  BSL event waveform archiving
  ncss:rumble  ShakeMap
  redi:shaker  Finite Fault
  ncss:quake7  PSD system
  ncss:benito  BSL AQMS DRP system
  ncss:sutter  dac480 support
  dcmgr:hugo   DART

BSL Test AQMS systems:

  ncss:mono    Solaris AQMS
  ncss:seiche  Linux AQMS

For Menlo Park systems, monitor is run for the following user:host
combinations:

  ncss:mnlons1  MP AQMS net services
  ncss:mnlons2  MP AQMS net services
  ncss:mnlort1  MP AQMS RT system
  ncss:mnlodb1  MP AQMS PP system
  dcmgr:mnlodb1 MP event waveform archiving
  ncss:mnlodd1  MP real-time double-difference system

==== Programs Used By Monitor ====

The following programs are used by monitor in the //prog// section of the
configuration file on various systems. Each of these programs can be used by
hand when needed. Most but not all of them will report how to use them when
given the "-h" command-line option.

=== action_error ===

On AQMS systems with alarm systems (alarmdec, alarmact, alarmdist), the script
//action_error// is used to find any alarm actions that are in the //ERROR//
state in the //Alarm_Action// database table.

<code>
ncss@ucbrt.geo.berkeley.edu:action_error -h
    action_error version 0.0.2
        action_error - report alarm_actions in ERROR state
Syntax:
        action_error  [-c config] [-E evid] [-U evid action]
where:
        -E evid         - query for the action commandline for
        any alarm actions in ERROR state for event <evid>.
        -U evid action  - update any of event <evid>'s actions
        of name <action> from ERROR state to ERROR-ACK state.
        -c config       - specify an alternate configuration file.
        The default config file is /home/ncss/run/params/db.conf
        -h      Help    - prints this help message.

        When neither option -E or -U is given, action_error prints any event
        IDs and their actions which are in the ERROR state. This mode is
        suitable for use by monitor.

NOTE:
</code>

Once action_error (through monitor) has reported an alarm action in the
//ERROR// state, the user should investigate the problem. Most alarm actions
log there results and errors in files in the //run/alarms/logs// directory.

//action_error// can be used with the //-E evid// option to learn the
command-line appropriate for manually running the alarm action script.

Once the error condition has been resolved, //action_error// should be run
with the //-U evid action// to change the alarm action state from //ERROR// to
//ERROR-ACK// in the alarm_action table. This will silence the complaints from
monitor when it runs //action_error// again.

Note that on the post-processing systems, the alarm_action table is replicated
among all the archive databases. That means that only one instance of monitor
should be configured to run //action_error// at a time; otherwise you may get
multiple pager messages about a single error condition. By convention, monitor
runs //action_error// only on the //active// post-processing system

=== check_page.pl ===

On the BSL systems (ucbns1, ucbns2) which actually submit pager and SMS
messages, we use //check_page.pl// to look for stale pager files. If these
were present, it would indicate that the pager daemon was not able to deliver
messages in a timely manner.

<code>

ncss@ucbns1:./check_page.pl -h
    check_page.pl version 0.0.1
        check_page.pl - check for stale pager files
Syntax:
        check_page.pl  [-h] [-t maxAge]
where:
        -t maxAge       - Maximum age in seconds for pending pager files
                        If older files are found, squawk about them!
                        Default max age is 300 seconds.
        -h              - prints this help message.
</code>

If check_page.pl is reporting a problem with the pager system on the local
system, e.g. ucbns1, it is important to configure monitor to send reports of
the problem to a different system, e.g. ucbns2.

=== checkampexchange ===

To monitor the [[postproc:ampexc|exchange of ground-motion amplitude packets]]
on the post-processing systems, monitor uses the //checkampexchange// script:

<code>
ncss@ucbpp:checkampexchange -h
    checkampexchange version 0.0.3 - 2015/01/29 NCSS
        checkampexchange - report problems with amp import/export
Syntax:
        checkampexchange [-a] [-b] [-e] [-j] [-q] [-T maxAge] [-v]
        checkampexchange -h
where:
        -a      check for unexpected files in get_amps/new directory
        -b      check for stale heartbeats
        -e      check for files in import error directory
        -j      check for jammed-up outgoing files
        -q      check for SQL errors or exceptions in Gmp2Db log
        -T maxAge  set max heartbeat or file age in minutes; default is 10
        -v      set verbose output; not suitable for use under monitor
        -h      Help    - prints this help message.

        When none of -a, -b, -e, -j, -q are specified, they ALL are implied.

NOTE:
</code>

We use only the //-T 30// option in monitor.config; i.e., perform all of the
checks defined for -a, -b, -e, -j, and -q with a 30 minute delay allowed in
the heartbeat messages sent from our sister agencies CGS and Caltech.

See the code (run/bin/checkampexchange) for details of the checks.

=== checkautoposter ===

On the post-processing systems, we use an Oracle Database //job// to insert
new subnet trigger events into the [[postproc:pcs]] system. To monitor the
status of the autoposter job, we use the script //checkautoposter//:

<code>
ncss@ucbpp:checkautoposter -h

check status of autoposter jobs

  usage: /home/ncss/ncpp/bin/checkautoposter -[h] [-d dbase]

  -h        : print usage info
  -d dbase  : use this particular dbase (defaults to "MasterDB")
  -v        : verbose: report good status; normally report only bad status

example: /home/ncss/ncpp/bin/checkautoposter -d dcmp2
</code>

This script queries the database for to see if the autoposter job is running
and if it has reported any errors. If errors are found, they should be
reported to the Oracle DBA for help in resolving them.

=== comparelocks ===

The [[postproc:jiggle]] application is used for human-controlled event
location and magnitude evaluation. In order to ensure that different jiggle
users are not working on the same event, jiggle uses the database table
//JasiEventLock// to "lock" other users out of the event that is being
worked. Database replication is not adequate for this locking mechanism on
different databases, so the //JasiEventLock// table is not replicated between
the archive DBs.

Instead we use the script //comparelocks// to ensure that all jiggle users are
using the same database:

<code>
ncss@ucbpp:comparelocks -h

    Compare current event locks.

    usage: /home/ncss/ncpp/bin/comparelocks [-v] [-m]

     -h        : print usage info
     -m        : truncated one-line output for use with monitor (NCSS)
     -v        : verbose: print all locks;
                 Normally all locks printed only if multiple DB's in use.
</code>

This script queries each of the listed databases (hard-coded in the script) to
see if any events are locked on that database. If it finds events locked on
more than one database, it cause monitor to send a message.

The corrective action is to tell the jiggle users to get on the same database.

=== dbping ===

dbping tests its configured database for basic functionality. This perl script
connects to the database and does one query. If that succeeds, dbping is
happy; otherwise it reports an error.

<code
ncss@rodgers:dbping -h
    dbping version $Id: dbping.pl,v 1.6 2004/11/20 16:43:26 redi Exp $
        dbping - ping a database
Syntax:
        dbping  [-c config] [-n repeats] [-t timeout] [-h] [-d debug level]
where:
        -c config       - specify an alternate configuration file
        The default config file is /home/redi/run/params/db.conf
        -n repeats      - number of times to try for database response
        -t timeout      - time to wait for response from database
        -h              - prints this help message.
        -d debug_level  - prints programmer debug code
</code>

Unexpected errors reported by dbping should be forwarded to the Oracle DBA for
corrective action.

=== dircheck ===

//dircheck// is a simple shell script to monitor directories for unexpected
files. We use it on directories that have files written to them and then
removed by various polling programs. For example, the input directory for the
Earthworm //sendfileII// program normally is empty. Other programs write files
in this directory; sendfileII deletes the files as soon as they have been
sent. If there are files sitting in the //sendfileII// input directory, it is
because they cannot be sent. //dircheck// will report this problem through
//monitor//.

<code>
ncss@ucbpp:dircheck -h
usage: /home/ncss/run/bin/dircheck max-files directory-path
</code>

When //dircheck// reports errors, the AQMS operator will have to track down
the processes involved and determine the appropriate corrective action.

=== pcsWatchdog ===

The //pcsWatchdog// program checks for abnormal entries in the
[[postproc:pcs|pcs state]] table. We run this program on whichever of the
post-processing systems is configured as "active".

<code>
ncss@ucbpp:pcsWatchdog -h

    Check for PCS backlog

    usage: /home/ncss/ncpp/bin/pcsWatchdog [-d] <config_file> [<dbase>]

     -h        : print usage info
     -d        : turn on diagnostic output
     config_file : full path to the config file
     dbase     : use this particular dbase (defaults to "dcucb")

    example: /home/ncss/ncpp/bin/pcsWatchdog /home/ncss/ncpp/conf/checkStates.cfg dcucb
</code>

This script uses a configuration file that lists the various states to check
for. Note that the Group, Table and State entries in this file are used
directly in a database query, so SQL wild-cards can be used. This file is
/home/ncss/ncpp/conf/checkStates.cfg: 

<code>
# Group Table State Age(secs) Count
EventStream % NewEvent 0 0
EventStream % NewTrigger 0 0
EventStream % MakeDRPGif 60 2
EventStream % MakeTrigGif 60 2
EventStream % ExportAmps 60 10
EventStream % FPfit 120 0
EventStream % ExportWF 760 1
EventStream % ExportArc 120 0
EventStream % ddrtFeed 120 0
EventStream % SwarmAlarm 120 0
EventStream % AssocTrig 0 0
EventStream % TrigCheck 360 10
TPP  TPP DELETED  60 0
TPP  TPP FINALIZE 60 0
TPP  TPP ALARM 60 0
TPP  TPP ddrtFeed 120 1
TPP  TPP MakeDRPGif 60 2
TPP  TPP FPfit 120 0
TPP  TPP ExportArc 60 1
TPP  TPP CANCELALARM 60 0
TPP  TPP DeleteArc 60 0
TPP  TPP REPOP 300 0
# >100 rows of any states older than 10min
% % % 600 100
</code>

Errors reported by //pcsWatchdog// are usually due to problems in the
[[postproc:pcs]] client programs responsible for handling that state.


=== pdlSendCheck ===

//pdlSendCheck// is a script for monitoring the ProductClient poll
directory. Since we no longer use ProductClient in polling mode, this script
is no longer needed.

=== status_ada, status_wda ===

//status_ada// and //status_wda// are scripts for monitoring data latency in
the AQMS //ADA// and //WDA// shared memory regions (also called GCDA, generic
channel data area), respectively. These two scripts work by using the output
from AQMS programs //adastat// and //wdastat//.

The two scripts are configured with files listing the SNCLs to be monitored; a
flag value following each SNCL indicates whether that SCNL should be reported
(flag = 1) or temporarily ignored (value = 0). Normally one SCNL from each
station whose data should be in the GCDA, since most data latency problems
affect all SNCLs of a station in the same way. One exception to this is for
stations with multiple dataloggers: in those cases, one SNCL from each
datalogger may be included in the configuration file.

Each of these scripts offer the same command-line options. Here's
//status_wda//: 
<code>
ncss@ucbns2:status_wda -h
status_wda version 0.8 (2010.218)
status_wda - Monitor WDA region.
Syntax:
status_wda   [-T N] [-C N] [-c K] [-d n] [-h]
where:
        -f file Name of config file. Default is:
                        /home/ncss/run/params/status_wda.config
        -T N    Total allowable delay summed over all stations.
                Default is 240 * number of channels;
        -C list Comma-delimited list of cluster sizes.
        -c list Comma-delimited list of allowable delay for each
                station for the corresponding cluster size.
        -d n    Debug option.
                1 provides basic debugging info.
                2 provides sorted delay info.
Examples:
status_wda -T 1800
        Set max total delay.
status_wda -C 1,2,3 -c 1800,900,360
        Set delay limits of
                1 station at 1800 seconds each
                2 stations at 900 seconds each
                3 stations at 360 seconds each
        Max total delay is unchanged from default.
</code>

These options provide two basic ways of monitoring data latency. To monitor
the total data latency (sum of latencies for all configured SNCLs), set the -T
value to some low number, while setting the -c and -C values to large
numbers. On the network service systems, we typically use a -T value of 43200
(12 hours), with -C 999 (more than the configured number of SNCLs). This will
generate warning messages if one SNCL is out for about 12 hours, or two SNCLs
out for about 6 hours, etc.

To monitor latency for groups of SNCLs instead of total latency, we set small
values for -c and -C, with a very value for -T. On the UCB net service
systems, we use -C 3 (a cluster of 3 SNCLs) and -c 600 (ten minutes). The
result is that if any three SNCLs each have latency of more than 600 seconds,
we will get a warning message.

//status_wda// and //status_ada// are often used manually. In that case, the
most useful option //-d2// to get a list of all configured SNCLs (including
those configured with flag "0"), sorted in descending order of latency.

=== status_mon ===

The AQMS systems at UC Berkeley do not use the standard Earthworm programs
//startstop//, //statmgr// or //status//. Instead they use locally written
programs //status_mgr// and //status_mon// for checking the state of health of
Earthworm programs running on these systems.

//status_mgr// creates a small share-memory region containing entries for each
of the Earthworm message types it is configured to monitor. Like the Earthworm
//statmgr//, status_mgr normally is configured to watch heartbeat messages
from configured modules. But //status_mgr// can also monitor other message
types, which it considers //data//. In the share-memory regions. status_mgr
keeps track of the last time it received each of the messages it is configured
to monitor, as well as a few other useful parameters.

<code>
ncss@ucbns1:status_mgr -h
status_mgr   [-F | -K] [-h] [-d N] config_file
    where:
        -h          Help - prints syntax message.
        -r ringname Name of ring to monitor.  Default is ?
        -F | -K     Flush or Keep old contents of redi ring (default=Keep).
        -d N        Debug option N (currently unused).
        config_file Name of configuration file (default = status_mgr.config)
</code>

Note that //status_mgr// is hard-coded to read Earthworm messages from
//REDI_RING//. This seemed reasonable when it was first written. It may be
time to make the small code changes necessary to make the input ring name
configurable.

//status_mgr// runs continuously, started with at the same time as other
Earthworm programs. It should be restarted whenever changes are made to its
configuration file:

   run_status_mgr restart

Here's a sample configuration file, taken from urbrt. It demonstrates the use
of both //HEARTBEAT// and //DATA// configuration types.

<code>
#
# Status region: Hard-coded to read from REDI_RING
#
SHMEM=3030
#
# Type=Program_Name,Module,Net,Expect,Flag
#
HEARTBEAT=ImportPkTrigMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigUcbns1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS1,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns1,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS1_BK,INST_UCB,60,1
HEARTBEAT=ImportPkTrigUcbns2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS2,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns2,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS2_BK,INST_UCB,60,1
HEARTBEAT=File2EW_Trig,TYPE_HEARTBEAT,MOD_FILE2EW_TRIG,INST_UCB,60,1
#
HEARTBEAT=Pkfilter,TYPE_HEARTBEAT,MOD_PKFILTER,INST_UCB,60,1
HEARTBEAT=Binder_ew,TYPE_HEARTBEAT,MOD_BINDER_EW,INST_UCB,60,1
HEARTBEAT=Eqassemble,TYPE_HEARTBEAT,MOD_EQASSEMBLE,INST_UCB,60,1
DATA=EvtAssemble,TYPE_HYP2000ARC,MOD_EQASSEMBLE,INST_UCB,0,1
HEARTBEAT=Hyps2ps,TYPE_HEARTBEAT,MOD_HYPS2PS,INST_UCB,60,1
#
HEARTBEAT=StatrigFilter,TYPE_HEARTBEAT,MOD_STATRIGFILTER,INST_UCB,60,1
HEARTBEAT=Carlsubtrig,TYPE_HEARTBEAT,MOD_CARLSUBTRIG,INST_UCB,60,1
DATA=SubnetTrigger,TYPE_TRIGLIST_SCNL,MOD_CARLSUBTRIG,INST_UCB,0,1
HEARTBEAT=Trig2ps,TYPE_HEARTBEAT,MOD_TRIG2PS,INST_UCB,60,1
#
</code>

The configuration file allows for naming the triplet of Earthworm message
type, module ID and installation ID. This name is intended to be (hopefully)
more comprehensible to human readers than the Earthworm syntax.

//status_mgr// ignores Earthworm //TYPE_ERROR// messages. These messages are
completely ignored in UCB Earthworm systems.

The companion program //status_mon// is normally used by //monitor// to report
abnormal Earthworm conditions. It reads from the same share-memory region
controlled by //status_mgr// and determines the latency of each parameter
being monitored. If the latency exceeds a limit, it will be reported. The
latency limits are the same as configured in status_mgr's configuration file,
unless modified by status_mon's command line. If no limits are exceeded, then
status_mon is silent.

<code>
ncss@ucbns2:status_mon -h
status_mon   [-a n] [-d N] [-h] [-I list | -O list]    where:
        -h             Help - prints syntax message.
        -m shmkey      Specify shared memory key.
        -a N           Set alive (heartbeat) threshold to N seconds.
        -d N           Set data threshold to N seconds.
        -I ignore-list Ignore items in comma-separted list.
        -O only-list   Only list items in comma-separted list.
    -I and -O cannot both be used at the same time.
</code>


The UCB command //status// is simply a symbolic link to //status_mon//. This
provides a complete list of the configured names, last update times, expected
update interval and current latency. This command is intended for manual use.
For example:

<code>
ncss@ucbns2: status
Name                   Last Update (UTC)    Expect  Delta
---------------------------------------------------------
Import_Trace_Local   2018/03/19,22:35:52.0000   10      1
Slink2EW_USLB        2018/03/19,22:35:35.0000   10     18
Mcast_USLB           2018/03/19,22:35:47.0000   60      6
WfTimeFilter         2018/03/19,22:35:45.0000   60      8
Pickew               2018/03/19,22:35:45.0000   60      8
Coda_AAV             2018/03/19,22:35:47.0000   60      6
Coda_DUR             2018/03/19,22:35:51.0000   60      2
Carlstatrig          2018/03/19,22:35:26.0000   60     27
CsDetectEw           2018/03/19,22:35:49.0000   10      4
PowerMon             2018/03/19,22:35:46.0000   10      7
Export_PkTrig_Menlo  2018/03/19,22:35:50.0000   60      3
Export_PkTrig_UCB    2018/03/19,22:35:46.0000   60      7
Export_Pick_CIT      2018/03/19,22:35:00.0000   60     53   (x)
Export_Pick_Golden   2018/03/19,22:35:37.0000   60     16   (x)
Export_Trace_MP      2018/03/19,22:35:49.0000   60      4   (x)
Export_Trace_ATWC    2018/03/19,22:35:42.0000   60     11   (x)
Export_Trace_PTWC    2018/03/19,22:35:44.0000   60      9   (x)
Export_Trace_UW      2018/03/19,22:35:44.0000   60      9   (x)
Export_Trace_NCVL    2018/03/19,22:35:33.0000   60     20   (x)
Export_Trace_UCSD_KF 2018/03/19,22:35:48.0000   60      5   (x)
Wave_serverV_1       2018/03/19,22:35:47.0000   60      6
Wave_serverV_2       2018/03/19,22:35:47.0000   60      6
Wave_serverV_3       2018/03/19,22:35:47.0000   60      6
Wave_serverV_4       2018/03/19,22:35:47.0000   60      6
Wave_serverV_5       2018/03/19,22:35:47.0000   60      6
</code>

In this output, the "(x)" text indicates that this item is flagged to turn off
warning messages sent through //monitor//. If this parameter's latency
exceeded the limit, this symbol would change to "(ALARM)". Likewise, if a
parameter were not flagged for silence, the symbol would be "ALARM", without
the enclosing parenthesis.

== Automatic Restart for Earthworm Modules at UCB ==

Because UCB AQMS systems do not use Earthworm's //startstop// or //statmgr//
programs, they are missing the ability to automatically restart Earthworm
modules when they die. To get around this lacuna, we have a //restart_mgr//
system to handle automatic restarts. ///home/ncss/run/bin/restart_mgr.pl// is
a per script that parses its configuration file, runs //status//, and restarts
any configured programs that are in the //ALARM// or //(ALARM)// state. Since
this system is run by cron, we have ///home/ncss/run/bin/restart_mgr.csh//, a
C-shell script to set the appropriate environment variables needed by the perl
script and by //status//.

<code>

ncss@ucbns1:./restart_mgr.pl -h
    restart_mgr.pl version 0.0.1
        restart_mgr.pl - check status and restart failed programs
Syntax:
        restart_mgr.pl  [-c config] [-h] [-v]
where:
        -c config       - specify an alternate configuration file
        The default config file is /home/ncss/run/params/restart_mgr.conf
        -w wait_time
                        - how long (seconds) to wait for heartbeat before
                        restarting a program; default is 300
        -v              - verbose output
        -h              - prints this help message.
</code>

The configuration file for restart_mgr on ucbns1 is shown here. Note that this
is in perl syntax that defines a hash. The left column gives the names of
items in status_mgr's configuration file. The right column gives the commands
needed to restart the given item when called with the //restart// option:

<code>
%progMap = (
            Export_Trace_Quake => "run_export_trace_quake",
            Export_Trace_MP    => "run_export_trace_menlo",
            Export_Trace_ATWC  => "run_export_trace_atwc",
            Export_Trace_PTWC  => "run_export_trace_ptwc",
            Export_Trace_UW    => "run_export_trace_uw",
            Export_Trace_NCVL  => "run_export_trace_ncal_valve",
            Export_Trace_UCSD_KF => "run_export_trace_ucsd_kf",
            Export_Pick_Golden => "run_export_pick_golden",
            );

# Following required to make perl happy:
1;
</code>

Yes, this is a kludge! But it seems to work OK.