===== Monitor, an Application for Monitoring Systems =====
Monitor is a perl script (written by Doug Neuhauser in about 1994) for
generalized monitoring. Being a perl script, it is more or less agnostic to
the type of unix or linux on which it runs. However, there are a number of
configurable parameters (both in the script and in the configuration file)
that must be adjusted to make the program work with some "standard" utilities
provided by unix/linux. Don't try to copy a Solaris version of monitor to
Linux; it will most likely not work properly.
Minimal help is provided by running "monitor -h":
monitor - Monitor processes, computers, disks, and other progs.
Syntax:
monitor [-h] [-d n] [-l logfile] [config_file]
where:
-b Boot option. Delay monitoring at boot time for
BOOT_DELAY seconds specified in config file.
-d n Debug value n. Value can be the OR of:
1 = print scheduling info
2 = print time info
4 = print sorting info
8 = print out commands
16 = print alarm messages instead of sending them.
64 = display monitor_list structure
128 = print proc list when missing process.
-h Help - prints this help message.
-l logfile
Log all alarms to the specified logfile.
config_file
Configuration file. Default file is: monitor.config
For most applications, monitor is started at system boot and runs
continuously; e.g.
monitor -b -l /home/ncss/run/logs/monitor.log /home/ncss/run/params/monitor.config
In (almost?) all systems, monitor is started at boot time with the "-b"
flag. This allows a few minutes for applications to get started before monitor
starts checking things. This boot delay also comes into effect when monitor is
restarted with the //HUP// signal. This feature can be useful to avoid
notifications when restarting applications being checked by monitor.
==== Configuration File ====
The configuration file, monitor.conf by default, is a text file consisting of 5
sections including some comments.
The first section provides basic configuration parameters for the monitor's
operations. Each parameter is treated very much like a unix environment
variable, and are actually added to monitor's process environment. This makes
these variables available to any programs that are run by monitor. Thus PATH
(and LD_LIBRARY_PATH, if necessary) can be set here.
Comments are prefixed with "#".
The "EMAIL" parameter includes the subject line under which email messages are
sent; e.g,:
EMAIL=/bin/mail -s 'ALARM - monitor@ucbns2'
In this first section, arbitrary variables may be defined with will be used in
the following sections of monitor.conf. For example, in ucbns2 we have:
# Notification list - may be used in place of user:action
#-----------------------------------------------------------------------
NOTIFY_LIST_1=(peggy,lombard,cpaff):(email,pager1),(jennifer):email
NOTIFY_LIST_2=(peggy,lombard,cpaff):(email,pager2)
NOTIFY_NCSS_1=(peggy,lombard,cpaff):(email,pager1)
NOTIFY_NCSS_2=(peggy,lombard,cpaff):(email,pager2)
NOTIFY_DOUG=doug:(email,pager)
The remaining four sections configure the four different ways that monitor
does its work. These are "prog", "alive", "proc" and "disk" Each of these
sections is optional. The only restriction is that at least one command must
be provided from any of these sections. Otherwise monitor won't have anything
to do so it will simply exit.
The "prog" command tells monitor to run some program (any sort of
executable). The comments in monitor say:
# Expectations of a program that is run under monitor:
# writes to STDOUT only when conditions are unsatisfactory;
# if conditions are normal, it should write nothing to STDOUT
# The STDOUT of the program will be sent as part of the page/email
# anything written to STDERR will go in monitor.error.log but
# nowhere else.
# Exit status: should be 0 under most conditions;
# use non-zero exit status only for failure to execute part of
# part of the program. This non-zero exit status will REPLACE
# anything written to STDOUT in the page/email sent by monitor.
Note that program STDERR output is ignored by monitor but gets combined with
any STDERR output from monitor itself.
Here are some comments and sample "prog" lines from ucbns2:
# Alarmflags are the following numeric parameters:
# run renotify notify_clear
# where:
# run - Run program every N seconds.
# Notify users when alarm is first raised.
# renotify - Renotify users every N seconds
# if alarm stays raised.
# notify_clear - Boolean flag (0 or 1) whether users
# should be notified when alarm is cleared.
# Program Notify_list Alarmflags Prog_args
#-----------------------------------------------------------
prog status_mon NOTIFY_LIST_2 60 3600 1 -a 600 -d 28800
prog status_wda NOTIFY_LIST_2 60 3600 1 -C 5 -c 900 -T 99999999
prog status_wda NOTIFY_LIST_2 60 3600 1 -C 999 -c 99999999 -T 43200
prog check_page.pl NOTIFY_LIST_1 60 3600 1 -t 300
Notice the last item here, running the check_page.pl script. This is testing
some conditions of the pager daemon on ucbns2. If it finds a problem, it must
be reported using pager1, the pager system on ucbns1. It cannot use the pager
system on ucbns2 to report that ucbns2's pager system is not working! This is
another instance in which monitor.conf is specific for one machine, and cannot
be copied intact to another machine.
The "alive" commands are used to check the network connectivity of other hosts
using the "ping" utility. Here's a few sample entries:
# Computer Notify Alarmflags ping_count
#-------------------------------------------------------------------
alive ucbns1.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5
alive rumble.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5
alive benito.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5
alive ucbrt.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5
The "proc" command is used to check for the presence of a given process, by
user, name, and possibly some arguments. The "Program" part of the proc
command uses perl regular expressions to compare against what is reported by
"ps -ef". This is a bit tricky to get right! Here's some samples:
# User,Program Notify Alarmflags
#-----------------------------------------------------------------
proc ncss:.*adadup_ucbns22ucbrt.* NOTIFY_LIST_2 120 3600 1
proc ncss:.*adadup_ucbns22ncss3.* NOTIFY_LIST_2 120 3600 1
proc ncss:.*crossoverSA/CA_base.* lombard:email 120 3600 1
proc ncss:.*SocketAgent/CA_base.* lombard:email 120 3600 1
The "disk" command is used to have monitor check for sufficient free space on
a given file-system. For example:
# Disk Notify Alarmflags Minfree|full%
#----------------------------------------------------------------------
disk /home/aq12 lombard 1800 21600 0 90%
disk /home/aq12 NOTIFY_NCSS_2 1800 7200 0 95%
For BSL systems, monitor is run for at least the following user:host
combinations:
ncss:ucbns1 BSL acquisition and AQMS net services
ncss:ucbns2 ""
ncss:ucbrt BSL AQMS RT system
ncss:ucbpp BSL AQMS PP system
dcmgr:ucbpp BSL event waveform archiving
ncss:rumble ShakeMap
redi:shaker Finite Fault
ncss:quake7 PSD system
ncss:benito BSL AQMS DRP system
ncss:sutter dac480 support
dcmgr:hugo DART
BSL Test AQMS systems:
ncss:mono Solaris AQMS
ncss:seiche Linux AQMS
For Menlo Park systems, monitor is run for the following user:host
combinations:
ncss:mnlons1 MP AQMS net services
ncss:mnlons2 MP AQMS net services
ncss:mnlort1 MP AQMS RT system
ncss:mnlodb1 MP AQMS PP system
dcmgr:mnlodb1 MP event waveform archiving
ncss:mnlodd1 MP real-time double-difference system
==== Programs Used By Monitor ====
The following programs are used by monitor in the //prog// section of the
configuration file on various systems. Each of these programs can be used by
hand when needed. Most but not all of them will report how to use them when
given the "-h" command-line option.
=== action_error ===
On AQMS systems with alarm systems (alarmdec, alarmact, alarmdist), the script
//action_error// is used to find any alarm actions that are in the //ERROR//
state in the //Alarm_Action// database table.
ncss@ucbrt.geo.berkeley.edu:action_error -h
action_error version 0.0.2
action_error - report alarm_actions in ERROR state
Syntax:
action_error [-c config] [-E evid] [-U evid action]
where:
-E evid - query for the action commandline for
any alarm actions in ERROR state for event .
-U evid action - update any of event 's actions
of name from ERROR state to ERROR-ACK state.
-c config - specify an alternate configuration file.
The default config file is /home/ncss/run/params/db.conf
-h Help - prints this help message.
When neither option -E or -U is given, action_error prints any event
IDs and their actions which are in the ERROR state. This mode is
suitable for use by monitor.
NOTE:
Once action_error (through monitor) has reported an alarm action in the
//ERROR// state, the user should investigate the problem. Most alarm actions
log there results and errors in files in the //run/alarms/logs// directory.
//action_error// can be used with the //-E evid// option to learn the
command-line appropriate for manually running the alarm action script.
Once the error condition has been resolved, //action_error// should be run
with the //-U evid action// to change the alarm action state from //ERROR// to
//ERROR-ACK// in the alarm_action table. This will silence the complaints from
monitor when it runs //action_error// again.
Note that on the post-processing systems, the alarm_action table is replicated
among all the archive databases. That means that only one instance of monitor
should be configured to run //action_error// at a time; otherwise you may get
multiple pager messages about a single error condition. By convention, monitor
runs //action_error// only on the //active// post-processing system
=== check_page.pl ===
On the BSL systems (ucbns1, ucbns2) which actually submit pager and SMS
messages, we use //check_page.pl// to look for stale pager files. If these
were present, it would indicate that the pager daemon was not able to deliver
messages in a timely manner.
ncss@ucbns1:./check_page.pl -h
check_page.pl version 0.0.1
check_page.pl - check for stale pager files
Syntax:
check_page.pl [-h] [-t maxAge]
where:
-t maxAge - Maximum age in seconds for pending pager files
If older files are found, squawk about them!
Default max age is 300 seconds.
-h - prints this help message.
If check_page.pl is reporting a problem with the pager system on the local
system, e.g. ucbns1, it is important to configure monitor to send reports of
the problem to a different system, e.g. ucbns2.
=== checkampexchange ===
To monitor the [[postproc:ampexc|exchange of ground-motion amplitude packets]]
on the post-processing systems, monitor uses the //checkampexchange// script:
ncss@ucbpp:checkampexchange -h
checkampexchange version 0.0.3 - 2015/01/29 NCSS
checkampexchange - report problems with amp import/export
Syntax:
checkampexchange [-a] [-b] [-e] [-j] [-q] [-T maxAge] [-v]
checkampexchange -h
where:
-a check for unexpected files in get_amps/new directory
-b check for stale heartbeats
-e check for files in import error directory
-j check for jammed-up outgoing files
-q check for SQL errors or exceptions in Gmp2Db log
-T maxAge set max heartbeat or file age in minutes; default is 10
-v set verbose output; not suitable for use under monitor
-h Help - prints this help message.
When none of -a, -b, -e, -j, -q are specified, they ALL are implied.
NOTE:
We use only the //-T 30// option in monitor.config; i.e., perform all of the
checks defined for -a, -b, -e, -j, and -q with a 30 minute delay allowed in
the heartbeat messages sent from our sister agencies CGS and Caltech.
See the code (run/bin/checkampexchange) for details of the checks.
=== checkautoposter ===
On the post-processing systems, we use an Oracle Database //job// to insert
new subnet trigger events into the [[postproc:pcs]] system. To monitor the
status of the autoposter job, we use the script //checkautoposter//:
ncss@ucbpp:checkautoposter -h
check status of autoposter jobs
usage: /home/ncss/ncpp/bin/checkautoposter -[h] [-d dbase]
-h : print usage info
-d dbase : use this particular dbase (defaults to "MasterDB")
-v : verbose: report good status; normally report only bad status
example: /home/ncss/ncpp/bin/checkautoposter -d dcmp2
This script queries the database for to see if the autoposter job is running
and if it has reported any errors. If errors are found, they should be
reported to the Oracle DBA for help in resolving them.
=== comparelocks ===
The [[postproc:jiggle]] application is used for human-controlled event
location and magnitude evaluation. In order to ensure that different jiggle
users are not working on the same event, jiggle uses the database table
//JasiEventLock// to "lock" other users out of the event that is being
worked. Database replication is not adequate for this locking mechanism on
different databases, so the //JasiEventLock// table is not replicated between
the archive DBs.
Instead we use the script //comparelocks// to ensure that all jiggle users are
using the same database:
ncss@ucbpp:comparelocks -h
Compare current event locks.
usage: /home/ncss/ncpp/bin/comparelocks [-v] [-m]
-h : print usage info
-m : truncated one-line output for use with monitor (NCSS)
-v : verbose: print all locks;
Normally all locks printed only if multiple DB's in use.
This script queries each of the listed databases (hard-coded in the script) to
see if any events are locked on that database. If it finds events locked on
more than one database, it cause monitor to send a message.
The corrective action is to tell the jiggle users to get on the same database.
=== dbping ===
dbping tests its configured database for basic functionality. This perl script
connects to the database and does one query. If that succeeds, dbping is
happy; otherwise it reports an error.
Unexpected errors reported by dbping should be forwarded to the Oracle DBA for
corrective action.
=== dircheck ===
//dircheck// is a simple shell script to monitor directories for unexpected
files. We use it on directories that have files written to them and then
removed by various polling programs. For example, the input directory for the
Earthworm //sendfileII// program normally is empty. Other programs write files
in this directory; sendfileII deletes the files as soon as they have been
sent. If there are files sitting in the //sendfileII// input directory, it is
because they cannot be sent. //dircheck// will report this problem through
//monitor//.
ncss@ucbpp:dircheck -h
usage: /home/ncss/run/bin/dircheck max-files directory-path
When //dircheck// reports errors, the AQMS operator will have to track down
the processes involved and determine the appropriate corrective action.
=== pcsWatchdog ===
The //pcsWatchdog// program checks for abnormal entries in the
[[postproc:pcs|pcs state]] table. We run this program on whichever of the
post-processing systems is configured as "active".
ncss@ucbpp:pcsWatchdog -h
Check for PCS backlog
usage: /home/ncss/ncpp/bin/pcsWatchdog [-d] []
-h : print usage info
-d : turn on diagnostic output
config_file : full path to the config file
dbase : use this particular dbase (defaults to "dcucb")
example: /home/ncss/ncpp/bin/pcsWatchdog /home/ncss/ncpp/conf/checkStates.cfg dcucb
This script uses a configuration file that lists the various states to check
for. Note that the Group, Table and State entries in this file are used
directly in a database query, so SQL wild-cards can be used. This file is
/home/ncss/ncpp/conf/checkStates.cfg:
# Group Table State Age(secs) Count
EventStream % NewEvent 0 0
EventStream % NewTrigger 0 0
EventStream % MakeDRPGif 60 2
EventStream % MakeTrigGif 60 2
EventStream % ExportAmps 60 10
EventStream % FPfit 120 0
EventStream % ExportWF 760 1
EventStream % ExportArc 120 0
EventStream % ddrtFeed 120 0
EventStream % SwarmAlarm 120 0
EventStream % AssocTrig 0 0
EventStream % TrigCheck 360 10
TPP TPP DELETED 60 0
TPP TPP FINALIZE 60 0
TPP TPP ALARM 60 0
TPP TPP ddrtFeed 120 1
TPP TPP MakeDRPGif 60 2
TPP TPP FPfit 120 0
TPP TPP ExportArc 60 1
TPP TPP CANCELALARM 60 0
TPP TPP DeleteArc 60 0
TPP TPP REPOP 300 0
# >100 rows of any states older than 10min
% % % 600 100
Errors reported by //pcsWatchdog// are usually due to problems in the
[[postproc:pcs]] client programs responsible for handling that state.
=== pdlSendCheck ===
//pdlSendCheck// is a script for monitoring the ProductClient poll
directory. Since we no longer use ProductClient in polling mode, this script
is no longer needed.
=== status_ada, status_wda ===
//status_ada// and //status_wda// are scripts for monitoring data latency in
the AQMS //ADA// and //WDA// shared memory regions (also called GCDA, generic
channel data area), respectively. These two scripts work by using the output
from AQMS programs //adastat// and //wdastat//.
The two scripts are configured with files listing the SNCLs to be monitored; a
flag value following each SNCL indicates whether that SCNL should be reported
(flag = 1) or temporarily ignored (value = 0). Normally one SCNL from each
station whose data should be in the GCDA, since most data latency problems
affect all SNCLs of a station in the same way. One exception to this is for
stations with multiple dataloggers: in those cases, one SNCL from each
datalogger may be included in the configuration file.
Each of these scripts offer the same command-line options. Here's
//status_wda//:
ncss@ucbns2:status_wda -h
status_wda version 0.8 (2010.218)
status_wda - Monitor WDA region.
Syntax:
status_wda [-T N] [-C N] [-c K] [-d n] [-h]
where:
-f file Name of config file. Default is:
/home/ncss/run/params/status_wda.config
-T N Total allowable delay summed over all stations.
Default is 240 * number of channels;
-C list Comma-delimited list of cluster sizes.
-c list Comma-delimited list of allowable delay for each
station for the corresponding cluster size.
-d n Debug option.
1 provides basic debugging info.
2 provides sorted delay info.
Examples:
status_wda -T 1800
Set max total delay.
status_wda -C 1,2,3 -c 1800,900,360
Set delay limits of
1 station at 1800 seconds each
2 stations at 900 seconds each
3 stations at 360 seconds each
Max total delay is unchanged from default.
These options provide two basic ways of monitoring data latency. To monitor
the total data latency (sum of latencies for all configured SNCLs), set the -T
value to some low number, while setting the -c and -C values to large
numbers. On the network service systems, we typically use a -T value of 43200
(12 hours), with -C 999 (more than the configured number of SNCLs). This will
generate warning messages if one SNCL is out for about 12 hours, or two SNCLs
out for about 6 hours, etc.
To monitor latency for groups of SNCLs instead of total latency, we set small
values for -c and -C, with a very value for -T. On the UCB net service
systems, we use -C 3 (a cluster of 3 SNCLs) and -c 600 (ten minutes). The
result is that if any three SNCLs each have latency of more than 600 seconds,
we will get a warning message.
//status_wda// and //status_ada// are often used manually. In that case, the
most useful option //-d2// to get a list of all configured SNCLs (including
those configured with flag "0"), sorted in descending order of latency.
=== status_mon ===
The AQMS systems at UC Berkeley do not use the standard Earthworm programs
//startstop//, //statmgr// or //status//. Instead they use locally written
programs //status_mgr// and //status_mon// for checking the state of health of
Earthworm programs running on these systems.
//status_mgr// creates a small share-memory region containing entries for each
of the Earthworm message types it is configured to monitor. Like the Earthworm
//statmgr//, status_mgr normally is configured to watch heartbeat messages
from configured modules. But //status_mgr// can also monitor other message
types, which it considers //data//. In the share-memory regions. status_mgr
keeps track of the last time it received each of the messages it is configured
to monitor, as well as a few other useful parameters.
ncss@ucbns1:status_mgr -h
status_mgr [-F | -K] [-h] [-d N] config_file
where:
-h Help - prints syntax message.
-r ringname Name of ring to monitor. Default is ?
-F | -K Flush or Keep old contents of redi ring (default=Keep).
-d N Debug option N (currently unused).
config_file Name of configuration file (default = status_mgr.config)
Note that //status_mgr// is hard-coded to read Earthworm messages from
//REDI_RING//. This seemed reasonable when it was first written. It may be
time to make the small code changes necessary to make the input ring name
configurable.
//status_mgr// runs continuously, started with at the same time as other
Earthworm programs. It should be restarted whenever changes are made to its
configuration file:
run_status_mgr restart
Here's a sample configuration file, taken from urbrt. It demonstrates the use
of both //HEARTBEAT// and //DATA// configuration types.
#
# Status region: Hard-coded to read from REDI_RING
#
SHMEM=3030
#
# Type=Program_Name,Module,Net,Expect,Flag
#
HEARTBEAT=ImportPkTrigMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_MENLO,60,1
HEARTBEAT=ImportPkTrigUcbns1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS1,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns1,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS1_BK,INST_UCB,60,1
HEARTBEAT=ImportPkTrigUcbns2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS2,INST_UCB,60,1
HEARTBEAT=PkTrigServerUcbns2,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS2_BK,INST_UCB,60,1
HEARTBEAT=File2EW_Trig,TYPE_HEARTBEAT,MOD_FILE2EW_TRIG,INST_UCB,60,1
#
HEARTBEAT=Pkfilter,TYPE_HEARTBEAT,MOD_PKFILTER,INST_UCB,60,1
HEARTBEAT=Binder_ew,TYPE_HEARTBEAT,MOD_BINDER_EW,INST_UCB,60,1
HEARTBEAT=Eqassemble,TYPE_HEARTBEAT,MOD_EQASSEMBLE,INST_UCB,60,1
DATA=EvtAssemble,TYPE_HYP2000ARC,MOD_EQASSEMBLE,INST_UCB,0,1
HEARTBEAT=Hyps2ps,TYPE_HEARTBEAT,MOD_HYPS2PS,INST_UCB,60,1
#
HEARTBEAT=StatrigFilter,TYPE_HEARTBEAT,MOD_STATRIGFILTER,INST_UCB,60,1
HEARTBEAT=Carlsubtrig,TYPE_HEARTBEAT,MOD_CARLSUBTRIG,INST_UCB,60,1
DATA=SubnetTrigger,TYPE_TRIGLIST_SCNL,MOD_CARLSUBTRIG,INST_UCB,0,1
HEARTBEAT=Trig2ps,TYPE_HEARTBEAT,MOD_TRIG2PS,INST_UCB,60,1
#
The configuration file allows for naming the triplet of Earthworm message
type, module ID and installation ID. This name is intended to be (hopefully)
more comprehensible to human readers than the Earthworm syntax.
//status_mgr// ignores Earthworm //TYPE_ERROR// messages. These messages are
completely ignored in UCB Earthworm systems.
The companion program //status_mon// is normally used by //monitor// to report
abnormal Earthworm conditions. It reads from the same share-memory region
controlled by //status_mgr// and determines the latency of each parameter
being monitored. If the latency exceeds a limit, it will be reported. The
latency limits are the same as configured in status_mgr's configuration file,
unless modified by status_mon's command line. If no limits are exceeded, then
status_mon is silent.
ncss@ucbns2:status_mon -h
status_mon [-a n] [-d N] [-h] [-I list | -O list] where:
-h Help - prints syntax message.
-m shmkey Specify shared memory key.
-a N Set alive (heartbeat) threshold to N seconds.
-d N Set data threshold to N seconds.
-I ignore-list Ignore items in comma-separted list.
-O only-list Only list items in comma-separted list.
-I and -O cannot both be used at the same time.
The UCB command //status// is simply a symbolic link to //status_mon//. This
provides a complete list of the configured names, last update times, expected
update interval and current latency. This command is intended for manual use.
For example:
ncss@ucbns2: status
Name Last Update (UTC) Expect Delta
---------------------------------------------------------
Import_Trace_Local 2018/03/19,22:35:52.0000 10 1
Slink2EW_USLB 2018/03/19,22:35:35.0000 10 18
Mcast_USLB 2018/03/19,22:35:47.0000 60 6
WfTimeFilter 2018/03/19,22:35:45.0000 60 8
Pickew 2018/03/19,22:35:45.0000 60 8
Coda_AAV 2018/03/19,22:35:47.0000 60 6
Coda_DUR 2018/03/19,22:35:51.0000 60 2
Carlstatrig 2018/03/19,22:35:26.0000 60 27
CsDetectEw 2018/03/19,22:35:49.0000 10 4
PowerMon 2018/03/19,22:35:46.0000 10 7
Export_PkTrig_Menlo 2018/03/19,22:35:50.0000 60 3
Export_PkTrig_UCB 2018/03/19,22:35:46.0000 60 7
Export_Pick_CIT 2018/03/19,22:35:00.0000 60 53 (x)
Export_Pick_Golden 2018/03/19,22:35:37.0000 60 16 (x)
Export_Trace_MP 2018/03/19,22:35:49.0000 60 4 (x)
Export_Trace_ATWC 2018/03/19,22:35:42.0000 60 11 (x)
Export_Trace_PTWC 2018/03/19,22:35:44.0000 60 9 (x)
Export_Trace_UW 2018/03/19,22:35:44.0000 60 9 (x)
Export_Trace_NCVL 2018/03/19,22:35:33.0000 60 20 (x)
Export_Trace_UCSD_KF 2018/03/19,22:35:48.0000 60 5 (x)
Wave_serverV_1 2018/03/19,22:35:47.0000 60 6
Wave_serverV_2 2018/03/19,22:35:47.0000 60 6
Wave_serverV_3 2018/03/19,22:35:47.0000 60 6
Wave_serverV_4 2018/03/19,22:35:47.0000 60 6
Wave_serverV_5 2018/03/19,22:35:47.0000 60 6
In this output, the "(x)" text indicates that this item is flagged to turn off
warning messages sent through //monitor//. If this parameter's latency
exceeded the limit, this symbol would change to "(ALARM)". Likewise, if a
parameter were not flagged for silence, the symbol would be "ALARM", without
the enclosing parenthesis.
== Automatic Restart for Earthworm Modules at UCB ==
Because UCB AQMS systems do not use Earthworm's //startstop// or //statmgr//
programs, they are missing the ability to automatically restart Earthworm
modules when they die. To get around this lacuna, we have a //restart_mgr//
system to handle automatic restarts. ///home/ncss/run/bin/restart_mgr.pl// is
a per script that parses its configuration file, runs //status//, and restarts
any configured programs that are in the //ALARM// or //(ALARM)// state. Since
this system is run by cron, we have ///home/ncss/run/bin/restart_mgr.csh//, a
C-shell script to set the appropriate environment variables needed by the perl
script and by //status//.
ncss@ucbns1:./restart_mgr.pl -h
restart_mgr.pl version 0.0.1
restart_mgr.pl - check status and restart failed programs
Syntax:
restart_mgr.pl [-c config] [-h] [-v]
where:
-c config - specify an alternate configuration file
The default config file is /home/ncss/run/params/restart_mgr.conf
-w wait_time
- how long (seconds) to wait for heartbeat before
restarting a program; default is 300
-v - verbose output
-h - prints this help message.
The configuration file for restart_mgr on ucbns1 is shown here. Note that this
is in perl syntax that defines a hash. The left column gives the names of
items in status_mgr's configuration file. The right column gives the commands
needed to restart the given item when called with the //restart// option:
%progMap = (
Export_Trace_Quake => "run_export_trace_quake",
Export_Trace_MP => "run_export_trace_menlo",
Export_Trace_ATWC => "run_export_trace_atwc",
Export_Trace_PTWC => "run_export_trace_ptwc",
Export_Trace_UW => "run_export_trace_uw",
Export_Trace_NCVL => "run_export_trace_ncal_valve",
Export_Trace_UCSD_KF => "run_export_trace_ucsd_kf",
Export_Pick_Golden => "run_export_pick_golden",
);
# Following required to make perl happy:
1;
Yes, this is a kludge! But it seems to work OK.