===== Monitor, an Application for Monitoring Systems ===== Monitor is a perl script (written by Doug Neuhauser in about 1994) for generalized monitoring. Being a perl script, it is more or less agnostic to the type of unix or linux on which it runs. However, there are a number of configurable parameters (both in the script and in the configuration file) that must be adjusted to make the program work with some "standard" utilities provided by unix/linux. Don't try to copy a Solaris version of monitor to Linux; it will most likely not work properly. Minimal help is provided by running "monitor -h": monitor - Monitor processes, computers, disks, and other progs. Syntax: monitor [-h] [-d n] [-l logfile] [config_file] where: -b Boot option. Delay monitoring at boot time for BOOT_DELAY seconds specified in config file. -d n Debug value n. Value can be the OR of: 1 = print scheduling info 2 = print time info 4 = print sorting info 8 = print out commands 16 = print alarm messages instead of sending them. 64 = display monitor_list structure 128 = print proc list when missing process. -h Help - prints this help message. -l logfile Log all alarms to the specified logfile. config_file Configuration file. Default file is: monitor.config For most applications, monitor is started at system boot and runs continuously; e.g. monitor -b -l /home/ncss/run/logs/monitor.log /home/ncss/run/params/monitor.config In (almost?) all systems, monitor is started at boot time with the "-b" flag. This allows a few minutes for applications to get started before monitor starts checking things. This boot delay also comes into effect when monitor is restarted with the //HUP// signal. This feature can be useful to avoid notifications when restarting applications being checked by monitor. ==== Configuration File ==== The configuration file, monitor.conf by default, is a text file consisting of 5 sections including some comments. The first section provides basic configuration parameters for the monitor's operations. Each parameter is treated very much like a unix environment variable, and are actually added to monitor's process environment. This makes these variables available to any programs that are run by monitor. Thus PATH (and LD_LIBRARY_PATH, if necessary) can be set here. Comments are prefixed with "#". The "EMAIL" parameter includes the subject line under which email messages are sent; e.g,: EMAIL=/bin/mail -s 'ALARM - monitor@ucbns2' In this first section, arbitrary variables may be defined with will be used in the following sections of monitor.conf. For example, in ucbns2 we have: # Notification list - may be used in place of user:action #----------------------------------------------------------------------- NOTIFY_LIST_1=(peggy,lombard,cpaff):(email,pager1),(jennifer):email NOTIFY_LIST_2=(peggy,lombard,cpaff):(email,pager2) NOTIFY_NCSS_1=(peggy,lombard,cpaff):(email,pager1) NOTIFY_NCSS_2=(peggy,lombard,cpaff):(email,pager2) NOTIFY_DOUG=doug:(email,pager) The remaining four sections configure the four different ways that monitor does its work. These are "prog", "alive", "proc" and "disk" Each of these sections is optional. The only restriction is that at least one command must be provided from any of these sections. Otherwise monitor won't have anything to do so it will simply exit. The "prog" command tells monitor to run some program (any sort of executable). The comments in monitor say: # Expectations of a program that is run under monitor: # writes to STDOUT only when conditions are unsatisfactory; # if conditions are normal, it should write nothing to STDOUT # The STDOUT of the program will be sent as part of the page/email # anything written to STDERR will go in monitor.error.log but # nowhere else. # Exit status: should be 0 under most conditions; # use non-zero exit status only for failure to execute part of # part of the program. This non-zero exit status will REPLACE # anything written to STDOUT in the page/email sent by monitor. Note that program STDERR output is ignored by monitor but gets combined with any STDERR output from monitor itself. Here are some comments and sample "prog" lines from ucbns2: # Alarmflags are the following numeric parameters: # run renotify notify_clear # where: # run - Run program every N seconds. # Notify users when alarm is first raised. # renotify - Renotify users every N seconds # if alarm stays raised. # notify_clear - Boolean flag (0 or 1) whether users # should be notified when alarm is cleared. # Program Notify_list Alarmflags Prog_args #----------------------------------------------------------- prog status_mon NOTIFY_LIST_2 60 3600 1 -a 600 -d 28800 prog status_wda NOTIFY_LIST_2 60 3600 1 -C 5 -c 900 -T 99999999 prog status_wda NOTIFY_LIST_2 60 3600 1 -C 999 -c 99999999 -T 43200 prog check_page.pl NOTIFY_LIST_1 60 3600 1 -t 300 Notice the last item here, running the check_page.pl script. This is testing some conditions of the pager daemon on ucbns2. If it finds a problem, it must be reported using pager1, the pager system on ucbns1. It cannot use the pager system on ucbns2 to report that ucbns2's pager system is not working! This is another instance in which monitor.conf is specific for one machine, and cannot be copied intact to another machine. The "alive" commands are used to check the network connectivity of other hosts using the "ping" utility. Here's a few sample entries: # Computer Notify Alarmflags ping_count #------------------------------------------------------------------- alive ucbns1.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5 alive rumble.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5 alive benito.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5 alive ucbrt.seismo.berkeley.edu NOTIFY_NCSS_2 60 3600 1 5 The "proc" command is used to check for the presence of a given process, by user, name, and possibly some arguments. The "Program" part of the proc command uses perl regular expressions to compare against what is reported by "ps -ef". This is a bit tricky to get right! Here's some samples: # User,Program Notify Alarmflags #----------------------------------------------------------------- proc ncss:.*adadup_ucbns22ucbrt.* NOTIFY_LIST_2 120 3600 1 proc ncss:.*adadup_ucbns22ncss3.* NOTIFY_LIST_2 120 3600 1 proc ncss:.*crossoverSA/CA_base.* lombard:email 120 3600 1 proc ncss:.*SocketAgent/CA_base.* lombard:email 120 3600 1 The "disk" command is used to have monitor check for sufficient free space on a given file-system. For example: # Disk Notify Alarmflags Minfree|full% #---------------------------------------------------------------------- disk /home/aq12 lombard 1800 21600 0 90% disk /home/aq12 NOTIFY_NCSS_2 1800 7200 0 95% For BSL systems, monitor is run for at least the following user:host combinations: ncss:ucbns1 BSL acquisition and AQMS net services ncss:ucbns2 "" ncss:ucbrt BSL AQMS RT system ncss:ucbpp BSL AQMS PP system dcmgr:ucbpp BSL event waveform archiving ncss:rumble ShakeMap redi:shaker Finite Fault ncss:quake7 PSD system ncss:benito BSL AQMS DRP system ncss:sutter dac480 support dcmgr:hugo DART BSL Test AQMS systems: ncss:mono Solaris AQMS ncss:seiche Linux AQMS For Menlo Park systems, monitor is run for the following user:host combinations: ncss:mnlons1 MP AQMS net services ncss:mnlons2 MP AQMS net services ncss:mnlort1 MP AQMS RT system ncss:mnlodb1 MP AQMS PP system dcmgr:mnlodb1 MP event waveform archiving ncss:mnlodd1 MP real-time double-difference system ==== Programs Used By Monitor ==== The following programs are used by monitor in the //prog// section of the configuration file on various systems. Each of these programs can be used by hand when needed. Most but not all of them will report how to use them when given the "-h" command-line option. === action_error === On AQMS systems with alarm systems (alarmdec, alarmact, alarmdist), the script //action_error// is used to find any alarm actions that are in the //ERROR// state in the //Alarm_Action// database table. ncss@ucbrt.geo.berkeley.edu:action_error -h action_error version 0.0.2 action_error - report alarm_actions in ERROR state Syntax: action_error [-c config] [-E evid] [-U evid action] where: -E evid - query for the action commandline for any alarm actions in ERROR state for event . -U evid action - update any of event 's actions of name from ERROR state to ERROR-ACK state. -c config - specify an alternate configuration file. The default config file is /home/ncss/run/params/db.conf -h Help - prints this help message. When neither option -E or -U is given, action_error prints any event IDs and their actions which are in the ERROR state. This mode is suitable for use by monitor. NOTE: Once action_error (through monitor) has reported an alarm action in the //ERROR// state, the user should investigate the problem. Most alarm actions log there results and errors in files in the //run/alarms/logs// directory. //action_error// can be used with the //-E evid// option to learn the command-line appropriate for manually running the alarm action script. Once the error condition has been resolved, //action_error// should be run with the //-U evid action// to change the alarm action state from //ERROR// to //ERROR-ACK// in the alarm_action table. This will silence the complaints from monitor when it runs //action_error// again. Note that on the post-processing systems, the alarm_action table is replicated among all the archive databases. That means that only one instance of monitor should be configured to run //action_error// at a time; otherwise you may get multiple pager messages about a single error condition. By convention, monitor runs //action_error// only on the //active// post-processing system === check_page.pl === On the BSL systems (ucbns1, ucbns2) which actually submit pager and SMS messages, we use //check_page.pl// to look for stale pager files. If these were present, it would indicate that the pager daemon was not able to deliver messages in a timely manner. ncss@ucbns1:./check_page.pl -h check_page.pl version 0.0.1 check_page.pl - check for stale pager files Syntax: check_page.pl [-h] [-t maxAge] where: -t maxAge - Maximum age in seconds for pending pager files If older files are found, squawk about them! Default max age is 300 seconds. -h - prints this help message. If check_page.pl is reporting a problem with the pager system on the local system, e.g. ucbns1, it is important to configure monitor to send reports of the problem to a different system, e.g. ucbns2. === checkampexchange === To monitor the [[postproc:ampexc|exchange of ground-motion amplitude packets]] on the post-processing systems, monitor uses the //checkampexchange// script: ncss@ucbpp:checkampexchange -h checkampexchange version 0.0.3 - 2015/01/29 NCSS checkampexchange - report problems with amp import/export Syntax: checkampexchange [-a] [-b] [-e] [-j] [-q] [-T maxAge] [-v] checkampexchange -h where: -a check for unexpected files in get_amps/new directory -b check for stale heartbeats -e check for files in import error directory -j check for jammed-up outgoing files -q check for SQL errors or exceptions in Gmp2Db log -T maxAge set max heartbeat or file age in minutes; default is 10 -v set verbose output; not suitable for use under monitor -h Help - prints this help message. When none of -a, -b, -e, -j, -q are specified, they ALL are implied. NOTE: We use only the //-T 30// option in monitor.config; i.e., perform all of the checks defined for -a, -b, -e, -j, and -q with a 30 minute delay allowed in the heartbeat messages sent from our sister agencies CGS and Caltech. See the code (run/bin/checkampexchange) for details of the checks. === checkautoposter === On the post-processing systems, we use an Oracle Database //job// to insert new subnet trigger events into the [[postproc:pcs]] system. To monitor the status of the autoposter job, we use the script //checkautoposter//: ncss@ucbpp:checkautoposter -h check status of autoposter jobs usage: /home/ncss/ncpp/bin/checkautoposter -[h] [-d dbase] -h : print usage info -d dbase : use this particular dbase (defaults to "MasterDB") -v : verbose: report good status; normally report only bad status example: /home/ncss/ncpp/bin/checkautoposter -d dcmp2 This script queries the database for to see if the autoposter job is running and if it has reported any errors. If errors are found, they should be reported to the Oracle DBA for help in resolving them. === comparelocks === The [[postproc:jiggle]] application is used for human-controlled event location and magnitude evaluation. In order to ensure that different jiggle users are not working on the same event, jiggle uses the database table //JasiEventLock// to "lock" other users out of the event that is being worked. Database replication is not adequate for this locking mechanism on different databases, so the //JasiEventLock// table is not replicated between the archive DBs. Instead we use the script //comparelocks// to ensure that all jiggle users are using the same database: ncss@ucbpp:comparelocks -h Compare current event locks. usage: /home/ncss/ncpp/bin/comparelocks [-v] [-m] -h : print usage info -m : truncated one-line output for use with monitor (NCSS) -v : verbose: print all locks; Normally all locks printed only if multiple DB's in use. This script queries each of the listed databases (hard-coded in the script) to see if any events are locked on that database. If it finds events locked on more than one database, it cause monitor to send a message. The corrective action is to tell the jiggle users to get on the same database. === dbping === dbping tests its configured database for basic functionality. This perl script connects to the database and does one query. If that succeeds, dbping is happy; otherwise it reports an error. Unexpected errors reported by dbping should be forwarded to the Oracle DBA for corrective action. === dircheck === //dircheck// is a simple shell script to monitor directories for unexpected files. We use it on directories that have files written to them and then removed by various polling programs. For example, the input directory for the Earthworm //sendfileII// program normally is empty. Other programs write files in this directory; sendfileII deletes the files as soon as they have been sent. If there are files sitting in the //sendfileII// input directory, it is because they cannot be sent. //dircheck// will report this problem through //monitor//. ncss@ucbpp:dircheck -h usage: /home/ncss/run/bin/dircheck max-files directory-path When //dircheck// reports errors, the AQMS operator will have to track down the processes involved and determine the appropriate corrective action. === pcsWatchdog === The //pcsWatchdog// program checks for abnormal entries in the [[postproc:pcs|pcs state]] table. We run this program on whichever of the post-processing systems is configured as "active". ncss@ucbpp:pcsWatchdog -h Check for PCS backlog usage: /home/ncss/ncpp/bin/pcsWatchdog [-d] [] -h : print usage info -d : turn on diagnostic output config_file : full path to the config file dbase : use this particular dbase (defaults to "dcucb") example: /home/ncss/ncpp/bin/pcsWatchdog /home/ncss/ncpp/conf/checkStates.cfg dcucb This script uses a configuration file that lists the various states to check for. Note that the Group, Table and State entries in this file are used directly in a database query, so SQL wild-cards can be used. This file is /home/ncss/ncpp/conf/checkStates.cfg: # Group Table State Age(secs) Count EventStream % NewEvent 0 0 EventStream % NewTrigger 0 0 EventStream % MakeDRPGif 60 2 EventStream % MakeTrigGif 60 2 EventStream % ExportAmps 60 10 EventStream % FPfit 120 0 EventStream % ExportWF 760 1 EventStream % ExportArc 120 0 EventStream % ddrtFeed 120 0 EventStream % SwarmAlarm 120 0 EventStream % AssocTrig 0 0 EventStream % TrigCheck 360 10 TPP TPP DELETED 60 0 TPP TPP FINALIZE 60 0 TPP TPP ALARM 60 0 TPP TPP ddrtFeed 120 1 TPP TPP MakeDRPGif 60 2 TPP TPP FPfit 120 0 TPP TPP ExportArc 60 1 TPP TPP CANCELALARM 60 0 TPP TPP DeleteArc 60 0 TPP TPP REPOP 300 0 # >100 rows of any states older than 10min % % % 600 100 Errors reported by //pcsWatchdog// are usually due to problems in the [[postproc:pcs]] client programs responsible for handling that state. === pdlSendCheck === //pdlSendCheck// is a script for monitoring the ProductClient poll directory. Since we no longer use ProductClient in polling mode, this script is no longer needed. === status_ada, status_wda === //status_ada// and //status_wda// are scripts for monitoring data latency in the AQMS //ADA// and //WDA// shared memory regions (also called GCDA, generic channel data area), respectively. These two scripts work by using the output from AQMS programs //adastat// and //wdastat//. The two scripts are configured with files listing the SNCLs to be monitored; a flag value following each SNCL indicates whether that SCNL should be reported (flag = 1) or temporarily ignored (value = 0). Normally one SCNL from each station whose data should be in the GCDA, since most data latency problems affect all SNCLs of a station in the same way. One exception to this is for stations with multiple dataloggers: in those cases, one SNCL from each datalogger may be included in the configuration file. Each of these scripts offer the same command-line options. Here's //status_wda//: ncss@ucbns2:status_wda -h status_wda version 0.8 (2010.218) status_wda - Monitor WDA region. Syntax: status_wda [-T N] [-C N] [-c K] [-d n] [-h] where: -f file Name of config file. Default is: /home/ncss/run/params/status_wda.config -T N Total allowable delay summed over all stations. Default is 240 * number of channels; -C list Comma-delimited list of cluster sizes. -c list Comma-delimited list of allowable delay for each station for the corresponding cluster size. -d n Debug option. 1 provides basic debugging info. 2 provides sorted delay info. Examples: status_wda -T 1800 Set max total delay. status_wda -C 1,2,3 -c 1800,900,360 Set delay limits of 1 station at 1800 seconds each 2 stations at 900 seconds each 3 stations at 360 seconds each Max total delay is unchanged from default. These options provide two basic ways of monitoring data latency. To monitor the total data latency (sum of latencies for all configured SNCLs), set the -T value to some low number, while setting the -c and -C values to large numbers. On the network service systems, we typically use a -T value of 43200 (12 hours), with -C 999 (more than the configured number of SNCLs). This will generate warning messages if one SNCL is out for about 12 hours, or two SNCLs out for about 6 hours, etc. To monitor latency for groups of SNCLs instead of total latency, we set small values for -c and -C, with a very value for -T. On the UCB net service systems, we use -C 3 (a cluster of 3 SNCLs) and -c 600 (ten minutes). The result is that if any three SNCLs each have latency of more than 600 seconds, we will get a warning message. //status_wda// and //status_ada// are often used manually. In that case, the most useful option //-d2// to get a list of all configured SNCLs (including those configured with flag "0"), sorted in descending order of latency. === status_mon === The AQMS systems at UC Berkeley do not use the standard Earthworm programs //startstop//, //statmgr// or //status//. Instead they use locally written programs //status_mgr// and //status_mon// for checking the state of health of Earthworm programs running on these systems. //status_mgr// creates a small share-memory region containing entries for each of the Earthworm message types it is configured to monitor. Like the Earthworm //statmgr//, status_mgr normally is configured to watch heartbeat messages from configured modules. But //status_mgr// can also monitor other message types, which it considers //data//. In the share-memory regions. status_mgr keeps track of the last time it received each of the messages it is configured to monitor, as well as a few other useful parameters. ncss@ucbns1:status_mgr -h status_mgr [-F | -K] [-h] [-d N] config_file where: -h Help - prints syntax message. -r ringname Name of ring to monitor. Default is ? -F | -K Flush or Keep old contents of redi ring (default=Keep). -d N Debug option N (currently unused). config_file Name of configuration file (default = status_mgr.config) Note that //status_mgr// is hard-coded to read Earthworm messages from //REDI_RING//. This seemed reasonable when it was first written. It may be time to make the small code changes necessary to make the input ring name configurable. //status_mgr// runs continuously, started with at the same time as other Earthworm programs. It should be restarted whenever changes are made to its configuration file: run_status_mgr restart Here's a sample configuration file, taken from urbrt. It demonstrates the use of both //HEARTBEAT// and //DATA// configuration types. # # Status region: Hard-coded to read from REDI_RING # SHMEM=3030 # # Type=Program_Name,Module,Net,Expect,Flag # HEARTBEAT=ImportPkTrigMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_1,INST_MENLO,60,1 HEARTBEAT=ImportPkTrigMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_UCB,60,1HEARTBEAT=PkTrigServerMenlo2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_MP_2,INST_MENLO,60,1 HEARTBEAT=ImportPkTrigUcbns1,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS1,INST_UCB,60,1 HEARTBEAT=PkTrigServerUcbns1,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS1_BK,INST_UCB,60,1 HEARTBEAT=ImportPkTrigUcbns2,TYPE_HEARTBEAT,MOD_IMPORT_PKTRIG_UCBNS2,INST_UCB,60,1 HEARTBEAT=PkTrigServerUcbns2,TYPE_HEARTBEAT,MOD_EXPORT_PKTRIG_UCBNS2_BK,INST_UCB,60,1 HEARTBEAT=File2EW_Trig,TYPE_HEARTBEAT,MOD_FILE2EW_TRIG,INST_UCB,60,1 # HEARTBEAT=Pkfilter,TYPE_HEARTBEAT,MOD_PKFILTER,INST_UCB,60,1 HEARTBEAT=Binder_ew,TYPE_HEARTBEAT,MOD_BINDER_EW,INST_UCB,60,1 HEARTBEAT=Eqassemble,TYPE_HEARTBEAT,MOD_EQASSEMBLE,INST_UCB,60,1 DATA=EvtAssemble,TYPE_HYP2000ARC,MOD_EQASSEMBLE,INST_UCB,0,1 HEARTBEAT=Hyps2ps,TYPE_HEARTBEAT,MOD_HYPS2PS,INST_UCB,60,1 # HEARTBEAT=StatrigFilter,TYPE_HEARTBEAT,MOD_STATRIGFILTER,INST_UCB,60,1 HEARTBEAT=Carlsubtrig,TYPE_HEARTBEAT,MOD_CARLSUBTRIG,INST_UCB,60,1 DATA=SubnetTrigger,TYPE_TRIGLIST_SCNL,MOD_CARLSUBTRIG,INST_UCB,0,1 HEARTBEAT=Trig2ps,TYPE_HEARTBEAT,MOD_TRIG2PS,INST_UCB,60,1 # The configuration file allows for naming the triplet of Earthworm message type, module ID and installation ID. This name is intended to be (hopefully) more comprehensible to human readers than the Earthworm syntax. //status_mgr// ignores Earthworm //TYPE_ERROR// messages. These messages are completely ignored in UCB Earthworm systems. The companion program //status_mon// is normally used by //monitor// to report abnormal Earthworm conditions. It reads from the same share-memory region controlled by //status_mgr// and determines the latency of each parameter being monitored. If the latency exceeds a limit, it will be reported. The latency limits are the same as configured in status_mgr's configuration file, unless modified by status_mon's command line. If no limits are exceeded, then status_mon is silent. ncss@ucbns2:status_mon -h status_mon [-a n] [-d N] [-h] [-I list | -O list] where: -h Help - prints syntax message. -m shmkey Specify shared memory key. -a N Set alive (heartbeat) threshold to N seconds. -d N Set data threshold to N seconds. -I ignore-list Ignore items in comma-separted list. -O only-list Only list items in comma-separted list. -I and -O cannot both be used at the same time. The UCB command //status// is simply a symbolic link to //status_mon//. This provides a complete list of the configured names, last update times, expected update interval and current latency. This command is intended for manual use. For example: ncss@ucbns2: status Name Last Update (UTC) Expect Delta --------------------------------------------------------- Import_Trace_Local 2018/03/19,22:35:52.0000 10 1 Slink2EW_USLB 2018/03/19,22:35:35.0000 10 18 Mcast_USLB 2018/03/19,22:35:47.0000 60 6 WfTimeFilter 2018/03/19,22:35:45.0000 60 8 Pickew 2018/03/19,22:35:45.0000 60 8 Coda_AAV 2018/03/19,22:35:47.0000 60 6 Coda_DUR 2018/03/19,22:35:51.0000 60 2 Carlstatrig 2018/03/19,22:35:26.0000 60 27 CsDetectEw 2018/03/19,22:35:49.0000 10 4 PowerMon 2018/03/19,22:35:46.0000 10 7 Export_PkTrig_Menlo 2018/03/19,22:35:50.0000 60 3 Export_PkTrig_UCB 2018/03/19,22:35:46.0000 60 7 Export_Pick_CIT 2018/03/19,22:35:00.0000 60 53 (x) Export_Pick_Golden 2018/03/19,22:35:37.0000 60 16 (x) Export_Trace_MP 2018/03/19,22:35:49.0000 60 4 (x) Export_Trace_ATWC 2018/03/19,22:35:42.0000 60 11 (x) Export_Trace_PTWC 2018/03/19,22:35:44.0000 60 9 (x) Export_Trace_UW 2018/03/19,22:35:44.0000 60 9 (x) Export_Trace_NCVL 2018/03/19,22:35:33.0000 60 20 (x) Export_Trace_UCSD_KF 2018/03/19,22:35:48.0000 60 5 (x) Wave_serverV_1 2018/03/19,22:35:47.0000 60 6 Wave_serverV_2 2018/03/19,22:35:47.0000 60 6 Wave_serverV_3 2018/03/19,22:35:47.0000 60 6 Wave_serverV_4 2018/03/19,22:35:47.0000 60 6 Wave_serverV_5 2018/03/19,22:35:47.0000 60 6 In this output, the "(x)" text indicates that this item is flagged to turn off warning messages sent through //monitor//. If this parameter's latency exceeded the limit, this symbol would change to "(ALARM)". Likewise, if a parameter were not flagged for silence, the symbol would be "ALARM", without the enclosing parenthesis. == Automatic Restart for Earthworm Modules at UCB == Because UCB AQMS systems do not use Earthworm's //startstop// or //statmgr// programs, they are missing the ability to automatically restart Earthworm modules when they die. To get around this lacuna, we have a //restart_mgr// system to handle automatic restarts. ///home/ncss/run/bin/restart_mgr.pl// is a per script that parses its configuration file, runs //status//, and restarts any configured programs that are in the //ALARM// or //(ALARM)// state. Since this system is run by cron, we have ///home/ncss/run/bin/restart_mgr.csh//, a C-shell script to set the appropriate environment variables needed by the perl script and by //status//. ncss@ucbns1:./restart_mgr.pl -h restart_mgr.pl version 0.0.1 restart_mgr.pl - check status and restart failed programs Syntax: restart_mgr.pl [-c config] [-h] [-v] where: -c config - specify an alternate configuration file The default config file is /home/ncss/run/params/restart_mgr.conf -w wait_time - how long (seconds) to wait for heartbeat before restarting a program; default is 300 -v - verbose output -h - prints this help message. The configuration file for restart_mgr on ucbns1 is shown here. Note that this is in perl syntax that defines a hash. The left column gives the names of items in status_mgr's configuration file. The right column gives the commands needed to restart the given item when called with the //restart// option: %progMap = ( Export_Trace_Quake => "run_export_trace_quake", Export_Trace_MP => "run_export_trace_menlo", Export_Trace_ATWC => "run_export_trace_atwc", Export_Trace_PTWC => "run_export_trace_ptwc", Export_Trace_UW => "run_export_trace_uw", Export_Trace_NCVL => "run_export_trace_ncal_valve", Export_Trace_UCSD_KF => "run_export_trace_ucsd_kf", Export_Pick_Golden => "run_export_pick_golden", ); # Following required to make perl happy: 1; Yes, this is a kludge! But it seems to work OK.