Alarm Actions

Alarm actions are the goal of the AQMS alarm system: if an event meets some conditions, one or more alarm actions are called for the event by alarmdist. Alarm actions are executable commands, usually scripts. They can do essentially anything although they should not take long to run: currently alarmdist calls one action at a time. Alarmdist has a configured time limit: for each action. If the action runs too long, alarmdist will kill it and report an error.

All alarm actions have the same calling convention:

action_name EVENT_ID mod_count

The mod_count value can be any number; it is required but no longer used. It is a relic from when the the event version was not maintained in the Event table.

Note that every action should also have a cancel script to undo the action. The cancel command has the name of the action script with CANCEL_ prepended. The cancel command has the same calling convention as the action command.

Errors

If an alarm action encounters a problem and returns abnormal exit code, alarmdist will send an email notice and set the action state to ERROR in the alarm_action DB table. To keep track of these errors, we have the script action_error (in /home/ncss/run/bin/).

ncss@ucbrt.geo.berkeley.edu:action_error -h
    action_error version 0.0.2
        action_error - report alarm_actions in ERROR state
Syntax:
        action_error  [-c config] [-E evid] [-U evid action]
where:
        -E evid         - query for the action commandline for
        any alarm actions in ERROR state for event <evid>.
        -U evid action  - update any of event <evid>'s actions
        of name <action> from ERROR state to ERROR-ACK state.
        -c config       - specify an alternate configuration file.
        The default config file is /home/ncss/run/params/db.conf
        -h      Help    - prints this help message.

        When neither option -E or -U is given, action_error prints any event
        IDs and their actions which are in the ERROR state. This mode is
        suitable for use by monitor.

The monitor program is configured to all action_error with no arguments. This searches the entire alarm_action table for actions in the ERROR state. This encourages AQMS operators to investigate these errors. The user should look in the action logs (/home/ncss/run/alarms/logs) to find the problem. It may be something temporary, such as an scp destination being down. In that case, simply running the action command manually would be appropriate to get the action performed.

After any action errors have been investigated, the user should acknowledge the error. Otherwise the action_error script will continue to report it. To acknowledge an error, use the -U option to action_error. For example,

action_error -U 72282711 MTweb

This will change the action state from ERROR to ERROR-ACK, which will stop action_error from complaining.

BeltPager

As the name suggests, this action causes a pager message to be sent. This is used to notify NCSS personnel about a significant earthquake. The script does the following:

NCSSmail

This action is intended to send email notification to NCSS personnel about interesting events. It is NOT intended for email to the general public; that function is ably provided by ENS.

The NCSSmail action does the following:

EVTPRM2PDL

This cryptically named action sends event parameters to the PDL system, making them available to the world. The following things are performed:

ShakeMap

This is the action that tells ShakeMap about earthquakes. ShakeMap's queue program is listing on a TCP socket. This action connects to the TCP socket and sends a short message:

shake_alarm EVENT_ID UPDATE

Note that queue is configured with a list of host-names from which it will accept TCP connections. If the host calling the ShakeMap alarm action is not on queue's list, it will be rejected.

The script that performs this action reads a list of ShakeMap hosts: /home/ncss/run/alarms/actions/Shakemap/shake_hosts_ports. For each listed host, the action scripts forks a separate process to handle that host. The parent process waits a configured time (30 seconds) for each child to complete its work. This ensures that even if one ShakeMap server is down, the other ones will still get notified of the event.

MTweb

The MTweb action takes care of publishing moment tensor information to the world. All the work as done by /home/ncss/run/alarms/bin/mtwebpdl. The following items are performed:

MTlocalmail, MTmail

These two alarm actions send email about moment tensors, to two different lists of recipients. The local list is intended to be a short list of NCSS people show should get prompt notification. The public list is longer, hence slower. These scripts do the following:

TMTSDone

This alarm action notifies Duty Response people when tmts has finished with an event on the real-time systems. It indicates that all real-time AQMS processing of the event is complete, so that it is safe to start modifying event information on the archive databases. The following steps are performed: