===== Starting and Stopping AQMS for NCSS =====

This page describes the various means of starting and stopping AQMS systems
for the NCSS. It starts at the highest level, where the OS boot and shutdown
system interact with AQMS. The discussion proceeds to lower levels, down to
ways of starting and stopping individual AQMS programs.


=== Init Scripts ===

Currently almost all the computers on which NCSS runs AQMS programs use "init"
scripts to start and stop things. This is the facility that Solaris and Linux
up through Red Hat 6 provide. Newer Linux systems running Red hat 7 offer a
different facility for controlling processes during bootup: systemd.

The following tables shows the various init scripts and the run levels and
priorities assigned to them:

== Network Service Systems ucbns1, ucbns2 ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| netmon  | 2,3,4 | 91 | no automatic stopping || starts data acquisition |
| ncss    | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops AQMS |

== Network Service Systems mnlons1, mnlons2 ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| ncss    | 2,3,4 | 94 | 0,1,5,6 | 05 | starts and stops non-EW AQMS |
| earthworm    | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops Earthworm |

== RT System ubbrt ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| dbora   | 2,3,4 | 82 | 0,1,5,6 | 10 | starts and stops Oracle DB |
| cms     | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS |
| ncss    | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops AQMS |

== RT System mnlort1 ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| dbora   | 2,3,4 | 82 | 0,1,5,6 | 10 | starts and stops Oracle DB |
| cms     | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS |
| ncss    | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops non-EW AQMS |
| earthworm | 2,3,4 98 | 0,1,5,6 | 02 | starts and stops Earthworm |

== Post-Proc System ucbpp ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| cms     | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS |
| ncss    | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops parts of AQMS |
| dcmgr   | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops dcmgr monitoring |

== Post-Proc System mnlodb1 ==
^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^
| cms     | 3     | 89 | no auto shutdown || starts CMS |
| dbora   | 3     | 93 | 0,1,S  | 10 | starts and stops Oracle DB |
| ncss    | 3     | 97 | 0,1,S  | 05 | starts and stops parts of AQMS |
| dcmgr   | 3     | 98 | 0,1,S  | 05 | starts and stops dcmgr monitoring |

As you can see, there is little consistency in the Start Priority values in the
above tables; it is the order that matters most. Systems that have a local
Oracle database should start that before starting "ncss", the main part that
depends on the database. Likewise on those systems running CMS, it should be
started before "ncss" with depends on CMS. On the UCB acquisition and network
service systems, the "netmon" system starts slowly enough that the AQMS part
(WDA) will already be available by the time netmon starts the WDA writers.

Note that many parts of the post-processing systems are started by crontab
entries instead of by init script. And on the two RT systems, the solution
servers are started by crontab entries. Each of the NCSS computers has various
crontab entries for running miscellaneous support codes, not described here.

== User-level Scripts ==

The above init scripts (except for //dbora//, provided by Oracle) are quite
simple. They simply call a user-level script to perform the startup (and
shutdown, if applicable) work, as follows:

  * ncss: calls //~ncss/run/bin/run_all// for startup, //~ncss/run/bin/stop_all// for shutdown.

  * cms: calls //~ncss/run/cms/runAll start// for startup, //~ncss/run/cms/runAll stop// for shutdown.

  * dcmgr: calls //~dcmgr/run/bin/run_all// for startup, //~dcmgr/run/bin/stop_all// for shutdown.

  * netmon: calls //~ncss/config/bin/run_netmon// for startup. Acquisition must be stopped manually.

  * earthworm: runs //startstop// in background with appropriate environment and configuration file, stdout & stderr redirected to /dev/null; kills startstop on shutdown.

The //run_all// scripts are pretty straight-forward bash scripts. They check
some environment variables and set some others that are needed by most AQMS
programs. Then they run //dbping// to check that the configured database is
available for use. //dbping// connects to the database and does a simple query
to ensure that the database is working correctly. If the database is OK, then the
run_all script calls all the scripts and programs needed to start the AQMS
components needed for the particular user and host. Each run_all script is
custom made for that user and host! The run_all script also starts several
programs that do not depend on the Oracle database.

For stopping AQMS programs, the //stop_all// script stops many of the programs
previously started by //run_all//. Some of that work is done by searching the
//run_all// script for commands that follow a simple pattern, making that part
of //stop_all// generic.

== Individual AQMS run scripts ==

Most AQMS programs used by NCSS have their own scripts for starting and
stopping. Where possible, many of these //run// scripts have only a few lines
and then call a generic script //run/bin/runguts//. As the name implies,
//runguts// has all the guts of the script. It provides the options start,
stop, stopwait, and restart. Many AQMS programs, especially the ones connecting
to CMS, take many seconds to exit after they have been sent the //kill//
signal. The "stopwait" and "restart" options verify that the particular application
has really exitted before proceeding.

Some exceptions to the simple run scripts:

  * run_monitor: the //monitor// script will reread its configuration file when sent the //HUP// signal. The run_monitor script does that when called with the //restart// option.

  * run_pws*: the proxy wave server (pws) works by forking a new process for each client connection. The //run_pws*// script can only kill the initial server process, not the forked ones. Thus the script cannot safely restart pws. It is up to the user to decide when it is safe to start a new pws server after stopping one.

  * run_wanc*: the NC waveform archiver //wanc// does not reliably exit on a //kill// signal. Thus the run_wanc* scripts do not provide a //restart// option.