===== Starting and Stopping AQMS for NCSS ===== This page describes the various means of starting and stopping AQMS systems for the NCSS. It starts at the highest level, where the OS boot and shutdown system interact with AQMS. The discussion proceeds to lower levels, down to ways of starting and stopping individual AQMS programs. === Init Scripts === Currently almost all the computers on which NCSS runs AQMS programs use "init" scripts to start and stop things. This is the facility that Solaris and Linux up through Red Hat 6 provide. Newer Linux systems running Red hat 7 offer a different facility for controlling processes during bootup: systemd. The following tables shows the various init scripts and the run levels and priorities assigned to them: == Network Service Systems ucbns1, ucbns2 == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | netmon | 2,3,4 | 91 | no automatic stopping || starts data acquisition | | ncss | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops AQMS | == Network Service Systems mnlons1, mnlons2 == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | ncss | 2,3,4 | 94 | 0,1,5,6 | 05 | starts and stops non-EW AQMS | | earthworm | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops Earthworm | == RT System ubbrt == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | dbora | 2,3,4 | 82 | 0,1,5,6 | 10 | starts and stops Oracle DB | | cms | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS | | ncss | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops AQMS | == RT System mnlort1 == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | dbora | 2,3,4 | 82 | 0,1,5,6 | 10 | starts and stops Oracle DB | | cms | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS | | ncss | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops non-EW AQMS | | earthworm | 2,3,4 98 | 0,1,5,6 | 02 | starts and stops Earthworm | == Post-Proc System ucbpp == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | cms | 2,3,4 | 89 | 0,1,5,6 | 10 | starts and stops CMS | | ncss | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops parts of AQMS | | dcmgr | 2,3,4 | 95 | 0,1,5,6 | 05 | starts and stops dcmgr monitoring | == Post-Proc System mnlodb1 == ^ Script Name ^ Start Run Levels ^ Start Priority ^ Kill Run Levels ^ Kill Priority ^ Function ^ | cms | 3 | 89 | no auto shutdown || starts CMS | | dbora | 3 | 93 | 0,1,S | 10 | starts and stops Oracle DB | | ncss | 3 | 97 | 0,1,S | 05 | starts and stops parts of AQMS | | dcmgr | 3 | 98 | 0,1,S | 05 | starts and stops dcmgr monitoring | As you can see, there is little consistency in the Start Priority values in the above tables; it is the order that matters most. Systems that have a local Oracle database should start that before starting "ncss", the main part that depends on the database. Likewise on those systems running CMS, it should be started before "ncss" with depends on CMS. On the UCB acquisition and network service systems, the "netmon" system starts slowly enough that the AQMS part (WDA) will already be available by the time netmon starts the WDA writers. Note that many parts of the post-processing systems are started by crontab entries instead of by init script. And on the two RT systems, the solution servers are started by crontab entries. Each of the NCSS computers has various crontab entries for running miscellaneous support codes, not described here. == User-level Scripts == The above init scripts (except for //dbora//, provided by Oracle) are quite simple. They simply call a user-level script to perform the startup (and shutdown, if applicable) work, as follows: * ncss: calls //~ncss/run/bin/run_all// for startup, //~ncss/run/bin/stop_all// for shutdown. * cms: calls //~ncss/run/cms/runAll start// for startup, //~ncss/run/cms/runAll stop// for shutdown. * dcmgr: calls //~dcmgr/run/bin/run_all// for startup, //~dcmgr/run/bin/stop_all// for shutdown. * netmon: calls //~ncss/config/bin/run_netmon// for startup. Acquisition must be stopped manually. * earthworm: runs //startstop// in background with appropriate environment and configuration file, stdout & stderr redirected to /dev/null; kills startstop on shutdown. The //run_all// scripts are pretty straight-forward bash scripts. They check some environment variables and set some others that are needed by most AQMS programs. Then they run //dbping// to check that the configured database is available for use. //dbping// connects to the database and does a simple query to ensure that the database is working correctly. If the database is OK, then the run_all script calls all the scripts and programs needed to start the AQMS components needed for the particular user and host. Each run_all script is custom made for that user and host! The run_all script also starts several programs that do not depend on the Oracle database. For stopping AQMS programs, the //stop_all// script stops many of the programs previously started by //run_all//. Some of that work is done by searching the //run_all// script for commands that follow a simple pattern, making that part of //stop_all// generic. == Individual AQMS run scripts == Most AQMS programs used by NCSS have their own scripts for starting and stopping. Where possible, many of these //run// scripts have only a few lines and then call a generic script //run/bin/runguts//. As the name implies, //runguts// has all the guts of the script. It provides the options start, stop, stopwait, and restart. Many AQMS programs, especially the ones connecting to CMS, take many seconds to exit after they have been sent the //kill// signal. The "stopwait" and "restart" options verify that the particular application has really exitted before proceeding. Some exceptions to the simple run scripts: * run_monitor: the //monitor// script will reread its configuration file when sent the //HUP// signal. The run_monitor script does that when called with the //restart// option. * run_pws*: the proxy wave server (pws) works by forking a new process for each client connection. The //run_pws*// script can only kill the initial server process, not the forked ones. Thus the script cannot safely restart pws. It is up to the user to decide when it is safe to start a new pws server after stopping one. * run_wanc*: the NC waveform archiver //wanc// does not reliably exit on a //kill// signal. Thus the run_wanc* scripts do not provide a //restart// option.