User Tools

Site Tools


operations:db_ncedc:data_volume

NCEDC Archive Size Estimation

Detailed Estimate

/data/dc5/reporting.NCEDC/archive_size/ (dcmgr on strike)

This directory contains the data and programs used to determine
the size of the NCEDC archive.  Information is aggregated by 
year and month.

1.  Configuration info:

conf/gps.list
	List of directories that store GPS data to be scanned.
	This file is used by programs in the bin directory to determine
	where to look for GPS data.

conf/nets.continuous.list
	List of networks for which we archive continuous MSEED data.
	This file is used by programs in the bin directory to determine
	the list networks we should scan the filesystem for and compute
	the continuous MSEED data sizes.

2.  Monthly data file created by the programs in the bin directory contain:
a.  gzipped files created by "find" filesystem scan.
b.  csv files that summarize the info from the "find" files or SQL queries.

data/cont_mseed/year/year.month.csv
	data:	year,month,nbytes

data/gps/year/gps.year.month.csv
	data:	year,month,nbytes

data/event_mseed_egs/year.month.egs.csv
	hdr:	YEAR,MONTH,EVIDCOUNT,WAVEFORMBYTES,WVIDCOUNT,SNCLEVIDCOUNT
	data:	year,month,evidcount,nbytes,wvidcount,snclevidcount

data/event_mseed_ncss/year.month.ncss.csv
	hdr:	YEAR,MONTH,EVIDCOUNT,WAVEFORMBYTES,WVIDCOUNT,SNCLEVIDCOUNT
	data:	year,month,evidcount,nbytes,wvidcount,snclevidcount

------------------------------------------------------------------------------

Programs for getting information on the NCEDC archive size:

==============================================================================

To use:

1.  Edit
	../setup.csh
    to update the years for the various scans and computations.

2.  Source ../setup.csh

3.  Run
	gen_sql_event_mseed_all
    to run the 2 scripts that generate the SQL query files that
    query the database to get size info for the NCSS and EGS event miniSEED
    files.  The output of each sql query files are written to a single file
    the ../sql directory.

4.  Run
	compute_size_event_mseed_all
    to run yasql using the 2 SQL query files.  You will need the
    EGS_RO database password and the NETDC database password.

4.  Run
	get_cont_mseed_data_all
	get_gps_data_all
    to scan the NCEDC filesystem to get archive size info for the continuous mseed
    and gps data directories.  Scanned info is saved in monthly files in the
    ../data directory.

5.  Run
	compute_size_cont_mseed_all
	compute_size_gps_all
    to compute the various archive sizes based on the scanned files or the
    SQL queries to the database.  Output info is saved in monthly csv files
    in the ../data directory.

6.  Run
	run_merge_monthly_csv
    to merge ALL of the monthly csv files into a single csv file that can be
    imported into the ncedc_archive_size.xls spreadsheet.  Output file
    is saved in the ../results directory.

==============================================================================

Programs in this directory:

1.  Programs to generate SQL requests (put in ../sql)
gen_sql_event_mseed_all
gen_sql_event_mseed_egs
gen_sql_event_mseed_ncss

2.  Programs to scan archive file system for all years.
get_cont_mseed_data_all
get_gps_data_all

3.  Programs to scan archive file system for specific year
    (run by the _all programs above).
get_cont_mseed_data
get_gps_data

4.  Programs that use the the scanned file system data and
    sql requests to compute the size of the NCEDC archive.
compute_size_cont_mseed_all
compute_size_event_mseed_all
compute_size_gps_all
 
5.  Programs used by the above _all programs.
compute_size_cont_mseed
compute_size_gps

6.  Program
merge_monthly_csv

7.  Support programs.
compute_size_net_sta
sum_reduce
------------------------------------------------------------------------------

ncedc_data_archive.xls
	Updated:  2015/12/11

	This spreadsheet contains multiple sheet.
1.	sheet 1:	NCEDC_Data		- COPIED data from most recent merged.csv
						  See instructions below.
2.	sheet 2:	NCEDC_Summary		- Summary info using formulas that reference
						  data from NCEDC_Data.
3.	sheet 3:	NCEDC_Summary_Year	- Summary info using formulas that reference
						  data from NCEDC_Summary sheet.

To update data in this spreadsheet:

0.  MAKE A BACKUP COPY OF THE ncedc_data_archive.xls.

1.  Run merge_monthly_csv to create a csv file with all ncedc archive size data.
	bin/merge_monthly_csv > results/merged.csv

2.  Open the ncedc_data_archive.xls spreadsheet (eg with soffice)

3.  Import the new cvs file into a NEW SHEET in the spreadsheet.

4.  Select ALL of the data in the NEW SHEET, copy it, and paste it into
	the NCEDC_Data Sheet.   You have to cut and paste because fields
	in the other NCEDC_* sheets reference fields in the NCEDC_Data sheet.

When you want to add a new year to the spreadsheet, you will have
to CAREFULLY add rows to the NCEDC_Summary* sheets and make sure that
they reference the appropriate fields in the NCEDC_Data sheet and the
NCEDC_Summary sheet.

Summary Estimate

  • On dcmgr@transform a cron job runs at the end of each month and generates the following information in /home/dcmgr/misc/NCEDC.output:
NCEDC Archive as of:
Thu Mar 31 23:55:01 PDT 2022

Total archive:
Filesystem         Size  Used Avail Use% Mounted on
strike:/sam/ncedc  250T  169T   82T  68% /data/ncedc

Continuous MiniSEED data for current year:
454G	BG/2022
629G	BK/2022
84G	BP/2022
688M	CC/2022
156G	CE/2022
312G	CI/2022
5.7G	GM/2022
1.6G	GS/2022
576G	NC/2022
110G	NN/2022
181G	NP/2022
41G	PB/2022
19G	PG/2022
20G	SB/2022
7.0G	SF/2022
58G	UO/2022
32G	UW/2022
62G	WR/2022
2.7T	total

GPS data for current year:
151G	gps/highrate/raw/2022
56G	gps/highrate/rinex/2022
5.1G	gps/rt/BK/2022
13G	gps/rt/CI/2022
4.0K	gps/rt/events/2022
18G	gps/rt/NC/2022
45G	gps/rt/PB/2022
88G	gps/rt/PW/2022
373G	total

Total Continuous MiniSEED data:
17T	BG
28T	BK
7.3T	BP
5.0G	CC
2.2T	CE
12T	CI
28K	db
91G	GM
78G	GS
44T	NC
4.8T	NN
9.7T	NP
6.6T	PB
2.6T	PG
325G	SB
5.3T	SF
906G	TA
7.7G	UL
219G	UO
2.8G	US
108G	UW
3.6T	WR
143T	total

Total Event data:
0	events/active
0	events/active22
1.1T	events/EGSEVT
3.0T	events/NCEVT
31G	events/SFEVT
4.1T	total

Total GPS data:
11T	gps

Total Misc data sets:
3.5T	misc

Continuous data daily rate:
5.1G	BG/2022/2022.075
7.1G	BK/2022/2022.075
981M	BP/2022/2022.075
13M	CC/2022/2022.075
1.8G	CE/2022/2022.075
3.6G	CI/2022/2022.075
65M	GM/2022/2022.075
18M	GS/2022/2022.075
6.5G	NC/2022/2022.075
1.3G	NN/2022/2022.075
2.1G	NP/2022/2022.075
491M	PB/2022/2022.075
236M	PG/2022/2022.075
230M	SB/2022/2022.075
637M	UO/2022/2022.075
364M	UW/2022/2022.075
734M	WR/2022/2022.075
31G	total

GPS data daily rate:
1.7G	raw/2022/2022.075
624M	rinex/2022/2022.075
2.3G	total
21M	BK/2022/2022.075
150M	CI/2022/2022.075
596M	NC/2022/2022.075
540M	PB/2022/2022.075
1.1G	PW/2022/2022.075
2.3G	total
operations/db_ncedc/data_volume.txt · Last modified: 2022/04/03 12:50 by stephane