Metrics analysis

Distribution Information (From DB)

On transform: (as dcmgr)
- Edit and execute the SQL scripts req_by_format_prog.sql, dist_by_format_prog.sql & dist_total.sql. It will generate the files req_by_format_prog.csv, dist_by_format_prog.csv & dist_total.csv for all the data in the database.

[dcmgr@transform analysis]# cd /data/dc5/reporting.NCEDC/analysis
[dcmgr@transform analysis]# vi req_by_format_prog.sql dist_by_format_prog.sql dist_total.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @req_by_format_prog.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @dist_by_format_prog.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @dist_total.sql

Run the csv_b2g script to convert bytes into gigabytes:

[dcmgr@transform analysis]$ ./csv_b2g -f3 dist_by_format_prog.csv 

FORMAT,PROG,NBYTES
assemble,httpd,0.461
cat,ftp,0.014
cat,httpd,44.362
catalog,fdsnws-event,145.248
gps,ftp,1778.876
gps,httpd,279.055
metadata,fdsnws-station,1362.311
metadata,ncedcws-dataless,1.856
metadata,ncedcws-resp,0.103
metadata,ncedcws-sacpz,0.382
mseed,fdsnws-dataselect,40431.028
mseed,ncedcws-eventdata,10.793
mseed,sws,594.182
rawdata,httpd,0.001
xmldata,ftp,68.643
xmldata,httpd,0.005
[dcmgr@transform analysis]$ ./csv_b2g -f1 dist_total.csv 

NBYTES
44717.320
[dcmgr@transform analysis]$

Geolocation Information (From DB)

On transform: (as dcmgr)
- Edit the script sql.dist_by_type to specify the time interval (yyyymm for start and end month):

[dcmgr@transform analysis]# cd /data/dc5/reporting.NCEDC/analysis
[dcmgr@transform analysis]# vi sql.dist_by_type

Run the script sql.dist_by_type. It will generate 3 output files:
- count.data.csv - for all types of data (mseed, gps, xml, raw, assemble, …).
- count.data.cat - for all catalog queries.
- count.data.meta - for all metadata queries.

[dcmgr@transform analysis]# ./sql.dist_by_type

Run the script geolocate.pl to get geolocation information for the above files, e.g.:

[dcmgr@transform analysis]# ./geolocate.pl count.data.csv	-> # creates count.data.csv.geo

Notes:

Notes based on current data distribution processing.
1.  IPADDR is NEVER NULL.
2.  DOMAIN is NEVER NULL.
3.  IPADDR is set to '-' when no IP address is available.
4.  DOMAIN is set to the ipaddress when no domain is available.

So, values of fields can be:
case:	IPADDR		DOMAIN
---------------------------------------
1	real_ip_addr	real_domainname
2	real_ip_addr	real_ip_addr
3	-		real_domainname

To get a count of everything once and only, perform 3 selections:
	select ...        where IPADDR != DOMAIN and IPADDR != '-'	# Case 1 - use either
	select ipaddr ... where IPADDR = DOMAIN				# Case 2 - only IP: domain set to IP 
	select domain ... where IPADDR = '-'				# Case 3 - only domain

==============================================================================
Query
	curl -s http://freegeoip.net/csv/128.32.149.11

Alternatives investigated but not used:
	curl -s http://geoip.nekudo.com/api/128.32.149.11/en/short
	https://dns.google.com/resolve?name=usgs.gov

curl -s http://dns.google.com/resolve\?name=hotmail.com
{"Status": 0,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "hotmail.com.","type": 1}],"Answer":[ {"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "157.56.198.220"},{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "65.55.118.92"}]}

{"Status": 0,
 "TC": false,
  "RD": true,
  "RA": true,
  "AD": false,
  "CD": false,
  "Question":[ 
	{"name": "hotmail.com.","type": 1}
   ],
   "Answer":[ 
	{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "157.56.198.220"},
	{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "65.55.118.92"}
   ]
}

==============================================================================

# 1.  Identify what is a domain name vs a hostname.
# 2.  While domain name, 
		use dns.google.com, and take result from "data" attribute.
# 3.  Lookup with 

rake% curl -s http://dns.google.com/resolve\?name=icjta.csic.es
{"Status": 3,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "icjta.csic.es.","type": 1}],"Authority":[ {"name": "csic.es.","type": 6,"TTL": 1611,"data": "olmo.csic.es. hostmaster.csic.es. 2010042607 86400 7200 2592000 86400"}]}
rake% curl -s http://dns.google.com/resolve\?name=olmo.csic.es
{"Status": 0,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "olmo.csic.es.","type": 1}],"Answer":[ {"name": "olmo.csic.es.","type": 1,"TTL": 70114,"data": "161.111.10.3"}]}
rake% curl -s http://freegeoip.net/csv/olmo.csic.es
161.111.10.3,ES,Spain,MD,Madrid,Madrid,28001,Europe/Madrid,40.4167,-3.6838,0
rake% curl -s http://freegeoip.net/csv/161.111.10.3
161.111.10.3,ES,Spain,MD,Madrid,Madrid,28001,Europe/Madrid,40.4167,-3.6838,0

Yearly Summary Information (From files)

On transform: (as dcmgr)

[dcmgr@transform csv]$ cd /data/dc5/reporting.NCEDC/csv/

Run the usr_csv_year.csh script to generate the number of distinct users for a given year, e.g.:

[dcmgr@transform csv]$ usr_csv_year.csh 2012
11003
[dcmgr@transform csv]$

Run the sum_csv_year.csh script to generate the total amount of data for a given year, e.g.:

[dcmgr@transform csv]$ sum_csv_year.csh 2012
10975G
[dcmgr@transform csv]$

Run the sum_csv_year_sta.csh script to generate the total amount of data for a given year and list of stations (defined in sum_csv_sta), e.g.:

[dcmgr@transform csv]$ sum_csv_year_sta.csh 2018
271G
[dcmgr@transform csv]$