====== Metrics analysis ======

=== Distribution Information (From DB) ===

  * On **transform**: (as **dcmgr**)
     * Edit and execute the SQL scripts **req_by_format_prog.sql**, **dist_by_format_prog.sql** & **dist_total.sql**. It will generate the files //req_by_format_prog.csv//, //dist_by_format_prog.csv// & //dist_total.csv// for all the data in the database.
<code>
[dcmgr@transform analysis]# cd /data/dc5/reporting.NCEDC/analysis
[dcmgr@transform analysis]# vi req_by_format_prog.sql dist_by_format_prog.sql dist_total.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @req_by_format_prog.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @dist_by_format_prog.sql
[dcmgr@transform analysis]# yasql ncdist@dcucb @dist_total.sql
</code>

     * Run the **csv_b2g** script to convert bytes into gigabytes:
<code>
[dcmgr@transform analysis]$ ./csv_b2g -f3 dist_by_format_prog.csv 

FORMAT,PROG,NBYTES
assemble,httpd,0.461
cat,ftp,0.014
cat,httpd,44.362
catalog,fdsnws-event,145.248
gps,ftp,1778.876
gps,httpd,279.055
metadata,fdsnws-station,1362.311
metadata,ncedcws-dataless,1.856
metadata,ncedcws-resp,0.103
metadata,ncedcws-sacpz,0.382
mseed,fdsnws-dataselect,40431.028
mseed,ncedcws-eventdata,10.793
mseed,sws,594.182
rawdata,httpd,0.001
xmldata,ftp,68.643
xmldata,httpd,0.005
[dcmgr@transform analysis]$ ./csv_b2g -f1 dist_total.csv 

NBYTES
44717.320
[dcmgr@transform analysis]$ 
</code>

=== Geolocation Information (From DB) ===

  * On **transform**: (as **dcmgr**)
     * Edit the script **sql.dist_by_type** to specify the time interval (yyyymm for start and end month):
<code>
[dcmgr@transform analysis]# cd /data/dc5/reporting.NCEDC/analysis
[dcmgr@transform analysis]# vi sql.dist_by_type
</code>

     * Run the script **sql.dist_by_type**. It will generate 3 output files:  
        * //count.data.csv// - for all types of data (mseed, gps, xml, raw, assemble, ...).
        * //count.data.cat// - for all catalog queries.
        * //count.data.meta// - for all metadata queries.
<code>
[dcmgr@transform analysis]# ./sql.dist_by_type
</code>

     * Run the script **geolocate.pl** to get geolocation information for the above files, e.g.:
<code>
[dcmgr@transform analysis]# ./geolocate.pl count.data.csv	-> # creates count.data.csv.geo
</code>

     * Notes:
<code>
Notes based on current data distribution processing.
1.  IPADDR is NEVER NULL.
2.  DOMAIN is NEVER NULL.
3.  IPADDR is set to '-' when no IP address is available.
4.  DOMAIN is set to the ipaddress when no domain is available.

So, values of fields can be:
case:	IPADDR		DOMAIN
---------------------------------------
1	real_ip_addr	real_domainname
2	real_ip_addr	real_ip_addr
3	-		real_domainname

To get a count of everything once and only, perform 3 selections:
	select ...        where IPADDR != DOMAIN and IPADDR != '-'	# Case 1 - use either
	select ipaddr ... where IPADDR = DOMAIN				# Case 2 - only IP: domain set to IP 
	select domain ... where IPADDR = '-'				# Case 3 - only domain

==============================================================================
Query
	curl -s http://freegeoip.net/csv/128.32.149.11

Alternatives investigated but not used:
	curl -s http://geoip.nekudo.com/api/128.32.149.11/en/short
	https://dns.google.com/resolve?name=usgs.gov

curl -s http://dns.google.com/resolve\?name=hotmail.com
{"Status": 0,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "hotmail.com.","type": 1}],"Answer":[ {"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "157.56.198.220"},{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "65.55.118.92"}]}

{"Status": 0,
 "TC": false,
  "RD": true,
  "RA": true,
  "AD": false,
  "CD": false,
  "Question":[ 
	{"name": "hotmail.com.","type": 1}
   ],
   "Answer":[ 
	{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "157.56.198.220"},
	{"name": "hotmail.com.","type": 1,"TTL": 3568,"data": "65.55.118.92"}
   ]
}

==============================================================================

# 1.  Identify what is a domain name vs a hostname.
# 2.  While domain name, 
		use dns.google.com, and take result from "data" attribute.
# 3.  Lookup with 

rake% curl -s http://dns.google.com/resolve\?name=icjta.csic.es
{"Status": 3,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "icjta.csic.es.","type": 1}],"Authority":[ {"name": "csic.es.","type": 6,"TTL": 1611,"data": "olmo.csic.es. hostmaster.csic.es. 2010042607 86400 7200 2592000 86400"}]}
rake% curl -s http://dns.google.com/resolve\?name=olmo.csic.es
{"Status": 0,"TC": false,"RD": true,"RA": true,"AD": false,"CD": false,"Question":[ {"name": "olmo.csic.es.","type": 1}],"Answer":[ {"name": "olmo.csic.es.","type": 1,"TTL": 70114,"data": "161.111.10.3"}]}
rake% curl -s http://freegeoip.net/csv/olmo.csic.es
161.111.10.3,ES,Spain,MD,Madrid,Madrid,28001,Europe/Madrid,40.4167,-3.6838,0
rake% curl -s http://freegeoip.net/csv/161.111.10.3
161.111.10.3,ES,Spain,MD,Madrid,Madrid,28001,Europe/Madrid,40.4167,-3.6838,0
</code>

=== Yearly Summary Information (From files) ===

  * On **transform**: (as **dcmgr**)
<code>
[dcmgr@transform csv]$ cd /data/dc5/reporting.NCEDC/csv/
</code>
  * Run the **usr_csv_year.csh ** script to generate the number of distinct users for a given year, e.g.:
<code>
[dcmgr@transform csv]$ usr_csv_year.csh 2012
11003
[dcmgr@transform csv]$
</code>

  * Run the **sum_csv_year.csh ** script to generate the total amount of data for a given year, e.g.:
<code>
[dcmgr@transform csv]$ sum_csv_year.csh 2012
10975G
[dcmgr@transform csv]$
</code>

  * Run the **sum_csv_year_sta.csh ** script to generate the total amount of data for a given year and list of stations (defined in **sum_csv_sta**), e.g.:
<code>
[dcmgr@transform csv]$ sum_csv_year_sta.csh 2018
271G
[dcmgr@transform csv]$
</code>