Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and...

Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset

A Technology Challenge Case Study

David LaryHanson Center for Space ScienceUniversity of Texas at Dallas

Type

s of

bio

logi

cal M

ater

ial

Type

s of

Dus

tTy

pes

of P

artic

ulat

esG

as M

olec

ules

0.0001 μm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 100 μm 1000 μm

Pollen

Mold Spores

House Dust Mite Allergens

Bacteria

Cat Allergens

Viruses

Heavy Dust

Settling Dust

Suspended Atmospheric Dust

Cement Dust

Fly Ash

Oil Smoke

Smog

Tobacco Smoke

Soot

Gas Molecules

Decreased Lung Function < 10 μm

Skin & Eye Disease < 2.5 μm

Tumors < 1 μm

Cardiovascular Disease < 0.1 μm

Hair

Pin

Cell

0.0001 μm 0.001 μm 0.01 μm 0.1 μm 1 μm 10 μm 100 μm 1000 μm

PM10 particles

PM2.5 particles

PM0.1 ultra fine particles PM10-2.5 coarse fraction

0.1 mm 1 mm

! 5!

Table!1.!PM!and!health!outcomes!(modified!from!Ruckerl*et*al.!(2006)).!

!!Health*Outcomes!

Short9term*Studies* Long9term*Studies*PM10! PM2.5! UFP! PM10! PM2.5! UFP!

Mortality* !! !! !! !! !! !!

!!!!All!causes! xxx!! xxx!! x! xx! xx! x!!!!!Cardiovascular! xxx! xxx! x!! xx! xx! x!

!!!!Pulmonary! xxx! xxx! x! xx! xx! x!Pulmonary!effects! !! !! !! !! !! !!

!!!!Lung!function,!e.g.,!PEF! xxx! xxx! xx! xxx! xxx! !!!!!!Lung!function!growth! !! !! !! xxx! xxx! !!

Asthma!and!COPD!exacerbation! !! !! !! !! !! !!

!!!!Acute!respiratory!symptoms! !! xx! x! xxx! xxx! !!!!!!Medication!use! !! !! x! !! !! !!

!!!!Hospital!admission! xx! xxx! x! !! !! !!Lung!cancer! !! !! !! !! !! !!

!!!!Cohort! !! !! !! xx! xx! x!

!!!!Hospital!admission! !! !! !! xx! xx! x!Cardiovascular!effects! !! !! !! !! !! !!

!!!!Hospital!admission! xxx! xxx! !! x! x! !!ECG@related!endpoints! !! !! !! !! !! !!

!!!!Autonomic!nervous!system! xxx! xxx! xx! !! !! !!!!!!Myocardial!substrate!and!vulnerability! !! xx! x! !! !! !!

Vascular!function! !! !! !! !! !! !!

!!!!Blood!pressure! xx! xxx! x! !! !! !!!!!!Endothelial!function! x! xx! x! !! !! !!

Blood!markers! !! !! !! !! !! !!!!!!Pro!inflammatory!mediators! xx! xx! xx! !! !! !!

!!!!Coagulation!blood!markers! xx! xx! xx! !! !! !!

!!!!Diabetes! x! xx! x! !! !! !!!!!!Endothelial!function! x! x! xx! !! !! !!

Reproduction! !! !! !! !! !! !!!!!!Premature!birth! x! x! !! !! !! !!

!!!!Birth!weight! xx! x! !! !! !! !!!!!!IUR/SGA! x! x! !! !! !! !!

Fetal!growth! !! !! !! !! !! !!

!!!!Birth!defects! x! !! !! !! !! !!!!!!Infant!mortality! xx! x! !! !! !! !!

!!!!Sperm!quality! x! x! !! !! !! !!Neurotoxic!effects! !! !! !! !! !! !!

!!!!Central!nervous!system!! !! x! xx! !! !! !!x, few studies; xx, many studies; xxx, large number of studies.

Why?

How?

Used around 40 different BigData sets from satellites, meteorology, demographics, scraped web-sites and social media to estimate PM2.5. Plot below shows the average of 5,935 days from August 1, 1997 to the present.

Which Platform?

Which Platform?

Requirements:1. Large persistent storage for multiple BigData sets, 100TB+ (otherwise

before have had time to process the massive datasets the scratch space time limit has expired)

Which Platform?



2. High Bandwidth connections

Which Platform?



2. High Bandwidth connections3. Ability to harvest social media (e.g. twitter) and scrape web sites for data

Which Platform?



2. High Bandwidth connections3. Ability to harvest social media (e.g. twitter) and scrape web sites for data4. High level language with wide range of optimized toolboxes, matlab

Which Platform?



2. High Bandwidth connections3. Ability to harvest social media (e.g. twitter) and scrape web sites for data4. High level language with wide range of optimized toolboxes, matlab5. Algorithms capable of dealing with massive non-linear, non-parametric,

non-Gaussian multivariate datasets (13,000+ variables)

Which Platform?




non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs

Which Platform?




non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs7. Ability to schedule tasks at precise times and time intervals to automate

workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)

How?

Exis%ng(• Social(Media(• Socioeconomic,(Census(• News(feeds(• Environmental(• Weather(• Satellite(• Sensors(• Health(• Economic(

New(• UAVs(• Smart(Dust(• Autonomous(Cars(• Sensors(

Simula%on(• Global(Weather(Models(• Economic(Models(• Earthquake(Models(

Insight(

Machine(Learning(

Data(

Same approach highly relevant for the validation and optimal exploitation of the next generation of satellites, e.g. the upcoming NASA Decadal Survey Missions.

How?

California Children Example

REMOTE SENSING, MACHINE LEARNING AND PM2.5 4

Random Forests, etc.) that can provide multi-variate non-linearnon-parametric regression or classification based on a trainingdataset. We have tried all of these approaches for estimatingPM2.5 and found the best by far to be Random Forests.

B. Random ForestsIn this paper we use one of the most accurate machine learn-

ing approaches currently available, namely Random Forests[53], [54]. Random forests are composed of an ensemble ofdecision trees [55]. Random forests have many advantagesincluding their ability to work efficiently with large datasets,accommodate thousands of input variables, provide a measureof the relative importance of the input variables in the re-gression, and effectively handling datasets containing missingdata.

Each tree in the random forest is a decision tree. A decisiontree is a tree-like graph that can be used for classificationor regression. Given a training dataset, a decision tree canbe grown to predict the value of a particular output variablebased on a set of input variables [55]. The performanceof the decision tree regression can be improved upon if,instead of using a single decision tree, we use an ensembleof independent trees, namely, a random forest [53], [54]. Thisapproach is referred to as tree bootstrap aggregation, or treebagging for short.

Bootstrapping is a simple way to assign a measure of ac-curacy to a sample estimate or a distribution. This is achievedby repeatedly randomly resampling the original dataset toprovide an ensemble of independently resampled datasets.Each member of the ensemble of independently resampleddatasets is then used to grow an independent decision tree.

The statistics of random sampling means that any given treeis trained on approximately 66% of the training dataset andso approximately 33% of the training dataset is not used intraining any given tree. Which 66% is used is different foreach of the trees in the random forest. This is a very rigorousindependent sampling strategy that helps minimize over fittingof the training dataset (e.g. learning the noise). In addition, inour implementation we keep back a random sample of data notused in the training for independent validation and uncertaintyestimation.

The members of the original training dataset not used in agiven bootstrap resample are referred to as out of bag forthis tree. The final regression estimate that is provided bythe random forest is simply the average of the ensemble ofindividual predictions in the random forest.

A further advantage of decision trees is that they can provideus the relative importance of each of the inputs in constructingthe final multi-variate non-linear non-parametric regressionmodel (e.g. Tables II and III).

C. Datasets Used in Machine Learning Regression1) PM2.5 Data: As many hourly PM2.5 observations

as possible that were available from the launch of Terraand Aqua to the present were used in this study. Forthe United States this data came from the EPA AirQuality System (AQS) http://www.epa.gov/ttn/airs/airsaqs/

TABLE IIVARIABLES USED IN THE MACHINE LEARNING ESTIMATE OF PM2.5 FORTHE MODIS COLLECTION 5.1 PRODUCTS FOR THE TERRA AND AQUADEEP BLUE ALGORITHM SORTED BY THEIR IMPORTANCE. THE MOST

IMPORTANCE VARIABLE FOR A GIVEN REGRESSION IS PLACED FIRST WITHA RANK OF 1.

Terra DeepBlue

Rank Source Variable Type

1 Population Density Input2 Satellite Product Tropospheric NO2 Column Input3 Meteorological Analyses Surface Specific Humidity Input4 Satellite Product Solar Azimuth Input5 Meteorological Analyses Surface Wind Speed Input6 Satellite Product White-sky Albedo at 2,130 nm Input7 Satellite Product White-sky Albedo at 555 nm Input8 Meteorological Analyses Surface Air Temperature Input9 Meteorological Analyses Surface Layer Height Input10 Meteorological Analyses Surface Ventilation Velocity Input11 Meteorological Analyses Total Precipitation Input12 Satellite Product Solar Zenith Input13 Meteorological Analyses Air Density at Surface Input14 Satellite Product Cloud Mask Qa Input15 Satellite Product Deep Blue Aerosol Optical Depth 470 nm Input16 Satellite Product Sensor Zenith Input17 Satellite Product White-sky Albedo at 858 nm Input18 Meteorological Analyses Surface Velocity Scale Input19 Satellite Product White-sky Albedo at 470 nm Input20 Satellite Product Deep Blue Angstrom Exponent Land Input21 Satellite Product White-sky Albedo at 1,240 nm Input22 Satellite Product Scattering Angle Input23 Satellite Product Sensor Azimuth Input24 Satellite Product Deep Blue Surface Reflectance 412 nm Input25 Satellite Product White-sky Albedo at 1,640 nm Input26 Satellite Product Deep Blue Aerosol Optical Depth 660 nm Input27 Satellite Product White-sky Albedo at 648 nm Input28 Satellite Product Deep Blue Surface Reflectance 660 nm Input29 Satellite Product Cloud Fraction Land Input30 Satellite Product Deep Blue Surface Reflectance 470 nm Input31 Satellite Product Deep Blue Aerosol Optical Depth 550 nm Input32 Satellite Product Deep Blue Aerosol Optical Depth 412 nm Input

In-situ Observation PM2.5 Target

Aqua DeepBlue

Rank Source Variable Type

1 Satellite Product Tropospheric NO2 Column Input2 Satellite Product Solar Azimuth Input3 Meteorological Analyses Air Density at Surface Input4 Satellite Product Sensor Zenith Input5 Satellite Product White-sky Albedo at 470 nm Input6 Population Density Input7 Satellite Product Deep Blue Surface Reflectance 470 nm Input8 Meteorological Analyses Surface Air Temperature Input9 Meteorological Analyses Surface Ventilation Velocity Input10 Meteorological Analyses Surface Wind Speed Input11 Satellite Product White-sky Albedo at 858 nm Input12 Satellite Product White-sky Albedo at 2,130 nm Input13 Satellite Product Solar Zenith Input14 Meteorological Analyses Surface Layer Height Input15 Satellite Product White-sky Albedo at 1,240 nm Input16 Satellite Product Deep Blue Surface Reflectance 660 nm Input17 Satellite Product Deep Blue Surface Reflectance 412 nm Input18 Satellite Product White-sky Albedo at 1,640 nm Input19 Satellite Product Sensor Azimuth Input20 Satellite Product Scattering Angle Input21 Meteorological Analyses Surface Velocity Scale Input22 Satellite Product Cloud Mask Qa Input23 Satellite Product White-sky Albedo at 555 nm Input24 Satellite Product Deep Blue Aerosol Optical Depth 550 nm Input25 Satellite Product Deep Blue Aerosol Optical Depth 660 nm Input26 Satellite Product Deep Blue Aerosol Optical Depth 412 nm Input27 Meteorological Analyses Total Precipitation Input28 Satellite Product White-sky Albedo at 648 nm Input29 Satellite Product Deep Blue Aerosol Optical Depth 470 nm Input30 Satellite Product Deep Blue Angstrom Exponent Land Input31 Meteorological Analyses Surface Specific Humidity Input32 Satellite Product Cloud Fraction Land Input

In-situ Observation PM2.5 Target

detaildata/downloadaqsdata.htm and AirNOW http://www.airnow.gov. In Canada the data came from http://www.etc-cte.ec.gc.ca/napsdata/main.aspx. In Europe the data camefrom AirBase, the European air quality database main-tained by the European Environment Agency and the Euro-

Hourly measurements from 53 countries from 1997-present

A lot of measurements, but notice the large gaps!

Gaps are inevitable because of the infrastructure and cost associated with making the measurements.

Challenge 1: Obtaining the in-situ PM2.5 data

Real time data from:

1. EPA AirNow data for USA and Canada

2. EEA data for Europe

3. Tasmania and Australia

4. Israel

5. Russia

6. Asia and Latin America by scraping http://aqicn.org/map/

7. Harvesting social media (twitter feeds from US Embassies)

Relative low bandwidth from multiple sites every 5 minutes

http://aqicn.org/map/

http://aqicn.org/map/

Challenge 2: (Easier)Obtaining the Satellite & Meteorological Data

Real time data from:

1. Multiple satellites MODIS Terra, MODIS Aqua, SeaWIFS, VIIRS NPP etc

2. Global Meteorological Analyses

High bandwidth from few sites every 1 to 3 hours

Challenge 3: Combine multiple BigData Sets with Machine Learning

Large member machine learning ensemble using massively parallel computing to produce PM2.5 data product

Algorithms capable of dealing with massive non-linear, non-parametric, non-Gaussian multivariate datasets (13,000+ variables)

Drastically reduced development time by using a high level language (Matlab) that can easily exploit parallel execution using both multiple CPUs and GPUs.

Massively parallel every 3 hours

High level language which can readily use CPUs and GPUs

Challenge 4: Continual Performance Improvement

Currently on around 400th version of system.

Have been making continuous improvements in:

1. Coverage of in-situ training data set

2. Inclusion of new satellite sensors

3. Additional BigData sets that help improve fidelity of the non-linear, non-parametric, non-Gaussian multivariate machine learning fits

4. Using many alternative machine learning strategies

5. Estimate uncertainties.

6. This requires frequent reprocessing of the entire multi-year record from 1997-present

Persistent massive data storage, much more than usual scratch space at HPC centers

Fully Automated Workflow

Requires ability to schedule automated tasks

Requires ability to disseminate results in multiple formats including ftp and as web and map services

Key System Requirements:Not always available on current HPC systems




non-Gaussian multivariate datasets (13,000+ variables) 6. Easy to make use of multiple GPUs and CPUs7. Ability to schedule tasks at precise times and time intervals to automate

workflows (in this case tasks executed at intervals of 5 minutes, 15 minutes, 1 hour, 3 hours, 1 day)

Thank you!

Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and...

Technology

Transcript of Requirements for next generation of Cloud Computing: Case study with multiple Big Datasets and...