Data Mining Foster
Data Mining Foster
Ian Foster
Computation Institute
Argonne National Lab & University of Chicago
http://ianfoster.typepad.com
www.ci.uchicago.edu www.ci.anl.gov
2
Data
Mining
Grid
3
In the Next 50 Years,
We Must …
● Increase energy production by 5, while
reducing GHG emissions by 2 or more
Innovation 4
Innovation
as a Systems Problem
● Quasi-ubiquitous Internet …
● … connects many potential innovators
◆ Millions of scientists, billions of people
● Who need to leverage
◆ Enormous data of tremendous complexity
◆ Immensely powerful computing
◆ Experimental apparatus of great power
DATA ADVANCED
ACQUISITION VISUALIZATION ,ANALYSIS
Research
COMPUTATIONAL
RESOURCES
IMAGING INSTRUMENTS
LARGE-SCALE DATABASES
Host Env
IBM
IBM
IBM
Database
Computers Specialized
File system
resource 8
Globus Downloads Last 24 Hours
Last month 9
First Generation Grids:
On-Demand/Batch Computing
Focus on aggregation of many resources for
massively (data-)parallel applications
EGEE
Globus 10
Applications:
High Energy Physics
Globus
11
Integrating Data and
Computing, on Demand
Public PUMA
Knowledge Base
Information about
proteins analyzed
against ~2 million
gene sequences
Back Office
Analysis on Grid
Millions of BLAST,
BLOCKS, etc., on
OSG and TeraGrid
Natalia Maltsev et al., http://compbio.mcs.anl.gov/puma2 12
Second Generation Grids:
Service-Oriented Science
● Empower many more users by enabling
on-demand access to services
● Grids become an enabling technology for
service oriented science (or business)
◆ Grid infrastructures host services
◆ Grid technologies used to build services
Science
Gateways
Data
Integration!
Globus
16
caBIG Under the Covers
Analytical Service Grid-Enabled Client Gene
Databas
Tool 1 e
Protein
Database
Grid Data Service Grid Services Infrastructure
(Metadata, Registry, Query, Tool 3
Invocation, Security, etc.)
Tool 4
Image
Grid Portal Microarray
Tool 2 Research
Center
Tool 3
Globus 17
LIGO Data Grid
LIGO Gravitational Wave Observatory
Birmingham•
Cardiff
AEI/Golm
Globus
Replicating >1 Terabyte/day to 8 sites
>150 million replicas so far
MTBF = 1 month www.globus.org/solutions 18
The Angle Project
Globus
19
Social Informatics Data Grid
Globus
20
Bennett Berthenthal et al., www.sidgrid.org
A Few Example
Research Themes
● Service discovery, composition, provisioning
◆ SOA, virtualization, cloud computing, …
● Large-scale (distributed) computation
◆ E.g., Swift, Kepler, Taverna
● Provenance
◆ E.g., “Provenance Challenge”
● “Virtual organizations”
◆ E.g., attribute-based authorization, trust
● Integration of physical systems
◆ Optimization of end-to-end workflows
21
Security Services for
Virtual Organization Policy
● Attribute Authority (ATA)
◆ Issue signed attribute assertions (incl. identity,
delegation & mapping)
● Authorization Authority (AZA)
◆ Decisions based on assertions & policy
VO Member VO VO A VO B
Attribute User B Service Service
22
Swift
(www.ci.uchicago.edu/swift)
23
An Integrated View of Modeling,
Simulation, Experiment, & Informatics
Bioinformatics
Analysis Integrated
Tools Biological
Databases
24
Robot Scientist
“The robot scientist project aims to develop a
computer system capable of originating its own Biomek 200
experiments, physically doing them, interpreting
the results, & then repeating the cycle.”
Background Machine Analysis
Knowledge Learning
Consistent
Hypothesis
Experiments(s)
Experiment(s)
Final Theory Robot Results
selection
Stephen Muggleton, Ross King et al., UK 25
Team Science meets
Data Deluge
26