The Scientific Data Management Center: Arie Shoshani (PI)
The Scientific Data Management Center: Arie Shoshani (PI)
Universities
NCSU: NWU: UCD: SDSC: UUtah: Mladen Vouk Alok Choudhary Bertram Ludaescher Ilkay Altintas Claudio Silva
What is SciDAC?
Department of Energy program for Scientific Discovery through Advanced Computing Brings together physical scientists, mathematicians, computer scientists, and computational scientists Applied to science projects in:
Nuclear Physics, Fusion Energy, Climate Modeling, Combustion, Astrophysics, etc.
Arie Shoshani
6.7 Petabytes 78 million files Storage Growth 1998-2008 at NERSC-LBNL (rate: ~2X / year)
Arie Shoshani
Arie Shoshani
Task B: Move TS
Task C: Analyze TS
Task D: Visualize TS
Simulation Program
Data Mover
Post Processing
Parallel R
VisIt
Parallel NetCDF
PVFS
SRM
Subset extraction
File system
HDF5 Libraries
Arie Shoshani
Integrated approach:
To provide a scientific workflow and dashboard capability To support data mining and analysis tools To accelerate storage and access to data
Arie Shoshani
Sustainability
robustness Productize software work with vendors, computing centers
Arie Shoshani
Results
Arie Shoshani
Arie Shoshani
P0 P0
P1 P1
P2 P2
P3 P3
P0 P0
P1 P1
P2 P2
P3 P3
Enables high performance parallel I/O to netCDF data sets Achieves up to 10-fold performance improvement over HDF5
Early performance testing showed PnetCDF outperformed HDF5 for some critical access patterns. The HDF5 team has responded by improving their code for these patterns, and now these teams actively collaborate to better understand application needs and system characteristics, leading to I/O performance gains in both libraries.
Illustration: A. Tovey Illustration: A. Tovey
Arie Shoshani
Original Pattern
MPI-IO Tuning
PnetCDF Enhancements New PnetCDF large variable support stores data contiguously (13.1 sec). Arie Shoshani
Data is stored in the netCDF record format, where variables are interleaved in file (36.0 sec). Adjusting MPI-IO parameters (right) resulted in significant I/O reduction (18.9 sec).
Find the HEP collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova
These are not typical database searches: Large high-dimensional data sets (1000 time steps X 1000 X 1000 X 1000 cells X 100 variables) No modification of individual records during queries, i.e., append-only data M-Dim queries: 500 < Temp < 1000 && CH3 > 10-4 && Large answers (hit thousands or millions of records) Seek collective features such as regions of interest, histograms, etc.
Other application domains: real-time analysis of network intrusion attacks fast tracking of combustion flame fronts over time accelerating molecular docking in biology applications query-driven visualization
Arie Shoshani
Size: FastBit indexes are modest in size compared to well-known database indexes
On average about 1/3 of data volume compared to 3-4 times in common indexes (e.g. B-trees)
Arie Shoshani
Region growing
Connect neighboring cells into regions
Region tracking
Track the evolution of the features through time
Arie Shoshani
3D Analysis Examples
Arie Shoshani
Query-Driven Visualization
Results
Arie Shoshani
Workflow automation requirements in Fusion Center for Plasma Edge Simulation (CPES) project
Arie Shoshani
Kepler is a workflow execution system based on Ptolemy (open source from UCB) SDM center work is in the development of components for scientific applications (called actors)
Arie Shoshani
Storage Resource Managers (SRMs): Middleware for storage interoperability and data movement
Arie Shoshani
170 TBs
ANL NCAR openDAPg
server Tomcat servlet engine
CAS Community Authorization Services
disk
gridFTP server
disk
LLNL
gridFTP server
MyProxy client
DRM Storage Resource Management
CAS client
ORNL
HRM Storage Resource Management
ISI
MCS Metadata Cataloguing Services RLS Replica Location Services SOAP RMI
disk
disk
Process provenance
the steps performed in the workflow, the progress through the workflow control flow, etc. history and lineage of each data item associated with the actual simulation (inputs, outputs, intermediate states, etc.) history of the workflow evolution and structure Machine and environment information compilation history of the codes information about the libraries source code run-time environment settings
Keple r
Data provenance
Workflow provenance
System provenance
Trust
Auth
Kepler
Rec API
Data Store
Access
Management API
Orchestration
Arie Shoshani
Dashboard uses provenance for finding location of files and automatic download with SRM
Download window
Arie Shoshani
Arie Shoshani
Results
Arie Shoshani
Use scientific data mining techniques to analyze data from various SciDAC applications Techniques borrowed from image and video processing, machine learning, statistics, pattern recognition,
Target Data Preprocessed Data Transformed Data Patterns Knowledge
Data Preprocessing Data Fusion Sampling Multi-resolution analysis De-noising Object identification Featureextraction Normalization Dimensionreduction
We used independent component analysis to separate El Nio and volcano signals in climate simulations Showed that the technique can be used to enable better comparisons of simulations
Using image and video processing techniques to identify and track blobs in experimental data from NSTX to validate and refine theories of edge turbulence
t t+1 t+2
Denoised original
Detection of blobs
Goal: Parallel R (pR) aims: (1) to automatically detect and execute task-parallel analyses; (2) to easily plug-in data-parallel MPI-based C/Fortran codes (3) to retain high-level of interactivity, productivity and abstraction Task-parallel analyses:
Likelihood Maximization Re-sampling schemes: Bootstrap, Jackknife Markov Chain Monte Carlo (MCMC) Animations
Data-parallel analyses:
k-means clustering Principal Component Analysis Hierarchical clustering Distance matrix, histogram, etc.
Arie Shoshani
>1,000
downloads
Arie Shoshani
Climate Modeling (Drake) Astrophysics (Blondin) Combustion (Jackie Chen) Combustion (Bell) Fusion (PPPL) Fusion (CPES) Materials - QBOX (Galli) High Energy Physics Groundwater Modeling Accelarator Science (Ryne) SNS Biology Climate Cloud modeling (Randall) Data-to-Model Coversion (Kotamathi)
workflow data movement data movement data-move, code-couple Lattice-QCD identified 4-5 workflows workflow ScalaBlast
Biology (H2) Fusion (RF) (Bachelor) Subsurface Modeling (Lichtner) Flow with strong shocks (Lele) Fusion (extended MHD) (Jardin) Nanoscience (Rack) other activities
poincare plots
currently in progress
problem identified
interest expressed
Arie Shoshani
It is becoming impractical to move large parts of simulation data to end user facilities
Near data could be a high capacity wide-area network (100 Gbps) On-the-fly processing capabilities as data is generated
Implications to XLDB
Fast I/O is very important to scientists Take advantage of append-only data for fast indexes Workflow (pipeline) processing extremely useful Integrated end-to-end capabilities can be very useful to get scientists interest (saves them time, one stop capability) Real-time monitoring and visualization highly desirable Data-side analysis facility may be required to be practical adjunct / alternative to UDFs
Arie Shoshani
Table-of-contents
Scientific Process Management
Data Analysis, Integration, and Visualization Methods Specialized Database Systems and Retrieval Techniques
Table-of-Contents
Arie Shoshani
The END
Arie Shoshani