0% found this document useful (0 votes)
52 views38 pages

The Scientific Data Management Center: Arie Shoshani (PI)

The Scientific Data Management Center focuses on providing tools and technologies for managing the large and increasing volumes of scientific data being generated. It takes an integrated approach based on a three-layer organization: the Scientific Process Automation layer supports workflows and dashboards; the Data Mining and Analysis layer has tools for parallel analysis; and the Storage Efficient Access layer focuses on high performance parallel I/O and indexing. The center works on technologies for high performance, enabling data understanding, usability, and sustainability. Results include accelerating data transfer with Parallel NetCDF and indexing for fast searches with FastBit.

Uploaded by

Anusha Ammu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views38 pages

The Scientific Data Management Center: Arie Shoshani (PI)

The Scientific Data Management Center focuses on providing tools and technologies for managing the large and increasing volumes of scientific data being generated. It takes an integrated approach based on a three-layer organization: the Scientific Process Automation layer supports workflows and dashboards; the Data Mining and Analysis layer has tools for parallel analysis; and the Storage Efficient Access layer focuses on high performance parallel I/O and indexing. The center works on technologies for high performance, enabling data understanding, usability, and sustainability. Results include accelerating data transfer with Parallel NetCDF and indexing for fast searches with FastBit.

Uploaded by

Anusha Ammu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

The Scientific Data Management Center

http://sdmcenter.lbl.gov Arie Shoshani (PI)


Lawrence Berkeley National Laboratory Co-Principal Investigators
DOE Laboratories
ANL: LBNL: LLNL: ORNL: PNNL: Rob Ross Doron Rotem Chandrika Kamath Nagiza Samatova Terence Critchlow

Universities
NCSU: NWU: UCD: SDSC: UUtah: Mladen Vouk Alok Choudhary Bertram Ludaescher Ilkay Altintas Claudio Silva

XLDB meeting, Lyon, August 2009


Arie Shoshani

What is SciDAC?
Department of Energy program for Scientific Discovery through Advanced Computing Brings together physical scientists, mathematicians, computer scientists, and computational scientists Applied to science projects in:
Nuclear Physics, Fusion Energy, Climate Modeling, Combustion, Astrophysics, etc.

Arie Shoshani

Scientific Data Management


Scientific data management is a collection of methods, algorithms and software that enables efficient capturing, storing, moving, and analysis of scientific data.

6.7 Petabytes 78 million files Storage Growth 1998-2008 at NERSC-LBNL (rate: ~2X / year)

Arie Shoshani

Problems and Goals

Why is Managing Scientific Data Important for Scientific Investigations?


Sheer volume and increasing complexity of data being collected are already interfering with the scientific investigation process Managing the data by scientists greatly wastes scientists effective time in performing their applications work Data I/O, storage, transfer, and archiving often conflict with effectively using computational resources Effectively managing, and analyzing this data and associated metadata requires a comprehensive, end-to-end approach that encompasses all of the stages from the initial data acquisition to the final analysis of the data

Arie Shoshani

A motivating SDM Scenario (dynamic monitoring)

Task A: Generate Time-Steps

Task B: Move TS

Task C: Analyze TS

Task D: Visualize TS

Flow Tier Work Tier

Control Flow Layer

Simulation Program

Data Mover

Post Processing

Parallel R

VisIt

Applications & Software Tools Layer

Parallel NetCDF

PVFS

SRM

Subset extraction

File system

HDF5 Libraries

I/O System Layer

Storage & Network Resources Layer

Arie Shoshani

based on three-layer organization of technologies

Organization of the center:

Integrated approach:

Scientific Process Automation (SPA) Layer


Workflow Management Engine (Kepler) Specialized Workflow components Scientific Dashboard

To provide a scientific workflow and dashboard capability To support data mining and analysis tools To accelerate storage and access to data

Data Mining and Analysis (DMA) Layer


Parallel R Statistical Analysis Data Analysis and Feature Identification Efficient indexing (Bitmap Index)

Storage Efficient Access (SEA) Layer


Active Storage Storage Resource Manager (SRM) Parallel Adaptable I/O System (ADIOS) Parallel I/O (ROMIO) Parallel NetCDF Virtual File System

Hardware, Operating Systems, and Storage Systems

Arie Shoshani

Focus of SDM center


high performance
fast, scalable Parallel I/O, parallel file systems Indexing, data movement

Enabling data understanding


Parallelize analysis tools Streamline use of analysis tools Real-time data search tools

Usability and effectiveness


Easy-to-use tools and interfaces Use of workflow, dashboards end-to-end use (data and metadata)

Sustainability
robustness Productize software work with vendors, computing centers

Establish dialog with scientists


partner with scientists, education (students, scientists)

Arie Shoshani

Results

High Performance Technologies Usability and effectiveness Enabling Data Understanding

Arie Shoshani

The I/O Software Stack

Arie Shoshani

Speeding data transfer with PnetCDF


Inter-process communication

P0 P0

P1 P1

P2 P2

P3 P3

P0 P0

P1 P1

P2 P2

P3 P3

Enables high performance parallel I/O to netCDF data sets Achieves up to 10-fold performance improvement over HDF5

netCDF Parallel File System

Parallel netCDF Parallel File System

Early performance testing showed PnetCDF outperformed HDF5 for some critical access patterns. The HDF5 team has responded by improving their code for these patterns, and now these teams actively collaborate to better understand application needs and system characteristics, leading to I/O performance gains in both libraries.
Illustration: A. Tovey Illustration: A. Tovey

Contacts: Rob Ross, ANL, Alok Choudhari, NWU

Arie Shoshani

Visualizing and Tuning I/O Access


This view shows the entire 28 Gbyte dataset as a 2D array of blocks, for three separate runs. Renderer is visualizing one variable out of five. Red blocks were accessed. Access times in parenthesis.

Original Pattern

MPI-IO Tuning

PnetCDF Enhancements New PnetCDF large variable support stores data contiguously (13.1 sec). Arie Shoshani

Data is stored in the netCDF record format, where variables are interleaved in file (36.0 sec). Adjusting MPI-IO parameters (right) resulted in significant I/O reduction (18.9 sec).

Searching Problems in Data Intensive Sciences


Find the HEP collision events with the most distinct signature of Quark Gluon Plasma Find the ignition kernels in a combustion simulation Track a layer of exploding supernova

These are not typical database searches: Large high-dimensional data sets (1000 time steps X 1000 X 1000 X 1000 cells X 100 variables) No modification of individual records during queries, i.e., append-only data M-Dim queries: 500 < Temp < 1000 && CH3 > 10-4 && Large answers (hit thousands or millions of records) Seek collective features such as regions of interest, histograms, etc.

Other application domains: real-time analysis of network intrusion attacks fast tracking of combustion flame fronts over time accelerating molecular docking in biology applications query-driven visualization
Arie Shoshani

FastBit: accelerating analysis of very large datasets

Most data analysis algorithm cannot handle a whole dataset


Therefore, most data analysis tasks are performed on a subset of the data Need: very fast indexing for real-time analysis

FastBit is an extremely efficient compressed bitmap indexing technology


Indexes and stores each column separately Uses a compute-friendly compression techniques (patent 2006) Improves search speed by 10x 100x than best known bitmap indexing methods Excels for high-dimensional data Can search billion data values in seconds

Size: FastBit indexes are modest in size compared to well-known database indexes
On average about 1/3 of data volume compared to 3-4 times in common indexes (e.g. B-trees)
Arie Shoshani

Flame Front Tracking with FastBit


Flame front identification can be specified as a query, efficiently executed for multiple timesteps with FastBit.
Cell identification
Identify all cells that satisfy user specified conditions: 600 < Temperature < 700 AND HO2concentr. > 10-7

Finding & tracking of combustion flame fronts

Region growing
Connect neighboring cells into regions

Region tracking
Track the evolution of the features through time
Arie Shoshani

3D Analysis Examples

Selecting particles using parallel coordinate display

Trace selected particles

Arie Shoshani

Query-Driven Visualization

Collaboration between SDM and VIS centers


Use FastBit indexes to efficiently select the most interesting data for visualization

Above example: laser wakefield accelerator simulation


VORPAL produces 2D and 3D simulations of particles in laser wakefield Finding and tracking particles with large momentum is key to design the accelerator Brute-force algorithm is quadratic (taking 5 minutes on 0.5 mil particles), FastBit time is linear in the number of results (takes 0.3 s, 1000 X speedup)
Arie Shoshani

Results

High Performance Technologies Usability and effectiveness Enabling Data Understanding

Arie Shoshani

Workflow automation requirements in Fusion Center for Plasma Edge Simulation (CPES) project

Automate the monitoring pipeline


transfer of simulation output to remote machine execution of conversion routines, image creation, data archiving

and the code coupling pipeline


Run simulation on a large supercomputer check linear stability on another machine Re-run simulation if needed

Requirements for Petascale computing


Easy to use Parallel processing Robustness Configurability Dashboard front-end Dynamic monitoring

Contact: Scott Klasky, et. al, ORNL

Arie Shoshani

The Kepler Workflow Engine

Kepler is a workflow execution system based on Ptolemy (open source from UCB) SDM center work is in the development of components for scientific applications (called actors)
Arie Shoshani

Real-time visualization and analysis capabilities on dashboard

visualize and compare shots


Arie Shoshani

Storage Resource Managers (SRMs): Middleware for storage interoperability and data movement

Arie Shoshani

SRM use in Earth Science Grid


14000 users
LBNL
HPSS High Performance Storage System

170 TBs
ANL NCAR openDAPg
server Tomcat servlet engine
CAS Community Authorization Services

disk

HRM Storage Resource Management

gridFTP server

gridFTP Striped server MyProxy server

disk

LLNL
gridFTP server

MCS client RLS client

MyProxy client
DRM Storage Resource Management

CAS client

DRM Storage Resource Management

GRAM gatekeeper gridFTP gridFTP server gridFTP gridFTP server

ORNL
HRM Storage Resource Management

ISI
MCS Metadata Cataloguing Services RLS Replica Location Services SOAP RMI

HRM Storage Resource Management

disk

MSS Mass Storage System

disk

HPSS High Performance Storage System

SDM Contact: A. Sim, A. Shoshani, LBNL


Arie Shoshani

Capturing Provenance in Workflow Framework

Process provenance
the steps performed in the workflow, the progress through the workflow control flow, etc. history and lineage of each data item associated with the actual simulation (inputs, outputs, intermediate states, etc.) history of the workflow evolution and structure Machine and environment information compilation history of the codes information about the libraries source code run-time environment settings

Control Plane (light data flows)

Keple r

Data provenance

Workflow provenance

Provenance, Tracking & Meta-Data (DBs and Portals)

System provenance

Execution Plane (Heavy Lifting Computations and data flows)


Arie Shoshani

SDM Contact: Mladen Vouk, NCSU

FIESTA: Framework for Integrated End-to-end SDM Technologies and Applications

Storage Supercomputers + Analytics Nodes

Trust
Auth

Kepler

Rec API

Data Store

Disp API Dashboard

Access

Management API

Orchestration

Provenance is captured in a data store and used by dashboard

Arie Shoshani

Dashboard uses provenance for finding location of files and automatic download with SRM

Download window

Arie Shoshani

Dashboard is used for job launching and real-time machine monitoring


Allow for secure logins with OTP. Allow for job submission. Allow for killing jobs. Search old jobs. See collaborators jobs.

Arie Shoshani

Results

High Performance Technologies Usability and effectiveness Enabling Data Understanding

Arie Shoshani

Goal: solving the problem of data overload



Raw Data

Scientific data understanding: from Terabytes to a Megabytes

Use scientific data mining techniques to analyze data from various SciDAC applications Techniques borrowed from image and video processing, machine learning, statistics, pattern recognition,
Target Data Preprocessed Data Transformed Data Patterns Knowledge

Data Preprocessing Data Fusion Sampling Multi-resolution analysis De-noising Object identification Featureextraction Normalization Dimensionreduction

Pattern Recognition Classification Clustering Regression

Interpreting Results Visualization Validation

An iterative and interactive process


Arie Shoshani

Separating signals in climate data

We used independent component analysis to separate El Nio and volcano signals in climate simulations Showed that the technique can be used to enable better comparisons of simulations

Collaboration with Ben Santer (LLNL)


Arie Shoshani

Tracking blobs in fusion plasma

Using image and video processing techniques to identify and track blobs in experimental data from NSTX to validate and refine theories of edge turbulence
t t+1 t+2

Denoised original

After removal of background

Detection of blobs

Collaboration with S. Zweben, R. Maqueda, and D. Stotler (PPPL)


Arie Shoshani

Task and Data Parallelism in pR


Task Parallelism Data Parallelism

Goal: Parallel R (pR) aims: (1) to automatically detect and execute task-parallel analyses; (2) to easily plug-in data-parallel MPI-based C/Fortran codes (3) to retain high-level of interactivity, productivity and abstraction Task-parallel analyses:
Likelihood Maximization Re-sampling schemes: Bootstrap, Jackknife Markov Chain Monte Carlo (MCMC) Animations

Task & Data Parallelism in pR

Data-parallel analyses:
k-means clustering Principal Component Analysis Hierarchical clustering Distance matrix, histogram, etc.
Arie Shoshani

ProRata use in OBER Projects


DOE OBER Projects Using ProRata: Jill Banfield, Bob Hettich: Acid Mine Drainage
Michelle Buchanan: CMCS Center Steve Brown, Jonathan Mielenz: BESC BioEnergy Carol Harwood, Bob Hettich: MCP R. palustris

J. of Proteome Research Vol. 5, No. 11, 2006

>1,000
downloads
Arie Shoshani

SDM center collaboration with applications


Application Domains Workflow Technology (Kepler) Metadata And provenance Data Movement and storage Indexing (FastBit) Parallel I/O (pNetCDF, etc.) Parallel Statistics (pR, ) Feature extraction Active Storage

Climate Modeling (Drake) Astrophysics (Blondin) Combustion (Jackie Chen) Combustion (Bell) Fusion (PPPL) Fusion (CPES) Materials - QBOX (Galli) High Energy Physics Groundwater Modeling Accelarator Science (Ryne) SNS Biology Climate Cloud modeling (Randall) Data-to-Model Coversion (Kotamathi)

workflow data movement data movement data-move, code-couple Lattice-QCD identified 4-5 workflows workflow ScalaBlast

dashboard distributed analysis Dashboard Data Entry tool (DEB)

DataMover-Lite DataMover-Lite DataMover-Lite SRM, DataMover

flame front Toroidal meshes event finding

pNetCDF Global Access XML MPIO-SRM pNetCDF

pMatlab pMatlab pR ProRata

tranient events poincare plots Blob tracking

ScalaBlast cloud modeling

Biology (H2) Fusion (RF) (Bachelor) Subsurface Modeling (Lichtner) Flow with strong shocks (Lele) Fusion (extended MHD) (Jardin) Nanoscience (Rack) other activities

Over AMR conditional statistics pMatlab

poincare plots

integrate with Luster

currently in progress

problem identified

interest expressed
Arie Shoshani

Future Vision for Extreme Scale Data: Data-Side Analysis Facility

It is becoming impractical to move large parts of simulation data to end user facilities
Near data could be a high capacity wide-area network (100 Gbps) On-the-fly processing capabilities as data is generated

Data-side analysis facility (exascale workshops)


Have an analysis cluster near the data generation site Have parallel analysis and visualization tools available on facility Have workflow tools to compose analysis pipelines by users Reuse previously composed pipelines Package specialized components (e.g. Poincare plot analysis)

Use dynamically or as post-processing


Invoke as part of end-to-end framework Use provenance store to track results
Arie Shoshani

Implications to XLDB

Fast I/O is very important to scientists Take advantage of append-only data for fast indexes Workflow (pipeline) processing extremely useful Integrated end-to-end capabilities can be very useful to get scientists interest (saves them time, one stop capability) Real-time monitoring and visualization highly desirable Data-side analysis facility may be required to be practical adjunct / alternative to UDFs

Arie Shoshani

SDM Book October 2009


New book edited and chapters written by group members Scientific Data Management: Challenges, SUBTITLE HERE IF NECESSARY Technology, and Deployment, Chapman & Hall/CRC

Section 1: Berkeley Lab Mission

Table-of-contents
Scientific Process Management

Data Analysis, Integration, and Visualization Methods Specialized Database Systems and Retrieval Techniques

Data Transfer and Scheduling

Storage Technology and Efficient Storage Access


Arie Shoshani

Table-of-Contents

I Storage Technology and Efficient Storage Access


1 Storage Technology, lead author: John Shalf 2 Parallel Data Storage and Access, lead author: Rob Ross 3 Dynamic Storage Management, lead author: Arie Shoshani 4 Coordination of Access to Large-Scale Datasets in Distributed Environments, lead author: Tevfik Kosar 5 High-Throughput Data Movement, lead author: Scott Klasky 6 Accelerating Queries on Very Large Datasets, lead author: Ekow Otoo 7 Emerging Database Systems in Support of Scientific Data, lead author: Per Svensson 8 Scientific Data Analysis lead author: Chandrika Kamath 9 Scientific Data Management Challenges in High-Performance Visual Data Analysis, lead author: E. Wes Bethel 10 Interoperability and Data Integration in the Geosciences, lead author: Michael Gertz 11 Analyzing Data Streams in Scientific Applications, lead author: Tore Risch 12 Metadata and Provenance Management, lead author: Ewa Deelman 13 Scientific Process Automation and Workflow Management, lead author: Bertram Ludascher

II Data Transfer and Scheduling


III Specialized Retrieval Techniques and Database Systems


IV Data Analysis, Integration, and Visualization Methods


V Scientific Process Management


Arie Shoshani

The END

Arie Shoshani

You might also like