Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley
Analytics Stack
(BDAS) Overview
Ion Stoica
UC Berkeley
UC BERKELEY
Application
Data Processing
Storage
Infrastructure
Goals
Batch
Interacti
ve
One
stack to
rule them
all!
Streami
ng
10Gbps
128512GB
40-60GB/s
16 cores
0.21GB/s
(x10 disks)
10-30TB
14GB/s
(x4 disks)
1-4TB
result
Techniques:
- Low latency parallel scheduler that achieve
high locality
result
Tnew (< T)
128512GB
40-60GB/s
Challenges:
- accurately estimate error and running time for
- arbitrary computations
16 cores
doubles
every 18
months
doubles
every 36
months
Our Approach
Easy to combine batch, streaming, and interactive computations
- Single execution model that supports all computation models
Easy to develop sophisticated algorithms
- Powerful Python and Scala shells
- High level abstractions for graph based, and ML algorithms
Compatible with existing open source ecosystem (Hadoop/HDFS)
- Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
- Support existing execution models (e.g., Hive, GraphLab)
in-memory processing
trade between time, quality, and cost
Efficient data sharing across frameworks
Share infrastructure across frameworks
(multi-programming for datacenters)
- 8 CS Faculty
- ~40 students
- 3 software engineers
Organized for collaboration:
XData,
Goal: Next Generation of Analytics Data Stack for Industry & Research:
Berkeley Data Analytics Stack (BDAS)
Release as Open Source
Application
Data Processing
Data Processing
Data Processing
Data Management
Data Management
Resource Management
Data
Management
Resource Management
Infrastructure Resource
Management
HIVE
Pig
HBase
Data Processing
Storm
MPI
Data
Processing
Hadoop
HDFS
Data Management
Resource Management
Data
Mgmnt.
Resource
Mgmnt.
Pig
HBase
Storm
MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
HIVE
Spark
Pig
Storm MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark Community
3000 people attended online training in
August
500+ meetup members
14 companies contributing
sparkproject.org
HIVE
Spark
Pig
Storm MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Shark
Spark
HIVE
Pig
Storm MPI
Data
Processing
Hadoop
HDFS
Data
Mgmnt.
Mesos
Resource
Mgmnt.
Spark
Streaming
Shark
HIVE
Pig
Storm MPI
Data
Processing
Hadoop
Spark
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Spark
Streaming
Shark
Spark
HIVE
Pig
Storm MPI
Data
Processing
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Spark
Streaming
Spark
Graph
BlinkDB
Shark
Spark
HIVE
Pig
Storm MPI
Data
Processing
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
Spark
Streaming
Spark
Graph
MLbase
BlinkDB
Shark
Spark
HIVE
Pig
Storm MPI
Data
Processing
Hadoop
Tachyon
HDFS
Mesos
Data
Mgmnt.
Resource
Mgmnt.
GraphLab API
Spark
Streaming
Spark
Graph
Shark
Spark
Tachyon
andPig
Shell
Storm MPI
HIVE
Hadoop
HDFS API
Mesos
HDFS
Data
Processing
Compatibility
layer
for
Data
Hadoop, Storm,
MPI,
Mgmnt.
etc to run over Mesos
Resource
Mgmnt.
Spark
Streaming
Spark
Graph
BlinkDB
MLbase
Shark
Spark
Tachyon
Pig
Storm MPI
HIVE
Support HDFS API,
Hadoop
S3 API, and Hive
metadata
HDFS
Mesos
Data
Processing
Data
Mgmnt.
Resource
Mgmnt.
Summary
Holistic approach to address next generation of Big Data challenges!
Support interactive and streaming computations
- In-memory, fault-tolerant storage abstraction, low-latency scheduling,...
Easy to combine batch, streaming, and interactive computations
- Spark execution engine supports all comp. models
Easy to develop sophisticated algorithms
- Scala interface, APIs for Java, Python, Hive QL,
Batch
Spark
Interacti
ve
Streami
ng
Whats Next?
This tutorial:
- Matei Zaharia: Spark
- Tathagata Das (TD): Spark Streaming
- Reynold Xin: Shark
Afternoon tutorial:
- Hands on with Spark, SparkStreaming, and Shark