Berkeley Data Analytics Stack (BDAS) Overview: Ion Stoica UC Berkeley

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

Berkeley Data

Analytics Stack
(BDAS) Overview
Ion Stoica
UC Berkeley

UC BERKELEY

What is Big Data used For?


Reports, e.g.,
- Track business processes, transactions
Diagnosis, e.g.,
- Why is user engagement dropping?
- Why is the system slow?
- Detect spam, worms, viruses, DDoS attacks
Decisions, e.g.,
- Decide what feature to add
- Decide what ad to show
- Block worms, viruses,

Data is only as useful as the decisions it enables

Data Processing Goals


Low latency (interactive) queries on historical data: enable
faster decisions
- E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming): enable decisions
on real-time data
- E.g., detect & block worms in real-time (a worm may infect 1mil
hosts in 1.3sec)
Sophisticated data processing: enable better decisions
- E.g., anomaly detection, trend analysis

Todays Open Analytics Stack


..mostly focused on large on-disk datasets: great for batch but slow

Application
Data Processing
Storage
Infrastructure

Goals
Batch

Interacti
ve

One
stack to
rule them
all!

Streami
ng

Easy to combine batch, streaming, and interactive computations


Easy to develop sophisticated algorithms
Compatible with existing open source ecosystem (Hadoop/HDFS)

Our Approach: Support Interactive and Streaming Comp.


Aggressive use of memory
Why?
1. Memory transfer rates >> disk or even SSDs
- Gap is growing especially w.r.t. disk
2. Many datasets already fit into memory
- The inputs of over 90% of jobs in Facebook,
Yahoo!, and Bing clusters fit into memory
- E.g., 1TB = 1 billion records @ 1 KB each
3. Memory density (still) grows with Moores law
- RAM/SSD hybrid memories at horizon

10Gbps
128512GB
40-60GB/s
16 cores
0.21GB/s
(x10 disks)

10-30TB

14GB/s
(x4 disks)

1-4TB

High end datacenter node

Our Approach: Support Interactive and Streaming Comp.


Increase parallelism
Why?

result

- Reduce work per node improve latency


T

Techniques:
- Low latency parallel scheduler that achieve
high locality
result

- Optimized parallel communication patterns


(e.g., shuffle, broadcast)
- Efficient recovery from failures and
straggler mitigation

Tnew (< T)

Our Approach: Support Interactive and Streaming Comp.


Trade between result accuracy and response times
Why?
- In-memory processing does not guarantee
interactive query processing
- E.g., ~10s sec just to scan 512 GB RAM!
- Gap between memory capacity and transfer rate
increasing

128512GB

40-60GB/s

Challenges:
- accurately estimate error and running time for
- arbitrary computations

16 cores

doubles
every 18
months
doubles
every 36
months

Our Approach
Easy to combine batch, streaming, and interactive computations
- Single execution model that supports all computation models
Easy to develop sophisticated algorithms
- Powerful Python and Scala shells
- High level abstractions for graph based, and ML algorithms
Compatible with existing open source ecosystem (Hadoop/HDFS)
- Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..)
- Support existing execution models (e.g., Hive, GraphLab)

Berkeley Data Analytics Stack (BDAS)


New apps: AMP-Genomics, Carat,
Application
Data Processing
Data Storage
Management
Resource
Infrastructure
Management

in-memory processing
trade between time, quality, and cost
Efficient data sharing across frameworks
Share infrastructure across frameworks
(multi-programming for datacenters)

The Berkeley AMPLab


Launched January 2011: 6 Year Plan

- 8 CS Faculty
- ~40 students
- 3 software engineers
Organized for collaboration:

The Berkeley AMPLab


Funding:
-

XData,

CISE Expedition Grant

- Industrial, founding sponsors


- 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry & Research:
Berkeley Data Analytics Stack (BDAS)
Release as Open Source

Berkeley Data Analytics Stack (BDAS)

Application
Data Processing

Data Processing
Data Processing

Data Management

Data Management

Resource Management
Data

Management

Resource Management

Infrastructure Resource

Management

Berkeley Data Analytics Stack (BDAS)


Existing stack components.

HIVE

Pig
HBase
Data Processing

Storm

MPI

Data
Processing

Hadoop
HDFS
Data Management

Resource Management

Data
Mgmnt.
Resource
Mgmnt.

Mesos [Released, v0.9]


Management platform that allows multiple framework to share cluster
Compatible with existing open analytics stack
Deployed in production at Twitter on 3,500+ servers
HIVE

Pig
HBase

Storm

MPI

Data
Processing

Hadoop

HDFS

Data
Mgmnt.

Mesos

Resource
Mgmnt.

Spark [Release, v0.7]


In-memory framework for interactive and iterative computations
- Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction
Scala interface, Java and Python APIs

HIVE

Spark

Pig

Storm MPI

Data
Processing

Hadoop
HDFS

Data
Mgmnt.

Mesos

Resource
Mgmnt.

Spark Community
3000 people attended online training in
August
500+ meetup members
14 companies contributing

sparkproject.org

Spark Streaming [Alpha Release]


Large scale streaming computation
Ensure exactly one semantics
Integrated with Spark unifies batch, interactive, and streaming computations!
Spark
Streaming

HIVE

Spark

Pig

Storm MPI

Data
Processing

Hadoop
HDFS

Data
Mgmnt.

Mesos

Resource
Mgmnt.

Shark [Release, v0.2]


HIVE over Spark: SQL-like interface (supports Hive 0.9)
- up to 100x faster for in-memory data, and 5-10x for disk
In tests on hundreds node cluster at
Spark
Streaming

Shark
Spark

HIVE

Pig

Storm MPI

Data
Processing

Hadoop
HDFS

Data
Mgmnt.

Mesos

Resource
Mgmnt.

Spark & Shark available now on EMR!

Tachyon [Alpha Release, this Spring]


High-throughput, fault-tolerant in-memory storage
Interface compatible to HDFS
Support for Spark and Hadoop

Spark
Streaming

Shark

HIVE

Pig

Storm MPI

Data
Processing

Hadoop

Spark
Tachyon

HDFS
Mesos

Data
Mgmnt.
Resource
Mgmnt.

BlinkDB [Alpha Release, this Spring]


Large scale approximate query engine
Allow users to specify error or time bounds
Preliminary prototype starting being tested at Facebook
BlinkDB

Spark
Streaming

Shark
Spark

HIVE

Pig

Storm MPI

Data
Processing

Hadoop

Tachyon

HDFS
Mesos

Data
Mgmnt.
Resource
Mgmnt.

SparkGraph [Alpha Release, this Spring]


GraphLab API and Toolkits on top of Spark
Fault tolerance by leveraging Spark

Spark
Streaming

Spark
Graph

BlinkDB
Shark

Spark

HIVE

Pig

Storm MPI

Data
Processing

Hadoop

Tachyon

HDFS
Mesos

Data
Mgmnt.
Resource
Mgmnt.

MLbase [In development]


Declarative approach to ML
Develop scalable ML algorithms
Make ML accessible to non-experts

Spark
Streaming

Spark
Graph

MLbase

BlinkDB
Shark

Spark

HIVE

Pig

Storm MPI

Data
Processing

Hadoop

Tachyon

HDFS
Mesos

Data
Mgmnt.
Resource
Mgmnt.

Compatible with Open Source Ecosystem


Support existing interfaces whenever possible

GraphLab API

Spark
Streaming

Spark
Graph

BlinkDB Hive Interface


MLbase

Shark

Spark
Tachyon

andPig
Shell
Storm MPI
HIVE
Hadoop

HDFS API
Mesos

HDFS

Data
Processing

Compatibility
layer
for
Data
Hadoop, Storm,
MPI,
Mgmnt.
etc to run over Mesos
Resource
Mgmnt.

Compatible with Open Source Ecosystem


Use existing interfaces whenever possible
Accept inputs
from Kafka,
Flume, Twitter,
TCP Sockets,

Spark
Streaming

Support Hive API

Spark
Graph

BlinkDB
MLbase

Shark

Spark
Tachyon

Pig

Storm MPI
HIVE
Support HDFS API,
Hadoop
S3 API, and Hive
metadata
HDFS

Mesos

Data
Processing

Data
Mgmnt.
Resource
Mgmnt.

Summary
Holistic approach to address next generation of Big Data challenges!
Support interactive and streaming computations
- In-memory, fault-tolerant storage abstraction, low-latency scheduling,...
Easy to combine batch, streaming, and interactive computations
- Spark execution engine supports all comp. models
Easy to develop sophisticated algorithms
- Scala interface, APIs for Java, Python, Hive QL,

Batch

Spark
Interacti
ve

Streami
ng

- New frameworks targeted to graph based and ML algorithms


Compatible with existing open source ecosystem
Open source (Apache/BSD) and fully committed to release high quality software
- Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th
Cloudera engineer)

Whats Next?
This tutorial:
- Matei Zaharia: Spark
- Tathagata Das (TD): Spark Streaming
- Reynold Xin: Shark
Afternoon tutorial:
- Hands on with Spark, SparkStreaming, and Shark

You might also like