Analyzing Big Data in Hadoop Spark
Analyzing Big Data in Hadoop Spark
Spark
Dan Lo
Department of Computer Science
Kennesaw State University
Oto von Bismarck
• Data applications are like sausages. It is better not to see them being
made.
Data
• The Large Hadron Collider produces about 30 petabytes of data per year
• Facebook’s data is growing at 8 petabytes per month
• The New York stock exchange generates about 4 terabyte of data per day
• YouTube had around 80 petabytes of storage in 2012
• Internet Archive stores around 19 petabytes of data
Cloud and Distributed Computing
D
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.
HBase
MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager
• Master-Slave design
• Master Node
• Single NameNode for managing metadata
• Slave Nodes
• Multiple DataNodes for storing data
• Other
• Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data
Secondary
Client NameNode
NameNode
File B1 B2 B3 B4
D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
• Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
• Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
• Support for other languages needed
• Only for Batch processing
• Interactivity, streaming data
Challenges of Data Science
• Data preprocessing or data engineering (a majority of work)
• Iteration is a fundamental part of the data science
• Stochastic gradient decent (SGE)
• Maximum likely estimation (MLE)
• Choose the right features, picking the right algorithms, running the right
significance tests, finding the right hyperparameters, etc.
• So data should be read once and stay in memory!
• Integration of models to real useful products
• Easy modeling but hard to work well in reality
• R is slow and lack of integration capability
• Java and C++ poor for exploratory analytics (lack of Read-Evaluate-Print-Loop)
Apache Spark
• Open source, originated from UC Berkeley AMPLab
• Distributed over a cluster of machines
• An elegant programming model
• Predecessor: MapReduce -> linear scalability and resilient to failures
• Improvement
• Execution operations over a directed acyclic graph (DAG), unlike map-then-reduce, to
keep data in memory rather than store in disks, like Microsoft’s Dryad
• A rich set of transformations and APIs to easy programming
• In memory processing across server operations: Resilient Distributed Dataset (RDD)
• Support Scala and Python APIs
Directed Acyclic Graphs (DAG)
B C
A
E
S
D
F
DAGs track dependencies (also known as Lineage )
nodes are RDDs
arrows are Transformations
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing
HDFS read
Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
faster with
Data Size 102.5 TB 100 TB
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
DataFrames ML Pipelines
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Resilient Distributed Datasets (RDDs)