Apache Hadoop and Spark:: and Use Cases For Data Analysis
Apache Hadoop and Spark:: and Use Cases For Data Analysis
Introduction
and Use Cases for Data Analysis
http://www.bigdatavietnam.org
Outline
• Enable Scalability
– on commodity hardware
• Handle Fault Tolerance
• Can Handle a Variety of Data type
– Text, Graph, Streaming Data, Images,…
• Shared Environment
• Provides Value
– Cost
Hadoop Ecosystem
A
Layer Diagram
B C
D
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.
HBase
MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager
• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data
Secondary
Client NameNode
NameNode
File B1 B2 B3 B4
HDFS read
Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
faster with
Data Size 102.5 TB 100 TB
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min
DataFrames ML Pipelines
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Core concepts
Resilient Distributed Datasets (RDDs)
Example: performance
Spark Operations
map flatMap
filter union
sample join
Transformations
groupByKey cogroup
(create a new RDD)
reduceByKey cross
sortByKey mapValues
intersection reduceByKey
collect first
Reduce take
Actions
Count takeOrdered
(return results to
takeSample countByKey
driver program)
take save
lookupKey foreach
Directed Acyclic Graphs (DAG)
B C
A
E
S
D
F
DAGs track dependencies (also known as Lineage )
➢ nodes are RDDs
➢ arrows are Transformations
Narrow Vs. Wide transformation
A,2
Map groupByKey
Actions
• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
Spark Workflow
Collect
Spark Driver
Context Program
Python RDD API Examples
• Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")
• Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….
• RDD Removal
– RDD.unpersist()
Broadcast Variables and Accumulators
(Shared Variables )
• Broadcast variables allow the programmer to keep a
read-only variable cached on each node, rather than sending
a copy of it with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
• Accumulators are variables that are only “added” to through
an associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
Spark’s Main Use Cases
• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
Spark Use Cases
• Web Analytics
– Developed a Spark based for web analytics
• Social Media Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Social Media dataset
My Use Case
Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations
• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud
• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
Spark: when not to use
MPI definitely outpaces Hadoop, but can be boosted using a hybrid approach of other
technologies that blend HPC and big data, including Spark and HARP. Dr. Geoffrey Fox,
Indiana University. (http://arxiv.org/pdf/1403.1528.pdf)
Conclusion
• MapReduce and Spark are two very popular open source cluster
computing frameworks for large scale data analytics
• These frameworks hide the complexity of task parallelism and
fault-tolerance, by exposing a simple programming API to users