0% found this document useful (0 votes)
8 views33 pages

Week 14

The document provides an introduction to Apache Hadoop and Spark, focusing on their roles in handling big datasets and data analysis. It outlines the components of Hadoop, including HDFS and MapReduce, and highlights Spark's capabilities and advantages over traditional MapReduce solutions. Additionally, it discusses the growth of data, cloud computing, and the importance of new data processing and machine learning methods.

Uploaded by

ahmadfraz0010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Week 14

The document provides an introduction to Apache Hadoop and Spark, focusing on their roles in handling big datasets and data analysis. It outlines the components of Hadoop, including HDFS and MapReduce, and highlights Spark's capabilities and advantages over traditional MapReduce solutions. Additionally, it discusses the growth of data, cloud computing, and the importance of new data processing and machine learning methods.

Uploaded by

ahmadfraz0010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Apache Hadoop and Spark: Introduction

and Use Cases for Data Analysis

Afzal Godil
Information Access Division, ITL, NIST
Outline

• Growth of big datasets


• Introduction to Apache Hadoop and Spark for developing
applications
• Components of Hadoop, HDFS, MapReduce and HBase
• Capabilities of Spark and the differences from a typical
MapReduce solution
• Some Spark use cases for data analysis
Data

• The Large Hadron Collider produces about 30 petabytes of


data per year
• Facebook’s data is growing at 8 petabytes per month
• The New York stock exchange generates about 4 terabyte of
data per day
• YouTube had around 80 petabytes of storage in 2012
• Internet Archive stores around 19 petabytes of data
Cloud and Distributed Computing

• The second trend is pervasiveness of cloud-based storage and


computational resources
– For processing of these big datasets
• Cloud characteristics
– Provide a scalable standard environment
– On-demand computing
– Pay as you need
– Dynamically scalable
– Cheaper
Data Processing and Machine learning Methods

• Data processing (third trend)


– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
– Tools for processing of streaming, Processing
multimedia & batch data ETL
• Machine Learning (fourth trend) (extract,
transform,
– Classification
load)
– Regression
– Clustering Machine
Big Datasets
– Collaborative filtering Learning

Working at the Intersection of these


four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing
Hadoop Ecosystem
• Enable Scalability
– on commodity hardware
• Handle Fault Tolerance
• Can Handle a Variety of Data type
– Text, Graph, Streaming Data, Images,…
• Shared Environment
• Provides Value
– Cost
how does Hadoop vary from
HDFS?
Hadoop Ecosystem
A
Layer Diagram
B C

D
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,


etc.
Pig Hive
Non-relational

Scripting SQL Like Query


Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)


Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files Single Hadoop cluster with 5000 servers
and 250 petabytes of data
– Random access to files
– Low latency access
Design of Hadoop Distributed File System (HDFS)

• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data


HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node


B1 B2 B4 B3

Node Node Node


B1 Node
B3 B1 B2 B4

Node Node Node Node


B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks


– It’s the basic unit of read/write
– Default size is 64MB, could be larger (128MB)
– Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
– One block stored at multiple location, also at different
racks (usually 3 times)
– This makes HDFS storage fault tolerant and faster to
read
Few HDFS Shell commands
Create a directory in HDFS
• hadoop fs -mkdir /user/godil/dir1

List the content of a directory


• hadoop fs -ls /user/godil

Upload and download a file in HDFS


• hadoop fs -put /home/godil/file.txt /user/godil/datadir/
• hadoop fs -get /user/godil/datadir/file.txt /home/

Look at the content of a file


• Hadoop fs -cat /user/godil/datadir/book.txt

Many more commands, similar to Unix


MapReduce: Simple Programming for Big Data
Based on Google’s MR paper (2004)

• MapReduce is simple programming paradigm for the Hadoop


ecosystem
• Traditional parallel programming requires expertise of different
computing/systems concepts
– examples: multithreads, synchronization mechanisms (locks,
semaphores, and monitors )
– incorrect use: can crash your program, get incorrect results, or
severely impact performance
– Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code
in parallel
– you don't have to deal with any of above issues
– only need to create, map and reduce functions
Map Reduce Paradigm

• Map and Reduce are based on functional programming

Apply function Map: Reduce:


Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];


square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output


MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce

D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS


read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2

You might also like