0% found this document useful (0 votes)

49 views48 pages

Apache Hadoop and Spark:: and Use Cases For Data Analysis

Apache Hadoop and Spark are frameworks for processing large datasets. Hadoop uses HDFS for storage and MapReduce for processing. Spark improves on Hadoop by keeping data in memory for faster iterative jobs. It supports a wider range of algorithms beyond map and reduce. Both frameworks can scale to process data across clusters of machines.

Uploaded by

satish.sathya.a2012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views48 pages

Apache Hadoop and Spark:: and Use Cases For Data Analysis

Uploaded by

satish.sathya.a2012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Apache Hadoop and Spark:

Introduction
and Use Cases for Data Analysis
http://www.bigdatavietnam.org
Outline

• Growth of big datasets

• Introduction to Apache Hadoop and Spark for developing
applications
• Components of Hadoop, HDFS, MapReduce and HBase
• Capabilities of Spark and the differences from a typical
MapReduce solution
• Some Spark use cases for data analysis
Growth of Big Datasets
• Internet/Online Data
– Clicks
– Searches
– Server requests
– Web logs
– Cell phone logs
– Mobile GPS locations
– User generated content
– Entertainment (YouTube, Netflix, Spotify, …)
• Healthcare and Scientific Computations
– Genomics, medical images, healthcare data, billing data
• Graph data
– Telecommunications network
– Social networks (Facebook, Twitter, LinkedIn, …)
– Computer networks
• Internet of Things
– RFID
– Sensors
• Financial data
Data

• The Large Hadron Collider produces about 30 petabytes of

data per year
• Facebook’s data is growing at 8 petabytes per month
• The New York stock exchange generates about 4 terabyte of
data per day
• YouTube had around 80 petabytes of storage in 2012
• Internet Archive stores around 19 petabytes of data
Cloud and Distributed Computing

• The second trend is pervasiveness of cloud based storage and

computational resources
– For processing of these big datasets
• Cloud characteristics
– Provide a scalable standard environment
– On-demand computing
– Pay as you need
– Dynamically scalable
– Cheaper
Data Processing and Machine learning Methods

• Data processing (third trend)

– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
– Tools for processing of streaming, Processing
multimedia & batch data ETL
• Machine Learning (fourth trend) (extract,
transform,
– Classification
load)
– Regression
– Clustering Machine
Big Datasets
– Collaborative filtering Learning

Working at the Intersection of these

four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing
Hadoop Ecosystem

• Enable Scalability
– on commodity hardware
• Handle Fault Tolerance
• Can Handle a Variety of Data type
– Text, Graph, Streaming Data, Images,…
• Shared Environment
• Provides Value
– Cost
Hadoop Ecosystem
A
Layer Diagram
B C

D
Apache Hadoop Basic Modules

• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,

etc.
Pig Hive
Non-relational

Scripting SQL Like Query

Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)

Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files Single Hadoop cluster with 5000 servers
and 250 petabytes of data
– Random access to files
– Low latency access
Design of Hadoop Distributed File System (HDFS)

• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node

B1 B2 B4 B3

Node Node Node

B1 Node
B3 B1 B2 B4

Node Node Node Node

B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks

– It’s the basic unit of read/write
– Default size is 64MB, could be larger (128MB)
– Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
– One block stored at multiple location, also at different
racks (usually 3 times)
– This makes HDFS storage fault tolerant and faster to
read
Few HDFS Shell commands
Create a directory in HDFS
• hadoop fs -mkdir /user/godil/dir1

List the content of a directory

• hadoop fs -ls /user/godil

Upload and download a file in HDFS

• hadoop fs -put /home/godil/file.txt /user/godil/datadir/
• hadoop fs -get /user/godil/datadir/file.txt /home/

Look at the content of a file

• Hadoop fs -cat /user/godil/datadir/book.txt

Many more commands, similar to Unix

HBase

• NoSQL data store build on top of HDFS

• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
– Transactional applications
– Data Analytics

• Not efficient for text searching and processing

MapReduce: Simple Programming for Big Data
Based on Google’s MR paper (2004)

• MapReduce is simple programming paradigm for the Hadoop

ecosystem
• Traditional parallel programming requires expertise of different
computing/systems concepts
– examples: multithreads, synchronization mechanisms (locks,
semaphores, and monitors )
– incorrect use: can crash your program, get incorrect results, or
severely impact performance
– Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running
code in parallel
– you don't have to deal with any of above issues
– only need to create, map and reduce functions
Map Reduce Paradigm

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

MapReduce Word Count Example
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS

read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
faster with
Data Size 102.5 TB 100 TB
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.

Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Core concepts
Resilient Distributed Datasets (RDDs)

• RDDs (Resilient Distributed Datasets) is Data Containers

• All the different processing components in Spark
share the same abstraction called RDD
• As applications share the RDD abstraction, you can
mix different kind of transformations to create new
RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
DataFrames & Spark SQL
• DataFrames (DFs) is one of the other distributed datasets organized
in named columns
• Similar to a relational database, Python Pandas Dataframe or R’s
DataTables
– Immutable once constructed
– Track lineage
– Enable distributed computations
• How to construct Dataframes
– Read from file(s)
– Transforming an existing DFs(Spark or Pandas)
– Parallelizing a python collection list
– Apply transformations and actions
DataFrame example
// Create a new DataFrame that contains “students”
students = users.filter(users.age < 21)

//Alternatively, using Pandas-like syntax

students = users[users.age < 21]

//Count the number of students users by gender

students.groupBy("gender").count()

// Join young students with another DataFrame called

logs
students.join(logs, logs.userId == users.userId,
“left_outer")
RDDs vs. DataFrames

• RDDs provide a low level interface into Spark

• DataFrames have a schema
• DataFrames are cached and optimized by Spark
• DataFrames are built on top of the RDDs and the core
Spark API

Example: performance
Spark Operations
map flatMap
filter union
sample join
Transformations
groupByKey cogroup
(create a new RDD)
reduceByKey cross
sortByKey mapValues
intersection reduceByKey
collect first
Reduce take
Actions
Count takeOrdered
(return results to
takeSample countByKey
driver program)
take save
lookupKey foreach
Directed Acyclic Graphs (DAG)

B C

A
E

S
D
F
DAGs track dependencies (also known as Lineage )
➢ nodes are RDDs
➢ arrows are Transformations
Narrow Vs. Wide transformation

Narrow Vs. Wide

A,1 A,[1,2]

A,2

Map groupByKey
Actions

• What is an action
– The final stage of the workflow
– Triggers the execution of the DAG
– Returns the results to the driver
– Or writes the data to HDFS or to a file
Spark Workflow

FlatMap Map groupbyKey

Collect

Spark Driver
Context Program
Python RDD API Examples
• Word count
text_file = sc.textFile("hdfs://usr/godil/text/book.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://usr/godil/output/wordCount.txt")

• Logistic Regression
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
# Here, we limit the number of iterations to 10.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()

Examples from http://spark.apache.org/

RDD Persistence and Removal

• RDD Persistence
– RDD.persist()
– Storage level:
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
DISK_ONLY,…….

• RDD Removal
– RDD.unpersist()
Broadcast Variables and Accumulators
(Shared Variables )
• Broadcast variables allow the programmer to keep a
read-only variable cached on each node, rather than sending
a copy of it with tasks
>broadcastV1 = sc.broadcast([1, 2, 3,4,5,6])
>broadcastV1.value
[1,2,3,4,5,6]
• Accumulators are variables that are only “added” to through
an associative operation and can be efficiently supported in
parallel
accum = sc.accumulator(0)
accum.add(x)
accum.value
Spark’s Main Use Cases

• Streaming Data
• Machine Learning
• Interactive Analysis
• Data Warehousing
• Batch Processing
• Exploratory Data Analysis
• Graph Data Analysis
• Spatial (GIS) Data Analysis
• And many more
Spark Use Cases

• Web Analytics
– Developed a Spark based for web analytics
• Social Media Sentiment Analysis
– Developed a Spark based Sentiment Analysis code
for a Social Media dataset
My Use Case
Spark in the Real World (I)
• Uber – the online taxi company gathers terabytes of event data from its
mobile users every day.
– By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline
– Convert raw unstructured event data into structured data as it is collected
– Uses it further for more complex analytics and optimization of operations

• Pinterest – Uses a Spark ETL pipeline

– Leverages Spark Streaming to gain immediate insight into how users all
over the world are engaging with Pins—in real time.
– Can make more relevant recommendations as people navigate the site
– Recommends related Pins
– Determine which products to buy, or destinations to visit
Spark in the Real World (II)
Here are Few other Real World Use Cases:

• Conviva – 4 million video feeds per month

– This streaming video company is second only to YouTube.
– Uses Spark to reduce customer churn by optimizing video streams and
managing live video traffic
– Maintains a consistently smooth, high quality viewing experience.

• Capital One – is using Spark and data science algorithms to understand customers
in a better way.
– Developing next generation of financial products and services
– Find attributes and patterns of increased probability for fraud

• Netflix – leveraging Spark for insights of user viewing habits and then
recommends movies to them.
– User data is also used for content creation
Spark: when not to use

• Even though Spark is versatile, that doesn’t mean Spark’s

in-memory capabilities are the best fit for all use cases:
– For many simple use cases Apache MapReduce and
Hive might be a more appropriate choice
– Spark was not designed as a multi-user environment
– Spark users are required to know that memory they
have is sufficient for a dataset
– Adding more users adds complications, since the users
will have to coordinate memory usage to run code
HPC and Big Data Convergence

• Clouds and supercomputers are collections of computers

networked together in a datacenter
• Clouds have different networking, I/O, CPU and cost
trade-offs than supercomputers
• Cloud workloads are data oriented vs. computation oriented
and are less closely coupled than supercomputers
• Principles of parallel computing same on both
• Apache Hadoop and Spark vs. Open MPI
HPC and Big Data K-Means example

MPI definitely outpaces Hadoop, but can be boosted using a hybrid approach of other
technologies that blend HPC and big data, including Spark and HARP. Dr. Geoffrey Fox,
Indiana University. (http://arxiv.org/pdf/1403.1528.pdf)
Conclusion

• Hadoop (HDFS, MapReduce)

– Provides an easy solution for processing of Big Data
– Brings a paradigm shift in programming distributed system
• Spark
– Has extended MapReduce for in memory computations
– for streaming, interactive, iterative and machine learning
tasks
• Changing the World
– Made data processing cheaper and more efficient and
scalable
– Is the foundation of many other tools and software
MapReduce vs. Spark for Large Scale Data Analysis

• MapReduce and Spark are two very popular open source cluster
computing frameworks for large scale data analytics
• These frameworks hide the complexity of task parallelism and
fault-tolerance, by exposing a simple programming API to users

Tasks Word sort K-means Page-Rank

Count
MapReduce
Spark

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Bash Emacs Editing Mode (Readline) Cheat Sheet
100% (10)
Bash Emacs Editing Mode (Readline) Cheat Sheet
2 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Week 14
No ratings yet
Week 14
33 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Biggdata
No ratings yet
Biggdata
24 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Spark
No ratings yet
Spark
96 pages
Module 2
No ratings yet
Module 2
20 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
SPARK
No ratings yet
SPARK
47 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
HADOOP
No ratings yet
HADOOP
55 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
An Overview of The Hadoop Ecosystem
No ratings yet
An Overview of The Hadoop Ecosystem
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 5
No ratings yet
Unit 5
32 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data
No ratings yet
Big Data
3 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Sample
No ratings yet
Sample
84 pages
OTAX PaymentImportRQ Cheque
No ratings yet
OTAX PaymentImportRQ Cheque
1 page
International Study Guide: Undergraduate & Postgraduate
No ratings yet
International Study Guide: Undergraduate & Postgraduate
23 pages
OTAX - PaymentImportRS - Credit Card
No ratings yet
OTAX - PaymentImportRS - Credit Card
1 page
OTAX PaymentImportRQ-credit Card
No ratings yet
OTAX PaymentImportRQ-credit Card
2 pages
SBC CreditCard Payment Overview
No ratings yet
SBC CreditCard Payment Overview
2 pages
SBC PaymentProcess
No ratings yet
SBC PaymentProcess
1 page
IRJET - Big Data-A Review Study With Comp
No ratings yet
IRJET - Big Data-A Review Study With Comp
6 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Had Oop Inverted Index V 5
No ratings yet
Had Oop Inverted Index V 5
17 pages
UNIT 4 WINDOWS Operating System
No ratings yet
UNIT 4 WINDOWS Operating System
7 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Event Management System New
0% (2)
Event Management System New
80 pages
Clojure Vs Groovy Vs Scala
No ratings yet
Clojure Vs Groovy Vs Scala
2 pages
SPOS pr1 Pass-1
No ratings yet
SPOS pr1 Pass-1
9 pages
Ruchika Aggarwal - CV
No ratings yet
Ruchika Aggarwal - CV
2 pages
Unit3 Assignment1
No ratings yet
Unit3 Assignment1
3 pages
HTML Attributes Reference
No ratings yet
HTML Attributes Reference
1 page
Software Engineer Resume
No ratings yet
Software Engineer Resume
1 page
Core 6 Developmentwith React
No ratings yet
Core 6 Developmentwith React
7 pages
API Integration in Flutter Using GETX
No ratings yet
API Integration in Flutter Using GETX
7 pages
Install Log File
No ratings yet
Install Log File
47 pages
Object Oriented Programming in C++
No ratings yet
Object Oriented Programming in C++
148 pages
Ijctt V3i4p109
No ratings yet
Ijctt V3i4p109
7 pages
Pranab Bhattacharya Mca-Ty Roll No.: 14: Created by
No ratings yet
Pranab Bhattacharya Mca-Ty Roll No.: 14: Created by
20 pages
Verilog FAQ
100% (2)
Verilog FAQ
11 pages
Nested Classes 235-245: Durgasoft MR - Ratan
No ratings yet
Nested Classes 235-245: Durgasoft MR - Ratan
4 pages
First Assessment 2nd Unit Test (2021-22)
No ratings yet
First Assessment 2nd Unit Test (2021-22)
3 pages
Subject: OSY (22516) Academic Year: 2024-25 Course: Co (5I) Semester: Fifth
No ratings yet
Subject: OSY (22516) Academic Year: 2024-25 Course: Co (5I) Semester: Fifth
12 pages
Application and System Software: Course: BCA Subject: Fundamental of Computer Unit: 2
No ratings yet
Application and System Software: Course: BCA Subject: Fundamental of Computer Unit: 2
38 pages
Python All The Fun of The Fair
No ratings yet
Python All The Fun of The Fair
4 pages
GV55 @track Air Interface Firmware Update V1.01
No ratings yet
GV55 @track Air Interface Firmware Update V1.01
12 pages
openSAP Abap1 All Slides
No ratings yet
openSAP Abap1 All Slides
58 pages
CS201 Grand Quiz File (New) by Nimra Khan
No ratings yet
CS201 Grand Quiz File (New) by Nimra Khan
100 pages
q1 Week 1 Etech Powerpoint A
No ratings yet
q1 Week 1 Etech Powerpoint A
28 pages
Pps Experiment - 7
No ratings yet
Pps Experiment - 7
6 pages
Cambridge International AS & A Level: Computer Science 9618/32
No ratings yet
Cambridge International AS & A Level: Computer Science 9618/32
16 pages
Model Classification
No ratings yet
Model Classification
6 pages
09 Ceedling and Jenkins
No ratings yet
09 Ceedling and Jenkins
5 pages

Apache Hadoop and Spark:: and Use Cases For Data Analysis

Uploaded by

Apache Hadoop and Spark:: and Use Cases For Data Analysis

Uploaded by

Apache Hadoop and Spark:

• Growth of big datasets

• The Large Hadron Collider produces about 30 petabytes of

• The second trend is pervasiveness of cloud based storage and

• Data processing (third trend)

Working at the Intersection of these

Spark, Storm, Tez,

Scripting SQL Like Query

HDFS Distributed File System (Storage)

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

Node Node Node Node

Node Node Node

Node Node Node Node

• HDFS files are divided into blocks

List the content of a directory

Upload and download a file in HDFS

Look at the content of a file

Many more commands, similar to Unix

• NoSQL data store build on top of HDFS

• Not efficient for text searching and processing

• MapReduce is simple programming paradigm for the Hadoop

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

HDFS HDFS HDFS

Spark: In-Memory Data Sharing

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

Scala, Java, Python, R, SQL

• RDDs (Resilient Distributed Datasets) is Data Containers

//Alternatively, using Pandas-like syntax

//Count the number of students users by gender

// Join young students with another DataFrame called

• RDDs provide a low level interface into Spark

Narrow Vs. Wide

FlatMap Map groupbyKey

Examples from http://spark.apache.org/

• Pinterest – Uses a Spark ETL pipeline

• Conviva – 4 million video feeds per month

• Even though Spark is versatile, that doesn’t mean Spark’s

• Clouds and supercomputers are collections of computers

• Hadoop (HDFS, MapReduce)

Tasks Word sort K-means Page-Rank

You might also like