0% found this document useful (0 votes)

34 views30 pages

Analyzing Big Data in Hadoop Spark

Uploaded by

gq998trc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views30 pages

Analyzing Big Data in Hadoop Spark

Uploaded by

gq998trc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Analyzing Big Data in Hadoop

Spark
Dan Lo
Department of Computer Science
Kennesaw State University
Oto von Bismarck
• Data applications are like sausages. It is better not to see them being
made.
Data

• The Large Hadron Collider produces about 30 petabytes of data per year
• Facebook’s data is growing at 8 petabytes per month
• The New York stock exchange generates about 4 terabyte of data per day
• YouTube had around 80 petabytes of storage in 2012
• Internet Archive stores around 19 petabytes of data
Cloud and Distributed Computing

• The second trend is pervasiveness of cloud based storage and computational

resources
• For processing of these big datasets
• Cloud characteristics
• Provide a scalable standard environment
• On-demand computing
• Pay as you need
• Dynamically scalable
• Cheaper
Data Processing and Machine learning Methods

• Data processing (third trend)

• Traditional ETL (extract, transform, load)
Data
• Data Stores (HBase, ……..)
Processing
• Tools for processing of streaming, ETL
multimedia & batch data (extract,
• Machine Learning (fourth trend) transform,
• Classification load)
• Regression
Machine
• Clustering Big Datasets
Learning
• Collaborative filtering

Working at the Intersection of these

four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing
Tasks Not Easily Accomplished 5 Years Ago
• Build a model to detect credit card fraud using thousands of features
and billions of transactions.
• Intelligently recommend millions of products to millions of users.
• Estimate financial risk through simulations of portfolios including
millions of instruments.
• Easily manipulate data from thousands of human genomes to detect
genetic associations with disease.
Apache Hadoop
• Open source, enable scalability on commodity hardware
• Widespread deployment, affordable
• Distributed, fault tolerance
• Data Science
• Answer questions such as “of the gazillion users who made it to the
third page in our registration process, how many are over 25?”
• Or even what is the largest ethnic group among those over 25?
Hadoop Ecosystem
A
Layer Diagram
B C

D
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,

etc.
Pig Hive
Non-relational

Scripting SQL Like Query

Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)

Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
• Serves as the distributed file system for most tools in the Hadoop
ecosystem
• Scalability for large data sets
• Reliability to cope with hardware failures
• HDFS good for:
• Large files
• Streaming data
• Not good for:
• Lots of small files Single Hadoop cluster with 5000 servers
• Random access to files and 250 petabytes of data
• Low latency access
Design of Hadoop Distributed File System (HDFS)

• Master-Slave design
• Master Node
• Single NameNode for managing metadata
• Slave Nodes
• Multiple DataNodes for storing data
• Other
• Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node

B1 B2 B4 B3

Node Node Node

B1 Node
B3 B1 B2 B4

Node Node Node Node

B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks

• It’s the basic unit of read/write
• Default size is 64MB, could be larger (128MB)
• Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
• One block stored at multiple location, also at different racks (usually 3
times)
• This makes HDFS storage fault tolerant and faster to read
Few HDFS Shell commands
Create a directory in HDFS
• hadoop fs -mkdir /user/godil/dir1

List the content of a directory

• hadoop fs -ls /user/godil

Upload and download a file in HDFS

• hadoop fs -put /home/godil/file.txt /user/godil/datadir/
• hadoop fs -get /user/godil/datadir/file.txt /home/

Look at the content of a file

• Hadoop fs -cat /user/godil/datadir/book.txt

Many more commands, similar to Unix

Tools
• Need a programming paradigm that would be flexible, closer to the
ground, and richer functionality in machine learning and statistics.
• Open source frameworks like R, the PyData stack, Octave: raid
analysis and modeling over small datasets.
• HPC (high performance computing): low level abstraction, hard to
use, and unable to read in-memory data.
• MPI: love-level distributed framework: difficult to program without C
and distributed computing knowledge.
• Hadoop: cheaper than HPC and will get jobs done as if in single
machine.
HBase

• NoSQL data store build on top of HDFS

• Based on the Google BigTable paper (2006)
• Can handle various types of data
• Stores large amount of data (TB,PB)
• Column-Oriented data store
• Big Data with random read and writes
• Horizontally scalable
HBase, not to use for
• Not good as a traditional RDBMs (Relational Database Model)
• Transactional applications
• Data Analytics

• Not efficient for text searching and processing

MapReduce: Simple Programming for Big Data
Based on Google’s MR paper (2004)

• MapReduce is simple programming paradigm for the Hadoop ecosystem

• Traditional parallel programming requires expertise of different
computing/systems concepts
• examples: multithreads, synchronization mechanisms (locks, semaphores, and
monitors )
• incorrect use: can crash your program, get incorrect results, or severely
impact performance
• Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code in parallel
• you don't have to deal with any of above issues
• only need to create, map and reduce functions
Map Reduce Paradigm
• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce

D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
• Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
• Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
• Support for other languages needed
• Only for Batch processing
• Interactivity, streaming data
Challenges of Data Science
• Data preprocessing or data engineering (a majority of work)
• Iteration is a fundamental part of the data science
• Stochastic gradient decent (SGE)
• Maximum likely estimation (MLE)
• Choose the right features, picking the right algorithms, running the right
significance tests, finding the right hyperparameters, etc.
• So data should be read once and stay in memory!
• Integration of models to real useful products
• Easy modeling but hard to work well in reality
• R is slow and lack of integration capability
• Java and C++ poor for exploratory analytics (lack of Read-Evaluate-Print-Loop)
Apache Spark
• Open source, originated from UC Berkeley AMPLab
• Distributed over a cluster of machines
• An elegant programming model
• Predecessor: MapReduce -> linear scalability and resilient to failures
• Improvement
• Execution operations over a directed acyclic graph (DAG), unlike map-then-reduce, to
keep data in memory rather than store in disks, like Microsoft’s Dryad
• A rich set of transformations and APIs to easy programming
• In memory processing across server operations: Resilient Distributed Dataset (RDD)
• Support Scala and Python APIs
Directed Acyclic Graphs (DAG)
B C

A
E

S
D
F
DAGs track dependencies (also known as Lineage )
 nodes are RDDs
 arrows are Transformations
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS

read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2
Sort competition
Hadoop MR Spark
Record (2013) Record (2014) Spark, 3x
faster with
Data Size 102.5 TB 100 TB
1/10 the
Elapsed Time 72 mins 23 mins nodes
# Nodes 2100 206
# Cores 50400 physical 6592 virtualized
Cluster disk 3150 GB/s
618 GB/s
throughput (est.)
dedicated data virtualized (EC2) 10Gbps
Network
center, 10Gbps network
Sort rate 1.42 TB/min 4.27 TB/min
Sort rate/node 0.67 GB/min 20.7 GB/min

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Apache Spark
Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It
can read/write from a range of data types and allows development in multiple
languages.

Scala, Java, Python, R, SQL

DataFrames ML Pipelines

Spark
Spark SQL MLlib GraphX
Streaming
Spark Core

Data Sources

Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre)
Resilient Distributed Datasets (RDDs)

• RDDs (Resilient Distributed Datasets) is Data Containers

• All the different processing components in Spark share the same
abstraction called RDD
• As applications share the RDD abstraction, you can mix different kind
of transformations to create new RDDs
• Created by parallelizing a collection or reading a file
• Fault tolerant
So…
• Spark spans the gap between exploratory analytics and operational
analytics systems.
• Data scientists are those who are better at engineering than most
statisticians and those who are better at statistics than most
engineers.
• Spark is built for performance and reliability from the ground up.
• It can read/write data formats supported by MapReduce, like Avro
and Parquet. Works with NoSQL like Hbase and Cassandra. Spark
streaming can ingest data from Flume and Kafka. SparkSQL interacts
with the Hive Metastore.

ShrimanteeRoy TechnicalArtist
No ratings yet
ShrimanteeRoy TechnicalArtist
1 page
Types of Computer Network
100% (4)
Types of Computer Network
3 pages
Fpga Model To Implement Handwritten Digit Recognition
No ratings yet
Fpga Model To Implement Handwritten Digit Recognition
48 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Week 14
No ratings yet
Week 14
33 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Big Data
No ratings yet
Big Data
3 pages
Exam Cutoff Data
No ratings yet
Exam Cutoff Data
4 pages
Chap5 BigDataComputingAndProcessing
No ratings yet
Chap5 BigDataComputingAndProcessing
72 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Integration Between Unity Connection and CUCM
No ratings yet
Integration Between Unity Connection and CUCM
10 pages
SPARK
No ratings yet
SPARK
47 pages
Amey
No ratings yet
Amey
30 pages
Robotic Process Automation in A Virtual Environment
No ratings yet
Robotic Process Automation in A Virtual Environment
15 pages
Chapter 2 Dit
No ratings yet
Chapter 2 Dit
17 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
CVG DVG
No ratings yet
CVG DVG
1 page
Misrate2 10.94.141.55 GG
No ratings yet
Misrate2 10.94.141.55 GG
141 pages
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
9 pages
SIRFSTARV 5E CSRG0530 BGA Data Sheet
No ratings yet
SIRFSTARV 5E CSRG0530 BGA Data Sheet
58 pages
M5
No ratings yet
M5
18 pages
Unit 4 Deadlock
No ratings yet
Unit 4 Deadlock
46 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
EC8391 - Control Systems Engineering
No ratings yet
EC8391 - Control Systems Engineering
3 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
06 Use Case Modeling Part 1
No ratings yet
06 Use Case Modeling Part 1
6 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Unit 5
No ratings yet
Unit 5
32 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Software Development Life Cycle (SDLC)
No ratings yet
Software Development Life Cycle (SDLC)
28 pages
SQL Questions For Journal
No ratings yet
SQL Questions For Journal
9 pages
Exchange Server 2010 Introduction To Supporting Administration
No ratings yet
Exchange Server 2010 Introduction To Supporting Administration
101 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Jawaharlal Nehru Technological University Kakinada: College Name: Narsaraopet Institute of Technology, Kotappakonda:Kh
No ratings yet
Jawaharlal Nehru Technological University Kakinada: College Name: Narsaraopet Institute of Technology, Kotappakonda:Kh
48 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
DSP Report
No ratings yet
DSP Report
18 pages
Slide Icera
No ratings yet
Slide Icera
17 pages
Attachment
No ratings yet
Attachment
11 pages
Fuzzy Logic Based Algorithm For Context Awareness in Lot For Smart Home Environment
No ratings yet
Fuzzy Logic Based Algorithm For Context Awareness in Lot For Smart Home Environment
4 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Education: Palamala Venkata Sai Gireesh
No ratings yet
Education: Palamala Venkata Sai Gireesh
2 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Tools in Data Analytics
No ratings yet
Tools in Data Analytics
17 pages
Module 2
No ratings yet
Module 2
20 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Praveen Kumar, Mike Folk, Momcilo Markus, Jay C. Alameda - Hydroinformatics - Data Integrative Approaches in Computation, Analysis, and Modeling-CRC Press (2005)
100% (1)
Praveen Kumar, Mike Folk, Momcilo Markus, Jay C. Alameda - Hydroinformatics - Data Integrative Approaches in Computation, Analysis, and Modeling-CRC Press (2005)
553 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
Biggdata
No ratings yet
Biggdata
24 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Yash Mittal Final Project Documentation
No ratings yet
Yash Mittal Final Project Documentation
114 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
HADOOP
No ratings yet
HADOOP
55 pages
Data Science
No ratings yet
Data Science
87 pages
HADOOP
No ratings yet
HADOOP
10 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Spark
No ratings yet
Spark
96 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
The Jain International School Computer Project#
No ratings yet
The Jain International School Computer Project#
23 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
AWS Dev Part1
No ratings yet
AWS Dev Part1
178 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
SN-IND-1-040 Diagnostics With CAPL Since 9.0 SP3
No ratings yet
SN-IND-1-040 Diagnostics With CAPL Since 9.0 SP3
28 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
1.data Types and Definitions: Abap - Syntax'S
No ratings yet
1.data Types and Definitions: Abap - Syntax'S
15 pages
Answers
No ratings yet
Answers
6 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages

Analyzing Big Data in Hadoop Spark

Uploaded by

Analyzing Big Data in Hadoop Spark

Uploaded by

Analyzing Big Data in Hadoop

• The second trend is pervasiveness of cloud based storage and computational

• Data processing (third trend)

Working at the Intersection of these

Spark, Storm, Tez,

Scripting SQL Like Query

HDFS Distributed File System (Storage)

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

Node Node Node Node

Node Node Node

Node Node Node Node

• HDFS files are divided into blocks

List the content of a directory

Upload and download a file in HDFS

Look at the content of a file

Many more commands, similar to Unix

• NoSQL data store build on top of HDFS

• Not efficient for text searching and processing

• MapReduce is simple programming paradigm for the Hadoop ecosystem

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

HDFS HDFS HDFS

Spark: In-Memory Data Sharing

Sort benchmark, Daytona Gray: sort of 100 TB of data (1 trillion records)

Scala, Java, Python, R, SQL

• RDDs (Resilient Distributed Datasets) is Data Containers

You might also like