0% found this document useful (0 votes)

46 views52 pages

1 - HADOOP Crash Course

The document provides an introduction to batch processing systems, focusing on the Hadoop ecosystem, including its components like HDFS and the MapReduce programming model. It outlines the differences between centralized and distributed data processing, the lifecycle of batch data processing, and various tools for data ingestion, storage, processing, and access. Additionally, it discusses the architecture and functionalities of Hadoop, emphasizing its scalability, fault tolerance, and resource management capabilities.

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views52 pages

1 - HADOOP Crash Course

Uploaded by

Karim Osama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

CIT650 Introduction to Big Data

Batch Processing Systems

Part 1

1
Objectives

By the end of the lecture you should be able to Understand:

The difference between centralized and distributed data processing

The lifecycle of batch data processing
Hadoop Ecosystem
HDFS
MapReduce programming model

2
What Computing Model Would Stand?

A single machine cannot handle that amount

3
Big Data System Requirements

We need distributed computing frameworks Like Hadoop

4
Two Ways to Build a System

Monolithic: Twice cost to have double processing and memory does

not mean twice the performance
Distributed: Many small cheap machines come together to act as a
single entity. Scales linearly as you add more nodes.
5
Server Farms
All these machines need to be
coordinated by a single piece of
software to take care of
Data partitioning

Coordinate computing tasks

Handle fault tolerance and

recovery

Schedule tasks and allocate

resources

6
Cluster Management
Back in 2000, Google realized that traditional data management
approaches will not scale with Big data:
Google File System to manage distributed storage of data
MapReduce: to abstract a distributed-parallel data processing model

7
Hadoop

Is an open source software framework that

follows Google’s File System, HDFS,
Supports MapReduce as the programming
model
Cluster management and scheduling of data
processing jobs
Horizontal scaling as new nodes join the
cluster
Provide fault tolerance and transparent
rescheduling of failed jobs/tasks
Manage resource allocation

8
Hadoop V 1.X Versus V 2.X

9
Core Hadoop

10
Hadoop Ecosystem

For storage: Hadoop Distributed File System (HDFS)

For computation: Map/Reduce Model
Other tools
Data ingestion
Sqoop
Flume
Storage
Hbase
Processing
Map/Reduce
Spark
Hive
Resource
management
Yarn
Source: https://www.edureka.co/blog/hadoop-ecosystem

11
Ingest Data

• Flume
• Collects, aggregates, and transfers big data
• Has a simple and flexible architecture based on streaming data flows
• Uses a simple extensible data model that allows for online analytic
application

• Sqoop
• Designed to transfer data between relational database systems and
Hadoop
• Accesses the database to understand the schema of the data
• Generates a MapReduce application to import or export the data

12
Store Data

• HBase
• A non-relational database that runs on top of HDFS
• Provides real time wrangling on data
• Stores data as indexes to allow for random and faster access to data

• Cassandra
• A scalable, NoSQL database designed to have no single point of
failure

13
Process and Analyze Data

• Pig
• Analyzes large amounts of data
• Operates on the client side of a cluster
• A procedural data flow language

• Hive
• Used for creating reports
• Operates on the server side of a cluster
• A declarative programming language

14
Access Data

• Elastic Search
• Provides real-time search and analytics capabilities

• Apache Drill
• schema-free SQL query engine for big data
• unified view of different data sources, allowing users to query data
across various sources seamlessly

15
Access Data

• Impala
• Scalable and easy to use platform for everyone
• No programming skills required

• Hue
• Stands for Hadoop user experience
• Allows you to upload, browse, and query data
• Runs Pig jobs and workflow
• Provides editors for several SQL Query languages like Hive and MySQL

16
Processing Lifecycle

17
Apache Scoop

Data exchange tool between

Hadoop and relational databases
Relies on RDBMS for data
schema
Table at a time import
Uses MapReduce jobs to
parallelize data exchange
Map only jobs, discussed later
Commands
sqoop import connect XXX
table MY TABLE
num-mappers 8 . . .
sqoop export . . .

18
Apache Scoop
scoop-env.sh

19
Apache Flume
A distributed, reliable service to
collect /transform
(unstructured) data
Logs
Data streams
Components:
Client: hosted on the origin of
data to deliver events to the
source
Source: Receives events from
their origin and delivers to one
or more channels
Channel: A transient store
where events are persisted
until consumed by sink(s)
Sink: consumes events from a
channel and deliver it to the
next destination
20
Apache Pig

A data processing platform with high-level language

Language Pig Latin
Alternative to write low-level MapReduce jobs
Good at transformation and joining
Pig interpreter runs on a client machine:
Turns Pig Latin scripts into MapReduce/Spark jobs
Submits those jobs to Hadoop cluster

o r d e r s = LOAD ’ / u ser / t r a i n i n g / o r d er s ’ AS ( ord Id , c u st I d , c o st ) ;

groups = GROUP o r d e r s BY c u s t I d ;
t o t a l s = FOREACH groups GENERATE group , SUM( o r d e r s . c o st ) AS t ;
r e s u l t = JOIN t o t a l s BY group , p eo p l e BY c u s t I d ;
DUMP r e s u l t ;

21
Apache Hive

A data warehouse on top

of Hadoop
Originally started at
Facebook
SQL-like interface for
data processing
(HiveQL)
Folder structure for data
storage can be leveraged
in Hive commands
HiveQL queries are
translated into
MapReduce jobs
transparently

22
Example Big Data Applications

• Social media marketing analysis

• Customer Segmentation and Targeting
• Campaign Performance Analytics

• Shopping pattern analysis

• Customer Journey Mapping
• Inventory Management Optimization

• Traffic pattern recognition

• Smart Transportation Systems
• Public Safety and Emergency Response
• Urban Planning and Infrastructure Optimization

23
Example Big Data Applications

• Large data transformation

• Batch Processing for Data Transformation
• Streamlining Data Pipelines

• Credit card fraud detection

• Anomaly Detection with Machine Learning
• Data Processing for Large Transaction
Volumes

24
Hadoop Vrs. Other Distributed Systems

Data is distributed in advance: at their arrival (scalability)

Data is replicated throughout the cluster (availability/reliability)
Minimize data transfer (locality)
A unified programming model that abstracts
Data distribution
Nodes communications
Fault tolerant
Clean separation of business needs (what you write) and handling of
data processing (what is hidden)

25
Core Hadoop

26
Storing Data in Hadoop

Efficient data storage is the foundation for efficient processing

Data can be stored in:
Raw format in HDFS
Columnar view Hbase
Data serialization formats
Raw
Avro
We talk about that later

27
Hadoop Distributed File System (HDFS)

Lowest level of data storage

Yet, at a higher abstraction level over
OS-specific file system
Partition at ingestion time
Replicate for high-availability and fault
tolerance
Abstract physical partition location
(Which node in the cluster) from the
application
Support serving several applications in
parallel on the same file partition
Inspired by Google File System

28
HDFS Vrs. Other Distributed File Systems

HDFS was designed with the following objectives in mind:

Partition and distribute a single file across different machines
Favor larger partition sizes
Data replication
Local processing (as much as possible)
HDFS is optimized for:
Reading sequentially versus (random access and writing)
No updates on files
No local caching

29
HBase vs. HDFS

30
HDFS Architecture

Source: Figure 2-1 in book Professional Hadoop Solutions

31
Name Node

A single node that keeps the metadata of HDFS

In some high-availability setting, there is a secondary name node
All the catalog (meta data) are kept in memory for fast access
A copy of the in-memory data is periodically flushed to the disk
(FsImage file)
Clients need to access the name node first to get info about the actual
blocks
As a name node can be accessed concurrently, a logging mechanism
similar to databases is used to track the updates on the catalog.
Name node maintains a daemon process to handle the requests and to
receive heartbeats from other data nodes

32
HDFS files

A single large file is partitioned into several blocks

Size of either 64 MB or 128MB
Compare that to block sizes on ordinary file systems
This is why sequential access is much better as the disk will make less
numbers of seeks
What would be the costs/benefits if we use smaller block sizes?

33
Writing a File to HDSF

When a client is writing data to an HDFS file, this data is first written to a
local file.
When the local file accumulates a full block of data, the client consults the
NameNode to get a list of DataNodes that are assigned to host replicas of
that block.
The client then writes the data block from its local storage to the first
DataNode in 4K portions.
The DataNode stores the received blocks in a local file system, and forwards
that portion of data to the next DataNode in the list.
The same operation is repeated by the next receiving DataNode until the
last node in the replica set receives data.

34
Writing a File to HDSF Cont.

This DataNode stores data locally without sending it any further

If one of the DataNodes fails while the block is being written, it is removed
from the pipeline
The NameNode re-replicates it to make up for the missing replica caused by
the failed DataNode
When a file is closed, the remaining data in the temporary local file is
pipelined to the DataNodes
If the NameNode dies before the file is closed, the file is lost.

35
Replica Placement

Replica placement is crucial for reliability of HDFS

Should not place the replicas on the same rack
All decisions about placement of partitions/replicas are made by the
NameNode
NameNode tracks the availability of Data Nodes by means of
Heartbeats
Every 3 seconds, NameNode should receive a heartbeat and a block
report from each data node
Block report allows verifying the list of stored blocks on the data node
Data node with a missing heartbeat is declared dead, based on the
catalog, replicas missing on this node are made up for trhough
NameNode sending replicas to other available data nodes

36
HDFS Federation

By default, HDFS has a single

NameNode. What is wrong with
that? If NameNode daemon
process goes down, the cluster is
inaccessible
A solution: HDFS Federation
Namespace Scalability:
Horizontal scalability to access
meta data as to access the
data itself
Performance: Higher
throughput as NameNodes
can be queried concurrently
Isolation: Serve blocking
applications by different
NameNodes
Is it more reliable?
37
HDFS High-availability

Each NameNode is backedup

with a slave other NameNode
that keeps a copy of the catalog
The slave node provides a
failover replacement of the
primary NameNode
Both nodes must have access to
a shared storage area
Data nodes have to send
heartbeats and block reports to
both the master and slave
NameNodes.

38
Core Hadoop

39
Processing Data with MapReduce

A simple abstraction for distributed data processing

Invented by Google in 2004
Two operations
Map: think of it as a transformation either 1:1 or 1:M that will be
applied to each element of your data
Reduce: think of it as a form of aggregating or compacting the data
M:1
Hadoop provides the open source implementation of that model
It was believed that all sorts of large-scale data processing can be
tackled by MapReduce, this is not true as you will see in future
lectures

40
MapReduce Overview

You need to implement map,

reduce or both
Map
Can be used to split elements,
Can be used to filter elements
Reduce
To compute aggregates
To combine results

41
MapReduce Overview

You need to implement map,

reduce or both
Map
Can be used to split elements,
Can be used to filter elements
Reduce
To compute aggregates
To combine results

42
MapReduce Workflow

43
Word Count: Mapper

44
Word Count: Mapper

45
Word Count: Combiner (optional) (local reducer)

Combiners run locally on the node where the mapper runs. They can be
used to reduce the number of records emitted to the next phase

46
Word Count: Partitioning/Shuffling/Sorting

Partitioning
Distributes the key-space to the number of reducers available.
Number of reducers need not be the same as number of mappers
Assume two reducers
Makes sure that same keys go to the same reducer
Usually implemented via a hash function
Shuffling
Preparing partitioned keys for the reduce step
Sorting
Transforms (k,v1), (k,v2), (k,v3) →(kx,{v1,v2,v3})

47
Word Count: Reducer

There will be two files written to the specified output path. The number is
equivalent to the number of reducers.

48
Other Uses for MapReduce

Reduce phase is optional in a MapReduce job

Running face-recognition on millions of images
Map-only job
Input (Image ID, Image)
Output (Image ID, list of features)
Features are loaded to distributed cache
The results of each mapper will be written in a separate file in the
output folder
What to do if we want all results in a single file?
Add a single reducer
However, this will come with a large overhead as shuffle and sort will
be invoked

49
Core Hadoop

50
R.I.P MapReduce?

51
R.I.P MapReduce?

Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Connections and Settings: Communication Unit 560CMU05
No ratings yet
Connections and Settings: Communication Unit 560CMU05
5 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Hadoop
No ratings yet
Hadoop
154 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit I
No ratings yet
Unit I
38 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Unit 2
No ratings yet
Unit 2
73 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Bda 2 - Hadoop
No ratings yet
Bda 2 - Hadoop
112 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Unit 5
No ratings yet
Unit 5
101 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
BDA Final Notes
No ratings yet
BDA Final Notes
53 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
4
No ratings yet
4
53 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
1.2. Lab Setup Virtualbox and Kali Linux v1
No ratings yet
1.2. Lab Setup Virtualbox and Kali Linux v1
11 pages
19CS4701D
No ratings yet
19CS4701D
2 pages
An Introduction To Openssl Programming (Par T I)
No ratings yet
An Introduction To Openssl Programming (Par T I)
13 pages
User and Best Practices Guide: Downloaded From Manuals Search Engine
No ratings yet
User and Best Practices Guide: Downloaded From Manuals Search Engine
78 pages
Lab Manual
No ratings yet
Lab Manual
28 pages
Website: Vce To PDF Converter: Facebook: Twitter:: 2V0-41.20.Vceplus - Premium.Exam.70Q
No ratings yet
Website: Vce To PDF Converter: Facebook: Twitter:: 2V0-41.20.Vceplus - Premium.Exam.70Q
23 pages
Install and Configure
No ratings yet
Install and Configure
25 pages
Amazon Web Services Actualtests clf-c02 Sample Question 2023-Dec-03 by Lynn 36q Vce
No ratings yet
Amazon Web Services Actualtests clf-c02 Sample Question 2023-Dec-03 by Lynn 36q Vce
17 pages
Client Server Architecture: Downloaded From
No ratings yet
Client Server Architecture: Downloaded From
12 pages
4602 - r11x - Bestpractices - Autosys
No ratings yet
4602 - r11x - Bestpractices - Autosys
13 pages
Microsoft Intune: Mobile Device and Application Management From The Cloud
No ratings yet
Microsoft Intune: Mobile Device and Application Management From The Cloud
82 pages
Food Wastage Management
No ratings yet
Food Wastage Management
3 pages
OS CDAC Question Paper
76% (17)
OS CDAC Question Paper
3 pages
1f98b77bcfcc427bb8c3e1bc9255a57a
No ratings yet
1f98b77bcfcc427bb8c3e1bc9255a57a
517 pages
Lesson Plan in TLE Grade 10
No ratings yet
Lesson Plan in TLE Grade 10
7 pages
Advanced Programming-Unit 1
No ratings yet
Advanced Programming-Unit 1
48 pages
Configuring MicroSCADA For Modbus Slave Protocol
No ratings yet
Configuring MicroSCADA For Modbus Slave Protocol
46 pages
Lecture13 Swapping and Virtual Memory
No ratings yet
Lecture13 Swapping and Virtual Memory
57 pages
Cs8711-Cloud Computing Laboratory Manual
No ratings yet
Cs8711-Cloud Computing Laboratory Manual
69 pages
Detailed LIS Communication Protocol: // A Query Record
No ratings yet
Detailed LIS Communication Protocol: // A Query Record
5 pages
Clustered Installations: Sterling B2B Integrator
No ratings yet
Clustered Installations: Sterling B2B Integrator
52 pages
Usd Protocol
No ratings yet
Usd Protocol
6 pages
Kubernetes
No ratings yet
Kubernetes
4 pages
Cheat Sheet For AIX User-Related Commands - AIX Content
No ratings yet
Cheat Sheet For AIX User-Related Commands - AIX Content
2 pages
Computer Software NOTES
No ratings yet
Computer Software NOTES
6 pages
lastUIException 63864829118
No ratings yet
lastUIException 63864829118
5 pages
CoursI 2 3
No ratings yet
CoursI 2 3
5 pages
Administration Manual OpenScape Desk Phone IP OpenScape Voice PDF
No ratings yet
Administration Manual OpenScape Desk Phone IP OpenScape Voice PDF
334 pages
Mike C Map
No ratings yet
Mike C Map
2 pages

1 - HADOOP Crash Course

Uploaded by

1 - HADOOP Crash Course

Uploaded by

CIT650 Introduction to Big Data

Batch Processing Systems

By the end of the lecture you should be able to Understand:

The difference between centralized and distributed data processing

A single machine cannot handle that amount

We need distributed computing frameworks Like Hadoop

Monolithic: Twice cost to have double processing and memory does

Coordinate computing tasks

Handle fault tolerance and

Schedule tasks and allocate

Is an open source software framework that

For storage: Hadoop Distributed File System (HDFS)

Data exchange tool between

A data processing platform with high-level language

o r d e r s = LOAD ’ / u ser / t r a i n i n g / o r d er s ’ AS ( ord Id , c u st I d , c o st ) ;

A data warehouse on top

• Social media marketing analysis

• Shopping pattern analysis

• Traffic pattern recognition

• Large data transformation

• Credit card fraud detection

Data is distributed in advance: at their arrival (scalability)

Efficient data storage is the foundation for efficient processing

Lowest level of data storage

HDFS was designed with the following objectives in mind:

Source: Figure 2-1 in book Professional Hadoop Solutions

A single node that keeps the metadata of HDFS

A single large file is partitioned into several blocks

This DataNode stores data locally without sending it any further

Replica placement is crucial for reliability of HDFS

By default, HDFS has a single

Each NameNode is backedup

A simple abstraction for distributed data processing

You need to implement map,

You need to implement map,

Reduce phase is optional in a MapReduce job

You might also like