0% found this document useful (0 votes)

4 views40 pages

BDA Lec5

This document discusses Big Data Analytics, focusing on the MapReduce programming model for batch processing. It covers distributed computing, the Hadoop Distributed File System (HDFS), and provides examples of climate and website data analysis using MapReduce. The lecture emphasizes the importance of parallel processing and the organization of data in distributed systems.

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views40 pages

BDA Lec5

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

3rd grade

Big Data Analytics

Dr. Nesma Mahmoud
Lecture 5: MapReduce
for Batch Processing
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
Big Data Analytics (In short)
What will we learn in this lecture?
01. Distributed Computing

02. MapReduce

03. Batch Processing with MapReduce

01. Distributed Computing
Distributed computing
● Distributed computing uses numerous computing resources in different
operating locations for a single computing purpose.
● Parallel computing, also known as parallel programming, is a process
where large compute problems are broken down into smaller problems
that can be solved simultaneously by multiple processors.
Cluster Computing & Commodity Clusters
● Cluster computing is a kind of distributed computing, a type of computing that
links computers together on a network to perform a computational task.

● The term, commodity cluster, is often heard in big data conversations.

● Commodity clusters are affordable)‫ (بسعر معقول‬parallel computers with an average

number of computing nodes.

○ They are not as powerful as traditional parallel computers and are often built
out of less specialized nodes.
Distributed File System

● The effectiveness of MapReduce, Spark, and other distributed

processing systems is in part simply due to use of a distributed
filesystem(DFS)!
What Is File System?
● File system definitions:
○ A file system (or filesystem) is an interface between the operating system and
the physical storage media, such as hard drives, solid-state drives (SSDs), or
external storage devices.
○ A file system is a crucial component of any operating system, responsible for
organizing and managing(create, update, delete) data.
○ A file system is a technique of arranging the files in a storage medium like a
hard disk, pen drive, DVD, etc.
● File system purpose:
○ It helps you to organizes the data and allows easy retrieval of files when they
are required.
○ It provides a structured way to store, access, and retrieve files and
directories.
● Without a file system, data placed on a storage medium would be one large entity
with no way to tell where one piece of data stopped and the next began
Distributed File System?
● Distributed file system is similar to a normal file system, except that it runs on
multiple servers at once.
○ Because it’s a file system, you can do almost all the same things you’d do on
a normal file system.
● Actions such as storing, reading, and deleting files and adding security to files are
at the core of every file system, including the distributed one
Distributed File System
Distributed File System
Distributed File System
Distributed File System
Distributed File System

Notation: C2… chunk no. 2 of file C

Bring computation directly to the data!

Chunkservers also serve ascompute servers

Distributed File System
◾ Chunk servers (on Data Nodes)
▪ File is split into contiguous chunks

▪ Typically each chunk is 16-64MB

▪ Each chunk replicated (usually 2x or 3x)

▪ Try to keep replicas in different racks

◾ Master node

▪ a.k.a. Name Node in Hadoop’s HDFS

▪ Stores metadata about where files are stored
▪ Master nodes are typically more robust to hardware failure and run critical cluster services.

◾ Client library for file access

▪ 1- Talks to master to find chunk servers
▪ 2- Connects directly to chunk servers to access data
Distributed File System
◾ Reliable distributed file system

◾ Data kept in “chunks” spread across machines

◾ Each chunk replicated on different machines

▪ Seamless(‫ (بسهوله – بسالسه‬recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Notation: C2… chunk no. 2 of file C

Hadoop Distributed File System
◾ The best-known distributed file system at this moment is the Hadoop File System (HDFS).

◾ It is an open source implementation of the Google File System (GFS).

◾ HDFS store data in each block which size is 64MB or 128MB.(Default are 128MB)
Hadoop Distributed File System
◾ We’re going to consider the case of creating a new file, writing data to it, then closing the

file.
02. MapReduce
MapReduce: Overview

● Easy as 1, 2!
○ Step 1: Map Step 2: Reduce

● Easy as 1, 2, 3!
○ Step 1: Map Step 2: Sort / Group by Step 3: Reduce
Programming Model: MapReduce
◾ MapReduce is a style of programming
(programming model) designed for:
1. Easy parallel programming
2. Invisible management of hardware and software failures
3. Easy management of very-large-scale data
MapReduce: Overview
◾ 3 steps of MapReduce
◾ 1. Map (written by programmer):
▪ Apply a user-written Map function to each input element
▪ Mapper applies the Map function to a single element
▪ Many mappers grouped in a Map task (the unit of parallelism)
▪ The output of the Map function is a set of 0, 1, or more key-value pairs.

◾ 2. Group by key (system handles): Sort and shuffle

▪ System sorts all the key-value pairs by key, and outputs
key-(list of values) pairs
◾ 3. Reduce (written by programmer):
▪ User-written Reduce function is applied to each key-(list
of values)

Outline stays the same, Map and Reduce change to fit the problem
MapReduce: In-Parallel
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the
key and output
MapReduce: In-Parallel
Phases of Map-
Reduced are
distributed with
many tasks
doing the work Partitioning function
in parallel determines which record
goes to which reducer

DFS → Map → Map’s Local FS → Reduce → DFS

MapReduce: Execution Flow
MapReduce vs HDFS
MapReduce vs HDFS
HDFS, MapReduce, and Distributed Computing
03. Batch Processing
with MapReduce
Batch Processing
● Batch: a group of things or people that are dealt with at the same time.

● Batch processing: is a powerful method used in data processing and computing

where data is collected, processed, and analyzed in groups or “batches” rather
than in real-time..

● MapReduce is a batch-processing model because it operates on data that is

already stored, not on a live continuous stream of incoming data.
○ Input data needs to be divided and distributed before the Map phase of
MapReduce even begins.
Climate Big Data Analysis
Let us consider point locations in a very large area with gridded climate parameter values that need to
be analyzed to compute statistical mean, maximum, or minimum values of temperature, radiation, and
humidity for a given time period.
The example is simplistic but demonstrates the distributed workflow in the Hadoop framework.
The input is a comma-separated variables (CSV) file containing records for latitude, longitude, time, and
climate parameter names, and their values.
Climate Big Data Analysis
● MapReduce (MR) Programming Model—using the Map and Reduce functions to
compute aggregated climate parameters namely temperature (red), radiation (yellow),
and humidity (blue) for points in an area from very large gridded climate datasets
generated from satellite observations over time.
● The Hadoop framework takes the input climate CSV file and splits it into multiple parts
that are sent to worker servers in the cluster.
● Each split file contains a set of CSV records for climate parameters in an unknown
order.
● A Map function is applied to the dataset to sort CSV records for each climate
parameter at each worker server.
● Sorting ensures that once the key value changes, there is no need to look for that
variable any more. The output is shuffled to group temperature (red), radiation
(yellow), and humidity (blue) records for each split file at worker servers.
● Finally, a Reduce function aggregates groups of climate records for each parameter
across all worker servers to compute the average, minimum, or maximum values of
temperature, radiation, and humidity for a given time period.
Climate Big Data Analysis
Visited Pages Big Data Analysis
● Imagine you have terabytes of website logs tracking every single visitor
interaction, and from here, you want to filter out some information, like
which pages are most popular or where visitors drop off in your
purchase funnel, etc.
Visited Pages Big Data Analysis
•
Think of each worker as
a separate server that
handles its assigned
chunk.
• It has a Map
Function that extracts
the key information: in
our case, it will map
the keys, which are the
specific webpage visited,
to the values, which, if
we are counting visits,
can be the number of
visits to that page (e.g., 1)
Visited Pages Big Data Analysis
• Then, we enter the
reduce phase, where all
the key-value pairs
generated by the map
phase are sorted and
grouped by webpage
(‘key’).

• We forward those to the

Reduce Function. For
each unique webpage, it
adds up the ‘1’ values to
find the total visits.
Visited Pages Big Data Analysis
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes

icons by Flaticon, and infographics & images by Freepik

A Distributed File System-1
No ratings yet
A Distributed File System-1
65 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Unit 2
No ratings yet
Unit 2
9 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
University Institute of Computing: Big Bata Analytics 22CAH-782
No ratings yet
University Institute of Computing: Big Bata Analytics 22CAH-782
51 pages
Big-Data-Unit 2
No ratings yet
Big-Data-Unit 2
70 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
4
No ratings yet
4
53 pages
Unit 1
No ratings yet
Unit 1
50 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
BDA Module 2 COMP
No ratings yet
BDA Module 2 COMP
29 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Bda 2
No ratings yet
Bda 2
35 pages
Bigdata Lecture 3
No ratings yet
Bigdata Lecture 3
42 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Big Data
No ratings yet
Big Data
51 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Week 02
No ratings yet
Week 02
115 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Lesson 2 A Review of Hadoop
No ratings yet
Lesson 2 A Review of Hadoop
6 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Week 14
No ratings yet
Week 14
33 pages
Module - 2 - Introduction To Hadoop
No ratings yet
Module - 2 - Introduction To Hadoop
24 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Hadoop: Presented by Y Naveen
No ratings yet
Hadoop: Presented by Y Naveen
7 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Hadoop
No ratings yet
Hadoop
34 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
89443939-Wiring Diagram, FM Cab Facelift (ENG)
100% (1)
89443939-Wiring Diagram, FM Cab Facelift (ENG)
174 pages
Malasakit Form
100% (1)
Malasakit Form
2 pages
RF Oscillator
100% (2)
RF Oscillator
25 pages
Alphamaquet 1150 Brochure en PDF
No ratings yet
Alphamaquet 1150 Brochure en PDF
24 pages
Nursing Informatics Week 1
No ratings yet
Nursing Informatics Week 1
37 pages
AI Lecture 9
No ratings yet
AI Lecture 9
39 pages
BDA Lec1
No ratings yet
BDA Lec1
25 pages
Technology Newsletter
No ratings yet
Technology Newsletter
5 pages
MNU CAI ICI334 Lec4&5
No ratings yet
MNU CAI ICI334 Lec4&5
33 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
Sen QB5
No ratings yet
Sen QB5
18 pages
Multiple Injuries After Ship Tips Over at Edinburgh Dockyard
No ratings yet
Multiple Injuries After Ship Tips Over at Edinburgh Dockyard
10 pages
Lec4 Designpattern
No ratings yet
Lec4 Designpattern
48 pages
BDA Lec3
No ratings yet
BDA Lec3
48 pages
Matsumoto Hakuō II
No ratings yet
Matsumoto Hakuō II
3 pages
Sodapdf
No ratings yet
Sodapdf
4 pages
Lecture 02,03
No ratings yet
Lecture 02,03
54 pages
Chapter 8 Concurrency-P1
No ratings yet
Chapter 8 Concurrency-P1
30 pages
Lecture 9 - MapReduce
No ratings yet
Lecture 9 - MapReduce
50 pages
BDA Lec4
No ratings yet
BDA Lec4
40 pages
MNU CAI ICI334 Lec7
No ratings yet
MNU CAI ICI334 Lec7
30 pages
Lecture 7 - Wide Column Stores - Part 1
No ratings yet
Lecture 7 - Wide Column Stores - Part 1
30 pages
Lec. 3
No ratings yet
Lec. 3
18 pages
Presenatation On SIP by Saral Jain
No ratings yet
Presenatation On SIP by Saral Jain
12 pages
Wa0005.
No ratings yet
Wa0005.
17 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
No ratings yet
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
1 page
Smartax Mt800 Adsl Router: User Manual
No ratings yet
Smartax Mt800 Adsl Router: User Manual
109 pages
Tally Prime Additional Entries
No ratings yet
Tally Prime Additional Entries
4 pages
Lec5 Flask
No ratings yet
Lec5 Flask
5 pages
Untitled Design
No ratings yet
Untitled Design
15 pages
Answer Midterm 2024 - 11 - 19
No ratings yet
Answer Midterm 2024 - 11 - 19
4 pages
Legal Framework For Truck Logistics in India
No ratings yet
Legal Framework For Truck Logistics in India
2 pages
Section 5
No ratings yet
Section 5
7 pages
A Study Between Social Media Usage and Self-Esteem Among Youths
No ratings yet
A Study Between Social Media Usage and Self-Esteem Among Youths
10 pages
Gamal Mohamed CV
No ratings yet
Gamal Mohamed CV
2 pages
How To Compute Withholding Tax On Compensation
No ratings yet
How To Compute Withholding Tax On Compensation
6 pages
V.K.S 7233 16.12.23
No ratings yet
V.K.S 7233 16.12.23
1 page
Technology Management
No ratings yet
Technology Management
2 pages
GLM vs. Machine Leaning: - With Case Studies in Pricing
No ratings yet
GLM vs. Machine Leaning: - With Case Studies in Pricing
28 pages
WP - No.10205 of 2017
No ratings yet
WP - No.10205 of 2017
2 pages
Chapter Four: International Management and Cross-Cultural Competence
No ratings yet
Chapter Four: International Management and Cross-Cultural Competence
35 pages
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
No ratings yet
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
8 pages
Aguinaldo Industries V CIR - Peralta
No ratings yet
Aguinaldo Industries V CIR - Peralta
2 pages
World Trade Organization and IPR
No ratings yet
World Trade Organization and IPR
5 pages
SO12913 ORBITech PDF
No ratings yet
SO12913 ORBITech PDF
1 page
Actility Enova Presentation REV01
No ratings yet
Actility Enova Presentation REV01
19 pages
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
No ratings yet
Research Methodology: Types of Taboo Words Are Used in What Is The Function of Taboo Words Are Used in
3 pages
Tabel Ses
No ratings yet
Tabel Ses
6 pages

BDA Lec5

Uploaded by

BDA Lec5

Uploaded by

3rd grade

Big Data Analytics

Data/Workflow Frameworks Analytics and Algorithms

03. Batch Processing with MapReduce

● The term, commodity cluster, is often heard in big data conversations.

● Commodity clusters are affordable)‫ (بسعر معقول‬parallel computers with an average

● The effectiveness of MapReduce, Spark, and other distributed

Notation: C2… chunk no. 2 of file C

Bring computation directly to the data!

Chunkservers also serve ascompute servers

▪ Typically each chunk is 16-64MB

▪ Each chunk replicated (usually 2x or 3x)

▪ Try to keep replicas in different racks

▪ a.k.a. Name Node in Hadoop’s HDFS

◾ Client library for file access

◾ Data kept in “chunks” spread across machines

◾ Each chunk replicated on different machines

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Notation: C2… chunk no. 2 of file C

◾ It is an open source implementation of the Google File System (GFS).

◾ 2. Group by key (system handles): Sort and shuffle

DFS → Map → Map’s Local FS → Reduce → DFS

● Batch processing: is a powerful method used in data processing and computing

● MapReduce is a batch-processing model because it operates on data that is

• We forward those to the

CREDITS: This presentation template was created by Slidesgo, and includes

You might also like