0% found this document useful (0 votes)
4 views40 pages

BDA Lec5

This document discusses Big Data Analytics, focusing on the MapReduce programming model for batch processing. It covers distributed computing, the Hadoop Distributed File System (HDFS), and provides examples of climate and website data analysis using MapReduce. The lecture emphasizes the importance of parallel processing and the organization of data in distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

BDA Lec5

This document discusses Big Data Analytics, focusing on the MapReduce programming model for batch processing. It covers distributed computing, the Hadoop Distributed File System (HDFS), and provides examples of climate and website data analysis using MapReduce. The lecture emphasizes the importance of parallel processing and the organization of data in distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

3rd grade

Big Data Analytics


Dr. Nesma Mahmoud
Lecture 5: MapReduce
for Batch Processing
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
Big Data Analytics (In short)
What will we learn in this lecture?
01. Distributed Computing

02. MapReduce

03. Batch Processing with MapReduce


01. Distributed Computing
Distributed computing
● Distributed computing uses numerous computing resources in different
operating locations for a single computing purpose.
● Parallel computing, also known as parallel programming, is a process
where large compute problems are broken down into smaller problems
that can be solved simultaneously by multiple processors.
Cluster Computing & Commodity Clusters
● Cluster computing is a kind of distributed computing, a type of computing that
links computers together on a network to perform a computational task.

● The term, commodity cluster, is often heard in big data conversations.

● Commodity clusters are affordable)‫ (بسعر معقول‬parallel computers with an average


number of computing nodes.

○ They are not as powerful as traditional parallel computers and are often built
out of less specialized nodes.
Distributed File System

● The effectiveness of MapReduce, Spark, and other distributed


processing systems is in part simply due to use of a distributed
filesystem(DFS)!
What Is File System?
● File system definitions:
○ A file system (or filesystem) is an interface between the operating system and
the physical storage media, such as hard drives, solid-state drives (SSDs), or
external storage devices.
○ A file system is a crucial component of any operating system, responsible for
organizing and managing(create, update, delete) data.
○ A file system is a technique of arranging the files in a storage medium like a
hard disk, pen drive, DVD, etc.
● File system purpose:
○ It helps you to organizes the data and allows easy retrieval of files when they
are required.
○ It provides a structured way to store, access, and retrieve files and
directories.
● Without a file system, data placed on a storage medium would be one large entity
with no way to tell where one piece of data stopped and the next began
Distributed File System?
● Distributed file system is similar to a normal file system, except that it runs on
multiple servers at once.
○ Because it’s a file system, you can do almost all the same things you’d do on
a normal file system.
● Actions such as storing, reading, and deleting files and adding security to files are
at the core of every file system, including the distributed one
Distributed File System
Distributed File System
Distributed File System
Distributed File System
Distributed File System

Notation: C2… chunk no. 2 of file C

Bring computation directly to the data!

Chunkservers also serve ascompute servers


Distributed File System
◾ Chunk servers (on Data Nodes)
▪ File is split into contiguous chunks

▪ Typically each chunk is 16-64MB

▪ Each chunk replicated (usually 2x or 3x)

▪ Try to keep replicas in different racks

◾ Master node

▪ a.k.a. Name Node in Hadoop’s HDFS


▪ Stores metadata about where files are stored
▪ Master nodes are typically more robust to hardware failure and run critical cluster services.

◾ Client library for file access


▪ 1- Talks to master to find chunk servers
▪ 2- Connects directly to chunk servers to access data
Distributed File System
◾ Reliable distributed file system

◾ Data kept in “chunks” spread across machines

◾ Each chunk replicated on different machines


▪ Seamless(‫ (بسهوله – بسالسه‬recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Notation: C2… chunk no. 2 of file C


Hadoop Distributed File System
◾ The best-known distributed file system at this moment is the Hadoop File System (HDFS).

◾ It is an open source implementation of the Google File System (GFS).

◾ HDFS store data in each block which size is 64MB or 128MB.(Default are 128MB)
Hadoop Distributed File System
◾ We’re going to consider the case of creating a new file, writing data to it, then closing the

file.
02. MapReduce
MapReduce: Overview

● Easy as 1, 2!
○ Step 1: Map Step 2: Reduce

● Easy as 1, 2, 3!
○ Step 1: Map Step 2: Sort / Group by Step 3: Reduce
Programming Model: MapReduce
◾ MapReduce is a style of programming
(programming model) designed for:
1. Easy parallel programming
2. Invisible management of hardware and software failures
3. Easy management of very-large-scale data
MapReduce: Overview
◾ 3 steps of MapReduce
◾ 1. Map (written by programmer):
▪ Apply a user-written Map function to each input element
▪ Mapper applies the Map function to a single element
▪ Many mappers grouped in a Map task (the unit of parallelism)
▪ The output of the Map function is a set of 0, 1, or more key-value pairs.

◾ 2. Group by key (system handles): Sort and shuffle


▪ System sorts all the key-value pairs by key, and outputs
key-(list of values) pairs
◾ 3. Reduce (written by programmer):
▪ User-written Reduce function is applied to each key-(list
of values)

Outline stays the same, Map and Reduce change to fit the problem
MapReduce: In-Parallel
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the
key and output
MapReduce: In-Parallel
Phases of Map-
Reduced are
distributed with
many tasks
doing the work Partitioning function
in parallel determines which record
goes to which reducer

DFS → Map → Map’s Local FS → Reduce → DFS


MapReduce: Execution Flow
MapReduce vs HDFS
MapReduce vs HDFS
HDFS, MapReduce, and Distributed Computing
03. Batch Processing
with MapReduce
Batch Processing
● Batch: a group of things or people that are dealt with at the same time.

● Batch processing: is a powerful method used in data processing and computing


where data is collected, processed, and analyzed in groups or “batches” rather
than in real-time..

● MapReduce is a batch-processing model because it operates on data that is


already stored, not on a live continuous stream of incoming data.
○ Input data needs to be divided and distributed before the Map phase of
MapReduce even begins.
Climate Big Data Analysis
Let us consider point locations in a very large area with gridded climate parameter values that need to
be analyzed to compute statistical mean, maximum, or minimum values of temperature, radiation, and
humidity for a given time period.
The example is simplistic but demonstrates the distributed workflow in the Hadoop framework.
The input is a comma-separated variables (CSV) file containing records for latitude, longitude, time, and
climate parameter names, and their values.
Climate Big Data Analysis
● MapReduce (MR) Programming Model—using the Map and Reduce functions to
compute aggregated climate parameters namely temperature (red), radiation (yellow),
and humidity (blue) for points in an area from very large gridded climate datasets
generated from satellite observations over time.
● The Hadoop framework takes the input climate CSV file and splits it into multiple parts
that are sent to worker servers in the cluster.
● Each split file contains a set of CSV records for climate parameters in an unknown
order.
● A Map function is applied to the dataset to sort CSV records for each climate
parameter at each worker server.
● Sorting ensures that once the key value changes, there is no need to look for that
variable any more. The output is shuffled to group temperature (red), radiation
(yellow), and humidity (blue) records for each split file at worker servers.
● Finally, a Reduce function aggregates groups of climate records for each parameter
across all worker servers to compute the average, minimum, or maximum values of
temperature, radiation, and humidity for a given time period.
Climate Big Data Analysis
Visited Pages Big Data Analysis
● Imagine you have terabytes of website logs tracking every single visitor
interaction, and from here, you want to filter out some information, like
which pages are most popular or where visitors drop off in your
purchase funnel, etc.
Visited Pages Big Data Analysis

Think of each worker as
a separate server that
handles its assigned
chunk.
• It has a Map
Function that extracts
the key information: in
our case, it will map
the keys, which are the
specific webpage visited,
to the values, which, if
we are counting visits,
can be the number of
visits to that page (e.g., 1)
Visited Pages Big Data Analysis
• Then, we enter the
reduce phase, where all
the key-value pairs
generated by the map
phase are sorted and
grouped by webpage
(‘key’).

• We forward those to the


Reduce Function. For
each unique webpage, it
adds up the ‘1’ values to
find the total visits.
Visited Pages Big Data Analysis
Thanks!
Do you have any questions?

CREDITS: This presentation template was created by Slidesgo, and includes


icons by Flaticon, and infographics & images by Freepik

You might also like