0% found this document useful (0 votes)

8 views16 pages

Hdfs MR Wordcount

Uploaded by

Prafullata Auradkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views16 pages

Hdfs MR Wordcount

Uploaded by

Prafullata Auradkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Big Data

Hadoop: HDFS and Map Reduce

Programming model

Prafullata Kiran Auradkar

Department of Computer Science and Engineering
[email protected]

Acknowledgements:
Significant information in the slide deck presented through the Unit 1 of the course have been created by Dr. K V Subramaniam and would like to acknowledge and thank him for the same.
There have been some information which I might have leveraged from Dr. H L Phalachandra’s slide contents too. I may have supplemented the same with contents from books and other
sources from Internet and would like to sincerely thank, acknowledge and reiterate that the credit/rights for the same remain with the original authors/publishers only. These are intended for
classroom presentation only.
Why the need?
• As per RBI in May 2019,
• #credit/debit card transactions~ 1.3 Billion (
https://rbidocs.rbi.org.in/rdocs/ATM/PDFs/ATM052019E96EC259708C
4ED9AD9E0C6B5E8B6DD5.PDF
)
• If each transaction requires about 10K of data

13 TB of data
• That’s a lot of data and this is only for credit/debit card transactions
• There are other transactions also
• Suppose you want to look for fraudulent transactions
• How to store and process this data?
HDFS – Hadoop Distributed File System

“HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.”

• Files can be MB/GB/TB in size

Very large
• Hadoop clusters that are PB are currently operational

• most efficient data processing pattern is a write-once,

read-many-times pattern.
• Each analysis will involve a large proportion of the
Read Mostly data
dataset
• time to read the whole dataset is more important than
the latency in reading the first record.

• Hadoop doesn’t require expensive, highly reliable

Commodity hardware
hardware • Designed to run on clusters of commodity hardware
HDFS Motivation

Filename, access control

File Metadata information, size, location of
where the file is on disk

The actual data of the file. Will

File Data be larger. More time is spent
here.
HDFS Motivation

Solution

Metadata < data Keep on separate server

File Metadata
Accessed less frequently NAMENODE

Much larger Data Distribute across machines

File Data
Requires parallel access DATANODE
HDFS Blocks: What

• Disk Blocks: File

• Minimum data that can be read written.
• Typically 512 bytes.
• HDFS blocks
HDFS
• Much larger unit Blocks
• 128MB (in v2)

• Files in HDFS are broken into block-

sized chunks, which are stored as
independent units. Mapped
to disk
• A file in HDFS that is smaller than a blocks

single block does not occupy a full

block’s worth of underlying storage.
• Use as many disk blocks as necessary.
HDFS Blocks: Why

• Benefits of block abstraction.

• A file can be larger than any single disk
in the network.
• Files can be distributed across disks
• Simplifies the storage subsystem
• Blocks fit well with replication for
providing fault tolerance and
availability.
• % hadoop fsck -files –blocks
• will list the blocks that make up each
file in the filesystem
BIG DATA
Map Reduce Programming model and Architecture

What is Map Reduce ?

BIG DATA
Map Reduce Programming model and Architecture

Why do we do Map-Reduce ?
• It’s the processing component of Apache Hadoop - a way to process extremely large
data
• We are going to
• Study Map-Reduce paradigm or the programming model for Map-Reduce
• Study Hadoop architecture or how Map-Reduce works internally
• Open Source implementation of Map-Reduce
Lets consider a Distributed Word Count
BIG DATA
Example: find the number of restaurants offering each item?

Merged results
Menu 1 Idli 1
Vada 1 Burger 1
Idli Vada
Pizza 1 Dosa 1
Pizza
Menu 2 Idli 12
Dosa 1 Pizza
Dosa Pizza Vada 1
Burger Pizza 1
Burger 1
BIG DATA Ensures that all similar keys
are aggregated at the same
Map + Reduce reducer. Each mapper has the
same partition function

R
Very E
big M Partitioning D All
A U matc
data Function C
P hes
E
R

• Map: • Reduce :
• Accepts input • Accepts intermediate
key/value pair key/value* pair
• Emits intermediate • Emits output
key/value pair key/value pair
Map Reduce: A look at the code
BIG DATA
Map Reduce- Mapper for Word Count (Python)

“””mapper.py”””

import sys
for line in sys.stdin:
line = line.strip() #Remove whitespaces
words = line.split() #line into words

for word in words:

# Write the results to standard output;
print(f"{word}\t 1") # ”/t” delimiter between key and
value
Word 1
ex : Apple 1
Mango 1
Apple 1
BIG DATA
Map Reduce- Reducer for Word Count (Python)

“””reducer.py”””

#!/usr/bin/env python3

import sys
word sum()
word_count = {} ex:
for line in sys.stdin: Apple 2
line = line.strip()
Mango 1
word, count = line.split('\t', 1)
word_count[word] = word_count.get(word, 0) + int(count)

for word, count in word_count.items():

print(f"{word}\t{count}")
Map Reduce: Sample Exercise

Map Reduce: Modify the Word count Map Reduce to count the frequencies
of those words which have word length >= 5.

Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
No ratings yet
Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
4 pages
Math 2
No ratings yet
Math 2
17 pages
Bhumika Di Ip
No ratings yet
Bhumika Di Ip
20 pages
AR253 History 2 - Structuralism and Metabolism
No ratings yet
AR253 History 2 - Structuralism and Metabolism
55 pages
My Strategy - MACD.HA
No ratings yet
My Strategy - MACD.HA
6 pages
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
No ratings yet
CBSE Computer Science Class 12 Question Paper 2024 Solutions FREE PDF
44 pages
DNA Extraction From Organic Phase of Trizol Reagent After RNA Isolation
No ratings yet
DNA Extraction From Organic Phase of Trizol Reagent After RNA Isolation
2 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Bigdata Lecture 3
No ratings yet
Bigdata Lecture 3
42 pages
Prelims Test Series Csat 1722243977612
No ratings yet
Prelims Test Series Csat 1722243977612
3 pages
Big Data Unit 2 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 2 (Easy Notes) Edushine Classes
35 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Week 14
No ratings yet
Week 14
33 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Unit 2
No ratings yet
Unit 2
22 pages
Lecture 1 - INTRODUCTION
No ratings yet
Lecture 1 - INTRODUCTION
25 pages
Cbds 2103
No ratings yet
Cbds 2103
11 pages
Muravyl Installation ENG
No ratings yet
Muravyl Installation ENG
10 pages
SAP CONTROLLING - PRODUCT COSTING PART-1 - SAP Blogs
No ratings yet
SAP CONTROLLING - PRODUCT COSTING PART-1 - SAP Blogs
47 pages
CC Unit 51
No ratings yet
CC Unit 51
39 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Hadoop and MapReduce
No ratings yet
Hadoop and MapReduce
31 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Earned Value Analysis-15-12-2016 - AH PDF
No ratings yet
Earned Value Analysis-15-12-2016 - AH PDF
17 pages
Module-3-Electro Chem PDF
No ratings yet
Module-3-Electro Chem PDF
11 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Study of Suspension System in All Terrain Vehicle: Presented by
No ratings yet
Study of Suspension System in All Terrain Vehicle: Presented by
14 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Tunnelling Applications Shotcrete Reinforcement
No ratings yet
Tunnelling Applications Shotcrete Reinforcement
11 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Antennas and Wave Propagation - May - 2016
No ratings yet
Antennas and Wave Propagation - May - 2016
1 page
Bulletin 193: Devicenet™ Configuration Terminal
No ratings yet
Bulletin 193: Devicenet™ Configuration Terminal
86 pages
Biggdata
No ratings yet
Biggdata
24 pages
Socio 101 - Midterm Exam Reviewer
No ratings yet
Socio 101 - Midterm Exam Reviewer
8 pages
DLD Exam
No ratings yet
DLD Exam
25 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
KRNT fx175qtv Data Cheet PDF
No ratings yet
KRNT fx175qtv Data Cheet PDF
2 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Ec24 33
No ratings yet
Ec24 33
3 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
EH Liquipoint FTW31 FTW32 Datasheet
No ratings yet
EH Liquipoint FTW31 FTW32 Datasheet
24 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
HW#7 Solutions
No ratings yet
HW#7 Solutions
5 pages
Big Data
No ratings yet
Big Data
22 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Husqvarna 2003 SM WRE 125 Manual
No ratings yet
Husqvarna 2003 SM WRE 125 Manual
2 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
English: Communication Studies
No ratings yet
English: Communication Studies
4 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Realtek Driver For Windows 10
No ratings yet
Realtek Driver For Windows 10
5 pages
MSD Digital 6A and 6AL Ignition Control
No ratings yet
MSD Digital 6A and 6AL Ignition Control
20 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
WRC 107 Tips
No ratings yet
WRC 107 Tips
4 pages
Ducted Split Air Conditioner: Service Manual
No ratings yet
Ducted Split Air Conditioner: Service Manual
19 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Week 02
No ratings yet
Week 02
115 pages
Cloud Computing Presentation On: By: ELSON D'SOUZA (1MS14IS035) ESHWAR M. S. (1MS14IS132) ARPITHA (1MS14IS019)
No ratings yet
Cloud Computing Presentation On: By: ELSON D'SOUZA (1MS14IS035) ESHWAR M. S. (1MS14IS132) ARPITHA (1MS14IS019)
30 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Big Data
No ratings yet
Big Data
67 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages

Hdfs MR Wordcount

Uploaded by

Hdfs MR Wordcount

Uploaded by

Big Data

Hadoop: HDFS and Map Reduce

Prafullata Kiran Auradkar

• Files can be MB/GB/TB in size

• most efficient data processing pattern is a write-once,

• Hadoop doesn’t require expensive, highly reliable

Filename, access control

The actual data of the file. Will

Metadata < data Keep on separate server

Much larger Data Distribute across machines

• Disk Blocks: File

• Files in HDFS are broken into block-

single block does not occupy a full

• Benefits of block abstraction.

What is Map Reduce ?

for word in words:

for word, count in word_count.items():

You might also like