0% found this document useful (0 votes)
8 views16 pages

Hdfs MR Wordcount

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views16 pages

Hdfs MR Wordcount

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Big Data

Hadoop: HDFS and Map Reduce


Programming model

Prafullata Kiran Auradkar


Department of Computer Science and Engineering
[email protected]

Acknowledgements:
Significant information in the slide deck presented through the Unit 1 of the course have been created by Dr. K V Subramaniam and would like to acknowledge and thank him for the same.
There have been some information which I might have leveraged from Dr. H L Phalachandra’s slide contents too. I may have supplemented the same with contents from books and other
sources from Internet and would like to sincerely thank, acknowledge and reiterate that the credit/rights for the same remain with the original authors/publishers only. These are intended for
classroom presentation only.
Why the need?
• As per RBI in May 2019,
• #credit/debit card transactions~ 1.3 Billion (
https://rbidocs.rbi.org.in/rdocs/ATM/PDFs/ATM052019E96EC259708C
4ED9AD9E0C6B5E8B6DD5.PDF
)
• If each transaction requires about 10K of data

13 TB of data
• That’s a lot of data and this is only for credit/debit card transactions
• There are other transactions also
• Suppose you want to look for fraudulent transactions
• How to store and process this data?
HDFS – Hadoop Distributed File System

“HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.”

• Files can be MB/GB/TB in size


Very large
• Hadoop clusters that are PB are currently operational

• most efficient data processing pattern is a write-once,


read-many-times pattern.
• Each analysis will involve a large proportion of the
Read Mostly data
dataset
• time to read the whole dataset is more important than
the latency in reading the first record.

• Hadoop doesn’t require expensive, highly reliable


Commodity hardware
hardware • Designed to run on clusters of commodity hardware
HDFS Motivation

Filename, access control


File Metadata information, size, location of
where the file is on disk

The actual data of the file. Will


File Data be larger. More time is spent
here.
HDFS Motivation

Solution

Metadata < data Keep on separate server


File Metadata
Accessed less frequently NAMENODE

Much larger Data Distribute across machines


File Data
Requires parallel access DATANODE
HDFS Blocks: What

• Disk Blocks: File


• Minimum data that can be read written.
• Typically 512 bytes.
• HDFS blocks
HDFS
• Much larger unit Blocks
• 128MB (in v2)

• Files in HDFS are broken into block-


sized chunks, which are stored as
independent units. Mapped
to disk
• A file in HDFS that is smaller than a blocks

single block does not occupy a full


block’s worth of underlying storage.
• Use as many disk blocks as necessary.
HDFS Blocks: Why

• Benefits of block abstraction.


• A file can be larger than any single disk
in the network.
• Files can be distributed across disks
• Simplifies the storage subsystem
• Blocks fit well with replication for
providing fault tolerance and
availability.
• % hadoop fsck -files –blocks
• will list the blocks that make up each
file in the filesystem
BIG DATA
Map Reduce Programming model and Architecture

What is Map Reduce ?


BIG DATA
Map Reduce Programming model and Architecture

Why do we do Map-Reduce ?
• It’s the processing component of Apache Hadoop - a way to process extremely large
data
• We are going to
• Study Map-Reduce paradigm or the programming model for Map-Reduce
• Study Hadoop architecture or how Map-Reduce works internally
• Open Source implementation of Map-Reduce
Lets consider a Distributed Word Count
BIG DATA
Example: find the number of restaurants offering each item?

Merged results
Menu 1 Idli 1
Vada 1 Burger 1
Idli Vada
Pizza 1 Dosa 1
Pizza
Menu 2 Idli 12
Dosa 1 Pizza
Dosa Pizza Vada 1
Burger Pizza 1
Burger 1
BIG DATA Ensures that all similar keys
are aggregated at the same
Map + Reduce reducer. Each mapper has the
same partition function

R
Very E
big M Partitioning D All
A U matc
data Function C
P hes
E
R

• Map: • Reduce :
• Accepts input • Accepts intermediate
key/value pair key/value* pair
• Emits intermediate • Emits output
key/value pair key/value pair
Map Reduce: A look at the code
BIG DATA
Map Reduce- Mapper for Word Count (Python)

“””mapper.py”””

import sys
for line in sys.stdin:
line = line.strip() #Remove whitespaces
words = line.split() #line into words

for word in words:


# Write the results to standard output;
print(f"{word}\t 1") # ”/t” delimiter between key and
value
Word 1
ex : Apple 1
Mango 1
Apple 1
BIG DATA
Map Reduce- Reducer for Word Count (Python)

“””reducer.py”””

#!/usr/bin/env python3

import sys
word sum()
word_count = {} ex:
for line in sys.stdin: Apple 2
line = line.strip()
Mango 1
word, count = line.split('\t', 1)
word_count[word] = word_count.get(word, 0) + int(count)

for word, count in word_count.items():


print(f"{word}\t{count}")
Map Reduce: Sample Exercise

Map Reduce: Modify the Word count Map Reduce to count the frequencies
of those words which have word length >= 5.

You might also like