Hdfs MR Wordcount
Hdfs MR Wordcount
Acknowledgements:
Significant information in the slide deck presented through the Unit 1 of the course have been created by Dr. K V Subramaniam and would like to acknowledge and thank him for the same.
There have been some information which I might have leveraged from Dr. H L Phalachandra’s slide contents too. I may have supplemented the same with contents from books and other
sources from Internet and would like to sincerely thank, acknowledge and reiterate that the credit/rights for the same remain with the original authors/publishers only. These are intended for
classroom presentation only.
Why the need?
• As per RBI in May 2019,
• #credit/debit card transactions~ 1.3 Billion (
https://rbidocs.rbi.org.in/rdocs/ATM/PDFs/ATM052019E96EC259708C
4ED9AD9E0C6B5E8B6DD5.PDF
)
• If each transaction requires about 10K of data
13 TB of data
• That’s a lot of data and this is only for credit/debit card transactions
• There are other transactions also
• Suppose you want to look for fraudulent transactions
• How to store and process this data?
HDFS – Hadoop Distributed File System
“HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.”
Solution
Why do we do Map-Reduce ?
• It’s the processing component of Apache Hadoop - a way to process extremely large
data
• We are going to
• Study Map-Reduce paradigm or the programming model for Map-Reduce
• Study Hadoop architecture or how Map-Reduce works internally
• Open Source implementation of Map-Reduce
Lets consider a Distributed Word Count
BIG DATA
Example: find the number of restaurants offering each item?
Merged results
Menu 1 Idli 1
Vada 1 Burger 1
Idli Vada
Pizza 1 Dosa 1
Pizza
Menu 2 Idli 12
Dosa 1 Pizza
Dosa Pizza Vada 1
Burger Pizza 1
Burger 1
BIG DATA Ensures that all similar keys
are aggregated at the same
Map + Reduce reducer. Each mapper has the
same partition function
R
Very E
big M Partitioning D All
A U matc
data Function C
P hes
E
R
• Map: • Reduce :
• Accepts input • Accepts intermediate
key/value pair key/value* pair
• Emits intermediate • Emits output
key/value pair key/value pair
Map Reduce: A look at the code
BIG DATA
Map Reduce- Mapper for Word Count (Python)
“””mapper.py”””
import sys
for line in sys.stdin:
line = line.strip() #Remove whitespaces
words = line.split() #line into words
“””reducer.py”””
#!/usr/bin/env python3
import sys
word sum()
word_count = {} ex:
for line in sys.stdin: Apple 2
line = line.strip()
Mango 1
word, count = line.split('\t', 1)
word_count[word] = word_count.get(word, 0) + int(count)
Map Reduce: Modify the Word count Map Reduce to count the frequencies
of those words which have word length >= 5.