Parallel Project
Parallel Project
Apache Hadoop
Submitted by
Mahrukh Tariq (2019-BCS-028)
Hafsa Bashir (2019-BCS-019)
Maria Naveed (2019-BCS-060)
Submitted To
Mam Madiha Wahid
Dated
16 December 2022
1
2
Apache Hadoop is an open source, Java-based software platform that manages data processing
and storage for big data applications. The platform works by distributing Hadoop big data and
analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads
that can be run in parallel.
Why Hadoop?
Distributed Cluster System
Platform for massively scalable applications
Enables parallel data processing
Apache Hadoop was born out of a need to process ever increasingly large volumes of big
data and deliver web results faster as search engines like Yahoo and Google were getting
off the ground.
Hadoop Invention:
Apache Hadoop was born out of a need to process ever increasingly large volumes of big data
and deliver web results faster as search engines like Yahoo and Google were getting off the
ground.
Inspired by Google’s MapReduce, a programming model that divides an application into small
fractions to run on different nodes, Doug Cutting and Mike Cafarella started Hadoop in 2002
while working on the Apache Nutch project. According to a New York Times article, Doug
named Hadoop after his son's toy elephant.
A few years later, Hadoop was spun off from Nutch. Nutch focused on the web crawler element,
and Hadoop became the distributed computing and processing portion. Two years after Cutting
joined Yahoo, Yahoo released Hadoop as an open-source project in 2008. The Apache Software
Foundation (ASF) made Hadoop available to the public in November 2012 as Apache Hadoop.
2
3
3
4
Map Reduce:
MapReduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment.
4
5
1. Critical path problem: It is the amount of time taken to finish the job without delaying
the next milestone or actual completion date. So, if, any of the machines delay the job,
the whole work gets delayed.
2. Reliability problem: What if, any of the machines which are working with a part of data
fails? The management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine
gets even part of data to work with. In other words, how to equally divide the data such
that no individual machine is overloaded or underutilized.
4. The single split may fail: If any of the machines fail to provide the output, I will not be
able to calculate the result. So, there should be a mechanism to ensure this fault tolerance
capability of the system.
5. Aggregation of the result: There should be a mechanism to aggregate the result
generated by each of the machines to produce the final output.
Let us understand more about MapReduce and its components. MapReduce majorly has the
Mapper Class
5
6
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
Input Split
It is the logical representation of data. It represents a block of work that contains a single map
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
6
7
Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we
will be finding the unique words and the number of occurrences of those unique words.
Advantages of Mapreduce
Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a
part of the job simultaneously. As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data gets reduced by a tremendous
amount.
2. Data Locality
It is very cost-effective to move processing unit to the data.
The processing time is reduced as all the nodes are working with their part of the data in
parallel.
Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.
7
8
1. Name Node
Stores metadata for the files, like the directory structure of a typical file system
The server holding the NameNode instance is quite crucial, as there is only one
Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or
file streams, only metadata
Handles creation of more replica blocks when necessary, after a DataNode failure
8
9
2. Data Node
Benefits of HDFS:
Cost effectiveness
Large data set storage
Fast recovery from hardware failure
Portability
Installation
Prerequisites:
Java 8:
9
10
10
11
11
12
12
13
Installing Java:
13
14
14
15
New Variables:
15
16
System Variables
16
17
17
18
Installation:
Extraction:
18
19
XML files
19
20
20
21
Copy the location of both folders, and now editing the HDFS site file
21
22
Hdoop-env.cmd file
New paths
22
23
Yahoo!
Facebook
Amazon
Netflix
Etc
→Yahoo!’ search Web map runs on 10,000 core Linux cluster and powers Yahoo! Web search
→FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)
23
24
Using Hadoop + MapReduce running on EC2 / S3, converted 4TB of TIFFs into 11 million PDF
articles in 24 hrs.
Design requirements:
Integrate display of email, SMS and chat messages between pairs and groups of users. Strong
control over who users receive messages from and suited for production use between 500 million
people immediately after launch. Also include Stringent latency & uptime requirements.
24
25
Classic alternatives
These requirements typically met using large MySQL cluster & caching tiers using Memcached
Content on HDFS could be loaded into MySQL or Memcached if needed by web tier
Problems with previous solutions
MySQL has low random write throughput… BIG problem for messaging!
Difficult to scale MySQL clusters rapidly while maintaining performance
MySQL clusters have high management overhead, require more expensive hardware
Facebook’s solution
Major concern was availability: NameNode is SPOF & failover times are at least 20 minutes
Proprietary “AvatarNode”: eliminates SPOF, makes HDFS safe to deploy even with 24/7
uptime requirement
Performance improvements for real time workload: RPC timeout. Rather fail fast and try
a different DataNode
Benefits of Hadoop
Cost saving and efficient and reliable data processing
Provides an economically scalable solution
Storing and processing of large amount of data
Data grid operating system
It is deployed on industry standard servers rather than expensive specialized data storage
systems
Parallel processing of huge amounts of data across inexpensive, industry-standard servers
Working of Hadoop
Word Count Problem
Configuration
25
26
Starting namenode
26
27
27
28
28
29
29
30
30
31
Hadoop Limitation
Small File Concerns
Slow Processing Speed
No Real Time Processing
No Iterative Processing
Ease of Use
Security Problem
Limitations and Solutions
Small File Concerns
The idea behind Hadoop was to have a small number of large files
Each file, directory, and block occupy a memory element inside NameNode memory.
Rule of thumb, this memory element is about 150 bytes. So, if you have 10 million files
each using a block then it would occupy 1.39 GB of memory.
Solution
Solution
No Iterative Processing
Solution
31
32
References
https://www.youtube.com/watch?v=g7Qpnmi0Q-s&ab_channel=edureka%21
https://www.databricks.com/glossary/hadoop
https://www.geeksforgeeks.org/hadoop-ecosystem/
https://www.edureka.co/blog/mapreduce-tutorial/
https://www.techtarget.com/searchdatamanagement/definition/Hadoop-
Distributed-File-System-HDFS
https://www.techtarget.com/searchdatamanagement/definition/Hadoop-
Distributed-File-System-HDFS
https://medium.com/@bharvi.vyas123/6-major-hadoop-limitations-with-their-
solutions-1cae1d3936e1
https://www.slideshare.net/sravya/hadoop-technology-ppt-57921409
https://www.guru99.com/create-your-first-hadoop-program.html
--------------------------------------------------End-------------------------------------------------------
32