0% found this document useful (0 votes)
46 views32 pages

Parallel Project

Apache Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop works by breaking jobs into smaller tasks that are processed in parallel by the nodes in the cluster. It was created to enable applications to work with thousands of computational nodes and petabytes of data. The core modules of Hadoop are MapReduce for distributed computing, HDFS for distributed file storage, YARN for resource management, and common utilities that support the other modules.

Uploaded by

hafsabashir820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views32 pages

Parallel Project

Apache Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. Hadoop works by breaking jobs into smaller tasks that are processed in parallel by the nodes in the cluster. It was created to enable applications to work with thousands of computational nodes and petabytes of data. The core modules of Hadoop are MapReduce for distributed computing, HDFS for distributed file storage, YARN for resource management, and common utilities that support the other modules.

Uploaded by

hafsabashir820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

1

Apache Hadoop

Submitted by
Mahrukh Tariq (2019-BCS-028)
Hafsa Bashir (2019-BCS-019)
Maria Naveed (2019-BCS-060)

Submitted To
Mam Madiha Wahid

Dated
16 December 2022

1
2

What is Apache Hadoop?

Apache Hadoop is an open source, Java-based software platform that manages data processing
and storage for big data applications. The platform works by distributing Hadoop big data and
analytics jobs across nodes in a computing cluster, breaking them down into smaller workloads
that can be run in parallel.

Why Hadoop?
 Distributed Cluster System
 Platform for massively scalable applications
 Enables parallel data processing
 Apache Hadoop was born out of a need to process ever increasingly large volumes of big
data and deliver web results faster as search engines like Yahoo and Google were getting
off the ground.

Hadoop Invention:
Apache Hadoop was born out of a need to process ever increasingly large volumes of big data
and deliver web results faster as search engines like Yahoo and Google were getting off the
ground.
Inspired by Google’s MapReduce, a programming model that divides an application into small
fractions to run on different nodes, Doug Cutting and Mike Cafarella started Hadoop in 2002
while working on the Apache Nutch project. According to a New York Times article, Doug
named Hadoop after his son's toy elephant.
A few years later, Hadoop was spun off from Nutch. Nutch focused on the web crawler element,
and Hadoop became the distributed computing and processing portion. Two years after Cutting
joined Yahoo, Yahoo released Hadoop as an open-source project in 2008. The Apache Software
Foundation (ASF) made Hadoop available to the public in November 2012 as Apache Hadoop.

Apache Hadoop Ecosystem:


Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

2
3

What is Hadoop used for?


When it comes to Hadoop, the possible use cases are almost endless.
Retail:
Large organizations have more customer data available on hand than ever before. But often, it’s
difficult to make connections between large amounts of seemingly unrelated data. When British
retailer M&S deployed the Hadoop-powered Cloudera Enterprise, they were more than
impressed with the results.
This led them to more efficient warehouse use and prevented stock-outs during “unexpected”
peaks in demand and gaining a huge advantage over the competition.
Finance:
Hadoop is perhaps more suited to the finance sector than any other. Early on, the software
framework was quickly pegged for primary use in handling the advanced algorithms involved
with risk modeling. It’s exactly the type of risk management that could help avoid the credit
swap disaster that led to the 2008 recession.
Banks have also realized this same logic also applies to managing risk for customer portfolios.
Today, it’s common for financial institutions to implement Hadoop to better manage the
financial security and performance of their client’s assets.
Healthcare:
Whether nationalized or privatized, healthcare providers of any size deal with huge volumes of
data and customer information. Hadoop frameworks allow for doctors, nurses and careers to
have easy access to the information they need when they need it and it also makes it easy to
aggregate data that provides actionable insights. This can apply to matters of public health, better
diagnostics, improved treatments and more.
Security and law enforcement:
Hadoop can help improve the effectiveness of national and local security, too. When it comes to
solving related crimes spread across multiple regions, a Hadoop framework can streamline the
process for law enforcement by connecting two seemingly isolated events. By cutting down on
the time to make case connections, agencies will be able to put out alerts to other agencies and
the public as quickly as possible.

3
4

What language is Hadoop written in?


The Hadoop framework itself is mostly built from Java. Other programming languages include
some native code in C and shell scripts for command lines. However, Hadoop programs can be
written in many other languages including Python or C++. This allows programmers the
flexibility to work with the tools they’re most familiar with.
What is Hadoop programming?
In the Hadoop framework, code is mostly written in Java but some native code is based in C.
Additionally, command-line utilities are typically written as shell scripts. For Hadoop
MapReduce, Java is most commonly used but through a module like Hadoop streaming, users
can use the programming language of their choice to implement the map and reduce functions.

What are the core Hadoop modules?

 MapReduce - MapReduce was the only execution engine available in Hadoop.


 HDFS - Hadoop Distributed File System. HDFS is a Java-based system that allows large
data sets to be stored across nodes in a cluster in a fault-tolerant manner.
 YARN - Yet Another Resource Negotiator. YARN is used for cluster resource
management, planning tasks, and scheduling jobs that are running on Hadoop.
 Hadoop Common - Hadoop Common provides a set of services across libraries and
utilities to support the other Hadoop modules.

Map Reduce:
MapReduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment.

4
5

1. Critical path problem: It is the amount of time taken to finish the job without delaying
the next milestone or actual completion date. So, if, any of the machines delay the job,
the whole work gets delayed.
2. Reliability problem: What if, any of the machines which are working with a part of data
fails? The management of this failover becomes a challenge.
3. Equal split issue: How will I divide the data into smaller chunks so that each machine
gets even part of data to work with. In other words, how to equally divide the data such
that no individual machine is overloaded or underutilized.
4. The single split may fail: If any of the machines fail to provide the output, I will not be
able to calculate the result. So, there should be a mechanism to ensure this fault tolerance
capability of the system.
5. Aggregation of the result: There should be a mechanism to aggregate the result
generated by each of the machines to produce the final output.

 MapReduce consists of two distinct tasks – Map and Reduce.


 As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

Let us understand more about MapReduce and its components. MapReduce majorly has the

following three Classes. They are,

Mapper Class

5
6

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader

processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store

saves this intermediate data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single map

task in the MapReduce Program.

 RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and

generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a

MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long

with data types and their respective job names.

MapReduce Tutorial: A Word Count Example of MapReduce


Let us understand, how a MapReduce works by taking an example where I have a text file called

example.txt whose contents are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

6
7

Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we

will be finding the unique words and the number of occurrences of those unique words.

Advantages of Mapreduce

Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a
part of the job simultaneously. As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the data gets reduced by a tremendous
amount.

2. Data Locality
 It is very cost-effective to move processing unit to the data.
 The processing time is reduced as all the nodes are working with their part of the data in
parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.

7
8

HDFS (Hadoop Distributed File system)

 Distributed File System


 Traditional Hierarchical file organization
 Single namespace for entire cluster
 Write-once-read-many access model
 Aware of the network Topology

1. Name Node

 Stores metadata for the files, like the directory structure of a typical file system
 The server holding the NameNode instance is quite crucial, as there is only one
 Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or
file streams, only metadata
 Handles creation of more replica blocks when necessary, after a DataNode failure

8
9

2. Data Node

 Stores the actual data in HDFS


 Can run on any underlying file system (ext3/4, NTFS, etc.)
 Notifies NameNode of what blocks it has
 The Data Nodes are in constant communication with the NameNode to determine if the
Data Nodes need to complete specific tasks.
 Consequently, the NameNode is always aware of the status of each DataNode.

Benefits of HDFS:

 Cost effectiveness
 Large data set storage
 Fast recovery from hardware failure
 Portability

 Streaming data access

Installation
 Prerequisites:
 Java 8:

9
10

10
11

11
12

 Changing the download destination

12
13

 Installing Java:

13
14

 Moving JDK to java file folder:

 Done to Avoid errors while setting Environment Variables for Java


 Setting the Environment Variables
 Edit in the Settings App

14
15

 New Variables:

15
16

System Variables

16
17

 Java has been installed


 Now checking version

 Hadoop version 3.2.4

 Selecting the mirror link

17
18

 Installation:

 Extraction:

 Setting Environment Variables

18
19

XML files

 Editing the core.xml file

19
20

Editing the map reduce file

 Editing the yarn site

20
21

 Editing the hdfs site file


 Before editing this file, create a new folder in hadoop location which is data

 Inside data create two more new files

 Copy the location of both folders, and now editing the HDFS site file

21
22

 Hdoop-env.cmd file

 Editing the file

 Setting the home and path for Hadoop

 New paths

22
23

Application & working of Hadoop


Hadoop is in use at most organizations that handle big data:

 Yahoo!
 Facebook
 Amazon
 Netflix
 Etc

Some examples of scale:

→Yahoo!’ search Web map runs on 10,000 core Linux cluster and powers Yahoo! Web search

→FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at ½ PB/day (Nov, 2012)

Three main applications of Hadoop:

 Advertisement (Mining user behavior to generate recommendations)


 Searches (group related documents)
 Security (search for uncommon patterns)

→ Non-Realtime large dataset computing:

NY Times was dynamically generating PDFs of articles from 1851-1922

Wanted to pre-generate & statically serve articles to improve performance

23
24

Using Hadoop + MapReduce running on EC2 / S3, converted 4TB of TIFFs into 11 million PDF
articles in 24 hrs.

Hadoop in the Wild: Facebook Messages

Design requirements:
Integrate display of email, SMS and chat messages between pairs and groups of users. Strong
control over who users receive messages from and suited for production use between 500 million
people immediately after launch. Also include Stringent latency & uptime requirements.

24
25

Classic alternatives
These requirements typically met using large MySQL cluster & caching tiers using Memcached
Content on HDFS could be loaded into MySQL or Memcached if needed by web tier
Problems with previous solutions
MySQL has low random write throughput… BIG problem for messaging!
Difficult to scale MySQL clusters rapidly while maintaining performance
MySQL clusters have high management overhead, require more expensive hardware
Facebook’s solution

 Hadoop + HBase as foundations


 Improve & adapt HDFS and HBase to scale to FB’s workload and operational
considerations

Major concern was availability: NameNode is SPOF & failover times are at least 20 minutes

 Proprietary “AvatarNode”: eliminates SPOF, makes HDFS safe to deploy even with 24/7
uptime requirement
 Performance improvements for real time workload: RPC timeout. Rather fail fast and try
a different DataNode

Benefits of Hadoop
 Cost saving and efficient and reliable data processing
 Provides an economically scalable solution
 Storing and processing of large amount of data
 Data grid operating system
 It is deployed on industry standard servers rather than expensive specialized data storage
systems
 Parallel processing of huge amounts of data across inexpensive, industry-standard servers

Working of Hadoop
Word Count Problem

Configuration

25
26

Starting namenode

26
27

 Web app node started at 8042


 Connection to resource manager

27
28

 Starting yarn daemons


 Directory created

Local 8042 running on web

Localhost 9000: created directory stored

28
29

29
30

Sales Reducer Class

30
31

Hadoop Limitation
 Small File Concerns
 Slow Processing Speed
 No Real Time Processing
 No Iterative Processing
 Ease of Use
 Security Problem
Limitations and Solutions
Small File Concerns
 The idea behind Hadoop was to have a small number of large files
 Each file, directory, and block occupy a memory element inside NameNode memory.
 Rule of thumb, this memory element is about 150 bytes. So, if you have 10 million files
each using a block then it would occupy 1.39 GB of memory.

Solution

 Hadoop Archives or HAR files


 Map-reduce job at the backend to pack the archived files into a small number of HDFS
files.

Slow Processing Speed


 MapReduce reads and writes the data to and from the disk in every stage.
 This disk seeks takes time thereby making the whole process very slow.
 Hadoop is slow in comparison with newer technologies like Spark and Flink.
 No Real Time Processing

Solution

 Spark-batch processing engine


 Flink is even faster than Spark (stream processing engine)

No Iterative Processing

 Core Hadoop does not support iterative processing


 Iterative processing requires a cyclic data flow.
 Works on the principle of write-once-read-many.

Solution

 Spark supports iterative processing.


 Accomplishes iterative processing through DAG i.e., Directed Acyclic Graph.
 Spark creates RDDs (distributed datasets) from HDFS files
 Flink iterates data using streaming architecture.

31
32

References
https://www.youtube.com/watch?v=g7Qpnmi0Q-s&ab_channel=edureka%21
https://www.databricks.com/glossary/hadoop
https://www.geeksforgeeks.org/hadoop-ecosystem/
https://www.edureka.co/blog/mapreduce-tutorial/
https://www.techtarget.com/searchdatamanagement/definition/Hadoop-
Distributed-File-System-HDFS
https://www.techtarget.com/searchdatamanagement/definition/Hadoop-
Distributed-File-System-HDFS
https://medium.com/@bharvi.vyas123/6-major-hadoop-limitations-with-their-
solutions-1cae1d3936e1
https://www.slideshare.net/sravya/hadoop-technology-ppt-57921409
https://www.guru99.com/create-your-first-hadoop-program.html

--------------------------------------------------End-------------------------------------------------------

32

You might also like