Bda Unit-2

Uploaded by

ANSHI RANK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views52 pages

Bda Unit-2

Uploaded by

ANSHI RANK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

BDA

Unit – 2
HADOOP
By :- Urvi Dhamecha

Urvi Dhamecha
History of Hadoop
• In the late 1990s, search engines and indexes
were created for helping people to find relevant
information about the content searched.
• Open-source web search engine was invented
to return results faster by distributing the data
across different machines to process the tasks
simultaneously
• It was Created by Doug Cutting and Mike Carafella in
2005.
• Cutting named the program after his son’s toy
elephant.
Urvi Dhamecha
Hadoop Overview
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation. Its framework is based
on Java programming with some native code in C and
shell scripts.
• Hadoop is an open-source software framework that
is used for storing and processing large amounts of
data in a distributed computing environment. It is
designed to handle big data and is based on the
MapReduce programming model, which allows for
the parallel processing of large datasets.

Urvi Dhamecha
Comparisons of RDBMS and Hadoop
RDBMS Hadoop
Traditional row-column based databases, An open-source software used for storing
basically used for data storage, data and running applications or
manipulation and retrieval. processes concurrently.
In this structured data is mostly In this both structured and unstructured
processed. data is processed.
It is best suited for OLTP environment. It is best suited for BIG data.
It is less scalable than Hadoop. It is highly scalable.
Data normalization is required in RDBMS. Data normalization is not required in
Hadoop.
It stores transformed and aggregated It stores huge volume of data.
data.
It has no latency in response. It has some latency in response.
The data schema of RDBMS is static type. The data schema of Hadoop is dynamic
type.
High data integrity available. Low data integrity available than RDBMS.
Cost is applicable for licensed software. Free of cost, as it is an open source
software.
Urvi Dhamecha
Distributed Computing Challenges
Here are the list of challenges in distributed computing.

• Heterogeneity
• Scalability
• Openness
• Transparency
• Concurrency
• Security
• Failure Handling

Urvi Dhamecha
Hadoop Distributed File System(HDFS)
• Hadoop comes with a distributed file system called
HDFS.
• In HDFS data is distributed over several machines
and replicated to ensure their durability to failure
and high availability to parallel application.
• It is cost effective as it uses commodity hardware. It
involves the concept of blocks, data nodes and node
name.

Urvi Dhamecha
Hadoop Framework

Urvi Dhamecha
Hadoop Framework
• NameNode : Stores and manages all metadata about
the data present on the cluster, so it is the single point of
contact to Hadoop.
• JobTracker : Runs on the Namenode and perform the
map reduce of the jobs submitted to the cluster
• Secondary NameNode: maintains the backup of
metadata present on the Namenode, file system change
history.
• DataNode: will contain the actual data.
• TaskTracker: will perform task on the local data,
assigned by the Jobtracker.

Urvi Dhamecha
Map reduce Architecture

Urvi Dhamecha
Map reduce Example

Urvi Dhamecha
Moving data in & out of Hadoop
Understanding inputs and outputs of
MapReduce
• In map reduce programming, the dataset is
splits into independent chunks.
• Map tasks process these independent chunks
completely in parallel manner.
• The output produced by the map tasks serves
as intermediate data and is stored on local
disc of that server.

Urvi Dhamecha
• Map reduce framework sorts the output based on
keys.
• This sorted output becomes the input to the reduce
tasks.
• Reduce task provides reduced output by combining
the output of the various mappers.
• Job inputs and outputs are stored in a file system.
• The MapReduce framework operates on <key, value>
pairs, that is, the framework views the input to the
job as a set of <key, value> pairs and produces a set
of <key, value> pairs as the output of the job.

Urvi Dhamecha
• The key and the value classes should be in
serialized manner by the framework and
hence, need to implement the Writable
interface.
• Additionally, the key classes have to
implement the Writable-Comparable interface
to facilitate sorting by the framework.
• Input and Output types of a MapReduce job:
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -
> <k3, v3>(Output).

Urvi Dhamecha
Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Urvi Dhamecha
Execution Pipeline

Urvi Dhamecha
Example of Map Reduce
Apply Map reduce Algorithm .

• Hello I am Google Assist

• How can I help you
• How can I assist you
• Are you an engineer
• Are you looking for coding
• Are you looking for interview questions
• what are you doing these days
• what are your strengths

Urvi Dhamecha
• OUTPUT
• am,1 are,5 An,1 Assist,2
• Can,2 coding,1
• Days,1 doing,1
• engineering,1
• For,2
• Google,1
• Hello,1 How,2 help,1
• I,3 Interview,1
• looking,2
• Questions,1
• Strengths,1
• these,1
• You,6 your,1
• What,2

Urvi Dhamecha
Hadoop in the cloud

Urvi Dhamecha
The Hadoop Ecosystem
Hadoop • Contains Libraries and other modules
Common
• a distributed file-system that stores
HDFS data on commodity machines