0% found this document useful (0 votes)
23 views52 pages

Bda Unit-2

Uploaded by

ANSHI RANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views52 pages

Bda Unit-2

Uploaded by

ANSHI RANK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

BDA

Unit – 2
HADOOP
By :- Urvi Dhamecha

Urvi Dhamecha
History of Hadoop
• In the late 1990s, search engines and indexes
were created for helping people to find relevant
information about the content searched.
• Open-source web search engine was invented
to return results faster by distributing the data
across different machines to process the tasks
simultaneously
• It was Created by Doug Cutting and Mike Carafella in
2005.
• Cutting named the program after his son’s toy
elephant.
Urvi Dhamecha
Hadoop Overview
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation. Its framework is based
on Java programming with some native code in C and
shell scripts.
• Hadoop is an open-source software framework that
is used for storing and processing large amounts of
data in a distributed computing environment. It is
designed to handle big data and is based on the
MapReduce programming model, which allows for
the parallel processing of large datasets.

Urvi Dhamecha
Comparisons of RDBMS and Hadoop
RDBMS Hadoop
Traditional row-column based databases, An open-source software used for storing
basically used for data storage, data and running applications or
manipulation and retrieval. processes concurrently.
In this structured data is mostly In this both structured and unstructured
processed. data is processed.
It is best suited for OLTP environment. It is best suited for BIG data.
It is less scalable than Hadoop. It is highly scalable.
Data normalization is required in RDBMS. Data normalization is not required in
Hadoop.
It stores transformed and aggregated It stores huge volume of data.
data.
It has no latency in response. It has some latency in response.
The data schema of RDBMS is static type. The data schema of Hadoop is dynamic
type.
High data integrity available. Low data integrity available than RDBMS.
Cost is applicable for licensed software. Free of cost, as it is an open source
software.
Urvi Dhamecha
Distributed Computing Challenges
Here are the list of challenges in distributed computing.

• Heterogeneity
• Scalability
• Openness
• Transparency
• Concurrency
• Security
• Failure Handling

Urvi Dhamecha
Hadoop Distributed File System(HDFS)
• Hadoop comes with a distributed file system called
HDFS.
• In HDFS data is distributed over several machines
and replicated to ensure their durability to failure
and high availability to parallel application.
• It is cost effective as it uses commodity hardware. It
involves the concept of blocks, data nodes and node
name.

Urvi Dhamecha
Hadoop Framework

Urvi Dhamecha
Hadoop Framework
• NameNode : Stores and manages all metadata about
the data present on the cluster, so it is the single point of
contact to Hadoop.
• JobTracker : Runs on the Namenode and perform the
map reduce of the jobs submitted to the cluster
• Secondary NameNode: maintains the backup of
metadata present on the Namenode, file system change
history.
• DataNode: will contain the actual data.
• TaskTracker: will perform task on the local data,
assigned by the Jobtracker.

Urvi Dhamecha
Map reduce Architecture

Urvi Dhamecha
Map reduce Example

Urvi Dhamecha
Moving data in & out of Hadoop
Understanding inputs and outputs of
MapReduce
• In map reduce programming, the dataset is
splits into independent chunks.
• Map tasks process these independent chunks
completely in parallel manner.
• The output produced by the map tasks serves
as intermediate data and is stored on local
disc of that server.

Urvi Dhamecha
• Map reduce framework sorts the output based on
keys.
• This sorted output becomes the input to the reduce
tasks.
• Reduce task provides reduced output by combining
the output of the various mappers.
• Job inputs and outputs are stored in a file system.
• The MapReduce framework operates on <key, value>
pairs, that is, the framework views the input to the
job as a set of <key, value> pairs and produces a set
of <key, value> pairs as the output of the job.

Urvi Dhamecha
• The key and the value classes should be in
serialized manner by the framework and
hence, need to implement the Writable
interface.
• Additionally, the key classes have to
implement the Writable-Comparable interface
to facilitate sorting by the framework.
• Input and Output types of a MapReduce job:
(Input) <k1, v1> -> map -> <k2, v2>-> reduce -
> <k3, v3>(Output).

Urvi Dhamecha
Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Urvi Dhamecha
Execution Pipeline

Urvi Dhamecha
Example of Map Reduce
Apply Map reduce Algorithm .

• Hello I am Google Assist


• How can I help you
• How can I assist you
• Are you an engineer
• Are you looking for coding
• Are you looking for interview questions
• what are you doing these days
• what are your strengths

Urvi Dhamecha
• OUTPUT
• am,1 are,5 An,1 Assist,2
• Can,2 coding,1
• Days,1 doing,1
• engineering,1
• For,2
• Google,1
• Hello,1 How,2 help,1
• I,3 Interview,1
• looking,2
• Questions,1
• Strengths,1
• these,1
• You,6 your,1
• What,2

Urvi Dhamecha
Hadoop in the cloud

Urvi Dhamecha
The Hadoop Ecosystem
Hadoop • Contains Libraries and other modules
Common
• a distributed file-system that stores
HDFS data on commodity machines

Hadoop YARN • Yet Another Resource Negotiator

Hadoop • A programming model for large scale


MapReduce data processing

Urvi Dhamecha
The Hadoop Ecosystem

Dr. Maulik Dhamecha


The Hadoop Ecosystem

Urvi Dhamecha
The Hadoop Ecosystem
• When architects and developers discuss
software, they typically immediately qualify a
software tool for its specific usage. For
example, they may say that Apache Tomcat is
a web server and that MySQL is a database.
• When it comes to Hadoop, however, things
become a little bit more complicated. Hadoop
encompasses a multiplicity of tools that are
designed and implemented to work together.

Urvi Dhamecha
The Hadoop Ecosystem
• For some people, Hadoop is a data
management system bringing together
massive amounts of structured and
unstructured data.
• For others, it is a massively parallel execution
framework bringing the power of
supercomputing to the masses.
• Some view Hadoop as an open source
community creating tools and software for
solving Big Data problems.
Urvi Dhamecha
The Hadoop Ecosystem
• Because Hadoop provides such a wide array of
capabilities that can be adapted to solve many
problems, many consider it to be a basic
framework.

Urvi Dhamecha
The Hadoop Ecosystem

• HDFS — A foundational component of the


Hadoop ecosystem is the Hadoop Distributed
File System (HDFS). HDFS is the mechanism by
which a large amount of data can be
distributed over a cluster of computers, and
data is written once, but read many times for
analytics. It provides the foundation for other
tools, such as HBase.

Urvi Dhamecha
The Hadoop Ecosystem

• MapReduce — Hadoop’s main execution framework


is MapReduce, a programming model for distributed,
parallel data processing, breaking jobs into mapping
phases and reduce phases (thus the name).
Developers write MapReduce jobs for Hadoop, using
data stored in HDFS for fast data access. Because of
the nature of how MapReduce works, Hadoop brings
the processing to the data in a parallel fashion,
resulting in fast implementation.

Urvi Dhamecha
The Hadoop Ecosystem

• HBase — A column-oriented NoSQL database


built on top of HDFS, HBase is used for fast
read/write access to large amounts of data.
HBase uses Zookeeper for its management to
ensure that all of its components are up and
running.

Urvi Dhamecha
The Hadoop Ecosystem

• Zookeeper — Zookeeper is Hadoop’s distributed


coordination service. Designed to run over a cluster
of machines, it is a highly available service used for
the management of Hadoop operations, and many
components of Hadoop depend on it.
• Oozie — A scalable workflow system, Oozie is
integrated into the Hadoop stack, and is used to
coordinate execution of multiple MapReduce jobs. It
is capable of managing a significant amount of
complexity, basing execution on external events that
include timing and presence of required data.
Urvi Dhamecha
The Hadoop Ecosystem

• Pig — An abstraction over the complexity of MapReduce


programming, the Pig platform includes an execution
environment and a scripting language (Pig Latin) used to
analyze Hadoop data sets. Its compiler translates Pig Latin into
sequences of MapReduce programs.
• Hive — An SQL-like, high-level language used to run queries
on data stored in Hadoop, Hive enables developers not
familiar with MapReduce to write data queries that are
translated into MapReduce jobs in Hadoop. Like Pig, Hive was
developed as an abstraction layer, but geared more toward
database analysts more familiar with SQL than Java
programming.
Urvi Dhamecha
The Hadoop Ecosystem
• Sqoop is a connectivity tool for moving data between
relational databases and data warehouses and Hadoop. Sqoop
leverages database to describe the schema for the imported/
exported data and MapReduce for parallelization operation
and fault tolerance.
• Flume is a distributed, reliable, and highly available service for
efficiently collecting, aggregating, and moving large amounts
of data from individual machines to HDFS. It is based on a
simple and flexible architecture, and provides a streaming of
data flows. It leverages a simple extensible data model,
allowing you to move data from multiple machines within an
enterprise into Hadoop.
Urvi Dhamecha
The Hadoop Ecosystem

• Whirr — This is a set of libraries that allows


users to easily spin-up Hadoop clusters on top
of Amazon EC2, Rackspace, or any virtual
infrastructure.
• Mahout — This is a machine-learning and
data-mining library that provides MapReduce
implementations for popular algorithms used
for clustering, regression testing, and
statistical modeling.
Urvi Dhamecha
The Hadoop Ecosystem

• BigTop — This is a formal process and


framework for packaging and interoperability
testing of Hadoop’s sub-projects and related
components.
• Ambari — This is a project aimed at
simplifying Hadoop management by providing
support for provisioning, managing, and
monitoring Hadoop clusters.

Urvi Dhamecha
Business value of Hadoop
Scalability:
• Hadoop can scale from a single server to thousands of machines, offering
local storage and computation.
• It can handle vast amounts of data, both structured and unstructured,
making it ideal for growing businesses.
Cost-Effective:
• Uses commodity hardware, reducing the need for expensive, specialized
equipment.
• Open-source nature eliminates software licensing costs.
Flexibility:
• Can store and process various types of data, including text, images, and
videos.
• Supports different data processing models, such as batch, interactive, and
real-time processing.

Urvi Dhamecha
Business value of Hadoop
Fault Tolerance:
• Built-in replication and redundancy ensure data reliability and availability.
• Automatic data recovery in case of hardware failure.
Speed and Performance:
• Distributes data processing across multiple nodes, enhancing processing
speed.
• Can process large volumes of data quickly, leading to faster insights and
decision-making.
Integration with Existing Systems:
• Compatible with various data sources and tools, facilitating easy
integration into existing IT infrastructure.
• Supports multiple programming languages and big data technologies like
Apache Spark, Hive, and HBase.

Urvi Dhamecha
Business value of Hadoop
Enhanced Data Analytics:
• Enables complex data analysis and machine learning at scale.
• Supports advanced analytics, providing deeper insights and improved
business intelligence.
Competitive Advantage:
• Businesses can derive actionable insights from big data, leading to better
decision-making and strategic planning.
• Ability to analyze customer behavior, market trends, and operational
efficiency can offer a significant competitive edge.
Security and Compliance:
• Provides robust security features such as authentication, authorization,
and encryption.
• Helps in meeting regulatory compliance by managing and securing
sensitive data effectively.

Urvi Dhamecha
Hadoop YARN
(Yet Another Resource Negotiator)
• YARN is a Framework on which MapReduce works.
YARN performs 2 operations that are Job scheduling
and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so
that each job can be assigned to various slaves in a
Hadoop cluster and Processing can be Maximized.
• Job Scheduler also keeps track of which job is
important, which job has more priority,
dependencies between the jobs and all the other
information like job timing, etc.

Urvi Dhamecha
Hadoop YARN
(Yet Another Resource Negotiator)
• And the use of Resource Manager is to manage all
the resources that are made available for running a
Hadoop cluster.

Features of YARN
• Multi-Tenancy
• Scalability
• Cluster-Utilization
• Compatibility

Urvi Dhamecha
Hadoop YARN
• YARN stands for “Yet Another Resource Negotiator“.
• It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in
Hadoop 1.0.
• YARN was described as a “Redesigned Resource
Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed
operating system used for Big Data processing.
• YARN architecture basically separates resource
management layer from the processing layer.

Urvi Dhamecha
Hadoop YARN

Urvi Dhamecha
Hadoop YARN
• YARN also allows different data processing engines like graph
processing, interactive processing, stream processing as well
as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system
much more efficient.

• Through its various components, it can dynamically allocate


various resources and schedule the application processing.

• For large volume data processing, it is quite necessary to


manage the available resources properly so that every
application can leverage them.

Urvi Dhamecha
Hadoop YARN Archirecture

Urvi Dhamecha
Hadoop YARN Archirecture
Resource Manager
• Resource Manager is the master daemon of YARN. It is
responsible for managing several other applications, along with
the global assignments of resources such as CPU and memory. It
is used for job scheduling. Resource Manager has two
components:
1. Scheduler: Schedulers’ task is to distribute resources to the
running applications. It only deals with the scheduling of tasks
and hence it performs no tracking and no monitoring of
applications.
2. Application Manager: The application Manager manages
applications running in the cluster. Tasks, such as the starting of
Application Master or monitoring, are done by the Application
Manager.
Urvi Dhamecha
Hadoop YARN Archirecture
Node Manager
• Node Manager is the slave daemon of YARN. It has the
following responsibilities:
• Node Manager has to monitor the container’s resource usage,
along with reporting it to the Resource Manager.
• The health of the node on which YARN is running is tracked by
the Node Manager.
• It takes care of each node in the cluster while managing the
workflow, along with user jobs on a particular node.
• It keeps the data in the Resource Manager updated
• Node Manager can also destroy or kill the container if it gets
an order from the Resource Manager to do so.

Urvi Dhamecha
Hadoop YARN Archirecture
Application Master
• Every job submitted to the framework is an application, and
every application has a specific Application Master associated
with it. Application Master performs the following tasks:
• It coordinates the execution of the application in the cluster,
along with managing the faults.
• It negotiates resources from the Resource Manager.
• It works with the Node Manager for executing and monitoring
other components’ tasks.
• At regular intervals, heartbeats are sent to the Resource
Manager for checking its health, along with updating records
according to its resource demands.

Urvi Dhamecha
Hadoop YARN Archirecture
Container
• A container is a set of physical resources (CPU cores, RAM,
disks, etc.) on a single node. The tasks of a container are listed
below:
• It grants the right to an application to use a specific amount of
resources (memory, CPU, etc.) on a specific host.
• YARN containers are particularly managed by a Container
Launch context which is Container Life Cycle (CLC). This record
contains a map of environment variables, dependencies
stored in remotely accessible storage, security tokens, the
payload for Node Manager services, and the command
necessary to create the process.

Urvi Dhamecha
Hadoop YARN Archirecture
• Application workflow in Hadoop YARN:

Urvi Dhamecha
Hadoop YARN Archirecture
1. Client submits an application
2. The Resource Manager allocates a container to start the Application
Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource
Manager
5. The Application Manager notifies the Node Manager to launch
containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers
with the Resource Manager

Urvi Dhamecha
Hadoop Limitations

Urvi Dhamecha
Hadoop Limitations
1. Problem with Small files
• Hadoop can efficiently perform over a small number of files of
large size. Hadoop stores the file in the form of file blocks
which are from 128MB in size(by default) to 256MB. Hadoop
fails when it needs to access the small size file in a large
amount. This so many small files surcharge the Namenode
and make it difficult to work.
2. Vulnerability
• Hadoop is a framework that is written in java, and java is one
of the most commonly used programming languages which
makes it more insecure as it can be easily exploited by any of
the cyber-criminal.

Urvi Dhamecha
Hadoop Limitations
3. Low Performance In Small Data Surrounding
• Hadoop is mainly designed for dealing with large datasets, so
it can be efficiently utilized for the organizations that are
generating a massive volume of data. It’s efficiency decreases
while performing in small data surroundings.
4. Lack of Security
• Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver
needs to be careful with this security face and should take
appropriate action on it.

Urvi Dhamecha
Hadoop Limitations
5. High Up Processing
• Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the
data read or write done from the disk which makes it difficult
to perform in-memory calculation and lead to processing
overhead or High up processing.
6. Supports Only Batch Processing
• The batch process is nothing but the processes that are
running in the background and does not have any kind of
interaction with the user. The engines used for these
processes inside the Hadoop core is not that much efficient.
Producing the output with low latency is not possible with it.

Urvi Dhamecha
End of Unit - 2

Urvi Dhamecha

You might also like