0% found this document useful (0 votes)
100 views7 pages

Unit 2 - Hadoop PDF

The document discusses Hadoop, which includes three main components - HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data across clusters of commodity servers and provides high fault tolerance. MapReduce uses a map-reduce programming model to process large datasets in parallel. YARN allows running various distributed applications beyond MapReduce. Other Hadoop ecosystem components discussed include Pig, Hive, HBase, Zookeeper, Oozie, Ambari and Sqoop. Common data formats and analyzing data with Hadoop are also summarized.

Uploaded by

Gopal Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views7 pages

Unit 2 - Hadoop PDF

The document discusses Hadoop, which includes three main components - HDFS for storage, MapReduce for processing, and YARN for resource management. HDFS stores data across clusters of commodity servers and provides high fault tolerance. MapReduce uses a map-reduce programming model to process large datasets in parallel. YARN allows running various distributed applications beyond MapReduce. Other Hadoop ecosystem components discussed include Pig, Hive, HBase, Zookeeper, Oozie, Ambari and Sqoop. Common data formats and analyzing data with Hadoop are also summarized.

Uploaded by

Gopal Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

HADOOP[unit 2]

Hadoop Distributed File System:

● In Hadoop, data resides in a distributed file system which is called a Hadoop Distributed
File system.

● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.

● The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS)
and provides a distributed file system that is designed to run on commodity hardware.

● Commodity hardware is cheap and widely available, these are useful for achieving
greater computational power at a low cost.

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● It provides high throughput access to application data and is suitable for applications
having large datasets.

● Hadoop framework includes the following two modules :

▪ Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.

▪ Hadoop YARN: This is a framework for job scheduling and cluster resource
management (managing the resources of the Clusters).

Hadoop Ecosystem And Components:


There are three main components of Hadoop:

Hadoop HDFS  -

● Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

● HDFS consists of two core components i.e. 

▪ Name node: Name Node is the prime node which contains metadata requiring
comparatively fewer resources than the data nodes that stores the actual data.
▪ Data Node: These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.

● HDFS splits files into blocks and sends them across various nodes in form of large
clusters.

● It is highly fault-tolerant and is designed to be deployed on low-cost hardware.

● It provides high throughput access to application data and is suitable for applications
having large datasets.

Hadoop MapReduce  -

● Hadoop MapReduce is the processing unit of Hadoop.

● MapReduce is a computational model and software framework for writing applications


that are run on Hadoop.

● These MapReduce programs are capable of processing enormous data in parallel on


large clusters of computation nodes.

● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is: 

▪ Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
▪ Reduce() as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.

Hadoop YARN –

● Hadoop YARN is a resource management unit of Hadoop.

● This is a framework for job scheduling and cluster resource management.


● YARN helps to open up Hadoop by allowing to process and run data for batch processing,
stream processing, interactive processing and graph processing which are stored in
HDFS.

● It helps to run different types of distributed applications other than MapReduce.

● Consists of three major components i.e. 

▪ Resource Manager: It has the privilege of allocating resources for the


applications in a system.
▪ Nodes Manager: work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager.
▪ Application Manager: works as an interface between the resource manager and
node manager and performs negotiations as per the requirement of the two.

Other components:

PIG: 

● Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.

● Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the Java virtual machine (JVM).
● Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

Mahout:

● Mahout, allows Machine Learnability to a system or application.


● It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.

HIVE: 

● With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
● Similar to the Query Processing frameworks, HIVE too comes with two components: Java
Database Connectivity (JDBC) Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers (uses the Open Database Connectivity interface by Microsoft)
work on establishing the data storage permissions and connection whereas HIVE
Command line helps in the processing of queries.

HBase: 

● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data.

Zookeeper:

● There was a huge issue of management of coordination and synchronization among the
resources or the components of Hadoop which resulted in inconsistency, often.
● Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.

Oozie:

● Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them
together as a single unit. There are two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs.
● Oozie workflow is the jobs that need to be executed in a sequentially ordered manner
whereas Oozie Coordinator jobs are those that are triggered when some data or
external stimulus is given to it.

Ambari (for management and monitoring):

● It eliminates the need for the manual tasks that used to watch over Hadoop operations.
● It gives a simple and secure platform for provisioning, managing, and monitoring
Hortonworks Data Platform (HDP) deployments.

Sqoop (SQL-to-Hadoop):

● It is a big data tool that offers the capability to extract data from non-Hadoop data
stores, transform the data into a form usable by Hadoop, and then load the data into
HDFS.
● Open Database Connectivity (ODBC) is an open standard Application Programming
Interface (API) for accessing a database.
DATA FORMAT:

● A data/file format defines how information is stored in HDFS.


● Hadoop does not have a default file format and the choice of a format depends on its
use.
● Managing the processing and storage of large volumes of information is very complex
that’s why a certain data format is required.
● The choice of an appropriate file format can produce the following benefits:
▪ Optimum writing time
▪ Optimum reading time
▪ File divisibility
▪ Adaptive scheme and compression support
● Some of the most commonly used formats of the Hadoop ecosystem are:
▪ Text/CSV: A plain text file or CSV is the most common format both outside and
within the Hadoop ecosystem.
▪ SequenceFile: The SequenceFile format stores the data in binary format, this
format accepts compression but does not store metadata.
▪ Avro: Avro is a row-based storage format. This format includes the definition of
the scheme of your data in JSON format. Avro allows block compression along
with its divisibility, making it a good choice for most cases when using Hadoop.
▪ Parquet: Parquet is a column-based binary storage format that can store nested
data structures. This format is very efficient in terms of disk input/output
operations when the necessary columns to be used are specified.
▪ RCFile (Record Columnar File): RCFile is a columnar format that divides data into
groups of rows, and inside it, data is stored in columns.
▪ ORC (Optimized Row Columnar): ORC is considered an evolution of the RCFile
format and has all its benefits alongside some improvements such as better
compression, allowing faster queries.

Analysing Data with Hadoop :

● While the MapReduce programming model is at the heart of Hadoop, it is low-level and
as such becomes an unproductive way for developers to write complex analysis jobs.

● To increase developer productivity, several higher-level languages and APIs have been
created that abstract away the low-level details of the MapReduce programming model.

● There are several choices available for writing data analysis jobs.

● The Hive and Pig projects are popular choices that provide SQL-like and procedural data
flow-like languages, respectively.

● HBase is also a popular way to store and analyze data in HDFS. It is a column-oriented
database, and unlike MapReduce, provides random read and write access to data with
low latency (there are no or almost no delays on network or Internet connection).
● MapReduce jobs can read and write data in HBase’s table format, but data processing is
often done via HBase’s own client API.
Scaling In Vs Scaling Out :
● Once a decision has been made for data scaling, the specific scaling approach must be chosen.
● There are two commonly used types of data scaling :
▪ Up
▪ Out
● Scaling up, or vertical scaling :
▪ It involves obtaining a faster server with more powerful processors and more memory.
▪ This solution uses less network hardware, and consumes less power; but ultimately for
many platforms, it may only provide a short-term fix, especially if continued growth is
expected.
● Scaling out, or horizontal scaling :
▪ It involves adding servers for parallel computing.
▪ The scale-out technique is a long-term solution, as more and more servers may be
added when needed. But going from one monolithic system to this type of cluster may
be difficult, although extremely effective solution.

Difference between scaling up and scaling out:

Horizontal Scaling Vertical Scaling


(scaling out) (scaling up)

In vertical scaling, the data lives on a single


In a database world, horizontal scaling is
node and scaling is done through multi-core,
Databases usually based on the partitioning of data
e.g. spreading the load between the CPU
(each node contains only part of the data).
and RAM resources of the machine.

In theory, adding more machines to the


existing pool means you are not limited to Vertical scaling is limited to the capacity of
Downtime one machine, scaling beyond that capacity
the capacity of a single unit, making it
possible to scale with less downtime. can involve downtime and has an upper hard
limit, i.e. the scale of the hardware on which
you are currently running.

Also described as distributed programming,


Actor model: concurrent programming on
as it involves distributing jobs across
multi-core machines is often performed via
Concurrency machines over the network. Several patterns
multi-threading and in-process message
associated with this model: Master/Worker,
passing.
Tuple Spaces, Blackboard, MapReduce.

In distributed computing, the lack of a


shared address space makes data sharing In a multi-threaded scenario, you can
Message more complex. It also makes the process of assume the existence of a shared address
passing sharing, passing or updating data more space, so data sharing and message passing
costly since you have to pass copies of the can be done by passing a reference.
data.

Examples Cassandra, MongoDB, Google Cloud Spanner MySQL, Amazon RDS

You might also like