0% found this document useful (0 votes)
19 views10 pages

Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT-2

Hadoop:
Introduction to Hadoop:
Apache Hadoop software is an open source framework that allows for the distributed storage and
processing of large datasets across clusters of computers using simple programming models. Its
framework is based on Java programming with some native code in C and shell scripts.
Hadoop is designed to scale up from a single computer to thousands of clustered computers, with
each machine offering local computation and storage. In this way, Hadoop can efficiently store
and process large datasets ranging in size from gigabytes to petabytes of data. Hadoop is an
open-source software framework that is used for storing and processing large amounts of data
in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.
There are mainly two problems with the big data. First one is to store such a huge amount of
data and the second one is to process that stored data. The traditional approach like RDBMS
is not sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the
problem of big data i.e. storing and processing the big data with some extra capabilities.
Difference between RDBMS and Apache Hadoop:
S.No. RDBMS Hadoop
1. Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, and running applications or processes
manipulation and retrieval. concurrently.
2. In this structured data is mostly processed. In this both structured and unstructured data is
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.

4. It is less scalable than Hadoop. It is highly scalable.


5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.


8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
History of Apache Hadoop:
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both
started to work on Apache Nutch project. Apache Nutch project was the process of building a
search engine system that can index 1 billion pages. After a lot of research on Nutch, they
concluded that such a system will cost around half a million dollars in hardware, and along
with a monthly running cost of $30, 000 approximately, which is very expensive. So, they
realized that their project architecture will not be capable enough to the workaround with
billions of pages on the web. So they were looking for a feasible solution which can reduce
the implementation cost as well as the problem of storing and processing of large datasets. In
2003, they came across a paper that described the architecture of Google’s distributed file
system, called GFS (Google File System) which was published by Google, for storing the
large data sets. Now they realize that this paper can solve their problem of storing very large
files which were being generated because of web crawling and indexing processes. But this
paper was just the half solution to their problem. In 2004, Google published one more paper
on the technique MapReduce, which was the solution of processing those large datasets. Now
this paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch
project. These both techniques (GFS & MapReduce) were just on white paper at Google.
Google didn’t implement these two techniques. Doug Cutting knew from his work on Apache
Lucene ( It is a free and open-source information retrieval software library, originally written
in Java by Doug Cutting in 1999) that open-source is a great way to spread the technology to
more people. So, together with Mike Cafarella, he started implementing Google’s techniques
(GFS & MapReduce) as open-source in the Apache Nutch project. In 2005, Cutting found that
Nutch is limited to only 20-to-40 node clusters. He soon realized two problems: (a) Nutch
wouldn’t achieve its potential until it ran reliably on the larger clusters (b) And that was
looking impossible with just two people (Doug Cutting & Mike Cafarella). The engineering
task in Nutch project was much bigger than he realized. So he started to find a job with a
company who is interested in investing in their efforts. And he found Yahoo!.Yahoo had a
large team of engineers that was eager to work on this there project. So in 2006, Doug Cutting
joined Yahoo along with Nutch project. He wanted to provide the world with an open-source,
reliable, scalable computing framework, with the help of Yahoo. So at Yahoo first, he
separates the distributed computing parts from Nutch and formed a new project Hadoop (He
gave name Hadoop it was the name of a yellow toy elephant which was owned by the
Doug Cutting’s son. and it was easy to pronounce and was the unique word.) Now he
wanted to make Hadoop in such a way that it can work well on thousands of nodes. So with
GFS and MapReduce, he started to work on Hadoop. In 2007, Yahoo successfully tested
Hadoop on a 1000 node cluster and start using it. In January of 2008, Yahoo released
Hadoop as an open source project to ASF(Apache Software Foundation). And in July of
2008, Apache Software Foundation successfully tested a 4000 node cluster with
Hadoop. In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than
17 hours for handling billions of searches and indexing millions of web pages. And Doug
Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop
to other industries. In December of 2011, Apache Software Foundation released Apache
Hadoop version 1.0. And later in Aug 2013, Version 2.0.6 was available. And currently, we
have Apache Hadoop version 3.0 which released in December 2017.
Benefits of Hadoop
Scalability
Hadoop is important as one of the primary tools to store and process huge amounts of data
quickly. It does this by using a distributed computing model which enables the fast processing
of data that can be rapidly scaled by adding computing nodes.
Low cost
As an open source framework that can run on commodity hardware and has a large ecosystem
of tools, Hadoop is a low-cost option for the storage and management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocessing before
storing it which means that an organization can store as much data as they like and then utilize
it later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system resilience,
meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on
one Hadoop cluster is replicated across other nodes within the system to fortify against the
possibility of hardware or software failure.

Challenges of Hadoop
MapReduce complexity and limitations
As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as
interactive analytical tasks. MapReduce functions also need to be written in Java and can require
a steep learning curve. The MapReduce ecosystem is quite large, with many components for
different functions that can make it difficult to determine what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large datasets. An
ecosystem of tools for authentication, encryption, auditing, and provisioning has emerged to help
developers secure data in Hadoop.
Governance and management
Hadoop does not have many robust tools for data management and governance, nor for data
quality and standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding developers
with the combined requisite skills in Java to program MapReduce, operating systems, and
hardware can be difficult. In addition, MapReduce has a steep learning curve, making it hard to
get new programmers up to speed on its best practices and ecosystem.

MASTER AND SLAVE NODES


Master and slave nodes form the HDFS cluster. The name node is called the master, and the data
nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores the metadata.
The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.

Consider that 30TB of data is loaded into the name node. The name node distributes it across the
data nodes, and this data is replicated among the data notes. You can see in the image above that
the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a commodity
machine fails, you can replace it with a new machine that has the same data.
HADOOP DISTRIBUTED FILE SYSTEM:
HDFS is known as the Hadoop distributed file system. It is the allocated File System. It is the
primary data storage system in Hadoop Applications. It is the storage system of Hadoop that is
spread all over the system. In HDFS, the data is once written on the server, and it will
continuously be used many times according to the need. The targets of HDFS are as follows.
 The ability to recover from hardware failures in a timely manner
 Access to Streaming Data
 Accommodation of Large data sets
 Portability
Hadoop Distributed File System has two nodes included in it. They are the Name Node and Data
Node.
Name Node:
Name Node is the primary component of HDFS. Name Node maintains the file systems along
with namespaces. Actual data can not be stored in the Name Node. The modified data, such as
Metadata, block data etc., can be stored here.
Data Node:
Data Node follows the instructions given by the Name Node. Data Nodes are also known as
‘slave Nodes’. These nodes store the actual data provided by the client and simply follow the
commands of the Name Node.
Job Tracker:
The primary function of the Job Tracker is resource management. Job Tracker determines the
location of the data by communicating with the Name Node. Job Tracker also helps in finding
the Task Tracker. It also tracks the MapReduce from Local Node to Slave Node. In Hadoop,
there is only one instance of Job Trackers. Job Tracker monitors the individual Task Tracker and
tracks the status. Job Tracker also helps in the execution of MapReduce in Hadoop.
Task Tracker:
Task Tracker is the slave daemon in the cluster which accepts all the instructions from the Job
Tracker. Task Tracker runs on its process. The task trackers monitor all the tasks by capturing
the input and output codes. The Task Tracker helps in mapping, shuffling and reducing the data
operations. Task Tracker arranges different slots to perform different tasks. Task Tracker
continuously updates the status of the Job Tracker. It also informs about the number of slots
available in the cluster. In case the Task Tracker is unresponsive, then Job Tracker assigns the
work to some other nodes.
Types of Hadoop File Formats
Hive and Impala table in HDFS can be created using four different Hadoop file formats:
 Text files
 Sequence File
 Avro data files
 Parquet file format
1. Text files
A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
The text file format consumes more space when a numeric value needs to be stored as a string. It
is also difficult to represent binary data such as an image.
2. Sequence File
The sequencefile format can be used to store an image in the binary format. They store key-value
pairs in a binary container format and are more efficient than a text file. However, sequence files
are not human- readable.
3. Avro Data Files
The Avro file format has efficient storage due to optimized binary encoding. It is widely
supported both inside and outside the Hadoop ecosystem.
The Avro file format is ideal for long-term storage of important data. It can read from and write
in many languages like Java, Scala and so on.Schema metadata can be embedded in the file to
ensure that it will always be readable. Schema evolution can accommodate changes. The Avro
file format is considered the best choice for general-purpose storage in Hadoop.
4. Parquet File Format
Parquet is a columnar format developed by Cloudera and Twitter. It is supported in Spark,
MapReduce, Hive, Pig, Impala, Crunch, and so on. Like Avro, schema metadata is embedded in
the file.
Parquet file format uses advanced optimizations described in Google’s Dremel paper. These
optimizations reduce the storage space and increase performance. This Parquet file format is
considered the most efficient for adding multiple records at a time. Some optimizations rely on
identifying repeated patterns. We will look into what data serialization is in the next section.
HADOOP ECOSYSTEM
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling. Hadoop is a framework that enables
processing of large data sets which reside in the form of clusters. Being a framework, Hadoop
is made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms. It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are nothing but concepts of
Machine learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

You might also like