Unit 3-1
Unit 3-1
NOTES
HISTORY OF HADOOP
▪ Created by Doug Cutting.
▪ He was the creator of Apache Lucene (Widely used text search library).
▪ Apache Nutch (Open-Source Web Search Engine) was started in 2002. It was a part
of Lucene Project.
▪ The origin of Hadoop was Apache Nutch in 2006.
▪ Hadoop is a made-up name.
▪ This name given by Doug Cutting’s kid to a yellowed elephant.
▪ Projects in Hadoop ecosystem have names are unrelated to their function, often with
an elephant or other animal names.
▪ 2004: - Nutch Developers set about writing an open-source implementation called
Nutch Distributed File System (NDFS).
▪ 2004:- Google published a paper introduced MapReduce.
▪ 2008 January:- Hadoop made its own top-level project at Apache.
▪ At that time Hadoop was used by many other companies like:
✓ Yahoo!
✓ Facebook
✓ New York Times
▪ April 2008:- Hadoop got an world record of fastest system to sort an entire
terabytes of data.
▪ Running on 910 node cluster, Hadoop sorted 1Tb in 209 seconds (3.5 minutes).
November 2008 April 2009
Google Yahoo!
MapReduce Implementation Hadoop
Sorted 1Tb in 68 sec. Sorted same in 62 sec.
▪ Widely used in Mainstream enterprises.
▪ Hadoop is used for general purpose storage & analysis platform for big data.
▪ Hadoop support established vendors like:
✓ EMC: Electronic Membership Corporation
✓ IBM
✓ Microsoft
✓ Oracle
▪ Specialist Hadoop companies are:
✓ Cloudera
✓ Horton Works
FEATURES OF HADOOP
1) Storage & process Big Data: - Process Big Data of 3V characteristics.
2) Open-Source Framework:-
o Open-Source access & cloud services enable large data stores.
o Hadoop uses a cluster of multiple inexpensive servers of cloud.
3) Java & Linux based:-
o Hadoop uses Java Interfaces.
o Hadoop base is Linux.
o Hadoop has its own set of shell commands supports.
4) Fault-efficient scalable:-
o System provides servers at high scalability
o System provides scalable by adding new nodes to handles large data.
5) Flexible modular design:-
o Simple & modular programming.
o Hadoop very helpful in:
✓ Storing
✓ Manipulating
✓ Processing
✓ Analysing of Big Data
o Modular functions make system flexible.
o One can add or replace components at ease.
6) Robust design of HDFS:-
o Execution of big data application continue even individual server or cluster
fails.
o Because Hadoop helps for backup & data recovery mechanism.
o HDFS has high reliability.
7) Hardware fault tolerant:-
o A fault does not affect data & application processing.
o If a node goes down, other nodes takes care.
o This is due to multiple copies of all data blocks which replicate automatically.
o Default 3 copies of data blocks.
NAMENODE
▪ The name node is the commodity hardware that contains the GNU/Linux operating
system and the name node software.
▪ It is a software that can be run on commodity hardware.
▪ The system having the name node acts as the master server and it does the following
tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
DATANODE
▪ The data node is a commodity hardware having the GNU/Linux operating system
and data node software.
▪ For every node (Commodity hardware/System) in a cluster, there will be a data node.
▪ These nodes manage the data storage of their system.
▪ Datanodes perform read-write operations on the file systems, as per client request.
▪ They also perform operations such as block creation, deletion, and replication
according to the instructions of the Namenode.
BLOCK
▪ Generally, the user data is stored in the files of HDFS.
▪ The file in a file system will be divided into one or more segments and/or stored in
individual data nodes.
▪ These file segments are called blocks.
▪ In other words, the minimum amount of data that HDFS can read or write is called a
Block.
▪ The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
▪ Data at the stores enable running the distributed applications including analytics,
data mining, OLAP using the clusters.
▪ Hadoop HDFS features are as follows:
Create, append, delete, rename and attribute modification functions
Content of individual files cannot be modified or replaced but appended with
new data at the end of the file.
It is suitable for distributed storage and processing.
Hadoop provides a command interface to interact with HDFS
The built-in servers of name node and data node help users to easily check
the status of the cluster.
Streaming access to file system data
HDFS provides file permissions and authentication.
COMPONENTS OF HADOOP
▪ Apache Pig :
✓ A software for analysing large data sets that consists of a high-level language
similar to SQL for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
✓ It contains a compiler that produces sequences of Map- Reduce programs.
▪ HBase:
✓ A non-relational columnar distributed database designed to run on top of
Hadoop Distributed File system (HDFS).
✓ It is written in Java and modelled after Google’s Big Table.
✓ HBase is an example of a NoSQL data store.
▪ Hive:
✓ It is a Data warehousing application that provides the SQL interface and
relational model.
✓ Hive infrastructure is built on the top of Hadoop that help in providing
summarization, query and analysis.
▪ Cascading :
✓ A software abstraction layer for Hadoop, intended to hide the underlying
complexity of Map Reduce jobs.
✓ Cascading allows users to create and execute data processing workflows on
Hadoop clusters using any JVM based language.
▪ Avro:
✓ A data serialization system and data exchange service.
✓ It is basically used in Apache Hadoop.
✓ These services can be used together as well as independently.
Prepared by NITHIN SEBASTIAN
6
▪ Big Top:
✓ It is used for packaging and testing the Hadoop ecosystem.
▪ Oozie:
✓ Oozie is a java based web-application that runs in a java servlet.
✓ Oozie uses the database to store a definition of Workflow that is a collection
of actions.
✓ It manages the Hadoop jobs.
ANALYSING THE DATA WITH HADOOP
▪ Hadoop can be used for data analysis technologies to analyse the huge stock data
being generated very frequently.
▪ Important Hadoop data analysis technology is MapReduce.
MapReduce
▪ MapReduce is a framework in which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.
▪ MapReduce is a processing technique and a program model for distributed
computing based on java.
▪ The MapReduce algorithm contains two important tasks, namely Map and Reduce.
▪ Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
▪ Secondly, reduce the task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples.
▪ As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
▪ The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
▪ Under the MapReduce model, the data processing primitives are called mappers
and reducers.
▪ Once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change.
▪ This simple scalability is what has attracted many programmers to use the
MapReduce model.
Algorithm
▪ MapReduce program executes in three stages
1. Map stage
2. Shuffle stage
3. Reduce stage
Map Stage
▪ The map or mapper’s job is to process the input data.
▪ Generally the input data is in the form of file or directory and is stored in the
Hadoop filesystem (HDFS).
▪ The input file is passed to the mapper function line by line.
▪ The mapper processes the data and creates several small chunks of data.
Reduce Stage
▪ This stage is the combination of the Shuffle stage and the Reduce stage.
▪ The Reducer’s job is to process the data that comes from the mapper.
▪ After processing, it produces a new set of output, which will be stored in the HDFS.
▪ During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
▪ Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
▪ After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
▪ The MapReduce framework operates on pairs, the framework views the input to the
job as a set of pairs and produces a set of pairs as the output of the job.
SCALING OUT
▪ To scale out, we need to store the data in a distributed file system (typically HDFS).
▪ This allows Hadoop to move the MapReduce computation to each machine hosting a
part of the data, using Hadoop’s resource management system, called YARN.
▪ A MapReduce job is a unit of work that the client wants to be performed.
▪ It consists of the input data, the MapReduce program, and configuration information.
▪ Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks
and reduce
tasks.
▪ The tasks
are
scheduled
using YARN
and run on
nodes in
the cluster.
▪ If a task
fails, it will
be
automatically rescheduled to run on a different node.
▪ Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits.
▪ Hadoop creates one map task for each split, which runs the user-defined map
function for each record in the split.
▪ The output of the reduce is normally stored in HDFS for reliability.
▪ For each HDFS block of the reduce output, the first replica is stored on the local
node, with other replicas being stored on off-rack nodes for reliability.
HADOOP STREAMING
▪ The core ideology of Hadoop is the data processing should be independent of the
language. i.e., it is flexible because the programs can be designed in any languages
to do the processing.
▪ Hadoop streaming is an ability of Hadoop to interface with Map & Reduce programs
written in Java/ Non-Java like Ruby, PHP, C++, Python etc.
▪ Hadoop streaming uses Unix standard streams as the interface between Hadoop
and our program.
DESIGN OF HDFS
▪ The HDFS is designed for big data processing.
▪ It is a core part of Hadoop, which is used for data storage.
▪ It is designed to run on commodity hardware. i.e., to run HDFS, we don’t have
specialized hardware or it can be easily installed and run on commodity hardware
such as low-cost hardware, easily available in the market.
FEATURES OF HDFS
▪ Distributed: In HDFS, the data is divided into multiple data blocks & stored into
different nodes. This is one of the most important features of HDFS that makes
Hadoop very powerful.
▪ Parallel computation: Data is divided and stored in different nodes; it allows parallel
computation.
▪ Highly scalable: HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
✓ Scaling is divided into two types:
▪ Replication:
o Due to some unfavourable conditions, the node containing the data may be
failed to work.
o So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
▪ Fault tolerance:
o The HDFS is highly fault-tolerant.
o i.e., if any machine fails, the other machine containing the copy of that data
automatically become active.
o In HDFS, the fault-tolerant signifies the robustness of the system in the event
of failure.
▪ Streaming data access:
o It follows write-once/ read-many design.
o Which means that once the data stored in the HDFS can’t be altered.
o But we can append the data at the end.
▪ Portable:
o HDFS is designed in such a way that it can easily portable from one platform
to another.
▪ Serialization is the process of turning structured objects into a byte stream for
transmission over a network.
▪ Deserialization is the reverse process of serialization, in which turning a byte stream
into series of structured objects.
▪ The commonly used data types are:
Boolean
Byte
Short
Int
Float
Long
Double
string