0% found this document useful (0 votes)
5 views

Hadoop

The document outlines the development and features of Hadoop, which was created by Yahoo in 2006 based on Google's GFS and MapReduce. It addresses the challenges of big data storage and processing through its core components, HDFS for distributed storage and YARN for resource management. The document also discusses Hadoop's reliability, scalability, and integration with systems like Oracle for enhanced data processing capabilities.

Uploaded by

Valerie Menezes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Hadoop

The document outlines the development and features of Hadoop, which was created by Yahoo in 2006 based on Google's GFS and MapReduce. It addresses the challenges of big data storage and processing through its core components, HDFS for distributed storage and YARN for resource management. The document also discusses Hadoop's reliability, scalability, and integration with systems like Oracle for enhanced data processing capabilities.

Uploaded by

Valerie Menezes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

 In Oct 2003 – Google releases papers with GFS (Google File

System).
 In Dec 2004, Google releases papers with MapReduce.
 In 2006, Yahoo created Hadoop based on GFS and MapReduce with
Doug Cutting and team.
 In 2007 Yahoo started using Hadoop on a 1000 node cluster.
 In Jan 2008, Yahoo released Hadoop as an open source project to
Apache Software Foundation.
 Doug quoted on Google’s contribution to the development of
Hadoop framework:
▪ “Google is living a few years in the future and sending the rest of us
messages.”
 Lets understand the problems associated with Big Data and
how Hadoop solved that problem.
 The first problem is storing the colossal amount of data
 The second problem is storing heterogeneous data
 The third problem, which is the processing speed

 To solve the storage issue and processing issue, two core


components were created in Hadoop – HDFS and YARN.
 HDFS solves the storage issue as it stores the data in a
distributed fashion and is easily scalable.
 And, YARN solves the processing issue by reducing the
processing time drastically.
 Framework to store Big Data in a distributed environment, so
that, you can process it parallely.
 Hadoop is written in the Java programming language
 The first problem is storing huge amount of data.
▪ HDFS provides a distributed way to store Big Data.
▪ Data is stored in blocks in DataNodes and you specify the size of each
block.
▪ Eg If 512 MB of data needs to be stored
▪ If HDFS is configured to create 128 MB of data blocks, HDFS will divide data
into 4 blocks as 512/128=4 and stores it across different DataNodes.
▪ Data blocks are replicated on different DataNodes to provide fault tolerance
 Problem of storing a variety of data.
▪ In HDFS you can store all kinds of data whether it is structured, semi-
structured or unstructured.
▪ In HDFS, there is no pre-dumping schema validation.
▪ It also follows write once and read many models.
▪ You can just write any kind of data once and you can read it multiple times for
finding insights.
 Problem of processing the data faster
▪ move the processing unit to data instead of moving data to the
processing unit.
▪ The processing logic is sent to the nodes where data is stored Nodes
can process a part of data in parallel.
▪ All of the intermediary output produced by each node is merged
together and the final response is sent back to the client.
 Reliability
▪ If one of the machines fails, another machine will take over the
responsibility and work in a reliable and fault-tolerant fashion.
▪ Hadoop infrastructure has inbuilt fault tolerance features and hence,
Hadoop is highly reliable.
 Economical
▪ Uses commodity hardware (like your PC, laptop
▪ Eg in a small Hadoop cluster, all your DataNodes can have normal
configurations like 8-16 GB RAM with 5-10 TB hard disk and Xeon
processors.
 Scalability
▪ Don’t need to worry about the scalability factor because you can go
ahead and procure more hardware and expand your set up within
minutes whenever required.
 Flexibility
▪ Hadoop is very flexible in terms of the ability to deal with all kinds of
data.
▪ Hadoop can store and process them all, whether it is structured,
semi-structured or unstructured data.
 Two services which are always mandatory for setting
up Hadoop: HDFS (storage) & YARN (processing).
▪ HDFS stands for Hadoop Distributed File System, which is a scalable
storage unit of Hadoop
▪ YARN is used to process the data i.e. stored in the HDFS in a
distributed and parallel fashion.
 The main components of HDFS are the NameNode and
the DataNode.
 It is the master daemon that maintains and manages the DataNodes (slave
nodes)
 It records the metadata of all the blocks stored in the cluster, e.g. location
of blocks stored, size of the files, permissions, hierarchy, etc.
 It records each and every change that takes place to the file system
metadata
 If a file is deleted in HDFS, the NameNode will immediately record this in
the EditLog
 It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are alive
 It keeps a record of all the blocks in the HDFS and DataNode in which they
are stored
 It has high availability and federation features
 It is the slave daemon which runs on each slave machine
 The actual data is stored on DataNodes
 It is responsible for serving read and write requests from the
clients
 It is also responsible for creating blocks, deleting
blocks and replicating the same based on the decisions taken
by the NameNode
 It sends heartbeats to the NameNode periodically to report
the overall health of HDFS, by default, this frequency is set to 3
seconds
 YARN comprises of two major
components: ResourceManager and NodeManager.
 It is a cluster-level (one for each cluster) component and runs
on the master machine
 It manages resources and schedules applications running on
top of YARN
 It has two components: Scheduler & ApplicationManager
 The Scheduler is responsible for allocating resources to the
various running applications
 The ApplicationManager is responsible for accepting job
submissions and negotiating the first container for executing
the application
 It keeps a track of the heartbeats from the Node Manage
 It is a node-level component (one on each node) and runs on
each slave machine
 It is responsible for
managing containers and monitoring resource utilization in
each container
 It also keeps track of node health and log management
 It continuously communicates with ResourceManager to
remain up-to-date
 HDFS -> Hadoop Distributed File System
 YARN -> Yet Another Resource Negotiator
 MapReduce -> Data processing using
programming
 Spark -> In-memory Data Processing
 PIG, HIVE-> Data Processing Services
using Query (SQL-like)
 HBase -> NoSQL Database
 Mahout, Spark MLlib -> Machine
Learning
 Apache Drill -> SQL on Hadoop
 Zookeeper -> Managing Cluster
 Oozie -> Job Scheduling
 Flume, Sqoop -> Data Ingesting Services
 Solr & Lucene -> Searching & Indexing
 Ambari -> Provision, Monitor and
Maintain cluster
 Search – Yahoo, Amazon, Zvents
 Log processing – Facebook, Yahoo
 Data Warehouse – Facebook, AOL
 Video and Image Analysis – New York Times, Eyealike
 Low Latency data access : Quick access to small parts of data
 Multiple data modification : Hadoop is a better fit only if we
are primarily concerned about reading data and not modifying
data.
 Lots of small files : Hadoop is suitable for scenarios, where we
have few but large files.
 Hadoop is not suitable OLTP Systems directly!
 The Large Hadron Collider is equipped with around 150 million
sensors, producing a petabyte of data every second, and the
data is growing continuously.
 CERN researches said that this data has been scaling up in
terms of amount and complexity, and one of the important
task is to serve these scalable requirements
 Used Hadoop cluster to their cost in hardware and complexity
in maintenance.
 They integrated Oracle & Hadoop and they got advantages of
integrating
 Oracle, optimized their Online Transactional System
 Hadoop provided them scalable distributed data processing
platform.
 They designed a hybrid system,
▪ First they moved data from Oracle to Hadoop.
▪ Then executed query over Hadoop data from Oracle using Oracle
APIs.
▪ They also used Hadoop data formats like Avro & Parquet for high
performance analytics without need of changing the end-user apps
connecting to Oracle.
 Export data from Oracle to HDFS
▪ Sqoop was good enough for most cases and they also adopted some
of the other possible options like custom ingestion, Oracle DataPump,
streaming etc.
 Query Hadoop from Oracle
▪ They accessed tables in Hadoop engines using DB links in Oracle. That
also build hybrid views by transparently combining data in Oracle and
Hadoop.
 Use Hadoop frameworks to process data in Oracle DBs
▪ They used Hadoop engines (like Impala, Spark) to process data
exported from Oracle and then read that data in a RDBMS directly
from Spark SQL with JDBC.
 Step1: Offload data to Hadoop
 Step2: Offload queries to Hadoop
 Step 3: Access Hadoop from an Oracle query
 Hadoop is scalable and excellent for Big Data analytics
 Oracle is proven for concurrent transactional workloads
 Solutions are available to integrate Oracle and Hadoop
 There is a great value in using hybrid systems (Oracle +
Hadoop):
▪ Oracle APIs for legacy applications and OLTP workloads
▪ Scalability on commodity Hardware for analytic workloads
 Distributed File System talks about managing data, i.e. files or
folders across multiple computers or servers.
 Hadoop Distributed file system or HDFS is a Java based
distributed file system that allows you to store large data
across multiple nodes in a Hadoop cluster.
 Installing Hadoop, you get HDFS as an underlying storage
system for storing the data in the distributed environment.
 Hadoop Distributed File System is distributed in such a way
that every machine contributes their individual storage for
storing any kind of data.
 1. Distributed Storage:
▪ Storage not limited to the physical boundaries of each individual
machine.
 2. Distributed & Parallel Computation:
 3. Horizontal Scalability:
 Apache HDFS is a block-structured file system
 Each file is divided into blocks of a pre-determined size.
 These blocks are stored across a cluster of one or several
machines.
 Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave
nodes).
 HDFS can be deployed on a broad spectrum of machines that
support Java.
 Though one can run several DataNodes on a single machine, but in
the practical world, these DataNodes are spread across various
machines.
 NameNode is the master node.
 Maintains and manages the blocks present on the DataNodes
(slave nodes).
 NameNode is a very highly available server that manages the
File System Namespace and controls access to files by clients
 The HDFS architecture is built in such a way that the user data
never resides on the NameNode.
▪ The data resides on DataNodes only.
 It is the master daemon that maintains and manages the
DataNodes (slave nodes)
 It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the
metadata:
▪ FsImage: It contains the complete state of the file system namespace
since the start of the NameNode.
▪ EditLogs: It contains all the recent modifications made to the file
system with respect to the most recent FsImage.
 It records each change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will
immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these
blocks are located.
 The NameNode is also responsible to take care of
the replication factor of all the blocks
 In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.
 DataNodes are the slave nodes in HDFS.
 DataNode is a commodity hardware, that is, a non-expensive
system which is not of high quality or high-availability.
 The DataNode is a block server that stores the data in the local
file ext3 or ext4.
 These are slave daemons or process which runs on each slave
machine.
 The actual data is stored on DataNodes.
 The DataNodes perform the low-level read and write
requests from the file system’s clients.
 They send heartbeats to the NameNode periodically to report
the overall health of HDFS, by default, this frequency is set to 3
seconds.
 Apart from these two daemons, there is a third daemon or
a process called Secondary NameNode.
 The Secondary NameNode works concurrently with the
primary NameNode as a helper daemon.
 It is NOT a backup NameNode.
 The Secondary NameNode is one which constantly reads all
the file systems and metadata from the RAM of the
NameNode and writes it into the hard disk or the file system.
 It is responsible for combining the EditLogs with FsImage from
the NameNode.
 It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage. The new FsImage is copied
back to the NameNode, which is used whenever the
NameNode is started the next time.
 HDFS provides a reliable way to store huge data in a
distributed environment as data blocks.
 The blocks are also replicated to provide fault tolerance.
 The default replication factor is 3 which is again configurable.
 Occupy more space than original size of the file.
 The NameNode collects block report from DataNode
periodically to maintain the replication factor.
 Therefore, whenever a block is over-replicated or under-
replicated the NameNode deletes or add replicas as needed.
 Suppose a situation where an HDFS client, wants to write a file
named “example.txt” of size 248 MB.

 Assume that the system block size is configured for 128 MB


(default).
 So, the client will be dividing the file “example.txt” into 2 blocks –
one of 128 MB (Block A) and the other of 120 MB (block B).
 At first, the HDFS client will reach out to the NameNode for a Write
Request against the two blocks, say, Block A & Block B.
 The NameNode will then grant the client the write permission and
will provide the IP addresses of the DataNodes where the file
blocks will be copied eventually.
 The selection of IP addresses of DataNodes is purely randomized
based on availability, replication factor and rack awareness that we
have discussed earlier.
 Let’s say the replication factor is set to default i.e. 3. Therefore, for
each block the NameNode will be providing the client a list of (3) IP
addresses of DataNodes. The list will be unique for each block.
 Suppose, the NameNode provided following lists of IP
addresses to the client:
▪ For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of
DataNode 6}
▪ For Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP of
DataNode 9}
 Each block will be copied in three different DataNodes to
maintain the replication factor consistent throughout the
cluster.
 Now the whole data copy process will happen in three stages:
1. Set up of Pipeline
2. Data streaming and replication
3. Shutdown of Pipeline (Acknowledgement stage)
 Before writing the blocks, the client confirms whether the
DataNodes, present in each of the list of IPs, are ready to
receive the data or not.
 In doing so, the client creates a pipeline for each of the blocks
by connecting the individual DataNodes in the respective list
for that block.
 Let us consider Block A. The list of DataNodes provided by the
NameNode is:
 For Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP of
DataNode 6}
 The client will choose the first DataNode in the list (DataNode IPs for Block
A) which is DataNode 1 and will establish a TCP/IP connection.
 The client will inform DataNode 1 to be ready to receive the block. It will
also provide the IPs of next two DataNodes (4 and 6) to the DataNode 1
where the block is supposed to be replicated.
 The DataNode 1 will connect to DataNode 4. The DataNode 1 will inform
DataNode 4 to be ready to receive the block and will give it the IP of
DataNode 6. Then, DataNode 4 will tell DataNode 6 to be ready for
receiving the data.
 Next, the acknowledgement of readiness will follow the reverse sequence,
i.e. From the DataNode 6 to 4 and then to 1.
 At last DataNode 1 will inform the client that all the DataNodes are ready
and a pipeline will be formed between the client, DataNode 1, 4 and 6.
 Now pipeline set up is complete and the client will finally begin the data
copy or streaming process.
 As the pipeline has been created, the client will push the data
into the pipeline.
 In HDFS, data is replicated based on replication factor.
 Block A will be stored to three DataNodes as the assumed
replication factor is 3.
 The client will copy the block (A) to DataNode 1 only.
 The replication is always done by DataNodes sequentially.
 Once the block has been written to DataNode 1 by the client,
DataNode 1 will connect to DataNode 4.
 Then, DataNode 1 will push the block in the pipeline and data
will be copied to DataNode 4.
 Again, DataNode 4 will connect to DataNode 6 and will copy
the last replica of the block.
 Series of acknowledgements will take place to ensure the
client and NameNode that the data has been written
successfully to all three data nodes.
 Then, the client will finally close the pipeline to end the TCP
session.
 Block B will also be copied into the DataNodes in parallel with
Block A. So, the following things are to be noticed here:
▪ The client will copy Block A and Block B to the first
DataNode simultaneously.
▪ Therefore, in our case, two pipelines will be formed for each of the
block and all the process discussed above will happen in parallel in
these two pipelines.
▪ The client writes the block into the first DataNode and then the
DataNodes will be replicating the block sequentially.
 The client will reach out to NameNode asking for the block
metadata for the file “example.txt”.
 The NameNode will return the list of DataNodes where each
block (Block A and B) are stored.
 After that client, will connect to the DataNodes where the
blocks are stored.
 The client starts reading data parallel from the DataNodes
(Block A from DataNode 1 and Block B from DataNode 3).
 Once the client gets all the required file blocks, it will combine
these blocks to form a file.
 While serving read request of the client, HDFS selects the
replica which is closest to the client.
 This reduces the read latency and the bandwidth
consumption.
 Therefore, that replica is selected which resides on the same
rack as the reader node, if possible.
 Hadoop YARN knits the storage unit of Hadoop i.e. HDFS
(Hadoop Distributed File System) with the various processing
tools.
 YARN stands for “Yet Another Resource Negotiator”.
 In Hadoop version 1.0 which is also referred to as
MRV1(MapReduce Version 1), MapReduce performed both
processing and resource management functions.
 It consisted of a Job Tracker which was the single master.
 The Job Tracker
▪ allocated the resources
▪ performed scheduling
▪ monitored the processing jobs.
 It assigned map and reduce tasks on a number of subordinate
processes called the Task Trackers.
 The Task Trackers periodically reported their progress to the Job
Tracker.
 Bottle neck due to single job tracker
 In MapReduce V1, Scalability is a bottleneck when cluster size
grows to 4000+
 To overcome all these issues, YARN was introduced in Hadoop
version 2.0 in the year 2012 by Yahoo and Hortonworks.
 The basic idea behind YARN is to relieve MapReduce by taking
over the responsibility of Resource Management and Job
Scheduling.
 YARN started to give Hadoop the ability to run non-
MapReduce jobs within the Hadoop framework.
 YARN allows different data
processing methods
▪ graph processing, interactive
processing, stream processing
as well as batch processing to
run and process data stored in
HDFS.
 YARN opens Hadoop to
other types of distributed
applications beyond
MapReduce.
 YARN enabled the users to perform operations as per requirement
by using a variety of tools like Spark for real-time
processing, Hive for SQL, HBase for NoSQL and others.
 YARN Performs two jobs:
▪ Resource Management - allocating resources
▪ Job Scheduling - scheduling tasks
 YARN performs all your processing activities by allocating
resources and scheduling tasks.
 Resource Manager: Runs on a master daemon and manages
the resource allocation in the cluster.
 Node Manager: They run on the slave daemons and are
responsible for the execution of a task on every single Data
Node.
 Application Master: Manages the user job lifecycle and
resource needs of individual applications. It works along with
the Node Manager and monitors the execution of tasks.
 Container: Package of resources including RAM, CPU,
Network, HDD etc on a single node.
 It is the ultimate authority in resource allocation.
 On receiving the processing requests,
▪ passes parts of requests to corresponding node managers accordingly,
where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the
allocation of the available resources for competing applications.
 Optimizes the cluster utilization like keeping all resources in use all
the time against various constraints such as capacity guarantees,
fairness, and SLAs.
 It has two major components:
▪ a) components-Scheduler b) Application Manager
 a) Scheduler
 The scheduler is responsible for allocating resources to the various
running applications subject to constraints of capacities, queues etc.
 It is called a pure scheduler in ResourceManager, which means that it
does not perform any monitoring or tracking of status for the applications.
 If there is an application failure or hardware failure, the Scheduler does
not guarantee to restart the failed tasks.
 Performs scheduling based on the resource requirements of the
applications.
 It has a pluggable policy plug-in, which is responsible for partitioning the
cluster resources among the various applications.
 There are two such plug-ins: Capacity Scheduler and Fair Scheduler, which
are currently used as Schedulers in ResourceManager.
 b) Application Manager
 It is responsible for accepting job submissions.
 Negotiates the first container from the Resource Manager for
executing the application specific Application Master.
 Manages running the Application Masters in a cluster and
provides service for restarting the Application Master
container on failure.
 It takes care of individual nodes in a Hadoop cluster and manages user
jobs and workflow on the given node.
 It registers with the Resource Manager and sends heartbeats with the
health status of the node.
 Its primary goal is to manage application containers assigned to it by the
resource manager.
 It keeps up-to-date with the Resource Manager.
 Application Master requests the assigned container from the Node
Manager by sending it a Container Launch Context(CLC) which includes
everything the application needs in order to run.
▪ The Node Manager creates the requested container process and starts it.
 Monitors resource usage (memory, CPU) of individual containers.
 Performs Log management.
 It also kills the container as directed by the Resource Manager.
 An application is a single job submitted to the framework.
▪ Each such application has a unique Application Master associated with it
which is a framework specific entity.
 It is the process that coordinates an application’s execution in the
cluster and also manages faults.
 Its task is to negotiate resources from the Resource Manager and
work with the Node Manager to execute and monitor the
component tasks.
 It is responsible for negotiating appropriate resource containers
from the ResourceManager, tracking their status and monitoring
progress.
 Once started, it periodically sends heartbeats to the Resource
Manager to affirm its health and to update the record of its
resource demands.
 It is a collection of physical resources such as RAM, CPU cores,
and disks on a single node.
 YARN containers are managed by a container launch context
which is container life-cycle(CLC).
 This record contains a map of environment variables,
dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command
necessary to create the process.
 It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
1. Client submits an application
2. The Resource Manager allocates a container
to start the Application Manager
3. The Application Manager registers itself
with the Resource Manager
4. The Application Manager negotiates
containers from the Resource Manager
5. The Application Manager notifies the Node
Manager to launch containers
6. Application code is executed in the
container
7. Client contacts Resource
Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the
Application Manager un-registers with the
Resource Manager
1. Client submits an application
2. Resource Manager allocates a container
to start Application Manager
3. Application Manager registers with
Resource Manager
4. Application Manager asks containers from
Resource Manager
5. Application Manager notifies Node
Manager to launch containers
6. Application code is executed in the
container
7. Client contacts Resource
Manager/Application Manager to monitor
application’s status
8. Application Manager unregisters with
Resource Manager
 fsck
▪ HDFS Command to check the health of the Hadoop file system.
▪ Command: hdfs fsck /
 ls
▪ HDFS Command to display the list of Files and Directories in HDFS.
▪ Command: hdfs dfs –ls /
 mkdir
▪ HDFS Command to create the directory in HDFS.
▪ Usage: hdfs dfs –mkdir /directory_name
▪ Command: hdfs dfs –mkdir /new_dir

 copyToLocal
▪ HDFS Command to copy the file from HDFS to Local File System.
▪ Usage: hdfs dfs -copyToLocal <hdfs source> <localdst>
▪ Command: hdfs dfs –copyToLocal /new_dir/test /home/basil
 touchz
▪ HDFS Command to create a file in HDFS with file size 0 bytes.
▪ Usage: hdfs dfs –touchz /directory/filename
▪ Command: hdfs dfs –touchz /new_dir/sample
 du
▪ HDFS Command to check the file size.
▪ Usage: hdfs dfs –du –s /directory/filename
▪ Command: hdfs dfs –du –s /new_dir/sample
 cat
▪ HDFS Command that reads a file on HDFS and prints the content of that file to the standard output.
▪ Usage: hdfs dfs –cat /path/to/file_in_hdfs
▪ Command: hdfs dfs –cat /new_dir/test
 copyFromLocal
▪ HDFS Command to copy the file from a Local file system to HDFS.
▪ Usage: hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
▪ Command: hdfs dfs –copyFromLocal /home/dir/test /new_dir
 get
▪ HDFS Command to copy files from hdfs to the local file system.
▪ Usage: hdfs dfs -get <src> <localdst>
▪ Command: hdfs dfs –get /user/test /home/dir

You might also like