0% found this document useful (0 votes)
38 views21 pages

Bda Unit 2

1. Apache Hadoop is an open-source software platform for storing and processing huge amounts of data across clusters of computers. 2. It allows for the distributed processing of structured, unstructured, and semi-structured data across systems and supports analytics like machine learning. 3. The core components of Hadoop are HDFS for storage, MapReduce for processing, YARN for resource management, and common utilities. Additional tools like Hive, Pig, and HBase provide additional functionality.

Uploaded by

245120737162
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views21 pages

Bda Unit 2

1. Apache Hadoop is an open-source software platform for storing and processing huge amounts of data across clusters of computers. 2. It allows for the distributed processing of structured, unstructured, and semi-structured data across systems and supports analytics like machine learning. 3. The core components of Hadoop are HDFS for storage, MapReduce for processing, YARN for resource management, and common utilities. Additional tools like Hive, Pig, and HBase provide additional functionality.

Uploaded by

245120737162
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 2

Apache Hadoop
Apache Hadoop is an open-source software
Platform for storing huge volumes of data
It gives us the massive data storage facility
support advanced analytics like Predictive
analytics, Machine learning and data mining.
Hadoop has the capability to handle different
modes of data such as structured,
unstructured and semi-structured data.
It gives us the flexibility to collect, process,
and analyze data that our old data
warehouses failed to do.
Apache Hadoop biggest strength is scalability.
It upgrades from working on a single node to
thousands of nodes without any issue in a
seamless manner.
The different domains of Big Data means we
are able to manage the data’s are from
videos, text, transactional data, sensor
information, statistical data, social media
conversations, search engine queries,
ecommerce data, financial information,
weather data, news updates and so on
Hadoop runs the applications on the basis of
MapReduce where the data is processed in
parallel
It is a framework which is based on java
programming.

1. Hadoop Ecosystem Overview


Hadoop ecosystem is a platform or framework
which helps in solving the big data
problems.
It consist of different components and
services ( storing, analyzing, and maintaining)
inside of it.
There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities.
Most of the tools are used to support these
major elements.

 Oozie: Job Scheduling


 Zookeeper: Managing cluster
 PIG, HIVE: Query based processing of data services
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 MapReduce: Programming based Data Processing
 YARN: Yet Another Resource Negotiator
 HDFS: Hadoop Distributed File System
HDFS:

 HDFS is the major component of Hadoop ecosystem and is responsible


for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log
files.
 Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage
for Big data.
 HDFS consists of two core components i.e.

 Name node
 Data Node
 Name Node is the prime node which contains metadata does not store
actual data or dataset.
 It is also known as Master node.
 It consists of files and directories.
 Executes file system execution such as naming, closing, opening files
and directories.
 It is also known as Slave. HDFS Datanode is responsible for storing
actual data in HDFS.
 Datanode performs read and write operation as per the request of the
clients.
 The first file is for data and second file is for recording the block’s
metadata.
 DataNode performs operations like block replica creation, deletion
 DataNode manages data storage of the system.

MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides
data processing.
process the vast amount of structured and unstructured data stored in the
Hadoop Distributed File system.
Map() performs sorting and filtering of data and thereby organizing them in
the form of group.
Reduce() takes the output generated by Map() as input and combines those
tuples into smaller set of tuples.

Map function takes a set of data and converts it into another set of data, where
individual
elements are broken down into tuples (key/value pairs).
Reduce function takes the output from the Map as an input and combines those
data
tuples based on the key and accordingly modifies the value of the key.

Features of MapReduce
 Simplicity MapReduce jobs are easy to run. Applications can be written in
any language such as java, C++, and python.
Scalability – MapReduce can process petabytes of data.
Speed
Fault Tolerance – MapReduce takes care of failures.

YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem
component that
provides the resource management.
Yarn is also one the most important component of
Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is
responsible
for managing and monitoring workloads.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager

Main features of YARN are:


Flexibility
Efficiency
Shared
HIVE:

 With the help of SQL methodology and interface, HIVE performs


reading and writing of large data sets.
 However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch
processing both.
 Also, all the SQL datatypes are supported by Hive
 HIVE too comes with two components: JDBC Drivers and HIVE
Command Line.

PIG:

Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 After the processing, pig stores the result in HDFS.
 Apache Pig is a high-level language platform

Features of Apache Pig:


 Extensibility
Handles all kinds of data

Apache HBase:

 It’s a NoSQL database which supports all kinds of data


 HBase, provide real-time access to read or write data in HDFS.
There are two HBase Components namely- HBase Master and
RegionServer.
i. HBase Master Maintain and monitor the Hadoop cluster.
ii. ii. RegionServer It is the worker node which handles read,
writes, updates and delete requests from clients.

Apache Mahout
Mahout is open source framework for creating scalable machine learning
algorithm and
data mining library.
Algorithms of Mahout are:
Clustering
Classifications

Avro
Avro is an open source project that provides data serialization and data
exchange services for Hadoop. These services can be used together or
independently.
Avro schema – It relies on schemas for serialization/deserialization.
Dynamic typing – It refers to serialization and deserialization without code
generation.
Features provided by Avro:
 Rich data structures.
 Remote procedure call.
 Compact, fast, binary data format.

Oozie
Oozie workflow – It is to store and run workflows composed of Hadoop jobs
e.g.,
MapReduce, pig, Hive.
 Oozie Coordinator – It runs workflow jobs based on predefined schedules
and
availability of data.

Zookeeper: Zookeeper overcame all the problems by performing


synchronization, inter-component based communication, grouping, and
maintenance.

Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components.
 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission.
HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster. Data
storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster that guides


the Datanode(Slaves).
Namenode is mainly used for storing the Metadata
Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be
from 1 to 500 or even more than that.

The more number of DataNode, the Hadoop cluster will be able to store
more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.

MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is
based on the YARN framework.
The major feature of MapReduce is to perform the distributed processing
in parallel in a Hadoop cluster which Makes Hadoop working so fast.
Here, we can see that the Input is provided to the Map() function then
it’s output is used as an input to the Reduce function and after that, we
receive our final output.
The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input
to the Reduce(). The Reduce() function then combines this broken Tuples
or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is
then sent to the final Output Node. Finally, the Output is Obtained.

Map Task:
RecordReader The purpose of recordreader is to break the records.
Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader.
Combiner: Combiner is used for grouping the data in the Map workflow.
Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases.

Reduce Task

Shuffle and Sort


Reduce
OutputFormat

Features Of 'Hadoop'
• Suitable for Big Data Analysis
• Scalability
• Fault Tolerance

YARN(Yet Another Resource Negotiator)


YARN is a Framework on which MapReduce works.
YARN performs 2 operations that are Job scheduling and Resource
Management. The Purpose of Job schedular is to divide a big task into
small jobs
And the use of Resource Manager is to manage all the resources that are
made available for running a Hadoop cluster.
Features of YARN
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

Hadoop common or Common Utilities


Hadoop common or Common utilities are nothing but our java library
and java files
these utilities are used by HDFS, YARN, and MapReduce for running the
cluster.

Analyzing data with Hadoop


 Cost reduction
 The development of new products
 Making faster and smarter decisions
 Detecting faults
Today, Big Data is used by almost all sectors including banking,
government, manufacturing,
airlines and hospitality.

There are many open source software frameworks for storing and
managing data, and Hadoop is
one of them. It has a huge capacity to store data, has efficient

Advantages of Big Data Analysis


Whenever users browse travel portals, shopping sites, search flights,
hotels or add a
particular item into their cart, then Ad Targeting companies can analyze
this wide variety
of data and activities and can provide better recommendations to the user
regarding
offers, discounts and deals based on the user browsing history and
product history.

The importance of Hadoop


 Ability to store and process huge amounts of any kind of data, quickly.
 Computing power.
 Fault tolerance.
 Flexibility.
 Low cost.
 Scalability.

Hadoop is used for:


 Machine learning
 Processing of text documents
 Image processing
 Processing of XML messages
 Web crawling
 Data analysis
 Analysis in the marketing field
 Study of statistical data

Hadoop challenges
Security
Lack of performance and scalability
Low processing Speed
Lack of flexible resource management

Hadoop Distributed File System (HDFS) Concepts


Hadoop Distributed File System (HDFS) is the primary storage system used by
Hadoop applications.
HDFS is a filesystem developed specifically for storing very large
files with streaming data access patterns
 Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle
of write-once and read-many-times. Once data is written large portions
of dataset can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily
available in the market.

1. Cluster : A hadoop cluster is made by having many machines in a


network, each machine is
termed as a node, and these nodes talks to each other over the network.
2. Name Node : Name Node holds all the file system metadata for the
cluster The Name Node is the central controller of HDFS.
3. Secondary Name Node : Since Name Node is the single point of failure,
secondary NameNode constantly reads the data from the RAM of the
NameNode and writes it into the hard
disk or the file system.
4. Data Node : These are the workers that does the real work of storing
data as and when told by
the Name Node.
5. Block Size: This is the minimum amount of size of one block in a
filesystem, in which data
can be kept contiguously. The default size of a single block in HDFS is 64
Mb.

 Namenodes:

 Run on the master node.


 Store metadata (data about data) like file path, the number of
blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e to reduce seek time.
Though a persistent copy of it is kept on disk.
 DataNodes:

 Run on slave nodes.


 Require high memory as data is actually stored here.
Design of HDFS
1. Very Large files
2. Streaming data access: most efficient data processing
pattern is a write-once, read-many-times pattern.
3. Commodity Hardware: Hadoop doesn’t require expensive, highly reliable
hardware.
4. Data Replication and Fault Tolerance
5. High Throughput : In HDFS, the task is divided and shared among different
systems.
6. Moving Computation is Better than Moving Data
7. Scale Out Architecture : HDFS employs scale out architecture as compared
to scale up architecture employed by RDBMS.
8. Low-latency data access
9. Lots of small files
10. Multiple writers, arbitrary file modifications

HDFS Concepts:
Important components in HDFS Architecture are:
 Blocks
 Name Node
 Data Nodes

Hadoop Filesystems
Hadoop is an open-source software framework written in Java
Hadoop is capable of running various file systems and HDFS is just one single
implementation that out of all those file systems.

URI Java implementation (all


Filesyste schem under
m e org.apache.hadoop) Description

Local file fs.LocalFileSystem The Hadoop Local


URI Java implementation (all
Filesyste schem under
m e org.apache.hadoop) Description

filesystem is used for a


locally connected disk
with client-side
checksumming. The
local filesystem uses
RawLocalFileSystem
with no checksums.

HDFS stands for Hadoop


Distributed File System
HDFS hdfs hdfs.DistributedFileSystem and it is drafted for
working with
MapReduce efficiently.

The HFTP filesystem


provides read-only
access to HDFS over
HFTP hftp hdfs.HftpFileSystem
HTTP. There is no
connection of HFTP with
FTP.

The HSFTP filesystem


provides read-only
access to HDFS over
HSFTP hsftp hdfs.HsftpFileSystem
HTTPS. This file system
also does not have any
connection with FTP.

The HAR file system is


mainly used to reduce
the memory usage of
HAR har fs.HarFileSystem
NameNode by
registering files in
Hadoop HDFS.

KFS kfs fs.kfs.KosmosFileSystem cloud store


URI Java implementation (all
Filesyste schem under
m e org.apache.hadoop) Description

or KFS(KosmosFileSyste
m) is a file system that is
written in c++. It is very
(Cloud- much similar to a
Store) distributed file system
like HDFS and
GFS(Google File
System).

The FTP filesystem is


FTP ftp fs.ftp.FTPFileSystem supported by the FTP
server.

S3 fs.s3native.NativeS3FileSy This file system is


s3n
(native) stem backed by AmazonS3.

S3 (block-based) file
system which is
supported by Amazon s3
S3
stores files in
(block- s3 fs.s3.S3FileSystem
blocks(similar to HDFS)
based)
just to overcome S3’s
file system 5 GB file size
limit.

Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with
HDFS, the name node, and the data nodes with the help of a diagram.
Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File
System).

Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks
in the file. For each block, the name node returns the addresses of the
data nodes that have a copy of that block. The DFS returns an
FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages
the data node and name node I/O.

Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the
file, then connects to the primary (closest) data node for the primary
block in the file.

Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of
view is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve
the data node locations for the next batch of blocks as needed.

Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.

Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to
get a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we
cannot edit the files which are already stored in HDFS, but we can append
data by reopening the files.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).

Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to
start out writing data to.

Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline, and
here we’ll assume the replication level is three, so there are three nodes in
the pipeline. The DataStreamer streams the packets to the primary data
node within the pipeline, which stores each packet and forwards it to the
second data node within the pipeline.

Step 4: Similarly, the second data node stores the packet and forwards it
to the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that
are waiting to be acknowledged by data nodes, called an “ack queue”.

Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name
node to signal whether the file is complete or not.
REFER YOUTUBE VIDEO (DATA FLAIR HINDI)

You might also like