Bda Unit 2
Bda Unit 2
Apache Hadoop
Apache Hadoop is an open-source software
Platform for storing huge volumes of data
It gives us the massive data storage facility
support advanced analytics like Predictive
analytics, Machine learning and data mining.
Hadoop has the capability to handle different
modes of data such as structured,
unstructured and semi-structured data.
It gives us the flexibility to collect, process,
and analyze data that our old data
warehouses failed to do.
Apache Hadoop biggest strength is scalability.
It upgrades from working on a single node to
thousands of nodes without any issue in a
seamless manner.
The different domains of Big Data means we
are able to manage the data’s are from
videos, text, transactional data, sensor
information, statistical data, social media
conversations, search engine queries,
ecommerce data, financial information,
weather data, news updates and so on
Hadoop runs the applications on the basis of
MapReduce where the data is processed in
parallel
It is a framework which is based on java
programming.
Name node
Data Node
Name Node is the prime node which contains metadata does not store
actual data or dataset.
It is also known as Master node.
It consists of files and directories.
Executes file system execution such as naming, closing, opening files
and directories.
It is also known as Slave. HDFS Datanode is responsible for storing
actual data in HDFS.
Datanode performs read and write operation as per the request of the
clients.
The first file is for data and second file is for recording the block’s
metadata.
DataNode performs operations like block replica creation, deletion
DataNode manages data storage of the system.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides
data processing.
process the vast amount of structured and unstructured data stored in the
Hadoop Distributed File system.
Map() performs sorting and filtering of data and thereby organizing them in
the form of group.
Reduce() takes the output generated by Map() as input and combines those
tuples into smaller set of tuples.
Map function takes a set of data and converts it into another set of data, where
individual
elements are broken down into tuples (key/value pairs).
Reduce function takes the output from the Map as an input and combines those
data
tuples based on the key and accordingly modifies the value of the key.
Features of MapReduce
Simplicity MapReduce jobs are easy to run. Applications can be written in
any language such as java, C++, and python.
Scalability – MapReduce can process petabytes of data.
Speed
Fault Tolerance – MapReduce takes care of failures.
YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem
component that
provides the resource management.
Yarn is also one the most important component of
Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is
responsible
for managing and monitoring workloads.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing
huge data sets.
After the processing, pig stores the result in HDFS.
Apache Pig is a high-level language platform
Apache HBase:
Apache Mahout
Mahout is open source framework for creating scalable machine learning
algorithm and
data mining library.
Algorithms of Mahout are:
Clustering
Classifications
Avro
Avro is an open source project that provides data serialization and data
exchange services for Hadoop. These services can be used together or
independently.
Avro schema – It relies on schemas for serialization/deserialization.
Dynamic typing – It refers to serialization and deserialization without code
generation.
Features provided by Avro:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Oozie
Oozie workflow – It is to store and run workflows composed of Hadoop jobs
e.g.,
MapReduce, pig, Hive.
Oozie Coordinator – It runs workflow jobs based on predefined schedules
and
availability of data.
Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission.
HDFS in Hadoop provides Fault-tolerance and High availability to the
storage layer and the other devices present in that Hadoop cluster. Data
storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
The more number of DataNode, the Hadoop cluster will be able to store
more data. So it is advised that the DataNode should have High storing
capacity to store a large number of file blocks.
MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is
based on the YARN framework.
The major feature of MapReduce is to perform the distributed processing
in parallel in a Hadoop cluster which Makes Hadoop working so fast.
Here, we can see that the Input is provided to the Map() function then
it’s output is used as an input to the Reduce function and after that, we
receive our final output.
The Map() function here breaks this DataBlocks into Tuples that are
nothing but a key-value pair. These key-value pairs are now sent as input
to the Reduce(). The Reduce() function then combines this broken Tuples
or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is
then sent to the final Output Node. Finally, the Output is Obtained.
Map Task:
RecordReader The purpose of recordreader is to break the records.
Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader.
Combiner: Combiner is used for grouping the data in the Map workflow.
Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases.
Reduce Task
Features Of 'Hadoop'
• Suitable for Big Data Analysis
• Scalability
• Fault Tolerance
There are many open source software frameworks for storing and
managing data, and Hadoop is
one of them. It has a huge capacity to store data, has efficient
Hadoop challenges
Security
Lack of performance and scalability
Low processing Speed
Lack of flexible resource management
Namenodes:
HDFS Concepts:
Important components in HDFS Architecture are:
Blocks
Name Node
Data Nodes
Hadoop Filesystems
Hadoop is an open-source software framework written in Java
Hadoop is capable of running various file systems and HDFS is just one single
implementation that out of all those file systems.
or KFS(KosmosFileSyste
m) is a file system that is
written in c++. It is very
(Cloud- much similar to a
Store) distributed file system
like HDFS and
GFS(Google File
System).
S3 (block-based) file
system which is
supported by Amazon s3
S3
stores files in
(block- s3 fs.s3.S3FileSystem
blocks(similar to HDFS)
based)
just to overcome S3’s
file system 5 GB file size
limit.
Let’s get an idea of how data flows between the client interacting with
HDFS, the name node, and the data nodes with the help of a diagram.
Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File
System).
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks
in the file. For each block, the name node returns the addresses of the
data nodes that have a copy of that block. The DFS returns an
FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages
the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the
file, then connects to the primary (closest) data node for the primary
block in the file.
Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of
view is simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve
the data node locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to
get a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we
cannot edit the files which are already stored in HDFS, but we can append
data by reopening the files.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created and therefore the client is thrown an error i.e.
IOException. The DFS returns an FSDataOutputStream for the client to
start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline, and
here we’ll assume the replication level is three, so there are three nodes in
the pipeline. The DataStreamer streams the packets to the primary data
node within the pipeline, which stores each packet and forwards it to the
second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it
to the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that
are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name
node to signal whether the file is complete or not.
REFER YOUTUBE VIDEO (DATA FLAIR HINDI)