0% found this document useful (0 votes)
101 views90 pages

Big Data Aktu Unit 3

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, features, and operational mechanisms for managing large-scale data storage and processing. Key aspects include its fault tolerance through data replication, scalability by adding nodes, and high throughput for efficient data access. Additionally, it discusses challenges such as the small file problem and latency, while emphasizing the importance of block size and data locality in optimizing performance.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views90 pages

Big Data Aktu Unit 3

The document provides an overview of the Hadoop Distributed File System (HDFS), detailing its architecture, features, and operational mechanisms for managing large-scale data storage and processing. Key aspects include its fault tolerance through data replication, scalability by adding nodes, and high throughput for efficient data access. Additionally, it discusses challenges such as the small file problem and latency, while emphasizing the importance of block size and data locality in optimizing performance.

Uploaded by

AYUSH GUPTA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

UNIT 3

Hadoop Distributed File System


HADOOP
Hadoop includes two main pieces:
• a distributed architecture for
running MapReduce jobs, which
are Java and other programs used
to convert data from one format to
another, and a
• distributed file system (HDFS)
for storing data in a distributed
architecture. Here we discuss
HDFS.
• Apache Hadoop HDFS is a distributed file system which
provides redundant storage space for storing files which are
huge in sizes; files which are in the range of Terabytes and
Petabytes.
• In HDFS data is stored reliably. Files are broken into blocks
and distributed across nodes in a cluster. After that, each block
Hadoop is replicated, means copies of blocks are created on different
machines.
Distributed • Hence if a machine goes down or gets crashed, then also we
can easily retrieve and access our data from different machines.
File System By default, 3 copies of a file are created on different machines.
• Hence it is highly fault-tolerant. HDFS provides faster file read
and writes mechanism, as data is stored in different nodes in a
cluster. Hence the user can easily access the data from any
machine in a cluster. Hence HDFS is highly used as a platform
for storing huge volume and different varieties of data
worldwide.
Features of HDFS
• Distributed Storage: Data is stored across multiple nodes
in a cluster.
• Fault Tolerance: Data is replicated across different nodes
to prevent data loss.
• Scalability: Easily scales by adding new nodes to the
cluster.
• High Throughput: Supports parallel processing, ensuring
efficient data access.
• Streaming Access: Designed for high-speed data
streaming rather than low-latency access.
• Write Once, Read Many (WORM): Data is written once
and read multiple times, making it suitable for big data
analytics.
HDFS Use Cases

BIG DATA MACHINE CLOUD STORAGE SOCIAL MEDIA & HEALTHCARE &
ANALYTICS: LEARNING SOLUTIONS: WEB ANALYTICS: GENOMICS:
PROCESSING LARGE PIPELINES: USED IN DATA STORING AND PROCESSING
DATASETS (E.G., LOG STORING AND LAKES AND ANALYZING USER LARGE MEDICAL
ANALYSIS, FRAUD PREPROCESSING ENTERPRISE DATA INTERACTIONS. DATASETS.
DETECTION). TRAINING DATA.
WAREHOUSES.
• HDFS (Hadoop Distributed File System)
is designed to handle large-scale data
storage and processing efficiently
across a distributed cluster. The design
HDFS of HDFS is influenced by the need for
Design scalability, fault tolerance, high
throughput, and cost-effectiveness
while being optimized for batch
processing workloads.
Design Goals of HDFS

Fault
Scalability
Tolerance

High Streaming
Throughput Data Access

Simple
Economical
Coherency
Storage
Model
HDFS
Architectur
e
HDFS Architecture

HDFS is an Open source component of the Apache Software Foundation that manages
data.

HDFS has scalability, availability, and replication as key features.

Name nodes, secondary name nodes, data nodes, checkpoint nodes, backup
nodes, and blocks all make up the architecture of HDFS.

HDFS is fault-tolerant and is replicated. Files are distributed across the cluster systems
using the Name node and Data Nodes.

The primary difference between Hadoop and Apache HBase is that Apache HBase is a
non-relational database and Apache Hadoop is a non-relational data store.
HDFS is composed of master-slave architecture, which
includes the following elements:
Name nodes,

secondary name nodes,

HDFS data nodes,

Architecture checkpoint nodes,

backup nodes

blocks
Name Node
The NameNode (Master Node) manages all DataNode blocks, performing tasks such as:
• Monitoring and controlling DataNodes.
• Granting user access to files.
• Storing block metadata.
• Committing EditLogs to disk after every write for recovery.
DataNodes operate independently, and failures don’t impact the cluster as lost blocks are
re-replicated. Only the NameNode tracks all DataNodes, managing communication centrally.
The NameNode maintains two key files:
FsImage: A snapshot of the entire filesystem, storing directories and file metadata hierarchically.
EditLogs: Records real-time changes, tracking file modifications for recovery.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is
activated to perform a checkpoint. The secondary NameNode performs
the following duties.

Replication ensures Metadata (information


Transaction logs from
data availability across about data locations) is
all sources are stored in
servers, either directly stored in cluster nodes
one location for easy
or via a distributed file and helps DataNodes
replay.
system. read and process data.

FsImage enables data


replication, scaling, and Backups can be stored
backup. It can be in another Hadoop
recreated from an old cluster or on a local file
version in case of system.
failure.
DataNode
Every slave machine that contains data organsises a DataNode. DataNode stores data in ext3 or
ext4 file format on DataNodes. DataNodes do the following:

1. DataNodes store every data.

2. It handles all of the requested operations on files, such as reading file content and creating
new data, as described above.

3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
Checkpoint Node

It establishes checkpoints at specified intervals to generate checkpoint


nodes in FsImage and EditLogs from NameNode and joins them to
produce a new image.

Whenever you generate FsImage and EditLogs from NameNode and


merge them to create a new image, checkpoint nodes in HDFS create a
checkpoint and deliver it to the NameNode.

The directory structure is always identical to that of the name node, so the
checkpointed image is always available.
Backup Node

Backup nodes are used to provide high availability of the data.

In case one of the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.

Backup nodes are not used to recover from a failure of the active NameNode or
DataNodes.

Instead, you use a replica set of the data for that purpose. Data nodes are used to store the
data and to create the FsImage and editsLogs files for replication.

Data nodes connect to one or more replica sets of the data to create the FsImage and
editsLogs files for replication. Data nodes are not used to provide high availability.
Blocks
•Block size in HDFS can be set between 32 MB and 128 MB, based on performance
needs.
•Data is appended to DataNodes, which are replicated for fault tolerance.
•Automatic recovery occurs if a node fails, restoring data from backups.
•HDFS uses its own file system, not direct hard drive storage.
•Scales horizontally as data and users grow.
•File storage mechanism:If a file exceeds the block size, it splits across multiple
blocks.
Example: A 135 MB file with a 128 MB block size creates two blocks—128 MB + 7
MB.
Scalability – HDFS is highly scalable and can store and
process petabytes of data by adding new nodes to the cluster.

Fault Tolerance – It replicates data across multiple nodes,


ensuring data availability even if a node fails.

Benefits of High Throughput – HDFS is optimized for large-scale data

HDFS (Hadoop
processing and supports parallel data access, improving
performance.
Distributed File Cost-Effective – It runs on commodity hardware, reducing the
cost compared to traditional storage solutions.
System) Write Once, Read Many (WORM) Model – Data in HDFS is
written once and read multiple times, making it ideal for big data
analytics.
Data Locality – HDFS moves computation to data rather than
moving data to computation, reducing network bottlenecks.

Integration with Hadoop Ecosystem – It seamlessly integrates


with Hadoop tools like MapReduce, Hive, and Spark for efficient
big data processing.
Challenges of HDFS
•Small File Problem – HDFS is optimized for large files; storing many small files can cause
performance issues and increase metadata overhead.
•Latency – Since HDFS is designed for batch processing, it has high latency and is not
suitable for real-time applications.
•Write-Once Limitation – Once written, files in HDFS cannot be modified, making updates
difficult.
•Complex Setup & Maintenance – Managing and configuring an HDFS cluster requires
expertise in Hadoop administration.
•Security Concerns – By default, HDFS lacks advanced security features like encryption and
access control, which must be manually configured.
•Hardware Dependency – Though it runs on commodity hardware, disk failures and
hardware issues still impact performance and require redundancy.
•Memory Overhead – NameNode stores metadata in RAM, which can become a bottleneck
as the cluster grows.
In Hadoop Distributed File System (HDFS), file size
considerations depend on several factors, such as block
size, replication factor, and compression. Here are
key points:
1. Default Block Size:
• The default block size in HDFS is 128 MB or 256 MB
File Size in • Files smaller than the block size do not waste space,
but large files are split into multiple blocks.
HDFS 2. Maximum File Size:
Since HDFS files are divided into blocks, the maximum
file size is practically limited by the number of blocks
NameNode can manage.
Theoretically, if each block is 128 MB and a NameNode
can handle hundreds of millions of blocks, a single
file can be petabytes in size.
3. Replication Factor:
• Each block is replicated (default: 3 times) across
different DataNodes.
• Effective Storage used= file size X Replication Factor
4. Small Files Problem:
File Size in HDFS is not efficient for handling a large number of

HDFS
small files, since each file's metadata is stored in
NameNode memory.
Solution: Use HAR (Hadoop Archive), SequenceFile,
or Avro to merge small files.
5. Compression (Optional):
HDFS supports Snappy, Gzip, LZO, etc., which reduce
file size and improve efficiency.
Block Size in HDFS

• Files in HDFS are broken into block-sized chunks called data blocks. These blocks
are stored as independent units.
• The size of these HDFS data blocks is 128 MB by default. We can configure the block
size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
• Hadoop distributes these blocks on different slave machines, and the master machine
stores the metadata about blocks location.
• All the blocks of a file are of the same size except the last one (if the file size is not a
multiple of 128). See the example below to understand this fact.
Block Size in HDFS
Suppose we have a file of size 612 MB, and we are using the default block
configuration (128 MB). Therefore five blocks are created, the first four blocks
are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1. A file in HDFS, smaller than a single block does not occupy a full block size
space of the underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured
block size.
• Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?

The default size of the HDFS data block is 128 MB. The reasons for
the large size of blocks are:

To minimize the cost of seek: For the large size blocks, time taken to
transfer the data from disk can be longer as compared to the time taken
to start the block. This results in the transfer of multiple blocks at the disk
transfer rate.
If blocks are small, there will be too many blocks in Hadoop HDFS and
thus too much metadata to store. Managing such a huge number of
blocks and metadata will create overhead and lead to traffic in a network.
Block Abstraction in HDFS (Hadoop
Distributed File System)
HDFS (Hadoop Distributed File System) stores large files across multiple
nodes in a Hadoop cluster. Instead of treating files as a whole, HDFS
divides them into blocks, which are the basic unit of storage.
Key Concepts of Block Abstraction in HDFS:
1. Block Size:
1. Default block size in HDFS is 128MB (can be configured).
2. Unlike traditional file systems (which have small block sizes like
4KB), HDFS uses large blocks to minimize metadata overhead and
improve performance.
2. Splitting and Storage:
• Large files are split into fixed-size blocks and distributed across multiple
nodes.
• Example: A 500MB file with a block size of 128MB will be split into 4
full blocks (128MB each) and 1 smaller block (12MB).
Block Abstraction in HDFS (Hadoop Distributed
File System)
3. Replication for Fault Tolerance:
•Each block is replicated (default replication factor:
3).
•Replicas are stored on different nodes to ensure data
availability in case of node failure.
4. Block Mapping:
•The NameNode keeps track of where each block is
stored but does not store the actual data.
•The DataNodes store the actual blocks.
5. Processing and Parallelism:
•Since files are split into blocks across multiple
nodes, Hadoop’s MapReduce can process them in
parallel, improving efficiency.
Example of Block Abstraction in HDFS:

Suppose you upload a 300MB file into HDFS with a


block size of 128MB and replication factor 3:
The file is split • Block 1: 128MB
into three • Block 2: 128MB
• Block 3: 44MB (remaining data)
blocks:
Each block is replicated three times across different
nodes.
Data Data replication with multiple copies
across many nodes helps protect
Replicatio against data loss. HDFS keeps at least
one copy on a different rack from all
n in HDFS other copies. This data storage in a large
(Hadoop cluster across nodes increases
reliability.
Distribute In addition, HDFS can take storage
snapshots to save point-in-time (PIT)
d File information.
System)
Data Replication in HDFS (Hadoop Distributed File System)

In order to provide reliable storage, HDFS stores large files in multiple locations in
a large cluster, with each file in a sequence of blocks. Each block is stored in a
file of the same size, except for the final block, which fills as data is added.

For added protection, HDFS files are write-once by only one writer at any time.
To help ensure that all data is being replicated as instructed. The NameNode
receives a heartbeat (a periodic status report) and blockreport (the block ID,
generation stamp and length of every block replica) from every DataNode
attached to the cluster. Receiving a heartbeat indicates that the DataNode is
working correctly.
• The NameNode selects the rack ID for each DataNode by using a process called
Hadoop Rack Awareness to help prevent the loss of data if an entire rack fails.
This also enables the use of bandwidth from multiple racks when reading data.
How HDFS Store a File?

Step 1: Client Initiates the Write


• A user (or application) wants to store a file in HDFS.
• The HDFS client contacts the NameNode to request permission and metadata setup for the file.
Step 2: File is Split into Blocks
• The file is divided into fixed-size blocks, typically 128 MB each.
• For example, a 400 MB file becomes 4 blocks: Block1 (128 MB), Block2 (128 MB), Block3
(128 MB), Block4 (16 MB).
Step 3: NameNode Chooses DataNodes
• For each block, the NameNode picks three DataNodes (by default) where replicas will be stored.
• These selections are based on rack awareness and load balancing.
How HDFS Store a File?
Step 4: Pipeline is Created for Block Writing
• The client receives the list of chosen DataNodes for Block1.
• Then it starts sending Block1 to the first DataNode.
• That DataNode forwards it to the second, and the second forwards it to the third.
• This forms a write pipeline.
Step 5: Block is Replicated and Acknowledged
• Once Block1 is written and replicated on all three DataNodes, an acknowledgment is sent back to the client.
• This process repeats for each block until the entire file is stored.
Step 6: Metadata Stored by NameNode
• The NameNode stores:
• The filename
• Block IDs
• The DataNode locations for each block
Step 7: File is Ready in HDFS
• Now, the file is officially stored in HDFS and can be accessed by the client for reading.
Example
Say you store a file research.pdf (size = 300 MB):

HDFS splits it into 3 blocks:

• Block 1 → 128 MB
• Block 2 → 128 MB
• Block 3 → 44 MB

Each block is replicated to 3 different DataNodes:

• Block 1 → DN1, DN2, DN3


• Block 2 → DN2, DN4, DN5
• Block 3 → DN1, DN3, DN6

NameNode stores the block mapping info for research.pdf.


Step 1: The client calls open() on the File System Object (which,
in the case of HDFS, is an instance of the Distributed File
System) to open the file it wants to read.

HDFS File Step 2: To find the locations of the file's first blocks, the
Distributed File System (DFS) makes a remote procedure call
Read Request (RPC) to the name node. The addresses of the data nodes that
have copies of each block are returned by the name node. The

Workflow DFS provides the client with an FSDataInputStream so that it


can read data from it. A DFSInputStream, which controls the
data node and name node I/O, is in turn wrapped by an
FSDataInputStream.

Step 3: The client then uses the stream to call read().


DFSInputStream connects to the principal (closest) data node
for the file's primary block after storing the info node addresses
for the first several blocks in the file.
Step 4: The client continually uses read() on the stream after
receiving data via a stream from the data node.

HDFS File Step 5: DFSInputStream locates the optimal data node for
the following block after cutting the connection with the data
Read Request node when the block ends. The client observes this
transparently and perceives it as just reading an infinite

Workflow stream. Blocks are read as the client reads through the
stream, causing the DFSInputStream to establish new
connections with data nodes. When necessary, it will also
make a call to the name node to obtain the positions of the
data nodes for the upcoming batch of blocks.
Step 1: The client calls DistributedFileSystem(DFS)'s
create() function to create the file.
HDFS File
Write •
Step 2: To create a new file in the file system
Request namespace without any blocks attached, DFS sends an
RPC call to the name node. In order to ensure that the
Workflow file is new and that the client has the necessary
permissions to create it, the name node runs a number
of tests. The name node creates a record of the new
file if these tests are successful; if not, the client
receives an error message, such as an IOException,
preventing the creation of the file. The client can begin
writing data to the FSDataOutputStream that the DFS
returns.
Step 3: The DFSOutputStream divides the data
written by the client into packets, which it then
sends to the info queue, an indoor queue. The
HDFS File DataStreamer consumes the data queue and is
responsible for selecting appropriate data nodes
Write from the inventory to store the replicas and
requesting the name node to allot new blocks. The
Request set of data nodes creates a pipeline; in this case,
we'll suppose that there are three nodes in the
Workflow pipeline due to the replication level of three. The
main data node in the pipeline receives the packets
from the DataStreamer and stores them before
forwarding them to the second data node in the
pipeline.


Step 4: In a similar manner, the packet is stored by
the second data node and forwarded to the third
(and final) data node in the pipeline.
Step 5: The DFSOutputStream maintains an internal "ack
queue" of packets awaiting acknowledgement from data
HDFS File nodes.

Write Step 6: To indicate whether or not the file is complete, this

Request action sends up all of the remaining packets to the data node
pipeline and then connects to the name node to await
acknowledgments.
Workflow
HDFS adheres to the Write Once, Read Many paradigm.
Therefore, files that are already saved in HDFS cannot be
edited, but they can be added by accessing the file again.
Because of this design, HDFS can scale to support many
concurrent clients because data traffic is distributed
throughout all of the cluster's data nodes. As a result, the
system's throughput, scalability, and availability all increase.
Example

Block Size Stored On

• Saving Project.pdf (300MB) B1 128 MB DN1, DN2, DN3


B2 128 MB DN2, DN4, DN5
B3 44 MB DN1, DN5, DN6
Java interfaces to HDFS
HDFS provides several Java interfaces/APIs for interacting with the
Hadoop Distributed File System. These interfaces are used for writing
Java programs that can store, read, and manage files in HDFS.
1. File System Interface
This is the main Java interface for interacting with HDFS.
It provides methods to:
• Create and write files
• Open and read files
• Delete files or directories
• Check if a file exists
• List directory contents
Think of this as the controller that lets Java programs perform file
operations in HDFS.
Java interfaces to HDFS
2. Path Class
• Represents a file or folder location in HDFS.
• Used to specify where a file should be stored, read from, or deleted.
• Similar to how you'd use a file path in any file system.
3. Configuration Class
Loads the Hadoop environment settings from configuration files.
It helps the Java program know:
• Where the HDFS is running
• What the default file system is
• Any other Hadoop-specific settings
Java interfaces to HDFS
4. FSDataInputStram and FSDataOutputStream
These are data streams used to read from and write to HDFS
files.
They work similarly to Java's standard input/output streams but
are optimized for HDFS.
5. Other Supporting Interfaces
FileStatus: Used to get information about files, like size,
permissions, and modification date.
RemoteIterator: Used to iterate over files and directories in
HDFS.
HDFS Command Line Interface (CLI)
HDFS CLI allows users to interact with the Hadoop Distributed
File System directly from the terminal. It is used for file
management, monitoring, and troubleshooting in HDFS without
writing code.
Core Purpose:
• Enables manual interaction with files stored in HDFS.
• Useful for administrators, developers, and analysts to
upload, organize, and manage large datasets.
• Supports scripting for automation of data workflows.
Common File Commands:
-ls → Lists files and directories in a given HDFS path.

-mkdir → Creates new directories in HDFS.

-put → Uploads files from the local file system to HDFS.

-get → Downloads files from HDFS to the local system.

-cat → Displays contents of an HDFS file.

-rm → Deletes files from HDFS.

-du, -df → Shows space usage and available capacity.


Hadoop • HDFS is a distributed file system designed
File to store large datasets across multiple
machines, providing high throughput,
System scalability, and fault tolerance.
Interface 1. Java API Interface
s • The primary and most powerful interface
for interacting with HDFS.
• Provides extensive functionality for file
read/write, directory management, etc.
• Integrated deeply with Hadoop's
MapReduce and other tools.
• Used mostly by Java developers building
Hadoop-based applications.
Hadoop 2. Hadoop Shell (Command-Line Interface)
File • Provides a set of commands similar to UNIX/Linux shell (e.g.,

System list, copy, move, delete).


• Very useful for administrators and testers.
Interface • Allows direct interaction with HDFS from the terminal or scripts.
s • Example commands: list directories, upload/download files,
remove files.
3. Web Interface (WebHDFS / REST API)
• A language-agnostic RESTful interface that enables
communication over HTTP.
• Allows remote systems or web apps to access HDFS using
URL-based requests.
• Suitable for integration with external systems or web
dashboards.
• Supports operations like read, write, rename, and delete files.
Hadoop 4. C/C++ Interface (libhdfs)
File • Provides access to HDFS from native
C/C++ applications.
System • Based on Java Native Interface (JNI).
Interface • Useful for legacy systems or
s performance-critical native applications.
• Includes functions for opening, reading,
writing, and closing files.
5. Python Interface
• Widely used by data scientists and
analysts in machine learning and AI
projects.
• Enables programmatic file handling within
Python-based data workflows.
Map Reduce Data Flow
• The data processed by MapReduce should be stored in HDFS, which
divides the data into blocks and store distributedly, for more details
about HDFS follow this HDFS comprehensive tutorial. Below are the
steps for MapReduce data flow:
• Step 1: One block is processed by one mapper at a time. In the
mapper, a developer can specify his own business logic as per the
requirements. In this manner, Map runs on all the nodes of the cluster
and process the data blocks in parallel.
• Step 2: Output of Mapper also known as intermediate output is
written to the local disk. An output of mapper is not stored on HDFS
as this is temporary data and writing on HDFS will create
unnecessary many copies.
Map Reduce Data Flow
• Step 3: Output of mapper is shuffled to reducer node (which is
a normal slave node but reduce phase will run here hence
called as reducer node). The shuffling/copying is a physical
movement of data which is done over the network.
• Step 4: Once all the mappers are finished and their output is
shuffled on reducer nodes then this intermediate output is
merged & sorted. Which is then provided as input to reduce
phase.
• Step 5: Reduce is the second phase of processing where the
user can specify his own custom business logic as per the
requirements. An input to a reducer is provided from all the
mappers. An output of reducer is the final output, which is
written on HDFS.
Data Ingest with Flume and Sqoop
In the Apache Hadoop ecosystem, Flume and Sqoop are essential tools for data
ingestion, each catering to specific data sources and targets.
Apache Flume: Streaming Data Ingestion
Flume is a distributed, reliable, and highly available service for collecting,
aggregating, and moving large amounts of streaming data into the Hadoop
Distributed File System (HDFS).
Key Features of Flume
Flume offers several features that make it well-suited for streaming data ingestion:
Scalability: Flume can be easily scaled horizontally by adding more agents to
accommodate increasing data volumes.
Fault Tolerance: Flume supports reliable and durable message delivery, ensuring
that no data is lost during the ingestion process.
Extensibility: Flume allows custom components to be developed and integrated,
making it highly adaptable to various data sources and sinks.
Flume Architecture: Key Components

The primary components of Flume’s architecture include:

Source: The component that receives data from external systems and
converts it into Flume events.

Channel: The storage mechanism that buffers the events between the
source and sink, providing durability and reliability.

Sink: The component that writes the events to the desired destination,
such as HDFS, HBase, or another Flume agent.
Sqoop is a tool designed to transfer bulk data between Hadoop
and structured datastores, such as relational databases and data

Bulk Data Transfer


warehouses.

Key Features of Sqoop


Apache Sqoop:

Sqoop offers several features that make it an ideal choice


for bulk data transfer:

Parallelism: Sqoop leverages MapReduce to perform parallel


data transfers, improving transfer speeds and efficiency.

Connectors: Sqoop supports a wide range of connectors for


various databases and datastores, providing flexibility and
compatibility.
Incremental Data Import: Sqoop can perform
incremental data imports, enabling efficient and
up-to-date data synchronization between Hadoop and
external systems.
The primary components of Sqoop’s architecture
include:
Connectors: The plugins that enable communication
Sqoop between Sqoop and the supported databases or
Architecture:
datastores.
Key
Components Import and Export Commands: The commands that
initiate and control the data transfer process, allowing
users to specify the source, target, and various transfer
options.
Hadoop Archive
• A Hadoop Archive is a special type of file format used in HDFS to
bundle many small files into a single archive file.HARs reduce
NameNode memory usage by minimizing the number of metadata
entries.It is primarily used to improve performance and scalability
when dealing with a large number of small files in HDFS.
Requirement of HAR:
• HDFS is designed for large files.
• When thousands or millions of small files are stored, the NameNode
gets overloaded as it must keep metadata for each file.
• HAR consolidates these files into a single logical file system,
reducing metadata overhead.
Hadoop
Archive
HDFS offers unlimited
storage with scalability,
so it can be used as an
archival storage system.
The following Data Flow
Diagram (DFD) depicts
the pattern of HDFS as
an archive store:
Hadoop I/O

• Hadoop comes with a set of primitives for data I/O.


• Some of these are techniques that are more general than
Hadoop, such as data integrity and compression, but deserve
special consideration when dealing with multiterabyte datasets.
• Others are Hadoop tools or APIs that form the building blocks
for developing distributed systems, such as serialization
frameworks and on-disk data structures.
Compression
File compression brings two major benefits:
• It reduces the space needed to store files, and it speeds up data
transfer across the network, or to or from disk.
• When dealing with large volumes of data, both of these savings
can be significant, so it pays to carefully consider how to use
compression in Hadoop.
A summary of compression formats
Compressio Algorith Filename Multiple
Tool Splittable
n format m extension files
There are many different
compression formats, tools and DEFLAT
algorithms, each with different DEFLATE[a] N/A .deflate No No
E
characteristics. Table-1 lists some
of the more common ones that can gzip gzip
DEFLAT
.gz No No
be used with Hadoop. E

DEFLAT Yes, at file


ZIP zip .zip Yes
E boundaries

bzip2 bzip2 bzip2 .bz2 No Yes

LZO lzop LZO .lzo No No


Serialization
Serialization is the process of turning structured objects into a byte stream
for transmission over a network or for writing to persistent
storage. Deserialization is the reverse process of turning a byte stream
back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data
processing: for interprocess communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is
implemented using remote procedure calls (RPCs).
The RPC protocol uses serialization to render the message into a binary
stream to be sent to the remote node, which then deserializes the binary
stream into the original message.
Serialization Format
In general, it is desirable that an RPC serialization format is:
• Compact
A compact format makes the best use of network bandwidth, which is the most scarce resource in a data
center.
• Fast
Interprocess communication forms the backbone for a distributed system, so it is essential that there is as
little performance overhead as possible for the serialization and deserialization process.
• Extensible
Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol
in a controlled manner for clients and servers. For example, it should be possible to add a new argument to a
method call, and have the new servers accept messages in the old format (without the new argument) from
old clients.
Interoperable
For some systems, it is desirable to be able to support clients that are written in different languages to the
server, so the format needs to be designed to make this possible.
Apache Avro is a language-neutral data
serialization system.

It was developed by Doug Cutting, the

Avro
father of Hadoop.

Avro uses JSON format to declare the


data structures. Presently, it supports
languages such as Java, C, C++, C#,
Python, and Ruby.
Avro depends heavily on its schema.

It allows every data to be written with


no prior knowledge of the schema.

Avro It serializes fast and the resulting


serialized data is lesser in size.
Schemas Schema is stored along with the Avro
Overview data in a file for any further
processing.
Avro schemas are defined with JSON
that simplifies its implementation in
languages with JSON libraries.
Like Avro, there are other serialization mechanisms in
Hadoop such as Sequence Files, Protocol
Buffers, and Thrift.
Create

• Create Schemas

Steps to
• Design an Avro schema in JSON format
according to your data structure.

Use Avro Read

for Data • Read the Schema into Your Program


• Two ways to do this:

Serializatio • Generate Class from Schema


• Compile the schema using Avro tools.

n • This creates a Java class corresponding


to the schema.
• Use Parser Library
• Read the schema dynamically using
Avro’s Schema.Parser class.
Step 3: Serialize the Data
• Use the serialization API provided in
org.apache.avro.specific
• Create an instance of DatumWriter and
Steps to Use Avro DataFileWriter to write the data

for Data
Step 4: Deserialize the Data
• Use the deserialization API provided
Serialization in:org.apache.avro.specific
• Create an instance of DatumWriter and
DataFileWriter to read the data
Hadoop Cluster
For the purpose of storing as well as analyzing huge amounts of
unstructured data in a distributed computing environment, a
special type of computational cluster is designed that what we
call as Hadoop Clusters.
Though, whenever we talk about Hadoop Clusters, two main
terms come up, they are cluster and node, so on defining them:
• A collection of nodes is what we call the cluster.
• A node is a point of intersection/connection within a network, ie
a server
Setting Up a Hadoop Cluster
• Setting up a Hadoop cluster involves installing Hadoop on
multiple machines (or virtual machines) and configuring them to
work together in a distributed environment. Here's a
step-by-step guide to set up a basic multi-node Hadoop
cluster:
Prerequisites
1. Machines: Minimum 2 (1 Master + 1 or more Slaves/Workers)
2. OS: Ubuntu (preferred, but others like CentOS also work)
3. Java: Hadoop requires Java 8 or later.
4. SSH: Passwordless SSH between all nodes.
Setting Up a
Hadoop Cluster
1. Install Java on All Nodes
2. Create Hadoop User on All Nodes
3. Set Up SSH (Passwordless)
4. Download and Extract Hadoop
5. Set Environment Variables
6. Configure Hadoop
7. Distribute Hadoop to All Nodes
8. Format HDFS (On Master)
9. Start Hadoop Cluster
10. Verify Cluster
Hadoop clusters have two
types of machines, such as
Master and Slave, where:
Hadoop
Cluster Master: HDFS NameNode,
Specificatio YARN ResourceManager.
n
Slaves: HDFS DataNodes,
YARN NodeManagers.
Hadoop Cluster Specification

NameNode is the HDFS master, DataNode manages storage


which manages the file system attached to the nodes that
namespace and regulates access they run on, basically there
to files by clients and also consults are a number of DataNodes,
with DataNodes (HDFS slave) while that means one DataNode
copying data or running per slave in the cluster.
MapReduce operations.
Cluster Setup and
Installation

As part of this example you will set up a


cluster with
• one load balancer,
• one administration server and
• four web server instances with session
replication enabled.
Cluster Setup and Installation
Identify the following machines:
• MachineA — Has both the load balancer and the administration server.
• MachineB, MachineC, MachineD and MachineE — Has the
administration node and the web server instances running.
1. Install Administration Server on MachineA.
2. Install the Administration Node on MachineB, MachineC, MachineD
and MachineE.
3. Configure the Web Application.
Cluster Setup and Installation
4. Configure the Instances.
a) Launch WDM
b) Create a new configuration for the load balancer.
c) Set up the reverse proxy (Load balancer).
d) Create an instance.
e) Deploy the Configuration
5. Create and Start the Cluster.
a. Create a new configuration for the cluster.
b. Enable Session Replication.
c. Add the web application.
d. Create the instances.
e. Start the cluster.
Hadoop Security
3 A's of security
Organizations must prioritize the three core security
principles known as the 3 A's: Authentication, Authorization,
and Auditing to manage security concerns in a Hadoop
environment effectively.
1. Authentication:
The authentication process ensures that only authorized
users can access the Hadoop cluster. It entails
authenticating users' identities using various mechanisms
such as usernames and passwords, digital certificates, or
biometric authentication. Organizations can reduce the
risk of unauthorized access and secure their data from
dangerous actors by establishing strong authentication
protocols.
Hadoop Security

2. Authorization:
Authorization governs the actions an authenticated user can take
within the Hadoop cluster. It entails creating access restrictions and
permissions depending on the roles and responsibilities of the
users. Organizations can enforce the concept of least privilege by
allowing users only the privileges required to complete their tasks, if
adequate authorization procedures are in place. This reduces the
possibility of unauthorized data tampering or exposure.
3. Auditing:
Auditing is essential for monitoring and tracking user activity in the
Hadoop cluster. Organizations can investigate suspicious or
unauthorized activity by keeping detailed audit logs. Auditing also
aids in compliance reporting, allowing organizations to demonstrate
conformity with regulatory standards. Implementing real-time audit
log monitoring and analysis provides for the timely detection of
security incidents and the facilitation of proactive measures.
Hadoop Adminstering
Administering Hadoop involves managing and maintaining a
Hadoop cluster, which includes tasks like installation,
configuration, monitoring, troubleshooting, and security.
It also encompasses managing the Hadoop Distributed File
System (HDFS) and YARN (Yet Another Resource Negotiator), as
well as other components that work with the core Hadoop
system.
Monitoring and
Installation and Cluster
Performance
Configuration Management
Tuning

Key Aspects of
Hadoop Security
Backup and
Troubleshooting
Recovery
Administration:
Hadoop
Tools and
Administration
Techniques
Roles
HDFS administration
includes monitoring the
HDFS file structure,
locations, and the updated
files.
Monitoring MapReduce administration
includes monitoring the list of
applications, configuration of
nodes, application status,
etc.
Hadoop Benchmark
• Hadoop benchmarking involves using specific tools and
techniques to evaluate the performance of a Hadoop cluster.
• These benchmarks allow for testing and tuning the cluster's
hardware, software, and network for optimal performance.
Common Benchmark Tools:
• TestDFSIO – Measures HDFS I/O performance.
• TeraSort – Sorts large amounts of data; tests both MapReduce
and HDFS.
• MRBench – Tests performance of MapReduce framework.
• HiBench – A comprehensive benchmarking suite by Intel.
Hadoop in the cloud, also known as
Hadoop as a Service (HaaS), refers
to running the Apache Hadoop
framework on a cloud provider's
infrastructure, rather than managing it
on-premises.
Hadoop in
Cloud This allows organizations to leverage
cloud-based resources for big data
analytics without the need to invest in
and manage their own hardware and
software.
Examples of Cloud Hadoop Services:
1. Amazon EMR:
A fully managed Hadoop service on Amazon Web Services,
offering a wide range of options for running Hadoop clusters in
the cloud.
2. Google Dataproc:
A managed service for running Apache Hadoop and Apache
Spark clusters on Google Cloud Platform.
3. Azure HDInsight:
A managed Hadoop service on Microsoft Azure, providing a
variety of Hadoop and related services.

You might also like