Big Data Aktu Unit 3
Big Data Aktu Unit 3
BIG DATA MACHINE CLOUD STORAGE SOCIAL MEDIA & HEALTHCARE &
ANALYTICS: LEARNING SOLUTIONS: WEB ANALYTICS: GENOMICS:
PROCESSING LARGE PIPELINES: USED IN DATA STORING AND PROCESSING
DATASETS (E.G., LOG STORING AND LAKES AND ANALYZING USER LARGE MEDICAL
ANALYSIS, FRAUD PREPROCESSING ENTERPRISE DATA INTERACTIONS. DATASETS.
DETECTION). TRAINING DATA.
WAREHOUSES.
• HDFS (Hadoop Distributed File System)
is designed to handle large-scale data
storage and processing efficiently
across a distributed cluster. The design
HDFS of HDFS is influenced by the need for
Design scalability, fault tolerance, high
throughput, and cost-effectiveness
while being optimized for batch
processing workloads.
Design Goals of HDFS
Fault
Scalability
Tolerance
High Streaming
Throughput Data Access
Simple
Economical
Coherency
Storage
Model
HDFS
Architectur
e
HDFS Architecture
HDFS is an Open source component of the Apache Software Foundation that manages
data.
Name nodes, secondary name nodes, data nodes, checkpoint nodes, backup
nodes, and blocks all make up the architecture of HDFS.
HDFS is fault-tolerant and is replicated. Files are distributed across the cluster systems
using the Name node and Data Nodes.
The primary difference between Hadoop and Apache HBase is that Apache HBase is a
non-relational database and Apache Hadoop is a non-relational data store.
HDFS is composed of master-slave architecture, which
includes the following elements:
Name nodes,
backup nodes
blocks
Name Node
The NameNode (Master Node) manages all DataNode blocks, performing tasks such as:
• Monitoring and controlling DataNodes.
• Granting user access to files.
• Storing block metadata.
• Committing EditLogs to disk after every write for recovery.
DataNodes operate independently, and failures don’t impact the cluster as lost blocks are
re-replicated. Only the NameNode tracks all DataNodes, managing communication centrally.
The NameNode maintains two key files:
FsImage: A snapshot of the entire filesystem, storing directories and file metadata hierarchically.
EditLogs: Records real-time changes, tracking file modifications for recovery.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is
activated to perform a checkpoint. The secondary NameNode performs
the following duties.
2. It handles all of the requested operations on files, such as reading file content and creating
new data, as described above.
3. All the instructions are followed, including scrubbing data on DataNodes, establishing
partnerships, and so on.
Checkpoint Node
The directory structure is always identical to that of the name node, so the
checkpointed image is always available.
Backup Node
In case one of the active NameNode or DataNodes goes down, the backup node can be
promoted to active and the active node switched over to the backup node.
Backup nodes are not used to recover from a failure of the active NameNode or
DataNodes.
Instead, you use a replica set of the data for that purpose. Data nodes are used to store the
data and to create the FsImage and editsLogs files for replication.
Data nodes connect to one or more replica sets of the data to create the FsImage and
editsLogs files for replication. Data nodes are not used to provide high availability.
Blocks
•Block size in HDFS can be set between 32 MB and 128 MB, based on performance
needs.
•Data is appended to DataNodes, which are replicated for fault tolerance.
•Automatic recovery occurs if a node fails, restoring data from backups.
•HDFS uses its own file system, not direct hard drive storage.
•Scales horizontally as data and users grow.
•File storage mechanism:If a file exceeds the block size, it splits across multiple
blocks.
Example: A 135 MB file with a 128 MB block size creates two blocks—128 MB + 7
MB.
Scalability – HDFS is highly scalable and can store and
process petabytes of data by adding new nodes to the cluster.
HDFS (Hadoop
processing and supports parallel data access, improving
performance.
Distributed File Cost-Effective – It runs on commodity hardware, reducing the
cost compared to traditional storage solutions.
System) Write Once, Read Many (WORM) Model – Data in HDFS is
written once and read multiple times, making it ideal for big data
analytics.
Data Locality – HDFS moves computation to data rather than
moving data to computation, reducing network bottlenecks.
HDFS
small files, since each file's metadata is stored in
NameNode memory.
Solution: Use HAR (Hadoop Archive), SequenceFile,
or Avro to merge small files.
5. Compression (Optional):
HDFS supports Snappy, Gzip, LZO, etc., which reduce
file size and improve efficiency.
Block Size in HDFS
• Files in HDFS are broken into block-sized chunks called data blocks. These blocks
are stored as independent units.
• The size of these HDFS data blocks is 128 MB by default. We can configure the block
size as per our requirement by changing the dfs.block.size property in hdfs-site.xml
• Hadoop distributes these blocks on different slave machines, and the master machine
stores the metadata about blocks location.
• All the blocks of a file are of the same size except the last one (if the file size is not a
multiple of 128). See the example below to understand this fact.
Block Size in HDFS
Suppose we have a file of size 612 MB, and we are using the default block
configuration (128 MB). Therefore five blocks are created, the first four blocks
are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1. A file in HDFS, smaller than a single block does not occupy a full block size
space of the underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the configured
block size.
• Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?
The default size of the HDFS data block is 128 MB. The reasons for
the large size of blocks are:
To minimize the cost of seek: For the large size blocks, time taken to
transfer the data from disk can be longer as compared to the time taken
to start the block. This results in the transfer of multiple blocks at the disk
transfer rate.
If blocks are small, there will be too many blocks in Hadoop HDFS and
thus too much metadata to store. Managing such a huge number of
blocks and metadata will create overhead and lead to traffic in a network.
Block Abstraction in HDFS (Hadoop
Distributed File System)
HDFS (Hadoop Distributed File System) stores large files across multiple
nodes in a Hadoop cluster. Instead of treating files as a whole, HDFS
divides them into blocks, which are the basic unit of storage.
Key Concepts of Block Abstraction in HDFS:
1. Block Size:
1. Default block size in HDFS is 128MB (can be configured).
2. Unlike traditional file systems (which have small block sizes like
4KB), HDFS uses large blocks to minimize metadata overhead and
improve performance.
2. Splitting and Storage:
• Large files are split into fixed-size blocks and distributed across multiple
nodes.
• Example: A 500MB file with a block size of 128MB will be split into 4
full blocks (128MB each) and 1 smaller block (12MB).
Block Abstraction in HDFS (Hadoop Distributed
File System)
3. Replication for Fault Tolerance:
•Each block is replicated (default replication factor:
3).
•Replicas are stored on different nodes to ensure data
availability in case of node failure.
4. Block Mapping:
•The NameNode keeps track of where each block is
stored but does not store the actual data.
•The DataNodes store the actual blocks.
5. Processing and Parallelism:
•Since files are split into blocks across multiple
nodes, Hadoop’s MapReduce can process them in
parallel, improving efficiency.
Example of Block Abstraction in HDFS:
In order to provide reliable storage, HDFS stores large files in multiple locations in
a large cluster, with each file in a sequence of blocks. Each block is stored in a
file of the same size, except for the final block, which fills as data is added.
For added protection, HDFS files are write-once by only one writer at any time.
To help ensure that all data is being replicated as instructed. The NameNode
receives a heartbeat (a periodic status report) and blockreport (the block ID,
generation stamp and length of every block replica) from every DataNode
attached to the cluster. Receiving a heartbeat indicates that the DataNode is
working correctly.
• The NameNode selects the rack ID for each DataNode by using a process called
Hadoop Rack Awareness to help prevent the loss of data if an entire rack fails.
This also enables the use of bandwidth from multiple racks when reading data.
How HDFS Store a File?
• Block 1 → 128 MB
• Block 2 → 128 MB
• Block 3 → 44 MB
HDFS File Step 2: To find the locations of the file's first blocks, the
Distributed File System (DFS) makes a remote procedure call
Read Request (RPC) to the name node. The addresses of the data nodes that
have copies of each block are returned by the name node. The
HDFS File Step 5: DFSInputStream locates the optimal data node for
the following block after cutting the connection with the data
Read Request node when the block ends. The client observes this
transparently and perceives it as just reading an infinite
Workflow stream. Blocks are read as the client reads through the
stream, causing the DFSInputStream to establish new
connections with data nodes. When necessary, it will also
make a call to the name node to obtain the positions of the
data nodes for the upcoming batch of blocks.
Step 1: The client calls DistributedFileSystem(DFS)'s
create() function to create the file.
HDFS File
Write •
Step 2: To create a new file in the file system
Request namespace without any blocks attached, DFS sends an
RPC call to the name node. In order to ensure that the
Workflow file is new and that the client has the necessary
permissions to create it, the name node runs a number
of tests. The name node creates a record of the new
file if these tests are successful; if not, the client
receives an error message, such as an IOException,
preventing the creation of the file. The client can begin
writing data to the FSDataOutputStream that the DFS
returns.
Step 3: The DFSOutputStream divides the data
written by the client into packets, which it then
sends to the info queue, an indoor queue. The
HDFS File DataStreamer consumes the data queue and is
responsible for selecting appropriate data nodes
Write from the inventory to store the replicas and
requesting the name node to allot new blocks. The
Request set of data nodes creates a pipeline; in this case,
we'll suppose that there are three nodes in the
Workflow pipeline due to the replication level of three. The
main data node in the pipeline receives the packets
from the DataStreamer and stores them before
forwarding them to the second data node in the
pipeline.
•
Step 4: In a similar manner, the packet is stored by
the second data node and forwarded to the third
(and final) data node in the pipeline.
Step 5: The DFSOutputStream maintains an internal "ack
queue" of packets awaiting acknowledgement from data
HDFS File nodes.
Request action sends up all of the remaining packets to the data node
pipeline and then connects to the name node to await
acknowledgments.
Workflow
HDFS adheres to the Write Once, Read Many paradigm.
Therefore, files that are already saved in HDFS cannot be
edited, but they can be added by accessing the file again.
Because of this design, HDFS can scale to support many
concurrent clients because data traffic is distributed
throughout all of the cluster's data nodes. As a result, the
system's throughput, scalability, and availability all increase.
Example
Source: The component that receives data from external systems and
converts it into Flume events.
Channel: The storage mechanism that buffers the events between the
source and sink, providing durability and reliability.
Sink: The component that writes the events to the desired destination,
such as HDFS, HBase, or another Flume agent.
Sqoop is a tool designed to transfer bulk data between Hadoop
and structured datastores, such as relational databases and data
Avro
father of Hadoop.
• Create Schemas
Steps to
• Design an Avro schema in JSON format
according to your data structure.
for Data
Step 4: Deserialize the Data
• Use the deserialization API provided
Serialization in:org.apache.avro.specific
• Create an instance of DatumWriter and
DataFileWriter to read the data
Hadoop Cluster
For the purpose of storing as well as analyzing huge amounts of
unstructured data in a distributed computing environment, a
special type of computational cluster is designed that what we
call as Hadoop Clusters.
Though, whenever we talk about Hadoop Clusters, two main
terms come up, they are cluster and node, so on defining them:
• A collection of nodes is what we call the cluster.
• A node is a point of intersection/connection within a network, ie
a server
Setting Up a Hadoop Cluster
• Setting up a Hadoop cluster involves installing Hadoop on
multiple machines (or virtual machines) and configuring them to
work together in a distributed environment. Here's a
step-by-step guide to set up a basic multi-node Hadoop
cluster:
Prerequisites
1. Machines: Minimum 2 (1 Master + 1 or more Slaves/Workers)
2. OS: Ubuntu (preferred, but others like CentOS also work)
3. Java: Hadoop requires Java 8 or later.
4. SSH: Passwordless SSH between all nodes.
Setting Up a
Hadoop Cluster
1. Install Java on All Nodes
2. Create Hadoop User on All Nodes
3. Set Up SSH (Passwordless)
4. Download and Extract Hadoop
5. Set Environment Variables
6. Configure Hadoop
7. Distribute Hadoop to All Nodes
8. Format HDFS (On Master)
9. Start Hadoop Cluster
10. Verify Cluster
Hadoop clusters have two
types of machines, such as
Master and Slave, where:
Hadoop
Cluster Master: HDFS NameNode,
Specificatio YARN ResourceManager.
n
Slaves: HDFS DataNodes,
YARN NodeManagers.
Hadoop Cluster Specification
2. Authorization:
Authorization governs the actions an authenticated user can take
within the Hadoop cluster. It entails creating access restrictions and
permissions depending on the roles and responsibilities of the
users. Organizations can enforce the concept of least privilege by
allowing users only the privileges required to complete their tasks, if
adequate authorization procedures are in place. This reduces the
possibility of unauthorized data tampering or exposure.
3. Auditing:
Auditing is essential for monitoring and tracking user activity in the
Hadoop cluster. Organizations can investigate suspicious or
unauthorized activity by keeping detailed audit logs. Auditing also
aids in compliance reporting, allowing organizations to demonstrate
conformity with regulatory standards. Implementing real-time audit
log monitoring and analysis provides for the timely detection of
security incidents and the facilitation of proactive measures.
Hadoop Adminstering
Administering Hadoop involves managing and maintaining a
Hadoop cluster, which includes tasks like installation,
configuration, monitoring, troubleshooting, and security.
It also encompasses managing the Hadoop Distributed File
System (HDFS) and YARN (Yet Another Resource Negotiator), as
well as other components that work with the core Hadoop
system.
Monitoring and
Installation and Cluster
Performance
Configuration Management
Tuning
Key Aspects of
Hadoop Security
Backup and
Troubleshooting
Recovery
Administration:
Hadoop
Tools and
Administration
Techniques
Roles
HDFS administration
includes monitoring the
HDFS file structure,
locations, and the updated
files.
Monitoring MapReduce administration
includes monitoring the list of
applications, configuration of
nodes, application status,
etc.
Hadoop Benchmark
• Hadoop benchmarking involves using specific tools and
techniques to evaluate the performance of a Hadoop cluster.
• These benchmarks allow for testing and tuning the cluster's
hardware, software, and network for optimal performance.
Common Benchmark Tools:
• TestDFSIO – Measures HDFS I/O performance.
• TeraSort – Sorts large amounts of data; tests both MapReduce
and HDFS.
• MRBench – Tests performance of MapReduce framework.
• HiBench – A comprehensive benchmarking suite by Intel.
Hadoop in the cloud, also known as
Hadoop as a Service (HaaS), refers
to running the Apache Hadoop
framework on a cloud provider's
infrastructure, rather than managing it
on-premises.
Hadoop in
Cloud This allows organizations to leverage
cloud-based resources for big data
analytics without the need to invest in
and manage their own hardware and
software.
Examples of Cloud Hadoop Services:
1. Amazon EMR:
A fully managed Hadoop service on Amazon Web Services,
offering a wide range of options for running Hadoop clusters in
the cloud.
2. Google Dataproc:
A managed service for running Apache Hadoop and Apache
Spark clusters on Google Cloud Platform.
3. Azure HDInsight:
A managed Hadoop service on Microsoft Azure, providing a
variety of Hadoop and related services.