Big Data PPT Unit 2 1

HDFS is designed to store large files on commodity hardware, ensuring reliability and high throughput even during hardware failures. The NameNode manages file operations and DataNodes, facilitating efficient write and read operations through direct communication with DataNodes. Additionally, HDFS supports data integrity, compression, serialization, and provides various file-based data structures for efficient data processing.

Uploaded by

rajsreerama.s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views25 pages

Big Data PPT Unit 2 1

Uploaded by

rajsreerama.s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

HDFS Design

• HDFS stores very large files running on a

cluster of commodity hardware.
• It works on the principle of storage of less
number of large files rather than the huge
number of small files.
• HDFS stores data reliably even in the case
of hardware failure.
• It provides high throughput by providing the
data access in parallel.
Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming, and closing files and
directories.
• NameNode manages and maintains the DataNodes.
• It determines the mapping of blocks of a file to DataNodes.
• NameNode records each change made to the file system namespace.
• It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of all the blocks.
• NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new replicas.

Details given already in Unit –I…

HDFS Write Operation
Write Operation
• When a client wants to write a file to HDFS, it communicates to the NameNode for metadata. The
Namenode responds with a number of blocks, their location, replicas, and other details. Based on
information from NameNode, the client directly interacts with the DataNode.
• The client first sends block A to DataNode 1 along with the IP of the other two DataNodes where
replicas will be stored. When Datanode 1 receives block A from the client, DataNode 1 copies the
same block to DataNode 2 of the same rack. As both the DataNodes are in the same rack, so block
transfer via rack switch. Now DataNode 2 copies the same block to DataNode 4 on a different
rack. As both the DataNodes are in different racks, so block transfer via an out-of-rack switch.
• When DataNode receives the blocks from the client, it sends write confirmation to Namenode.
• The same process is repeated for each block of the file.
HDFS Read Operation
Read Operation
• To read from HDFS, the client first communicates with the NameNode for metadata. The
Namenode responds with the locations of DataNodes containing blocks. After receiving the
DataNodes locations, the client then directly interacts with the DataNodes.

• The client starts reading data parallel from the DataNodes based on the information received from
the NameNode. The data will flow directly from the DataNode to the client.

• When a client or application receives all the blocks of the file, it combines these blocks into the
form of an original file.
HDFS Interfaces
• Shell Interface to HDFS
• Web Interface to HDFS
• JAVA Interface to HDFS
• Internals of HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
Dataflow
• Dataflow is used for processing & enriching batch or stream data for use cases such as analysis,
machine learning or data warehousing.
• Dataflow is a serverless, fast and cost-effective service that supports both stream and batch
processing.
• It provides portability with processing jobs written using the open source Apache Beam libraries
and removes operational overhead from your data engineering teams by automating the
infrastructure provisioning and cluster management.
Data Integrity
• Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to the block.
• Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data written and
verify checksum when it will read that data.
• The separate checksum will create for every dfs.bytes.per.checksum bytes of data.
• The default size for this property is 512 bytes. Checksum is 4 Byte long.
• All DataNodes are responsible to check checksum of their data.
• When client read data from checksum, they also check checksum.
• To check the data block DataNodes runs a DataBlockScanner periodically to verify Block.
• So if corrupt data found HDFS will take replica of actual data and replace the corrupt one.
Data Compression
• When we think about Hadoop, we think about very large files which are stored in HDFS and lots
of data transfer among nodes in the Hadoop cluster while storing HDFS blocks or while running
map reduce tasks.
• If you could some how reduce the file size that would help you in reducing storage requirements
as well as in reducing the data transfer across the network.
• That’s where data compression in Hadoop helps.
Data compression at various stages in Hadoop
We can compress data in Hadoop MapReduce at various stages.
• Compressing input files- You can compress the input file that will reduce storage space in HDFS.
If you compress the input files then the files will be decompressed automatically when the file is
processed by a MapReduce job. Determining the appropriate coded will be done using the file
name extension. As example if file name extension is .snappy hadoop framework will
automatically use SnappyCodec to decompress the file.
• Compressing the map output- You can compress the intermediate map output. Since map output
is written to disk and data from several map outputs is used by a reducer so data from map outputs
is transferred to the node where reduce task is running. Thus by compressing intermediate map
output you can reduce both storage space and data transfer across network.
• Compressing output files- You can also compress the output of a MapReduce job.
Hadoop compression formats
• There are many different compression formats available in Hadoop framework. You will have to use one that suits your
requirement.
• Parameters that you need to look for are-
• Time it takes to compress.
• Space saving.
• Compression format is splittable or not.

• Deflate– It is the compression algorithm whose implementation is zlib. Deflate compression algorithm is also used by gzip
compression tool. Filename extension is .deflate.
• gzip- gzip compression is based on Deflate compression algorithm. gzip compression is not as fast as LZO or snappy but
compresses better so space saving is more.
Gzip is not splittable.
Filename extension is .gz.
• bzip2- Using bzip2 for compression will provide higher compression ratio but the compressing and decompressing speed is slow.
Bzip2 is splittable, Bzip2Codec implements SplittableCompressionCodec interface which provides the capability to compress /
de-compress a stream starting at any arbitrary position.
Filename extension is .bz2.
• Snappy– The Snappy compressor from Google provides fast compression and decompression but compression ratio is less.
Snappy is not splittable.
Filename extension is .snappy.

• LZO– LZO, just like snappy is optimized for speed so compresses and decompresses faster but compression ratio is less.
LZO is not splittable by default but you can index the lzo files as a pre-processing step to make them splittable.
Filename extension is .lzo.

• LZ4– Has fast compression and decompression speed but compression ratio is less. LZ4 is not splittable.
Filename extension is .lz4.

• Zstandard– Zstandard is a real-time compression algorithm, providing high compression ratios. It offers a very wide range
of compression / speed trade-off.
Zstandard is not splittable.
Filename extension is .zstd.
Serialization in hadoop
• Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage.

• Deserialization is the reverse process of turning a byte stream back into a series of
structured objects.

• Serialization is used in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage.

• In Hadoop, interprocess communication between nodes in the system is implemented using

remote procedure calls (RPCs).

• The RPC protocol uses serialization to render the message into a binary stream to be sent to
the remote node, which then deserializes the binary stream into the original message.

• In general, it is desirable that an RPC serialization format is compact, Fast, Extensible and
Interoperable.

• Hadoop uses its own serialization format, Writables, which is certainly compact and fast,
but not so easy to extend or use from languages other than Java.
File based Data Structure in Hadoop
• Sequence File
• Map file
Sequence File
• Providing a persistent data structure for binary key-value pairs
• Also work well as containers for smaller files

Two ways to seek to a given position in a sequence file

• Seek(long pos)- positions the reader at the given point in the file.next() method fails if the pos is
not a record boundary.
• Sync(long pos)- positions the reader at the next sync point after position. We can call sync() with
any position in the stream – a nonrecord boundary, for example- and the reader will establish itself
at the next sync point so reading can continue.
• SequenceFile.Writer has a method called sync() for inserting a sync point at the current position
in the stream.

• Hadoop fs command has a –text option to display sequence files in textual form. It can attempt to
detect the type of the file and appropriately convert it to text it. It can recognize gzipped files and
sequence files, otherwise it assumes the input is plain text.
Map File
• Sorted SequenceFile with an index to permit lookups by key.
• MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement
this interface), which is able to grow beyond the size of Map that is kept in memory.
Map File Variants
• SetFile is a specialization of MapFile for storing a set of Writable keys. The keys must be added in
sorted order.

• ArrayFile is a MapFile where the key is an integer representing the index of the element in the
array and the value is a Writable value.

• BloomMapFile is a MapFile which offers a fast version of the get() method, especially for
sparsely populated files. The implementation uses a dynamic bloom filter for testing whether a
given key is in the map. The test is very fast since it is in-memory, but it has a non-zero probability
of false positives, in which case the regular get() method is called.

Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
100% (4)
Real-Time Data Processing & Analytics - Distributed Computing & Event Processing Using Spark, Flink, Storm, Kafka
422 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
SYS600 - Application Objects
No ratings yet
SYS600 - Application Objects
372 pages
PHP W3schools Tutorial
No ratings yet
PHP W3schools Tutorial
25 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
9 pages
Iot Based Fire Detector and Automatic Extinguisher Using Nodemcu
100% (1)
Iot Based Fire Detector and Automatic Extinguisher Using Nodemcu
33 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Upgrading To R12.2 - Tips, Pointers & Lessons Pointers, & Lessons Learned
0% (1)
Upgrading To R12.2 - Tips, Pointers & Lessons Pointers, & Lessons Learned
53 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Unit - 3 (HDFS)
No ratings yet
Unit - 3 (HDFS)
23 pages
18EC46 Question Bank
100% (1)
18EC46 Question Bank
1 page
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Shield Pt2313
No ratings yet
Shield Pt2313
7 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Mobile Application Management Guide v9 - 2 PDF
No ratings yet
Mobile Application Management Guide v9 - 2 PDF
179 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
How To Install Oracle Solaris 11 On Sun Machine
No ratings yet
How To Install Oracle Solaris 11 On Sun Machine
18 pages
Xen Hypervisor
No ratings yet
Xen Hypervisor
111 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 4
No ratings yet
Unit 4
104 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Network Forensics Bom
No ratings yet
Network Forensics Bom
2 pages
Correct Answer: B
No ratings yet
Correct Answer: B
49 pages
Unit 2
No ratings yet
Unit 2
56 pages
Module 6 - Session and Pagination Library
No ratings yet
Module 6 - Session and Pagination Library
35 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
IT JOB Tips
No ratings yet
IT JOB Tips
36 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Hogp Spec V10 PDF
No ratings yet
Hogp Spec V10 PDF
38 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
ENSC 350: Digital Systems Design: Instructor: Dr. Ameer Abdelhadi
No ratings yet
ENSC 350: Digital Systems Design: Instructor: Dr. Ameer Abdelhadi
56 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop
No ratings yet
Hadoop
30 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Pico Log Self Help Guide
No ratings yet
Pico Log Self Help Guide
34 pages
Unit3 Bda
No ratings yet
Unit3 Bda
71 pages
Rob Jordan & Chris Livdahl
No ratings yet
Rob Jordan & Chris Livdahl
32 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Unit - 3 - Big Data
No ratings yet
Unit - 3 - Big Data
66 pages
Computer Organization & Architecture Questions and Answers
No ratings yet
Computer Organization & Architecture Questions and Answers
8 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
6050A2264501 A02 BAP31U - MAIN BOARD - A02 - 0520 1310A2264501-MB-A02 20090513 ACER ASPIRE 3410 3810T PDF
No ratings yet
6050A2264501 A02 BAP31U - MAIN BOARD - A02 - 0520 1310A2264501-MB-A02 20090513 ACER ASPIRE 3410 3810T PDF
35 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Introduction To Hadoop Distributed File System (HDFS)
No ratings yet
Introduction To Hadoop Distributed File System (HDFS)
22 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Unit - 3 (HDFS) - 1
No ratings yet
Unit - 3 (HDFS) - 1
24 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
5 - BDP 2024 06
No ratings yet
5 - BDP 2024 06
14 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
BDP 2024 06
No ratings yet
BDP 2024 06
14 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
CEA201 Pt.1 - Quizizz
No ratings yet
CEA201 Pt.1 - Quizizz
11 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
17 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Delphi XE5: REST Client Library
No ratings yet
Delphi XE5: REST Client Library
10 pages
USB To RS-232 Converter - User's Manual
No ratings yet
USB To RS-232 Converter - User's Manual
7 pages
Unit 3 - Information and Communication Technology Skills (Solutions)
No ratings yet
Unit 3 - Information and Communication Technology Skills (Solutions)
7 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
Local Batch 2
No ratings yet
Local Batch 2
9 pages
CIA3 Answer
No ratings yet
CIA3 Answer
5 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Upgradation Process
No ratings yet
Upgradation Process
6 pages
COA Experiment No 9-10
No ratings yet
COA Experiment No 9-10
6 pages
PrimeKey SignServer Enterprise 11
No ratings yet
PrimeKey SignServer Enterprise 11
2 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
Bigdata
No ratings yet
Bigdata
5 pages
Manish Tiwari: Mobile: 7276456203
No ratings yet
Manish Tiwari: Mobile: 7276456203
3 pages
I.C.T Sample Paper For Mid-Term Revision
No ratings yet
I.C.T Sample Paper For Mid-Term Revision
3 pages
ProVisionaire Edge Setup Manual EN
No ratings yet
ProVisionaire Edge Setup Manual EN
2 pages
Mubashir CV
No ratings yet
Mubashir CV
3 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Big Data PPT Unit 2 1

Uploaded by

Big Data PPT Unit 2 1

Uploaded by

HDFS Design

• HDFS stores very large files running on a

Details given already in Unit –I…

• In Hadoop, interprocess communication between nodes in the system is implemented using

Two ways to seek to a given position in a sequence file

You might also like