0% found this document useful (0 votes)
5 views25 pages

Big Data PPT Unit 2 1

HDFS is designed to store large files on commodity hardware, ensuring reliability and high throughput even during hardware failures. The NameNode manages file operations and DataNodes, facilitating efficient write and read operations through direct communication with DataNodes. Additionally, HDFS supports data integrity, compression, serialization, and provides various file-based data structures for efficient data processing.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

Big Data PPT Unit 2 1

HDFS is designed to store large files on commodity hardware, ensuring reliability and high throughput even during hardware failures. The NameNode manages file operations and DataNodes, facilitating efficient write and read operations through direct communication with DataNodes. Additionally, HDFS supports data integrity, compression, serialization, and provides various file-based data structures for efficient data processing.

Uploaded by

rajsreerama.s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

HDFS Design

• HDFS stores very large files running on a


cluster of commodity hardware.
• It works on the principle of storage of less
number of large files rather than the huge
number of small files.
• HDFS stores data reliably even in the case
of hardware failure.
• It provides high throughput by providing the
data access in parallel.
Functions of HDFS NameNode
• It executes the file system namespace operations like opening, renaming, and closing files and
directories.
• NameNode manages and maintains the DataNodes.
• It determines the mapping of blocks of a file to DataNodes.
• NameNode records each change made to the file system namespace.
• It keeps the locations of each block of a file.
• NameNode takes care of the replication factor of all the blocks.
• NameNode receives heartbeat and block reports from all DataNodes that ensure DataNode is alive.
• If the DataNode fails, the NameNode chooses new DataNodes for new replicas.

Details given already in Unit –I…


HDFS Write Operation
Write Operation
• When a client wants to write a file to HDFS, it communicates to the NameNode for metadata. The
Namenode responds with a number of blocks, their location, replicas, and other details. Based on
information from NameNode, the client directly interacts with the DataNode.
• The client first sends block A to DataNode 1 along with the IP of the other two DataNodes where
replicas will be stored. When Datanode 1 receives block A from the client, DataNode 1 copies the
same block to DataNode 2 of the same rack. As both the DataNodes are in the same rack, so block
transfer via rack switch. Now DataNode 2 copies the same block to DataNode 4 on a different
rack. As both the DataNodes are in different racks, so block transfer via an out-of-rack switch.
• When DataNode receives the blocks from the client, it sends write confirmation to Namenode.
• The same process is repeated for each block of the file.
HDFS Read Operation
Read Operation
• To read from HDFS, the client first communicates with the NameNode for metadata. The
Namenode responds with the locations of DataNodes containing blocks. After receiving the
DataNodes locations, the client then directly interacts with the DataNodes.

• The client starts reading data parallel from the DataNodes based on the information received from
the NameNode. The data will flow directly from the DataNode to the client.

• When a client or application receives all the blocks of the file, it combines these blocks into the
form of an original file.
HDFS Interfaces
• Shell Interface to HDFS
• Web Interface to HDFS
• JAVA Interface to HDFS
• Internals of HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
JAVA Interfaces to HDFS
Dataflow
• Dataflow is used for processing & enriching batch or stream data for use cases such as analysis,
machine learning or data warehousing.
• Dataflow is a serverless, fast and cost-effective service that supports both stream and batch
processing.
• It provides portability with processing jobs written using the open source Apache Beam libraries
and removes operational overhead from your data engineering teams by automating the
infrastructure provisioning and cluster management.
Data Integrity
• Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to the block.
• Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data written and
verify checksum when it will read that data.
• The separate checksum will create for every dfs.bytes.per.checksum bytes of data.
• The default size for this property is 512 bytes. Checksum is 4 Byte long.
• All DataNodes are responsible to check checksum of their data.
• When client read data from checksum, they also check checksum.
• To check the data block DataNodes runs a DataBlockScanner periodically to verify Block.
• So if corrupt data found HDFS will take replica of actual data and replace the corrupt one.
Data Compression
• When we think about Hadoop, we think about very large files which are stored in HDFS and lots
of data transfer among nodes in the Hadoop cluster while storing HDFS blocks or while running
map reduce tasks.
• If you could some how reduce the file size that would help you in reducing storage requirements
as well as in reducing the data transfer across the network.
• That’s where data compression in Hadoop helps.
Data compression at various stages in Hadoop
We can compress data in Hadoop MapReduce at various stages.
• Compressing input files- You can compress the input file that will reduce storage space in HDFS.
If you compress the input files then the files will be decompressed automatically when the file is
processed by a MapReduce job. Determining the appropriate coded will be done using the file
name extension. As example if file name extension is .snappy hadoop framework will
automatically use SnappyCodec to decompress the file.
• Compressing the map output- You can compress the intermediate map output. Since map output
is written to disk and data from several map outputs is used by a reducer so data from map outputs
is transferred to the node where reduce task is running. Thus by compressing intermediate map
output you can reduce both storage space and data transfer across network.
• Compressing output files- You can also compress the output of a MapReduce job.
Hadoop compression formats
• There are many different compression formats available in Hadoop framework. You will have to use one that suits your
requirement.
• Parameters that you need to look for are-
• Time it takes to compress.
• Space saving.
• Compression format is splittable or not.

• Deflate– It is the compression algorithm whose implementation is zlib. Deflate compression algorithm is also used by gzip
compression tool. Filename extension is .deflate.
• gzip- gzip compression is based on Deflate compression algorithm. gzip compression is not as fast as LZO or snappy but
compresses better so space saving is more.
Gzip is not splittable.
Filename extension is .gz.
• bzip2- Using bzip2 for compression will provide higher compression ratio but the compressing and decompressing speed is slow.
Bzip2 is splittable, Bzip2Codec implements SplittableCompressionCodec interface which provides the capability to compress /
de-compress a stream starting at any arbitrary position.
Filename extension is .bz2.
• Snappy– The Snappy compressor from Google provides fast compression and decompression but compression ratio is less.
Snappy is not splittable.
Filename extension is .snappy.

• LZO– LZO, just like snappy is optimized for speed so compresses and decompresses faster but compression ratio is less.
LZO is not splittable by default but you can index the lzo files as a pre-processing step to make them splittable.
Filename extension is .lzo.

• LZ4– Has fast compression and decompression speed but compression ratio is less. LZ4 is not splittable.
Filename extension is .lz4.

• Zstandard– Zstandard is a real-time compression algorithm, providing high compression ratios. It offers a very wide range
of compression / speed trade-off.
Zstandard is not splittable.
Filename extension is .zstd.
Serialization in hadoop
• Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage.

• Deserialization is the reverse process of turning a byte stream back into a series of
structured objects.

• Serialization is used in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage.

• In Hadoop, interprocess communication between nodes in the system is implemented using


remote procedure calls (RPCs).

• The RPC protocol uses serialization to render the message into a binary stream to be sent to
the remote node, which then deserializes the binary stream into the original message.

• In general, it is desirable that an RPC serialization format is compact, Fast, Extensible and
Interoperable.

• Hadoop uses its own serialization format, Writables, which is certainly compact and fast,
but not so easy to extend or use from languages other than Java.
File based Data Structure in Hadoop
• Sequence File
• Map file
Sequence File
• Providing a persistent data structure for binary key-value pairs
• Also work well as containers for smaller files

Two ways to seek to a given position in a sequence file


• Seek(long pos)- positions the reader at the given point in the file.next() method fails if the pos is
not a record boundary.
• Sync(long pos)- positions the reader at the next sync point after position. We can call sync() with
any position in the stream – a nonrecord boundary, for example- and the reader will establish itself
at the next sync point so reading can continue.
• SequenceFile.Writer has a method called sync() for inserting a sync point at the current position
in the stream.

• Hadoop fs command has a –text option to display sequence files in textual form. It can attempt to
detect the type of the file and appropriately convert it to text it. It can recognize gzipped files and
sequence files, otherwise it assumes the input is plain text.
Map File
• Sorted SequenceFile with an index to permit lookups by key.
• MapFile can be thought of as a persistent form of java.util.Map (although it doesn’t implement
this interface), which is able to grow beyond the size of Map that is kept in memory.
Map File Variants
• SetFile is a specialization of MapFile for storing a set of Writable keys. The keys must be added in
sorted order.

• ArrayFile is a MapFile where the key is an integer representing the index of the element in the
array and the value is a Writable value.

• BloomMapFile is a MapFile which offers a fast version of the get() method, especially for
sparsely populated files. The implementation uses a dynamic bloom filter for testing whether a
given key is in the map. The test is very fast since it is in-memory, but it has a non-zero probability
of false positives, in which case the regular get() method is called.

You might also like