BDT - Unit - II - Hdfs and Hadoop Io
BDT - Unit - II - Hdfs and Hadoop Io
BDT - Unit - II - Hdfs and Hadoop Io
Syllabus
1
Unit-II
Hadoop Distributed File system:
HDFS concepts,
CommandLine Interface,
Hadoop file systems,
Java interface,
Data flow,
Hadoop archives.
Hadoop I/O:
Data integrity,
Compression,
Serialization,
File-based data structures.
2
1) Hadoop Distributed File System
To store data, Hadoop utilizes its own distributed file system is called
HDFS, which makes data available to multiple computing nodes. 3
HDFS provides high throughput access to application data and
is suitable for applications that have large data sets, HDFS
relaxes a few POSIX requirements to enable streaming access
to file system data.
HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of
commodity hardware. “Very large” in this context means files
that are hundreds of megabytes, gigabytes, or terabytes in size.
The Java abstract class org.apache.hadoop.fs.FileSystem
represents a filesystem in Hadoop, and there are several
concrete implementations.
12
Fig: HDFS Architecture 13
The Hadoop Distributed File System (HDFS) is designed to store
very large data sets reliably, and to stream those data sets at high
bandwidth to user applications. In a large cluster, thousands of
servers both host directly attached storage and execute user
application tasks.
Program
InputStream in = null;
try { in = new URL("hdfs://host/path").openStream();
// process in
} finally
{ IOUtils.closeStream(in);
} 15
Reading Data Using the FileSystem API
Set a URLStreamHandlerFactory for your application.
FileSystem is a general filesystem API, so the first step is
to retrieve an instance for the filesystem we want to use-
HDFS in this case.
21
7) Hadoop Archives: HDFS stores small files inefficiently,
since each file is stored in a block, and block metadata is
held in memory by the name node.
22
Fig: HAR File Layout
23
Hadoop Archive is created from a collection of files using
the archive tool.
HDFS Features:
Scale-Out Architecture - Add servers to increase capacity
High Availability - Serve mission-critical workflows and applications
Fault Tolerance - Automatically and seamlessly recover from failures
Flexible Access – Multiple and open frameworks for serialization and file
system mounts
Load Balancing - Place data intelligently for maximum efficiency and
utilization
Tunable Replication - Multiple copies of each file provide data protection
and computational performance
Security - POSIX-based file permissions for users and groups with optional
LDAP integration.
25
2. Hadoop I/O: Hadoop comes with a set of primitives for data I/O,
some of these are techniques that are more general than Hadoop,
such as data integrity and compression, but deserve special
consideration when dealing with multiterabyte datasets.
Hadoop tools or APIs that form the building blocks for developing
distributed system, such as serialization frameworks and on-disk
data structures.
Example… 29
c) ChecksumFileSystem: LocalFileSystem uses ChecksumFileSystem
to do its work, and class makes it easy to add checksumming to
other (nonchecksummed) filesystems, as Checksum File System is
just a wrapper around FileSystem.
It consists of
a) Codecs
b) Compression and Input Splits
c) Using Compression in MapReduce
31
a) Codecs: A codec is the implementation of a compression-
decompression algorithm. In Hadoop, a codec is represented by an
implementation of the CompressionCodec interface
32
b) Compression: Compression tools are gzip,zip,bzip2 and
lzop.
33
c) Using Compression in MapReduce: When considering how to
compress data that will be processed by MapReduce, it is important
to understand whether the compression format supports splitting. If
your input files are compressed, they will be automatically
decompressed as they are read by MapReduce, using the filename
extension to determine the codec to use.
For Example…
34
3) Data Serialization: Serialization is the process of turning
structured objects into a byte stream for transmission over a network
or for writing to persistent storage. Deserialization is the process of
turning a byte stream back into a series of structured objects.
In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls(RPCs).
35
It consists of
a) The Writable Interface
b) Writable Classes
c) Implementing a Custom Writable
d) Serialization Frameworks
e) Avro
a) The Writable Interface: The Writable interface defines two methods are one for
writing its state to a Data Output binary stream, and one for reading its state from
a Data Input binary stream.
36
b) Writable Classes: Hadoop comes with a large selection of Writable classes in the
org.apache.hadoop.iopackage.
It consists of
a) Sequencefile
b) Mapfile
39
a) Sequence File: Imagine a log file, where each log record is a new
line of text. If you want to log binary types, plain text isn’t a
suitable format. Hadoop’s SequenceFile class fits the bill in this :
Hadoop I/O situation, providing a persistent data structure for
binary key-value pairs
40
Fig: The internal structure of a sequence file with no compression
and record compression 41
b) MapFile: A MapFile is a sorted SequenceFile with an index to permit lookups by key.
MapFile can be thought of as a persistent form of java.util.Map, which is able to grow
beyond the size of a Map that is kept in memory.
Writing a MapFile: Writing a MapFile is similar to writing a SequenceFile: you create an
instance of MapFile.
Reading a MapFile: Iterating through the entries in order in a MapFile is similar to the
procedure for a SequenceFile: you create a MapFile.
Converting a SequenceFile to a MapFile: A MapFile is as an indexed and sorted
SequenceFile. So it’s quite natural to want to be able to convert a SequenceFile into a
MapFile.