BDT - Unit - II - Hdfs and Hadoop Io

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

BIG DATA

Syllabus

Unit-I : Introduction to Big Data and Hadoop


Unit-II : HDFS and Hadoop I/O
Unit-III : MapReduce, Types and Formats and Features
Unit-VI : Hive, HBase and Pig
Unit-V : Mahout, Sqoop, ZooKeeper and Case Study

1
Unit-II
 Hadoop Distributed File system:
HDFS concepts,
CommandLine Interface,
Hadoop file systems,
Java interface,
Data flow,
Hadoop archives.

 Hadoop I/O:
Data integrity,
Compression,
Serialization,
File-based data structures.
2
1) Hadoop Distributed File System

 HDFS is a Hadoop Distributed File System for Data Storage.

 Hadoop Distributed File System (HDFS) is a sub-project of the


Apache Hadoop project and this Apache Software Foundation project
is designed to provide a fault-tolerant file system designed to run
on commodity hardware.

 The HDFS is the primary storage system used


by Hadoop applications.

 The HDFS is a distributed file system that provides high-


performance access to data across Hadoop clusters.

 To store data, Hadoop utilizes its own distributed file system is called
HDFS, which makes data available to multiple computing nodes. 3
 HDFS provides high throughput access to application data and
is suitable for applications that have large data sets, HDFS
relaxes a few POSIX requirements to enable streaming access
to file system data.

 HDFS uses a master/slave architecture in which one device is


the master controls one or more other devices is the slaves.

 HDFS features are great fault tolerance, high throughput,


suitability for handling large data sets, and streaming access to
file system data.

 A typical Hadoop usage pattern involves three stages are


Loading data into HDFS, MapReduce operations, and
Retrieving results from HDFS.
4
Fig: HDFS Architecture 5
HDFS consists of
1. HDFS concepts
2. The command line interface
3. Hadoop file systems
4. The Java interface
5. Data flow
6. Parallel copying distcp
7. Hadoop Archives
6
1) HDFS Concepts: There are different types of
HDFS concepts are Blocks , Namenodes and
Datanodes, HDFS Federation and HDFS High-
Availability
i. Blocks
ii. Namenodes and Datanodes
iii. HDFS Federation
iv. HDFS High-Availability

i. Blocks: A disk has a block size, which is the


minimum amount of data that it can read or write.
HDFS has the concept of a block, but it is a much
larger unit 64 MB by default.
7
Fig: A Client reading data from HDFS 8
ii. Namenodes and Datanodes: An HDFS cluster has two types
of node operating in a master-worker pattern are a namenode
is also called as masternode and a number of datanodes is also
called as slavenode. The NameNode also knows the datanodes
on which all the blocks for a given file are located.

iii. HDFS Federation: HDFS Federation, introduced in the 0.23


release series, allows a cluster to scale by adding namenodes,
each of which manages a portion of the file system
namespace.

iv. HDFS High-Availability: The combination of replicating


namenode metadata on multiple filesystems, and using the
secondary namenode to create checkpoints protects against
data loss, but does not provide high-availability of the
filesystem.
9
2) The Command-Line Interface
 The HDFS can be manipulated through a Java API or through
a command line interface.

 All commands for manipulating HDFS through Hadoop's command


line interface begin with "hadoop", a space, and "fs". This is the file
system shell. This is followed by the command name as an
argument to "hadoop fs".

 There are two properties that we set in the pseudo-distributed


configuration that deserve further explanation.

 The first is fs.default.name, set to hdfs://localhost/, which is used to


set a default filesystem for Hadoop.

 The default HDFS port is 8020. 10


 Basic Filesystem Operations: The filesystem is ready to be used,
and we can do all of the usual filesystem operations such as reading
files, creating directories, moving files, deleting data, and listing
directories.
i. Hadoop provides two FileSystem method for processing globs are
public FileStatus[] and FSDataInputStream[]
ii. Directories
iii. Querying the Filesystem
11
iv. File Patterns
3) Hadoop File Systems

 Hadoop has an abstract notion of filesystem, of which HDFS is


just one implementation.
 HDFS is filing system use to store large data files and handles
streaming data and running clusters on the commodity
hardware.

 HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of
commodity hardware. “Very large” in this context means files
that are hundreds of megabytes, gigabytes, or terabytes in size.
 The Java abstract class org.apache.hadoop.fs.FileSystem
represents a filesystem in Hadoop, and there are several
concrete implementations.
12
Fig: HDFS Architecture 13
 The Hadoop Distributed File System (HDFS) is designed to store
very large data sets reliably, and to stream those data sets at high
bandwidth to user applications. In a large cluster, thousands of
servers both host directly attached storage and execute user
application tasks.

Fig: Hadoop File Systems configuration for HDFS 14


4) The Java Interface
 The API for interacting with one of Hadoop’s filesystems, while
we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your
code against the FileSystem abstract class, to retain portability
across filesystems.
 Reading Data from a Hadoop URL: One of the simplest ways to
read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from.

 Program
InputStream in = null;
try { in = new URL("hdfs://host/path").openStream();
// process in
} finally
{ IOUtils.closeStream(in);
} 15
Reading Data Using the FileSystem API
 Set a URLStreamHandlerFactory for your application.
 FileSystem is a general filesystem API, so the first step is
to retrieve an instance for the filesystem we want to use-
HDFS in this case.

 There are several static factory methods for getting a


FileSystem instance:
i. public static FileSystem get(Configuration conf) throws
IOException
ii. public static FileSystem get(URI uri, Configuration conf)
throws IOException
iii. public static FileSystem get(URI uri, Configuration conf,
String user) throws IOException
16
5) Data Flow: To get an idea of how data flows between the
client interacting with HDFS, the namenode and the
datanodes.

Fig: A Client reading data from HDFS 17


 The client opens the file it wishes to read by calling open() on the
FileSystem object, which for HDFS is an instance of
distributedFileSystem.
 The DistributedFileSystem returns an FSDataInputStream (an input
stream that supports file seeks) to the client for it to read data from.
 FSDataInputStream in turn wraps a DFSInputStream, which
manages the datanode and namenode I/O.

Anatomy of a File Write:


 The client creates the file by calling create() on
DistributedFileSystem.
 DistributedFileSystem makes an RPC call to the namenode to
create a new file in the filesystem’s namespace, with no blocks
associated with it.
 The data queue is consumed by the Data Streamer, whose
responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas.
18
Fig: A Client writing data to HDFS 19
6) Parallel Copying with distcp
 The HDFS access patterns that we have seen so far focus
on single-threaded access.

 Hadoop comes with a useful program called distcp for


copying large amounts of data to and from Hadoop
filesystems in parallel.

 The canonical use case for distcp is for transferring data


between two HDFS clusters.

 If the clusters are running identical versions of Hadoop,


the hdfs scheme is appropriate:
20
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
 Since it’s a good idea to get each map to copy a
reasonable amount of data to minimize overheads
in task setup, each map copies at least 256 MB.

 When copying data into HDFS, it’s important to


consider cluster balance.

 HDFS works best when the file blocks are evenly


spread across the cluster, so you want to ensure
that distcp doesn’t disrupt this.

21
7) Hadoop Archives: HDFS stores small files inefficiently,
since each file is stored in a block, and block metadata is
held in memory by the name node.

 Hadoop Archives, or HAR files, are a file archiving


facility that packs files into HDFS blocks more efficiently,
thereby reducing name node memory usage while still
allowing transparent access to files.

 In particular, Hadoop Archives can be used as input to


MapReduce.

22
Fig: HAR File Layout
23
 Hadoop Archive is created from a collection of files using
the archive tool.

 The tool runs a MapReduce job to process the input files


in parallel. so to run it, you need a MapReduce cluster
running to use it.

 Archives are immutable.


 Rename, delete, and create return an error.

 Hadoop archives is exposed as a file system MapReduce


will be able to use all the logical input files in Hadoop
archives as input.
24
 HDFS is a fault tolerant and self-healing distributed file
system designed to turn a cluster of industry standard
servers into a massively scalable pool of storage.

 HDFS Features:
 Scale-Out Architecture - Add servers to increase capacity
 High Availability - Serve mission-critical workflows and applications
 Fault Tolerance - Automatically and seamlessly recover from failures
 Flexible Access – Multiple and open frameworks for serialization and file
system mounts
 Load Balancing - Place data intelligently for maximum efficiency and
utilization
 Tunable Replication - Multiple copies of each file provide data protection
and computational performance
 Security - POSIX-based file permissions for users and groups with optional
LDAP integration.
25
2. Hadoop I/O: Hadoop comes with a set of primitives for data I/O,
some of these are techniques that are more general than Hadoop,
such as data integrity and compression, but deserve special
consideration when dealing with multiterabyte datasets.

 Hadoop tools or APIs that form the building blocks for developing
distributed system, such as serialization frameworks and on-disk
data structures.

 Hadoop MapReduce job, input files are read from HDFS,


Data are usually compressed to reduce the file sizes, after
decompression, serialized bytes are transformed into Java objects
before being passed to a user-defined map() function.

 Output records are serialized, compressed, and eventually pushed


back to HDFS in conversely.

 Hadoop’s SequenceFile provides a persistent data structure for


binary key-value pairs. 26
 Hadoop I/O consists of
1. Data Integrity
2. Data Compression
3. Data Serialization
4. File-Based Data Structures

Fig: Basic flow of Hadoop I/O 27


1) Data Integrity: Hadoop I/O is a technique doesn’t offer any way to
fix the data, just only error detection.
 Hadoop every I/O operation on the disk or network carries with it a
small chance of introducing errors into the data that it is reading or
writing.
Data Integrity consists of
a) Data Integrity in HDFS
b) LocalFileSystem
c) ChecksumFileSystem

a) Data Integrity in HDFS: HDFS transparently checksums all data


written to it and by default verifies checksums when reading data.
 A separate checksum is created for every io.bytes.per.checksum.
 HDFS stores replicas of blocks, it can “heal” corrupted blocks by
copying one of the good replicas to produce a new, uncorrupt
replica the datanode it was trying to read from to the namenode
before throwing a ChecksumException.
28
b) LocalFileSystem: The Hadoop LocalFileSystem performs client-side
checksumming, this means that when you write a file a called filename,
the file system client transparently creates a hidden file, .filename.crc, in
the same directory containing the checksums for each chunk of the file.

 Like HDFS, the chunk size is controlled by the io.bytes.per.check


property, which defaults to 512 bytes.
 The chunk size is stored as metadata in the .crc file, so the file can be read
back correctly even if the setting for the chunk size has changed.

 Checksums are fairly cheap to compute, typically adding a few percent


overhead to the time to read or write a file.

 It is possible to disable checksums and use case here is when the


underlying file system support checksums natively, this is accomplished
by using RawLocalFileSystem in place of LocalFileSystem

 Example… 29
c) ChecksumFileSystem: LocalFileSystem uses ChecksumFileSystem
to do its work, and class makes it easy to add checksumming to
other (nonchecksummed) filesystems, as Checksum File System is
just a wrapper around FileSystem.

 LocalFileSystem uses ChecksumFileSystem to do its work, and this


class makes it easy to add checksumming to other filesystems, as
ChecksumFileSystem is just a wrapper around FileSystem.

 The general idiom is as follows:

 The underlying filesystem is called the raw filesystem, and may be


retrieved using the getRawFileSystem() method on
ChecksumFileSystem. 30
2. Data Compression: File compression brings two major
benefits as it reduces the space needed to store files, and it
speeds up data transfer across the network, or to or from
disk.
 When dealing with large volumes of data, both of these
savings can be significant, so it pays to carefully consider
how to use compression in Hadoop.

It consists of
a) Codecs
b) Compression and Input Splits
c) Using Compression in MapReduce

31
a) Codecs: A codec is the implementation of a compression-
decompression algorithm. In Hadoop, a codec is represented by an
implementation of the CompressionCodec interface

 Hadoop codecs must be downloaded separately from


http://code.google.com/p/hadoop-gpl-compression/

32
b) Compression: Compression tools are gzip,zip,bzip2 and
lzop.

 Compression time by offering nine different options


– -1 means optimize for speed and -9 means optimize for space
– e.g.) gzip -1 file

 The different tools have very different compression characteristics.


– Both gzip and ZIP are general-purpose compressors, and sit in
the middle of the space/time trade-off.
– Bzip2 compresses more effectively than gzip or ZIP, but is
slower.
– LZO optimizes for speed. It is faster than gzip and ZIP, but
compresses slightly less effectively.

33
c) Using Compression in MapReduce: When considering how to
compress data that will be processed by MapReduce, it is important
to understand whether the compression format supports splitting. If
your input files are compressed, they will be automatically
decompressed as they are read by MapReduce, using the filename
extension to determine the codec to use.
 For Example…

34
3) Data Serialization: Serialization is the process of turning
structured objects into a byte stream for transmission over a network
or for writing to persistent storage. Deserialization is the process of
turning a byte stream back into a series of structured objects.
 In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls(RPCs).

 In general, it is desirable that an RPC serialization format is:


 Compact: A compact format makes the best use of network bandwidth
 Fast: Interprocess communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
 Extensible: Protocols change over time to meet new requirements, so it
should be straightforward to evolve the protocol in a controlled manner for
clients and servers.
 Interoperable: For some systems, it is desirable to be able to support clients
that are written in different languages to the server.

35
It consists of
a) The Writable Interface
b) Writable Classes
c) Implementing a Custom Writable
d) Serialization Frameworks
e) Avro
a) The Writable Interface: The Writable interface defines two methods are one for
writing its state to a Data Output binary stream, and one for reading its state from
a Data Input binary stream.

36
b) Writable Classes: Hadoop comes with a large selection of Writable classes in the
org.apache.hadoop.iopackage.

Fig: Writable wrappers for Java primitives 37


c) Implementing a Custom Writable: Hadoop comes with a useful set of
Writable implementations that serve most purposes; however, on occasion, you
may need to write your own custom implementation. With Serialization | 105 a
custom Writable, you have full control over the binary representation and the
sort order.

d) Serialization Frameworks: Hadoop has an API for pluggable


serialization frameworks. A serialization framework is represented by an
implementation of Serialization (in the org.apache.hadoop.io. serializer
package). WritableSerialization, for example, is the implementation of
Serialization for Writable types.
 A Serialization defines a mapping from types to Serializer instances (for
turning an object into a byte stream) and Deserializer instances (for
turning a byte stream into an object).

e) Avro: Apache Avro4 is a language-neutral data serialization system. The


project was created by Doug Cutting (the creator of Hadoop) to address
the major downside of Hadoop Writables: lack of language portability.
 The Avro specification precisely defines the binary format that all
38
implementations must support.
4) File-Based Data Structures: Apache Hadoop’s
SequenceFile provides a persistent data structure for
binary key-value pairs. In contrast with other persistent
key-value data structures like B-Trees, you can’t seek to a
specified key editing, adding or removing it. This file is
append-only.

 MapReduce-based processing, putting each blob of binary


data into its own file doesn’t scale, so Hadoop developed
a number of higher-level containers for these situations.

 It consists of
a) Sequencefile
b) Mapfile
39
a) Sequence File: Imagine a log file, where each log record is a new
line of text. If you want to log binary types, plain text isn’t a
suitable format. Hadoop’s SequenceFile class fits the bill in this :
Hadoop I/O situation, providing a persistent data structure for
binary key-value pairs

 Writing a Sequence File: To create a SequenceFile, use one of its


createWriter() static methods, which returns a SequenceFile.Writer
instance.

 Reading a SequenceFile: Reading sequence files from beginning


to end is a matter of creating an instance of SequenceFile.

 The SequenceFile format: A sequence file consists of a header


followed by one or more records.

40
Fig: The internal structure of a sequence file with no compression
and record compression 41
 b) MapFile: A MapFile is a sorted SequenceFile with an index to permit lookups by key.
MapFile can be thought of as a persistent form of java.util.Map, which is able to grow
beyond the size of a Map that is kept in memory.
 Writing a MapFile: Writing a MapFile is similar to writing a SequenceFile: you create an
instance of MapFile.
 Reading a MapFile: Iterating through the entries in order in a MapFile is similar to the
procedure for a SequenceFile: you create a MapFile.
 Converting a SequenceFile to a MapFile: A MapFile is as an indexed and sorted
SequenceFile. So it’s quite natural to want to be able to convert a SequenceFile into a
MapFile.

Fig: The internal structure of a sequence file with block compression 42

You might also like