0% found this document useful (0 votes)

47 views42 pages

BDT - Unit - II - Hdfs and Hadoop Io

Uploaded by

AMIRISHETTY DEEPIKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views42 pages

BDT - Unit - II - Hdfs and Hadoop Io

Uploaded by

AMIRISHETTY DEEPIKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

BIG DATA

Syllabus

Unit-I : Introduction to Big Data and Hadoop

Unit-II : HDFS and Hadoop I/O
Unit-III : MapReduce, Types and Formats and Features
Unit-VI : Hive, HBase and Pig
Unit-V : Mahout, Sqoop, ZooKeeper and Case Study

1
Unit-II
 Hadoop Distributed File system:
HDFS concepts,
CommandLine Interface,
Hadoop file systems,
Java interface,
Data flow,
Hadoop archives.

 Hadoop I/O:
Data integrity,
Compression,
Serialization,
File-based data structures.
2
1) Hadoop Distributed File System

 HDFS is a Hadoop Distributed File System for Data Storage.

 Hadoop Distributed File System (HDFS) is a sub-project of the

Apache Hadoop project and this Apache Software Foundation project
is designed to provide a fault-tolerant file system designed to run
on commodity hardware.

 The HDFS is the primary storage system used

by Hadoop applications.

 The HDFS is a distributed file system that provides high-

performance access to data across Hadoop clusters.

 To store data, Hadoop utilizes its own distributed file system is called
HDFS, which makes data available to multiple computing nodes. 3
 HDFS provides high throughput access to application data and
is suitable for applications that have large data sets, HDFS
relaxes a few POSIX requirements to enable streaming access
to file system data.

 HDFS uses a master/slave architecture in which one device is

the master controls one or more other devices is the slaves.

 HDFS features are great fault tolerance, high throughput,

suitability for handling large data sets, and streaming access to
file system data.

 A typical Hadoop usage pattern involves three stages are

Loading data into HDFS, MapReduce operations, and
Retrieving results from HDFS.
4
Fig: HDFS Architecture 5
HDFS consists of
1. HDFS concepts
2. The command line interface
3. Hadoop file systems
4. The Java interface
5. Data flow
6. Parallel copying distcp
7. Hadoop Archives
6
1) HDFS Concepts: There are different types of
HDFS concepts are Blocks , Namenodes and
Datanodes, HDFS Federation and HDFS High-
Availability
i. Blocks
ii. Namenodes and Datanodes
iii. HDFS Federation
iv. HDFS High-Availability

i. Blocks: A disk has a block size, which is the

minimum amount of data that it can read or write.
HDFS has the concept of a block, but it is a much
larger unit 64 MB by default.
7
Fig: A Client reading data from HDFS 8
ii. Namenodes and Datanodes: An HDFS cluster has two types
of node operating in a master-worker pattern are a namenode
is also called as masternode and a number of datanodes is also
called as slavenode. The NameNode also knows the datanodes
on which all the blocks for a given file are located.

iii. HDFS Federation: HDFS Federation, introduced in the 0.23

release series, allows a cluster to scale by adding namenodes,
each of which manages a portion of the file system
namespace.

iv. HDFS High-Availability: The combination of replicating

namenode metadata on multiple filesystems, and using the
secondary namenode to create checkpoints protects against
data loss, but does not provide high-availability of the
filesystem.
9
2) The Command-Line Interface
 The HDFS can be manipulated through a Java API or through
a command line interface.

 All commands for manipulating HDFS through Hadoop's command

line interface begin with "hadoop", a space, and "fs". This is the file
system shell. This is followed by the command name as an
argument to "hadoop fs".

 There are two properties that we set in the pseudo-distributed

configuration that deserve further explanation.

 The first is fs.default.name, set to hdfs://localhost/, which is used to

set a default filesystem for Hadoop.

 The default HDFS port is 8020. 10

 Basic Filesystem Operations: The filesystem is ready to be used,
and we can do all of the usual filesystem operations such as reading
files, creating directories, moving files, deleting data, and listing
directories.
i. Hadoop provides two FileSystem method for processing globs are
public FileStatus[] and FSDataInputStream[]
ii. Directories
iii. Querying the Filesystem
11
iv. File Patterns
3) Hadoop File Systems

 Hadoop has an abstract notion of filesystem, of which HDFS is

just one implementation.
 HDFS is filing system use to store large data files and handles
streaming data and running clusters on the commodity
hardware.

 HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of
commodity hardware. “Very large” in this context means files
that are hundreds of megabytes, gigabytes, or terabytes in size.
 The Java abstract class org.apache.hadoop.fs.FileSystem
represents a filesystem in Hadoop, and there are several
concrete implementations.
12
Fig: HDFS Architecture 13
 The Hadoop Distributed File System (HDFS) is designed to store
very large data sets reliably, and to stream those data sets at high
bandwidth to user applications. In a large cluster, thousands of
servers both host directly attached storage and execute user
application tasks.

Fig: Hadoop File Systems configuration for HDFS 14

4) The Java Interface
 The API for interacting with one of Hadoop’s filesystems, while
we focus mainly on the HDFS implementation,
DistributedFileSystem, in general you should strive to write your
code against the FileSystem abstract class, to retain portability
across filesystems.
 Reading Data from a Hadoop URL: One of the simplest ways to
read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from.

 Program
InputStream in = null;
try { in = new URL("hdfs://host/path").openStream();
// process in
} finally
{ IOUtils.closeStream(in);
} 15
Reading Data Using the FileSystem API
 Set a URLStreamHandlerFactory for your application.
 FileSystem is a general filesystem API, so the first step is
to retrieve an instance for the filesystem we want to use-
HDFS in this case.

 There are several static factory methods for getting a

FileSystem instance:
i. public static FileSystem get(Configuration conf) throws
IOException
ii. public static FileSystem get(URI uri, Configuration conf)
throws IOException
iii. public static FileSystem get(URI uri, Configuration conf,
String user) throws IOException
16
5) Data Flow: To get an idea of how data flows between the
client interacting with HDFS, the namenode and the
datanodes.

Fig: A Client reading data from HDFS 17

 The client opens the file it wishes to read by calling open() on the
FileSystem object, which for HDFS is an instance of
distributedFileSystem.
 The DistributedFileSystem returns an FSDataInputStream (an input
stream that supports file seeks) to the client for it to read data from.
 FSDataInputStream in turn wraps a DFSInputStream, which
manages the datanode and namenode I/O.

Anatomy of a File Write:

 The client creates the file by calling create() on
DistributedFileSystem.
 DistributedFileSystem makes an RPC call to the namenode to
create a new file in the filesystem’s namespace, with no blocks
associated with it.
 The data queue is consumed by the Data Streamer, whose
responsibility it is to ask the namenode to allocate new blocks by
picking a list of suitable datanodes to store the replicas.
18
Fig: A Client writing data to HDFS 19
6) Parallel Copying with distcp
 The HDFS access patterns that we have seen so far focus
on single-threaded access.

 Hadoop comes with a useful program called distcp for

copying large amounts of data to and from Hadoop
filesystems in parallel.

 The canonical use case for distcp is for transferring data

between two HDFS clusters.

 If the clusters are running identical versions of Hadoop,

the hdfs scheme is appropriate:
20
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
 Since it’s a good idea to get each map to copy a
reasonable amount of data to minimize overheads
in task setup, each map copies at least 256 MB.

 When copying data into HDFS, it’s important to

consider cluster balance.

 HDFS works best when the file blocks are evenly

spread across the cluster, so you want to ensure
that distcp doesn’t disrupt this.

21
7) Hadoop Archives: HDFS stores small files inefficiently,
since each file is stored in a block, and block metadata is
held in memory by the name node.

 Hadoop Archives, or HAR files, are a file archiving

facility that packs files into HDFS blocks more efficiently,
thereby reducing name node memory usage while still
allowing transparent access to files.

 In particular, Hadoop Archives can be used as input to

MapReduce.

22
Fig: HAR File Layout
23
 Hadoop Archive is created from a collection of files using
the archive tool.

 The tool runs a MapReduce job to process the input files

in parallel. so to run it, you need a MapReduce cluster
running to use it.

 Archives are immutable.

 Rename, delete, and create return an error.

 Hadoop archives is exposed as a file system MapReduce

will be able to use all the logical input files in Hadoop
archives as input.
24
 HDFS is a fault tolerant and self-healing distributed file
system designed to turn a cluster of industry standard
servers into a massively scalable pool of storage.

 HDFS Features:
 Scale-Out Architecture - Add servers to increase capacity
 High Availability - Serve mission-critical workflows and applications
 Fault Tolerance - Automatically and seamlessly recover from failures
 Flexible Access – Multiple and open frameworks for serialization and file
system mounts
 Load Balancing - Place data intelligently for maximum efficiency and
utilization
 Tunable Replication - Multiple copies of each file provide data protection
and computational performance
 Security - POSIX-based file permissions for users and groups with optional
LDAP integration.
25
2. Hadoop I/O: Hadoop comes with a set of primitives for data I/O,
some of these are techniques that are more general than Hadoop,
such as data integrity and compression, but deserve special
consideration when dealing with multiterabyte datasets.

 Hadoop tools or APIs that form the building blocks for developing
distributed system, such as serialization frameworks and on-disk
data structures.

 Hadoop MapReduce job, input files are read from HDFS,

Data are usually compressed to reduce the file sizes, after
decompression, serialized bytes are transformed into Java objects
before being passed to a user-defined map() function.

 Output records are serialized, compressed, and eventually pushed

back to HDFS in conversely.

 Hadoop’s SequenceFile provides a persistent data structure for

binary key-value pairs. 26
 Hadoop I/O consists of
1. Data Integrity
2. Data Compression
3. Data Serialization
4. File-Based Data Structures

Fig: Basic flow of Hadoop I/O 27

1) Data Integrity: Hadoop I/O is a technique doesn’t offer any way to
fix the data, just only error detection.
 Hadoop every I/O operation on the disk or network carries with it a
small chance of introducing errors into the data that it is reading or
writing.
Data Integrity consists of
a) Data Integrity in HDFS
b) LocalFileSystem
c) ChecksumFileSystem

a) Data Integrity in HDFS: HDFS transparently checksums all data

written to it and by default verifies checksums when reading data.
 A separate checksum is created for every io.bytes.per.checksum.
 HDFS stores replicas of blocks, it can “heal” corrupted blocks by
copying one of the good replicas to produce a new, uncorrupt
replica the datanode it was trying to read from to the namenode
before throwing a ChecksumException.
28
b) LocalFileSystem: The Hadoop LocalFileSystem performs client-side
checksumming, this means that when you write a file a called filename,
the file system client transparently creates a hidden file, .filename.crc, in
the same directory containing the checksums for each chunk of the file.

 Like HDFS, the chunk size is controlled by the io.bytes.per.check

property, which defaults to 512 bytes.
 The chunk size is stored as metadata in the .crc file, so the file can be read
back correctly even if the setting for the chunk size has changed.

 Checksums are fairly cheap to compute, typically adding a few percent

overhead to the time to read or write a file.

 It is possible to disable checksums and use case here is when the

underlying file system support checksums natively, this is accomplished
by using RawLocalFileSystem in place of LocalFileSystem

 Example… 29
c) ChecksumFileSystem: LocalFileSystem uses ChecksumFileSystem
to do its work, and class makes it easy to add checksumming to
other (nonchecksummed) filesystems, as Checksum File System is
just a wrapper around FileSystem.

 LocalFileSystem uses ChecksumFileSystem to do its work, and this

class makes it easy to add checksumming to other filesystems, as
ChecksumFileSystem is just a wrapper around FileSystem.

 The general idiom is as follows:

 The underlying filesystem is called the raw filesystem, and may be

retrieved using the getRawFileSystem() method on
ChecksumFileSystem. 30
2. Data Compression: File compression brings two major
benefits as it reduces the space needed to store files, and it
speeds up data transfer across the network, or to or from
disk.
 When dealing with large volumes of data, both of these
savings can be significant, so it pays to carefully consider
how to use compression in Hadoop.

It consists of
a) Codecs
b) Compression and Input Splits
c) Using Compression in MapReduce

31
a) Codecs: A codec is the implementation of a compression-
decompression algorithm. In Hadoop, a codec is represented by an
implementation of the CompressionCodec interface

 Hadoop codecs must be downloaded separately from

http://code.google.com/p/hadoop-gpl-compression/

32
b) Compression: Compression tools are gzip,zip,bzip2 and
lzop.

 Compression time by offering nine different options

– -1 means optimize for speed and -9 means optimize for space
– e.g.) gzip -1 file

 The different tools have very different compression characteristics.

– Both gzip and ZIP are general-purpose compressors, and sit in
the middle of the space/time trade-off.
– Bzip2 compresses more effectively than gzip or ZIP, but is
slower.
– LZO optimizes for speed. It is faster than gzip and ZIP, but
compresses slightly less effectively.

33
c) Using Compression in MapReduce: When considering how to
compress data that will be processed by MapReduce, it is important
to understand whether the compression format supports splitting. If
your input files are compressed, they will be automatically
decompressed as they are read by MapReduce, using the filename
extension to determine the codec to use.
 For Example…

34
3) Data Serialization: Serialization is the process of turning
structured objects into a byte stream for transmission over a network
or for writing to persistent storage. Deserialization is the process of
turning a byte stream back into a series of structured objects.
 In Hadoop, interprocess communication between nodes in the
system is implemented using remote procedure calls(RPCs).

 In general, it is desirable that an RPC serialization format is:

 Compact: A compact format makes the best use of network bandwidth
 Fast: Interprocess communication forms the backbone for a distributed
system, so it is essential that there is as little performance overhead as
possible for the serialization and deserialization process.
 Extensible: Protocols change over time to meet new requirements, so it
should be straightforward to evolve the protocol in a controlled manner for
clients and servers.
 Interoperable: For some systems, it is desirable to be able to support clients
that are written in different languages to the server.

35
It consists of
a) The Writable Interface
b) Writable Classes
c) Implementing a Custom Writable
d) Serialization Frameworks
e) Avro
a) The Writable Interface: The Writable interface defines two methods are one for
writing its state to a Data Output binary stream, and one for reading its state from
a Data Input binary stream.

36
b) Writable Classes: Hadoop comes with a large selection of Writable classes in the
org.apache.hadoop.iopackage.

Fig: Writable wrappers for Java primitives 37

c) Implementing a Custom Writable: Hadoop comes with a useful set of
Writable implementations that serve most purposes; however, on occasion, you
may need to write your own custom implementation. With Serialization | 105 a
custom Writable, you have full control over the binary representation and the
sort order.

d) Serialization Frameworks: Hadoop has an API for pluggable

serialization frameworks. A serialization framework is represented by an
implementation of Serialization (in the org.apache.hadoop.io. serializer
package). WritableSerialization, for example, is the implementation of
Serialization for Writable types.
 A Serialization defines a mapping from types to Serializer instances (for
turning an object into a byte stream) and Deserializer instances (for
turning a byte stream into an object).

e) Avro: Apache Avro4 is a language-neutral data serialization system. The

project was created by Doug Cutting (the creator of Hadoop) to address
the major downside of Hadoop Writables: lack of language portability.
 The Avro specification precisely defines the binary format that all
38
implementations must support.
4) File-Based Data Structures: Apache Hadoop’s
SequenceFile provides a persistent data structure for
binary key-value pairs. In contrast with other persistent
key-value data structures like B-Trees, you can’t seek to a
specified key editing, adding or removing it. This file is
append-only.

 MapReduce-based processing, putting each blob of binary

data into its own file doesn’t scale, so Hadoop developed
a number of higher-level containers for these situations.

 It consists of
a) Sequencefile
b) Mapfile
39
a) Sequence File: Imagine a log file, where each log record is a new
line of text. If you want to log binary types, plain text isn’t a
suitable format. Hadoop’s SequenceFile class fits the bill in this :
Hadoop I/O situation, providing a persistent data structure for
binary key-value pairs

 Writing a Sequence File: To create a SequenceFile, use one of its

createWriter() static methods, which returns a SequenceFile.Writer
instance.

 Reading a SequenceFile: Reading sequence files from beginning

to end is a matter of creating an instance of SequenceFile.

 The SequenceFile format: A sequence file consists of a header

followed by one or more records.

40
Fig: The internal structure of a sequence file with no compression
and record compression 41
 b) MapFile: A MapFile is a sorted SequenceFile with an index to permit lookups by key.
MapFile can be thought of as a persistent form of java.util.Map, which is able to grow
beyond the size of a Map that is kept in memory.
 Writing a MapFile: Writing a MapFile is similar to writing a SequenceFile: you create an
instance of MapFile.
 Reading a MapFile: Iterating through the entries in order in a MapFile is similar to the
procedure for a SequenceFile: you create a MapFile.
 Converting a SequenceFile to a MapFile: A MapFile is as an indexed and sorted
SequenceFile. So it’s quite natural to want to be able to convert a SequenceFile into a
MapFile.

Fig: The internal structure of a sequence file with block compression 42

Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Bda Unit 5
No ratings yet
Bda Unit 5
29 pages
Fundamentals-of-Computer-and-IT-BCA Notes (Unit1, Unit2, Unit3 and Unit4)
No ratings yet
Fundamentals-of-Computer-and-IT-BCA Notes (Unit1, Unit2, Unit3 and Unit4)
187 pages
Chapter 01
No ratings yet
Chapter 01
13 pages
Distributed File System - File Service Architecture
No ratings yet
Distributed File System - File Service Architecture
51 pages
Mainframe Operating Systems
No ratings yet
Mainframe Operating Systems
4 pages
6.1 Emerging Databases
No ratings yet
6.1 Emerging Databases
18 pages
Parallelism
No ratings yet
Parallelism
22 pages
Unit 4
No ratings yet
Unit 4
40 pages
Operating Systems Interview Question and Answers
No ratings yet
Operating Systems Interview Question and Answers
18 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
Distributed Computing Question Bank
No ratings yet
Distributed Computing Question Bank
6 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
STC Sample Report
No ratings yet
STC Sample Report
21 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
CORBA Services
No ratings yet
CORBA Services
5 pages
SAN Module 1
No ratings yet
SAN Module 1
47 pages
Chapter - 2 - Parallel Hardware and Parallel Software
No ratings yet
Chapter - 2 - Parallel Hardware and Parallel Software
143 pages
HDFS Concepts
No ratings yet
HDFS Concepts
10 pages
Block Chain-Module-I
No ratings yet
Block Chain-Module-I
19 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Nba - Cse-Presentation 15.7.2022
No ratings yet
Nba - Cse-Presentation 15.7.2022
88 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Cs6551 Computer Networks: Unit - I
No ratings yet
Cs6551 Computer Networks: Unit - I
86 pages
SOLUTIONS That I Can Copy and PASTE Krypton - Fhda.edu - Mmurperfefhy - Cnet-53f - Resources - ISM Book Exercise Solutions
No ratings yet
SOLUTIONS That I Can Copy and PASTE Krypton - Fhda.edu - Mmurperfefhy - Cnet-53f - Resources - ISM Book Exercise Solutions
32 pages
Os PPT Disk Sheduling 22
No ratings yet
Os PPT Disk Sheduling 22
16 pages
Virtualization in Cloud Computing and Types
No ratings yet
Virtualization in Cloud Computing and Types
8 pages
Chapter 12: Mass-Storage Systems: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
No ratings yet
Chapter 12: Mass-Storage Systems: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
23 pages
Distributed System Course File
No ratings yet
Distributed System Course File
26 pages
Unit I Dbms
0% (1)
Unit I Dbms
45 pages
Modeling and Detection of Camouflaging Worm
No ratings yet
Modeling and Detection of Camouflaging Worm
37 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
OS 2 Marks
100% (11)
OS 2 Marks
15 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Unit 5-Cloud PDF
No ratings yet
Unit 5-Cloud PDF
33 pages
Distributed Database System (KCA045)
No ratings yet
Distributed Database System (KCA045)
9 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Chapter 6 Security Reference Model
No ratings yet
Chapter 6 Security Reference Model
30 pages
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
No ratings yet
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
55 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
2 Mark Question With Answers
No ratings yet
2 Mark Question With Answers
9 pages
UNIT 6 Hardware & Software Concepts PDF
No ratings yet
UNIT 6 Hardware & Software Concepts PDF
9 pages
Cloud Computing Unit 2
No ratings yet
Cloud Computing Unit 2
54 pages
DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
SAN Question Bank With Solution
No ratings yet
SAN Question Bank With Solution
13 pages
HBase
No ratings yet
HBase
31 pages
2012 IN4392 Lecture-5 CloudProgrammingModels
100% (1)
2012 IN4392 Lecture-5 CloudProgrammingModels
95 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Building Websites with VB.NET and DotNetNuke 4
From Everand
Building Websites with VB.NET and DotNetNuke 4
Daniel N. Egan
1/5 (1)
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Distributed Systems Notes AKTU
No ratings yet
Distributed Systems Notes AKTU
3 pages
FCC - Module V - Cloud Technologies and Advancements
No ratings yet
FCC - Module V - Cloud Technologies and Advancements
63 pages
OFSAAI Administration Guide 8.0
No ratings yet
OFSAAI Administration Guide 8.0
133 pages
Carter Nelson Education Kansas State University
No ratings yet
Carter Nelson Education Kansas State University
1 page
Atc TutorialSSIS4
No ratings yet
Atc TutorialSSIS4
2,769 pages
Cloudera Kudu
100% (1)
Cloudera Kudu
102 pages
Big Data PPT Notes Prepared by Abes Engineering College
No ratings yet
Big Data PPT Notes Prepared by Abes Engineering College
158 pages
Hadoop Installation On CentOS PDF
No ratings yet
Hadoop Installation On CentOS PDF
3 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
89 pages
Big Data 2 Marks With Answer - Compress
No ratings yet
Big Data 2 Marks With Answer - Compress
10 pages
Cloud Computing
No ratings yet
Cloud Computing
3 pages
Instant Ebooks Textbook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Download All Chapters
100% (6)
Instant Ebooks Textbook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Download All Chapters
49 pages
Standalone PDF
No ratings yet
Standalone PDF
1,134 pages
What Is Microsoft Azure?: Overview of Azure Services
100% (1)
What Is Microsoft Azure?: Overview of Azure Services
21 pages
(M19CST1108) I M. Tech I Semester (R19) Regular Examinations Big Data Analytics Department of Computer Engineering Model Question Paper TIME: 3 Hrs. Max. Marks: 75 M
No ratings yet
(M19CST1108) I M. Tech I Semester (R19) Regular Examinations Big Data Analytics Department of Computer Engineering Model Question Paper TIME: 3 Hrs. Max. Marks: 75 M
2 pages
Inforbright IEE Architecture Overview
No ratings yet
Inforbright IEE Architecture Overview
14 pages
Installation and Configuration System Tool For Hadoop
No ratings yet
Installation and Configuration System Tool For Hadoop
30 pages
Algorithms and Architectures For Parallel Processing 14th International Conference ICA3PP 2014 Dalian China August 24 27 2014 Proceedings Part II 1st Edition Xian-He Sun PDF Download
No ratings yet
Algorithms and Architectures For Parallel Processing 14th International Conference ICA3PP 2014 Dalian China August 24 27 2014 Proceedings Part II 1st Edition Xian-He Sun PDF Download
51 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Cloud Computing QA Module5
No ratings yet
Cloud Computing QA Module5
3 pages
10 1109ICoAC44903 2018 8939061
No ratings yet
10 1109ICoAC44903 2018 8939061
9 pages
SQL and Nosql Interview Questions Your Essential Guide To Acing SQL and Nosql Job Interviews Vishwanathan Narayanan PDF Download
No ratings yet
SQL and Nosql Interview Questions Your Essential Guide To Acing SQL and Nosql Job Interviews Vishwanathan Narayanan PDF Download
51 pages
Big Data in Ehealthcare - Challenges and Perspectives
No ratings yet
Big Data in Ehealthcare - Challenges and Perspectives
256 pages
Ch6 Ais6 Reviewer
No ratings yet
Ch6 Ais6 Reviewer
8 pages
Big Data Analytics Implementation in Banking Industry Case Study Cross Selling Activity in Indonesias Commercial Bank
No ratings yet
Big Data Analytics Implementation in Banking Industry Case Study Cross Selling Activity in Indonesias Commercial Bank
12 pages
Challenges For Mapreduce in Big Data: Scholarship@Western
No ratings yet
Challenges For Mapreduce in Big Data: Scholarship@Western
10 pages
Naukri UmaMaheshwerRaoMenneni (14y 6m)
No ratings yet
Naukri UmaMaheshwerRaoMenneni (14y 6m)
5 pages
4 Hadoop and HDFS
No ratings yet
4 Hadoop and HDFS
33 pages
BDA Unit-4
No ratings yet
BDA Unit-4
38 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages

BDT - Unit - II - Hdfs and Hadoop Io

Uploaded by

BDT - Unit - II - Hdfs and Hadoop Io

Uploaded by

BIG DATA

Unit-I : Introduction to Big Data and Hadoop

 HDFS is a Hadoop Distributed File System for Data Storage.

 Hadoop Distributed File System (HDFS) is a sub-project of the

 The HDFS is the primary storage system used

 The HDFS is a distributed file system that provides high-

 HDFS uses a master/slave architecture in which one device is

 HDFS features are great fault tolerance, high throughput,

 A typical Hadoop usage pattern involves three stages are

i. Blocks: A disk has a block size, which is the

iii. HDFS Federation: HDFS Federation, introduced in the 0.23

iv. HDFS High-Availability: The combination of replicating

 All commands for manipulating HDFS through Hadoop's command

 There are two properties that we set in the pseudo-distributed

 The first is fs.default.name, set to hdfs://localhost/, which is used to

 The default HDFS port is 8020. 10

 Hadoop has an abstract notion of filesystem, of which HDFS is

Fig: Hadoop File Systems configuration for HDFS 14

 There are several static factory methods for getting a

Fig: A Client reading data from HDFS 17

Anatomy of a File Write:

 Hadoop comes with a useful program called distcp for

 The canonical use case for distcp is for transferring data

 If the clusters are running identical versions of Hadoop,

 When copying data into HDFS, it’s important to

 HDFS works best when the file blocks are evenly

 Hadoop Archives, or HAR files, are a file archiving

 In particular, Hadoop Archives can be used as input to

 The tool runs a MapReduce job to process the input files

 Archives are immutable.

 Hadoop archives is exposed as a file system MapReduce

 Hadoop MapReduce job, input files are read from HDFS,

 Output records are serialized, compressed, and eventually pushed

 Hadoop’s SequenceFile provides a persistent data structure for

Fig: Basic flow of Hadoop I/O 27

a) Data Integrity in HDFS: HDFS transparently checksums all data

 Like HDFS, the chunk size is controlled by the io.bytes.per.check

 Checksums are fairly cheap to compute, typically adding a few percent

 It is possible to disable checksums and use case here is when the

 LocalFileSystem uses ChecksumFileSystem to do its work, and this

 The general idiom is as follows:

 The underlying filesystem is called the raw filesystem, and may be

 Hadoop codecs must be downloaded separately from

 Compression time by offering nine different options

 The different tools have very different compression characteristics.

 In general, it is desirable that an RPC serialization format is:

Fig: Writable wrappers for Java primitives 37

d) Serialization Frameworks: Hadoop has an API for pluggable

e) Avro: Apache Avro4 is a language-neutral data serialization system. The

 MapReduce-based processing, putting each blob of binary

 Writing a Sequence File: To create a SequenceFile, use one of its

 Reading a SequenceFile: Reading sequence files from beginning

 The SequenceFile format: A sequence file consists of a header

Fig: The internal structure of a sequence file with block compression 42

You might also like