0% found this document useful (0 votes)

8 views27 pages

Unit II Hadoop IO

The document discusses key concepts related to Hadoop I/O, focusing on data integrity, compression, and serialization. It explains how HDFS ensures data integrity through checksumming and error detection, as well as the importance of compression formats for efficient data storage and transfer. Additionally, it covers serialization processes, the Writable interface, and various data structures like SequenceFile and MapFile used in Hadoop for handling binary key-value pairs.

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views27 pages

Unit II Hadoop IO

Uploaded by

anildudla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Hadoop I/O

Data Integrity, Compression, Serialization,

Avro and File-Based Data structures
Data Integrity

• Data integrity is the maintenance of,

and the assurance of, data accuracy
and consistency over its entire life-
cycle and is a critical aspect to the
design, implementation, and usage of
any system that stores, processes, or
retrieves data.

• Users of Hadoop rightly expect that no

data will be lost or corrupted during
storage or processing. However, because
every I/O operation on the disk or network
carries with it a small chance of
introducing errors into the data that it is
reading or writing
• when the volumes of data flowing through the system are as large as the ones
Hadoop is capable of handling, the chance of data corruption occurring is
high.
• The usual way of detecting corrupted data is by computing a checksum for the
data when it first enters the system, and again whenever it is transmitted
across a channel that is unreliable and hence capable of corrupting the data.
• The data is deemed to be corrupt if the newly generated checksum doesn’t
exactly match the original.
• This technique doesn’t offer any way to fix the data—it is merely error
detection.
• A commonly used error-detecting code is CRC-32 (32-bit cyclic redundancy
check), which computes a 32-bit integer checksum for input of any size.
• CRC-32 is used for checksumming in Hadoop’s ChecksumFileSystem, while
HDFS uses a more efficient variant called CRC-32C.
Data Integrity in HDFS
 HDFS transparently checksums all data written to it and by default verifies checksums
when reading data
– io.bytes.per.checksum
 Data size to compute checksums
 Default is 512 bytes

 Datanodes are responsible for verifying the data they receive before storing the data
and its checksum
– If it detects an error, the client receives a ChecksumException, a subclass of IOException

 When clients read data from datanodes, they verify checksums as well, comparing them
with the ones stored at the datanode

 Checksum verification log

– Each datanode keeps a persistent log to know the last time each of its blocks was verified
– When a client successfully verifies a block, it tells the datanode who sends the block
– Then, the datanode updates its log
5 / 18
Data Integrity in HDFS
 DataBlockScanner
– Background thread that periodically verifies all the blocks stored on the datanode
– Guard against corruption due to “bit rot” in the physical storage media

 Healing corrupted blocks

– If a client detects an error when reading a block, it reports the bad block and the datanode to
the namenode
– Namenode marks the block replica as corrupt
– Namenode schedules a copy of the block to be replicated on another datanode
– The corrupt replica is deleted

 Disabling verification of checksum

– Pass false to the setVerifyCheckSum() method on FileSystem
– -ignoreCrc option

6 / 18
Data Integrity in HDFS
 LocalFileSystem
– Performes client-side checksumming
– When you write a file called filename, the FS client transparently creates a hidden
file, .filename.crc, in the same directory containing the checksums for each chunk of the file

 RawLocalFileSystem
– Disable checksums
– Use when you don’t need checksums

 ChecksumFileSystem
– Wrapper around FileSystem
– Make it easy to add checksumming to other (nonchecksummed) FS
– Underlying FS is called the raw FS

FileSystem rawFs = ...

FileSystem checksummedFs = new ChecksumFileSystem(rawFs);

7 / 18
Compression
 Two major benefits of file compression
– Reduce the space needed to store files
– Speed up data transfer across the network

 When dealing with large volumes of data, both of these savings can be significant, so it
pays to carefully consider how to use compression in Hadoop

8 / 18
Compression Formats
 Compression formats
Compression Filename Exten-
Tool Algorithm Multiple Files Splittable
Format sion
DEFLATE N/A DEFLATE .deflate NO NO
gzip gzip DEFLATE .gz NO NO
YES, at file
ZIP zip DEFLATE .zip YES
boundaries
bzip2 bzip2 bzip2 .bz2 NO YES
LZO lzop LZO .lzo NO NO

 “Splittable” column
– Indicates whether the compression format supports splitting
– Whether you can seek to any point in the stream and start reading from some point further on
– Splittable compression formats are especially suitable for MapReduce

9 / 18
Codes
 Implementation of a compression-decompression algorithm
Compression For- Hadoop Compression Codec
mat
DEFLATE org.apache.hadoop.io.compression.DefaultCodec

gzip org.apache.hadoop.io.compression.GzipCodec

Bzip2 org.apache.hadoop.io.compression.BZip2Codec

LZO com.hadoop.compression.lzo.LzopCodec

 The LZO libraries are GPL-licensed and may not be included in Apache distributions
 CompressionCodec
– createOutputStream(OutputStream out): create a CompressionOutputStream to which
you write your uncompressed data to have it written in compressed form to the underlying
stream
– createInputStream(InputStream in): obtain a CompressionInputStream, which allows
you to read uncompressed data from the underlying stream

10 / 18
Example
public class StreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
CompressionOutputStream out =
codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
}
}
 finish()
– Tell the compressor to finish writing to the compressed stream, but doesn’t close the stream

% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.com-

press.GzipCodec \
| gunzip -
Text
11 / 18
Compression and Input Splits
 When considering how to compress data that will be processed by MapReduce, it is im-
portant to understand whether the compression format supports splitting

 Example of not-splitable compression problem

– A file is a gzip-compressed file whose compressed size is 1 GB
– Creating a split for each block won’t work since it is impossible to start reading at an arbitrary
point in the gzip stream, and therefore impossible for a map task to read its split independently
of the others

12 / 18
Serialization
 Process of turning structured objects into a byte stream for transmission over a network
or for writing to persistent storage

 Deserialization is the reverse process of serialization

 Requirements
– Compact
 To make efficient use of storage space
– Fast
 The overhead in reading and writing of data is minimal
– Extensible
 We can transparently read data written in an older format
– Interoperable
 We can read or write persistent data using different language

13 / 18
Writable Interface
 Writable interface defines two methods
– write() for writing its state to a DataOutput binary stream
– readFields() for reading its state from a DataInput binary stream
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
 Example: IntWritable
IntWritable writable = new IntWritable();
writable.set(163);

public static byte[] serialize(Writable writable) throws IOException {

ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}

byte[] bytes = serialize(writable);

assertThat(bytes.length, is(4));
assertThat(StringUtils.byteToHexString(bytes),
is("000000a3")); 14 / 18
WritableComparable and Comparator
 IntWritable implements the WritableComparable interface
public interface WritableComparable<T> extends Writable, Comparable<T> {
}

 Comparison of types is crucial for MapReduce

 Optimization: RawComparator
– Compare records read from a stream without deserializing them into objects
 WritableComparator is a general-purpose implementation of RawComparator
– Provide a default implementation of the raw compare() method
 Deserialize the objects and invokes the object compare() method
– Act as a factory for RawComparator instances

RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);

IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));
byte[] b1 = serialize(w1); byte[] b2 = serialize(w2);
assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length),
greaterThan(0));
15 / 18
Writable Classes
 Writable class hierarchy
Primi- Others
tives
Boolean-
NullWritable
Writable
<<interface>>
Writable <<interface>>
org.apache.h- WritableComparable ByteWritable Text
daoop.io
IntWritable BytesWritable

VIntWritable MD5Hash

FloatWritable ObjectWritable

LongWritable GenericWritable

ArrayWritable VLongWritable

TwoDArray- DoubleWritable
Writable

AbstractMapWritable MapWritable

SortedMapWritable
16 / 18
Writable Wrappers for Java Primitives
 There are Writable wrappers for all the Java primitive types except shot and char(both
of which can be stored in an IntWritable)
 get() for retrieving and set() for storing the wrapped value
 Variable-length formats
– If a value is between -122 and 127, use only a single byte
– Otherwise, use first byte to indicate whether the value is positive or negative and how many
bytes follow

Java Writable Implementa- Serialized

Primitive tion Size (bytes)
163
boolean BooleanWritable 1
byte ByteWritable 1
VIntWritable: 8fa3
int IntWritable 4
VIntWritable 1~5
1000 1111 1010 0011 float FloatWritable 4
long LongWritable 8
-123 163 VLongWritable 1~9
(2’s complement)
double DoubleWritable 8
17 / 18
Text
 Writable for UTF-8 sequences
 Can be thought of as the Writable equivalent of java.lang.String
 Replacement for the org.apache.hadoop.io.UTF8 class (deprecated)
 Maximum size is 2GB
 Use standard UTF-8
– org.apache.hadoop.io.UTF8 used Java’s modified UTF-8
 Indexing for the Text class is in terms of position in the encoded byte sequence
 Text is mutable (like all Writable implementations, except NullWritable)
– You can reuse a Text instance by calling one of the set() method

Text t = new Text("hadoop");

t.set("pig");
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3));

18 / 18
Etc.
 BytesWritable
– Wrapper for an array of binary data
 NullWritable
– Zero-length serialization
– Used as a placeholder
– A key or a value can be declared as a NullWritable when you don’t need to use that position
 ObjectWritable
– General-purpose wrapper for Java primitives, String, enum, Writable, null, arrays of any of
these types
– Useful when a field can be of more than one type
 Writable collections
– ArrayWritable
– TwoDArrayWritable
– MapWritable
– SortedMapWritable

19 / 18
Serialization Frameworks
 Using Writable is not mandated by MapReduce API

 Only requirement
– Mechanism that translates to and from a binary representation of each type

 Hadoop has an API for pluggable serialization frameworks

 A serialization framework is represented by an implementation of Serialization (in

org.apache.hadoop.io.serializer package)

 A Serialization defines a mapping from types to Serializer instances and Deserial-

izer instances

 Set the io.serializations property to a comma-separated list of classnames to register

Serialization implementations

20 / 18
SequenceFile
 Persistent data structure for binary key-value pairs

 Usage example
– Binary log file
 Key: timestamp
 Value: log
– Container for smaller files

 The keys and values stored in a SequenceFile do not necessarily need to be Writable

 Any types that can be serialized and deserialized by a Serialization may be used

21 / 18
Writing a SequenceFile

22 / 18
Reading a SequenceFile

23 / 18
Sync Point
 Point in the stream which can be used to resynchronize with a record boundary if the
reader is “lost”—for example, after seeking to an arbitrary position in the stream

 sync(long position)
– Position the reader at the next sync point after position

 Do not confuse with sync() method defined by the Syncable interface for synchroniz-
ing buffers to the underlying device

24 / 18
SequenceFile Format
 Header contains the version number, the names of the key and value classes, compres-
sion details, user-defined metadata, and the sync marker
 Record format
– No compression
– Record compression
– Block compression

25 / 18
MapFile
 Sorted SequenceFile with an index to permit lookups by key

 Keys must be instances of WritableComparable and values must be Writable

26 / 18
Reading a MapFile
 Call the next() method until it returns false

 Random access lookup can be performed by calling the get() method

– Read the index file into memory

– Perform a binary search on the in-memory index
 Very large MapFile index
– Reindex to change the index interval
– Load only a fraction of the index keys into memory by setting the io.map.index.skip property

27 / 18

Book 7 Specialized
No ratings yet
Book 7 Specialized
107 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Cloud Computing 1st-Unit
No ratings yet
Cloud Computing 1st-Unit
91 pages
Marlabs DotNet SQL Interview Questions 1
No ratings yet
Marlabs DotNet SQL Interview Questions 1
100 pages
AcroTex Acrobat Javascript
No ratings yet
AcroTex Acrobat Javascript
67 pages
Accenture II
No ratings yet
Accenture II
13 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Log
No ratings yet
Log
210 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
95 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
Grade 4
100% (1)
Grade 4
4 pages
CHOLMOD UserGuide
No ratings yet
CHOLMOD UserGuide
153 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Software Design Document
No ratings yet
Software Design Document
3 pages
Studio Running Notes
100% (1)
Studio Running Notes
3 pages
SAP Business One 10.0 Highlights
No ratings yet
SAP Business One 10.0 Highlights
152 pages
Unit-1 CC
No ratings yet
Unit-1 CC
86 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Cybertalent
No ratings yet
Cybertalent
5 pages
Computer Design Basic
No ratings yet
Computer Design Basic
1 page
Unit 4 Bda
No ratings yet
Unit 4 Bda
33 pages
BDA Module 2 COMP
No ratings yet
BDA Module 2 COMP
29 pages
Unit3 Bda
No ratings yet
Unit3 Bda
71 pages
Raspberry Pi - Wikipedia
No ratings yet
Raspberry Pi - Wikipedia
16 pages
JobDescription - AI Engineering Intern at Nexagen
No ratings yet
JobDescription - AI Engineering Intern at Nexagen
1 page
KeyLab mkII FL Studio User Guide V1
No ratings yet
KeyLab mkII FL Studio User Guide V1
17 pages
Unit 3
No ratings yet
Unit 3
27 pages
DB2 Architecture Overview
No ratings yet
DB2 Architecture Overview
15 pages
Unit - 3 - Big Data
No ratings yet
Unit - 3 - Big Data
66 pages
J2ee Webteh
No ratings yet
J2ee Webteh
237 pages
ICP Final 2009
No ratings yet
ICP Final 2009
7 pages
Shp2epa Utility Conversion: Google Ads - Sitio Oficial
No ratings yet
Shp2epa Utility Conversion: Google Ads - Sitio Oficial
1 page
IET Udaipur BDA Unit-3
No ratings yet
IET Udaipur BDA Unit-3
10 pages
Bda Answer Key
No ratings yet
Bda Answer Key
5 pages
Address: 177 Great Portland Street, London W5W 6PQ Phone:: Web Developer - 09/2015 To 05/2019 Luna Web Design, New York
No ratings yet
Address: 177 Great Portland Street, London W5W 6PQ Phone:: Web Developer - 09/2015 To 05/2019 Luna Web Design, New York
1 page
Java Notes
No ratings yet
Java Notes
14 pages
Cummulative Flow Diagram
No ratings yet
Cummulative Flow Diagram
16 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Hibernate Inheritance Using XML
No ratings yet
Hibernate Inheritance Using XML
9 pages
BDA Unit-4
No ratings yet
BDA Unit-4
32 pages
OCPI 2.2 d2
No ratings yet
OCPI 2.2 d2
186 pages
Bigdata
No ratings yet
Bigdata
9 pages
Question Bank-BDA
No ratings yet
Question Bank-BDA
15 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Lecture 07
No ratings yet
Lecture 07
58 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Data Processing For Large Database Using Mapreduce Approach Using Apso
No ratings yet
Data Processing For Large Database Using Mapreduce Approach Using Apso
59 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
CH 3 BDA
No ratings yet
CH 3 BDA
13 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit3 BDAT
No ratings yet
Unit3 BDAT
18 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Instalar Openldap en Ubuntu 20
No ratings yet
Instalar Openldap en Ubuntu 20
1 page
Data Analytics
No ratings yet
Data Analytics
26 pages
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
No ratings yet
BDA IV B.Tech I Sem MR18-Mid-2 Objective Questions
11 pages
l2 Hdfs and Mapreduce Model 2022s2
No ratings yet
l2 Hdfs and Mapreduce Model 2022s2
52 pages
Hadoop
No ratings yet
Hadoop
30 pages
Big Data PPT Unit 2 1
No ratings yet
Big Data PPT Unit 2 1
25 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
HADOOP Notes Unit 3 and 4
No ratings yet
HADOOP Notes Unit 3 and 4
14 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Bach Khoa University of Technology: Digital Systems Laboratory Prelab 4 Finite State Machine
No ratings yet
Bach Khoa University of Technology: Digital Systems Laboratory Prelab 4 Finite State Machine
11 pages
Answers Assignment B21 43
No ratings yet
Answers Assignment B21 43
7 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Scan With Rollup
No ratings yet
Scan With Rollup
5 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Hadoop and Big Data Unit 4
No ratings yet
Hadoop and Big Data Unit 4
13 pages
Hadoop IO Explanation
No ratings yet
Hadoop IO Explanation
3 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Threads Synchronization
No ratings yet
Threads Synchronization
29 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
IT JOB Tips
No ratings yet
IT JOB Tips
36 pages
Hadoop Primitives
No ratings yet
Hadoop Primitives
6 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
DPAT Assignment
0% (1)
DPAT Assignment
21 pages
Software Project Management Class Notes
No ratings yet
Software Project Management Class Notes
42 pages
Multiple Choice Questions For Mid - 1
No ratings yet
Multiple Choice Questions For Mid - 1
26 pages
CIA3 Answer
No ratings yet
CIA3 Answer
5 pages
B.Tech VIII BDA Chapter - 3 1
No ratings yet
B.Tech VIII BDA Chapter - 3 1
3 pages
Build, Test, and Deploy Your Laravel Application With GitHub Actions
No ratings yet
Build, Test, and Deploy Your Laravel Application With GitHub Actions
8 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
8.3.6 Lab - Use NETCONF To Access An IOS XE Device
No ratings yet
8.3.6 Lab - Use NETCONF To Access An IOS XE Device
16 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet

Unit II Hadoop IO

Uploaded by

Unit II Hadoop IO

Uploaded by

Hadoop I/O

Data Integrity, Compression, Serialization,

• Data integrity is the maintenance of,

• Users of Hadoop rightly expect that no

 Checksum verification log

 Healing corrupted blocks

 Disabling verification of checksum

FileSystem rawFs = ...

% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.com-

 Example of not-splitable compression problem

 Deserialization is the reverse process of serialization

public static byte[] serialize(Writable writable) throws IOException {

byte[] bytes = serialize(writable);

 Comparison of types is crucial for MapReduce

RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);

Java Writable Implementa- Serialized

Text t = new Text("hadoop");

 Hadoop has an API for pluggable serialization frameworks

 A serialization framework is represented by an implementation of Serialization (in

 A Serialization defines a mapping from types to Serializer instances and Deserial-

 Set the io.serializations property to a comma-separated list of classnames to register

 Keys must be instances of WritableComparable and values must be Writable

 Random access lookup can be performed by calling the get() method

– Read the index file into memory

You might also like