Unit II Hadoop IO
Unit II Hadoop IO
Datanodes are responsible for verifying the data they receive before storing the data
and its checksum
– If it detects an error, the client receives a ChecksumException, a subclass of IOException
When clients read data from datanodes, they verify checksums as well, comparing them
with the ones stored at the datanode
6 / 18
Data Integrity in HDFS
LocalFileSystem
– Performes client-side checksumming
– When you write a file called filename, the FS client transparently creates a hidden
file, .filename.crc, in the same directory containing the checksums for each chunk of the file
RawLocalFileSystem
– Disable checksums
– Use when you don’t need checksums
ChecksumFileSystem
– Wrapper around FileSystem
– Make it easy to add checksumming to other (nonchecksummed) FS
– Underlying FS is called the raw FS
7 / 18
Compression
Two major benefits of file compression
– Reduce the space needed to store files
– Speed up data transfer across the network
When dealing with large volumes of data, both of these savings can be significant, so it
pays to carefully consider how to use compression in Hadoop
8 / 18
Compression Formats
Compression formats
Compression Filename Exten-
Tool Algorithm Multiple Files Splittable
Format sion
DEFLATE N/A DEFLATE .deflate NO NO
gzip gzip DEFLATE .gz NO NO
YES, at file
ZIP zip DEFLATE .zip YES
boundaries
bzip2 bzip2 bzip2 .bz2 NO YES
LZO lzop LZO .lzo NO NO
“Splittable” column
– Indicates whether the compression format supports splitting
– Whether you can seek to any point in the stream and start reading from some point further on
– Splittable compression formats are especially suitable for MapReduce
9 / 18
Codes
Implementation of a compression-decompression algorithm
Compression For- Hadoop Compression Codec
mat
DEFLATE org.apache.hadoop.io.compression.DefaultCodec
gzip org.apache.hadoop.io.compression.GzipCodec
Bzip2 org.apache.hadoop.io.compression.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
The LZO libraries are GPL-licensed and may not be included in Apache distributions
CompressionCodec
– createOutputStream(OutputStream out): create a CompressionOutputStream to which
you write your uncompressed data to have it written in compressed form to the underlying
stream
– createInputStream(InputStream in): obtain a CompressionInputStream, which allows
you to read uncompressed data from the underlying stream
10 / 18
Example
public class StreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
CompressionOutputStream out =
codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
}
}
finish()
– Tell the compressor to finish writing to the compressed stream, but doesn’t close the stream
12 / 18
Serialization
Process of turning structured objects into a byte stream for transmission over a network
or for writing to persistent storage
Requirements
– Compact
To make efficient use of storage space
– Fast
The overhead in reading and writing of data is minimal
– Extensible
We can transparently read data written in an older format
– Interoperable
We can read or write persistent data using different language
13 / 18
Writable Interface
Writable interface defines two methods
– write() for writing its state to a DataOutput binary stream
– readFields() for reading its state from a DataInput binary stream
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}
Example: IntWritable
IntWritable writable = new IntWritable();
writable.set(163);
VIntWritable MD5Hash
FloatWritable ObjectWritable
LongWritable GenericWritable
ArrayWritable VLongWritable
TwoDArray- DoubleWritable
Writable
AbstractMapWritable MapWritable
SortedMapWritable
16 / 18
Writable Wrappers for Java Primitives
There are Writable wrappers for all the Java primitive types except shot and char(both
of which can be stored in an IntWritable)
get() for retrieving and set() for storing the wrapped value
Variable-length formats
– If a value is between -122 and 127, use only a single byte
– Otherwise, use first byte to indicate whether the value is positive or negative and how many
bytes follow
18 / 18
Etc.
BytesWritable
– Wrapper for an array of binary data
NullWritable
– Zero-length serialization
– Used as a placeholder
– A key or a value can be declared as a NullWritable when you don’t need to use that position
ObjectWritable
– General-purpose wrapper for Java primitives, String, enum, Writable, null, arrays of any of
these types
– Useful when a field can be of more than one type
Writable collections
– ArrayWritable
– TwoDArrayWritable
– MapWritable
– SortedMapWritable
19 / 18
Serialization Frameworks
Using Writable is not mandated by MapReduce API
Only requirement
– Mechanism that translates to and from a binary representation of each type
20 / 18
SequenceFile
Persistent data structure for binary key-value pairs
Usage example
– Binary log file
Key: timestamp
Value: log
– Container for smaller files
The keys and values stored in a SequenceFile do not necessarily need to be Writable
Any types that can be serialized and deserialized by a Serialization may be used
21 / 18
Writing a SequenceFile
22 / 18
Reading a SequenceFile
23 / 18
Sync Point
Point in the stream which can be used to resynchronize with a record boundary if the
reader is “lost”—for example, after seeking to an arbitrary position in the stream
sync(long position)
– Position the reader at the next sync point after position
Do not confuse with sync() method defined by the Syncable interface for synchroniz-
ing buffers to the underlying device
24 / 18
SequenceFile Format
Header contains the version number, the names of the key and value classes, compres-
sion details, user-defined metadata, and the sync marker
Record format
– No compression
– Record compression
– Block compression
25 / 18
MapFile
Sorted SequenceFile with an index to permit lookups by key
26 / 18
Reading a MapFile
Call the next() method until it returns false
27 / 18