IT JOB Tips
IT JOB Tips
IT JOB Tips
1
Contents
Hadoop I/O
Data Integrity
Compression
Serialization
File-Based Data Structures
2
Hadoop I/O
Hadoop Comes with a set of primitives for data I/O.
Some of these are techniques that are more general than Hadoop, such as data inte
grity and compression, but deserve special consideration when dealing with multit
erabyte datasets.
Others are Hadoop tools or APIs that form the building blocks for developing distrib
uted system, such as serialization frameworks and on-disk data structures.
3
Data Integrity
Since every I/O operation on the disk or network carries with it a small chance of in
troducing errors into the data that it is reading or writing.
When the volumes of data flowing through the system are as large as the ones Had
oop is capable of handling, the chance of data corruption occurring is high
The usual way of detecting corrupted data is by computing a checksum for the data
.
This technique doesn’t offer any way to fix the data, just only error detection
Note that it is possible that it’s the checksum that is corrupt, not the data, but this i
s very unlikely, since the checksum is much smaller than the data.
Datanodes are responsible for verifying the data they receive before storing the data and its chec
ksum. This applies to data that they receive from clients and from other datanodes during replica
tion. If it detects an error, the client receives a ChecksumException, a subclass of IOException.
When clients read data from datanodes, they verify checksums as well, comparing them with the
ones stored at the datanode. When a client successfully verifies a block, it tells the datanode, whi
ch updates its log. Keeping statistics such as these is valuable in detecting bad disks.
Aside from block verification on client reads, each datanode runs a DataBlockScanner in a backgr
ound thread that periodically verifies all the blocks stored on the datanode. This is to guard again
st corruption due to “bit rot” in the physical storage media.
5
Data Integrity in HDFS
Since HDFS stores replica of blocks, it can “heal” corrupted blocks by copying one of the go
od replicas to produce a new, uncorrupt replica.
If a client detects an error when reading a block
1. It reports the bad block and datanode it was trying to read from to the namenode before throwing a
ChecksumException.
2. The namenode marks the block replica as corrupt, so it doesn’t direct clients to it, or try to copy this
replica to another datanode.
3. It then schedules a copy of the block to be replicated on another datanode, so its replication factor
is back at the expected level.
4. Once this has happened, the corrupt replica is deleted.
6
LocalFileSystem
The Hadoop LocalFileSystem performs client-side checksumming. This means that when you write a
file a called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the sa
me directory containing the checksums for each chunk of the file.
Like HDFS, the chunk size is controlled by the io.bytes.per.check property, which defaults to 512 byte
s. The chunk size is stored as metadata in the .crc file, so the file can be read back correctly even if th
e setting for the chunk size has changed.
Checksums are fairly cheap to compute, typically adding a few percent overhead to the time to read
or write a file.
It is possible to disable checksums: the use case here is when the underlying filesystem support chec
ksums natively. This is accomplished by using RawLocalFileSystem in place of LocalFileSystem
Example…
7
ChecksumFileSystem
LocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it ea
sy to add checksumming to other filesystems, as ChecksumFileSystem is just a wrap
per around FileSystem.
The general idiom is as follows:
The underlying filesystem is called the raw filesystem, and may be retrieved using t
he getRawFileSystem() method on ChecksumFileSystem.
If an error is detected by ChecksumFileSystem when reading a file, it will call its rep
ortChecksumFailure() method.
8
Compression
All of the tools listed in Table 4-1 give some control over this trade-off at compression time by
offering nine different options
-1 means optimize for speed and -9 means optimize for space
e.g.) gzip -1 file
The different tools have very different compression characteristics.
Both gzip and ZIP are general-purpose compressors, and sit in the middle of the space/time trade-off.
Bzip2 compresses more effectively than gzip or ZIP, but is slower.
LZO optimizes for speed. It is faster than gzip and ZIP, but compresses slightly less effectively
9
Codecs
A codec is the implementation of a compression-decompression algorithm
The LZO libraries are GPL-licensed and may not be included in Apache distributions, so
for this reason the Hadoop codecs must be downloaded separately from
http://code.google.com/p/hadoop-gpl-compression/
10
Compressing and decompressing streams with CompressionCodec
CompressionCodec has two methods that allow you to easily compress or decompr
ess data.
To compress data being written to an output stream, use the createOutputStream
(OutputStream out) method to create a CompressionOutputStream to which you w
rite your uncompressed data to have it written in compressed form to the underlyi
ng stream.
To decompress data begin read from an input stream, call createIntputStream(Inpu
tStream in) to obtain a CompressionInputStream, which allows you to read uncomp
ressed data from the underlying stream.
11
Inferring CompressionCodecs using CompressionCodecFactory
If you are reading a compressed file, you can normally infer the codec to use by loo
king at its filename extension. A file ending in .gz can be read with GzipCodec, and s
o on.
CompressionCodecFactory provides a way of mapping a filename extension to a co
mpressionCodec using its getCodec() method, which takes a Path object for the file
in question.
Following example shows an application that uses this feature to decompress files.
12
Native libraries
For performance, it is preferable to use a native library for compression and decom
pression. For example, in one test, using the native gzip libraries reduced decompre
ssion times by up to 50% and compression times by around 10% (compared to the
built-in Java implementation).
Hadoop comes with prebuilt native compression libraries for 32- and 64-bit Linux,
which you can find in the lib/native directory
By default Hadoop looks for native libraries for the platform it is running on, and lo
ads them automatically if they are found.
13
Native libraries – CodecPool
If you are using a native library and you are doing a lot of compression or decompre
ssion in your application, consider using CodecPool, which allows you to reuse com
pressors and decompressors, thereby amortizing the cost of creating these objects.
14
Compression and Input Splits
When considering how to compress data that will be processed by MapReduce, it is i
mportant to understand whether the compression format supports splitting.
Consider an uncompressed file stored in HDFS whose size is 1GB. With a HDFS block s
ize of 64MB, the file will be stored as 16 blocks, and a MapReduce job using this file a
s input will create 16 input splits, each processed independently as input to a separat
e map task.
Imagine now the file is a gzip-compressed file whose compressed size is 1GB. As befo
re, HDFS will store the file as 16 blocks. However, creating a split for each block won’t
work since it is impossible to start reading at an arbitrary point in the gzip stream, an
d therefore impossible for a map task to read its split independently of the others
In this case, MapReduce will do the right thing, and not try to split the gzipped file. T
his will work, but at the expense of locality. A single map will process the 16 HDFS blo
cks, most of which will not be local to the map. Also, with fewer maps, the job is less
granular, and so may take longer to run.
… Mapper
… Mapper Mapper
15
… Mapper an uncompressed file a gzip-compressed file
Using Compression in MapReduce
If your input files are compressed, they will be automatically decompressed as they
are read by MapReduce, using the filename extension to determine the codec to us
e.
For Example…
16
Compressing map output
Even if your MapReduce application reads and writes uncompressed data, it may b
enefit from compressing the intermediate output of the map phase.
Since the map output is written to disk and transferred across the network to the re
ducer nodes, by using a fast compressor such as LZO, you can get performance gain
s simply because the volume of data to transfer is reduced
Here are the lines to add to enable gzip map output compression in your job:
17
Serialization
Serialization is the process of turning structured objects into a byte stream for transmission ove
r a network or for writing to persistent storage. Deserialization is the process of turning a byte s
tream back into a series of structured objects.
In Hadoop, interprocess communication between nodes in the system is implemented using re
mote procedure calls(RPCs). The RPC protocol uses serialization to render the message into a bi
nary stream to be sent to the remote node, which then deserializes the binary stream into the
original message.
18
Writable Interface
The Writable interface defines two methods: one for writing its state to a DataOutp
ut binary stream, and one for reading its state from a DataInput binary stream
We will use IntWritable, a wrapper fro a Java int. We can create one and set its valu
e using the set() method:
To examine the serialized form of the IntWritable, we write a small helper method t
hat wraps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream to captu
re the bytes in the serialized stream
19
Writable Class
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.i
o package. They form the class hierarchy shown in Figure 4-1.
20
Writable Class
Writable wrappers for Java primitives
There are Writable wrappers for all the Java primitive types except short and char.
All have a get() and a set() method for retrieving and storing the wrapped value.
21
Text
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equival
ent of java.lang.String.
The Text class uses an int to store the number of bytes in the string encoding, so th
e maximum value is 2 GB. Furthermore, Text uses standard UTF-8, which makes it p
otentially easier to interpoperate with other tools that understand UTF-8.
22
Text
Indexing
Indexing for the Text class is in terms of position in the encoded byte sequence, not
the Unicode character in the string, or the Java char code unit. For ASCII String, thes
e three concepts of index position coincide.
Notice that charAt() returns an int representing a Unicode code point, unlike the Str
ing variant that returns a char. Text also has a find() method, which is analogous to
String’s indexOf()
23
Text
Unicode
When we start using characters that are encoded with more than a single byte, the
differences between Text and String become clear. Consider the Unicode characters
shown in Table 4-7
All but the last character in the table, U+10400, canbe expressed using a single Java
char.
24
Text
Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offse
ts for indexing, since you can’t just increment the index.
The idiom for iteration is a little obscure: turn the Text object into a java.nio.ByteBu
ffer. Then repeatedly call the bytesToCodePoint() static method on Text with the bu
ffer. This method extracts the next code point as an int and updates the position in
the buffer.
For Example…
25
Text
Mutability
Another difference with String is that Text is mutable. You can reuse a Text instance by
calling on of the set() methods on it.
For Example…
Restoring to String
Text doesn’t have as rich an API for manipulating strings as java.lang.String, so in man
y cases you need to convert the Text object to a String.
26
Null Writable
NullWritable is a special type of Writable, as it has a zero-length serialization. No by
tes are written to , or read from , the stream. It is used as a placeholder.
NullWritable can also be useful as a key in SequenceFile when you want to store a li
st of values, as opposed to key-value pairs. It is an immutable singleton: the instanc
e can be retrieved by calling NullWritable.get().
27
Serialization Frameworks
Although most MapReduce programs use Writable key and value types, this isn’t m
andated by the MapReduce API. In fact, any types can be used, the only requireme
nt is that there be a mechanism that translates to and from a binary representation
of each type.
To support this, Hadoop has an API for pluggable serialization frameworks. A seriali
zation framework is represented by an implementation of Serialization. WritableSe
rialization, for example, is the implementation of Serialization for Writable types.
28
File-Based Data Structure
For some applications, you need a specialized data structure to hold your data. For
doing MapReduce-based processing, putting each blob of binary data into its own fi
le doesn’t scale, so Hadoop developed a number of higher-level containers for thes
e situations.
Higher-level containers
SequenceFile
MapFile
29
SequenceFile
Imagine a logfile, where each log record is a new line of text. If you want to log bina
ry types, plain text isn’t a suitable format.
Hadoop’s SequenceFile class fits the bill in this situation, providing a persistent data
structure for binary key-value pairs. To use it as a logfile format, you would choose
a key, such as timestamp represented by a LongWritable, and the value is Writable
that represents the quantity being logged.
SequenceFile also work well as containers for smaller files. HDFS and MapReduce a
re optimized for large files, so packing files into a SequenceFile makes storing and p
rocessing the smaller files more efficient.
30
Writing a SequenceFile
To create a SequenceFile, use one of its createWriter() static methods, which return
s a SequenceFile.Writer instance.
The keys and values stored in a SequenceFile do not necessarily need to be Writabl
e. Any types that can be serialized and deserialized by a Serialization may be used.
Once you have a SequenceFile.Writer, you then write key-value pairs, using the app
end() method. Then when you’ve finished you call the close() method (SequenceFil
e.Writer implements java.io.Closeable)
For example…
31
Reading a SequenceFile
Reading sequence files from beginning to end is a matter of creating an instance of
SequenceFile.Reader, and iterating over records by prepeatedly invoking one of the
next() methods.
If you are using Writable types, you can use the next() method that takes a key and
a value argument, and reads the next key and value in the stream into these variabl
es:
For example…
32
MapFile
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile
can be though of as a persistent form of java.util.Map(although it doesn’t impleme
nt this interface), which is able to grow beyond the size of a Map that is kept in me
mory
Writing a MapFile
33
Reading a MapFile
Iterating through the entries in order in a MapFile is similar to the procedure for a S
equenceFile. You create a MapFile.Reader, then call the next() method until it retur
ns false, signifying that no entry was read because the end of the file was reached.
The return value is used to determine if an entry was found in the MapFile. If it’s nu
ll, then no value exist for the given key. If key was found, then the value for that key
is read into val, as well as being returned from the method call.
For this operation, the MapFile.Reader reads the index file into memory.
A very large MapFile’s index c an take up a lot of memory. Rather than reindex to c
hange the index interval, it is possible to lad only a fraction of the index keys into m
emory when reading the MapFile by setting the io.amp.index.ksip property.
34
Converting a SequenceFile to a MapFile
One way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s q
uite natural to want to be able to convert a SequenceFile into a MapFile.
For example…
35
THANK YOU.