Data Types in Hadoop
Data Types in Hadoop
While programming for a distributed system, we cannot use standard data types. This is because they do
not know how to read from/write to the disk i.e. they are not serializable.
Hadoop provides Writable interface based data types for serialization and de-serialization of data
storage in HDFS and MapReduce computations.
Serialization is the process of converting object data into byte stream data for transmission over
a network across different nodes in a cluster or for persistent data storage.
De-serialization is the reverse process of serialization and converts byte stream data into object
data for reading data from HDFS.
All types in Hadoop must implement the Writable interface. Hadoop provides Writable wrappers for
almost all Java primitive types and some other types. However, we might sometimes need to create our
own wrappers for custom objects.
All the Writable wrapper classes have a get() and a set() method for retrieving and storing the wrapped
value.
Hadoop also provides another interface called WritableComparable. The WritableComparable interface
is a sub-interface of Hadoop’s Writable and Java’s Comparable interfaces.
Note that the Writable and WritableComparable interfaces are provided in the org.apache.hadoop.io
package.
As we know, data flows from mappers to reducers in the form of (key, value) pairs. It is important to
note that any data type used for the key must implement the WritableComparable interface along
with Writable interface to compare the keys of this type with each other for sorting purposes, and any
data type used for the value must implement the Writable interface.
Writable Classes
These are Writable wrappers for Java primitive data types and they hold a single primitive value. Below
is the list of primitive writable data types available in Hadoop:
BooleanWritable
ByteWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable
Note that the serialized sizes of the above primitive Writable data types are same as the size of the
actual Java data types.
Hadoop provides two types of array Writable classes: one for single-dimensional and another for two-
dimensional arrays:
ArrayWritable
TwoDArrayWritable
The elements of these arrays must be other Writable objects like IntWritable or FLoatWritable only; not
the Java native data types like int or float.
Hadoop provides the following MapWritable data types which implement the java.util.Map interface:
AbstractMapWritable: this is the abstract or base class for other MapWritable classes
MapWritable: this is a general purpose map, mapping Writable keys to Writable values
SortedMapWritable: this is a specialization of the MapWritable class that also implements the
SortedMap interface
NullWritable: it is a special type of Writable representing a null value. No bytes are read or
written when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be
declared as a NullWritable when we don’t need to use that field
ObjectWritable: this is a general-purpose generic object wrapper which can store any objects
like Java primitives, String, Enum, Writable, null, or arrays
Text: it can be used as the Writable equivalent of java.lang.String and its max size is 2 GB. Unlike
Java’s String data type, Text is mutable in Hadoop
As discussed earlier, Hadoop already has Writable wrappers for most primitive Java data types.
However, if we need to use a data type for which Hadoop doesn't already have a Writable wrapper, we
can do one of the following: