Text Output Lecture
Text Output Lecture
Text Output:
Default Output Format: TextOutputFormat.
TextOutputFormat writes records as lines of text.
Keys and values in TextOutputFormat can be of any type, converted to strings using the
toString() method.
Each key-value pair is separated by a tab character (though this separator can be changed
using the mapreduce.output.textoutputformat.separator property, or
mapred.textoutputformat.separator in the old API).
The counterpart to TextOutputFormat for reading is KeyValueTextInputFormat.
KeyValueTextInputFormat breaks lines into key-value pairs based on a configurable
separator.
You can suppress the key or the value (or both) in TextOutputFormat using a NullWritable
type.
Using NullWritable causes no separator to be written, making the output suitable for reading
using TextInputFormat.
SequenceFileOutputFormat:
Writes sequence files as output.
Ideal for use as input for further MapReduce jobs.
Sequence files are compact and can be readily compressed.
Compression can be controlled using static methods on SequenceFileOutputFormat.
Useful for scenarios where compactness and compressibility are important.
SequenceFileAsBinaryOutputFormat:
Counterpart to SequenceFileAsBinaryInputFormat.
Writes keys and values in raw binary format into a SequenceFile container.
Suitable for cases where binary data needs to be stored efficiently in SequenceFiles.
MapFileOutputFormat:
Writes MapFiles as output.
Requires keys to be added in order, so reducers must emit keys in sorted order.
MapFiles are an efficient way to store key-value pairs in Hadoop, often used for indexing.
Custom partitioner:
public class StationPartitioner extends Partitioner<LongWritable, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
parser.parse(value); // Parses the input Text value.
return getPartition(parser.getStationId()); // Calls a method to determine the partition.
}