0% found this document useful (0 votes)
21 views

Text Output Lecture

The document describes different output formats for MapReduce jobs: - TextOutputFormat writes keys and values as tab-separated strings, and is read by KeyValueTextInputFormat - SequenceFileOutputFormat writes to sequence files for efficiency in further MapReduce jobs - SequenceFileAsBinaryOutputFormat stores keys and values in raw binary in sequence files - MapFileOutputFormat writes sorted keys to map files for indexing It also explains how to use MultipleOutputs to write to multiple files per reducer and provides an example of custom partitioning data by weather station ID into separate output files.

Uploaded by

ponnaraseebk999
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Text Output Lecture

The document describes different output formats for MapReduce jobs: - TextOutputFormat writes keys and values as tab-separated strings, and is read by KeyValueTextInputFormat - SequenceFileOutputFormat writes to sequence files for efficiency in further MapReduce jobs - SequenceFileAsBinaryOutputFormat stores keys and values in raw binary in sequence files - MapFileOutputFormat writes sorted keys to map files for indexing It also explains how to use MultipleOutputs to write to multiple files per reducer and provides an example of custom partitioning data by weather station ID into separate output files.

Uploaded by

ponnaraseebk999
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Output File Format

Text Output:
 Default Output Format: TextOutputFormat.
 TextOutputFormat writes records as lines of text.
 Keys and values in TextOutputFormat can be of any type, converted to strings using the
toString() method.
 Each key-value pair is separated by a tab character (though this separator can be changed
using the mapreduce.output.textoutputformat.separator property, or
mapred.textoutputformat.separator in the old API).
 The counterpart to TextOutputFormat for reading is KeyValueTextInputFormat.
 KeyValueTextInputFormat breaks lines into key-value pairs based on a configurable
separator.
 You can suppress the key or the value (or both) in TextOutputFormat using a NullWritable
type.
 Using NullWritable causes no separator to be written, making the output suitable for reading
using TextInputFormat.

SequenceFileOutputFormat:
 Writes sequence files as output.
 Ideal for use as input for further MapReduce jobs.
 Sequence files are compact and can be readily compressed.
 Compression can be controlled using static methods on SequenceFileOutputFormat.
 Useful for scenarios where compactness and compressibility are important.

SequenceFileAsBinaryOutputFormat:
 Counterpart to SequenceFileAsBinaryInputFormat.
 Writes keys and values in raw binary format into a SequenceFile container.
 Suitable for cases where binary data needs to be stored efficiently in SequenceFiles.

MapFileOutputFormat:
 Writes MapFiles as output.
 Requires keys to be added in order, so reducers must emit keys in sorted order.
 MapFiles are an efficient way to store key-value pairs in Hadoop, often used for indexing.

Multiple Outputs in Hadoop MapReduce:


 FileOutputFormat and its subclasses typically generate one file per reducer with default
naming (e.g., part-r-00000).
 Sometimes, you need more control over file naming or want to produce multiple files per
reducer.
 Hadoop provides the MultipleOutputs class to help achieve this.

Example: Partitioning Data by Weather Station:


 Imagine a scenario where you want to partition weather data by weather station, creating a
file for each station.
 To achieve this, you need a custom partitioner and set the number of reducers to match the
number of weather stations.
 The partitioner assigns records from the same weather station to the same partition based on
their station ID.

Custom partitioner:
public class StationPartitioner extends Partitioner<LongWritable, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
parser.parse(value); // Parses the input Text value.
return getPartition(parser.getStationId()); // Calls a method to determine the partition.
}

private int getPartition(String stationId) {


// The getPartition method is not shown here, but it should determine the partition index for
a given station ID.
// Typically, it involves some logic to assign the station to a specific partition.
// It may use a hash function or other criteria to ensure that records with the same station ID
end up in the same partition.
// The exact implementation depends on your use case and how you want to distribute data
among reducers.
// This method is crucial for ensuring that records from the same station go to the same
reducer.
}
}

 StationPartitioner is a custom partitioner class that extends Hadoop's Partitioner class.


 It is designed to partition data based on the LongWritable key (which appears to be a station
ID) and the Text value.
 In the getPartition method:
 It parses the Text value to extract information (presumably, the station ID).
 Then, it calls the getPartition method (not shown) to determine the partition index
for the given station ID.
 The getPartition method:
 This method is crucial but is not provided in the code snippet. It should be
implemented to assign a partition index to a given station ID.
 The exact logic used in getPartition depends on your specific requirements. It could
involve hashing, lookup tables, or any other mechanism that ensures records with the
same station ID end up in the same partition.

You might also like