Text Output Lecture

The document describes different output formats for MapReduce jobs: - TextOutputFormat writes keys and values as tab-separated strings, and is read by KeyValueTextInputFormat - SequenceFileOutputFormat writes to sequence files for efficiency in further MapReduce jobs - SequenceFileAsBinaryOutputFormat stores keys and values in raw binary in sequence files - MapFileOutputFormat writes sorted keys to map files for indexing It also explains how to use MultipleOutputs to write to multiple files per reducer and provides an example of custom partitioning data by weather station ID into separate output files.

Uploaded by

ponnaraseebk999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views2 pages

Text Output Lecture

Uploaded by

ponnaraseebk999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 2

Output File Format

Text Output:
 Default Output Format: TextOutputFormat.
 TextOutputFormat writes records as lines of text.
 Keys and values in TextOutputFormat can be of any type, converted to strings using the
toString() method.
 Each key-value pair is separated by a tab character (though this separator can be changed
using the mapreduce.output.textoutputformat.separator property, or
mapred.textoutputformat.separator in the old API).
 The counterpart to TextOutputFormat for reading is KeyValueTextInputFormat.
 KeyValueTextInputFormat breaks lines into key-value pairs based on a configurable
separator.
 You can suppress the key or the value (or both) in TextOutputFormat using a NullWritable
type.
 Using NullWritable causes no separator to be written, making the output suitable for reading
using TextInputFormat.

SequenceFileOutputFormat:
 Writes sequence files as output.
 Ideal for use as input for further MapReduce jobs.
 Sequence files are compact and can be readily compressed.
 Compression can be controlled using static methods on SequenceFileOutputFormat.
 Useful for scenarios where compactness and compressibility are important.

SequenceFileAsBinaryOutputFormat:
 Counterpart to SequenceFileAsBinaryInputFormat.
 Writes keys and values in raw binary format into a SequenceFile container.
 Suitable for cases where binary data needs to be stored efficiently in SequenceFiles.

MapFileOutputFormat:
 Writes MapFiles as output.
 Requires keys to be added in order, so reducers must emit keys in sorted order.
 MapFiles are an efficient way to store key-value pairs in Hadoop, often used for indexing.

Multiple Outputs in Hadoop MapReduce:

 FileOutputFormat and its subclasses typically generate one file per reducer with default
naming (e.g., part-r-00000).
 Sometimes, you need more control over file naming or want to produce multiple files per
reducer.
 Hadoop provides the MultipleOutputs class to help achieve this.

Example: Partitioning Data by Weather Station:

 Imagine a scenario where you want to partition weather data by weather station, creating a
file for each station.
 To achieve this, you need a custom partitioner and set the number of reducers to match the
number of weather stations.
 The partitioner assigns records from the same weather station to the same partition based on
their station ID.

Custom partitioner:
public class StationPartitioner extends Partitioner<LongWritable, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();

@Override
public int getPartition(LongWritable key, Text value, int numPartitions) {
parser.parse(value); // Parses the input Text value.
return getPartition(parser.getStationId()); // Calls a method to determine the partition.
}

private int getPartition(String stationId) {

// The getPartition method is not shown here, but it should determine the partition index for
a given station ID.
// Typically, it involves some logic to assign the station to a specific partition.
// It may use a hash function or other criteria to ensure that records with the same station ID
end up in the same partition.
// The exact implementation depends on your use case and how you want to distribute data
among reducers.
// This method is crucial for ensuring that records from the same station go to the same
reducer.
}
}

 StationPartitioner is a custom partitioner class that extends Hadoop's Partitioner class.

 It is designed to partition data based on the LongWritable key (which appears to be a station
ID) and the Text value.
 In the getPartition method:
 It parses the Text value to extract information (presumably, the station ID).
 Then, it calls the getPartition method (not shown) to determine the partition index
for the given station ID.
 The getPartition method:
 This method is crucial but is not provided in the code snippet. It should be
implemented to assign a partition index to a given station ID.
 The exact logic used in getPartition depends on your specific requirements. It could
involve hashing, lookup tables, or any other mechanism that ensures records with the
same station ID end up in the same partition.

S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-333 Exam (01-2014) - Cloudera Quiz Learning
44 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
MapReduce Exam 2019 - Solved Paper
No ratings yet
MapReduce Exam 2019 - Solved Paper
25 pages
3 MapReduce Framework
No ratings yet
3 MapReduce Framework
28 pages
Map Reduce
No ratings yet
Map Reduce
46 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
MapReduce and Yarn
No ratings yet
MapReduce and Yarn
39 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Mcsl26 See QP Solution 2024
No ratings yet
Mcsl26 See QP Solution 2024
33 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Hadoop IO - Notes
No ratings yet
Hadoop IO - Notes
22 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Data Analytics
No ratings yet
Data Analytics
26 pages
Q1. What Is The Purpose of Recordreader in Hadoop?
No ratings yet
Q1. What Is The Purpose of Recordreader in Hadoop?
5 pages
Compare Hadoop & Spark Criteria Hadoop Spark
No ratings yet
Compare Hadoop & Spark Criteria Hadoop Spark
18 pages
Unit 4
No ratings yet
Unit 4
11 pages
Mapreduce Introduction
No ratings yet
Mapreduce Introduction
14 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Big Data Fundamentals and Platforms Assginment 3
No ratings yet
Big Data Fundamentals and Platforms Assginment 3
6 pages
MapReduce Questions
No ratings yet
MapReduce Questions
8 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Job Scheduling in MR
No ratings yet
Job Scheduling in MR
6 pages
MapReduce - Notes
No ratings yet
MapReduce - Notes
17 pages
Hadoop Weather
No ratings yet
Hadoop Weather
4 pages
B. Hadoop Ecosystem - III - B (MapReduce Framework)
No ratings yet
B. Hadoop Ecosystem - III - B (MapReduce Framework)
33 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Bda U3, U4 and U5 Two Marks Qs
No ratings yet
Bda U3, U4 and U5 Two Marks Qs
19 pages
Csen 3101
No ratings yet
Csen 3101
11 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Drawbacks of Fixed Partitioning in Hadoop
No ratings yet
Drawbacks of Fixed Partitioning in Hadoop
1 page
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Interview Questions - Introduction To Hadoop and MapReduce Programming
No ratings yet
Interview Questions - Introduction To Hadoop and MapReduce Programming
4 pages
22MCC20017 Suraj Kumar Thakur BIG Data 2.2
No ratings yet
22MCC20017 Suraj Kumar Thakur BIG Data 2.2
5 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Map Reduce Types and Formats
No ratings yet
Map Reduce Types and Formats
32 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
DSCC Unit 5 PDF
No ratings yet
DSCC Unit 5 PDF
8 pages
Quick HadOop Ref Card Always
No ratings yet
Quick HadOop Ref Card Always
2 pages
New 9
No ratings yet
New 9
3 pages
Unit II Hadoop IO
No ratings yet
Unit II Hadoop IO
27 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Map Reduce 1
No ratings yet
Map Reduce 1
8 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
IET Udaipur BDA Unit-3
No ratings yet
IET Udaipur BDA Unit-3
10 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet