0% found this document useful (0 votes)
13 views13 pages

Hadoop Week 4

The document provides information about connecting with the Edureka Hadoop training program and an overview of the course topics over 8 weeks. It also recaps the concepts covered in weeks 1-3 which include building a Hadoop cluster, map reduce concepts of mappers, reducers and drivers, and input file formats. Finally, it discusses input splits, record readers and gives examples of different input formats like text, nline, sequence and keyvalue input formats.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Hadoop Week 4

The document provides information about connecting with the Edureka Hadoop training program and an overview of the course topics over 8 weeks. It also recaps the concepts covered in weeks 1-3 which include building a Hadoop cluster, map reduce concepts of mappers, reducers and drivers, and input file formats. Finally, it discusses input splits, record readers and gives examples of different input formats like text, nline, sequence and keyvalue input formats.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Connect with us

 24x7 Support on Skype, Email & Phone

 Skype ID – edureka.hadoop

 Email – [email protected]

 Call us – +91 88808 62004

 Venkat – [email protected]
Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
Recap of Week 1,2 and 3

Built Hadoop Cluster

Map Reduce Concepts


Mapper
Reducer
Driver
InputFormat
Input Split
Record Reader
Overview of InputFileFormats
Map Reduce – A closer look
Input files: Data for a MapReduce Task is initially
stored

InputFormat: How these files are split up and


read is defined here. It provides below
functionality

• Split-up the input file(s) into logical Input Splits, each


of which is then assigned to an individual Mapper

• Provide the Record Reader implementation to be used


to glean input records from the logical Input Split for
processing by Mapper.
Let us run a simple Map Reduce Task

WordCount Example

• Mapper
• Reducer
• Driver
FileSplit is the default Input Split
public abstract class FileInputFormat<K, V> extends InputFormat<K, V>
{ Some InputFileFormat like
……
public List<InputSplit> getSplits(JobContext job ) throws IOException
NLineInputFormat overrides
{ getsplits()
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus>files = listStatus(job);
for (FileStatus file: files)
{
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path))
{
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
………
………
protected long computeSplitSize(long blockSize, long minSize, long maxSize)
{ return Math.max(minSize, Math.min(maxSize, blockSize));

}
TextInputFormat is the default record reader
public class TextInputFormat extends FileInputFormat<LongWritable, Text> { @Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split, TaskAttemptContext context)
{
return new LineRecordReader();

public class LineRecordReader extends RecordReader<LongWritable, Text> {


……
public boolean nextKeyValue() throws IOException
{
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
……..
while (pos < end) { newSize = in.readLine(value, maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
…..
pos += newSize;
Key InputFormats
InputFormat: Description: Key: Value:
Default format; reads The byte offset of the
TextInputFormat The line contents
lines of text files line
Parses lines into key, Everything up to the The remainder of the
KeyValueInputFormat
val pairs first tab character line
A Hadoop-specific
SequenceFileInputFormat high-performance user-defined user-defined
binary format
Sample code and execution

• Custom Record Reader

• NLineInputFormat

• SequenceFileInputFormat

• KeyValueInputFormat

• Custom partitioner

• Custom combiner
Display running jobs

hadoop job –list

Joc an be kille d with following command


hadoop job -kill jobid
Clarifications

Q & A..?
Thank You
See You in Class Next Week

You might also like