Hadoop Week 4
Hadoop Week 4
Skype ID – edureka.hadoop
Email – [email protected]
Venkat – [email protected]
Course Topics
Week 1 Week 5
– Introduction to HDFS – HIVE
Week 2 Week 6
– Setting Up Hadoop Cluster – HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER
Week 4 Week 8
– PIG – SQOOP
Recap of Week 1,2 and 3
WordCount Example
• Mapper
• Reducer
• Driver
FileSplit is the default Input Split
public abstract class FileInputFormat<K, V> extends InputFormat<K, V>
{ Some InputFileFormat like
……
public List<InputSplit> getSplits(JobContext job ) throws IOException
NLineInputFormat overrides
{ getsplits()
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
// generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus>files = listStatus(job);
for (FileStatus file: files)
{
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path))
{
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
………
………
protected long computeSplitSize(long blockSize, long minSize, long maxSize)
{ return Math.max(minSize, Math.min(maxSize, blockSize));
}
TextInputFormat is the default record reader
public class TextInputFormat extends FileInputFormat<LongWritable, Text> { @Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split, TaskAttemptContext context)
{
return new LineRecordReader();
• NLineInputFormat
• SequenceFileInputFormat
• KeyValueInputFormat
• Custom partitioner
• Custom combiner
Display running jobs
Q & A..?
Thank You
See You in Class Next Week