Module 1
Module 1
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
VTUPulse.com
• The Hadoop Distributed File System (HDFS) was designed
for Big Data processing.
• It is a core part of Hadoop which is used for data storage.
VTUPulse.com
• It is designed to run on commodity hardware.
VTUPulse.com
• Distributed
• Parallel Computation
• Replication
VTUPulse.com
• Fault tolerance
• Streaming Data Access
• Portable
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Features of HDFS
• Distributed and Parallel Computation - This is one of the most important features
of HDFS that makes Hadoop very powerful. Here, data is divided into multiple blocks
VTUPulse.com
and stored into nodes.
VTUPulse.com
VTUPulse.com
you have 10 machines in a Hadoop cluster with
similar configuration – 43 minutes or 4.3 minutes? 4.3 minutes,
Right! What happened here? Each of the nodes is working with a
part of the 1 TB file in parallel.
• Therefore, the work which was taking 43 minutes before, gets
finished in just 4.3 minutes now as the work got divided over ten
For Video Lectures subscribe to
machines. https://www.youtube.com/c/maheshhuddar
Features of HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
VTUPulse.com
VTUPulse.com
VTUPulse.com
• Upon startup, each DataNode provides a block report to the
NameNode.
• The block reports are sent every 10 heartbeats. (The interval
between reports is a configurable property.)
• The reports enable the NameNode to keep an up-to-date account
of all data blocks in theFor
cluster.
Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS Components
• In almost all Hadoop deployments, there is a
VTUPulse.com
SecondaryNameNode.
• While not explicitly required by a NameNode, it is highly
recommended.
VTUPulse.com
• The term "SecondaryNameNode" (now called
CheckPointNode) is somewhat misleading.
• It is not an active failover node and cannot replace the
primary NameNode in case of its failure.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS Components
• The purpose of the SecondaryNameNode is to perform periodic
checkpoints that evaluate the status of the NameNode.
VTUPulse.com
• As the NameNode keeps all system metadata memory for fast access. It
also has two disk files that track changes to the metadata:
– An image of the file system state when the NameNode was started. This
VTUPulse.com
file begins with fsimage_* and is used only at startup by the NameNode.
– A series of modifications done to the file system after starting the
NameNode. These files begin with edit_* and reflect the changes made
after the file was read.
• The location of these files is set by the dfs.namenode.name.dir property in
the hdfs-site.xml file.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS Components
• The SecondaryNameNode periodically downloads fsimage and
edits files, joins them into a new fsimage, and uploads the new
VTUPulse.com
fsimage file to the NameNode.
• Thus, when the NameNode restarts, the fsimage file is reasonably
up-to-date and requires only the edit logs to be applied since the
VTUPulse.com
last checkpoint.
• If the SecondaryNameNode were not running, a restart of the
NameNode could take a long time due to the number of changes
to the file system.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS Components
Where
read or
data?
VTUPulse.com
VTUPulse.com
VTUPulse.com
1. HDFS uses a master/slave model designed for large file
reading/streaming.
2. The NameNode is a metadata server or "data traffic cop."
VTUPulse.com
3. HDFS provides a single namespace that is managed by the
NameNode.
4. Data is redundantly stored on DataNodes; there is no data on
the NameNode.
5. The SecondaryNameNode performs checkpoints of NameNode
file system's statehttps://www.youtube.com/c/maheshhuddar
butForisVideo
notLectures
a failover node.
subscribe to
Replication
VTUPulse.com
Hadoop Distributed File System (HDFS)
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
VTUPulse.com
DataNode Data Block
1 1 and 2
VTUPulse.com
2
3
3
VTUPulse.com
DataNode Data Block
1 1 and 2
VTUPulse.com
2
3
3
1
4 2 and 3
node. https://www.youtube.com/c/maheshhuddar
HDFS Block Replication
Rack
VTUPulse.com
DataNode
1
Data Block
1 and 2
2 1
1
VTUPulse.com 3 3 and 2
4 3
2
VTUPulse.com 3 3
4
2
VTUPulse.com 3 3 and 1
4 2
VTUPulse.com
VTUPulse.com
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
VTUPulse.com
• When there is a file system issue that must be addressed
by the administrator by entering into safe mode manually
by using command:
VTUPulse.com
hdfs dfsadmin-safemode
VTUPulse.com
• The Standby NameNode maintains enough state to provide a fast
failover (if required).
• To guarantee the file system state is preserved, both the Active
and Standby NameNodes receive block reports from the
DataNodes.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
VTUPulse.com
VTUPulse.com
VTUPulse.com
• This design will enable the system to tolerate the failure of a single Journal
Node machine.
• The Standby node continuously reads the edits from the Journal Nodes to
ensure its namespace is synchronized with that of the Active node.
• In the event of an Active NameNode failure, the Standby node reads all
remaining edits from the JournalNodes before promoting itself to the
For Video Lectures subscribe to
Active state. https://www.youtube.com/c/maheshhuddar
NameNode High Availability
• To prevent confusion between NameNodes, the JournalNodes
allow only one NameNode to be a writer at a time. During
VTUPulse.com
failover, the NameNode that is chosen to become active takes
over the role of writing to the JournalNodes.
• A SecondaryNameNode is not required in the HA configuration
VTUPulse.com
because the Standby node also performs the tasks of the
Secondary NameNode.
• Apache Zookeeper is used to monitor the NameNode health.
• HDFS failover relies on ZooKeeper for failure detection and for
Standby to Active NameNode election
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS NameNode Federation
• Older versions of HDFS provided a single namespace for the entire cluster managed
by a single NameNode.
VTUPulse.com
• Thus, the resources of a single NameNode determined the size of the namespace.
• Federation addresses this limitation by adding support for multiple NameNodes /
namespaces to the HDFS file system.
• The key benefits are as follows:
VTUPulse.com
– Namespace scalability. HDFS cluster storage scales horizontally without placing a
burden on the NameNode.
– Better performance. Adding more NameNodes to the cluster scales the file
system read/write operations throughput by separating the total namespace.
– System isolation. Multiple NameNodes enable different categories of
applications to be distinguished, and users can be isolated to different
For Video Lectures subscribe to
namespaces. https://www.youtube.com/c/maheshhuddar
HDFS NameNode Federation
VTUPulse.com
VTUPulse.com
VTUPulse.com
download the fsimage and edits files from the active
NameNode because it already has an up-to-date namespace
state in memory.
• A NameNode supports one BackupNode at a time.
• No CheckpointNodes may be registered if a Backup node is in
use. For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS Snapshots
• HDFS snapshots are similar to backups, but are created by administrators using
the hdfs dfs -snapshot command.
VTUPulse.com
• HDFS snapshots are read-only point-in-time copies of the file system.
• They offer the following features:
– Snapshots can be taken of a sub-tree of the file system or the entire file
system.
VTUPulse.com
– Snapshots can be used for data backup, protection against user errors, and
disaster recovery.
– Snapshot creation is instantaneous.
– Blocks on the DataNodes are not copied, because the snapshot files record
the block list and the file size. There is no data copying, although it appears to
the user that there are duplicate files. subscribe to
For Video Lectures
https://www.youtube.com/c/maheshhuddar
HDFS User Commands
VTUPulse.com
Hadoop Distributed File System (HDFS)
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
• hdfs version
VTUPulse.com
–Hadoop 2.6.0.2.2.4.3-2
• hdfs –dfs
–List all comands in For
HDFS
Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS User Commands
Lists files in HDFS
VTUPulse.com
• hdfs dfs –ls /
–Lists files in the root HDFS directory
VTUPulse.com
• hdfs dfs –ls OR
• hdfs dfs –ls /user/hdfs
–Lists the files in user home directory
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
HDFS User Commands
Make Directory in HDFS
VTUPulse.com
• hdfs dfs –mkdir stuff
–Create directory
VTUPulse.com
VTUPulse.com
• hdfs dfs –ls stuff
-rw-r--r-- 2 hdfs hdfs 128572020-04-18 12:12 stuff/test
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
– First Person:
<Hello, 1>
<World, 2> VTUPulse.com
< Hadoop, 1>
– Second Person:
<Hello, 1>
<Hadoop, 2>
<Goodbye, 1> For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
MapReduce Model
• The reduce phase happens when everyone is done counting and
reducer sum the total of each word as each one of them tell their
counts. VTUPulse.com
<Hello, 2>
<World, 2>
VTUPulse.com
< Hadoop, 3>
<Goodbye, 1>
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
MapReduce Parallel Data Flow
1. Input Splits
VTUPulse.com
– HDFS distributes and replicates data over multiple
servers called DatNodes.
– The default data chunk or block size is 64MB.
VTUPulse.com
– Thus, a 150MB file would be broken into 3 blocks and
written to different machines in the cluster.
– The data are also replicated on multiple machines
(typically three machines).
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
MapReduce Parallel Data Flow
2. Map Step
– The mapping process is where the parallel nature of Hadoop comes into play.
–
–
VTUPulse.com
For large amounts of data, many mappers can be operating at the same time.
The user provides the specific mapping process.
– MapReduce will try to execute the mapper on the machines where the block
–
resides.
VTUPulse.com
Because the file is replicated in HDFS, the least busy node with the data will be
chosen.
– If all nodes holding the data are too busy, MapReduce will try to pick a node that is
closest to the node that hosts the data block (a characteristic called
rack awareness).
– The last choice is any node in the cluster that has access to HDFS.
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
MapReduce Parallel Data Flow
3. Combiner step.
– It is possible to provide an optimization or pre-reduction as part of the map stage
VTUPulse.com
where key—value pairs are combined prior to the next stage. The combiner stage
is optional.
VTUPulse.com
– Therefore, results of the map stage must be collected by key—
value pairs and shuffled to the same reducer process. If only a
single reducer process is used, the Shuffle stage is not needed.
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
2. Compile the WordCount program using the 'hadoop classpath’ command to include all the
available Hadoop class paths.
$ javac -cp ’hadoop classpath' -d wordcount_classes WordCount.java
3. The jar file can be created using the following command:
VTUPulse.com
$ jar -cvf wordcount.jar -C wordcount_classes/
4. To run the example, create an input directory in HDFS and place a text file in the new directory.
For this example, we will use the war-and-peace.txt file (available from the book download page;
see Appendix A):
$ hdfs dfs -mkdir /Demo
$ hdfs dfs -put input. txt /Demo
5. Run the WordCount application using the following command:
For Video Lectures subscribe to
$ hadoop jar wordcount.jar WordCount /Demo/input /output
https://www.youtube.com/c/maheshhuddar
Debugging MapReduce Applications
VTUPulse.com
Hadoop Distributed File System (HDFS)
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
VTUPulse.com
• The first (and best) method is to use log aggregation.
• In this mode, logs are aggregated in HDFS and can be displayed in the
YARN ResourceManager user interface or examined with the yarn logs
command.
• Second, If log aggregation is not enabled, the logs will be placed locally on the
cluster nodes on which the mapper or reducer ran. The location of the
unaggregated local logs is given by the
For Video yarn.nodemanager.log-dirs
Lectures subscribe to property in
the yarn-site.xml file https://www.youtube.com/c/maheshhuddar
Enabling YARN Log Aggregation
• To manually enable log aggregation, follows these steps:
• As the HDFS superuser administrator (usually user hdfs), create the
VTUPulse.com
following directory in HDFS:
<property>
VTUPulse.com
<name>yarn.nodemanager.remote-app-log-dir</name>
VTUPulse.com
<value>/yarn/logs</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property> For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Web Interface Log View
• The most convenient way to view logs is to use the
YARN ResourceManager web user interface.
VTUPulse.com
• In the figure, the contents of stdout, stderr, and syslog are
displayed on a single page.
VTUPulse.com
• If log aggregation is not enabled, a message stating that the logs
are not available will be displayed.
• The follwing URL is used to launch the web Interface
http://localhost:8088/
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Hadoop Log Management
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
VTUPulse.com
$ yarn application -list -appStates FINISHED
• Next, run the following command to produce a dump of all the logs for
VTUPulse.com
that application. Note that the output can be long and is best saved to a
file.
$ yarn logs -applicationld application_1432667013445_0001 > AppOut
VTUPulse.com
#!/usr/bin/env python
import sys
for line in sys.stdin:
VTUPulse.com
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
WordCount Program using Streaming Interface in Python
Reducer Program
#!/usr/bin/env python
from operator import itemgetter
current_count = 0
word = None
VTUPulse.com
import syscurrent_word = None
VTUPulse.com
word, count = line.split('\t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word: For Video Lectures subscribe to
print '%s\t%s' % (current_word, current_count)
https://www.youtube.com/c/maheshhuddar
Steps to execute WordCount Program using Streaming
Interface in Python
1. Create a directory and move the file into HDFS
VTUPulse.com
hdfs dfs -mkdir war-and-peace-input
hdfs dfs -put war-and-peace.txt war-and-peace-input
VTUPulse.com
2. make sure output directory is removed from any previous runs
VTUPulse.com
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-file ./mapper.py \
VTUPulse.com
-mapper ./mapper.py \
-file ./reducer.py \
-reducer ./reducer.py \
-input war-and-peace.txt \
-output war-and-peace-output
{ VTUPulse.com
public static class TokenizerMapper
VTUPulse.com
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException
{
VTUPulse.com
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
For Video Lectures subscribe to
} https://www.youtube.com/c/maheshhuddar
WordCount Program in Java
public static void main(String[] args) throws Exception
{
VTUPulse.com
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
VTUPulse.com
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
For Video Lectures subscribe to
} https://www.youtube.com/c/maheshhuddar
Steps to execute Basic Hadoop Java word count example
Hadoop V2
1. Make Locak WordCount_classes directory
VTUPulse.com
mkdir wordcount_classes
VTUPulse.com
javac -cp `hadoop classpath` -d wordcount_classes WordCount.java
VTUPulse.com
hdfs dfs -mkdir war-and-peace-input
hdfs dfs -put war-and-peace.txt war-and-peace-input
VTUPulse.com
5. run work count, but first
hadoop jar wordcount.jar WordCount war-and-peace-output
VTUPulse.com
hdfs dfs -get war-and-peace-output/part-00000
hdfs dfs -get war-and-peace-output/part-00001
Note:
VTUPulse.com
• If you run program again it wont work because /war-and-peace-output
exists.
• Hadoop will not overwrite files!
For Video Lectures subscribe to
https://www.youtube.com/c/maheshhuddar
Hadoop MapReduce
WordCount Program
VTUPulse.com in C++
using Pipes interface
VTUPulse.com
Mahesh G Huddar
Asst. Professor
Dept. of CSE, HIT, Nidasoshi
#include <string>
VTUPulse.com
#include "stdint.h" // <--- to prevent uint64_t errors!
#include "Pipes.hh“
#include "StringUtils.hh“
VTUPulse.com
string line = context.getInputValue();
vector< string > words = HadoopUtils::splitString( line, " " );
for ( unsigned int i=0; i < words.size(); i++ )
{
context.emit( words[i], HadoopUtils::toString( 1 ) );
}
}
For Video Lectures subscribe to
}; https://www.youtube.com/c/maheshhuddar
WordCount Program in C++ using Pipes interface
VTUPulse.com
int count = 0;
while ( context.nextValue() )
{
count += HadoopUtils::toInt( context.getInputValue() );
}
context.emit(context.getInputKey(), HadoopUtils::toString( count ));
}
For Video Lectures subscribe to
}; https://www.youtube.com/c/maheshhuddar
WordCount Program in C++ using Pipes interface
VTUPulse.com
g++ wordcount.cpp -o wordcount -L$HADOOP_HOME//lib/native/ -I$HADOOP_HOME/../usr/include -
lhadooppipes -lhadooputils -lpthread –lcrypto
VTUPulse.com
hdfs dfs -mkdir war-and-peace-input
hdfs dfs -put war-and-peace.txt war-and-peace-input
mapred pipes \
VTUPulse.com
-D hadoop.pipes.java.recordreader=true \
VTUPulse.com
-D hadoop.pipes.java.recordwriter=true \
-input war-and-peace.txt \
-output war-and-peace-output \
-program bin/wordcount