0% found this document useful (0 votes)
40 views46 pages

Map Reduce

Uploaded by

21ve1a6772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views46 pages

Map Reduce

Uploaded by

21ve1a6772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

MapReduce Program – Weather Data Analysis For

Analyzing Hot And Cold Days


Here, we will write a Map-Reduce program for analyzing weather datasets to
understand its data processing programming model. Weather sensors are
collecting weather information across the globe in a large volume of log data. This
weather data is semi-structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a
single record. Each row has lots of fields like longitude, latitude, daily max-min
temperature, daily average temperature, etc. for easiness, we will focus on the
main element, i.e. temperature. We will use the data from the National Centres for
Environmental Information(NCEI). It has a massive amount of historical weather
data that we can use for our data analysis.
Problem Statement:
Analyzing weather data of Fairbanks, Alaska to find cold and
hot days using MapReduce Hadoop.

Step 1:

We can download the dataset from this Link, For various cities in different years.
choose the year of your choice and select any one of the data text-file for
analyzing. In my case, I have selected CRND0103-2020-
AK_Fairbanks_11_NE.txt dataset for analysis of hot and cold days in Fairbanks,
Alaska.
We can get information about data from README.txt file available on the NCEI
website.

Step 2:

Below is the example of our dataset where column 6 and column 7 is showing
Maximum and Minimum temperature, respectively.
Step 3:

Make a project in Eclipse with below steps:

 First Open Eclipse -> then select File -> New -> Java Project ->Name
it MyProject -> then select use an execution environment -> choose JavaSE-
1.8 then next -> Finish.

 In this Project Create Java class with name MyMaxMin -> then click Finish
 Copy the below source code to this MyMaxMin java class

JAVA
// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;

public class MyMaxMin {

// Mapper

/*MaxTemperatureMapper class is static


* and extends Mapper abstract class
* having four Hadoop generics type
* LongWritable, Text, Text, Text.
*/

public static class MaxTemperatureMapper extends


Mapper<LongWritable, Text, Text, Text> {

/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/

// the data in our data set with


// this value is inconsistent data
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {

// Convert the single row(Record) to


// String and store it in String
// variable name line

String line = Value.toString();

// Check for the empty line


if (!(line.length() == 0)) {
// from character 6 to 14 we have
// the date in our dataset
String date = line.substring(6, 14);

// similarly we have taken the maximum


// temperature from 39 to 45 characters
float temp_Max = Float.parseFloat(line.substring(39,
45).trim());

// similarly we have taken the minimum


// temperature from 47 to 53 characters

float temp_Min = Float.parseFloat(line.substring(47,


53).trim());

// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {

// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new
Text(String.valueOf(temp_Max)));
}

// if the minimum temperature is


// less than 15, it is a cold day
if (temp_Min < 15) {

// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

// Reducer

/*MaxTemperatureReducer class is static


and extends Reducer abstract class
having four Hadoop generics type
Text, Text, Text, Text.
*/

public static class MaxTemperatureReducer extends


Reducer<Text, Text, Text, Text> {

/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/

public void reduce(Text Key, Iterator<Text> Values, Context context)


throws IOException, InterruptedException {

// putting all the values in


// temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/

public static void main(String[] args) throws Exception {

// reads the default configuration of the


// cluster from the configuration XML files
Configuration conf = new Configuration();

// Initializing the job with the


// default configuration of the cluster
Job job = new Job(conf, "weather example");

// Assigning the driver class name


job.setJarByClass(MyMaxMin.class);

// Key type coming out of mapper


job.setMapOutputKeyClass(Text.class);

// value type coming out of mapper


job.setMapOutputValueClass(Text.class);

// Defining the mapper class name


job.setMapperClass(MaxTemperatureMapper.class);
// Defining the reducer class name
job.setReducerClass(MaxTemperatureReducer.class);

// Defining input Format class which is


// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);

// Defining output Format class which is


// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);

// setting the second argument


// as a path in a path variable
Path OutputPath = new Path(args[1]);

// Configuring the input path


// from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));

// Configuring the output path from


// the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// deleting the context path automatically


// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);

// exiting the job only if the


// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

 Now we need to add external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version.
You can check Hadoop Version:

hadoop version
 Now we add these external jars to our MyProject. Right Click on MyProject ->
then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply
and Close.

 Now export the project as jar file. Right-click on MyProject choose Export.. and
go to Java -> JAR file click -> Next and choose your export destination then
click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -
> Finish -> Ok.
Step 4:

Start our Hadoop Daemons


start-dfs.sh
start-yarn.sh

Step 5:

Move your dataset to the Hadoop HDFS.


Syntax:

hdfs dfs -put /file_path /destination


In below command / shows the root directory of our HDFS.

hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-


AK_Fairbanks_11_NE.txt /
Check the file sent to our HDFS.

hdfs dfs -ls /

Step 6:

Now Run your Jar File with below command and produce the output
in MyOutput File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput

Step 7:

Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Step 8:

See the result in the Downloaded File.

In the above image, you can see the top 10 results showing the cold days. The
second column is a day in yyyy/mm/dd format. For Example, 20200101 means
year = 2020
month = 01
Date = 01

MapReduce Program – Finding The Average Age of


Male and Female Died in Titanic Disaster
All of us are familiar with the disaster that happened on April 14, 1912. The big
giant ship of 46000-ton in weight got sink-down to the depth of 13,000 feet in the
North Atlantic Ocean. Our aim is to analyze the data obtained after this disaster.
Hadoop MapReduce can be utilized to deal with this large datasets efficiently to
find any solution for a particular problem.
Problem Statement: Analyzing the Titanic Disaster dataset, for finding the
average age of male and female persons died in this disaster with MapReduce
Hadoop.

Step 1:

We can download the Titanic Dataset from this Link. Below is the column structure
of our Titanic dataset. It consists of 12 columns where each row describes the
information of a particular person.

Step 2:

The first 10 records of the dataset is shown below.


Step 3:

Make the project in Eclipse with below steps:


 First Open Eclipse -> then select File -> New -> Java Project ->Name
it Titanic_Data_Analysis -> then select use an execution environment ->
choose JavaSE-1.8 then next -> Finish.

 In this Project Create Java class with name Average_age -> then click Finish
 Copy the below source code to this Average_age java class
 Java
// import libraries
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

// Making a class with name Average_age


public class Average_age {

public static class Map extends Mapper<LongWritable, Text, Text,


IntWritable> {

// private text gender variable which


// stores the gender of the person
// who died in the Titanic Disaster
private Text gender = new Text();

// private IntWritable variable age will store


// the age of the person for MapReduce. where
// key is gender and value is age
private IntWritable age = new IntWritable();

// overriding map method(run for one time for each record in


dataset)
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{

// storing the complete record


// in a variable name line
String line = value.toString();

// splitting the line with ', ' as the


// values are separated with this
// delimiter
String str[] = line.split(", ");

/* checking for the condition where the


number of columns in our dataset
has to be more than 6. This helps in
eliminating the ArrayIndexOutOfBoundsException
when the data sometimes is incorrect
in our dataset*/
if (str.length > 6) {

// storing the gender


// which is in 5th column
gender.set(str[4]);

// checking the 2nd column value in


// our dataset, if the person is
// died then proceed.
if ((str[1].equals("0"))) {

// checking for numeric data with


// the regular expression in this column
if (str[5].matches("\\d+")) {

// converting the numeric


// data to INT by typecasting
int i = Integer.parseInt(str[5]);
// storing the person of age
age.set(i);
}
}
}
// writing key and value to the context
// which will be output of our map phase
context.write(gender, age);
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text,


IntWritable> {

// overriding reduce method(runs each time for every key )


public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException
{

// declaring the variable sum which


// will store the sum of ages of people
int sum = 0;

// Variable l keeps incrementing for


// all the value of that key.
int l = 0;

// foreach loop
for (IntWritable val : values) {
l += 1;
// storing and calculating
// sum of values
sum += val.get();
}
sum = sum / l;
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception


{
Configuration conf = new Configuration();

@SuppressWarnings("deprecation")
Job job = new Job(conf, "Averageage_survived");
job.setJarByClass(Average_age.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path out = new Path(args[1]);
out.getFileSystem(conf).delete(out);
job.waitForCompletion(true);
}
}

 Now we need to add external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version.
Check Hadoop Version :
hadoop version

 Now we add these external jars to our Titanic_Data_Analysis project. Right


Click on Titanic_Data_Analysis -> then select Build Path-> Click
on Configure Build Path and select Add External jars…. and add jars from it’s
download location then click -> Apply and Close.
 Now export the project as jar file. Right-click
on Titanic_Data_Analysis choose Export.. and go to Java -> JAR file click -
> Next and choose your export destination then click -> Next. Choose Main
Class as Average_age by clicking -> Browse and then click -> Finish -> Ok.
Step 4:

Start Hadoop Daemons


start-dfs.sh
start-yarn.sh
Then, Check Running Hadoop Daemons.
jps

Step 5:

Move your dataset to the Hadoop HDFS.


Syntax:
hdfs dfs -put /file_path /destination
In below command / shows the root directory of our HDFS.
hdfs dfs -put /home/dikshant/Documents/titanic_data.txt /
Check the file sent to our HDFS.
hdfs dfs -ls /

Step 6:

Now Run your Jar File with below command and produce the output
in Titanic_Output File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/Average_age.jar
/titanic_data.txt /Titanic_Output

Step 7:

Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Note: We can also view the result with below command
hdfs dfs -cat /Titanic_Output/part-r-00000
In the above image, we can see that the average age of the female is 28 and male
is 30 according to our dataset who died in the Titanic Disaster.

How to Execute Character Count Program in


MapReduce Hadoop?
Prerequisites: Hadoop and MapReduce
Required setup for completing the below task.
1. Java Installation
2. Hadoop installation
Our task is to count the frequency of each character present in our input file. We
are using Java for implementing this particular scenario. However, The
MapReduce program can also be written in Python or C++. Execute the below
steps to complete the task for finding the occurrence of each character.
Example:
Input
GeeksforGeeks
Output
F 1
G 2
e 4
k 2
o 1
r 1
s 2
Step 1: First Open Eclipse -> then select File -> New -> Java Project ->Name
it CharCount -> then select use an execution environment -> choose JavaSE-
1.8 then next -> Finish.
Step 2: Create Three Java Classes into the project. Name
them CharCountDriver(having the main
function), CharCountMapper, CharCountReducer.
Mapper Code: You have to copy and paste this program into
the CharCountMapper Java Class file.
 Java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CharCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
String tokenizer[] = line.split("");
for(String SingleChar : tokenizer)
{
Text charKey = new Text(SingleChar);
IntWritable One = new IntWritable(1);
output.collect(charKey, One);
}
}

Reducer Code: You have to copy-paste this below program into


the CharCountReducer Java Class file.
 Java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class CharCountReducer extends MapReduceBase


implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Driver Code: You have to copy-paste this below program into


the CharCountDriver Java Class file.
 Java
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class CharCountDriver {
public static void main(String[] args)
throws IOException
{
JobConf conf = new JobConf(CharCountDriver.class);
conf.setJobName("CharCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CharCountMapper.class);
conf.setCombinerClass(CharCountReducer.class);
conf.setReducerClass(CharCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new Path(args[0]));
FileOutputFormat.setOutputPath(conf,
new Path(args[1]));
JobClient.runJob(conf);
}
}

Step 3: Now we need to add an external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version. You can check Hadoop Version with the
below command:
hadoop version

Step 4: Now we add these external jars to our CharCount project. Right Click
on CharCount -> then select Build Path-> Click on Configure Build Path and
select Add External jars…. and add jars from it’s download location then click -
> Apply and Close.
Step 5: Now export the project as a jar file. Right-click
on CharCount choose Export.. and go to Java -> JAR file click -> Next and
choose your export destination then click -> Next. Choose Main Class
as CharCount by clicking -> Browse and then click -> Finish -> Ok.
Now the Jar file is successfully created and saved at /Documents directory with
the name charectercount.jar in my case.
Step 6: Create a simple text file and add some data to it.
nano test.txt
You can also add text to the file manually or using some other editor like Vim or
gedit.
To see the content of the file use cat command available in Linux.
cat test.txt

Step 7: Start our Hadoop Daemons


start-dfs.sh
start-yarn.sh
Step 8: Move your test.txt file to the Hadoop HDFS.
Syntax:
hdfs dfs -put /file_path /destination
In below command / shows the root directory of our HDFS.
hdfs dfs -put /home/dikshant/Documents/test.txt /
Check the file is present in the root directory of HDFS or not.
hdfs dfs -ls /

Step 9: Now Run your Jar File with the below command and produce the output
in CharCountResult File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/charectercount.jar
/test.txt /CharCountResult
Step 10: Now Move to localhost:50070/, under utilities select Browse the file
system and download part-r-00000 in /CharCountResult directory to see result. we
can also check the result i.e. that part-r-00000 file with cat command as shown
below.
hdfs dfs -cat /CharCountResult/part-00000

Hadoop Streaming Using Python – Word Count


Problem
Hadoop Streaming is a feature that comes with Hadoop and allows users or
developers to use various different languages for writing MapReduce programs like
Python, C++, Ruby, etc. It supports all the languages that can read from standard
input and write to standard output. We will be implementing Python with Hadoop
Streaming and will observe how it works. We will implement the word count
problem in python to understand Hadoop Streaming. We will be
creating mapper.py and reducer.py to perform map and reduce tasks.
Let’s create one file which contains multiple words that we can count.
Step 1: Create a file with the name word_count_data.txt and add some data to it.
cd Documents/ # to change the
directory to /Documents
touch word_count_data.txt # touch is used to
create an empty file
nano word_count_data.txt # nano is a command line
editor to edit the file
cat word_count_data.txt # cat is used to
see the content of the file
Step 2: Create a mapper.py file that implements the mapper logic. It will read the
data from STDIN and will split the lines into words, and will generate an output of
each word with its individual count.
cd Documents/ # to change the
directory to /Documents
touch mapper.py # touch is used to create an
empty file
cat mapper.py # cat is used to see the
content of the file
Copy the below code to the mapper.py file.
Python3
#!/usr/bin/env python

# import sys because we need to read and write data to STDIN and STDOUT
import sys

# reading entire line from STDIN (standard input)


for line in sys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()

# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)

Here in the above program #! is known as shebang and used for interpreting the
script. The file will be run using the command we are specifying.
Let’s test our mapper.py locally that it is working fine or not.
Syntax:
cat <text_data_file> | python <mapper_code_python_file>
Command(in my case)
cat word_count_data.txt | python mapper.py
The output of the mapper is shown below.

Step 3: Create a reducer.py file that implements the reducer logic. It will read the
output of mapper.py from STDIN(standard input) and will aggregate the occurrence
of each word and will write the final output to STDOUT.
cd Documents/ # to change the
directory to /Documents
touch reducer.py # touch is used to create
an empty file

Python3

#!/usr/bin/env python

from operator import itemgetter


import sys
current_word = None
current_count = 0
word = None

# read the entire line from STDIN


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# splitting the data on the basis of tab we have provided in mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output


# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!


if current_word == word:
print '%s\t%s' % (current_word, current_count)

Now let’s check our reducer code reducer.py with mapper.py is it working properly
or not with the help of the below command.
cat word_count_data.txt | python mapper.py | sort -k1,1 |
python reducer.py
We can see that our reducer is also working fine in our local system.
Step 4: Now let’s start all our Hadoop daemons with the below command.
start-dfs.sh

start-yarn.sh

Now make a directory word_count_in_python in our HDFS in the root directory


that will store our word_count_data.txt file with the below command.
hdfs dfs -mkdir /word_count_in_python
Copy word_count_data.txt to this folder in our HDFS with help
of copyFromLocal command.
Syntax to copy a file from your local file system to the HDFS is given below:
hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n
/destination
Actual command(in my case)
hdfs dfs -copyFromLocal
/home/dikshant/Documents/word_count_data.txt
/word_count_in_python
Now our data file has been sent to HDFS successfully. we can check whether it
sends or not by using the below command or by manually visiting our HDFS.
hdfs dfs -ls / # list down content of the root directory

hdfs dfs -ls /word_count_in_python # list down content of


/word_count_in_python directory

Let’s give executable permission to our mapper.py and reducer.py with the help
of below command.
cd Documents/

chmod 777 mapper.py reducer.py # changing the permission to


read, write, execute for user, group and others
In below image,Then we can observe that we have changed the file permission.
Step 5: Now download the latest hadoop-streaming jar file from this Link. Then
place, this Hadoop,-streaming jar file to a place from you can easily access it. In
my case, I am placing it to /Documents folder
where mapper.py and reducer.py file is present.
Now let’s run our python files with the help of the Hadoop streaming utility as
shown below.
hadoop jar /home/dikshant/Documents/hadoop-streaming-
2.7.3.jar \

> -input /word_count_in_python/word_count_data.txt \

> -output /word_count_in_python/output \

> -mapper /home/dikshant/Documents/mapper.py \

> -reducer /home/dikshant/Documents/reducer.py


In the above command in -output, we will specify the location in HDFS where we
want our output to be stored. So let’s check our output in output file at
location /word_count_in_python/output/part-00000 in my case. We can check
results by manually vising the location in HDFS or with the help of cat command as
shown below.
hdfs dfs -cat /word_count_in_python/output/part-00000

Basic options that we can use with Hadoop Streaming


Option Description

-
The command to be run as the mapper
mapper

-reducer The command to be run as the reducer

-input The DFS input path for the Map step


Option Description

-output The DFS output directory for the Reduce step

Hadoop – File Permission and ACL(Access Control


List)
In general, a Hadoop cluster performs security on many layers. The level of
protection depends upon the organization’s requirements. In this article, we are
going to Learn about Hadoop’s first level of security. It contains mainly two
components. Both of these features are part of the default installation.
1. File Permission
2. ACL(Access Control List)

1. File Permission

The HDFS(Hadoop Distributed File System) implements POSIX(Portable


Operating System Interface) like a file permission model. It is similar to the file
permission model in Linux . In Linux, we use Owner, Group, and Others which
has permission for each file and directory available in our Linux environment.
Owner/user Group
Others
rwx rwx
rwx
Similarly, the HDFS file system also implements a set of permissions, for
this Owner, Group, and Others. In Linux we use -rwx for permission to the
specific user where r is read, w is for write or append and x is for executable. But
in HDFS for a file, we have r for reading, w for writing and appending and there is
no sense for x i.e. for execution permission, because in HDFS all files are
supposed to be data files and we don’t have any concept of executing a file in
HDFS. Since we don’t have an executable concept in HDFS so we don’t have
a setUID and setGID for HDFS.
Similarly, we can have permission for a directory in our HDFS. Where r is used to
list the content of a directory, w is used for creation or deletion of a directory
and x permission is used to access the child of a directory. Here also we don’t
have a setUID and setGID for HDFS.

How You Can Change this HDFS File’s Permission?

-chmod that stands for change mode command is used for changing the
permission for the files in our HDFS. The first list down the directories available in
our HDFS and have a look at the permission assigned to each of this directory.
You can list the directory in your HDFS root with the below command.
hdfs dfs -ls /
Here, / represents the root directory of your HDFS.
Let me first list down files present in my Hadoop_File directory.
hdfs dfs -ls /Hadoop_File

In above Image you can see that for file1.txt, I have only read and write permission
for owner user only. So I am adding write permission to group and others also.
Pre-requisite:
You have to be familiar with the use of -chmod command in Linux means how to
use switch for permissions for users. To add write permission to group and others
use below command.
hdfs dfs -chmod go+w /Hadoop_File/file1.txt
Here, go stands for group and other and w means write, and + sign shows that I
am adding write permission to group and other. Then list the file again to check it
worked or not.
hdfs dfs -ls /Hadoop_File

And we have done with it, similarly, you can change the permission for any file or
directory available in our HDFS(Hadoop Distributed File System).
Similarly, you can change permission as per your requirement for any user. you
can also change group or owner of a directory with -chgrp and -
chown respectively.

2. ACL(Access Control List)

ACL provides a more flexible way to assign permission for a file system. It is a list
of access permission for a file or a directory. We need the use of ACL in case you
have made a separate user for your Hadoop single node cluster setup, or you have
a multinode cluster setup where various nodes are present, and you want to
change permission for other users.
Because if you want to change permission for the different users, you can not do it
with -chmod command. For example, for single node cluster of Hadoop your main
user is root and you have created a separate user for Hadoop setup with name let
say Hadoop. Now if you want to change permission for the root user for files that
are present in your HDFS, you can not do it with -chmod command. Here comes
ACL(Access Control List) in the picture. With ACL you can set permission for a
specific named user or named group.
In order to enable ACL in HDFS you need to add the below property in hdfs-
site.xml file.
<property>

<name>dfs.namenode.acls.enabled</name>

<value>true</value>

</property>

Note: Don’t forget to restart all the daemons otherwise changes made to hdfs-
site.xml don’t reflect.
You can check the entry’s in your access control list(ACL) with -getfacl command
for a directory as shown below.
hdfs dfs -getfacl /Hadoop_File

You can see that we have 3 different entry’s in our ACL. Suppose you want to
change permission for your root user for any HDFS directory you can do it with
below command.
Syntax:
hdfs dfs -setfacl -m user: user_name:r-x /Hadoop_File
You can change permission for any user by adding it to the ACL for that directory.
Below are some of the example to change permission of different named users for
any HDFS file or directory.
hdfs dfs -setfacl -m user:root:r-x /Hadoop_File
Another example, for raj user:
hdfs dfs -setfacl -m user:raj:r-x /Hadoop_File
Here r-x denotes only read and executing permission for HDFS directory for
that root, and raj user.
In my case, I don’t have any other user so I am changing permission for my only
user i.e. dikshant
hdfs dfs -setfacl -m user:dikshant:rwx /Hadoop_File
Then list the ACL with -getfacl command to see the changes.
hdfs dfs -getfacl /Hadoop_File

Here, you can see another entry in ACL of this directory with user:dikshant:rwx for
new permission of dikshant user. Similarly, in case you have multiple users then
you can change their permission for any HDFS directory. This is another example
to change the permission of the user dikshant from r-x mode.
Here, you can see that I have changed dikshant user permission from rwx to r-x.
Hadoop – copyFromLocal Command
Hadoop copyFromLocal command is used to copy the file from your local file
system to the HDFS(Hadoop Distributed File System). copyFromLocal command
has an optional switch –f which is used to replace the already existing file in the
system, means it can be used to update that file. -f switch is similar to first delete a
file and then copying it. If the file is already present in the folder then copy it into
the same folder will automatically throw an error.
Syntax to copy a file from your local file system to HDFS is given below:

hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n


/destination
The copyFromLocal local command is similar to the -put command used in HDFS.
we can also use hadoop fs as a synonym for hdfs dfs. The command can take
multiple arguments where all the paths provided are of the source from where we
want to copy the file except the last one which is the destination, where the file is
copied. Make sure that the destination should be a directory.
Our objective is to copy the file from our local file system to HDFS. In my case, I
want to copy the file name Salaries.csv which is present
at /home/dikshant/Documents/hadoop_file directory.

Steps to execute copyFromLocal Command

Let’s see the current view of my Root directory in HDFS.


Step 1: Make a directory in HDFS where you want to copy this file with the below
command.

hdfs dfs -mkdir /Hadoop_File

Step 2: Use copyFromLocal command as shown below to copy it to


HDFS /Hadoop_File directory.

hdfs dfs -copyFromLocal


/home/dikshant/Documents/hadoop_file/Salaries.csv /Hadoop_File

Step 3: Check whether the file is copied successfully or not by moving to its
directory location with below command.
hdfs dfs -ls /Hadoop_File

Overwriting or Updating the File In HDFS with -f switch


From below Image, you can observe that copyFromLocal command itself does not
copy the same name file at the same location. it says that the file already exists.

To update the content of the file or to Overwrite it, you should use -f switch as
shown below.

hdfs dfs -copyFromLocal -f


/home/dikshant/Documents/hadoop_file/Salaries.csv /Hadoop_File
Now you can easily observe that using copyFromLocal with -f switch does not
produce any error or it will easily update or modify your file in HDFS.
Hadoop – getmerge Command
Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop
Distributed File System) and then put it into one single output file in our local file
system.
We want to merge the 2 files present inside are HDFS i.e. file1.txt and file2.txt, into
a single file output.txt in our local file system.

Steps To Use -getmerge Command

Step 1: Let’s see the content of file1.txt and file2.txt that are available in our
HDFS. You can see the content of File1.txt in the below image:

Content of File2.txt

In this case, we have copied both of these files inside my HDFS


in Hadoop_File folder. If you don’t know how to make the directory and copy files
to HDFS then follow below command to do so.
 Making Hadoop_Files directory in our HDFS
hdfs dfs -mkdir /Hadoop_File
 Copying files to HDFS

hdfs dfs -copyFromLocal /home/dikshant/Documents/hadoop_file/file1.txt


/home/dikshant/Documents/hadoop_file/file2.txt /Hadoop_File

Below is the Image showing this file inside my /Hadoop_File directory in HDFS.
Step 2: Now it’s time to use -getmerge command to merge these files into a single
output file in our local file system for that follow the below procedure.
Syntax:
hdfs dfs -getmerge -nl /path1 /path2 ..../path n /destination
-nl is used for adding new line. this will add a new line between the content of
these n files. In this case we have merge it to /hadoop_file folder inside
my /Documents folder.
hdfs dfs -getmerge -nl /Hadoop_File/file1.txt /Hadoop_File/file2.txt
/home/dikshant/Documents/hadoop_file/output.txt

Now let’s see whether the file get merged in output.txt file or not.

In the above image, you can easily see that the file is merged successfully in
our output.txt file.

You might also like