Map Reduce
Map Reduce
Step 1:
We can download the dataset from this Link, For various cities in different years.
choose the year of your choice and select any one of the data text-file for
analyzing. In my case, I have selected CRND0103-2020-
AK_Fairbanks_11_NE.txt dataset for analysis of hot and cold days in Fairbanks,
Alaska.
We can get information about data from README.txt file available on the NCEI
website.
Step 2:
Below is the example of our dataset where column 6 and column 7 is showing
Maximum and Minimum temperature, respectively.
Step 3:
First Open Eclipse -> then select File -> New -> Java Project ->Name
it MyProject -> then select use an execution environment -> choose JavaSE-
1.8 then next -> Finish.
In this Project Create Java class with name MyMaxMin -> then click Finish
Copy the below source code to this MyMaxMin java class
JAVA
// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
// Mapper
/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {
// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new
Text(String.valueOf(temp_Max)));
}
// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
// Reducer
/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/
/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/
}
}
Now we need to add external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version.
You can check Hadoop Version:
hadoop version
Now we add these external jars to our MyProject. Right Click on MyProject ->
then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply
and Close.
Now export the project as jar file. Right-click on MyProject choose Export.. and
go to Java -> JAR file click -> Next and choose your export destination then
click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -
> Finish -> Ok.
Step 4:
Step 5:
Step 6:
Now Run your Jar File with below command and produce the output
in MyOutput File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput
Step 7:
Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Step 8:
In the above image, you can see the top 10 results showing the cold days. The
second column is a day in yyyy/mm/dd format. For Example, 20200101 means
year = 2020
month = 01
Date = 01
Step 1:
We can download the Titanic Dataset from this Link. Below is the column structure
of our Titanic dataset. It consists of 12 columns where each row describes the
information of a particular person.
Step 2:
In this Project Create Java class with name Average_age -> then click Finish
Copy the below source code to this Average_age java class
Java
// import libraries
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
// foreach loop
for (IntWritable val : values) {
l += 1;
// storing and calculating
// sum of values
sum += val.get();
}
sum = sum / l;
context.write(key, new IntWritable(sum));
}
}
@SuppressWarnings("deprecation")
Job job = new Job(conf, "Averageage_survived");
job.setJarByClass(Average_age.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Now we need to add external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version.
Check Hadoop Version :
hadoop version
Step 5:
Step 6:
Now Run your Jar File with below command and produce the output
in Titanic_Output File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/Average_age.jar
/titanic_data.txt /Titanic_Output
Step 7:
Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Note: We can also view the result with below command
hdfs dfs -cat /Titanic_Output/part-r-00000
In the above image, we can see that the average age of the female is 28 and male
is 30 according to our dataset who died in the Titanic Disaster.
Step 3: Now we need to add an external jar for the packages that we have import.
Download the jar package Hadoop Common and Hadoop MapReduce
Core according to your Hadoop version. You can check Hadoop Version with the
below command:
hadoop version
Step 4: Now we add these external jars to our CharCount project. Right Click
on CharCount -> then select Build Path-> Click on Configure Build Path and
select Add External jars…. and add jars from it’s download location then click -
> Apply and Close.
Step 5: Now export the project as a jar file. Right-click
on CharCount choose Export.. and go to Java -> JAR file click -> Next and
choose your export destination then click -> Next. Choose Main Class
as CharCount by clicking -> Browse and then click -> Finish -> Ok.
Now the Jar file is successfully created and saved at /Documents directory with
the name charectercount.jar in my case.
Step 6: Create a simple text file and add some data to it.
nano test.txt
You can also add text to the file manually or using some other editor like Vim or
gedit.
To see the content of the file use cat command available in Linux.
cat test.txt
Step 9: Now Run your Jar File with the below command and produce the output
in CharCountResult File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS
/output-file_name
Command:
hadoop jar /home/dikshant/Documents/charectercount.jar
/test.txt /CharCountResult
Step 10: Now Move to localhost:50070/, under utilities select Browse the file
system and download part-r-00000 in /CharCountResult directory to see result. we
can also check the result i.e. that part-r-00000 file with cat command as shown
below.
hdfs dfs -cat /CharCountResult/part-00000
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print '%s\t%s' % (word, 1)
Here in the above program #! is known as shebang and used for interpreting the
script. The file will be run using the command we are specifying.
Let’s test our mapper.py locally that it is working fine or not.
Syntax:
cat <text_data_file> | python <mapper_code_python_file>
Command(in my case)
cat word_count_data.txt | python mapper.py
The output of the mapper is shown below.
Step 3: Create a reducer.py file that implements the reducer logic. It will read the
output of mapper.py from STDIN(standard input) and will aggregate the occurrence
of each word and will write the final output to STDOUT.
cd Documents/ # to change the
directory to /Documents
touch reducer.py # touch is used to create
an empty file
Python3
#!/usr/bin/env python
Now let’s check our reducer code reducer.py with mapper.py is it working properly
or not with the help of the below command.
cat word_count_data.txt | python mapper.py | sort -k1,1 |
python reducer.py
We can see that our reducer is also working fine in our local system.
Step 4: Now let’s start all our Hadoop daemons with the below command.
start-dfs.sh
start-yarn.sh
Let’s give executable permission to our mapper.py and reducer.py with the help
of below command.
cd Documents/
-
The command to be run as the mapper
mapper
1. File Permission
-chmod that stands for change mode command is used for changing the
permission for the files in our HDFS. The first list down the directories available in
our HDFS and have a look at the permission assigned to each of this directory.
You can list the directory in your HDFS root with the below command.
hdfs dfs -ls /
Here, / represents the root directory of your HDFS.
Let me first list down files present in my Hadoop_File directory.
hdfs dfs -ls /Hadoop_File
In above Image you can see that for file1.txt, I have only read and write permission
for owner user only. So I am adding write permission to group and others also.
Pre-requisite:
You have to be familiar with the use of -chmod command in Linux means how to
use switch for permissions for users. To add write permission to group and others
use below command.
hdfs dfs -chmod go+w /Hadoop_File/file1.txt
Here, go stands for group and other and w means write, and + sign shows that I
am adding write permission to group and other. Then list the file again to check it
worked or not.
hdfs dfs -ls /Hadoop_File
And we have done with it, similarly, you can change the permission for any file or
directory available in our HDFS(Hadoop Distributed File System).
Similarly, you can change permission as per your requirement for any user. you
can also change group or owner of a directory with -chgrp and -
chown respectively.
ACL provides a more flexible way to assign permission for a file system. It is a list
of access permission for a file or a directory. We need the use of ACL in case you
have made a separate user for your Hadoop single node cluster setup, or you have
a multinode cluster setup where various nodes are present, and you want to
change permission for other users.
Because if you want to change permission for the different users, you can not do it
with -chmod command. For example, for single node cluster of Hadoop your main
user is root and you have created a separate user for Hadoop setup with name let
say Hadoop. Now if you want to change permission for the root user for files that
are present in your HDFS, you can not do it with -chmod command. Here comes
ACL(Access Control List) in the picture. With ACL you can set permission for a
specific named user or named group.
In order to enable ACL in HDFS you need to add the below property in hdfs-
site.xml file.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
Note: Don’t forget to restart all the daemons otherwise changes made to hdfs-
site.xml don’t reflect.
You can check the entry’s in your access control list(ACL) with -getfacl command
for a directory as shown below.
hdfs dfs -getfacl /Hadoop_File
You can see that we have 3 different entry’s in our ACL. Suppose you want to
change permission for your root user for any HDFS directory you can do it with
below command.
Syntax:
hdfs dfs -setfacl -m user: user_name:r-x /Hadoop_File
You can change permission for any user by adding it to the ACL for that directory.
Below are some of the example to change permission of different named users for
any HDFS file or directory.
hdfs dfs -setfacl -m user:root:r-x /Hadoop_File
Another example, for raj user:
hdfs dfs -setfacl -m user:raj:r-x /Hadoop_File
Here r-x denotes only read and executing permission for HDFS directory for
that root, and raj user.
In my case, I don’t have any other user so I am changing permission for my only
user i.e. dikshant
hdfs dfs -setfacl -m user:dikshant:rwx /Hadoop_File
Then list the ACL with -getfacl command to see the changes.
hdfs dfs -getfacl /Hadoop_File
Here, you can see another entry in ACL of this directory with user:dikshant:rwx for
new permission of dikshant user. Similarly, in case you have multiple users then
you can change their permission for any HDFS directory. This is another example
to change the permission of the user dikshant from r-x mode.
Here, you can see that I have changed dikshant user permission from rwx to r-x.
Hadoop – copyFromLocal Command
Hadoop copyFromLocal command is used to copy the file from your local file
system to the HDFS(Hadoop Distributed File System). copyFromLocal command
has an optional switch –f which is used to replace the already existing file in the
system, means it can be used to update that file. -f switch is similar to first delete a
file and then copying it. If the file is already present in the folder then copy it into
the same folder will automatically throw an error.
Syntax to copy a file from your local file system to HDFS is given below:
Step 3: Check whether the file is copied successfully or not by moving to its
directory location with below command.
hdfs dfs -ls /Hadoop_File
To update the content of the file or to Overwrite it, you should use -f switch as
shown below.
Step 1: Let’s see the content of file1.txt and file2.txt that are available in our
HDFS. You can see the content of File1.txt in the below image:
Content of File2.txt
Below is the Image showing this file inside my /Hadoop_File directory in HDFS.
Step 2: Now it’s time to use -getmerge command to merge these files into a single
output file in our local file system for that follow the below procedure.
Syntax:
hdfs dfs -getmerge -nl /path1 /path2 ..../path n /destination
-nl is used for adding new line. this will add a new line between the content of
these n files. In this case we have merge it to /hadoop_file folder inside
my /Documents folder.
hdfs dfs -getmerge -nl /Hadoop_File/file1.txt /Hadoop_File/file2.txt
/home/dikshant/Documents/hadoop_file/output.txt
Now let’s see whether the file get merged in output.txt file or not.
In the above image, you can easily see that the file is merged successfully in
our output.txt file.