SSJ Bda File
SSJ Bda File
Mayapuri, 110064
2024
There are three modes in which you can get the experience of Hadoop:
- Standalone Mode
In this mode you need an IDE like eclipse and the Hadoop library files (which you can
download from the Apache website). You can create your MapReduce program and run it in
your local machine. You will be able to check the logic of the code and you can check any
syntax errors and this needs some sample data to perform these actions but you will not get
the full experience of Hadoop.
- Pseudo-Distributed mode
In this mode you get all the daemons of Hadoop running on a single machine and you can get
a VM from Cloudera or Hortonworks which is just plug and play type of thing. It will have all
the necessary tools installed and configured. In this mode you can scale up your data to check
how your code performs and optimize accordingly to get the job done in the required time.
- Fully-Distributed mode
In this mode you get all the daemons running on different machines. This is mostly used in
the production stage of your project. When you have already verified your code you will get
a chance to implement it in this mode.
Since you request an online service where you can practice your Hadoop code. Install eclipse
on pc and download the libraries and start coding.
Conclusion:
In this practical, we learned how to create directories in the Hadoop Distributed File System
(HDFS) at specified paths using the hadoop fs -mkdir command. Creating directories in HDFS
is a fundamental operation when organizing and managing data in a Hadoop cluster.
Mastering this skill allows users to efficiently structure their data storage in HDFS for various
big data processing tasks.
Aim: The aim of this practical is to demonstrate how to view the contents of a file in the
Hadoop Distributed File System (HDFS) using the hadoop fs -cat command, similar to the cat
command in Unix/Linux.
Procedure:
1. Viewing Contents of a File:
• Use the hadoop fs -cat command to view the contents of a file in HDFS.
• Syntax: hadoop fs -cat <HDFS_file_path>
• Replace <HDFS_file_path> with the path to the file in HDFS whose contents you
want to view.
• Example: hadoop fs -cat /user/saurzcode/dir1/abc.txt5
• This command will display the contents of the file named abc.txt5 located in the
/user/saurzcode/dir1/ directory in HDFS.
Conclusion:
In this practical, we learned how to view the contents of a file in the Hadoop Distributed File
System (HDFS) using the hadoop fs -cat command. This command is similar to the cat
command in Unix/Linux and allows you to quickly view the contents of a file stored in HDFS.
Being able to view file contents is essential for data inspection and debugging purposes in
Hadoop environments.
Aim: The aim of this practical is to demonstrate how to copy files from a source location to a
destination location in the Hadoop Distributed File System (HDFS) using the hadoop fs -cp
command.
Procedure:
1. Copying a File from Source to Destination:
• Use the hadoop fs -cp command to copy a file from a source location to a
destination location in HDFS.
• Syntax: hadoop fs -cp <source_path> <destination_path>
• Replace <source_path> with the path to the source file in HDFS.
• Replace <destination_path> with the path to the destination location in HDFS.
• Example: hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
• This command will copy the file abc.txt from the /user/saurzcode/dir1/
directory to the /user/saurzcode/dir2/ directory in HDFS.
Conclusion:
In this practical, we learned how to copy files from a source location to a destination location
in the Hadoop Distributed File System (HDFS) using the hadoop fs -cp command. This
command is useful for moving files within HDFS and organizing data in the Hadoop cluster.
Mastering file copying operations in HDFS is essential for efficient data management and
workflow in Hadoop environments.
Aim: The aim of this practical is to demonstrate how to remove files or directories from the
Hadoop Distributed File System (HDFS) using the hadoop fs -rm command for files and
hadoop fs -rmr command for directories.
Procedure:
1. Removing a File:
• Use the hadoop fs -rm command to remove a file from HDFS.
• Syntax: hadoop fs -rm <file_path>
• Replace <file_path> with the path to the file you want to remove from HDFS.
• Example: hadoop fs -rm /user/saurzcode/dir1/abc.txt
• This command will remove the file abc.txt from the /user/saurzcode/dir1/
directory in HDFS.
2. Removing a Directory (Recursive):
• Use the hadoop fs -rmr command to remove a directory and its contents recursively
from HDFS.
• Syntax: hadoop fs -rmr <directory_path>
• Replace <directory_path> with the path to the directory you want to remove from
HDFS.
• Example: hadoop fs -rmr /user/saurzcode/dir1/
• This command will remove the directory /user/saurzcode/dir1/ and all its contents
recursively from HDFS.
Conclusion:
In this practical, we learned how to remove files or directories from the Hadoop Distributed
File System (HDFS) using the hadoop fs -rm command for files and hadoop fs -rmr command
for directories. These commands are essential for managing data in HDFS and maintaining
the organization of files and directories within the Hadoop cluster. Understanding how to
remove files and directories safely is crucial to avoid unintended data loss in Hadoop
environments.
Aim: The aim of this practical is to demonstrate how to copy files from a source location to a
destination location in the Hadoop Distributed File System (HDFS) using the hadoop fs -cp
command.
Procedure:
1. Copying a File from Source to Destination:
• Use the hadoop fs -cp command to copy a file from a source location to a
destination location in HDFS.
• Syntax: hadoop fs -cp <source_path> <destination_path>
• Replace <source_path> with the path to the source file in HDFS.
• Replace <destination_path> with the path to the destination location in HDFS.
• Example: hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
• This command will copy the file abc.txt from the /user/saurzcode/dir1/
directory to the /user/saurzcode/dir2/ directory in HDFS.
Conclusion:
In this practical, we learned how to copy files from a source location to a destination location
in the Hadoop Distributed File System (HDFS) using the hadoop fs -cp command. This
command is useful for moving files within HDFS and organizing data in the Hadoop cluster.
Mastering file copying operations in HDFS is essential for efficient data management and
workflow in Hadoop environments.
Conclusion:
In this practical, we learned how to move files from a source location to a destination location
in the Hadoop Distributed File System (HDFS) using the hadoop fs -mv command. Moving
files in HDFS allows you to reorganize data within the Hadoop cluster efficiently. Mastering
file moving operations in HDFS is essential for maintaining data organization and managing
workflows in Hadoop environments.
10 | S H I V SHEKHAR JHA
00276807721
EXPERIMENT 9
Display the Aggregate Length of a File
Aim: The aim of this practical is to demonstrate how to display the aggregate length of a file
in the Hadoop Distributed File System (HDFS) using the hadoop fs -du command.
Procedure:
1. Displaying Aggregate Length of a File:
• Use the hadoop fs -du command to display the aggregate length of a file in HDFS.
• Syntax: hadoop fs -du <file_path>
• Replace <file_path> with the path to the file in HDFS for which you want to
display the aggregate length.
• Example: hadoop fs -du /user/saurzcode/dir1/abc.txt
• This command will display the aggregate length of the file abc.txt located in the
/user/saurzcode/dir1/ directory in HDFS.
Conclusion:
In this practical, we learned how to display the aggregate length of a file in the Hadoop
Distributed File System (HDFS) using the hadoop fs -du command. This command provides
information about the total length occupied by the specified file in HDFS. Understanding how
to retrieve file size information is essential for managing and monitoring data storage in
Hadoop environments
11 | S H I V SHEKHAR JHA
00276807721
EXPERIMENT 10
Implement a Program of Word Count Map Reduce program to understand
Map Reduce Paradigm
Program-
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.Hadoop.io.IntWritable;
import org.apache.Hadoop.io.LongWritable;
import org.apache.Hadoop.io.Text;
import org.apache.Hadoop.mapreduce.Mapper;
import org.apache.Hadoop.mapreduce.Reducer;
import org.apache.Hadoop.conf.Configuration;
import org.apache.Hadoop.mapreduce.Job;
import org.apache.Hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.Hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.Hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.Hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.Hadoop.fs.Path;
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have to delete it
explicitly
outputPath.getFileSystem(conf).delete(outputPath); //exiting the job only if the flag value
becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code
We will understand the code for each of these three parts sequentially.
Mapper code:
public static class Map extends Mapper { public void map(LongWritable key, Text value,
Context context) throws IOException,InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
13 | S H I V SHEKHAR JHA
00276807721
while (tokenizer.hasMoreTokens())
{
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
• We have created a class Map that extends the class Mapper which is already defined in the
MapReduce Framework
• We define the data types of input and output key/value pair after the class declaration
using angle brackets.
• Both the input and output of the Mapper is a key/value pair.
• Input:
◦ The key is nothing but the offset of each line in the text file:LongWritable
◦ The value is each individual line (as shown in the figure at the right): Text• Output:
◦ The key is the tokenized words: Text ◦ We have the hardcoded value in our case
which is
1: IntWritable
◦ Example – Dear 1, Bear 1, etc.
• We have written a java code where we have tokenized each word and assigned them a
hardcoded value equal to 1.
Reducer Code:
public static class Reduce extends Reducer
{
public void reduce(Text key, Iterable values,Context context) throws
IOException,InterruptedException
{
int sum=0; for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
• We have created a class Reduce which extends class Reducer like that of Mapper.
• We define the data types of input and output key/value pair after the class declaration
using angle brackets as done for Mapper.
• Both the input and the output of the Reducer is a keyvalue pair.
• Input:
◦ The key nothing but those unique words which have been generated after the sorting
14 | S H I V SHEKHAR JHA
00276807721
and shuffling phase: Text
◦ The value is a list of integers corresponding to each key: IntWritable
◦ Example – Bear, [1, 1], etc.
Output:
◦ The key is all the unique words present in the input text file: Text
◦ The value is the number of occurrences of each of the unique words: IntWritable
◦ Example – Bear, 2; Car, 3, etc.
• We have aggregated the values present in each of the list corresponding to each key and
produced the final answer.
• In general, a single reducer is created for each of the unique words, but, you can specify
the number of reducer in map red-site.xml
Driver Code:
Configuration conf= new Configuration ();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
1. In the driver class, we set the configuration of our MapReduce job to run in Hadoop.
2. We specify the name of the job , the data type of input/ output of the mapper and
reducer.
3. We also specify the names of the mapper and reducer classes.
4. The path of the input and output folder is also specified.
15 | S H I V SHEKHAR JHA
00276807721
5. The method setInputFormatClass () is used for specifying that how a Mapper will read
the input data or what will be the unit of work. Here, we have chosen TextInputFormat
so that single line is read by the mapper at a time from the input text file.
6. The main () method is the entry point for the driver. In this method, we instantiate a
new Configuration object for the job.
16 | S H I V SHEKHAR JHA
00276807721