0% found this document useful (0 votes)
32 views32 pages

21CP059 DAV Lab Manual

Uploaded by

Odrib Deb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views32 pages

21CP059 DAV Lab Manual

Uploaded by

Odrib Deb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

21CP059 4CP02 - DAV

BIRLA VISHVAKARMA
MAHAVIDYALAYA
(An Autonomous Institution)
Computer Department
Course Code: 4CP02
Course Name: Data Analysis and
Visualization
Faculty: Divyang sir
Name: Devarshi Chinivar
ID: 21CP059
21CP059 4CP02 - DAV

Practical List
No. Date Practical Sign

1. Configure Hadoop cluster in distributed mode.


2. Try Linux and Hadoop File operation commands.
For Linux File System: For Hadoop File System:
 pwd  ls
 ls  mkdir
 mkdir  put
 cd  get
 touch  cat
 gedit / vi  mv
 cat  rmr
 more  expunge
 less  touchz
 cp
 mv
 rm
 rmdir
3. Write a Map Reduce Code for Count Frequency of words from a large
file.
4. Develop a MapReduce program to Analyze weather data set and print
whether the day is shinny or cool
5. Develop a MapReduce program to find the number of products sold
in each country by considering sales data containing fields like

6. Configure Hive in distributed mode.

7. To write queries to sort and aggregate the data in a table using


HiveQL.
8. Write Hive Query for the following task for movie dataset. Movie
dataset consists of movie id, movie name, release year, rating, and
runtime in seconds. A sample of the dataset is as follows:
a. The Nightmare Before Christmas,1993,3.9,4568
b. The Mummy,1932,3.5,4388
c. Orphans of the Storm,1921,3.2,9062
d. The Object of Beauty,1991,2.8,6150
e. Night Tide,1963,2.8,5126
21CP059 4CP02 - DAV

Write a hive query for the following


i. Load the data
ii. List the movies that are having a rating greater than 4
iii. Store the result of previous query into file
iv. List the movies that were released between 1950 and 1960
v. List the movies that have duration greater than 2 hours
vi. List the movies that have rating between 3 and 4
vii. List the movie names and its duration in minutes

9. Configure Pig and try different Pig commands.

10. Configure HBase and try different HBase commands.

11. Write a java program to insert, update and delete records from HBase.

12. Install Apache Spark and try basic commands.


21CP059 4CP02 - DAV

Practical - 1 :
Configure Hadoop cluster in distributed mode.
 To download Cloudera, visit https://community.cloudera.com/t5/Support-
Questions/Cloudera- QuickStart-VM-Download/

 Open Oracle Virtual Box and head to File menu and select Import appliance and
give location of our installed Cloudera files.
21CP059 4CP02 - DAV

 After the completion of import, simply run the cloudera software in your virtual box.

Practical - 2 :
21CP059 4CP02 - DAV

Try Linux and Hadoop File operation commands.

 For Linux File System:

1. pwd (Print Working Directory)


• Displays the current directory path.
Command:
• Pwd

Example Output:

2. ls (List Directory Contents)


• Lists files and directories in the current directory.
Command:
• ls

Example Output :

3. mkdir (Make Directory)


• Creates a new directory.
Command:
• mkdir new_directory

Example Output :
(No output if successful)

4. cd (Change Directory)
• Changes the current directory to the specified path.
21CP059 4CP02 - DAV

Command:
• cd new_directory

Example Output :
(No output if successful)

5. touch (Create Empty File)


• Creates an empty file or updates the timestamp of an existing file.
Command:
• touch newfile.txt

Example Output :
(No output if successful)

6. gedit / vi (Text Editors)


• Opens a file in the text editor gedit or vi.
Command:
• gedit file.txt

Example Output :
(The file opens in the respective text editor; no command-line output)

7. cat (Concatenate and Display File Content)


• Displays the contents of a file.
Command:
• cat file.txt

Example Output :

8. more (View File Content Page by Page)


• Displays file content one screen at a time.
21CP059 4CP02 - DAV

Command:
• more file.txt

Example Output :

Hello, this is a sample file.


It contains some text.
--
(END)

9. less (View File Content with Scrolling)


• Displays file content with the ability to scroll up and down.
Command:
• less file.txt

Example Output :

Hello, this is a sample file.


It contains some text.

10. cp (Copy Files or Directories)


• Copies files or directories.
Command:
• cp file.txt copy_file.txt

Example Output :
(No output if successful)

11. mv (Move or Rename Files or Directories)


• Moves or renames files or directories.
Command:
• mv file.txt new_location/

Example Output :
(No output if successful)

12. rm (Remove Files)


• Deletes files.
21CP059 4CP02 - DAV

Command:
• rm file.txt

Example Output :
(No output if successful; will prompt for confirmation if using -i option)

13. rmdir (Remove Directory)


• Deletes empty directories.
Command:
• rmdir empty_directory

Example Output :
(No output if successful; will error if the directory is not empty)

 For Hadoop File System:


1. ls (List Directory Contents)
• Lists the files and directories in a specified directory in HDFS.
Command:
• hdfs dfs -ls /path/to/directory

Example Output :

2. mkdir (Make Directory)


• Creates a new directory in HDFS.
21CP059 4CP02 - DAV

Command:
• hdfs dfs -mkdir /path/to/new_directory

Example Output :

3. put (Upload File)


• Uploads a local file to HDFS.
Command:
• hdfs dfs -put local_file.txt /path/to/hdfs_directory/

Example Output :

4. get (Download File)


• Downloads a file from HDFS to the local file system.
21CP059 4CP02 - DAV

Command:
• hdfs dfs -get /path/to/hdfs_file.txt local_directory/

Example Output :

5. cat (Display File Content)


• Displays the contents of a file in HDFS.
Command:
• hdfs dfs -cat /path/to/hdfs_file.txt

Example Output :

6. mv (Move or Rename File)


• Moves or renames a file or directory in HDFS.
21CP059 4CP02 - DAV

Command:
• hdfs dfs -mv /path/to/old_location /path/to/new_location

Example Output :

7. rmr (Remove File)


• Deletes a file in HDFS.
Command:
• hdfs dfs -rm /path/to/hdfs_file.txt

Example Output :

8. expunge (Remove Deleted Files)


• Permanently removes files that are in the Trash.
21CP059 4CP02 - DAV

Command:
• hdfs dfs -expunge
Example Output :
• (No output if successful)

9. touchz (Create an Empty File)


• Creates an empty file in HDFS (similar to touch but for HDFS).
Command:
• hdfs dfs -touchz /path/to/new_file.txt
Example Output :
• (No output if successful)

Practical - 3 :
Write a Map Reduce Code for Count Frequency of words
21CP059 4CP02 - DAV

from a large file.

Steps:
 First Open Eclipse -> then select File -> New -> Java Project ->Name
it WordCount -> then Finish.

 CreateThree Java Classes into the project. Name them WCDriver(having the main
function), WCMapper, WCReducer.

 Youhave to include two Reference Libraries for that:


Right Click on Project -> then select Build Path-> Click on Configure Build
Path
21CP059 4CP02 - DAV

 In the above figure, you can see the Add External JARs option on the Right Hand
Side. Click on it and add the below mention files. You can find these files
in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.13.0.jar
2. /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
21CP059 4CP02 - DAV

 Mapper Code:

// Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WCMapper extends MapReduceBase implements


Mapper<LongWritable, Text, Text, IntWritable> {

// Map function
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter rep) throws IOException
{

String line = value.toString();

// Splitting the line on spaces


for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}
21CP059 4CP02 - DAV

 Reducer Code:

// Importing libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WCReducer extends MapReduceBase implements


Reducer<Text,IntWritable, Text, IntWritable> {

// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
OutputCollector<Text, IntWritable> output,
Reporter rep) throws IOException
{

int count = 0;

// Counting the frequency of each words


while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}

output.collect(key, new IntWritable(count));


}
}
21CP059 4CP02 - DAV

 Driver Code:

// Importing libraries
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WCDriver extends Configured implements Tool {

public int run(String args[]) throws IOException


{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}

JobConf conf = new JobConf(WCDriver.class);


FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
21CP059 4CP02 - DAV

// Main Method
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}

_________________________________________________________________

 Now you have to make a jar file. Right Click on Project-> Click on Export-
> Select export destination as Jar File-> Name the jar File(WordCount.jar) -
> Click on next -> at last Click on Finish. Now copy this file into the Workspace
directory of Cloudera
21CP059 4CP02 - DAV

 Open the terminal on CDH and change the directory to the workspace. You can do
this by using “cd workspace/” command. Now, Create a text file(WCFile.txt) and
move it to HDFS. For that open terminal and write this code(remember you should
be in the same directory as jar file you have created just now).
21CP059 4CP02 - DAV

 Now, run this command to copy the file input file into the HDFS.

 Now run the jar file by writing the code

 After Executing the code, you can see the result in WCOutput file or by writing
following command on terminal.

hadoop fs -cat WCOutput/part-00000

Input:

Hello I am DevarshiChinivar
Hello I am a Student

Output:

DevarshiChinivar 1
Hello 2
I 2
Student 1
am 2
a 1
21CP059 4CP02 - DAV

Practical - 4 :
Develop a MapReduce program to Analyze weather
data set and print whether the day is shiny or cool.

Steps:

 Download CRND0103-2020-AK_Fairbanks_11_NE.txt dataset for analysis of hot


and cold days in Fairbanks, Alaska.

 Example of our dataset where column 6 and column 7 is showing Maximum and
Minimum temperature, respectively.

 Make a project in Eclipse with below steps:

 First Open Eclipse -> then select File -> New -> Java Project ->Name
it MyProject -> then select use an execution environment ->
choose JavaSE-1.8 then next -> Finish.
21CP059 4CP02 - DAV

 In this Project Create Java class with name MyMaxMin -> then click Finish.

 MyMaxMin.java:

// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;

public class MyMaxMin {


21CP059 4CP02 - DAV

// Mapper

/*MaxTemperatureMapper class is static


* and extends Mapper abstract class
* having four Hadoop generics type
* LongWritable, Text, Text, Text.
*/

public static class MaxTemperatureMapper extends


Mapper<LongWritable, Text, Text, Text> {

/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/

// the data in our data set with


// this value is inconsistent data
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {

// Convert the single row(Record) to


// String and store it in String
// variable name line

String line = Value.toString();

// Check for the empty line


if (!(line.length() == 0)) {

// from character 6 to 14 we have


21CP059 4CP02 - DAV

// the date in our dataset


String date = line.substring(6, 14);

// similarly we have taken the maximum


// temperature from 39 to 45 characters
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());

// similarly we have taken the minimum


// temperature from 47 to 53 characters

float temp_Min = Float.parseFloat(line.substring(47, 53).trim());

// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {

// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new
Text(String.valueOf(temp_Max)));
}

// if the minimum temperature is


// less than 15, it is a cold day
if (temp_Min < 15) {

// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

// Reducer

/*MaxTemperatureReducer class is static


21CP059 4CP02 - DAV

and extends Reducer abstract class


having four Hadoop generics type
Text, Text, Text, Text.
*/

public static class MaxTemperatureReducer extends


Reducer<Text, Text, Text, Text> {

/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/

public void reduce(Text Key, Iterator<Text> Values, Context context)


throws IOException, InterruptedException {

// putting all the values in


// temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/

public static void main(String[] args) throws Exception {

// reads the default configuration of the


// cluster from the configuration XML files
21CP059 4CP02 - DAV

Configuration conf = new Configuration();

// Initializing the job with the


// default configuration of the cluster
Job job = new Job(conf, "weather example");

// Assigning the driver class name


job.setJarByClass(MyMaxMin.class);

// Key type coming out of mapper


job.setMapOutputKeyClass(Text.class);

// value type coming out of mapper


job.setMapOutputValueClass(Text.class);

// Defining the mapper class name


job.setMapperClass(MaxTemperatureMapper.class);

// Defining the reducer class name


job.setReducerClass(MaxTemperatureReducer.class);

// Defining input Format class which is


// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);

// Defining output Format class which is


// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);

// setting the second argument


// as a path in a path variable
Path OutputPath = new Path(args[1]);

// Configuring the input path


// from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));

// Configuring the output path from


// the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
21CP059 4CP02 - DAV

// deleting the context path automatically


// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);

// exiting the job only if the


// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

 Now we need to add external jar for the packages that we have import. Download
the jar package Hadoop Common and Hadoop MapReduce Core according to
your Hadoop version.
You can check Hadoop Version:

 hadoop version

 Now we add these external jars to our MyProject. Right Click on MyProject ->
then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply
and Close.
21CP059 4CP02 - DAV

 Now export the project as jar file. Right-click on MyProject choose Export.. and
go to Java -> JAR file click -> Next and choose your export destination then
click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -
> Finish -> Ok.
21CP059 4CP02 - DAV

Start our Hadoop Daemons

 start-dfs.sh
 start-yarn.sh

Move your dataset to the Hadoop HDFS.

Syntax:

 hdfs dfs -put /file_path /destination

In below command / shows the root directory of our HDFS.

 hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-


AK_Fairbanks_11_NE.txt /

Check the file sent to our HDFS.

 hdfs dfs -ls /


21CP059 4CP02 - DAV

Now Run your Jar File with below command and produce the output
in MyOutput File.

Syntax:

 hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name

Command:

 hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-


AK_Fairbanks_11_NE.txt /MyOutput

Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
21CP059 4CP02 - DAV

See the result in the Downloaded File.

In the above image, you can see the top 10 results showing the cold days. The
second column is a day in yyyy/mm/dd format. For Example, 20200101 means

 year = 2020
month = 01
Date = 01

You might also like