0% found this document useful (0 votes)
90 views6 pages

Big Data Fundamentals and Platforms Assginment 3

This document provides code for a MapReduce program to analyze a COVID-19 dataset from a CSV file to count the total reported cases for each country/location until April 8, 2020. It includes Java code for the Mapper, Reducer, and Main classes. The Mapper splits the CSV data and outputs the location and new cases. The Reducer sums the new cases for each location. The Main class runs the MapReduce job, specifying the input/output paths and classes.

Uploaded by

sagar srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views6 pages

Big Data Fundamentals and Platforms Assginment 3

This document provides code for a MapReduce program to analyze a COVID-19 dataset from a CSV file to count the total reported cases for each country/location until April 8, 2020. It includes Java code for the Mapper, Reducer, and Main classes. The Mapper splits the CSV data and outputs the location and new cases. The Reducer sums the new cases for each location. The Main class runs the MapReduce job, specifying the input/output paths and classes.

Uploaded by

sagar srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Big Data Fundamentals and Platforms

ASSIGNMENT

Question 3: Show practical example to list files, Insert data, retrieving data and
shutting down HDFS.
Initially, you have to format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.

$ hadoop namenode
-format
After formatting the HDFS, start the distributed file system. The following command will
start the namenode as well as the data nodes as cluster.
$ start-dfs.sh 

Listing Files in HDFS


After loading the information in the server, we can find the list of files in a directory,
status of a file, using ‘ls’. Given below is the syntax of ls that you can pass to a
directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs -ls
<args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be
saved in the hdfs file system. Follow the steps given below to insert the required file in
the Hadoop file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input 

Step 2
Transfer and store a data file from local systems to the Hadoop file system using the
put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input 

Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input 

Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple demonstration
for retrieving the required file from the Hadoop file system.
Step 1
Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat
/user/output/outfile 

Step 2
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/ 

Shutting Down the HDFS


You can shut down the HDFS by using the following command.
$ stop-dfs.sh

Q4. Building on the simple WordCount example done in class and Hadoop
tutorial, your task is to perform simple processing on provided COVID-19
dataset. 

The task is to count the total number of reported cases for every country/location
till April 8th, 2020 (NOTE: There data does contain case rows for Dec 2019, you
will have to filter that data)
COVID - Analysis from the csv file using
MapReduce Programming. 
File: MapperClass.java
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class MapperClass extends MapReduceBase implements Mapper


<LongWritable, Text, Text, IntWritable> {

    // initialize the field variable


    private final static IntWritable one = new IntWritable(1);
    private final static int LOCATION = 1;
    private final static int NEW_CASES = 2;
    private final static String CSV_SEPARATOR = ",";

    
    public void map(LongWritable key, Text value, OutputCollector
<Text, IntWritable> output, Reporter reporter) throws IOException {

        // initiate the variable


        String valueString = value.toString();

        // split the data with CSV_SEPARATOR


        String[] columnData = valueString.split(CSV_SEPARATOR);
    
        // collect the data with defined column
        output.collect(new Text(columnData[LOCATION]), new
IntWritable(Integer.parseInt(columnData[NEW_CASES])));
    }
}
 
File: ReducerClass.java
import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class ReducerClass extends MapReduceBase implements


Reducer<Text, IntWritable, Text, IntWritable> {

    
    public void reduce(Text t_key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
        // determine key object and counter variable
        Text key = t_key;
        int counter = 0;

        // as long that the values inside the data being mapped,
        // will counting how many data with the same key
        while (values.hasNext()) {

            // replace type of value with the actual type of our


value
            IntWritable value = (IntWritable) values.next();
            counter += value.get();
        }
        output.collect(key, new IntWritable(counter));
    }
}
 
File: MainClass.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.*;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

public class MainClass {

    
    public static void main(String[] args) {

        // create new JobClient


        JobClient my_client = new JobClient();

        // Create a configuration object for the job


        JobConf job_conf = new JobConf(MainClass.class);

        // Set a name of the Job


        job_conf.setJobName("MapReduceCSV");

        // Specify data type of output key and value


        job_conf.setOutputKeyClass(Text.class);
        job_conf.setOutputValueClass(IntWritable.class);

        // Specify names of Mapper and Reducer Class


        job_conf.setMapperClass(MapperClass.class);
        job_conf.setReducerClass(ReducerClass.class);

        // Specify formats of the data type of Input and output


        job_conf.setInputFormat(TextInputFormat.class);
        job_conf.setOutputFormat(TextOutputFormat.class);

        // Set input and output directories using command line


arguments, 
        // arg[0] = name of input directory on HDFS, and
        // arg[1] =  name of output directory to be created to store
the output file.

        // called the input path for file and defined the output
path
        FileInputFormat.setInputPaths(job_conf, new Path(args[0]));
        FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

        my_client.setConf(job_conf);
        try {
            // Run the job 
            JobClient.runJob(job_conf);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
 
How to execute:
Step 1. Creating a directory named as classes/
$ mkdir classes

Step 2. Compiling the source file using the following command. 


$ javac -cp hadoop-common-2.2.0.jar:hadoop-mapreduce-client-core-
2.7.1.jar:classes:. -d classes/ *.java

Step 3: Creating Jar file for the above created classes which are stored in classes
folder.
$  jar -cvf CountMe.jar -C classes/ .
# do not forget to put <space> dot at the end inthe above command. 

Output: 
added manifest
adding: MainClass.class(in = 1609) (out= 806)(deflated 49%)
adding: MapperClass.class(in = 1911) (out= 754)(deflated 60%)
adding: ReducerClass.class(in = 1561) (out= 628)(deflated 59%)

Step 4: upload the csv file to hadoop distributed file system


$ hadoop fs -put covid.csv .
# do not forget to remove the header line from the covid file provided before
uploading to HDFS
# do not forget to put <space> dot in the above command to upload it to the home
folder. 

Step 5: Running the Hadoop file using hadoop jar command. 


$ hadoop jar CountMe.jar MainClass covid.csv output/
Step 6: Checking the output folder has been populated or not and also printing the
output on terminal

$ hadoop fs -ls output/


$ hadoop fs -cat output/part-00000

Afghanistan     367
Albania 383
Algeria 1468
Andorra 545
Angola  17
Anguilla        3
Antigua and Barbuda     15
Argentina       1715
Armenia 853
Aruba   74
Australia       5956
Austria 12640
Azerbaijan      717
Bahamas 36
Bahrain 811
Bangladesh      164
Barbados        63

You might also like