0% found this document useful (0 votes)
17 views28 pages

BIG Data File

The document outlines a series of experiments for a Big Data course, specifically focusing on Hadoop and its functionalities. It includes detailed instructions for installing Hadoop, performing file management tasks in HDFS, and implementing MapReduce programs such as a Word Count program and a Weather Report analysis. Each experiment aims to provide hands-on experience with Hadoop's capabilities in processing and analyzing large datasets.

Uploaded by

hk8560394
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views28 pages

BIG Data File

The document outlines a series of experiments for a Big Data course, specifically focusing on Hadoop and its functionalities. It includes detailed instructions for installing Hadoop, performing file management tasks in HDFS, and implementing MapReduce programs such as a Word Count program and a Weather Report analysis. Each experiment aims to provide hands-on experience with Hadoop's capabilities in processing and analyzing large datasets.

Uploaded by

hk8560394
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CSE, VII- semester

Subject: - CS-704 Big


Data

LIST OF EXPERIMENTS

S.No. Title Signature

1. Installation of Hadoop
2. File Management tasks in Hadoop
Count Map Reduce Word program to understand Map
3. Reduce Paradigm.

Weather Report POC-Map Reduce Program to analyse time


4. temperature statistics and generate report with max/min
temperature.

Implementing Matrix Multiplication with Hadoop Map


5.
Reduce

6. Pig Latin scripts to sort,group, join,project, and filter your


data.
7. Move file from source to destination.
8. Remove a file or directory in HDFS.
9. Display last few lines of a file.
10. Display the aggregate length of a file.
EXPRIMENT: 1
AIM: Installation of Hadoop

Hadoop is a well-known big data processing system for storing and analysing enormous volumes of data. It’s an open-
source project that you can use for free. If you’re new to Hadoop, you may find the installation process difficult.
In this tutorial, I’ll walk you through the steps of installing Hadoop on Windows.
Step 1: Download and install Java
Hadoop is built on Java, so you must have Java installed on your PC. You can get the most recent version of Java from
the official website. After downloading, follow the installation wizard to install Java on your system.
JDK: https://www.oracle.com/java/technologies/javase-downloads.html
Step 2: Download Hadoop
Hadoop can be downloaded from the Apache Hadoop website. Make sure to have the latest stable release of Hadoop. Once
downloaded, extract the contents to a convenient location.
Hadoop: https://hadoop.apache.org/releases.html
Step 3: Set Environment Variables
You must configure environment variables after downloading and unpacking Hadoop. Launch the Start menu, type “Edit the
system environment variables,” and select the result. This will launch the System Properties dialogue box. Click on
“Environment Variables” button to open.
Click “New” under System Variables to add a new variable. Enter the variable name “HADOOP_HOME” and the path to
the Hadoop folder as the variable value. Then press “OK.”
Then, under System Variables, locate the “Path” variable and click “Edit.” Click “New” in the Edit Environment Variable
window and enter “%HADOOP_HOME%bin” as the variable value. To close all the windows, use the “OK” button.
Step 4: Setup Hadoop
You must configure Hadoop in this phase by modifying several configuration files. Navigate to the “etc/hadoop” folder in
the Hadoop folder. You must make changes to three files:
core-site.xml
hdfs-site.xml
mapred-site.xml
Open each file in a text editor and edit the following properties:
In core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop-3.3.1/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop-3.3.1/data/datanode</value>
</property>
</configuration>
In mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Save the changes in each file.
Step 5: Format Hadoop NameNode
You must format the NameNode before you can start Hadoop. Navigate to the Hadoop bin folder using a command prompt.
Execute this command:
hadoop namenode -format
Step 6: Start Hadoop
To start Hadoop, open a command prompt and navigate to the Hadoop bin folder. Run the following command:
start-all.cmd
This command will start all the required Hadoop services, including the NameNode, DataNode, and JobTracker. Wait for
a few minutes until all the services are started.
Step 7: Verify Hadoop Installation
To ensure that Hadoop is properly installed, open a web browser and go to http://localhost:50070/. This will launch the web
interface for the Hadoop NameNode. You should see a page with Hadoop cluster information.
Wrapping Up
By following the instructions provided in this article, you should be able to get Hadoop up and operating on your
machine. Remember to get the most recent stable version of Hadoop, install Java, configure Hadoop, format the
NameNode, and start Hadoop services. Finally, check the NameNode web interface to ensure that Hadoop is properly
installed.
EXPRIMENT: 2

AIM: File Management tasks in Hadoop.


File management tasks in Hadoop primarily revolve around managing files within the Hadoop Distributed File
System (HDFS). Here are some essential file management tasks:
1. **Adding Files to HDFS**
- Use the `hadoop fs -put` or `hdfs dfs -put` command to upload files from the local filesystem to HDFS.
- Example:
```bash hadoop fs -put /local/path/to/file /hdfs/path ```

2. **Viewing Files in HDFS**


- Use the `hadoop fs -ls` or `hdfs dfs -ls` command to list files and directories in HDFS.
- Example:
```bash hadoop fs -ls /hdfs/path/ ```

3. **Reading Files from HDFS**


- Use the `hadoop fs -cat` or `hdfs dfs -cat` command to read the content of a file stored in HDFS.
- Example:
```bash hadoop fs -cat /hdfs/path/to/file ```

4. **Copying Files in HDFS**


- You can copy files within HDFS using the `hadoop fs -cp` or `hdfs dfs -cp` command.
- Example:
```bash hadoop fs -cp /hdfs/path/source /hdfs/path/destination ```

5. **Moving Files in HDFS**


- The `hadoop fs -mv` or `hdfs dfs -mv` command moves files from one directory to another within HDFS.
- Example:
```bash hadoop fs -mv /hdfs/path/source /hdfs/path/destination ```

6. **Removing Files from HDFS**


- Use `hadoop fs -rm` or `hdfs dfs -rm` to delete files from HDFS. Use `-r` for recursive deletion.
- Example:
```bash hadoop fs -rm /hdfs/path/to/file hadoop fs -rm -r /hdfs/path/to/directory ```

7. **Checking File Status in HDFS**


- Use the `hadoop fs -stat` command to check a file's details (like file size, modification date, etc.) in HDFS.
- Example:
```bash hadoop fs -stat %s /hdfs/path/to/file ```

8. **Changing Permissions on HDFS Files**


- Modify file permissions using `hadoop fs -chmod`. You can set read, write, and execute permissions.
- Example:
```bash hadoop fs -chmod 755 /hdfs/path/to/file ```

9. **Changing Ownership of Files**


- Change file ownership using `hadoop fs -chown`.
- Example:
```bash hadoop fs -chown user:group /hdfs/path/to/file ```
10. **Setting Quotas**
- HDFS allows setting quotas to control storage usage.
- Use `hdfs dfsadmin -setQuota` to set quotas on directories.
- Example:
```bash hdfs dfsadmin -setQuota 100 /hdfs/path/to/directory ```

11. **Viewing Disk Usage**


- Check the amount of storage used by a directory with `hadoop fs -du` or `hdfs dfs -du`.
- Example:
```bash hadoop fs -du -h /hdfs/path/ ```

12. **Replicating Files in HDFS**


- Change replication factors for redundancy with `hadoop fs -setrep`.
- Example:
```bash hadoop fs -setrep 3 /hdfs/path/to/file```

13. **Archiving Files in HDFS**


- You can use HDFS archives to create compressed files for efficient storage.
- Example:
```bash hadoop archive -archiveName myArchive.har -p /hdfs/source/path/ /hdfs/destination/path/```
EXPRIMENT: 3

AIM: Word Count Map Reduce program to understand Map Reduce Paradigm.
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into individual tasks that can
be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final
results.
MapReduce consists of 2 steps:
Map Function – It takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Bus, Car, bus, car, train, car, bus, car,
Set of data
Input train, bus, TRAIN,BUS, buS, caR, CAR,
car, BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),


Convert into
(car,1), (bus,1), (car,1), (train,1), (bus,1),
another set of
Output (TRAIN,1),(BUS,1), (buS,1), (caR,1),
data
(CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller set of
tuples.
Example – (Reduce function in Word Count)
(Bus,1), (Car,1), (bus,1), (car,1),
(train,1),
Input (car,1), (bus,1), (car,1), (train,1),
Set of Tuples
(output of Map (bus,1),
function) (TRAIN,1),(BUS,1), (buS,1),
(caR,1), (CAR,1),
(car,1), (BUS,1), (TRAIN,1)

(BUS,7),
Converts into
Output (CAR,7),
smaller set of tuples
(TRAIN,4)
Work Flow of the Program

Workflow of MapReduce consists of 5 steps:


Splitting – The splitting parameter can be anything, e.g. splitting by space, comma, semicolon, or even by a
new line (‘\n’).
Mapping – as explained above.
Intermediate splitting – the entire process in parallel on different clusters. In order to group them in “Reduce
Phase” the similar KEY data should be on the same cluster.
Reduce – it is nothing but mostly group by phase.
Combining – The last phase where all the data (individual result set from each cluster) is combined together to
form a result.
Now Let’s See the Word Count Program in Java
Fortunately, we don’t have to write all of the above steps, we only need to write the splitting parameter, Map
function logic, and Reduce function logic. The rest of the remaining steps will execute automatically.
Make sure that Hadoop is installed on your system with the Java SDK.
Steps
Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
Right Click > New > Package ( Name it - PackageDemo) > Finish.
Right Click on Package > New > Class (Name it - WordCount).
Add Following Reference Libraries:
Right Click on Project > Build Path> Add External
/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
5. Type the following code:
1
package PackageDemo;
2

3
import java.io.IOException;
4
import org.apache.hadoop.conf.Configuration;
5
import org.apache.hadoop.fs.Path;
6
import org.apache.hadoop.io.IntWritable;
7
import org.apache.hadoop.io.LongWritable;
8
import org.apache.hadoop.io.Text;
9
import org.apache.hadoop.mapreduce.Job;
10
import org.apache.hadoop.mapreduce.Mapper;
11
import org.apache.hadoop.mapreduce.Reducer;
12
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14
import org.apache.hadoop.util.GenericOptionsParser;
15
public class WordCount {
20
public static void main(String [] args) throws Exception
21
{
22
Configuration c=new Configuration();
23
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
24
Path input=new Path(files[0]);
25
Path output=new Path(files[1]);
26
Job j=new Job(c,"wordcount");
27
j.setJarByClass(WordCount.class);
28
j.setMapperClass(MapForWordCount.class);
29
j.setReducerClass(ReduceForWordCount.class);
30
j.setOutputKeyClass(Text.class);
31
j.setOutputValueClass(IntWritable.class);
32
FileInputFormat.addInputPath(j, input);
33
FileOutputFormat.setOutputPath(j, output);
34
System.exit(j.waitForCompletion(true)?0:1);
35
}
36
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
37
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
38
{
39
String line = value.toString();
40
String[] words=line.split(",");
41
for(String word: words )
42
{
43
Text outputKey = new Text(word.toUpperCase().trim());
44
IntWritable outputValue = new IntWritable(1);
45
con.write(outputKey, outputValue);
46
}
47
}
48
}
49

50
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
51
{
52
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
53
{
54
int sum = 0;
55
for(IntWritable value : values)
56
{
57
sum += value.get();
58
}
59
con.write(word, new IntWritable(sum));
60
}
61

62
}
63

64
}
The above program consists of three classes:
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Map function.
The Reduce class which extends the public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Reduce function.
6. Make a jar file
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
7. Take a text file and move it into HDFS format:

To move this into Hadoop directly, open the terminal and enter the following commands:
1
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
8. Run the jar file:
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile PathToOutputDirectry)
1
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile MRDir1
9. Open the result:
1
[training@localhost ~]$ hadoop fs -ls MRDir1
2

3
Found 3 items
4

5
-rw-r--r-- 1 training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
6
drwxr-xr-x - training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_logs
7
-rw-r--r-- 1 training supergroup 20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
1
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
2
BUS 7
3
CAR 4
4
TRAIN 6
hadoopMapReduceJava (programming language)
EXPRIMENT: 4

AIM: Weather Report POC-Map Reduce Program to analyse timetemperature


statisticsand generate report with max/min temperature.
Problem Statement:
1. The system receives temperatures of various cities(Austin, Boston,etc) of USA captured at regular
intervals of time on each day in an input file.
2. System will process the input data file and generates a report with Maximum and Minimum
temperatures of each day along with time.
3. Generates a separate output report for each city.
Ex: Austin-r-00000
Boston-r-00000
Newjersy-r-00000
Baltimore-r-00000
California-r-00000
Newyork-r-00000

Expected output:- In each output file record should be like this:


25-Jan-2014 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7

First download input file which contains temperature statistics with time for multiple cities.Schema of
record set : CA_25-Jan-2014 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 ......
CA is city code, here it stands for California followed by date. After that each pair of values represent time
and temperature.

Mapper class and map method:-


The very first thing which is required for any map reduce problem is to understand what will be the type of
keyIn, ValueIn, KeyOut,ValueOut for the given Mapper class and followed by type of map method
parameters.

 public class WhetherForcastMapper extends Mapper <Object, Text, Text, Text>


Object (keyIn) - Offset for each line, line number 1, 2...
Text (ValueIn) - Whole string for each line (CA_25-Jan-2014 00:12:345 ...... )
Text (KeyOut) - City information with date information as string
Text (ValueOut) - Temperature and time information which need to be passed to reducer as string.
 public void map(Object keyOffset, Text dayReport, Context con) { }
KeyOffset is like line number for each line in input file.
dayreport is input to map method - whole string present in one line of input file.
con is context where we write mapper output and it is used by reducer.
Reducer class and reducer method:-
Similarly,we have to decide what will be the type of keyIn, ValueIn, KeyOut,ValueOut for the given
Reducer class and followed by type of reducer method parameters.

 public class WhetherForcastReducer extends Reducer<Text, Text, Text, Text>


Text(keyIn) - it is same as keyOut of Mapper.
Text(ValueIn)- it is same as valueOut of Mapper.
Text(KeyOut)- date as string
text(ValueOut) - reducer writes max and min temperature with time as string
 public void reduce(Text key, Iterable<Text> values, Context context)
Text key is value of mapper output. i.e:- City & date information
Iterable<Text> values - values stores multiple temperature values for a given city and date.
context object is where reducer write it's processed outcome and finally written in file.
MultipleOutputs :- In general, reducer generates output file(i.e: part_r_0000), however in this use case we
want to generate multiple output files. In order to deal with such scenario we need to use MultipleOutputs
of "org.apache.hadoop.mapreduce.lib.output.MultipleOutputs" which provides a way to write multiple file
depending on reducer outcome. See below reducer class for more details.For each reducer task
multipleoutput object is created and key/result is written to appropriate file.

Lets create a Map/Reduce project in eclipse and create a class file name it
as CalculateMaxAndMinTemeratureWithTime. For simplicity,here we have written mapper and reducer
class as inner static class. Copy following code lines and paste in newly created class file.

/**
* Question:- To find Max and Min temperature from record set stored in
* text file. Schema of record set :- tab separated (\t) CA_25-Jan-2014
* 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 03:12:187 16 04:00:093
* -14 05:12:345 35.7 06:19:345 23.1 07:34:542 12.3 08:12:187 16
* 09:00:093 -7 10:12:345 15.7 11:19:345 23.1 12:34:542 -22.3 13:12:187
* 16 14:00:093 -7 15:12:345 15.7 16:19:345 23.1 19:34:542 12.3
* 20:12:187 16 22:00:093 -7
* Expected output:- Creates files for each city and store maximum & minimum
* temperature for each day along with time.
*/

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";

public static class WhetherForcastMapper extends


Mapper<Object, Text, Text, Text> {

public void map(Object keyOffset, Text dayReport, Context con)


throws IOException, InterruptedException {
StringTokenizer strTokens = new StringTokenizer(
dayReport.toString(), "\t");
int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;

while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}

temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);

}
}

public static class WhetherForcastReducer extends


Reducer<Text, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;

public void setup(Context context) {


mos = new MultipleOutputs<Text, Text>(context);
}

public void reduce(Text key, Iterable<Text> values, Context context)


throws IOException, InterruptedException {
int counter = 0;
String reducerInputStr[] = null;
String f1Time = "";
String f2Time = "";
String f1 = "", f2 = "";
Text result = new Text();
for (Text value : values) {

if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}

else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}

counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {

result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"


+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {

result = new Text("Time: " + f1Time + " MinTemp: " + f1 + "\t"


+ "Time: " + f2Time + " MaxTemp: " + f2);
}
String fileName = "";
if (key.toString().substring(0, 2).equals("CA")) {
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
} else if (key.toString().substring(0, 2).equals("NY")) {
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
} else if (key.toString().substring(0, 2).equals("NJ")) {
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
} else if (key.toString().substring(0, 3).equals("AUS")) {
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS")) {
fileName = CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0, 3).equals("BAL")) {
fileName = CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}

@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}

public static void main(String[] args) throws IOException,


ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);

job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);

// FileInputFormat.addInputPath(job, new Path(args[0]));


// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path(
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
Path pathOutputDir = new Path(
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);

try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}
Explanation:-
In map method, we are parsing each input line and maintains a counter for extracting date and each
temperature & time information.For a given input line, first extract date(counter ==0) and followed by
alternatively extract time(counter%2==1) since time is on odd number position like (1,3,5. .. ) and get
temperature otherwise. Compare for max & min temperature and store it accordingly. Once while loop
terminates for a given input line, write maxTempTime and minTempTime with date.
In reduce method, for each reducer task, setup method is executed and create MultipleOutput object. For
a given key, we have two entry (maxtempANDTime and mintempANDTime). Iterate values list , split
value and get temperature & time value. Compare temperature value and create actual value sting which
reducer write in appropriate file.
In main method,a instance of Job is created with Configuration object. Job is configured with mapper,
reducer class and along with input and output format. MultipleOutputs information added to Job to
indicate file name to be used with input format. For this sample program, we are using input
file("/weatherInputData/input_temp.txt") placed on HDFS and output directory
(/user/hduser1/testfs/output_mapred5) will be also created on HDFS. Refer below command to copy
downloaded input file from local file system to HDFS and give write permission to client who is
executing this program unit so that output directory can be created.
Copy a input file form local file system to HDFS

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -put /home/zytham/input_temp.txt


/weatherInputData/
Give write permission to all user for creating output directory

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -chmod -R 777 /user/hduser1/testfs/

Before executing above program unit make sure hadoop services are running(to start all service execute
./start-all.sh from <hadoop_home>/sbin).
Now execute above sample program. Run -> Run as hadoop. Wait for a moment and check whether output
directory is in place on HDFS. Execute following command to verify the same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls /user/hduser1/testfs/output_mapred3


Found 8 items
-rw-r--r-- 3 zytham supergroup 438 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Austin-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Baltimore-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Boston-r-
00000
-rw-r--r-- 3 zytham supergroup 511 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/California-r-
00000
-rw-r--r-- 3 zytham supergroup 146 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Newjersy-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Newyork-r-
00000
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/_SUCCESS
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/part-r-00000
Open one of the file and verify expected output schema, execute following command for the same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -cat /user/hduser1/testfs/output_mapred3/Austin-


r-00000
25-Jan-2014 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7
26-Jan-2014 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 55.7
27-Jan-2014 Time: 02:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 55.7
29-Jan-2014 Time: 14:00:093 MinTemp: -17.0 Time: 02:34:542 MaxTemp: 62.9
30-Jan-2014 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 49.2
31-Jan-2014 Time: 14:00:093 MinTemp: -17.0 Time: 03:12:187 MaxTemp: 56.0

Note:-

 In order to reference input file from local file system instead of HDFS, uncomment below lines in main
method and comment below added addInputPath and setOutputPath lines. Here Path(args[0])
and Path(args[1]) read input and output location path from program arguments. OR create path object with
sting input of input file and output location.
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));

Execute WeatherReportPOC.jar on single node cluster


We can create jar file out of this project and run on single node cluster too. Download WeatherReportPOC
jar and place at some convenient location.Start hadoop services(./start-all.sh from <hadoop_home>/sbin).
I have placed jar at "/home/zytham/Downloads/WeatherReportPOC.jar".
Execute following command to submit job with input file HDFS location is
"/wheatherInputData/input_temp.txt" and output directory location is
"/user/hduser1/testfs/output_mapred7"

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop jar


/home/zytham/Downloads/WeatherReportPOC.jar
CalculateMaxAndMinTemeratureWithTime
/wheatherInputData/input_temp.txt /user/hduser1/testfs/output_mapred7
15/12/11 22:16:12 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/12/11 22:16:12 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/12/11 22:16:14 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed.
Implement the Tool interface and execute your application with ToolRunner to remedy this.
...........
15/12/11 22:16:26 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1563851561_0001_r_000000_0' to
hdfs://hostname:54310/user/hduser1/testfs/output_mapred7/_temporary/0/task_local1563851561_0001_r_000000
15/12/11 22:16:26 INFO mapred.LocalJobRunner: reduce > reduce
15/12/11 22:16:26 INFO mapred.Task: Task 'attempt_local1563851561_0001_r_000000_0' done.
15/12/11 22:16:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local1563851561_0001_r_000000_0
15/12/11 22:16:26 INFO mapred.LocalJobRunner: reduce task executor complete.
15/12/11 22:16:26 INFO mapreduce.Job: map 100% reduce 100%
15/12/11 22:16:27 INFO mapreduce.Job: Job job_local1563851561_0001 completed successfully
15/12/11 22:16:27 INFO mapreduce.Job: Counters: 38
......
EXPRIMENT: 5

AIM: Implementing Matrix Multiplication with Hadoop Map Reduce.


MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to
make computation faster, save time, and mostly used in distributed systems. It has 2 important parts:
 Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you
search for the word “Data” and its associated meaning is “facts and statistics collected together for
reference or analysis”. Here the Key is Data and the Value associated with is facts and statistics
collected together for reference or analysis.
 Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following
matrix:

2×2 matrices A and B

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2.
Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of the
matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column. Now
One step matrix multiplication has 1 mapper and 1 reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula
k=1 i=1 j=1 ((1, 1), (A, 1, 1))
j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))

j=2 ((2, 2), (A, 2, 4))

Computing the mapper for Matrix B


i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))

k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:


Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 ------- (i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 ------- (ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 ------- (iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 ------- (iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)

((2, 2), 50)

Therefore the Final Matrix is:

Final output of Matrix multiplication.


EXPRIMENT: 6

AIM: Pig Latin scripts to sort,group, join, project, and filter your data.
Sample Data: To make these scripts concrete, let’s assume two example datasets:

1. Employees (stored in a file named employees.txt):Copy

code
1,John,Engineering,500002,Sara,Marketing,60000
3,Mike,Engineering,700004,Jane,Marketing,55000
5,Bob,HR,45000

2. Departments (stored in a file named departments.txt):Copy

code
Engineering,101
Marketing,102
HR,103

 employees.txtschema: (emp_id: int, name: chararray, department: chararray, salary: int)


 departments.txtschema: (department: chararray, dept_id: int)

Loading Data in Pig

First, load these datasets into Pig variables for use in transformations.

pig
Copy code
-- Load the employees dataset employees = LOAD
'employees.txt'
USING PigStorage(',')
AS (emp_id: int, name: chararray, department: chararray, salary: int);

-- Load the departments dataset departments = LOAD


'departments.txt'
USING PigStorage(',')
AS (department: chararray, dept_id: int);

1. Sorting Data

Sort the employeesdata by salaryin descending order.

pig
Copy code
sorted_employees = ORDER employees BY salary DESC;DUMP
sorted_employees;

2. Grouping Data
Group the employeesdata by departmentto calculate the total salary per department.

pig
Copy code
grouped_employees = GROUP employees BY department; total_salary_per_department =
FOREACH grouped_employees GENERATE
group AS department, SUM(employees.salary) AS
total_salary;
DUMP total_salary_per_department;

3. Joining Data

Join the employeesdataset with the departmentsdataset based on the departmentfield to associate eachemployee with
their dept_id.

pig
Copy code
joined_data = JOIN employees BY department, departments BY department;projected_data = FOREACH joined_data
GENERATE
employees::emp_id AS emp_id, employees::name AS
name, employees::department AS department,
departments::dept_id AS dept_id, employees::salary
AS salary;
DUMP projected_data;

4. Projecting Specific Fields

Project only the nameand salaryfields from the employeesdataset.

pig
Copy code
projected_employees = FOREACH employees GENERATE name, salary;DUMP
projected_employees;

5. Filtering Data

Filter the employeesdataset to show only employees with a salary greater than 50000.

pig
Copy code
high_salary_employees = FILTER employees BY salary > 50000;DUMP
high_salary_employees;

Summary of Operations

Each script performs one of the following:

 Sorting (ORDER): sorts data by one or more fields.


 Grouping (GROUP): groups data by a specified key.
 Joining (JOIN): joins two datasets based on a common field.
 Projecting (FOREACH ... GENERATE): selects specific fields.
 Filtering (FILTER): filters records based on a condition.
EXPRIMENT: 7

AIM: Move file from source to destination.


Hadoop FS Command
bash
hadoop fs -mv /source/path/filename /destination/path/

Step-by-Step Process

1. Verify Source File Existence: Check if the source file exists at the specified path.
2. Check Permissions: Ensure that the user has necessary permissions to read from the source location and
write to the destination location.
3. Lock Source File: Lock the source file to prevent concurrent modifications.
4. Create Destination Directory: Create the destination directory if it does not exist.
5. Copy File Contents: Copy the contents of the source file to the destination file.
6. Update File Metadata: Update the file metadata (e.g., ownership, permissions, timestamps) at the
destination.
7. Delete Source File: Delete the source file.
8. Unlock Destination File: Unlock the destination file.

HDFS Move Operation

When moving a file in HDFS, the following occurs:

1. Rename: Rename the file in the namespace.


2. Update Block Locations: Update the block locations in the namespace.
3. Delete Source File: Delete the source file's metadata.

Java API

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

// Create a FileSystem object


FileSystem fs = FileSystem.get(getConf());

// Move file from source to destination


fs.moveToLocalFile(new Path("/source/path/filename"), new Path("/destination/path/"));

Python API (pyhdfs)

from pyhdfs import HdfsClient

# Create an HdfsClient object


hdfs = HdfsClient(hosts='localhost', port=9000)
# Move file from source to destination
hdfs.move('/source/path/filename', '/destination/path/')

Error Handling

Common errors that may occur during file move operations:

- FileNotFound: Source file does not exist.


- PermissionDenied: Lack of permissions to move file.
- IOException: I/O error during file move operation.

Best Practices

- Verify file existence and permissions before moving.


- Use try-catch blocks to handle exceptions.
- Ensure destination directory exists before moving.
- Use Hadoop's built-in file management tools.

Use Cases

- Moving data from staging area to production area.


- Moving processed data to archive area.
- Moving files between directories in HDFS.

Advantages

- Efficient file management.


- Improved data organization.
- Reduced data redundancy.

Disadvantages

- Potential data loss during move operation.


- Permission issues may arise.
- May impact system performance.Use Cases

- Moving data from staging area to production area.


- Moving processed data to archive area.
- Moving files between directories in HDFS.

Advantages

- Efficient file management.


- Improved data organization.
- Reduced data redundancy.

Disadvantages

- Potential data loss during move operation.


- Permission issues may arise.
- May impact system performance.
EXPRIMENT: 8

AIM: Remove a file or directory in HDFS.


To remove a file or directory in HDFS (Hadoop Distributed File System), you can use the `hdfs dfs -rm`
and `hdfs dfs -rm -r` commands. Here’s a detailed, step-by-step guide to removing files and directories from
HDFS, along with a diagram to illustrate the process.

Step-by-Step Guide to Removing Files and Directories in HDFS

Removing a File from HDFS

1. **Open Terminal**: Use the command line or terminal to enter HDFS commands.
2. **Run Remove Command**: Use the following command to delete a file:
```bash
hdfs dfs -rm /path/to/your/file
```
- **Example**: If you have a file named `example.txt` in `/user/hadoop/`, the command will be:
```bash
hdfs dfs -rm /user/hadoop/example.txt
```
3. **Confirm Deletion**: The command will delete the specified file. You’ll see a confirmation message
in the terminal if the file is successfully deleted.

Removing a Directory from HDFS (Recursive Delete)

1. **Open Terminal**: Use the command line or terminal.


2. **Run Remove Command with Recursive Flag**: To delete a directory and all its contents, use the `-
r` (recursive) flag:
```bash
hdfs dfs -rm -r /path/to/your/directory
```
- **Example**: To delete the directory `/user/hadoop/data`, use:
```bash
hdfs dfs -rm -r /user/hadoop/data
```
3. **Confirm Deletion**: The terminal will show a message confirming that the directory and its contents
have been deleted.
EXPRIMENT: 9

AIM: Display last few lines of a file.


Use the tail command to write the file specified by the File parameter to standard output beginning at a
specified
point.
See the following examples:
To display the last 10 lines of the notes file, type the following:
tail notes
To specify the number of lines to start reading from the end of the notes file, type the following:
tail -20 notes
To display the notes file one page at a time, beginning with the 200th byte, type the following:
tail -c +200 notes | pg
To follow the growth of the file named accounts, type the following:
tail -f accounts
This displays the last 10 lines of the accounts file. The tail command continues to display lines as they are
added to the accounts file. The display continues until you press the (Ctrl-C) key sequence to stop the
display.
EXPRIMENT: 10

AIM: Display the aggregate length of a file.


To display the aggregate length (total size) of a file in HDFS, you can use the `hdfs dfs -du` or `hdfs dfs-
stat` command. Here are the methods to achieve this.

Method 1: Using `hdfs dfs -du`

This command displays the disk usage of files in HDFS.

1. **Open Terminal**: Access the terminal or command line.


2. **Run the `du` Command**: Use the following command to see the size of the file.
```bash
hdfs dfs -du -s /path/to/your/file
```
**Explanation**:-
`-du`: Displays disk usage for files or directories.
- `-s`: Shows a summary (total size).
- **Example**: To get the size of `example.txt` in `/user/hadoop/`, use:
```bash
hdfs dfs -du -s /user/hadoop/example.txt
```
3. **Output**: The output will display the total size (in bytes) of the specified file.

Method 2: Using `hdfs dfs -stat`

This command gives detailed information about a file, including its size.

1. **Open Terminal**.
2. **Run the `stat` Command**:
```bash
hdfs dfs -stat %b /path/to/your/file
```
- **Explanation**:
- `%b`: Displays the file length in bytes.
- **Example**:
```bash
hdfs dfs -stat %b /user/hadoop/example.txt
```
3. **Output**: The total file size in bytes will be shown.

Example Output

If `example.txt` is 1048576 bytes (1 MB), the command output might look like:

```bash
1048576
```

This output represents the total size (aggregate length) of the file.

You might also like