0% found this document useful (0 votes)

17 views28 pages

BIG Data File

The document outlines a series of experiments for a Big Data course, specifically focusing on Hadoop and its functionalities. It includes detailed instructions for installing Hadoop, performing file management tasks in HDFS, and implementing MapReduce programs such as a Word Count program and a Weather Report analysis. Each experiment aims to provide hands-on experience with Hadoop's capabilities in processing and analyzing large datasets.

Uploaded by

hk8560394

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views28 pages

BIG Data File

Uploaded by

hk8560394

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

CSE, VII- semester

Subject: - CS-704 Big

Data

LIST OF EXPERIMENTS

S.No. Title Signature

1. Installation of Hadoop
2. File Management tasks in Hadoop
Count Map Reduce Word program to understand Map
3. Reduce Paradigm.

Weather Report POC-Map Reduce Program to analyse time

4. temperature statistics and generate report with max/min
temperature.

Implementing Matrix Multiplication with Hadoop Map

5.
Reduce

6. Pig Latin scripts to sort,group, join,project, and filter your

data.
7. Move file from source to destination.
8. Remove a file or directory in HDFS.
9. Display last few lines of a file.
10. Display the aggregate length of a file.
EXPRIMENT: 1
AIM: Installation of Hadoop

Hadoop is a well-known big data processing system for storing and analysing enormous volumes of data. It’s an open-
source project that you can use for free. If you’re new to Hadoop, you may find the installation process difficult.
In this tutorial, I’ll walk you through the steps of installing Hadoop on Windows.
Step 1: Download and install Java
Hadoop is built on Java, so you must have Java installed on your PC. You can get the most recent version of Java from
the official website. After downloading, follow the installation wizard to install Java on your system.
JDK: https://www.oracle.com/java/technologies/javase-downloads.html
Step 2: Download Hadoop
Hadoop can be downloaded from the Apache Hadoop website. Make sure to have the latest stable release of Hadoop. Once
downloaded, extract the contents to a convenient location.
Hadoop: https://hadoop.apache.org/releases.html
Step 3: Set Environment Variables
You must configure environment variables after downloading and unpacking Hadoop. Launch the Start menu, type “Edit the
system environment variables,” and select the result. This will launch the System Properties dialogue box. Click on
“Environment Variables” button to open.
Click “New” under System Variables to add a new variable. Enter the variable name “HADOOP_HOME” and the path to
the Hadoop folder as the variable value. Then press “OK.”
Then, under System Variables, locate the “Path” variable and click “Edit.” Click “New” in the Edit Environment Variable
window and enter “%HADOOP_HOME%bin” as the variable value. To close all the windows, use the “OK” button.
Step 4: Setup Hadoop
You must configure Hadoop in this phase by modifying several configuration files. Navigate to the “etc/hadoop” folder in
the Hadoop folder. You must make changes to three files:
core-site.xml
hdfs-site.xml
mapred-site.xml
Open each file in a text editor and edit the following properties:
In core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop-3.3.1/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop-3.3.1/data/datanode</value>
</property>
</configuration>
In mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Save the changes in each file.
Step 5: Format Hadoop NameNode
You must format the NameNode before you can start Hadoop. Navigate to the Hadoop bin folder using a command prompt.
Execute this command:
hadoop namenode -format
Step 6: Start Hadoop
To start Hadoop, open a command prompt and navigate to the Hadoop bin folder. Run the following command:
start-all.cmd
This command will start all the required Hadoop services, including the NameNode, DataNode, and JobTracker. Wait for
a few minutes until all the services are started.
Step 7: Verify Hadoop Installation
To ensure that Hadoop is properly installed, open a web browser and go to http://localhost:50070/. This will launch the web
interface for the Hadoop NameNode. You should see a page with Hadoop cluster information.
Wrapping Up
By following the instructions provided in this article, you should be able to get Hadoop up and operating on your
machine. Remember to get the most recent stable version of Hadoop, install Java, configure Hadoop, format the
NameNode, and start Hadoop services. Finally, check the NameNode web interface to ensure that Hadoop is properly
installed.
EXPRIMENT: 2

AIM: File Management tasks in Hadoop.

File management tasks in Hadoop primarily revolve around managing files within the Hadoop Distributed File
System (HDFS). Here are some essential file management tasks:
1. **Adding Files to HDFS**
- Use the `hadoop fs -put` or `hdfs dfs -put` command to upload files from the local filesystem to HDFS.
- Example:
```bash hadoop fs -put /local/path/to/file /hdfs/path ```

2. Viewing Files in HDFS

- Use the `hadoop fs -ls` or `hdfs dfs -ls` command to list files and directories in HDFS.
- Example:
```bash hadoop fs -ls /hdfs/path/ ```

3. Reading Files from HDFS

- Use the `hadoop fs -cat` or `hdfs dfs -cat` command to read the content of a file stored in HDFS.
- Example:
```bash hadoop fs -cat /hdfs/path/to/file ```

4. Copying Files in HDFS

- You can copy files within HDFS using the `hadoop fs -cp` or `hdfs dfs -cp` command.
- Example:
```bash hadoop fs -cp /hdfs/path/source /hdfs/path/destination ```

5. Moving Files in HDFS

- The `hadoop fs -mv` or `hdfs dfs -mv` command moves files from one directory to another within HDFS.
- Example:
```bash hadoop fs -mv /hdfs/path/source /hdfs/path/destination ```

6. Removing Files from HDFS

- Use `hadoop fs -rm` or `hdfs dfs -rm` to delete files from HDFS. Use `-r` for recursive deletion.
- Example:
```bash hadoop fs -rm /hdfs/path/to/file hadoop fs -rm -r /hdfs/path/to/directory ```

7. Checking File Status in HDFS

- Use the `hadoop fs -stat` command to check a file's details (like file size, modification date, etc.) in HDFS.
- Example:
```bash hadoop fs -stat %s /hdfs/path/to/file ```

8. Changing Permissions on HDFS Files

- Modify file permissions using `hadoop fs -chmod`. You can set read, write, and execute permissions.
- Example:
```bash hadoop fs -chmod 755 /hdfs/path/to/file ```

9. Changing Ownership of Files

- Change file ownership using `hadoop fs -chown`.
- Example:
```bash hadoop fs -chown user:group /hdfs/path/to/file ```
10. **Setting Quotas**
- HDFS allows setting quotas to control storage usage.
- Use `hdfs dfsadmin -setQuota` to set quotas on directories.
- Example:
```bash hdfs dfsadmin -setQuota 100 /hdfs/path/to/directory ```

11. Viewing Disk Usage

- Check the amount of storage used by a directory with `hadoop fs -du` or `hdfs dfs -du`.
- Example:
```bash hadoop fs -du -h /hdfs/path/ ```

12. Replicating Files in HDFS

- Change replication factors for redundancy with `hadoop fs -setrep`.
- Example:
```bash hadoop fs -setrep 3 /hdfs/path/to/file```

13. Archiving Files in HDFS

- You can use HDFS archives to create compressed files for efficient storage.
- Example:
```bash hadoop archive -archiveName myArchive.har -p /hdfs/source/path/ /hdfs/destination/path/```
EXPRIMENT: 3

AIM: Word Count Map Reduce program to understand Map Reduce Paradigm.
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into individual tasks that can
be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final
results.
MapReduce consists of 2 steps:
Map Function – It takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Bus, Car, bus, car, train, car, bus, car,
Set of data
Input train, bus, TRAIN,BUS, buS, caR, CAR,
car, BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Convert into
(car,1), (bus,1), (car,1), (train,1), (bus,1),
another set of
Output (TRAIN,1),(BUS,1), (buS,1), (caR,1),
data
(CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller set of
tuples.
Example – (Reduce function in Word Count)
(Bus,1), (Car,1), (bus,1), (car,1),
(train,1),
Input (car,1), (bus,1), (car,1), (train,1),
Set of Tuples
(output of Map (bus,1),
function) (TRAIN,1),(BUS,1), (buS,1),
(caR,1), (CAR,1),
(car,1), (BUS,1), (TRAIN,1)

(BUS,7),
Converts into
Output (CAR,7),
smaller set of tuples
(TRAIN,4)
Work Flow of the Program

Workflow of MapReduce consists of 5 steps:

Splitting – The splitting parameter can be anything, e.g. splitting by space, comma, semicolon, or even by a
new line (‘\n’).
Mapping – as explained above.
Intermediate splitting – the entire process in parallel on different clusters. In order to group them in “Reduce
Phase” the similar KEY data should be on the same cluster.
Reduce – it is nothing but mostly group by phase.
Combining – The last phase where all the data (individual result set from each cluster) is combined together to
form a result.
Now Let’s See the Word Count Program in Java
Fortunately, we don’t have to write all of the above steps, we only need to write the splitting parameter, Map
function logic, and Reduce function logic. The rest of the remaining steps will execute automatically.
Make sure that Hadoop is installed on your system with the Java SDK.
Steps
Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
Right Click > New > Package ( Name it - PackageDemo) > Finish.
Right Click on Package > New > Class (Name it - WordCount).
Add Following Reference Libraries:
Right Click on Project > Build Path> Add External
/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
5. Type the following code:
1
package PackageDemo;
2

3
import java.io.IOException;
4
import org.apache.hadoop.conf.Configuration;
5
import org.apache.hadoop.fs.Path;
6
import org.apache.hadoop.io.IntWritable;
7
import org.apache.hadoop.io.LongWritable;
8
import org.apache.hadoop.io.Text;
9
import org.apache.hadoop.mapreduce.Job;
10
import org.apache.hadoop.mapreduce.Mapper;
11
import org.apache.hadoop.mapreduce.Reducer;
12
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14
import org.apache.hadoop.util.GenericOptionsParser;
15
public class WordCount {
20
public static void main(String [] args) throws Exception
21
{
22
Configuration c=new Configuration();
23
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
24
Path input=new Path(files[0]);
25
Path output=new Path(files[1]);
26
Job j=new Job(c,"wordcount");
27
j.setJarByClass(WordCount.class);
28
j.setMapperClass(MapForWordCount.class);
29
j.setReducerClass(ReduceForWordCount.class);
30
j.setOutputKeyClass(Text.class);
31
j.setOutputValueClass(IntWritable.class);
32
FileInputFormat.addInputPath(j, input);
33
FileOutputFormat.setOutputPath(j, output);
34
System.exit(j.waitForCompletion(true)?0:1);
35
}
36
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
37
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
38
{
39
String line = value.toString();
40
String[] words=line.split(",");
41
for(String word: words )
42
{
43
Text outputKey = new Text(word.toUpperCase().trim());
44
IntWritable outputValue = new IntWritable(1);
45
con.write(outputKey, outputValue);
46
}
47
}
48
}
49

50
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
51
{
52
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
53
{
54
int sum = 0;
55
for(IntWritable value : values)
56
{
57
sum += value.get();
58
}
59
con.write(word, new IntWritable(sum));
60
}
61

62
}
63

64
}
The above program consists of three classes:
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Map function.
The Reduce class which extends the public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Reduce function.
6. Make a jar file
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
7. Take a text file and move it into HDFS format:

To move this into Hadoop directly, open the terminal and enter the following commands:
1
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
8. Run the jar file:
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile PathToOutputDirectry)
1
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile MRDir1
9. Open the result:
1
[training@localhost ~]$ hadoop fs -ls MRDir1
2

3
Found 3 items
4

5
-rw-r--r-- 1 training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
6
drwxr-xr-x - training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_logs
7
-rw-r--r-- 1 training supergroup 20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
1
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
2
BUS 7
3
CAR 4
4
TRAIN 6
hadoopMapReduceJava (programming language)
EXPRIMENT: 4

AIM: Weather Report POC-Map Reduce Program to analyse timetemperature

statisticsand generate report with max/min temperature.
Problem Statement:
1. The system receives temperatures of various cities(Austin, Boston,etc) of USA captured at regular
intervals of time on each day in an input file.
2. System will process the input data file and generates a report with Maximum and Minimum
temperatures of each day along with time.
3. Generates a separate output report for each city.
Ex: Austin-r-00000
Boston-r-00000
Newjersy-r-00000
Baltimore-r-00000
California-r-00000
Newyork-r-00000

Expected output:- In each output file record should be like this:

25-Jan-2014 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7

First download input file which contains temperature statistics with time for multiple cities.Schema of
record set : CA_25-Jan-2014 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 ......
CA is city code, here it stands for California followed by date. After that each pair of values represent time
and temperature.

Mapper class and map method:-

The very first thing which is required for any map reduce problem is to understand what will be the type of
keyIn, ValueIn, KeyOut,ValueOut for the given Mapper class and followed by type of map method
parameters.

 public class WhetherForcastMapper extends Mapper <Object, Text, Text, Text>

Object (keyIn) - Offset for each line, line number 1, 2...
Text (ValueIn) - Whole string for each line (CA_25-Jan-2014 00:12:345 ...... )
Text (KeyOut) - City information with date information as string
Text (ValueOut) - Temperature and time information which need to be passed to reducer as string.
 public void map(Object keyOffset, Text dayReport, Context con) { }
KeyOffset is like line number for each line in input file.
dayreport is input to map method - whole string present in one line of input file.
con is context where we write mapper output and it is used by reducer.
Reducer class and reducer method:-
Similarly,we have to decide what will be the type of keyIn, ValueIn, KeyOut,ValueOut for the given
Reducer class and followed by type of reducer method parameters.

 public class WhetherForcastReducer extends Reducer<Text, Text, Text, Text>

Text(keyIn) - it is same as keyOut of Mapper.
Text(ValueIn)- it is same as valueOut of Mapper.
Text(KeyOut)- date as string
text(ValueOut) - reducer writes max and min temperature with time as string
 public void reduce(Text key, Iterable<Text> values, Context context)
Text key is value of mapper output. i.e:- City & date information
Iterable<Text> values - values stores multiple temperature values for a given city and date.
context object is where reducer write it's processed outcome and finally written in file.
MultipleOutputs :- In general, reducer generates output file(i.e: part_r_0000), however in this use case we
want to generate multiple output files. In order to deal with such scenario we need to use MultipleOutputs
of "org.apache.hadoop.mapreduce.lib.output.MultipleOutputs" which provides a way to write multiple file
depending on reducer outcome. See below reducer class for more details.For each reducer task
multipleoutput object is created and key/result is written to appropriate file.

Lets create a Map/Reduce project in eclipse and create a class file name it
as CalculateMaxAndMinTemeratureWithTime. For simplicity,here we have written mapper and reducer
class as inner static class. Copy following code lines and paste in newly created class file.

/**
* Question:- To find Max and Min temperature from record set stored in
* text file. Schema of record set :- tab separated (\t) CA_25-Jan-2014
* 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 03:12:187 16 04:00:093
* -14 05:12:345 35.7 06:19:345 23.1 07:34:542 12.3 08:12:187 16
* 09:00:093 -7 10:12:345 15.7 11:19:345 23.1 12:34:542 -22.3 13:12:187
* 16 14:00:093 -7 15:12:345 15.7 16:19:345 23.1 19:34:542 12.3
* 20:12:187 16 22:00:093 -7
* Expected output:- Creates files for each city and store maximum & minimum
* temperature for each day along with time.
*/

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";

public static class WhetherForcastMapper extends

Mapper<Object, Text, Text, Text> {

public void map(Object keyOffset, Text dayReport, Context con)

throws IOException, InterruptedException {
StringTokenizer strTokens = new StringTokenizer(
dayReport.toString(), "\t");
int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;

while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}

temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);

}
}

public static class WhetherForcastReducer extends

Reducer<Text, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;

public void setup(Context context) {

mos = new MultipleOutputs<Text, Text>(context);
}

public void reduce(Text key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {
int counter = 0;
String reducerInputStr[] = null;
String f1Time = "";
String f2Time = "";
String f1 = "", f2 = "";
Text result = new Text();
for (Text value : values) {

if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}

else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}

counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {

result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"

+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {

result = new Text("Time: " + f1Time + " MinTemp: " + f1 + "\t"

+ "Time: " + f2Time + " MaxTemp: " + f2);
}
String fileName = "";
if (key.toString().substring(0, 2).equals("CA")) {
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
} else if (key.toString().substring(0, 2).equals("NY")) {
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
} else if (key.toString().substring(0, 2).equals("NJ")) {
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
} else if (key.toString().substring(0, 3).equals("AUS")) {
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS")) {
fileName = CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0, 3).equals("BAL")) {
fileName = CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}

@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}

public static void main(String[] args) throws IOException,

ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);

job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);

// FileInputFormat.addInputPath(job, new Path(args[0]));

// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path(
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
Path pathOutputDir = new Path(
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);

try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}
Explanation:-
In map method, we are parsing each input line and maintains a counter for extracting date and each
temperature & time information.For a given input line, first extract date(counter ==0) and followed by
alternatively extract time(counter%2==1) since time is on odd number position like (1,3,5. .. ) and get
temperature otherwise. Compare for max & min temperature and store it accordingly. Once while loop
terminates for a given input line, write maxTempTime and minTempTime with date.
In reduce method, for each reducer task, setup method is executed and create MultipleOutput object. For
a given key, we have two entry (maxtempANDTime and mintempANDTime). Iterate values list , split
value and get temperature & time value. Compare temperature value and create actual value sting which
reducer write in appropriate file.
In main method,a instance of Job is created with Configuration object. Job is configured with mapper,
reducer class and along with input and output format. MultipleOutputs information added to Job to
indicate file name to be used with input format. For this sample program, we are using input
file("/weatherInputData/input_temp.txt") placed on HDFS and output directory
(/user/hduser1/testfs/output_mapred5) will be also created on HDFS. Refer below command to copy
downloaded input file from local file system to HDFS and give write permission to client who is
executing this program unit so that output directory can be created.
Copy a input file form local file system to HDFS

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -put /home/zytham/input_temp.txt

/weatherInputData/
Give write permission to all user for creating output directory

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -chmod -R 777 /user/hduser1/testfs/

Before executing above program unit make sure hadoop services are running(to start all service execute
./start-all.sh from <hadoop_home>/sbin).
Now execute above sample program. Run -> Run as hadoop. Wait for a moment and check whether output
directory is in place on HDFS. Execute following command to verify the same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls /user/hduser1/testfs/output_mapred3

Found 8 items
-rw-r--r-- 3 zytham supergroup 438 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Austin-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Baltimore-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Boston-r-
00000
-rw-r--r-- 3 zytham supergroup 511 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/California-r-
00000
-rw-r--r-- 3 zytham supergroup 146 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Newjersy-r-
00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/Newyork-r-
00000
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/_SUCCESS
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21 /user/hduser1/testfs/output_mapred3/part-r-00000
Open one of the file and verify expected output schema, execute following command for the same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -cat /user/hduser1/testfs/output_mapred3/Austin-

r-00000
25-Jan-2014 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7
26-Jan-2014 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 55.7
27-Jan-2014 Time: 02:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 55.7
29-Jan-2014 Time: 14:00:093 MinTemp: -17.0 Time: 02:34:542 MaxTemp: 62.9
30-Jan-2014 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 49.2
31-Jan-2014 Time: 14:00:093 MinTemp: -17.0 Time: 03:12:187 MaxTemp: 56.0

Note:-

 In order to reference input file from local file system instead of HDFS, uncomment below lines in main
method and comment below added addInputPath and setOutputPath lines. Here Path(args[0])
and Path(args[1]) read input and output location path from program arguments. OR create path object with
sting input of input file and output location.
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));

Execute WeatherReportPOC.jar on single node cluster

We can create jar file out of this project and run on single node cluster too. Download WeatherReportPOC
jar and place at some convenient location.Start hadoop services(./start-all.sh from <hadoop_home>/sbin).
I have placed jar at "/home/zytham/Downloads/WeatherReportPOC.jar".
Execute following command to submit job with input file HDFS location is
"/wheatherInputData/input_temp.txt" and output directory location is
"/user/hduser1/testfs/output_mapred7"

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop jar

/home/zytham/Downloads/WeatherReportPOC.jar
CalculateMaxAndMinTemeratureWithTime
/wheatherInputData/input_temp.txt /user/hduser1/testfs/output_mapred7
15/12/11 22:16:12 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/12/11 22:16:12 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/12/11 22:16:14 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed.
Implement the Tool interface and execute your application with ToolRunner to remedy this.
...........
15/12/11 22:16:26 INFO output.FileOutputCommitter: Saved output of task
'attempt_local1563851561_0001_r_000000_0' to
hdfs://hostname:54310/user/hduser1/testfs/output_mapred7/_temporary/0/task_local1563851561_0001_r_000000
15/12/11 22:16:26 INFO mapred.LocalJobRunner: reduce > reduce
15/12/11 22:16:26 INFO mapred.Task: Task 'attempt_local1563851561_0001_r_000000_0' done.
15/12/11 22:16:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local1563851561_0001_r_000000_0
15/12/11 22:16:26 INFO mapred.LocalJobRunner: reduce task executor complete.
15/12/11 22:16:26 INFO mapreduce.Job: map 100% reduce 100%
15/12/11 22:16:27 INFO mapreduce.Job: Job job_local1563851561_0001 completed successfully
15/12/11 22:16:27 INFO mapreduce.Job: Counters: 38
......
EXPRIMENT: 5

AIM: Implementing Matrix Multiplication with Hadoop Map Reduce.

MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to
make computation faster, save time, and mostly used in distributed systems. It has 2 important parts:
 Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you
search for the word “Data” and its associated meaning is “facts and statistics collected together for
reference or analysis”. Here the Key is Data and the Value associated with is facts and statistics
collected together for reference or analysis.
 Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following
matrix:

2×2 matrices A and B

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2.
Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of the
matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column. Now
One step matrix multiplication has 1 mapper and 1 reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula
k=1 i=1 j=1 ((1, 1), (A, 1, 1))
j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))

j=2 ((2, 2), (A, 2, 4))

Computing the mapper for Matrix B

i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))

k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 ------- (i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 ------- (ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 ------- (iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 ------- (iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)

((2, 2), 50)

Therefore the Final Matrix is:

Final output of Matrix multiplication.

EXPRIMENT: 6

AIM: Pig Latin scripts to sort,group, join, project, and filter your data.
Sample Data: To make these scripts concrete, let’s assume two example datasets:

1. Employees (stored in a file named employees.txt):Copy

code
1,John,Engineering,500002,Sara,Marketing,60000
3,Mike,Engineering,700004,Jane,Marketing,55000
5,Bob,HR,45000

2. Departments (stored in a file named departments.txt):Copy

code
Engineering,101
Marketing,102
HR,103

 employees.txtschema: (emp_id: int, name: chararray, department: chararray, salary: int)

 departments.txtschema: (department: chararray, dept_id: int)

Loading Data in Pig

First, load these datasets into Pig variables for use in transformations.

pig
Copy code
-- Load the employees dataset employees = LOAD
'employees.txt'
USING PigStorage(',')
AS (emp_id: int, name: chararray, department: chararray, salary: int);

-- Load the departments dataset departments = LOAD

'departments.txt'
USING PigStorage(',')
AS (department: chararray, dept_id: int);

1. Sorting Data

Sort the employeesdata by salaryin descending order.

pig
Copy code
sorted_employees = ORDER employees BY salary DESC;DUMP
sorted_employees;

2. Grouping Data
Group the employeesdata by departmentto calculate the total salary per department.

pig
Copy code
grouped_employees = GROUP employees BY department; total_salary_per_department =
FOREACH grouped_employees GENERATE
group AS department, SUM(employees.salary) AS
total_salary;
DUMP total_salary_per_department;

3. Joining Data

Join the employeesdataset with the departmentsdataset based on the departmentfield to associate eachemployee with
their dept_id.

pig
Copy code
joined_data = JOIN employees BY department, departments BY department;projected_data = FOREACH joined_data
GENERATE
employees::emp_id AS emp_id, employees::name AS
name, employees::department AS department,
departments::dept_id AS dept_id, employees::salary
AS salary;
DUMP projected_data;

4. Projecting Specific Fields

Project only the nameand salaryfields from the employeesdataset.

pig
Copy code
projected_employees = FOREACH employees GENERATE name, salary;DUMP
projected_employees;

5. Filtering Data

Filter the employeesdataset to show only employees with a salary greater than 50000.

pig
Copy code
high_salary_employees = FILTER employees BY salary > 50000;DUMP
high_salary_employees;

Summary of Operations

Each script performs one of the following:

 Sorting (ORDER): sorts data by one or more fields.

 Grouping (GROUP): groups data by a specified key.
 Joining (JOIN): joins two datasets based on a common field.
 Projecting (FOREACH ... GENERATE): selects specific fields.
 Filtering (FILTER): filters records based on a condition.
EXPRIMENT: 7

AIM: Move file from source to destination.

Hadoop FS Command
bash
hadoop fs -mv /source/path/filename /destination/path/

Step-by-Step Process

1. Verify Source File Existence: Check if the source file exists at the specified path.
2. Check Permissions: Ensure that the user has necessary permissions to read from the source location and
write to the destination location.
3. Lock Source File: Lock the source file to prevent concurrent modifications.
4. Create Destination Directory: Create the destination directory if it does not exist.
5. Copy File Contents: Copy the contents of the source file to the destination file.
6. Update File Metadata: Update the file metadata (e.g., ownership, permissions, timestamps) at the
destination.
7. Delete Source File: Delete the source file.
8. Unlock Destination File: Unlock the destination file.

HDFS Move Operation

When moving a file in HDFS, the following occurs:

1. Rename: Rename the file in the namespace.

2. Update Block Locations: Update the block locations in the namespace.
3. Delete Source File: Delete the source file's metadata.

Java API

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

// Create a FileSystem object

FileSystem fs = FileSystem.get(getConf());

// Move file from source to destination

fs.moveToLocalFile(new Path("/source/path/filename"), new Path("/destination/path/"));

Python API (pyhdfs)

from pyhdfs import HdfsClient

# Create an HdfsClient object

hdfs = HdfsClient(hosts='localhost', port=9000)
# Move file from source to destination
hdfs.move('/source/path/filename', '/destination/path/')

Error Handling

Common errors that may occur during file move operations:

- FileNotFound: Source file does not exist.

- PermissionDenied: Lack of permissions to move file.
- IOException: I/O error during file move operation.

Best Practices

- Verify file existence and permissions before moving.

- Use try-catch blocks to handle exceptions.
- Ensure destination directory exists before moving.
- Use Hadoop's built-in file management tools.

Use Cases

- Moving data from staging area to production area.

- Moving processed data to archive area.
- Moving files between directories in HDFS.

Advantages

- Efficient file management.

- Improved data organization.
- Reduced data redundancy.

Disadvantages

- Potential data loss during move operation.

- Permission issues may arise.
- May impact system performance.Use Cases

- Moving data from staging area to production area.

- Moving processed data to archive area.
- Moving files between directories in HDFS.

Advantages

- Efficient file management.

- Improved data organization.
- Reduced data redundancy.

Disadvantages

- Potential data loss during move operation.

- Permission issues may arise.
- May impact system performance.
EXPRIMENT: 8

AIM: Remove a file or directory in HDFS.

To remove a file or directory in HDFS (Hadoop Distributed File System), you can use the `hdfs dfs -rm`
and `hdfs dfs -rm -r` commands. Here’s a detailed, step-by-step guide to removing files and directories from
HDFS, along with a diagram to illustrate the process.

Step-by-Step Guide to Removing Files and Directories in HDFS

Removing a File from HDFS

1. **Open Terminal**: Use the command line or terminal to enter HDFS commands.
2. **Run Remove Command**: Use the following command to delete a file:
```bash
hdfs dfs -rm /path/to/your/file
```
- **Example**: If you have a file named `example.txt` in `/user/hadoop/`, the command will be:
```bash
hdfs dfs -rm /user/hadoop/example.txt
```
3. **Confirm Deletion**: The command will delete the specified file. You’ll see a confirmation message
in the terminal if the file is successfully deleted.

Removing a Directory from HDFS (Recursive Delete)

1. Open Terminal: Use the command line or terminal.

2. **Run Remove Command with Recursive Flag**: To delete a directory and all its contents, use the `-
r` (recursive) flag:
```bash
hdfs dfs -rm -r /path/to/your/directory
```
- **Example**: To delete the directory `/user/hadoop/data`, use:
```bash
hdfs dfs -rm -r /user/hadoop/data
```
3. **Confirm Deletion**: The terminal will show a message confirming that the directory and its contents
have been deleted.
EXPRIMENT: 9

AIM: Display last few lines of a file.

Use the tail command to write the file specified by the File parameter to standard output beginning at a
specified
point.
See the following examples:
To display the last 10 lines of the notes file, type the following:
tail notes
To specify the number of lines to start reading from the end of the notes file, type the following:
tail -20 notes
To display the notes file one page at a time, beginning with the 200th byte, type the following:
tail -c +200 notes | pg
To follow the growth of the file named accounts, type the following:
tail -f accounts
This displays the last 10 lines of the accounts file. The tail command continues to display lines as they are
added to the accounts file. The display continues until you press the (Ctrl-C) key sequence to stop the
display.
EXPRIMENT: 10

AIM: Display the aggregate length of a file.

To display the aggregate length (total size) of a file in HDFS, you can use the `hdfs dfs -du` or `hdfs dfs-
stat` command. Here are the methods to achieve this.

Method 1: Using `hdfs dfs -du`

This command displays the disk usage of files in HDFS.

1. Open Terminal: Access the terminal or command line.

2. **Run the `du` Command**: Use the following command to see the size of the file.
```bash
hdfs dfs -du -s /path/to/your/file
```
**Explanation**:-
`-du`: Displays disk usage for files or directories.
- `-s`: Shows a summary (total size).
- **Example**: To get the size of `example.txt` in `/user/hadoop/`, use:
```bash
hdfs dfs -du -s /user/hadoop/example.txt
```
3. **Output**: The output will display the total size (in bytes) of the specified file.

Method 2: Using `hdfs dfs -stat`

This command gives detailed information about a file, including its size.

1. **Open Terminal**.
2. **Run the `stat` Command**:
```bash
hdfs dfs -stat %b /path/to/your/file
```
- **Explanation**:
- `%b`: Displays the file length in bytes.
- **Example**:
```bash
hdfs dfs -stat %b /user/hadoop/example.txt
```
3. **Output**: The total file size in bytes will be shown.

Example Output

If `example.txt` is 1048576 bytes (1 MB), the command output might look like:

```bash
1048576
```

This output represents the total size (aggregate length) of the file.

Data Analytics Lab
No ratings yet
Data Analytics Lab
42 pages
BDA Record
No ratings yet
BDA Record
58 pages
CWS 415 1I en StudentManual v04
100% (1)
CWS 415 1I en StudentManual v04
734 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
Ccs 334 Bigdata Manual
No ratings yet
Ccs 334 Bigdata Manual
45 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
Bigdatamanualfinal 231019063224 d211cb48
No ratings yet
Bigdatamanualfinal 231019063224 d211cb48
45 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
26 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
BigData Lab Manual
No ratings yet
BigData Lab Manual
44 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Hadoopfile PP
No ratings yet
Hadoopfile PP
83 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Big Data
No ratings yet
Big Data
23 pages
213nt1306 - Big Data Analytics Lab Manual
No ratings yet
213nt1306 - Big Data Analytics Lab Manual
80 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
Avaya WFO - 15.1 - February 2016 - Interactions and Analytics Administration Guide
No ratings yet
Avaya WFO - 15.1 - February 2016 - Interactions and Analytics Administration Guide
155 pages
BDA-Lab Record
No ratings yet
BDA-Lab Record
43 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Bigdata Lab File
No ratings yet
Bigdata Lab File
20 pages
Group A 1st
No ratings yet
Group A 1st
4 pages
BDA Record
No ratings yet
BDA Record
34 pages
New Bda Manual
No ratings yet
New Bda Manual
80 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
$ROVOPHR
No ratings yet
$ROVOPHR
9 pages
Exp 1-2
No ratings yet
Exp 1-2
9 pages
Big Data Lab Record
No ratings yet
Big Data Lab Record
30 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Lab Manual
No ratings yet
Lab Manual
34 pages
Bda Record
No ratings yet
Bda Record
83 pages
BDA LabManual
No ratings yet
BDA LabManual
20 pages
A Guide To Using jUDDI
No ratings yet
A Guide To Using jUDDI
82 pages
Data Science
No ratings yet
Data Science
82 pages
Week 06 - Analysis - Requirements
No ratings yet
Week 06 - Analysis - Requirements
45 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
Nasscom Mlops Playbook 2022
No ratings yet
Nasscom Mlops Playbook 2022
55 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Bda Manual
No ratings yet
Bda Manual
80 pages
BIGDATALABCURRENT
No ratings yet
BIGDATALABCURRENT
54 pages
Big Data Manual Ai
No ratings yet
Big Data Manual Ai
33 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
Big Data Manual
No ratings yet
Big Data Manual
19 pages
Step 1: Download Binary Package
No ratings yet
Step 1: Download Binary Package
50 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Write A Mapreduce Program To Find Dept Wise Salary. Empno Empname Dept Salary
100% (1)
Write A Mapreduce Program To Find Dept Wise Salary. Empno Empname Dept Salary
5 pages
Big Data File
No ratings yet
Big Data File
16 pages
QlikView Server and Publisher Sample Chapter
No ratings yet
QlikView Server and Publisher Sample Chapter
26 pages
Guide Allotment For Major Project CSE VIII
No ratings yet
Guide Allotment For Major Project CSE VIII
2 pages
Unit 2 - DS - Class
No ratings yet
Unit 2 - DS - Class
77 pages
Big Data
No ratings yet
Big Data
28 pages
G 12 CS Day 5 - 23-9-24
No ratings yet
G 12 CS Day 5 - 23-9-24
9 pages
Anushka Shetty 35
No ratings yet
Anushka Shetty 35
34 pages
Assignment Questions CS 803
No ratings yet
Assignment Questions CS 803
1 page
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Cloud Computing Important Questions
No ratings yet
Cloud Computing Important Questions
3 pages
TTDD Routing Protocol PDF
No ratings yet
TTDD Routing Protocol PDF
16 pages
Term 3 OOP Prelim Lab Exam - Attempt Review1
No ratings yet
Term 3 OOP Prelim Lab Exam - Attempt Review1
5 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
89 JSP Interview Questions 1
No ratings yet
89 JSP Interview Questions 1
13 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Operating Systems: Vfs/Nfs
No ratings yet
Operating Systems: Vfs/Nfs
17 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
KsignSecureDB (Plug in SPIN) V1.5
No ratings yet
KsignSecureDB (Plug in SPIN) V1.5
32 pages
Virtual Class Management
No ratings yet
Virtual Class Management
10 pages
Steps of Hadoop Installation
No ratings yet
Steps of Hadoop Installation
3 pages
RDBMS 2
No ratings yet
RDBMS 2
9 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
Amc Engineering College: Dept. of Computer Science and Engineering
No ratings yet
Amc Engineering College: Dept. of Computer Science and Engineering
6 pages
Modified Python Lesson Plan
No ratings yet
Modified Python Lesson Plan
3 pages
Sneak Past Pay-For WiFi With DNS Tunneling
No ratings yet
Sneak Past Pay-For WiFi With DNS Tunneling
5 pages
Global Technical Architect HCI & Nutanix
No ratings yet
Global Technical Architect HCI & Nutanix
2 pages
Linked Lists: Data Structures
No ratings yet
Linked Lists: Data Structures
9 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
Harpreet Singh - 2023
No ratings yet
Harpreet Singh - 2023
3 pages
SQL Server Hacking On Scale UsingPowerShell S.sutherland
No ratings yet
SQL Server Hacking On Scale UsingPowerShell S.sutherland
110 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Sqlassignment 03
No ratings yet
Sqlassignment 03
3 pages
Netapp Management Solutions Suite For Vmware Vsphere: Datasheet
No ratings yet
Netapp Management Solutions Suite For Vmware Vsphere: Datasheet
4 pages
GCP Data Engineer Course Content
No ratings yet
GCP Data Engineer Course Content
7 pages
Lending Tree System Abstract
No ratings yet
Lending Tree System Abstract
8 pages
SAP FICO Real Time Project: o o o o o o o o o
No ratings yet
SAP FICO Real Time Project: o o o o o o o o o
4 pages
Data Analytics Question Bank
No ratings yet
Data Analytics Question Bank
4 pages
AWS Solution Architect Syllabus - by Murali P N, Besant Technologies
No ratings yet
AWS Solution Architect Syllabus - by Murali P N, Besant Technologies
7 pages
Add Z Field To Field Catalouge
No ratings yet
Add Z Field To Field Catalouge
9 pages

BIG Data File

Uploaded by

BIG Data File

Uploaded by

CSE, VII- semester

Subject: - CS-704 Big

S.No. Title Signature

Weather Report POC-Map Reduce Program to analyse time

Implementing Matrix Multiplication with Hadoop Map

6. Pig Latin scripts to sort,group, join,project, and filter your

AIM: File Management tasks in Hadoop.

2. **Viewing Files in HDFS**

3. **Reading Files from HDFS**

4. **Copying Files in HDFS**

5. **Moving Files in HDFS**

6. **Removing Files from HDFS**

7. **Checking File Status in HDFS**

8. **Changing Permissions on HDFS Files**

9. **Changing Ownership of Files**

11. **Viewing Disk Usage**

12. **Replicating Files in HDFS**

13. **Archiving Files in HDFS**

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Workflow of MapReduce consists of 5 steps:

AIM: Weather Report POC-Map Reduce Program to analyse timetemperature

Expected output:- In each output file record should be like this:

Mapper class and map method:-

 public class WhetherForcastMapper extends Mapper <Object, Text, Text, Text>

 public class WhetherForcastReducer extends Reducer<Text, Text, Text, Text>

public static class WhetherForcastMapper extends

public void map(Object keyOffset, Text dayReport, Context con)

public static class WhetherForcastReducer extends

public void setup(Context context) {

public void reduce(Text key, Iterable<Text> values, Context context)

result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"

result = new Text("Time: " + f1Time + " MinTemp: " + f1 + "\t"

public static void main(String[] args) throws IOException,

// FileInputFormat.addInputPath(job, new Path(args[0]));

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -put /home/zytham/input_temp.txt

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -chmod -R 777 /user/hduser1/testfs/

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls /user/hduser1/testfs/output_mapred3

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -cat /user/hduser1/testfs/output_mapred3/Austin-

Execute WeatherReportPOC.jar on single node cluster

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop jar

AIM: Implementing Matrix Multiplication with Hadoop Map Reduce.

2×2 matrices A and B

j=2 ((2, 2), (A, 2, 4))

Computing the mapper for Matrix B

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:

((2, 2), 50)

Therefore the Final Matrix is:

Final output of Matrix multiplication.

1. Employees (stored in a file named employees.txt):Copy

2. Departments (stored in a file named departments.txt):Copy

 employees.txtschema: (emp_id: int, name: chararray, department: chararray, salary: int)

Loading Data in Pig

-- Load the departments dataset departments = LOAD

Sort the employeesdata by salaryin descending order.

4. Projecting Specific Fields

Project only the nameand salaryfields from the employeesdataset.

Each script performs one of the following:

 Sorting (ORDER): sorts data by one or more fields.

AIM: Move file from source to destination.

HDFS Move Operation

When moving a file in HDFS, the following occurs:

1. Rename: Rename the file in the namespace.

// Create a FileSystem object

// Move file from source to destination

Python API (pyhdfs)

from pyhdfs import HdfsClient

# Create an HdfsClient object

Common errors that may occur during file move operations:

- FileNotFound: Source file does not exist.

- Verify file existence and permissions before moving.

- Moving data from staging area to production area.

- Efficient file management.

- Potential data loss during move operation.

- Moving data from staging area to production area.

2. Viewing Files in HDFS

3. Reading Files from HDFS

4. Copying Files in HDFS

5. Moving Files in HDFS

6. Removing Files from HDFS

7. Checking File Status in HDFS

8. Changing Permissions on HDFS Files

9. Changing Ownership of Files

11. Viewing Disk Usage

12. Replicating Files in HDFS

13. Archiving Files in HDFS

1. Open Terminal: Use the command line or terminal.

1. Open Terminal: Access the terminal or command line.