Big Data Lab
Big Data Lab
Tech
(CSE) – III-II L T P C
0 0 3 1.5
Course Objectives:
1. Get familiar with Hadoop distributions, configuring Hadoop and performing Filemanagement
tasks
2. Experiment MapReduce in Hadoop frameworks
3. Implement MapReduce programs in variety applications
4. Explore MapReduce support for debugging
5. Understand different approaches for building Hadoop MapReduce programs for real-time
applications
Experiments:
2. Develop a MapReduce program to calculate the frequency of a given word in a given file.
6. Develop a MapReduce to find the maximum electrical consumption in each year givenelectrical
consumption for each month in each year.
7. Develop a MapReduce to analyze weather data set and print whether the day is shinny or coolday.
8. Develop a MapReduce program to find the number of products sold in each country byconsidering
sales data containing fields like
9. Develop a MapReduce program to find the tags associated with each movie by analyzingmovie
lens data.
The data is coming in log files and looks like as shown below.
111115 | 222 | 0 | 1 | 0
111113 | 225 | 1 | 0 | 0
111117 | 223 | 0 | 1 | 1
111115 | 225 | 1 | 0 | 0
11. Develop a MapReduce program to find the frequency of books published eachyear and findin which
year maximum number of books were published usingthe following data.
12. Develop a MapReduce program to analyze Titanic ship data and to find the average age of the people
(both male and female) who died in the tragedy. How many persons are survived in each class.
13. Develop a MapReduce program to analyze Uber data set to find the days on which each
basement has more trips using the following dataset.
14. Develop a program to calculate the maximum recorded temperature by yearwise for theweather
dataset in Pig Latin
16. Develop a Java application to find the maximum temperature using Spark.
Text Books:
1. Tom White, “Hadoop: The Definitive Guide” Fourth Edition, O’reilly Media, 2015.
Reference Books:
1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding BigData :
Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing, 2012.
4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.
Course Outcomes:
AIM:
To install a single-node Hadoop cluster backed by the Hadoop Distributed File System on Ubuntu
using VMware.
PRE REQUISITES:
Description:
1. Installing VMware
i. Double click to launch the VMware-workstation-full-15 application.
ii. Security warning panel and click on Run to continue.
iii. Initial screen will appear, wait for the process to complete.
iv. VMware Workstation setup wizard open, click next.
v. Select I accept the terms within the License Agreement and click on next.
vi. Select the directory during which you’d wish to install the appliance Also select Enhanced
Keyboard Driver checkbox.
vii. Leave it to defaults Settings and click next.
viii. Select both the options desktop and start Menu Programs Folder and click next.
ix. Click Install to start the installation process.
x. Installation in progress, wait for
this to complete.
Output:
Output:
Output:
5- Installing Java
Output:
Output:
$ssh localhost
Output:
$wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Output:
Now type :-
$source ~/.bashrc
==========================
2nd File
==========================
$sudo nano $HADOOP_HOME/etc/bigdata/hadoop-env.sh
$export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
==========================
3rd File
============================
$sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>dfs.data.dir</name>
<value>/home/bigdata/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/bigdata/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
==========================
5th File
===========================
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
==========================
6th File
==========================
$sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CL
ASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
Output:
RESULT:
The installation of single-node Hadoop cluster backed by the Hadoop Distributed File System on
Ubuntu using VMware is successfully executed.
Aim:
To write a MapReduce program for counting the number of occurrences of each word in a text
files using the Mapreduce concepts.
PROCEDURE:
1. Make sure you have installed and running hadoop and java
$hadoop version
$javac –version
Program: WordCount.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Output:
Output:
hadoop
#Check in browser localhost://50070 or 9870
Output:
Output:
Check in browser
$ cd /home/bigdata/Desktop/
Final Output:
Aim:
To find maximum temperature per year from sensor temperature data sheet, using hadoop
mapreduce framework.
Description:
$start-all.sh
Output:
Program: MaxTemp.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
if (line.charAt(87)=='+')
temperature =Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
context.write(new Text(year),new IntWritable(temperature));
}
}
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /MaxTemp/Output/*
Final Output:
Aim:
To find the grades of student’s using map reduce program.
Description:
$start-all.sh
Output:
Program: StudentGrade.java
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
int avg = 0;
int l = 0;
for (IntWritable val : values) {
l += 1;
avg += val.get();
}
avg=avg / l;
context.write(key, new IntWritable(avg));
if(avg>=80)
{
output.collect(new Text("A " + grade),new Text(String.valueOf(name)));
}
else if(avg>=60 && avg<80)
{
output.collect(new Text("B " + grade),new Text(String.valueOf(name)));
}
else if(avg>=40 && avg<60)
{
output.collect(new Text("C " + grade),new Text(String.valueOf(name)));
}
else
{
output.collect(new Text("D " + grade),new Text(String.valueOf(name)));
}
}
}
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /StudentGrade/Output/*
Final Output:
Aim:
1. To find Matrix Multiplication using map reduce program.
Description:
$start-all.sh
Output:
Program: MatrixMul.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
Output:
Output:
Output:
# check output
$hadoop dfs –cat /MatrixMul/Output/*
Final Output:
Develop a MapReduce program to find the maximum electrical consumption in each year given
electrical consumption for each month in each year.
Aim:
To find the maximum electrical consumption in each year givenelectrical consumption for each month
in each year using map reduce program.
Description:
$start-all.sh
Output:
Program: ProcessUnits.java
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
while(s.hasMoreTokens())
{
lasttoken=s.nextToken();
}
while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}
}
}
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /ProcessUnits/Output/*
Final Output:
Develop a MapReduce to analyze weather data set and print whether the day is shinny or coolday.
Aim:
To analyze weather data set and print whether the day is shinny or coolday using map reduce program.
Description:
$start-all.sh
Output:
Program: MyMaxMin.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
@Override
public void map(LongWritable arg0, Text Value,
OutputCollector<Text, Text> output, Reporter arg3)
throws IOException {
// Example of Input
// Date Max Min
// 25380 20130101 2.514 -135.69 58.43 8.3 1.1 4.7 4.9 5.6 0.01 C
1.0 -0.1 0.4 97.3 36.0 69.4 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -
9999.0 -9999.0
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Output:
Output:
4. Create a text file contains some NDC date and name as “input7.txt”
Output:
# check output
$hadoop dfs –cat /MyMaxMin/Output/*
Final Output:
Develop a MapReduce program to find the number of products sold in each country byconsidering sales data
containing fields like
Aim:
To Develop a MapReduce program to find the number of products sold in each country byconsidering sales
data containing fields like
Description:
$start-all.sh
Output:
Program: SalesCountry.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public void map(LongWritable key, Text value, OutputCollector <Text, IntWritable> output,
Reporter reporter) throws IOException {
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
public static void main(String[] args)
{
JobClient my_client = new JobClient();
// Create a configuration object for the job
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
Output:
Output:
4. Create a text file contains some sales data and name as “input8.txt”
Output:
G.VIDYA SAGAR, ASSISTANT PROFESSOR CSE, SVCN
5. Check folder to store java classes and with name as “bigdata_classes”
Output:
Output:
# check output
$hadoop dfs –cat /SalesCountry/Output/*
Develop a MapReduce program to find the tags associated with each movie by analyzingmovie lens
data.
Aim:
To Develop a MapReduce program to find the tags associated with each movie by analyzingmovie lens
data.
Description:
$start-all.sh
Output:
Program: MovieLens.java
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
G.VIDYA SAGAR, ASSISTANT PROFESSOR CSE, SVCN
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /MovieLens/Output/*
Final Output:
XYZ.com is an online music website where users listen to various tracks, the data gets collected which is
given below.
The data is coming in log files and looks like as shown below.
111115 | 222 | 0 | 1 | 0
111113 | 225 | 1 | 0 | 0
111117 | 223 | 0 | 1 | 1
111115 | 225 | 1 | 0 | 0
Aim:
XYZ.com is an online music website where users listen to various tracks, the data gets collected which is
given below.
The data is coming in log files and looks like as shown below.
111115 | 222 | 0 | 1 | 0
111113 | 225 | 1 | 0 | 0
111117 | 223 | 0 | 1 | 1
111115 | 225 | 1 | 0 | 0
Description:
$start-all.sh
Output:
G.VIDYA SAGAR, ASSISTANT PROFESSOR CSE, SVCN
2. Write below program and save as “MusicTrack.java” in Desktop
Program: MusicTrack.java
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
//import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MusicTrack
{
public static class MusicMapper extends Mapper<Object,Text,Text,Text>
{
public void map(Object key,Text value,Context context) throws
IOException,InterruptedException
{
String[] tokens=value.toString().split("\\|");
String trackid = /*"1";*/tokens[1];
String others = tokens[0]+"\t"+tokens[2]+"\t"+tokens[3]+"\t"+tokens[4];
context.write(new Text(trackid),new Text(others));
}
}
for(Text val:value)
{
String[] valTokens = val.toString().split("\t");
int sh = Integer.parseInt(valTokens[1]);
int ra = Integer.parseInt(valTokens[2]);
int sk = Integer.parseInt(valTokens[3]);
shared = shared+sh;
radio=radio+ra;
skip=skip+sk;
listen = shared + radio;
userIdSet.add(cus);
}
public static void main(String args[]) throws Exception
{
Configuration conf=new Configuration();
Job job=new Job(conf,"MusicTrack");
job.setNumReduceTasks(1);
job.setJarByClass(MusicTrack.class);
job.setMapperClass(MusicMapper.class);
job.setReducerClass(MusicReduceer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /MusicTrack/Output/*
Final Output:
Develop a MapReduce program to find the frequency of books published eachyear and findin which year
maximum number of books were published usingthe following data.
Aim:
To Develop a MapReduce program to find the frequency of books published each year and findin which
year maximum number of books were published using the following data.
Description:
$start-all.sh
Output:
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
if (line.charAt(87)=='+')
author =Integer.parseInt(line.substring(25, 32));
else
author = Integer.parseInt(line.substring(20, 25));
String quality = line.substring(92, 93);
if(author != MISSING && quality.matches("[01200]"))
context.write(new Text(year),new IntWritable(author));
}
}
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat /BookMax/Output/*
Final Output:
Develop a MapReduce program to analyze Titanic ship data and to find the average age of the people
(both male and female) who died in the tragedy. How many persons are survived in each class.
Aim:
To Develop a MapReduce program to analyze Titanic ship data and to find the average age of the people
(both male and female) who died in the tragedy. How many persons are survived in each class.
Description:
$start-all.sh
Output:
Program: Average_age.java
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
@SuppressWarnings("deprecation")
Job job = new Job(conf, "Averageage_survived");
job.setJarByClass(Average_age.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
}
Output:
Output:
# check output
$hadoop dfs –cat / Average_age /Output/*
Final Output:
Develop a MapReduce program to analyze Uber data set to find the days on which eachbasement has
more trips using the following dataset.
Aim:
To Develop a MapReduce program to analyze Uber data set to find the days on which eachbasement
has more trips using the following dataset.
Description:
$start-all.sh
Output:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public void map(Object key, Text record, Context context) throws IOException,
InterruptedException {
String[] parts = record.toString().split("[,]");
basement.set(parts[0]);
try {
date = format.parse(parts[1]);
calendar.setTime(date);
} catch (ParseException e) {
e.printStackTrace();
}
}
public class Sum_reducer extends Reducer<Text, IntWritable, Text, IntWritable> {
int sum = 0;
result.set(sum);
context.write(key, result);
}
}
public class Trip_tracking extends Configured implements Tool {
Job job = Job.getInstance(getConf(), "Uber Trip tracking to fiund the more trips for
each basement");
job.setJarByClass(getClass());
job.setMapperClass(TokenizerMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setCombinerClass(Sum_reducer.class);
job.setReducerClass(Sum_reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Output:
Output:
Output:
Output:
Output:
# check output
$hadoop dfs –cat / UberTrack /Output/*
Final Output:
Develop a program to calculate the maximum recorded temperature by yearwise for theweather
dataset in Pig Latin
Aim:
To develop a program to calculate the maximum recorded temperature by yearwise for theweather
dataset in Pig Latin
Description:
1. Install java
2. Install hadoop
3. Run hadoop
$start-all.sh
Output:
$wget http://www.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
Output:
$nano .bashrc
$export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
OUTPUT:
$pig
final output:
1949 111
1950 22
Write queries to sort and aggregate the data in a table using HiveQL.
Aim:
To write queries to sort and aggregate the data in a table using HiveQL.
Description:
1. Install java
2. Install hadoop
3. Run hadoop
$start-all.sh
Output:
$wget “https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz”
$nano .bashrc
$export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
OUTPUT:
$hive
hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated1949by111',' ;
1950 22
Final output:
14. Now, fetch the data in the descending order (sorting) by using the following command.
Aim:
To Develop a Java application to find the maximum temperature using Spark.
Description:
1. Install java
2. Install hadoop
3. Run hadoop
$start-all.sh
Output:
$wget “https://www.apache.org/dyn/closer.lua/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz”
5. Unzip spark
$nano .bashrc
SPARK_HOME=/ home/bigdata/spark-2.4.1-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
$source .bashrc
09. Set JAVA_HOME
$export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
11. Create a text file “input11.txt” in your local machine and write some weather data set into it.
import org.apache.spark.SparkContext._
1950 22