CP5261Data Analytics Laboratory
CP5261Data Analytics Laboratory
REGISTERNO:………………………………………………….
YEAR / SEM :.…………………………………………………
DEPARTMENT:…………………………………………………
SUBJECT :………………………………………………….
BONAFIDE CERTIFICATE
Mr./Ms.…….…………………………Register No…………………..
of .......……....…………………………………………… Department
PAGE
EX.NO DATE NAME OF THE EXPERIMENT SIGN
NO
INSTALL, CONFIGURE AND RUN
1 HADOOP AND HDFS
IMPLEMENT WORD COUNT /
2 FREQUENCY PROGRAMS USING
MAPREDUCE
IMPLEMENT AN MR PROGRAM THAT
3 PROCESSES A WEATHER DATASET
IMPLEMENT LINEAR AND LOGISTIC
4 REGRESSION
IMPLEMENT SVM / DECISION TREE
5 CLASSIFICATION TECHNIQUES
IMPLEMENT CLUSTERING
6 TECHNIQUES
VISUALIZE DATA USING ANY
7 PLOTTING FRAMEWORK
IMPLEMENT AN APPLICATION THAT
STORES BIG DATA IN HBASE /
8 MONGODB / PIG
USING HADOOP / R.
AIM:
PROCEDURE:
1. Installing Java
prince@prince-VirtualBox:~$ cd ~
3. Installing SSH
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh
first. Use this command to do that :
This will install ssh on our machine. If we get something similar to the following, we can
think it is setup properly:
/usr/bin/ssh
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to
localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH
public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.
prince@prince-VirtualBox:~$ suhduser
Password:
prince@prince-VirtualBox:~$ ssh-keygen -t rsa -P ""
The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password.
hduser@prince-VirtualBox:/home/k$ sshlocalhost
5. Install Hadoop
hduser@prince-VirtualBox:~$ wget
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
hduser@prince-VirtualBox:~$ tar xvzf hadoop-2.6.0.tar.gz
We want to move the Hadoop installation to the /usr/local/hadoop directory using the
following command:
Oops!... We got:
This error can be resolved by logging in as a root user, and then add hduser to sudo:
hduser@prince-VirtualBox:~/hadoop-2.6.0$ su prince
Password:
prince@prince-VirtualBox:/home/hduser$ sudoadduserhdusersudo
Now, the hduser has root priviledge, we can move the Hadoop installation to the
/usr/local/hadoop directory without any problem:
prince@prince-VirtualBox:/home/hduser$ sudosuhduser
The following files will have to be modified to complete the Hadoop setup:
i.~/.bashrc
ii./usr/local/hadoop/etc/hadoop/hadoop-env.sh
iii./usr/local/hadoop/etc/hadoop/core-site.xml
iv./usr/local/hadoop/etc/hadoop/mapred-site.xml.template
v./usr/local/hadoop/etc/hadoop/hdfs-site.xml
i. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to find the path where Java has
been installed to set the JAVA_HOME environment variable using the following command:
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-
openjdk-amd64/jre/bin/java
Nothing to configure.
ii. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of
JAVA_HOME variable will be available to Hadoop whenever it is started up.
iii. /usr/local/hadoop/etc/hadoop/core-site.xml:
Open the file and enter the following in between the <configuration></configuration> tag:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
theFileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
iv. /usr/local/hadoop/etc/hadoop/mapred-site.xml
hduser@prince-VirtualBox:~$ cp /usr/local/hadoop/etc/hadoop/mapred-
site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
v. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Before editing this file, we need to create two directories which will contain the namenode
and the datanode for this Hadoop installation.
This can be done using the following commands:
Open the file and enter the following content in between the <configuration></configuration>
tag:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Now, the Hadoop file system needs to be formatted so that we can start to use it. The format
command should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
Note that hadoopnamenode -format command should be executed once before we start
using Hadoop.
If this command is executed again after Hadoop has been used, it'll destroy all the data on the
Hadoop file system.
8. Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)
prince@prince-VirtualBox:~$ cd /usr/local/hadoop/sbin
prince@prince-VirtualBox:/usr/local/hadoop/sbin$ ls
prince@prince-VirtualBox:/usr/local/hadoop/sbin$ sudosuhduser
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh
hduser@prince-VirtualBox:~$ start-all.sh
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
The output means that we now have a functional instance of Hadoop running on our VPS
(Virtual private server).
$ pwd
/usr/local/hadoop/sbin
$ ls
We run stop-all.sh or (stop-dfs.sh and stop-yarn.sh) to stop all the daemons running on our
machine:
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ pwd
/usr/local/hadoop/sbin
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ ls
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ stop-all.sh
Let's start the Hadoop again and see its Web UI:
hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh
AIM:
To write a java program for counting the number of occurrences of each word in a
text file using the Mapreduce concepts.
PROCEDURE:
1. Install hadoop.
2. Start all services using the command.
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ jps
3242 Jps
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
16/09/15 15:38:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-
prince-VirtualBox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-
prince-VirtualBox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-
secondarynamenode-prince-VirtualBox.out
16/09/15 15:39:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
starting yarn daemons
startingresourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-
prince-VirtualBox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-
prince-VirtualBox.out
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ jps
16098 NameNode
16214 DataNode
16761 NodeManager
16636 ResourceManager
16429 SecondaryNameNode
19231 Jps
PROGRAM CODING:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ nano wordcount7.java
importjava.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
TO COMPILE:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ hadoopcom.sun.tools.javac.Main
wordcount7.java
TO EXECUTE:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ hadoop jar wc2.jar wordcount7
/deepika/wc1.txt /deepika/out2
INPUT FILE:
wc1.txt
STEPS:
1. Open an editor and type WordCount program and save as WordCount.java
2. Set the path as export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
3. To compile the program, bin/hadoopcom.sun.tools.javac.Main WordCount.java
4. Create a jar file, jar cf wc.jar WordCount*.class
5. Create input files input.txt,input1.txt and input2.txt and create a directory in hdfs,
/mit/wordcount/input
6. Move these i/p files to hdfs system, bin/hadoopfs –put /opt/hadoop-2.7.0/input.txt
/mit/wordcount/input/input.txt repeat this step for other two i/p files.
7. To execute, bin/hadoop jar wc.jar WordCount /mit/wordcount/input
/mit/wordcount/output.
8. The mapreduce result will be available in the output directory.
OUTPUT:
/mit/wordcount/input 2
/mit/wordcount/input/input.txt 1
/mit/wordcount/output. 1
/opt/hadoop-2.7.0/input.txt 1
1. 1
2. 1
3. 1
4. 1
5. 1
6. 1
7. 1
8. 1
Create 2
HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar 1
Move 1
Open 1
STEPS: 1
Set 1
The 1
To 2
WordCount 2
WordCount*.class 1
WordCount.java 2
a2
an 1
and 4
as 2
available 1
be 1
bin/hadoop 3
cf 1
com.sun.tools.javac.Main 1
compile 1
create 1
directory 1
directory. 1
editor 1
execute, 1
export 1
file, 1
files 2
files. 1
for 1
fs 1
hdfs 1
hdfs, 1
i/p 2
in 2
input 1
input.txt,input1.txt 1
input2.txt 1
jar 3
mapreduce 1
other 1
output 1
path 1
program 1
program, 1
repeat 1
result 1
save 1
step 1
system, 1
the 3
these 1
this 1
to 1
two 1
type 1
wc.jar 2
will 1
–put 1
RESULT:
AIM:
ALGORITHM:
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step7:
Step 8:
PROGRAM CODING:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";
while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}
temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);
}
}
public static class WhetherForcastReducer extends
Reducer<Text, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;
if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}
else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}
counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {
@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}
job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path(
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
Path pathOutputDir = new Path(
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}}}
OUTPUT:
whether output directory is in place on HDFS. Execute following command to verify the
same.
RESULT:
EX.NO: 4 IMPLEMENT LINEAR AND LOGISTIC REGRESSION
DATE:
AIM:
The in-built data set "mtcars" describes different models of a car with their various engine
specifications. In "mtcars" data set, the transmission mode (automatic or manual) is described
by the column am which is a binary value (0 or 1). We can create a logistic regression model
between the columns "am" and 3 other columns - hp, wt and cyl.
RESULT:
EX.NO: 5
DATE: IMPLEMENT SVM / DECISION TREE CLASSIFICATION
TECHNIQUES
AIM:
IMPLEMENTATION:(SVM)
To use SVM in R, we have a package e1071. The package is not preinstalled, hence
one needs to run the line “install.packages(“e1071”) to install the package and then import
the package contents using the library command--library(e1071).
R CODE:
x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60)
#Linear regression
model <- lm(y ~ x, train)
#SVM
library(e1071)
#Plot the predictions and the plot to see our model fit
points(train$x, pred, col = "blue", pch=4)
#Linear model has a residuals part which we can extract and directly calculate rmse
error <- model$residuals
lm_error <- sqrt(mean(error^2)) # 3.832974
#For svm, we have to manually calculate the difference between actual values (train$y) with
our predictions (pred)
error_2 <- train$y - pred
svm_error <- sqrt(mean(error_2^2)) # 2.696281
#- best parameters:
# epsilon cost
#0 8
plot(svm_tune)
plot(train,pch=16)
points(train$x, best_mod_pred, col = "blue", pch=4)
OUTPUT :
IMPLEMENTATION:(DECISION TREE)
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The basic syntax for creating a decision tree in R is
ctree(formula, data)
input data:
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
# Load the party package. It will automatically load other dependent packages.
library(party)
RESULT:
EX.NO:6 IMPLEMENT CLUSTERING TECHNIQUES
DATE:
AIM:
To implement clustering Techniques
PROGRAM CODING:
install.packages("factoextra")
install.packages("cluster")
install.packages("magrittr")
library("cluster")
library("factoextra")
library("magrittr")
Data preparation
data("USArrests")
my_data <- USArrests %>% na.omit() %>% # Remove missing values (NA) scale() #
Scale variables
head(my_data, n = 3)
Distance measures
set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,ellipse.type = "convex", palette = "jc,
ggtheme = theme_minimal())
MODEL BASED CLUSTERING:
library("MASS") data("geyser")
library("mclust")
data("diabetes")
head(diabetes, 3)
Model-based clustering can be computed using the function Mclust() as follow:
library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df) # Model-based-clustering
summary(mc) # Print a summary
mc$G # Optimal number of cluster => 3 head(mc$z, 30) # Probality to belong to a given
cluster head(mc$classification, 30) # Cluster assignement of each observation
VISUALIZING MODEL-BASED CLUSTERING
library(factoextra)
# Classification uncertainty
AIM:
To implement the visualize data using any plotting framework
IMPLEMENTATION:
RESULT:
EX.NO:8 IMPLEMENT AN APPLICATION THAT STORES BIG DATA IN HBASE /
MONGODB / PIG USING HADOOP / R.
DATE:
PROGRAM CODING:
OUTPUT:
RESULT: