0% found this document useful (0 votes)

52 views37 pages

DS With R Lab Record

The program simulates filtering of data from the diamonds dataset using R. It filters the data to select diamonds with cut equal to 'Ideal', price greater than 2000, and cut not equal to 'Ideal'. This demonstrates how to use the filter function in R to subset data based on conditional logic.

Uploaded by

PEDARLA MANOJ SAI KUMAR PEDARLA MANOJ SAI KUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views37 pages

DS With R Lab Record

Uploaded by

PEDARLA MANOJ SAI KUMAR PEDARLA MANOJ SAI KUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

V i s h n u p u r, N a r s a p u r, M e d a k D i s t r i c t

Lab Record of

DATA SCIENCE WITH R LAB

Name :

Roll No :

Branch : Section:
B V Raju Institute of Technology
(UGC - Autonomous)
Vishnupur, Narsapur, Medak District-502313.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

Certified that this is a bonafide record of lab work done by

Mr./Ms. ……………..………………….. bearing Roll No. ………………… of ..…...

Year ….. Semester B.Tech. .……… in the ……….………………….... during the

academic year …………..…..

Number of Programs done:

Date:

Internal Examiner External Examiner

B V Raju Institute of Technology
(UGC - Autonomous)
Vishnupur, Narsapur, Medak District-502313.

STUDENT PROFILE
NAME :

ROLL NO :

YEAR :

SEMESTER :

B.TECH :

BATCH :

LAB NAME :

TOTAL NO. OF PROGRAMS :

TOTAL NO.OFPROGRAMS DONE :

TOTAL CONDUCTED LAB CLASSES :

ATTENDED NO. OF LAB CLASSES :

ACADEMIC YEAR :

Signature of Staff Signature of HOD

INDEX

S.No. Name of the Program Date Page Signature

No.
1. Write a program to illustrate the working of
countdown function.
2 Write a Program to simulate random number
game

3 Calculate the median of number using vector

4 Write a program to compute Matrix Multiplication

5 Write a program which simulates Filtering of Data

Using R

6 Write a program which simulates Match and

samples in R
7 Implement Linear Regression in R
8 Write a program to write into CSV in R
9 Write a program to illustrate READING of .CSV
File.
10 Write a program to illustrate the use of cat()
function with R code
11 Write a program to illustrate to save a plot as pdf
image
12 Perform K-means clustering algorithm
13 Perform naïve bayes classifier algorithm

14 Perform decision tree algorithm

15 Perform PCA in R
Subject: DATA SCIENCE WITH R LAB Program No : 1
Roll No: Date:
Write a program to illustrate the working of countdown function.

Aim : to implement a program to illustrate the working of countdown function

Code:

countdown<- function(from)
{
print(from)
while(from!=0)
{
Sys.sleep(1)
from <-from -1
print(from)
}
}
countdown(5)

Output:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 2
Roll No: Date:
Write a Program to simulate random number game

Aim : to implement a program to simulate random number game.

Code:
readinteger <- function()
{
n <- readline(prompt="Enter an integer: ")
if(!grepl("^[0-9]+$",n))
{
return(readinteger())
}
return(as.integer(n))
}

num <- round(runif(1) * 100, digits = 0)

guess <- -1

cat ("Guess a number between 0 and 100.\n")

while (guess != num)

{
guess <- readinteger()
if (guess == num)
{
cat("Congratulations,", num, "is right.\n")
}
else if (guess < num)
{
cat("It's bigger!\n")
}
else if(guess > num)
{
cat("It's smaller!\n")
}
}

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:

Output:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 3
Roll No: Date:
Calculate the median of number using vector

Aim : to implement median, median() function in R Language is used to calculate the median of the
elements of the numeric vector passed as argument.

Syntax: median(x, na.rm)

Parameters:
x: Numeric Vector
na.rm: Boolean value to ignore NA

Code: # R program to calculate median of a vector

x <- c(2, 3, 5, NA, 4, NA, 6)

# Calling median() function

# with NA values included
median(x, na.rm = FALSE)

# Calling median() function

# with NA values excluded
median(x, na.rm = TRUE)

Output:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 4
Roll No: Date:
Write a program to compute Matrix Multiplication

Aim : to implement a program to compute matrix multiplication

Matrix multiplication is the most useful matrix operation. It is widely used in areas such as
network theory, transformation of coordinates and many more uses nowadays. A matrix in R can
be created using matrix() function and this function takes input vector, nrow, ncol, byrow,
dimnames as arguments
Creating a matrix
A matrix can be created using matrix() function.
n <- matrix(8:15, nrow=2)
Multiplication of Matrices
The multiplication operator * is used for multiplying a matrix by scalar or element-wise
multiplication of two matrices.
Multiplication with scalar
If you multiply a matrix with a scalar value, then every element of the matrix will be multiplied
with that scalar.
Multiplication between Matrices
When a matrix is multiplied with another matrix, the element-wise multiplication of two matrices
take place. All the corresponding elements of both matrices will be multiplied under the condition
that both matrices will be of the same dimension.
Code:
# R program for matrix multiplication with a scalar
m <- matrix(1:8, nrow=2)
m <- 2*m
print(m)
m <- matrix(1:8, nrow=2)
n <- matrix(8:15, nrow=2)
print(m*n)
Output :
Subject: DATA SCIENCE WITH R LAB Program No : 5
Roll No: Date:

Write a program which simulates Filtering of Data Using R

Aim: to implement a program which simulates filtering of data using R.

Data: diamonds

## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows

Code: filter(diamonds, cut == 'Ideal')

filter(diamonds, price > 2000)

filter(diamonds, cut != 'Ideal')

Output:
## # A tibble: 29,733 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
## 4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56
## 5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72
## 6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38
## 7 0.7 Good F VS1 59.4 62 2759 5.71 5.76 3.4
## 8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
## 9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
## 10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
## # … with 29,723 more rows

Here, we select only the diamonds where the price is greater than 2000.

## # A tibble: 32,389 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## # … with 32,379 more rows

And here, we select all the diamonds whose cut is NOT equal to 'Ideal'. Note that this is the exact
opposite of what we filtered before.
Subject: DATA SCIENCE WITH R LAB Program No : 6
Roll No: Date:
Write a Program which simulates Match and Sample in R.

Aim : to implement a program which simulates match and sample in R

The match() function returns a vector of the position of first occurrence of the vector1 in vector2. If the
element of the vector1 does not exist in vector2, NA is returned.

Syntax: match(vector1, vector2, nomatch = NA_integer_, incomparables = NULL)

vector1 The match() function returns a vector of the position of first occurrence of the vector1 in vector2. If
the element of the vector1 does not exist in vector2, NA is returned.

Syntax: match(vector1, vector2, nomatch = NA_integer_, incomparables = NULL)

vector1: vector, the values to be matched

vector2: vector, the values to be matched against
nomatch: the value which should be returned when no match is found
incomparables: the vector of values that cannot be matched.

By default the nomatch argument will return NA in case the match is not found in vector2.

According to the R Documentation the %in% operator is equivalent to match(). It is a logical vector which
indicates whether a match was located for vector1 in vector2. The result value will be either TRUE or FALSE
but never NA. So the %in% operator can be useful in if conditions.

Syntax: vector1 %n% vector2

vector2: vector, the values to be matched against

nomatch: the value which should be returned when no match is found
incomparables: the vector of values that cannot be matched.

By default the nomatch argument will return NA in case the match is not found in vector2.

Syntax: vector1 %n% vector2

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:
Code:

Sample:
Definition: The sample R function takes a random sample or permutation of a data object.

Basic R Syntax: In the following, you can find the basic R programming syntax of the sample
function.
sample(values, size_of_subsample) # Basic syntax of sample

In the following, I’ll illustrate in six examples how to use the sample function in R programming.

Code:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 7
Roll No: Date:
Implement Logistic Regression in R.

Aim: to implement logistic regression in R.

Code:
install.packages("ROCR")
install.packages("caTools")
install.packages("caret")
install.packages("ggplot2")
install.packages("car")
library(ROCR)
library(ggplot2)
library(caret)
library(caTools)
library(car)

#data(Titanic)
getwd()
setwd("E:/R")
Titanic=read.csv("Titanic.csv",header = TRUE)
View(Titanic)

Titanic=read.csv(file.choose(),header = T)
View(Titanic)

Titanic=read.csv(file.choose(),na.strings = "")
View(Titanic)
summary(Titanic)
str(Titanic)
sum(is.na(Titanic))

set.seed(123)
idx=sample(nrow(Titanic),nrow(Titanic)*0.7)
train=Titanic[idx,]
test=Titanic[-idx,]

model=glm(survived~pclass+sex+age+sibsp+parch+embarked,data=train)
summary(model)
pred=predict(model,newdata=test)
pred

library(Amelia)
missmap(Titanic)

is.na(Titanic)
any(is.na(Titanic))
sum(is.na(Titanic))
colSums(is.na(Titanic))
table(is.na(Titanic))
prop.table(table(is.na(Titanic)))
prop.table(table(is.na(Titanic$body)))
prop.table(table(is.na(Titanic$cabin)))
prop.table(table(is.na(Titanic$boat)))
prop.table(table(is.na(Titanic$home.dest)))
prop.table(table(is.na(Titanic$age)))

names(Titanic)
#Titanic[,c("name","ticket","cabin","boat","body","home.dest")]
Titanic1=Titanic[,-c(3,8,10,12,13,14)]
sum(is.na(Titanic1))
colSums(is.na(Titanic1))
missmap(Titanic1)

na.omit(Titanic1$pclass)
na.omit(Titanic1$survived)
Titanic1[1310,]
Titanic2=Titanic1[-1310,]
colSums(is.na(Titanic2))

#after imputing with missing values,if you want go for

#boxplot use as. numeric
#Titanic2$age=as.numeric(Titanic2$age)

#summary(Titanic2$age)
boxplot(Titanic2)
boxplot(Titanic2$age)
bench=quantile(Titanic2$age,0.75,na.rm=T)+
1.5*IQR(Titanic2$age,na.rm=T)
bench
#bench=quantile(Titanic2$age,0.99,na.rm=T)
Titanic2$age[Titanic2$age>bench]=bench
boxplot(Titanic2$age)

boxplot(Titanic2$fare)
bench=quantile(Titanic2$fare,0.75,na.rm=T)+
1.5*IQR(Titanic2$fare,na.rm=T)
bench
Titanic2$fare[Titanic2$fare>bench]=bench
boxplot(Titanic2$fare)

library(Hmisc)
Titanic2$age=impute(Titanic2$age,mean)
Titanic2$fare=impute(Titanic2$fare,median)
Titanic2$embarked=impute(Titanic2$embarked,mode)
colSums(is.na(Titanic2))

missmap(Titanic2)

table(Titanic2$survived)
# If we want percentages
prop.table(table(Titanic2$survived))

plot(Titanic2)
#install.packages("corrplot")
library(corrplot)
str(Titanic2)
corrplot(cor(Titanic2[,-c(3,8)]))
cor(Titanic2[,-c(3,8)])
names(Titanic2)

set.seed(123)
idx=sample(nrow(Titanic2),nrow(Titanic2)*0.7)
train=Titanic2[idx,]
test=Titanic2[-idx,]

model1=glm(survived~.,data=train,family="binomial")
summary(model1)

model2=glm(survived~.-parch,data=train,family="binomial")
summary(model2)

model3=glm(survived~.-parch-fare,data=train,family="binomial")
summary(model3)

model3$coefficients
model3$residuals
model3$xlevels
model3$rank

attach(Titanic2)
names(Titanic2)
str(Titanic)
model4=glm(survived~pclass+sex+age+sibsp+parch+fare+
embarked,
data=train,family="binomial")
summary(model4)

predict = predict(model2, newdata=test,

type = "response")
predict
head(predict)
head(test$survived)

table(ActualValue=test$survived, PredictedValue=predict)
table(ActualValue=test$survived,PredictedValue=predict>0.5)

library(caret)
#confusionMatrix(table(ActualValue=test$survived,PredictedValue=predict>0.5))
pred1=ifelse(predict>0.56,1,0)
#install.packages("caret")
library(caret)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1))
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")

table(ActualValue=test$survived,PredictedValue=predict>0.56)
table(test$survived,predict>0.59)

table(ActualValue=test$survived,PredictedValue=predict>0.58)
pred1=ifelse(predict>0.58,1,0)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")

set.seed(123) # Set seed for reproducible results

idx = sample(nrow(Titanic2),nrow(Titanic2)*0.8) #create an index
idx
train = Titanic2[ idx,] #create training set
test = Titanic2[ -idx,] #creating testing set

train
test

model3=glm(survived ~ age+sex+pclass+sibsp+parch,
data=train, family="binomial")
summary(model3)

model4=glm(survived~age+sex+pclass+sibsp+parch+fare,
data=train, family="binomial")
summary(model4)

model5=glm(survived~age+sex+pclass+sibsp+parch+fare+ticket,data=train, family="binomial")
summary(model5)

predict = predict(model3, newdata=test,

type = "response")
predict

head(predict)
head(test$survived)

table(ActualValue=test$survived, PredictedValue=predict)
table(ActualValue=test$survived,PredictedValue=predict>0.58)

pred1=ifelse(predict>0.58,1,0)
#install.packages("caret")
library(caret)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1))
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")
table(ActualValue=test$survived,PredictedValue=predict>0.56)
table(test$survived,predict>0.59)
table(ActualValue=test$survived,PredictedValue=predict>0.5)

accuracy=(116+57)/(116+12+29+57)
accuracy
precision=57/(57+12)
precision
recall=57/(57+29)
recall
F1Score=2*(precision*recall)/(precision+recall)
F1Score

#multi collinearity
install.packages("car")
library(car)
vif(model2)

#RoC curve to calculate threshold value

install.packages("ROCR")
library(ROCR)

#predict=predict(model2,newdata=test,type="response")
pred=prediction(predict,test$survived)
pred
perf=performance(pred,"tpr","fpr")
perf
#div.off()
plot(perf)
plot(perf,colorize=T)
#abline(h=0.4,v=0.25)
#abline(a=0,b=1)
plot(perf,colorize=T,print.cutoffs.at=seq(0,1,0.1))

#max=which.max(slot(eval,"y.values")[[1]])
auc=performance(pred,"auc")

auc

#Complete separation state

y<- c(0,0,0,0,1,1,1,1)
x1<-c(1,2,3,3,5,6,10,11)
x2<-c(3,2,-1,-1,2,4,1,0)
m1<- glm(y~ x1+x2, family=binomial)
#Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m1)

#quasi separation state

y<- c(0,0,0,0,1,1,1,1,1,1)
x1<-c(1,2,3,3,3,4,5,6,10,11)
x2<-c(3,0,-1,4,1,0,2,7,3,4)
m1<- glm(y~ x1+x2, family=binomial)

#Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m1)

Output :

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 8
Roll No: Date:
Write a program to write into csv in R
Aim : to implement a program to write into csv in R.
There are several arguments for the write.csv(...) function call, a few of which are particularly important for
how the data are exported. Let’s explore these now.
Syntax
# Export the data. The write.csv() function requires a minimum of two
# arguments, the data to be saved and the name of the output file.

Code:
write.csv(carSpeeds, file = 'data/car-speeds-cleaned.csv')
If you open the file, you’ll see that it has header names, because the data had headers within R, but that
there are numbers in the first column.

Output: data read successfully

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 9
Roll No: Date:
Write a program to illustrate reading of .csv file.

Aim : to implement a program to illustrate reading of .csv files.

Code:
carSpeeds <- read.csv(file = 'data/car-speeds.csv')
head(carSpeeds)

Output:
Color Speed State
1 Blue 32 NewMexico
2 Red 45 Arizona
3 Blue 35 Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Blue 41 Arizona

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 10
Roll No: Date:
Write a program to illustrate the use of cat() function with R code

Aim : to implement a program to illustrate the use of cat() function with R code.
cat() function in R Language is used to print out to the screen or to a file.

Syntax:
cat(…, file = “”, sep = ” “, fill = FALSE, labels = NULL, append = FALSE)
Parameters:
…: atomic vectors, names, NULL and objects with no output
file: the file in which printing will be done
sep: specified separator
fill: If fill=TRUE, a new line will be printed, otherwise not
labels: specified labels

Code:
# R program to illustrate

# cat function

# Creating some string and print it

x <- "GeeksforGeeks\n"

y <- "Geeks\n"

# Calling cat() function

cat(x)

cat(y)

# Creating a sequence from 1 to 9

x <- 1:9

# Calling cat() function

cat(x, sep =" + ")

cat("\n")

cat(x, sep =" / ")

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:
Output:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 11
Roll No: Date:
Write a program to illustrate to save a plot as pdf image.

Aim : to implement a program to illustrate to save the plot as pdf image.

Code:

year = [2014, 2015, 2016, 2017, 2018, 2019]

tutorial_count = [39, 117, 111, 110, 67, 29]

plt.plot(year, tutorial_count, color="#6c3376", linewidth=3)

plt.xlabel('Year')

plt.ylabel('Number of futurestud.io Tutorials')

If you want to export a graph with matplotlib, you will always call .savefig(path). matplotlib will figure out the
file type based on the passed file path .

This will save the plot in line_plot.pdf.

plt.savefig('line_plot.pdf')
Output :

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 12
Roll No: Date:
Perform KNN Algorithm in R.

Aim: to implement KNN algorithm in R.

Code: # Installing Packages

install.packages("e1071")
install.packages("caTools")
install.packages("class")

# Loading package
library(e1071)
library(caTools)
library(class)

# Loading data
data(iris)
head(iris)

# Splitting data into train

# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")

# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

# Fitting KNN Model

# to training dataset
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 1)
classifier_knn

# Confusiin Matrix
cm <- table(test_cl$Species, classifier_knn)
cm

# Model Evaluation - Choosing K

# Calculate out of Sample error
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

# K = 3
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 3)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

# K = 5
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 5)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

# K = 7
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 7)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

# K = 15
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 15)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

# K = 19
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 19)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:
Output :
• Model classifier_knn(k=1):

The KNN model is fitted with a train, test, and k value. Also, the Classifier Species feature is fitted in
the model.
• Confusion Matrix:

So, 20 Setosa are correctly classified as Setosa. Out of 20 Versicolor, 17 Versicolor are correctly
classified as Versicolor and 3 are classified as virginica. Out of 20 virginica, 17 virginica are correctly
classified as virginica and 3 are classified as Versicolor.
• Model Evaluation:
(k=1)

The model achieved 90% accuracy with k is 1.

(K=3)

The model achieved 88.33% accuracy with k is 3 which is lower than when k was 1.
(K=5)

The model achieved 91.66% accuracy with k is 5 which is more than when k was 1 and 3.
(K=7)

The model achieved 93.33% accuracy with k is 7 which is more than when k was 1, 3, and 5.
(K=15)

The model achieved 95% accuracy with k is 15 which is more than when k was 1, 3, 5, and 7.
(K=19)

The model achieved 95% accuracy with k is 19 which is more than when k was 1, 3, 5, and 7. Its
same accuracy when k was 15 which means now increasing k values doesn’t affect the accuracy.
So, K Nearest Neighbor is widely used in the industry.

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 13
Roll No: Date:
Perform Naïve bays classifier algorithm

Aim : to implement naïve bayes classifier algorithm

Code:
install.packages("e1071")
install.packages("caTools")
install.packages("caret")

# Loading package
library(e1071)
library(caTools)
library(caret)

# Splitting data into train

# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")

# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

# Fitting Naive Bayes Model

# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Species ~ ., data = train_cl)
classifier_cl

# Predicting on test data'

y_pred <- predict(classifier_cl, newdata = test_cl)

# Confusion Matrix
cm <- table(test_cl$Species, y_pred)
cm

# Model Evaluation
confusionMatrix(cm)

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:
Output :
• Model classifier_cl:

• The Conditional probability for each feature or variable is created by model separately.
The apriori probabilities are also calculated which indicates the distribution of our
data.
• Confusion Matrix:

• So, 20 Setosa are correctly classified as Setosa. Out of 16 Versicolor, 15 Versicolor are
correctly classified as Versicolor, and 1 are classified as virginica. Out of 24 virginica,
19 virginica are correctly classified as virginica and 5 are classified as Versicolor.
• Model Evaluation:
• The model achieved 90% accuracy with a p-value of less than 1. With Sensitivity,
Specificity, and Balanced accuracy, the model build is good.
So, Naive Bayes is widely used in Sentiment analysis, document categorization, Email spam
filtering etc in industry.

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 14
Roll No: Date:
Perform decision tree algorithm.

Aim : to implement decision tree algorithm

Code:
sample_data = sample.split(readingSkills, SplitRatio = 0.8)
train_data <- subset(readingSkills, sample_data == TRUE)
test_data <- subset(readingSkills, sample_data == FALSE)

Separating data into training and testing sets is an important part of evaluating data mining models. Hence
it is separated into training and testing sets. After a model has been processed by using the training set, you
test the model by making predictions against the test set. Because the data in the testing set already
contains known values for the attribute that you want to predict, it is easy to determine whether the
model’s guesses are correct.

#Create the decision tree model using ctree and plot the model
model<- ctree(nativeSpeaker ~ ., train_data)
plot(model)
#The basic syntax for creating a decision tree in R is:
ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case,
nativeSpeaker is the response variable and the other predictor variables are represented by, hence when
we plot the model we get the following output.

# testing the people who are native speakers

# and those who are not
predict_model<-predict(ctree_, test_data)

# creates a table to count how many are classified

# as native speakers and how many are not
m_at <- table(test_data$nativeSpeaker, predict_model)
m_at

ac_Test < - sum(diag(table_mat)) / sum(table_mat)

print(paste('Accuracy for test is found to be', ac_Test))

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No :
Roll No: Date:
Output:

BVRIT DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Page No:

Subject: DATA SCIENCE WITH R LAB Program No : 15
Roll No: Date:
Perform PCA in R.
Aim : to implement PCA in R

Code:
# Loading Data
data(mtcars)

# Apply PCA using prcomp function

# Need to scale / Normalize as
# PCA depends on distance measure
my_pca <- prcomp(mtcars, scale = TRUE,
center = TRUE, retx = T)
names(my_pca)

# Summary
summary(my_pca)
my_pca

# View the principal component loading

# my_pca$rotation[1:5, 1:4]
my_pca$rotation

# See the principal components

dim(my_pca$x)
my_pca$x

# Plotting the resultant principal components

# The parameter scale = 0 ensures that arrows
# are scaled to represent the loadings
biplot(my_pca, main = "Biplot", scale = 0)

# Compute standard deviation

my_pca$sdev

# Compute variance
my_pca.var <- my_pca$sdev ^ 2
my_pca.var

# Proportion of variance for a scree plot

propve <- my_pca.var / sum(my_pca.var)
propve

# Plot variance explained for each principal component

plot(propve, xlab = "principal component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b",
main = "Scree Plot")

# Plot the cumulative proportion of variance explained

plot(cumsum(propve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")

# Find Top n principal component

# which will atleast cover 90 % variance of dimension
which(cumsum(propve) >= 0.9)[1]

# Predict mpg using first 4 new Principal Components

# Add a training set with principal components
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])

# Running a Decision tree algporithm

## Installing and loading packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

rpart.model <- rpart(disp ~ .,

data = train.data, method = "anova")

rpart.plot(rpart.model)

Output:
• Bi plot

• The resultant principal components are plotted as Biplot. Scale value 0 represents that
arrows are scaled representing loadings.
• Variance explained for each principal component
• Scree Plot represents the proportion of variance and a principal component. Below 2
principal components, there is a maximum proportion of variance as clearly seen in the plot.
• Cumulative proportion of variance

• Scree Plot represents the Cumulative proportion of variance and a principal component.
Above 2 principal components, there is a maximum cumulative proportion of variance as
clearly seen in the plot.
• Decision tree model

• Decision tree model was built to predict disp using other variables in the dataset and using
ANOVA method. The decision tree plot is plotted and displays the information.