DS With R Lab Record
DS With R Lab Record
Lab Record of
Name :
Roll No :
Branch : Section:
B V Raju Institute of Technology
(UGC - Autonomous)
Vishnupur, Narsapur, Medak District-502313.
CERTIFICATE
Date:
STUDENT PROFILE
NAME :
ROLL NO :
YEAR :
SEMESTER :
B.TECH :
BATCH :
LAB NAME :
ACADEMIC YEAR :
15 Perform PCA in R
Subject: DATA SCIENCE WITH R LAB Program No : 1
Roll No: Date:
Write a program to illustrate the working of countdown function.
Code:
countdown<- function(from)
{
print(from)
while(from!=0)
{
Sys.sleep(1)
from <-from -1
print(from)
}
}
countdown(5)
Output:
Code:
readinteger <- function()
{
n <- readline(prompt="Enter an integer: ")
if(!grepl("^[0-9]+$",n))
{
return(readinteger())
}
return(as.integer(n))
}
Output:
Aim : to implement median, median() function in R Language is used to calculate the median of the
elements of the numeric vector passed as argument.
Parameters:
x: Numeric Vector
na.rm: Boolean value to ignore NA
Output:
Matrix multiplication is the most useful matrix operation. It is widely used in areas such as
network theory, transformation of coordinates and many more uses nowadays. A matrix in R can
be created using matrix() function and this function takes input vector, nrow, ncol, byrow,
dimnames as arguments
Creating a matrix
A matrix can be created using matrix() function.
n <- matrix(8:15, nrow=2)
Multiplication of Matrices
The multiplication operator * is used for multiplying a matrix by scalar or element-wise
multiplication of two matrices.
Multiplication with scalar
If you multiply a matrix with a scalar value, then every element of the matrix will be multiplied
with that scalar.
Multiplication between Matrices
When a matrix is multiplied with another matrix, the element-wise multiplication of two matrices
take place. All the corresponding elements of both matrices will be multiplied under the condition
that both matrices will be of the same dimension.
Code:
# R program for matrix multiplication with a scalar
m <- matrix(1:8, nrow=2)
m <- 2*m
print(m)
m <- matrix(1:8, nrow=2)
n <- matrix(8:15, nrow=2)
print(m*n)
Output :
Subject: DATA SCIENCE WITH R LAB Program No : 5
Roll No: Date:
Data: diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
Output:
## # A tibble: 29,733 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
## 4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56
## 5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72
## 6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38
## 7 0.7 Good F VS1 59.4 62 2759 5.71 5.76 3.4
## 8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
## 9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
## 10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
## # … with 29,723 more rows
Here, we select only the diamonds where the price is greater than 2000.
## # A tibble: 32,389 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## # … with 32,379 more rows
And here, we select all the diamonds whose cut is NOT equal to 'Ideal'. Note that this is the exact
opposite of what we filtered before.
Subject: DATA SCIENCE WITH R LAB Program No : 6
Roll No: Date:
Write a Program which simulates Match and Sample in R.
The match() function returns a vector of the position of first occurrence of the vector1 in vector2. If the
element of the vector1 does not exist in vector2, NA is returned.
vector1 The match() function returns a vector of the position of first occurrence of the vector1 in vector2. If
the element of the vector1 does not exist in vector2, NA is returned.
By default the nomatch argument will return NA in case the match is not found in vector2.
According to the R Documentation the %in% operator is equivalent to match(). It is a logical vector which
indicates whether a match was located for vector1 in vector2. The result value will be either TRUE or FALSE
but never NA. So the %in% operator can be useful in if conditions.
By default the nomatch argument will return NA in case the match is not found in vector2.
According to the R Documentation the %in% operator is equivalent to match(). It is a logical vector which
indicates whether a match was located for vector1 in vector2. The result value will be either TRUE or FALSE
but never NA. So the %in% operator can be useful in if conditions.
Sample:
Definition: The sample R function takes a random sample or permutation of a data object.
Basic R Syntax: In the following, you can find the basic R programming syntax of the sample
function.
sample(values, size_of_subsample) # Basic syntax of sample
In the following, I’ll illustrate in six examples how to use the sample function in R programming.
Code:
Code:
install.packages("ROCR")
install.packages("caTools")
install.packages("caret")
install.packages("ggplot2")
install.packages("car")
library(ROCR)
library(ggplot2)
library(caret)
library(caTools)
library(car)
#data(Titanic)
getwd()
setwd("E:/R")
Titanic=read.csv("Titanic.csv",header = TRUE)
View(Titanic)
Titanic=read.csv(file.choose(),header = T)
View(Titanic)
Titanic=read.csv(file.choose(),na.strings = "")
View(Titanic)
summary(Titanic)
str(Titanic)
sum(is.na(Titanic))
set.seed(123)
idx=sample(nrow(Titanic),nrow(Titanic)*0.7)
train=Titanic[idx,]
test=Titanic[-idx,]
model=glm(survived~pclass+sex+age+sibsp+parch+embarked,data=train)
summary(model)
pred=predict(model,newdata=test)
pred
library(Amelia)
missmap(Titanic)
is.na(Titanic)
any(is.na(Titanic))
sum(is.na(Titanic))
colSums(is.na(Titanic))
table(is.na(Titanic))
prop.table(table(is.na(Titanic)))
prop.table(table(is.na(Titanic$body)))
prop.table(table(is.na(Titanic$cabin)))
prop.table(table(is.na(Titanic$boat)))
prop.table(table(is.na(Titanic$home.dest)))
prop.table(table(is.na(Titanic$age)))
names(Titanic)
#Titanic[,c("name","ticket","cabin","boat","body","home.dest")]
Titanic1=Titanic[,-c(3,8,10,12,13,14)]
sum(is.na(Titanic1))
colSums(is.na(Titanic1))
missmap(Titanic1)
na.omit(Titanic1$pclass)
na.omit(Titanic1$survived)
Titanic1[1310,]
Titanic2=Titanic1[-1310,]
colSums(is.na(Titanic2))
#summary(Titanic2$age)
boxplot(Titanic2)
boxplot(Titanic2$age)
bench=quantile(Titanic2$age,0.75,na.rm=T)+
1.5*IQR(Titanic2$age,na.rm=T)
bench
#bench=quantile(Titanic2$age,0.99,na.rm=T)
Titanic2$age[Titanic2$age>bench]=bench
boxplot(Titanic2$age)
boxplot(Titanic2$fare)
bench=quantile(Titanic2$fare,0.75,na.rm=T)+
1.5*IQR(Titanic2$fare,na.rm=T)
bench
Titanic2$fare[Titanic2$fare>bench]=bench
boxplot(Titanic2$fare)
library(Hmisc)
Titanic2$age=impute(Titanic2$age,mean)
Titanic2$fare=impute(Titanic2$fare,median)
Titanic2$embarked=impute(Titanic2$embarked,mode)
colSums(is.na(Titanic2))
missmap(Titanic2)
table(Titanic2$survived)
# If we want percentages
prop.table(table(Titanic2$survived))
plot(Titanic2)
#install.packages("corrplot")
library(corrplot)
str(Titanic2)
corrplot(cor(Titanic2[,-c(3,8)]))
cor(Titanic2[,-c(3,8)])
names(Titanic2)
set.seed(123)
idx=sample(nrow(Titanic2),nrow(Titanic2)*0.7)
train=Titanic2[idx,]
test=Titanic2[-idx,]
model1=glm(survived~.,data=train,family="binomial")
summary(model1)
model2=glm(survived~.-parch,data=train,family="binomial")
summary(model2)
model3=glm(survived~.-parch-fare,data=train,family="binomial")
summary(model3)
model3$coefficients
model3$residuals
model3$xlevels
model3$rank
attach(Titanic2)
names(Titanic2)
str(Titanic)
model4=glm(survived~pclass+sex+age+sibsp+parch+fare+
embarked,
data=train,family="binomial")
summary(model4)
table(ActualValue=test$survived, PredictedValue=predict)
table(ActualValue=test$survived,PredictedValue=predict>0.5)
library(caret)
#confusionMatrix(table(ActualValue=test$survived,PredictedValue=predict>0.5))
pred1=ifelse(predict>0.56,1,0)
#install.packages("caret")
library(caret)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1))
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")
table(ActualValue=test$survived,PredictedValue=predict>0.56)
table(test$survived,predict>0.59)
table(ActualValue=test$survived,PredictedValue=predict>0.58)
pred1=ifelse(predict>0.58,1,0)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")
train
test
model3=glm(survived ~ age+sex+pclass+sibsp+parch,
data=train, family="binomial")
summary(model3)
model4=glm(survived~age+sex+pclass+sibsp+parch+fare,
data=train, family="binomial")
summary(model4)
model5=glm(survived~age+sex+pclass+sibsp+parch+fare+ticket,data=train, family="binomial")
summary(model5)
head(predict)
head(test$survived)
table(ActualValue=test$survived, PredictedValue=predict)
table(ActualValue=test$survived,PredictedValue=predict>0.58)
pred1=ifelse(predict>0.58,1,0)
#install.packages("caret")
library(caret)
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1))
confusionMatrix(table(Actualvalue=test$survived ,
Predictedvalue=pred1),
positive = "1")
table(ActualValue=test$survived,PredictedValue=predict>0.56)
table(test$survived,predict>0.59)
table(ActualValue=test$survived,PredictedValue=predict>0.5)
accuracy=(116+57)/(116+12+29+57)
accuracy
precision=57/(57+12)
precision
recall=57/(57+29)
recall
F1Score=2*(precision*recall)/(precision+recall)
F1Score
#multi collinearity
install.packages("car")
library(car)
vif(model2)
#predict=predict(model2,newdata=test,type="response")
pred=prediction(predict,test$survived)
pred
perf=performance(pred,"tpr","fpr")
perf
#div.off()
plot(perf)
plot(perf,colorize=T)
#abline(h=0.4,v=0.25)
#abline(a=0,b=1)
plot(perf,colorize=T,print.cutoffs.at=seq(0,1,0.1))
#max=which.max(slot(eval,"y.values")[[1]])
auc=performance(pred,"auc")
auc
summary(m1)
#Warning message:
# glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m1)
Output :
Code:
write.csv(carSpeeds, file = 'data/car-speeds-cleaned.csv')
If you open the file, you’ll see that it has header names, because the data had headers within R, but that
there are numbers in the first column.
Code:
carSpeeds <- read.csv(file = 'data/car-speeds.csv')
head(carSpeeds)
Output:
Color Speed State
1 Blue 32 NewMexico
2 Red 45 Arizona
3 Blue 35 Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Blue 41 Arizona
Aim : to implement a program to illustrate the use of cat() function with R code.
cat() function in R Language is used to print out to the screen or to a file.
Syntax:
cat(…, file = “”, sep = ” “, fill = FALSE, labels = NULL, append = FALSE)
Parameters:
…: atomic vectors, names, NULL and objects with no output
file: the file in which printing will be done
sep: specified separator
fill: If fill=TRUE, a new line will be printed, otherwise not
labels: specified labels
Code:
# R program to illustrate
# cat function
x <- "GeeksforGeeks\n"
y <- "Geeks\n"
cat(x)
cat(y)
x <- 1:9
cat("\n")
Code:
plt.xlabel('Year')
If you want to export a graph with matplotlib, you will always call .savefig(path). matplotlib will figure out the
file type based on the passed file path .
plt.savefig('line_plot.pdf')
Output :
# Loading package
library(e1071)
library(caTools)
library(class)
# Loading data
data(iris)
head(iris)
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Confusiin Matrix
cm <- table(test_cl$Species, classifier_knn)
cm
# K = 3
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 3)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))
# K = 5
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 5)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))
# K = 7
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 7)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))
# K = 15
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 15)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))
# K = 19
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Species,
k = 19)
misClassError <- mean(classifier_knn != test_cl$Species)
print(paste('Accuracy =', 1-misClassError))
The KNN model is fitted with a train, test, and k value. Also, the Classifier Species feature is fitted in
the model.
• Confusion Matrix:
So, 20 Setosa are correctly classified as Setosa. Out of 20 Versicolor, 17 Versicolor are correctly
classified as Versicolor and 3 are classified as virginica. Out of 20 virginica, 17 virginica are correctly
classified as virginica and 3 are classified as Versicolor.
• Model Evaluation:
(k=1)
The model achieved 88.33% accuracy with k is 3 which is lower than when k was 1.
(K=5)
The model achieved 91.66% accuracy with k is 5 which is more than when k was 1 and 3.
(K=7)
The model achieved 93.33% accuracy with k is 7 which is more than when k was 1, 3, and 5.
(K=15)
The model achieved 95% accuracy with k is 15 which is more than when k was 1, 3, 5, and 7.
(K=19)
The model achieved 95% accuracy with k is 19 which is more than when k was 1, 3, 5, and 7. Its
same accuracy when k was 15 which means now increasing k values doesn’t affect the accuracy.
So, K Nearest Neighbor is widely used in the industry.
Code:
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Confusion Matrix
cm <- table(test_cl$Species, y_pred)
cm
# Model Evaluation
confusionMatrix(cm)
• The Conditional probability for each feature or variable is created by model separately.
The apriori probabilities are also calculated which indicates the distribution of our
data.
• Confusion Matrix:
• So, 20 Setosa are correctly classified as Setosa. Out of 16 Versicolor, 15 Versicolor are
correctly classified as Versicolor, and 1 are classified as virginica. Out of 24 virginica,
19 virginica are correctly classified as virginica and 5 are classified as Versicolor.
• Model Evaluation:
• The model achieved 90% accuracy with a p-value of less than 1. With Sensitivity,
Specificity, and Balanced accuracy, the model build is good.
So, Naive Bayes is widely used in Sentiment analysis, document categorization, Email spam
filtering etc in industry.
Code:
sample_data = sample.split(readingSkills, SplitRatio = 0.8)
train_data <- subset(readingSkills, sample_data == TRUE)
test_data <- subset(readingSkills, sample_data == FALSE)
Separating data into training and testing sets is an important part of evaluating data mining models. Hence
it is separated into training and testing sets. After a model has been processed by using the training set, you
test the model by making predictions against the test set. Because the data in the testing set already
contains known values for the attribute that you want to predict, it is easy to determine whether the
model’s guesses are correct.
#Create the decision tree model using ctree and plot the model
model<- ctree(nativeSpeaker ~ ., train_data)
plot(model)
#The basic syntax for creating a decision tree in R is:
ctree(formula, data)
where, formula describes the predictor and response variables and data is the data set used. In this case,
nativeSpeaker is the response variable and the other predictor variables are represented by, hence when
we plot the model we get the following output.
Code:
# Loading Data
data(mtcars)
# Summary
summary(my_pca)
my_pca
# Compute variance
my_pca.var <- my_pca$sdev ^ 2
my_pca.var
rpart.plot(rpart.model)
Output:
• Bi plot
• The resultant principal components are plotted as Biplot. Scale value 0 represents that
arrows are scaled representing loadings.
• Variance explained for each principal component
• Scree Plot represents the proportion of variance and a principal component. Below 2
principal components, there is a maximum proportion of variance as clearly seen in the plot.
• Cumulative proportion of variance
• Scree Plot represents the Cumulative proportion of variance and a principal component.
Above 2 principal components, there is a maximum cumulative proportion of variance as
clearly seen in the plot.
• Decision tree model
• Decision tree model was built to predict disp using other variables in the dataset and using
ANOVA method. The decision tree plot is plotted and displays the information.