0% found this document useful (0 votes)
26 views54 pages

Datamining 2

The document discusses different classification algorithms and compares their accuracy on iris data. It divides the data into training and test sets with different proportions, standardizes the data, and compares the accuracy of naive bayes, KNN and decision tree classifiers under different training/test set configurations.

Uploaded by

ananomous.email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views54 pages

Datamining 2

The document discusses different classification algorithms and compares their accuracy on iris data. It divides the data into training and test sets with different proportions, standardizes the data, and compares the accuracy of naive bayes, KNN and decision tree classifiers under different training/test set configurations.

Uploaded by

ananomous.email
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

DATA MINING

PRACTICAL FILE
Name: Swati Saini
Roll No.: 2102066
Course: B.sc.(H) Computer
science

Submitted to: Dr. Sangeeta srivastava


Ques 1:-

Create a file “people.txt” with the following data:

i) Read the data from the file “people.txt”.


ii) Create a ruleset E that contain rules to check for the following conditions:
1. The age should be in the range 0-150.
2. The age should be greater than yearsmarried.
3. The status should be married or single or widowed.
4. If age is less than 18 the agegroup should be child, if age is between 18 and 65 the agegroup
should be adult, if age is more than 65 the agegroup should be elderly.
iii) Check whether ruleset E is violated by the data in the file people.txt.
iv) Summarize the results obtained in part (iii)
v) Visualize the results obtained in part (iii)

Code:-

#Reading file people.txt

people = read.table(file = 'people.txt', sep = '\t',header = TRUE)

print(people)

#install package edit rules


#install.packages("editrules")

library(editrules)

#Create a ruleset E that contain rules to check for the following conditions:

#1. The age should be in the range 0-150.

#2. The age should be greater than yearsmarried.

#3. The status should be married or single or widowed.

#4. If age is less than 18 the agegroup should be child, if age is between 18 and 65

# the agegroup should be adult, if age is more than 65 the agegroup should be elderly.

E<-editset(expression(

Age>=0,

Age<=150,

Age>yearsmarried,

status %in% c('married','single','widowed'),

if(Age<18) agegroup=='child',

if(Age>=18 & Age<65) agegroup=='adult',

if(Age>=65) agegroup=='elderly'

))

print(E)

ve<-violatedEdits(E,people)

summary(ve)

print(ve)
summary(ve,E)

plot(E)

plot(ve)

plot(E,layout=layout.circle)

Output Screen:-
Ques 2:-

Perform the following preprocessing tasks on the dirty_iris datasetii.


i) Calculate the number and percentage of observations that are complete.
ii) Replace all the special values in data with NA.
iii) Define these rules in a separate text file and read them.
(Use editfile function in R (package editrules). Use similar function in Python).
Print the resulting constraint object.
– Species should be one of the following values: setosa, versicolor or virginica.
– All measured numerical properties of an iris should be positive.
– The petal length of an iris is at least 2 times its petal width.
– The sepal length of an iris cannot exceed 30 cm.
– The sepals of an iris are longer than its petals.
iv) Determine how often each rule is broken (violatedEdits). Also summarize and plot the
result.
v) Find outliers in sepal length using boxplot and boxplot.stats

Rules.txt:-

Species %in% c("setosa", "versicolor", "virginica")

Sepal.Length > 0

Sepal.Width > 0

Petal.Length > 0

Petal.Width > 0

Petal.Length >= 2*Petal.Width

Sepal.Length < 30

Sepal.Length > Petal.Length

Code:-

data<-read.table(file = 'dirty_iris.csv', sep=',', header = TRUE)

print(head(data,10))

print(summary(data))

str(data)

#i)Calculate the number and percentage of observations that are complete.

complete_cases <- complete.cases(data)

as.numeric(complete_cases)

n_complete<-sum(complete_cases)

print(n_complete)

percent_complete<- 100*n_complete/nrow(data)

print(paste("Cases in DIRTY_IRIS dataframe that are complete: ", percent_complete, "%"))


#ii)Replace all the special values in data with NA.

is.na(data)<- sapply(data,is.infinite)

is.na(data)

#iii)Define these rules in a separate text file and read them.

#(Use editfile function in R (package editrules).

#Print the resulting constraint object.

#-Species should be one of the following values: setosa, versicolor or virginica.

#-All measured numerical properties of an iris should be positive.

#-The petal length of an iris is at least 2 times its petal width.

#-The sepal length of an iris cannot exceed 30 cm.

#-The sepals of an iris are longer than its petals.

library(editrules)

(Rules <- editfile('ques2Rules.txt'))

#iv)Determine how often each rule is broken (violatedEdits). Also summarize and plot the result.

print(Rules)

rules_violations <- violatedEdits(Rules,data)

print(rules_violations)

summary(rules_violations)

#par(mar=c(3,3,3,3))

plot(rules_violations)

#v)Find outliers in sepal length using boxplot and boxplot.stats

boxplot(data$Sepal.Length, horizontal=TRUE)
print(boxplot.stats(data$Sepal.Length))

Output Screen:-
Ques 3:-

Load the data from wine dataset. Check whether all attributes are standardized or not (mean is
0 and standard deviation is 1). If not, standardize the attributes. Do the same with Iris dataset.

Code:-

library(caret)
#----WINE DATASET-----

df_wine<-read.csv("wine.csv", header=FALSE)

View(df_wine)

str(df_wine)

summary(df_wine)

apply(df_wine,2,sd)

wine_pre<-preProcess(df_wine[,],method=c("center","scale"))

wine_standard<-predict(wine_pre,df_wine[,])

summary(wine_standard)

apply(wine_standard,2,sd)

#----IRIS DATASET-----

data("iris")

View(iris)

summary(iris)

apply(iris[,1:4],2,sd)

#for making mean = 0 and standard deviation = 1 in IRIS dataset.

iris_pre<-preProcess(iris[,1:4], method=c("center", "scale"))

iris_standard<-predict(iris_pre, iris[,1:4])

summary(iris_standard)

apply(iris_standard,2,sd)

Output Screen:-
Ques 4:-

Run Apriori algorithm to find frequent itemsets and association rules


1.1 Use minimum support as 50% and minimum confidence as 75%
1.2 Use minimum support as 60% and minimum confidence as 60 %

Code:-

receipt_df <- read.csv("datasets/Dataset for Apriori/1000/1000i.csv")

View(receipt_df)
#Applying column names

names(receipt_df) <- c("Receipt_Number", "Food", "Quantity")

#Applying Data Preprocessing

id<- c(1:5)

food<- c("milk", "sugar", "chocolate", "apple", "curd")

df <- data.frame(id, food)

print(df)

#map id to its text representation

receipt_df$Food <- df$food[match(receipt_df$Food,df$id)]

#after Preprocessing

head(receipt_df)

typeof(receipt_df)

library(arules)

library(arulesViz)

#convert into basket format to run in apriori

test_df <- receipt_df[,c("Receipt_Number", "Food", "Quantity")]

df_trans <- as(split(test_df$Food, test_df$Receipt_Number),"transactions")

#support and confidence

#part 1:- minimum support as 50% and minimum confidence as 75%

rules <- apriori(df_trans,parameter = list(supp = .5, conf = .5))

plot(rules)
rules<- sort(rules, by="support", decreasing = T)

inspect(head(rules))

#part 2:- minimum support as 60% and minimum confidence as 60%

rules <- apriori(df_trans,parameter = list(supp = .6, conf = .6))

plot(rules)

rules<- sort(rules, by="support", decreasing = T)

inspect(head(rules))

Output Screen:-
Ques 5:-

Use Naive bayes, K-nearest, and Decision tree classification algorithms and build classifiers.
Divide the data set into training and test set. Compare the accuracy of the different classifiers
under the following situations:
5.1 a) Training set = 75% Test set = 25% b) Training set = 66.6% (2/3rd of total), Test set =
33.3%
5.2 Training set is chosen by i) hold out method ii) Random subsampling iii) Cross-Validation.
Compare the accuracy of the classifiers obtained.
5.3 Data is scaled to standard format.

Code:-

#Use Naive bayes, K-nearest, and Decision tree classification algorithms

#and build classifiers. Divide the data set into training and test set.

#Compare the accuracy of the different classifiers under the following

#situations:

#5.1 a) Training set = 75% Test set = 25%

#b) Training set = 66.6% (2/3rd of total), Test set = 33.3%


#5.2 Training set is chosen by i) hold out method

#ii) Random subsampling iii) Cross-Validation.

#Compare the accuracy of the classifiers obtained.

#5.3 Data is scaled to standard format.

data(iris)

print(summary(iris))

print(head(iris,5))

#library imports

library(caTools)

library(rpart)

#?rpart

library(e1071)

library(class)

library(rpart.plot)

library(caret)

#part a

print("Part a start Training set = 75% Test set = 25% ")

#Hold-out Method

print("Hold out Method")

set.seed(143)

split<-sample.split(iris$Species,SplitRatio = 0.75)
training_set = subset(iris,split==TRUE)

testing_set = subset(iris,split==FALSE)

#summary(training_set)

#summary(testing_set)

#feature scaling

train_scale <- scale(training_set[,1:4])

test_scale <- scale(testing_set[,1:4])

print(dim(training_set))

print(dim(testing_set))

#Naive bayes

print("Naive bayes-Hold out-part a")

classifier_naive <- naiveBayes(Species~., data = training_set)

print(classifier_naive)

y_pred <- predict(classifier_naive, newdata = testing_set)

print(y_pred)

#confusion matrix

cm <- table(testing_set$Species, y_pred)

print(cm)
print(confusionMatrix(cm))

#K-Nearest

print("KNN-Hold out-part a")

# Fitting KNN model to traning dataset

classifier_knn <- knn(train = train_scale, test = test_scale,

cl = training_set$Species,

k=1)

print(classifier_knn)

#confusion matrix

cm <- table(testing_set$Species, classifier_knn)

print(cm)

misClassError <- mean(classifier_knn != testing_set$Species)

print(paste("Accuracy of KNN model: ", 1 - misClassError))

#DECISION TREE

print("Decision Tree-Hold out-part a")

dtm<-rpart(Species~.,training_set,method = "class")

#dtm #gives textual description

#plot(dtm)

#text(dtm) #adding text to plot


rpart.plot(dtm)

rpart.plot(dtm,type = 4, extra = 101)

p<-predict(dtm,testing_set,type = "class")

#print(confusionMatrix(testing_set[,5],p))

print(confusionMatrix(testing_set[,5],p)$table)

print(confusionMatrix(testing_set[,5],p)$overall["Accuracy"]*100)

#Random subsampling

print("Random Subsampling")

dtm_acc=list()

knn_acc=list()

naive_acc=list()

for(x in 1:5)

set.seed(123)

split<-sample.split(iris$Species,SplitRatio = 0.75)

train_set_rndm = subset(iris,split==TRUE)

test_set_rndm = subset(iris,split==FALSE)

dim(test_set_rndm)
dim(train_set_rndm)

train_scale_rndm <- scale(train_set_rndm[,1:4])

test_scale_rndm <- scale(test_set_rndm[,1:4])

dtm_rndm = rpart(Species~., train_set_rndm,method = "class")

dtm_rndm

#plot(dtm_rndm)

#rpart.plot(dtm_rndm)

#rpart.plot(dtm_rndm,type = 4 , extra = 101)

p=predict(dtm_rndm , test_set_rndm,type="class")

confusionMatrix(test_set_rndm[,5],p)$table

dtm_acc[x] = (confusionMatrix(test_set_rndm[,5],p)$overall['Accuracy']*100)

print(paste("Decision tree Accuracy: ",dtm_acc[x]))

# K-NEAREST USING RANDOM SUBSAMPLING (PART A)

classifier_knn_rndm <- knn(train = train_scale_rndm[,c(1,2,3,4)], test =


test_scale_rndm[,c(1,2,3,4)],cl = train_set_rndm$Species,k=1)

classifier_knn

#confusion matrix

cm = table(test_set_rndm$Species, classifier_knn_rndm)

cm

misclasserror = mean(classifier_knn_rndm!=test_set_rndm$Species)
print(paste("Accuracy of knn model is :: ", 1 - misclasserror))

knn_acc[x] = confusionMatrix(test_set_rndm[,5],classifier_knn_rndm)$overall['Accuracy']*100

knn_acc

# NAIVE BAYES USING RANDOM SUBSAMPLING (PART A)

classifier_naive_rndm = naiveBayes(Species~., data= train_set_rndm)

classifier_naive

predicted_y = predict(classifier_naive_rndm,newdata = test_set_rndm)

predicted_y

table(predicted_y)

#confusion matrix

cm = table(test_set_rndm$Species, predicted_y)

print("Naive Bayes confusion matrix")

print(cm)

confusionMatrix(cm)

naive_acc[x]=confusionMatrix(test_set_rndm$Species,predicted_y)$overall['Accuracy']*100

naive_acc

#cross-validation

print("Cross Validation")
#Naive Bayes

print("Naive bayes-Cross Validation-part a")

nb_model<-train(iris[,1:4],iris[,5],'nb',

trControl = trainControl(method = 'cv', number=4))

print(nb_model)

#KNN

print("KNN-Cross Validation-part a")

knn_model<-train(iris[,1:4],iris[,5],'knn',

trControl = trainControl(method = 'cv', number=4))

print(knn_model)

#Decision Tree

print("Decision Tree-Cross Validation-part a")

Dtree_model<-train(iris[,1:4],iris[,5],'rpart',

trControl = trainControl(method = 'cv', number=4))

print(Dtree_model)

#part b

print("Part b start Training set = 66.6% (2/3rd of total), Test set = 33.3% ")
#Hold-out Method

print("Hold out Method")

set.seed(143)

split<-sample.split(iris$Species,SplitRatio = 0.6666)

training_set = subset(iris,split==TRUE)

testing_set = subset(iris,split==FALSE)

#feature scaling

train_scale <- scale(training_set[,1:4])

test_scale <- scale(testing_set[,1:4])

print(dim(training_set))

print(dim(testing_set))

#Naive bayes

print("Naive bayes-Hold out-part b")

classifier_naive <- naiveBayes(Species~., data = training_set)

print(classifier_naive)

y_pred <- predict(classifier_naive, newdata = testing_set)

print(y_pred)

#confusion matrix
cm <- table(testing_set$Species, y_pred)

print(cm)

print(confusionMatrix(cm))

#K-Nearest

print("KNN-Hold out-part b")

# Fitting KNN model to traning dataset

classifier_knn <- knn(train = train_scale, test = test_scale,

cl = training_set$Species,

k=1)

print(classifier_knn)

#confusion matrix

cm <- table(testing_set$Species, classifier_knn)

print(cm)

misClassError <- mean(classifier_knn != testing_set$Species)

print(paste("Accuracy of KNN model: ", 1 - misClassError))

#DECISION TREE

print("Decision Tree-Hold out-part b")

dtm<-rpart(Species~.,training_set,method = "class")
#rpart.plot(dtm)

rpart.plot(dtm,type = 4, extra = 101)

p<-predict(dtm,testing_set,type = "class")

print(confusionMatrix(testing_set[,5],p))

#Random subsampling

print("Random Subsampling")

dtm_acc=list()

knn_acc=list()

naive_acc=list()

for(x in 1:5)

set.seed(123)

split<-sample.split(iris$Species,SplitRatio = 0.66)

train_set_rndm = subset(iris,split==TRUE)

test_set_rndm = subset(iris,split==FALSE)

dim(test_set_rndm)

dim(train_set_rndm)
train_scale_rndm <- scale(train_set_rndm[,1:4])

test_scale_rndm <- scale(test_set_rndm[,1:4])

dtm_rndm = rpart(Species~., train_set_rndm,method = "class")

dtm_rndm

#plot(dtm_rndm)

#rpart.plot(dtm_rndm)

#rpart.plot(dtm_rndm,type = 4 , extra = 101)

p=predict(dtm_rndm , test_set_rndm,type="class")

confusionMatrix(test_set_rndm[,5],p)$table

dtm_acc[x] = (confusionMatrix(test_set_rndm[,5],p)$overall['Accuracy']*100)

print(paste("Decision tree Accuracy: ",dtm_acc[x]))

# K-NEAREST USING RANDOM SUBSAMPLING (PART A)

classifier_knn_rndm <- knn(train = train_scale_rndm[,c(1,2,3,4)], test =


test_scale_rndm[,c(1,2,3,4)],cl = train_set_rndm$Species,k=1)

classifier_knn

#confusion matrix

cm = table(test_set_rndm$Species, classifier_knn_rndm)

cm

misclasserror = mean(classifier_knn_rndm!=test_set_rndm$Species)

print(paste("Accuracy of knn model is :: ", 1 - misclasserror))


knn_acc[x] = confusionMatrix(test_set_rndm[,5],classifier_knn_rndm)$overall['Accuracy']*100

knn_acc

# NAIVE BAYES USING RANDOM SUBSAMPLING (PART A)

classifier_naive_rndm = naiveBayes(Species~., data= train_set_rndm)

classifier_naive

predicted_y = predict(classifier_naive_rndm,newdata = test_set_rndm)

predicted_y

table(predicted_y)

#confusion matrix

cm = table(test_set_rndm$Species, predicted_y)

print("Naive Bayes confusion matrix")

print(cm)

confusionMatrix(cm)

naive_acc[x]=confusionMatrix(test_set_rndm$Species,predicted_y)$overall['Accuracy']*100

naive_acc

#cross-validation

print("Cross Validation")

#Naive Bayes
print("Naive bayes-Cross Validation-part b")

nb_model<-train(iris[,1:4],iris[,5],'nb',

trControl = trainControl(method = 'cv', number=3))

print(nb_model)

#KNN

print("KNN-Cross Validation-part b")

knn_model<-train(iris[,1:4],iris[,5],'knn',

trControl = trainControl(method = 'cv', number=3))

print(knn_model)

#DECISION TREE

print("Decision Tree-Cross Validation-part b")

Dtree_model<-train(iris[,1:4],iris[,5],'rpart',

trControl = trainControl(method = 'cv', number=3))

print(Dtree_model)

Output Screen:-
PLOTS:-
Ques 6:-

Use Simple Kmeans, DBScan, Hierachical clustering algorithms for clustering. Compare the
performance of clusters by changing the parameters involved in the algorithms.

Code:-

#Q6. Use Simple Kmeans, DBScan, Hierachical clustering algorithms

#for clustering. Compare the performance of clusters by changing

#the parameters involved in the algorithms.

#working on HTRU_2 dataset

df <- read.csv("datasets/HTRU2/HTRU_2.csv")

print(summary(df))

#View(df)

str(df)

library(ggplot2)

#normalize data for variables measured on different scale


dfScaled <- scale(df[,-9])

dfScaled

summary(dfScaled)

set.seed(234)

kmeans_model <- kmeans(x = df[,-9], centers = 2, nstart = 20)

print(kmeans_model)

print(kmeans_model$cluster)

cm <- table(df[,9], kmeans_model$cluster)

print(cm)

#library(caret)

#print(confusionMatrix(cm))

# ----------------- HIERACHICAL CLUSTERING --------------------

#distance matrix - matrix of distance b/w every point to any other

d= dist(dfScaled)

fitH = hclust(d,"ward.D2")

print(fitH)

plot(fitH)

rect.hclust(fitH,k=3,border = "blue")

cluster= cutree(fitH,3) #we can cut off the tree at the desired number of clusters using cutree

cluster
table(df[,9],cluster)

# ---------------------- DBSCAN ---------------------------

#install.packages("dbscan",dependencies = TRUE)

library(dbscan)

kNNdistplot(dfScaled,k=3) #to decide value of eps

abline(h=0.7,col="red",Ity = 2)

fitD<-dbscan(dfScaled,eps=0.7,minPts = 5)

print(fitD)

plot(df,col=fitD$cluster)

table(df[,9],fitD$cluster)

Output Screen:-
Plots:-

You might also like