0% found this document useful (0 votes)
43 views

Name: Le Ho Thao Nguyen Student ID: 20194224

This document contains a student's responses to questions about data mining concepts. The student provides details about concept hierarchies, automatic generation of concept hierarchies for categorical and numeric data, calculations for a Naive Bayes classifier, evaluation metrics for classification models, and R code to analyze a cardiovascular disease dataset using neural networks, decision trees, and Naive Bayes classification. Evaluation results show the neural network model had the highest accuracy at 87.3% compared to 73.9% for the decision tree model.

Uploaded by

Nguyen Le
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Name: Le Ho Thao Nguyen Student ID: 20194224

This document contains a student's responses to questions about data mining concepts. The student provides details about concept hierarchies, automatic generation of concept hierarchies for categorical and numeric data, calculations for a Naive Bayes classifier, evaluation metrics for classification models, and R code to analyze a cardiovascular disease dataset using neural networks, decision trees, and Naive Bayes classification. Evaluation results show the neural network model had the highest accuracy at 87.3% compared to 73.9% for the decision tree model.

Uploaded by

Nguyen Le
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Name: Le Ho Thao Nguyen

Student ID: 20194224

QUESTION 1:

Concept hiarachies is as below

Level 1 Level 2 Level 3


status undergraduate  {freshman,
sophomore,
junior, senior}
status graduate {M.Sc., M.A.,
Ph.D.} 

major science {physics,


chemistry,
math}
major appl._science  {cs,
engineering}
major arts. {French,
philosophy}

age young  {16…20, 21…


25}∈

age old {26…30,


over_30 

nationality foreign.  {Asia, Europe,


Latin_America} 
nationality North_America. {USA, Canada} 

The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema

The automatic generation of a concept hierarchy for numeric data based on the
equiwidth/equidepth partitioning rule

major status age nationality gpa count


engineering junior 21…25 Canada 3.6-4.0 96
engineering Ph.D. 26…30 Asia 3.6-4.0 78
cs junior 16…20 Canada 3.2-3.6 76
engineering junior 21…25 Europe 3.6-4.0 71
cs senior 16…20 Canada 3.2-3.6 70
math junior 16…20 Canada 3.6-4.0 59
French junior 16…20 Canada 3.2-3.6 52
chemistry junior 16…20 USA 3.6-4.0 46
French senior 16…20 Canada 3.2-3.6 40
math senior 16…20 USA 3.6-4.0 32
cs junior 16…20 Europe 3.2-3.6 29
chemistry junior 21…25 USA 3.6-4.0 25
philosophy senior 26…30 Canada 2.8-3.2 19
Latin-
physic M.S 26…30 3.2-3.6 18
America
philosophy M.S Over-30 Canada 3.6-4.0 15
philosophy Ph.D 26…30 Canada 3.6-4.0 14
philosophy junior 26…30 Canada 2.8-3.2 9
philosophy M.S 26…30 Asia 3.2-3.6 9
French junior 16…20 USA 2.8-3.2 8
philosophy junior 21…25 USA 2.8-3.2 8
Latin-
math Ph.D 26…30 3.6-4.0 7
America
philosophy Ph.D 26…30 Europe 3.2-3.6 5
Latin-
French M.S Over-30 2.8-3.2 4
America
French M.A Over-30 Canada 2.8-3.2 3
French Ph.D Over-30 Canada 2.8-3.2 1

(a)  equal-frequency (equidepth) partitioning

Partition the data into equidepth bins of depth 3:

Bin1: 1,3,4,5,7,8, 8,9,9

Bin2: 14,15,18,19,25,29,32,40

Bin3: 46,52,59,70,71,76,78,96

(b)  equal-width partitioning


Partitioning the data into 3 equi-width bins will require the width to be (96 − 1)/3 =31. We get:

Bin1: 1,3,4,5,7,8, 8,9,9, 14,15,18,19,25,29,32

Bin2: 40,

Bin3: 46,52,59,70,71,78,96
QUESTION 2 (Naive Bayes classifier)
a. P (K = 1|a = 1 ∧ b = 1 ∧ c = 0)
= P (K = 1 ∧ a = 1 ∧ b = 1 ∧ c = 0)/P (a = 1 ∧ b = 1 ∧ c = 0)
= P(K=1)·P(a=1|K =1)·P(b=1|K =1)·P(c=0|K =1)/ P (a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 1) + P
(a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 0)
= 1/2

b. P (K = 0|a = 1 ∧ b = 1)
= P (K = 0 ∧ a = 1 ∧ b = 1)/P (a = 1 ∧ b = 1)
= P(K=0)·P(a=1|K =0)·P(b=1|K =0)/ P (a = 1 ∧ b = 1 ∧ K = 1) + P (a = 1 ∧ b = 1 ∧ K =
0)
=2/3

QUESTION 3 (Classification model evaluation)


a. Accuracy (M1) = (5+90)/(5+95+90+10) = 47.5%
Accuracy (M2) = (85+5)/(85+15+5+95) = 45 %

b. Precision (M1) = 5/(10+5) = 33.33%


This means 33.33% positives predicted values from our model are really correct.
Precision (M2) = 85/(85+95) = 47.22%
This means 47.22% positives predicted values from our model are really correct.

If precision is of paramount importance in the application, we should choose model M2


because the higher precision is, the better.

c. Assuming cost of False Positive > False Negative, so the optimal model will be the one

with minimum False Positive, or model with higher specificity will be selected.

Specificity (M1) = 90/(90+10)=90%

Specificity (M2) = 5/(95+5) = 5%

Therefore, model M1 will be chosen.

Question 4 (R programming)

library(ggplot2)

library(caret)

library(tidyverse)

library(caretEnsemble)
library(Amelia)

library(mice)

library(randomForest)

library(naivebayes)

library(foreign)

library(haven)

hsbdemo <- read_dta("Downloads/hsbdemo.dta")

str(hsbdemo)

```

```{r}

# display first 5 rows of data

head(hsbdemo, n=5)

```

```{r}

#Check dimension

dim(hsbdemo)

```

```{r}

#check data types

sapply(hsbdemo, class)

```

```{r}

summary(hsbdemo)

```

```{r}

# calculate standard deviation for all attributes


sapply(hsbdemo[,1:13], sd)

```

```{r}

#Check skewness

library(mlbench)

library(e1071)

skew <- apply(hsbdemo[,1:13], 2, skewness)

print(skew)

```

```{r}

# calculate a correlation matrix for numeric variables

correlations <- cor(hsbdemo[,2:13])

print(correlations)

```

```{r}

#Create separate boxplots for each attribute

par(mfrow=c(1,8))

for(i in 1:8) {

boxplot(hsbdemo[,i], main=names(hsbdemo)[i])}

```

```{r}

#create a missing map

missmap(hsbdemo, col=c("black", "grey"), legend=FALSE)

```

```{r}

#Correlation plot

library(corrplot)
corrplot(correlations, method="circle")

```

```{r}

#scatterplot matrix

pairs(hsbdemo[2:13])

#Q-Q plot

ggplot(hsbdemo, aes(sample=read))+stat_qq()

```{r}

# load the packages

library(caret)

library(klaR)

# define an 70%/30% train/test split of the dataset

trainIndex <- createDataPartition(hsbdemo$prog, p=0.70, list=FALSE)

dataTrain <- hsbdemo[ trainIndex,]

dataTest <- hsbdemo[-trainIndex,]

hsbdemo$prog <- as.factor(hsbdemo$prog)

#train a naive Bayes model

fit <- NaiveBayes(prog~., data=dataTrain)

# make predictions

predictions <- predict(fit, dataTest[,2:13])

# summarize results

confusionMatrix(predictions$class, dataTest$prog)

library(knitr)
purl("test.Rmd", output = "test2.R", documentation = 2)

Question 5
a. Neural Network
Time taken to build model: 0.23 seconds

=== Evaluation on test split ===

Time taken to test model on test split: 0 seconds

=== Summary ===

Correctly Classified Instances 90 87.3786 %


Incorrectly Classified Instances 13 12.6214 %
Kappa statistic 0.7472
Mean absolute error 0.1334
Root mean squared error 0.3144
Relative absolute error 26.7735 %
Root relative squared error 62.6602 %
Total Number of Instances 103

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.860 0.113 0.878 0.860 0.869 0.747 0.955 0.955 Sick
0.887 0.140 0.870 0.887 0.879 0.747 0.955 0.961
Healthy
Weighted Avg.
0.874 0.127 0.874 0.874 0.874 0.747 0.955 0.958

=== Confusion Matrix ===

a b <-- classified as
43 7 | a = Sick
6 47 | b = Healthy

b. Decision Tree
=== Run information ===

Scheme: weka.classifiers.trees.J48 -L -C 0.25 -M 2


Relation: cardiology
Instances: 303
Attributes: 14
age
sex
chest-pain-type
blood-pressure
cholesterol
Fasting-blood-sugar<120
resting-ecg
maximum-heart-rate
angina
peak
slope
#colored-vessels
thal
class
Test mode: 10-fold cross-validation

Number of Leaves : 30

Size of the tree : 49

Time taken to build model: 0.07 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 224 73.9274 %


Incorrectly Classified Instances 79 26.0726 %
Kappa statistic 0.4734
Mean absolute error 0.2956
Root mean squared error 0.4727
Relative absolute error 59.5847 %
Root relative squared error 94.903 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC


Area Class
0.703 0.230 0.719 0.703 0.711 0.474 0.738 0.689 Sick
0.770 0.297 0.756 0.770 0.763 0.474 0.738 0.699 Healthy
Weighted Avg. 0.739 0.267 0.739 0.739 0.739 0.474 0.738 0.694

=== Confusion Matrix ===

a b <-- classified as
97 41 | a = Sick
38 127 | b = Healthy
c. Comment: Neural Network model using the Multilayer Perceptron/Backpropagation
algorithm has higher accuracy than Decision Tree J48 (Accuracy rate 87% vs 73.9%)

You might also like