Name: Le Ho Thao Nguyen Student ID: 20194224
Name: Le Ho Thao Nguyen Student ID: 20194224
QUESTION 1:
The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema
The automatic generation of a concept hierarchy for numeric data based on the
equiwidth/equidepth partitioning rule
Bin2: 14,15,18,19,25,29,32,40
Bin3: 46,52,59,70,71,76,78,96
Bin2: 40,
Bin3: 46,52,59,70,71,78,96
QUESTION 2 (Naive Bayes classifier)
a. P (K = 1|a = 1 ∧ b = 1 ∧ c = 0)
= P (K = 1 ∧ a = 1 ∧ b = 1 ∧ c = 0)/P (a = 1 ∧ b = 1 ∧ c = 0)
= P(K=1)·P(a=1|K =1)·P(b=1|K =1)·P(c=0|K =1)/ P (a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 1) + P
(a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 0)
= 1/2
b. P (K = 0|a = 1 ∧ b = 1)
= P (K = 0 ∧ a = 1 ∧ b = 1)/P (a = 1 ∧ b = 1)
= P(K=0)·P(a=1|K =0)·P(b=1|K =0)/ P (a = 1 ∧ b = 1 ∧ K = 1) + P (a = 1 ∧ b = 1 ∧ K =
0)
=2/3
c. Assuming cost of False Positive > False Negative, so the optimal model will be the one
with minimum False Positive, or model with higher specificity will be selected.
Question 4 (R programming)
library(ggplot2)
library(caret)
library(tidyverse)
library(caretEnsemble)
library(Amelia)
library(mice)
library(randomForest)
library(naivebayes)
library(foreign)
library(haven)
str(hsbdemo)
```
```{r}
head(hsbdemo, n=5)
```
```{r}
#Check dimension
dim(hsbdemo)
```
```{r}
sapply(hsbdemo, class)
```
```{r}
summary(hsbdemo)
```
```{r}
```
```{r}
#Check skewness
library(mlbench)
library(e1071)
print(skew)
```
```{r}
print(correlations)
```
```{r}
par(mfrow=c(1,8))
for(i in 1:8) {
boxplot(hsbdemo[,i], main=names(hsbdemo)[i])}
```
```{r}
```
```{r}
#Correlation plot
library(corrplot)
corrplot(correlations, method="circle")
```
```{r}
#scatterplot matrix
pairs(hsbdemo[2:13])
#Q-Q plot
ggplot(hsbdemo, aes(sample=read))+stat_qq()
```{r}
library(caret)
library(klaR)
# make predictions
# summarize results
confusionMatrix(predictions$class, dataTest$prog)
library(knitr)
purl("test.Rmd", output = "test2.R", documentation = 2)
Question 5
a. Neural Network
Time taken to build model: 0.23 seconds
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.860 0.113 0.878 0.860 0.869 0.747 0.955 0.955 Sick
0.887 0.140 0.870 0.887 0.879 0.747 0.955 0.961
Healthy
Weighted Avg.
0.874 0.127 0.874 0.874 0.874 0.747 0.955 0.958
a b <-- classified as
43 7 | a = Sick
6 47 | b = Healthy
b. Decision Tree
=== Run information ===
Number of Leaves : 30
a b <-- classified as
97 41 | a = Sick
38 127 | b = Healthy
c. Comment: Neural Network model using the Multilayer Perceptron/Backpropagation
algorithm has higher accuracy than Decision Tree J48 (Accuracy rate 87% vs 73.9%)