0% found this document useful (0 votes)

54 views9 pages

Name: Le Ho Thao Nguyen Student ID: 20194224

This document contains a student's responses to questions about data mining concepts. The student provides details about concept hierarchies, automatic generation of concept hierarchies for categorical and numeric data, calculations for a Naive Bayes classifier, evaluation metrics for classification models, and R code to analyze a cardiovascular disease dataset using neural networks, decision trees, and Naive Bayes classification. Evaluation results show the neural network model had the highest accuracy at 87.3% compared to 73.9% for the decision tree model.

Uploaded by

Nguyen Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views9 pages

Name: Le Ho Thao Nguyen Student ID: 20194224

Uploaded by

Nguyen Le

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Name: Le Ho Thao Nguyen

Student ID: 20194224

QUESTION 1:

Concept hiarachies is as below

Level 1 Level 2 Level 3

status undergraduate {freshman,
sophomore,
junior, senior}
status graduate {M.Sc., M.A.,
Ph.D.}

major science {physics,

chemistry,
math}
major appl._science {cs,
engineering}
major arts. {French,
philosophy}

age young {16…20, 21…

25}∈

age old {26…30,

over_30

nationality foreign. {Asia, Europe,

Latin_America}
nationality North_America. {USA, Canada}

The automatic generation of a concept hierarchy for categorical data based on the number of
distinct values of attributes in the given schema

The automatic generation of a concept hierarchy for numeric data based on the
equiwidth/equidepth partitioning rule

major status age nationality gpa count

engineering junior 21…25 Canada 3.6-4.0 96
engineering Ph.D. 26…30 Asia 3.6-4.0 78
cs junior 16…20 Canada 3.2-3.6 76
engineering junior 21…25 Europe 3.6-4.0 71
cs senior 16…20 Canada 3.2-3.6 70
math junior 16…20 Canada 3.6-4.0 59
French junior 16…20 Canada 3.2-3.6 52
chemistry junior 16…20 USA 3.6-4.0 46
French senior 16…20 Canada 3.2-3.6 40
math senior 16…20 USA 3.6-4.0 32
cs junior 16…20 Europe 3.2-3.6 29
chemistry junior 21…25 USA 3.6-4.0 25
philosophy senior 26…30 Canada 2.8-3.2 19
Latin-
physic M.S 26…30 3.2-3.6 18
America
philosophy M.S Over-30 Canada 3.6-4.0 15
philosophy Ph.D 26…30 Canada 3.6-4.0 14
philosophy junior 26…30 Canada 2.8-3.2 9
philosophy M.S 26…30 Asia 3.2-3.6 9
French junior 16…20 USA 2.8-3.2 8
philosophy junior 21…25 USA 2.8-3.2 8
Latin-
math Ph.D 26…30 3.6-4.0 7
America
philosophy Ph.D 26…30 Europe 3.2-3.6 5
Latin-
French M.S Over-30 2.8-3.2 4
America
French M.A Over-30 Canada 2.8-3.2 3
French Ph.D Over-30 Canada 2.8-3.2 1

(a) equal-frequency (equidepth) partitioning

Partition the data into equidepth bins of depth 3:

Bin1: 1,3,4,5,7,8, 8,9,9

Bin2: 14,15,18,19,25,29,32,40

Bin3: 46,52,59,70,71,76,78,96

(b) equal-width partitioning

Partitioning the data into 3 equi-width bins will require the width to be (96 − 1)/3 =31. We get:

Bin1: 1,3,4,5,7,8, 8,9,9, 14,15,18,19,25,29,32

Bin2: 40,

Bin3: 46,52,59,70,71,78,96
QUESTION 2 (Naive Bayes classifier)
a. P (K = 1|a = 1 ∧ b = 1 ∧ c = 0)
= P (K = 1 ∧ a = 1 ∧ b = 1 ∧ c = 0)/P (a = 1 ∧ b = 1 ∧ c = 0)
= P(K=1)·P(a=1|K =1)·P(b=1|K =1)·P(c=0|K =1)/ P (a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 1) + P
(a = 1 ∧ b = 1 ∧ c = 0 ∧ K = 0)
= 1/2

b. P (K = 0|a = 1 ∧ b = 1)
= P (K = 0 ∧ a = 1 ∧ b = 1)/P (a = 1 ∧ b = 1)
= P(K=0)·P(a=1|K =0)·P(b=1|K =0)/ P (a = 1 ∧ b = 1 ∧ K = 1) + P (a = 1 ∧ b = 1 ∧ K =
0)
=2/3

QUESTION 3 (Classification model evaluation)

a. Accuracy (M1) = (5+90)/(5+95+90+10) = 47.5%
Accuracy (M2) = (85+5)/(85+15+5+95) = 45 %

b. Precision (M1) = 5/(10+5) = 33.33%

This means 33.33% positives predicted values from our model are really correct.
Precision (M2) = 85/(85+95) = 47.22%
This means 47.22% positives predicted values from our model are really correct.

If precision is of paramount importance in the application, we should choose model M2

because the higher precision is, the better.

c. Assuming cost of False Positive > False Negative, so the optimal model will be the one

with minimum False Positive, or model with higher specificity will be selected.

Specificity (M1) = 90/(90+10)=90%

Specificity (M2) = 5/(95+5) = 5%

Therefore, model M1 will be chosen.

Question 4 (R programming)

library(ggplot2)

library(caret)

library(tidyverse)

library(caretEnsemble)
library(Amelia)

library(mice)

library(randomForest)

library(naivebayes)

library(foreign)

library(haven)

hsbdemo <- read_dta("Downloads/hsbdemo.dta")

str(hsbdemo)

```

```{r}

# display first 5 rows of data

head(hsbdemo, n=5)

```

```{r}

#Check dimension

dim(hsbdemo)

```

```{r}

#check data types

sapply(hsbdemo, class)

```

```{r}

summary(hsbdemo)

```

```{r}

# calculate standard deviation for all attributes

sapply(hsbdemo[,1:13], sd)

```

```{r}

#Check skewness

library(mlbench)

library(e1071)

skew <- apply(hsbdemo[,1:13], 2, skewness)

print(skew)

```

```{r}

# calculate a correlation matrix for numeric variables

correlations <- cor(hsbdemo[,2:13])

print(correlations)

```

```{r}

#Create separate boxplots for each attribute

par(mfrow=c(1,8))

for(i in 1:8) {

boxplot(hsbdemo[,i], main=names(hsbdemo)[i])}

```

```{r}

#create a missing map

missmap(hsbdemo, col=c("black", "grey"), legend=FALSE)

```

```{r}

#Correlation plot

library(corrplot)
corrplot(correlations, method="circle")

```

```{r}

#scatterplot matrix

pairs(hsbdemo[2:13])

#Q-Q plot

ggplot(hsbdemo, aes(sample=read))+stat_qq()

```{r}

# load the packages

library(caret)

library(klaR)

# define an 70%/30% train/test split of the dataset

trainIndex <- createDataPartition(hsbdemo$prog, p=0.70, list=FALSE)

dataTrain <- hsbdemo[ trainIndex,]

dataTest <- hsbdemo[-trainIndex,]

hsbdemo$prog <- as.factor(hsbdemo$prog)

#train a naive Bayes model

fit <- NaiveBayes(prog~., data=dataTrain)

# make predictions

predictions <- predict(fit, dataTest[,2:13])

# summarize results

confusionMatrix(predictions$class, dataTest$prog)

library(knitr)
purl("test.Rmd", output = "test2.R", documentation = 2)

Question 5
a. Neural Network
Time taken to build model: 0.23 seconds

=== Evaluation on test split ===

Time taken to test model on test split: 0 seconds

=== Summary ===

Correctly Classified Instances 90 87.3786 %

Incorrectly Classified Instances 13 12.6214 %
Kappa statistic 0.7472
Mean absolute error 0.1334
Root mean squared error 0.3144
Relative absolute error 26.7735 %
Root relative squared error 62.6602 %
Total Number of Instances 103

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.860 0.113 0.878 0.860 0.869 0.747 0.955 0.955 Sick
0.887 0.140 0.870 0.887 0.879 0.747 0.955 0.961
Healthy
Weighted Avg.
0.874 0.127 0.874 0.874 0.874 0.747 0.955 0.958

=== Confusion Matrix ===

a b <-- classified as
43 7 | a = Sick
6 47 | b = Healthy

b. Decision Tree
=== Run information ===

Scheme: weka.classifiers.trees.J48 -L -C 0.25 -M 2

Relation: cardiology
Instances: 303
Attributes: 14
age
sex
chest-pain-type
blood-pressure
cholesterol
Fasting-blood-sugar<120
resting-ecg
maximum-heart-rate
angina
peak
slope
#colored-vessels
thal
class
Test mode: 10-fold cross-validation

Number of Leaves : 30

Size of the tree : 49

Time taken to build model: 0.07 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 224 73.9274 %

Incorrectly Classified Instances 79 26.0726 %
Kappa statistic 0.4734
Mean absolute error 0.2956
Root mean squared error 0.4727
Relative absolute error 59.5847 %
Root relative squared error 94.903 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC

Area Class
0.703 0.230 0.719 0.703 0.711 0.474 0.738 0.689 Sick
0.770 0.297 0.756 0.770 0.763 0.474 0.738 0.699 Healthy
Weighted Avg. 0.739 0.267 0.739 0.739 0.739 0.474 0.738 0.694

=== Confusion Matrix ===

a b <-- classified as
97 41 | a = Sick
38 127 | b = Healthy
c. Comment: Neural Network model using the Multilayer Perceptron/Backpropagation
algorithm has higher accuracy than Decision Tree J48 (Accuracy rate 87% vs 73.9%)

heart disease
No ratings yet
heart disease
20 pages
CP1407 prac6-9
No ratings yet
CP1407 prac6-9
45 pages
Machine Learning Final Report
No ratings yet
Machine Learning Final Report
8 pages
TensorFlow Classification
No ratings yet
TensorFlow Classification
68 pages
GLA3 AJPODD2024 Nimish Shandilya
No ratings yet
GLA3 AJPODD2024 Nimish Shandilya
6 pages
eBay Auction Case Solution
No ratings yet
eBay Auction Case Solution
9 pages
DATA 51000 ClassificationAssignment
No ratings yet
DATA 51000 ClassificationAssignment
10 pages
ML Model Report
No ratings yet
ML Model Report
8 pages
SupervisedLearning_Classification
No ratings yet
SupervisedLearning_Classification
20 pages
ML Practical1
No ratings yet
ML Practical1
4 pages
Relatório IA Classification Type Breast Cancer
No ratings yet
Relatório IA Classification Type Breast Cancer
53 pages
P02 DecisionTrees SolutionNotes
No ratings yet
P02 DecisionTrees SolutionNotes
3 pages
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
No ratings yet
Prediction of Heart Disease Using Decision Tree in Comparison With KNN To Improve Accuracy
5 pages
Aih Exp 2
No ratings yet
Aih Exp 2
8 pages
Heart Disease Predictor - ML - Report
No ratings yet
Heart Disease Predictor - ML - Report
15 pages
27 ShivangiSrivastava ML Lab
No ratings yet
27 ShivangiSrivastava ML Lab
52 pages
DWDM Lab 3
No ratings yet
DWDM Lab 3
10 pages
Part I
No ratings yet
Part I
12 pages
JETIR2403387
No ratings yet
JETIR2403387
5 pages
Machine Learning Model
No ratings yet
Machine Learning Model
9 pages
SUMMARY
No ratings yet
SUMMARY
16 pages
EBUS537 Theme4 Week 5
No ratings yet
EBUS537 Theme4 Week 5
26 pages
Ramana 2019
No ratings yet
Ramana 2019
6 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
Mla - 2 (Cia - 1) - 20221013
No ratings yet
Mla - 2 (Cia - 1) - 20221013
14 pages
class 2a-Decision Trees
No ratings yet
class 2a-Decision Trees
28 pages
PeerEval Classification
No ratings yet
PeerEval Classification
5 pages
Map Assign 8
No ratings yet
Map Assign 8
7 pages
Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur
No ratings yet
I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur
9 pages
Cardiovascular Disease Slides
No ratings yet
Cardiovascular Disease Slides
35 pages
IEEE Conference Team ATOM
No ratings yet
IEEE Conference Team ATOM
5 pages
DT-2023-24-sols
No ratings yet
DT-2023-24-sols
8 pages
An Approach For Classification Using Simple CART Algorithm in Weka
No ratings yet
An Approach For Classification Using Simple CART Algorithm in Weka
5 pages
Openlab1
No ratings yet
Openlab1
17 pages
Camera Ready
No ratings yet
Camera Ready
5 pages
Disease Prediction Based on Symptoms
No ratings yet
Disease Prediction Based on Symptoms
16 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Heart Disease Prediction Using Machine Learning Techniques: Abstract
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Abstract
5 pages
Lab6-Data Mining
No ratings yet
Lab6-Data Mining
7 pages
Practical Machine Learning
No ratings yet
Practical Machine Learning
11 pages
Machine Learning Assignment (1)
No ratings yet
Machine Learning Assignment (1)
8 pages
University of California Los Angeles
No ratings yet
University of California Los Angeles
45 pages
WEKA Assignment I
No ratings yet
WEKA Assignment I
2 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
ML
No ratings yet
ML
11 pages
REASEARCH
No ratings yet
REASEARCH
4 pages
Weka
No ratings yet
Weka
22 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
Presentation-evaluation-Form-Group 11 - Duc Hoang and Kristine
No ratings yet
Presentation-evaluation-Form-Group 11 - Duc Hoang and Kristine
1 page
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
R Assignment
No ratings yet
R Assignment
8 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Dbscan implementation in python
No ratings yet
Dbscan implementation in python
5 pages
SAT Math Shortcuts
From Everand
SAT Math Shortcuts
Bella Biscotti
No ratings yet
IPB Inner Model
No ratings yet
IPB Inner Model
165 pages
CHAPTER IV - Multiple Regression Model
No ratings yet
CHAPTER IV - Multiple Regression Model
90 pages
data-science-practical-with-solutions-bsc-cs-sem-6
No ratings yet
data-science-practical-with-solutions-bsc-cs-sem-6
29 pages
Chapter 8 Stat
No ratings yet
Chapter 8 Stat
36 pages
Lecture - 8 MLR
No ratings yet
Lecture - 8 MLR
63 pages
Chapter 15
No ratings yet
Chapter 15
11 pages
Machine Learning Based Advanced Crime Prediction and Analysis
No ratings yet
Machine Learning Based Advanced Crime Prediction and Analysis
7 pages
Time Series Forecasting Chapter 16
No ratings yet
Time Series Forecasting Chapter 16
43 pages
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
No ratings yet
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
637 pages
Correlation and Regression Analyses
No ratings yet
Correlation and Regression Analyses
8 pages
Analisis Kualitas Pelayanan Terhadap Kepuasan Pasien Berobat Di Puskesmas Pembantu Desa Pasir Utama
No ratings yet
Analisis Kualitas Pelayanan Terhadap Kepuasan Pasien Berobat Di Puskesmas Pembantu Desa Pasir Utama
11 pages
Practice Makes Perfect in Mathematics: Algebra (Volume 2 of 2)
From Everand
Practice Makes Perfect in Mathematics: Algebra (Volume 2 of 2)
John Parnell
No ratings yet
Stat 401B Exam 2 Key F16
No ratings yet
Stat 401B Exam 2 Key F16
10 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
STA 2311 Statistical Programming II
No ratings yet
STA 2311 Statistical Programming II
3 pages
Derivation of The Ordinary Least Squares Estimator Simple Linear Regression Case
No ratings yet
Derivation of The Ordinary Least Squares Estimator Simple Linear Regression Case
17 pages
Solution 9 8
No ratings yet
Solution 9 8
17 pages
Some Estimation Methods For Dynamic Panel Data Models: July 2014
No ratings yet
Some Estimation Methods For Dynamic Panel Data Models: July 2014
9 pages
Assignment of Multiple Linear Regressions
No ratings yet
Assignment of Multiple Linear Regressions
9 pages
ECN224 Exe 2
No ratings yet
ECN224 Exe 2
2 pages
Advanced Statistical Methods
No ratings yet
Advanced Statistical Methods
63 pages
Application of Linear Programming For Profit Maximization of Bank of Dream Team
No ratings yet
Application of Linear Programming For Profit Maximization of Bank of Dream Team
5 pages
Lab 5
No ratings yet
Lab 5
13 pages
Extra Proves
No ratings yet
Extra Proves
1 page
Final - '22
No ratings yet
Final - '22
2 pages
SAS - Modeling Trend, Cycles, and Seasonality in Time Series Data Using PROC UCM
No ratings yet
SAS - Modeling Trend, Cycles, and Seasonality in Time Series Data Using PROC UCM
3 pages
Regression
No ratings yet
Regression
3 pages
Normality in The Residual - Hossain Academy Note
No ratings yet
Normality in The Residual - Hossain Academy Note
2 pages
Presentation-Evaluation-Form-Group 1 Le Ngoc Anh Thu Nguyen Dang
No ratings yet
Presentation-Evaluation-Form-Group 1 Le Ngoc Anh Thu Nguyen Dang
1 page
Presentation-Evaluation-Form-Group 4 - Anuj &adi
No ratings yet
Presentation-Evaluation-Form-Group 4 - Anuj &adi
1 page
Presentation-Evaluation-Form-Group 2 - Devanshika & Ann Maria
No ratings yet
Presentation-Evaluation-Form-Group 2 - Devanshika & Ann Maria
1 page
The Value of Financial Sensitivity Analysis in General
No ratings yet
The Value of Financial Sensitivity Analysis in General
1 page
Applying Normalization To The Comparable Properties PDF
No ratings yet
Applying Normalization To The Comparable Properties PDF
1 page
Colonial Broadcasting Company - Group 10
No ratings yet
Colonial Broadcasting Company - Group 10
9 pages
Objective Assignment 6: (Https://swayam - Gov.in)
No ratings yet
Objective Assignment 6: (Https://swayam - Gov.in)
5 pages
Introduction To Vars and Structural Vars:: Estimation & Tests Using Stata
100% (1)
Introduction To Vars and Structural Vars:: Estimation & Tests Using Stata
69 pages
Assignment 1: Data Mining MGSC5126 - 10
No ratings yet
Assignment 1: Data Mining MGSC5126 - 10
10 pages