0% found this document useful (0 votes)
87 views13 pages

Assignment-2 Group 8 ADM 3308 A

Uploaded by

Ayoub Essadouky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views13 pages

Assignment-2 Group 8 ADM 3308 A

Uploaded by

Ayoub Essadouky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ADM3308-Fall 2023: Business Data Mining

Assignment #2 (Group Work)


_____________________________________________________________________

Submission Instruction:
®
● Submit the assignment to Brighspace no later than midnight on Oct. 30,
2023.
● Only one submission per group. Please choose a group representative to submit
one copy of the report to BrightSpace

Weight: 10% of the final course mark.


______________________________________________________________________

Statements on Group Contribution and Academic Integrity

When submitting your group work, the submission must include the following two
statements. Without the following two statements included in your submission, your
assignment will not be marked.

(a) Statement of Contributions


In a brief statement, explain the contribution of each group member to the
assignment. Mention the name of the group member and the specific tasks (or
items) accomplished by that group member as their contribution to the
assignment.

Ayoub Q1, 2,5 and practical problem 6


Reana Q3, 4 and practical problem 8
Gabriel Practical problem 9

(b) Academic Integrity Statement


Each individual member of the group must read the Academic Integrity
Statement, and type in their full name and student ID. The Academic Integrity
Statement must be e-signed by ALL members of the group UNLESS a group
member has not contributed to the assignment.

NOTE: If the above two statements are not included in your original submission, the
assignment will not be marked. Then, the following deductions will be applied:

-20% if the statement was not submitted with the original submission, but was
submitted after reminded by the TA within 24 hours.

-100% if the statement was not submitted within 24 hours after reminded by the
TA.

University of Ottawa Telfer School of Management Page 1 of 13


ADM3308-Fall 2023: Business Data Mining

University of Ottawa Telfer School of Management Page 2 of 13


ADM3308-Fall 2023: Business Data Mining

Important Note: Each member of the group must read the following academic
integrity statement, and type in their full name and student ID. The Academic
Integrity Statement must be e-signed by ALL members of the group UNLESS a
group member has not contributed to the assignment. Your assignment will not be
marked if the following academic integrity statement is not submitted.

Statement of Academic Integrity


Group Assignment Checklist & Disclosure

Please read the disclosure below following the completion of your group assignment.
Once all team members have verified these points, hand in this signed disclosure with
your group assignment.
1. All team members acknowledge to have read and understood their responsibilities for
maintaining academic integrity, as defined by the University of Ottawa’s policies and
regulations. Furthermore, all members understand that any violation of academic
integrity may result in strict disciplinary action as outlined in the regulations.
2. If applicable, all team members have referenced and/or footnoted all ideas, words, or
other intellectual property from other sources used in completing this assignment.
3. A proper bibliography is included, which includes acknowledgement of all sources used
to complete this assignment.
4. This is the first time that any member of the group has submitted this assignment or
essay (either partially or entirely) for academic evaluation.
5. No member of the team has utilized unauthorized assistance or aids including but not
limited to outsourcing assignment solutions, and unethical use of online services such as
artificial intelligence tools and course-sharing websites.
6. Each member of the group has read the full content of the submission and is assured
that the content is free of violations of academic integrity. Group discussions regarding
the importance of academic integrity have taken place.
7. All team members have identified their individual contributions to the work submitted
such that if violations of academic integrity are suspected, then the student(s) primarily
responsible for the violations may be identified. Note that the remainder of the team will
also be subject to disciplinary action.

Course Code: ADM3308, Fall 2023

Group Number: 8

Assignment # or Title: Assignment #2

Date of Submission: Monday, October 30, 2023

University of Ottawa Telfer School of Management Page 3 of 13


ADM3308-Fall 2023: Business Data Mining

Student Full Name Signature

Ayoub Essadouky AE

Reana Agil RA

Gabriel Torres Stelluto GTS

University of Ottawa Telfer School of Management Page 4 of 13


ADM3308-Fall 2023: Business Data Mining

Assignment #2 (Group Work)


_____________________________________________________________________

Part I- Review Questions

Q1) [4 marks] Aside from the Gini and Entropy, explain two other parameters to measure
purity (best split) in a decision tree.

Aside from Gini and Entropy, two other parameters to measure purity (best split) in a
decision tree are:

Information Gain Ratio:

Information gain ratio is a purity measure used for categorical targets. It addresses a
problem with Entropy, which is the intrinsic information (decrease in entropy solely due
to the number of branches).
Info gain ratio is calculated as the total information gain due to the proposed split divided
by the intrinsic information. It is used in algorithms like C5.

Reduction in Variance:

Reduction in variance is a purity measure used for numeric targets. It assesses the spread
of values within a class. If the values in a class are close to each other, their variance
should get close to the mean value.
This measure is specifically designed for numerical variables and aims to minimize the
variance within classes after the split.

Q2) [5 marks] (a) Explain “over fitting”. What is its negative impact on learner models?
(b) Explain two approaches to prevent overfitting?

a.
Overfitting occurs when a machine learning model learns the training data too well, to the
point where it captures noise and randomness in the data rather than the underlying
pattern. This leads to a model that performs exceptionally well on the training data but
poorly on new, unseen data.

The negative impact of overfitting on learner models is significant:

Poor Generalization: Overfit models have learned the training data too closely and are
overly sensitive to small fluctuations in the data. This leads to poor performance on new
data because the model has essentially memorized the training set, rather than learning
the underlying relationship.

University of Ottawa Telfer School of Management Page 5 of 13


ADM3308-Fall 2023: Business Data Mining

Reduced Predictive Power: The primary goal of a machine learning model is to make
accurate predictions on new, unseen data. Overfit models fail in this regard, as they are
tailored too specifically to the training data and can't effectively generalize to different
data distributions.

Unreliable Insights: Overfit models might infer relationships that are purely coincidental
and not actually meaningful. This can lead to incorrect conclusions about the data.

b. There is 2 approaches possibile to avoid overfitting:

Pruning Process: After growing a full tree, the pruning process begins. Pruning involves
selectively removing branches and nodes from the tree while aiming to improve its
generalization capabilities. The goal is to create a simpler, more interpretable tree that
still maintains good predictive performance on unseen data: reducing model complexity

Cross-validation: involves partitioning the training data into subsets. The model is
trained on a portion of the data (training set) and evaluated on the remaining part
(validation set). This process is repeated multiple times, with different subsets used for
training and validation each time. To avoid overfitting we should break down the data.
We should not use all of it at a time

Q3) [4 marks] (a) What do the X-axis and Y-axis represent in an ROC curve? What does
the area under the curve represent? Draw an example ROC curve for a classifier that
performs better than a random guess. (b) Describe a real world application where false
negative rate is of high interest (in addition to the accuracy of the classifier).

a) To begin, an ROC curve shows the relationship between climbed sensivitivty and
the 1-speficity for every possible cut-off. The x-axis shows the 1-specificity i.e
(false posivtive rate = FP/FP+IN), this false positive rate will demonstrate and
measure how many times a false alarm will occur i.e how many times an actual
negative value will be classified as a positive value. The y-axis shows the
sensitivity i.e (true positive rate = TP/TP+FN) which demonstrates and will
measure the probability that a positive value will actually be classified as a
positive value. Now what does the area under the curve represent? The area under
the curve represents if a test can be used as a reflection to measure the rests
discriminative ability. See graph below: Typically the AUC, represents the AUC
score, and the higher the score the more accurate the final test value results will
be.

University of Ottawa Telfer School of Management Page 6 of 13


ADM3308-Fall 2023: Business Data Mining

b) In the area of a email spamming, we would want the false positive rate to be very
low and negative rate to be high. This is because when testing a software for the
email spam, we would want to have the most accurate view of how many spam
mails can still come through. We would use the TP rate classifier feeding it the
false negative rates.

Q4) [5 marks] In a Neural Networks model, should we prefer a large hidden layer over a
small one? Explain the benefits and drawbacks of each option.

The hidden layer is a middle layer, the size of the middle layer is important as it allows
for the network to be more powerful by enabling it to recognize more patterns, as it is
able to capture more complex patterns, however, the drawback with this is that the larger
it is the more resources are required and it can cause a greater the risk of overfitting if
there is not enough data. In contrast, the smaller the hidden layer, the less powerful the
network is, however it is computationally more efficient as usually only one hidden layer
is needed. Also, the smaller layer will be less prone to overfitting as it may not capture
the complex relationships as well. Therefore, the choice really ends up depending on the
proble amd the data that is available, but experimentation can occur to determine if more
times than other using and enlarging the layer only when necessary.

Q5) [8 marks] A neural network classifier is trained on labeled transaction data to


classify a transaction as fraudulent or non-fraudulent. The output of the classifier is a

University of Ottawa Telfer School of Management Page 7 of 13


ADM3308-Fall 2023: Business Data Mining

numerical value between [0, 1]. A threshold is used to interpret the output of the classifier
as a binary value of “fraudulent” or “non-fraudulent”.

We first set the threshold at 0.5, meaning that, if the output of the model is higher than
0.5, then the transaction is classified as “fraudulent”, and if the output of the model is less
than or equal to 0.5, then the transaction is classified as “non-fraudulent”.

Let us assume that for threshold=0.5, the confusion matrix for this classifier is measured
as follows:

Actual class

0 (non-fraudulent) 1 (fraudulent)

0 (non-fraudulent) d b
Predicted class
1 (fraudulent) c a

Referring to the values of a, b, c, and d in the confusion matrix, fill out the blanks in the
following statements using the options provided in the brackets:

(a) The classification error rate for truly fraudulent records (with this 0.5 threshold) is
______b/(a+b)___________ (present the formula in terms of values a, b, c, and d).

(b) The classification error rate for truly non-fraudulent records (with this 0.5 threshold)
is ____c/(c+d)_________ (present the formula in terms of values a, b, c, and d).

(c) Lowering the threshold below 0.5 leads to classifying more records (both actual
fraudulent and non-fraudulent records) as ___non-fraudulent______ (fraudulent or
non-fraudulent), therefore, the values _____d and c___ (a, b, c, d) increase, and the
values ___a and b____(a, b, c, d) decrease. This means,

a. With respect to the classification error rate for truly fraudulent records, the error
rate ___decreases____ (increases or decreases). This means, as you lower the
threshold for calling a record fraudulent, you catch ___more___ (more or less) of
the real frauds.

University of Ottawa Telfer School of Management Page 8 of 13


ADM3308-Fall 2023: Business Data Mining

b. With respect to the classification error rate for truly non-fraudulent records, the
error rate ___increases_____ (increases or decreases). This means, as you lower
the threshold for calling a record fraudulent, you mistakenly identify
___more____ (more or less) non-frauds as frauds.

(d) Increasing the threshold above 0.5 leads to classifying more records (both actual
fraudulent and non-fraudulent records) as ____fraudulent______ (fraudulent or
non-fraudulent), therefore the values ______a and b____ (a, b, c, d) increase, and the
values c and d_____ (a, b, c, d) decreases. This means,

a. With respect to the classification error rate for truly fraudulent records, the error
rate ___increases______ (increases or decreases). This means, as you raise the
threshold for calling a record fraudulent, you miss ____more____ (more or less)
of the real frauds.
b. With respect to the classification error rate for truly non-fraudulent records, the
error rate ___decreases____ (decreases or increases). This means, as you raise
the threshold for calling a record fraudulent, ____fewer____ (fewer or more)
non-fraudulent transactions are miss-classified as frauds.

Part II- Problems

Q6) [5 marks] The following confusion matrix is produced during the test phase of a
classifier.

Actual vs. Classified Positive Negative


Positive 82 10
Negative 8 20

Calculate the following measures for this classifier:

From the table above we conclude that:

TP= 82
FN= 10
FP= 8
TN= 20

● Accuracy (also called Correctness)

Correctness = (TP+TN)/ (TP+TN+FP+FN)


= (82 + 20) / (82 + 20 + 8 + 10)
= 0.85

University of Ottawa Telfer School of Management Page 9 of 13


ADM3308-Fall 2023: Business Data Mining

● Specificity
Specificity = TN / (FP + TN)
= 20 / (8 + 20)
= 0.714

● Sensitivity
Sensitivity = TP / (TP + FN)
= 82 / (82 + 10)
= 0.891

● Precision
Precision = TP / (TP + FP)
= 82 / (82 + 8)
= 0.911

● F_Measure

F_Measue = 2*(Precision*Recall) / (Precision + Recall)

Precision = 0.11
Recall = Sensitivity = 0.891

So, F_Measue = 2*(0.11*0.891) / (0.11 + 0.891)


= 0.196

Q7) [12 marks] Consider the training data set collected by the ABC bank as shown in the
following Table. The last column is the target value (label).

Customer Savings Assets Credit


1 Medium High High
2 Low Low Low
3 High Medium Low
4 Medium Medium High
5 Low Medium High
6 High High High
7 Low Medium Low
8 Medium Medium High

We are interested in building a decision tree to classify new customers as “High” or


“Low” credit customers. Consider the following two candidate splits at the root :

University of Ottawa Telfer School of Management Page 10 of 13


ADM3308-Fall 2023: Business Data Mining

i. using the “Savings” attribute only.


ii. using the “Assets” attribute only.

(a) Draw a decision tree for each of the above candidate splits
USING “SAVINGS” ATTRIBUTE:

USING “ASSETS” ATTRIBUTE:

(b) Calculate the Gini index of the split. Based on your calculation of the Gini index,
which split is better?

SAVINGS:
GINI:
- CREDIT: 1 - [ (3/8) ^ 2 + (5/8) ^ 2] = 0.46875
- HIGH SAVINGS: 1 – [ (1/2) ^ 2 + (1/2) ^ 2] = 0.5
- MEDIUM SAVINGS: 1 – [ (0/3) ^ 2 + (3/3) ^ 2] = 0
- LOW SAVINGS: 1 – [ (1/3) ^ 2 + (2/3) ^ 2] = 0.44

GINI INDEX SAVINGS: (3/8) * 0.5 + (5/8) * 0.44 = 0.5375

University of Ottawa Telfer School of Management Page 11 of 13


ADM3308-Fall 2023: Business Data Mining

ASSSETS:
GINI:
- CREDIT: 1 - [ (3/8) ^ 2 + (5/8) ^ 2] = 0.46875
- HIGH SAVINGS: 1 – [ (0/2) ^ 2 + (2/2) ^ 2] = 0
- MEDIUM SAVINGS: 1 – [ (2/5) ^ 2 + (3/5) ^ 2] = 0.48
- LOW SAVINGS: 1 – [ (1/1) ^ 2 + (0/1) ^ 2] = 0

GINI INDEX ROOT SAVINGS: (3/8) * 0 + (5/8) * 0.48 = 0.3

ANSWER:
Slipt “Assets” is better because it has a lower Gini index than slipt “Savings”.

(c) Calculate the Entropy of the split. Based on your calculation of the Entropy index,
which split is better? (Note: For simplicity, use Log10 in your calculations)

SAVINGS:
- CREDIT: -(3/8) * Log10(3/8) - (5/8) * Log10(5/8) = 0.2873
- HIGH SAVINGS: -(1/2) * Log10(1/2) - (1/2) * Log10(1/2) = 0.30
- MEDIUM SAVINGS: -(0/3) * Log10(0/3) - (3/3) * Log10(3/3) = 0
- LOW SAVINGS: -(1/3) * Log10(1/3) - (2/3) * Log10(2/3) = 0.2764
- ENTROPY INDEX: (5/8) * 0.30 + (3/8) * 0.2764 = 0.29115

ASSETS:
SAVINGS:
- CREDIT: -(3/8) * Log10(3/8) - (5/8) * Log10(5/8) = 0.2873
- HIGH SAVINGS: -(0/2) * Log10(0/2) - (2/2) * Log10(2/2) = 0
- MEDIUM SAVINGS: -(2/5) * Log10(2/5) - (3/5) * Log10(3/5) = 0.2922
- LOW SAVINGS: -(1/1) * Log10(1/1) - (0/1) * Log10(0/1) = 0
- ENTROPY TOTAL: (5/8) * 0.2922 + (3/8) * 0.2922 = 0.2922

ANSWER:
Slipt “Savings” is better because it has a lower Entropy total.

Q8) [8 marks] In the following neural networks, the range of the input data is between 1
and 10, and its current value is 6. The transfer function for the nodes in the hidden layer
is Hyperbolic Tangent, and for the output node is a Logistic function. Calculate the output
x. Remember to normalize the input between [0, 1]. Also, remember to scale back the
output value x to its actual value (i.e. de-normalize the output value x).

University of Ottawa Telfer School of Management Page 12 of 13


ADM3308-Fall 2023: Business Data Mining

Step 1: Normalize the inputs between [0,1] > 6/10 = 0.60

Step 2: Now multiply the weights (w11 and w12)

= w11 * normalized input value


= 0.65 * 0.60
= 0.39

and

= w12 * normalized input value


= 0.50 * 0.60
= 0.30

Tanh(0.39) = 0.37136022787
Tanh(0.30) = 0.29131261245

Step 3: Multiply and add the weights (w21 and w22)

= w11 * w21 + w12 * w22


= 0.37136022787 * 0.40 + 0.29131261245 * 0.22
= 0.148544091 + 0.064088774
= 0.212632865

Step 4: Calculate Sigmoid Function

= 1/1+exp(x), where x= 0.212632865


= 1/1+0.5529588319757652
= 0.64392073

Step 5: Denormalize the value of x output

= 0.64392073 * 10
= 6.43932073

Therefore, the demormalized output actual value x is 6.439.

University of Ottawa Telfer School of Management Page 13 of 13

You might also like