Assignment-2 Group 8 ADM 3308 A
Assignment-2 Group 8 ADM 3308 A
Submission Instruction:
®
● Submit the assignment to Brighspace no later than midnight on Oct. 30,
2023.
● Only one submission per group. Please choose a group representative to submit
one copy of the report to BrightSpace
When submitting your group work, the submission must include the following two
statements. Without the following two statements included in your submission, your
assignment will not be marked.
NOTE: If the above two statements are not included in your original submission, the
assignment will not be marked. Then, the following deductions will be applied:
-20% if the statement was not submitted with the original submission, but was
submitted after reminded by the TA within 24 hours.
-100% if the statement was not submitted within 24 hours after reminded by the
TA.
Important Note: Each member of the group must read the following academic
integrity statement, and type in their full name and student ID. The Academic
Integrity Statement must be e-signed by ALL members of the group UNLESS a
group member has not contributed to the assignment. Your assignment will not be
marked if the following academic integrity statement is not submitted.
Please read the disclosure below following the completion of your group assignment.
Once all team members have verified these points, hand in this signed disclosure with
your group assignment.
1. All team members acknowledge to have read and understood their responsibilities for
maintaining academic integrity, as defined by the University of Ottawa’s policies and
regulations. Furthermore, all members understand that any violation of academic
integrity may result in strict disciplinary action as outlined in the regulations.
2. If applicable, all team members have referenced and/or footnoted all ideas, words, or
other intellectual property from other sources used in completing this assignment.
3. A proper bibliography is included, which includes acknowledgement of all sources used
to complete this assignment.
4. This is the first time that any member of the group has submitted this assignment or
essay (either partially or entirely) for academic evaluation.
5. No member of the team has utilized unauthorized assistance or aids including but not
limited to outsourcing assignment solutions, and unethical use of online services such as
artificial intelligence tools and course-sharing websites.
6. Each member of the group has read the full content of the submission and is assured
that the content is free of violations of academic integrity. Group discussions regarding
the importance of academic integrity have taken place.
7. All team members have identified their individual contributions to the work submitted
such that if violations of academic integrity are suspected, then the student(s) primarily
responsible for the violations may be identified. Note that the remainder of the team will
also be subject to disciplinary action.
Group Number: 8
Ayoub Essadouky AE
Reana Agil RA
Q1) [4 marks] Aside from the Gini and Entropy, explain two other parameters to measure
purity (best split) in a decision tree.
Aside from Gini and Entropy, two other parameters to measure purity (best split) in a
decision tree are:
Information gain ratio is a purity measure used for categorical targets. It addresses a
problem with Entropy, which is the intrinsic information (decrease in entropy solely due
to the number of branches).
Info gain ratio is calculated as the total information gain due to the proposed split divided
by the intrinsic information. It is used in algorithms like C5.
Reduction in Variance:
Reduction in variance is a purity measure used for numeric targets. It assesses the spread
of values within a class. If the values in a class are close to each other, their variance
should get close to the mean value.
This measure is specifically designed for numerical variables and aims to minimize the
variance within classes after the split.
Q2) [5 marks] (a) Explain “over fitting”. What is its negative impact on learner models?
(b) Explain two approaches to prevent overfitting?
a.
Overfitting occurs when a machine learning model learns the training data too well, to the
point where it captures noise and randomness in the data rather than the underlying
pattern. This leads to a model that performs exceptionally well on the training data but
poorly on new, unseen data.
Poor Generalization: Overfit models have learned the training data too closely and are
overly sensitive to small fluctuations in the data. This leads to poor performance on new
data because the model has essentially memorized the training set, rather than learning
the underlying relationship.
Reduced Predictive Power: The primary goal of a machine learning model is to make
accurate predictions on new, unseen data. Overfit models fail in this regard, as they are
tailored too specifically to the training data and can't effectively generalize to different
data distributions.
Unreliable Insights: Overfit models might infer relationships that are purely coincidental
and not actually meaningful. This can lead to incorrect conclusions about the data.
Pruning Process: After growing a full tree, the pruning process begins. Pruning involves
selectively removing branches and nodes from the tree while aiming to improve its
generalization capabilities. The goal is to create a simpler, more interpretable tree that
still maintains good predictive performance on unseen data: reducing model complexity
Cross-validation: involves partitioning the training data into subsets. The model is
trained on a portion of the data (training set) and evaluated on the remaining part
(validation set). This process is repeated multiple times, with different subsets used for
training and validation each time. To avoid overfitting we should break down the data.
We should not use all of it at a time
Q3) [4 marks] (a) What do the X-axis and Y-axis represent in an ROC curve? What does
the area under the curve represent? Draw an example ROC curve for a classifier that
performs better than a random guess. (b) Describe a real world application where false
negative rate is of high interest (in addition to the accuracy of the classifier).
a) To begin, an ROC curve shows the relationship between climbed sensivitivty and
the 1-speficity for every possible cut-off. The x-axis shows the 1-specificity i.e
(false posivtive rate = FP/FP+IN), this false positive rate will demonstrate and
measure how many times a false alarm will occur i.e how many times an actual
negative value will be classified as a positive value. The y-axis shows the
sensitivity i.e (true positive rate = TP/TP+FN) which demonstrates and will
measure the probability that a positive value will actually be classified as a
positive value. Now what does the area under the curve represent? The area under
the curve represents if a test can be used as a reflection to measure the rests
discriminative ability. See graph below: Typically the AUC, represents the AUC
score, and the higher the score the more accurate the final test value results will
be.
b) In the area of a email spamming, we would want the false positive rate to be very
low and negative rate to be high. This is because when testing a software for the
email spam, we would want to have the most accurate view of how many spam
mails can still come through. We would use the TP rate classifier feeding it the
false negative rates.
Q4) [5 marks] In a Neural Networks model, should we prefer a large hidden layer over a
small one? Explain the benefits and drawbacks of each option.
The hidden layer is a middle layer, the size of the middle layer is important as it allows
for the network to be more powerful by enabling it to recognize more patterns, as it is
able to capture more complex patterns, however, the drawback with this is that the larger
it is the more resources are required and it can cause a greater the risk of overfitting if
there is not enough data. In contrast, the smaller the hidden layer, the less powerful the
network is, however it is computationally more efficient as usually only one hidden layer
is needed. Also, the smaller layer will be less prone to overfitting as it may not capture
the complex relationships as well. Therefore, the choice really ends up depending on the
proble amd the data that is available, but experimentation can occur to determine if more
times than other using and enlarging the layer only when necessary.
numerical value between [0, 1]. A threshold is used to interpret the output of the classifier
as a binary value of “fraudulent” or “non-fraudulent”.
We first set the threshold at 0.5, meaning that, if the output of the model is higher than
0.5, then the transaction is classified as “fraudulent”, and if the output of the model is less
than or equal to 0.5, then the transaction is classified as “non-fraudulent”.
Let us assume that for threshold=0.5, the confusion matrix for this classifier is measured
as follows:
Actual class
0 (non-fraudulent) 1 (fraudulent)
0 (non-fraudulent) d b
Predicted class
1 (fraudulent) c a
Referring to the values of a, b, c, and d in the confusion matrix, fill out the blanks in the
following statements using the options provided in the brackets:
(a) The classification error rate for truly fraudulent records (with this 0.5 threshold) is
______b/(a+b)___________ (present the formula in terms of values a, b, c, and d).
(b) The classification error rate for truly non-fraudulent records (with this 0.5 threshold)
is ____c/(c+d)_________ (present the formula in terms of values a, b, c, and d).
(c) Lowering the threshold below 0.5 leads to classifying more records (both actual
fraudulent and non-fraudulent records) as ___non-fraudulent______ (fraudulent or
non-fraudulent), therefore, the values _____d and c___ (a, b, c, d) increase, and the
values ___a and b____(a, b, c, d) decrease. This means,
a. With respect to the classification error rate for truly fraudulent records, the error
rate ___decreases____ (increases or decreases). This means, as you lower the
threshold for calling a record fraudulent, you catch ___more___ (more or less) of
the real frauds.
b. With respect to the classification error rate for truly non-fraudulent records, the
error rate ___increases_____ (increases or decreases). This means, as you lower
the threshold for calling a record fraudulent, you mistakenly identify
___more____ (more or less) non-frauds as frauds.
(d) Increasing the threshold above 0.5 leads to classifying more records (both actual
fraudulent and non-fraudulent records) as ____fraudulent______ (fraudulent or
non-fraudulent), therefore the values ______a and b____ (a, b, c, d) increase, and the
values c and d_____ (a, b, c, d) decreases. This means,
a. With respect to the classification error rate for truly fraudulent records, the error
rate ___increases______ (increases or decreases). This means, as you raise the
threshold for calling a record fraudulent, you miss ____more____ (more or less)
of the real frauds.
b. With respect to the classification error rate for truly non-fraudulent records, the
error rate ___decreases____ (decreases or increases). This means, as you raise
the threshold for calling a record fraudulent, ____fewer____ (fewer or more)
non-fraudulent transactions are miss-classified as frauds.
Q6) [5 marks] The following confusion matrix is produced during the test phase of a
classifier.
TP= 82
FN= 10
FP= 8
TN= 20
● Specificity
Specificity = TN / (FP + TN)
= 20 / (8 + 20)
= 0.714
● Sensitivity
Sensitivity = TP / (TP + FN)
= 82 / (82 + 10)
= 0.891
● Precision
Precision = TP / (TP + FP)
= 82 / (82 + 8)
= 0.911
● F_Measure
Precision = 0.11
Recall = Sensitivity = 0.891
Q7) [12 marks] Consider the training data set collected by the ABC bank as shown in the
following Table. The last column is the target value (label).
(a) Draw a decision tree for each of the above candidate splits
USING “SAVINGS” ATTRIBUTE:
(b) Calculate the Gini index of the split. Based on your calculation of the Gini index,
which split is better?
SAVINGS:
GINI:
- CREDIT: 1 - [ (3/8) ^ 2 + (5/8) ^ 2] = 0.46875
- HIGH SAVINGS: 1 – [ (1/2) ^ 2 + (1/2) ^ 2] = 0.5
- MEDIUM SAVINGS: 1 – [ (0/3) ^ 2 + (3/3) ^ 2] = 0
- LOW SAVINGS: 1 – [ (1/3) ^ 2 + (2/3) ^ 2] = 0.44
ASSSETS:
GINI:
- CREDIT: 1 - [ (3/8) ^ 2 + (5/8) ^ 2] = 0.46875
- HIGH SAVINGS: 1 – [ (0/2) ^ 2 + (2/2) ^ 2] = 0
- MEDIUM SAVINGS: 1 – [ (2/5) ^ 2 + (3/5) ^ 2] = 0.48
- LOW SAVINGS: 1 – [ (1/1) ^ 2 + (0/1) ^ 2] = 0
ANSWER:
Slipt “Assets” is better because it has a lower Gini index than slipt “Savings”.
(c) Calculate the Entropy of the split. Based on your calculation of the Entropy index,
which split is better? (Note: For simplicity, use Log10 in your calculations)
SAVINGS:
- CREDIT: -(3/8) * Log10(3/8) - (5/8) * Log10(5/8) = 0.2873
- HIGH SAVINGS: -(1/2) * Log10(1/2) - (1/2) * Log10(1/2) = 0.30
- MEDIUM SAVINGS: -(0/3) * Log10(0/3) - (3/3) * Log10(3/3) = 0
- LOW SAVINGS: -(1/3) * Log10(1/3) - (2/3) * Log10(2/3) = 0.2764
- ENTROPY INDEX: (5/8) * 0.30 + (3/8) * 0.2764 = 0.29115
ASSETS:
SAVINGS:
- CREDIT: -(3/8) * Log10(3/8) - (5/8) * Log10(5/8) = 0.2873
- HIGH SAVINGS: -(0/2) * Log10(0/2) - (2/2) * Log10(2/2) = 0
- MEDIUM SAVINGS: -(2/5) * Log10(2/5) - (3/5) * Log10(3/5) = 0.2922
- LOW SAVINGS: -(1/1) * Log10(1/1) - (0/1) * Log10(0/1) = 0
- ENTROPY TOTAL: (5/8) * 0.2922 + (3/8) * 0.2922 = 0.2922
ANSWER:
Slipt “Savings” is better because it has a lower Entropy total.
Q8) [8 marks] In the following neural networks, the range of the input data is between 1
and 10, and its current value is 6. The transfer function for the nodes in the hidden layer
is Hyperbolic Tangent, and for the output node is a Logistic function. Calculate the output
x. Remember to normalize the input between [0, 1]. Also, remember to scale back the
output value x to its actual value (i.e. de-normalize the output value x).
and
Tanh(0.39) = 0.37136022787
Tanh(0.30) = 0.29131261245
= 0.64392073 * 10
= 6.43932073