IDAI610 PS1 DecisionTree
IDAI610 PS1 DecisionTree
Objective: In this problem set, you will implement the standard decision tree algorithm including two
ways to implement node splitting, for a comparison of their results. You will also use your decision
tree implementation to create a predictive model with the Wisconsin Diagnostic Breast Cancer Data,
using training, tuning, and test sets. Subsequently, you will analyze and discuss the results. In a final
task, you will compare with a decision tree implementation in a standard machine learning package
named scikit-learn and explore additional hyperparameters of decision trees as well as decision-tree
visualization.
Submission Instruction: Submit your report and notebook(s) as ps1-[LastName] as in this example:
ps1-alm.[zip|tar.gz] , in the assigned assignment dropbox in our myCourses website. Remember
to include your written report, code, and a succinct readme explaining how to run your code. Clearly
indicate which question you are responding using the format Qn.
Please start this assignment by reading the paper: W. Nick Street, W. H. Wolberg, and O. L. Mangasarian
“Nuclear feature extraction for breast tumor diagnosis”, Proceedings of SPIE 1905, Biomedical Image
Processing and Biomedical Visualization, (29 July 1993); https://doi.org/10.1117/12.148698.
Based on the example draw a decision tree each for following Boolean expressions:
1 Original dataset at: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic.
1
IDAI 610 2
1. A ∧ B̄ ∧ C
2. X ∧ Ȳ ∨ X̄ ∧ Y
3. X ∧ Y ∧ Z ∨ X ∧ Ȳ ∧ W ∨ X̄ ∧ Y
Q2: Root node selection using Information Gain and Gini (4 points)
Table 1: Data the new soccer club’s sports director has for shortlisting forward players A through N.
Based on Table 1, compute the selection procedure of the root decision node (considering the 4
features) using standard Information Gain and Gini. Show all the intermediary calculation steps.
Your decision tree implementation should include two attribute/node splitting techniques, based on
informativeness, and allow the user to select which one to use:
1. Entropy and Information Gain
Entropy is a measure of heterogeneity (uncertainty) in data. Higher entropy implies less infor-
mation, and vice-versa. Entropy quantifies the uncertainty about the class distribution within a
node in decision tree. A node with low entropy comprises mostly one class while a node with high
entropy contains a broader distribution across classes. The entropy of a node is given by:
X
Entropy(Di ) = − P (ci ) × log2 P (ci ) (1)
ci ∈C
where Di is the set of instances within the node under consideration, c is the number of classes,
and P (ci ) is the probability of instances belonging to class i.
Information Gain (IG) is a way to quantify the quality of split based on entropy. It is used to
evaluate the effectiveness of a feature in reducing entropy. You will implement IG as the difference
between the entropy of the parent node and the weighted average entropy of the child nodes after
the split. The feature with the highest IG is selected as the splitting feature. The Information
IDAI 610 3
Gain IG(Di , f ) of selecting an feature f with levels l, relative to set of instances Di in the parent
node is given by:
X |Di,l |
IG(Di , f ) = Entropy(Di ) − Entropy(Di,l ) (2)
|Di |
l∈levels(f )
where Di,l represents the subset of instances that have value l for feature f .
2. Gini
The Gini index measures the likelihood of a randomly selected instance being classified incorrectly.
(It can be particularly valuable for continuous features, e.g., in CART or Classification and Re-
gression Trees.) It too quantifies the impurity or uncertainty of values in a dataset. And similarly,
a lower Gini index indicates a homogeneous distribution while higher indicates heterogeneous mix
of instances within a node. The Gini index is given by:
X
Gini = P (ci )(1 − P (ci )2 ) (3)
ci ∈C
X
=1− P (ci )2 (4)
ci ∈C
Finally, you must implement tuning and testing procedures (functions) as part of your decision tree
implementation. You will use these in problem 2 so we recommend you read that part before beginning
your own implementation.
Your report on this problem should discuss Q3: your implementation in one paragraph, Q4: any
challenges you faced, and Q5: how you overcame them.
Extra credit (8 points): Implement the pruning element of the decision tree with χ2 pruning,
described in section 19.3.4 in R&N for 8 extra bonus points.
Problem 3: Use your decision tree to develop a model for WDBC (14 points)
For this problem, you will use the train, dev (standing for dev-test or validation or tuning set), and test
sets, available in CSV files provided in myCourses. You will seek to train, tune (using the data partition
called dev), and test your own decision tree that classifies tumor nuclei into two classes: malignant (M)
and benign (B). Thus, it is a Boolean classification problem.
The tuning phase will focus on selecting the best-performing node-splitting criterion (Gini or IG).
If you implemented χ2 pruning for problem 1, you could gain 2 extra points by considering it, i.e.,
setting χ2 pruning to active or not active (on or off), in your tuning process.
Finally, you will test your decision tree’s predictions on the held-out test data. The test set aims to
approximate data seen in deployment. It may not be used before predicting your final results. Thus,
you may NOT review the test set during development, or re-tune to the test set.
Discretization:
All the actual feature values are continuous. However, a standard ID3 decision tree implementation
expects categorical features. Thus, there is a need to discretize or bin the continuous, real-valued
features in the original dataset. This has been done for you, with values binned into six categorical
level (l1, l2, l3, l4, l5, l6 ). For discretization, each feature column was first Z-score normalized:
xij − µj
Zij = (5)
σj
where xij is the j th feature (column) of the ith image instance (row), µj and σj are the mean and
standard deviation of j th feature respectively, and Zij is the Z-score normalized value of xij . The
normalized values were then referred to as six levels (l) of the feature f , coded as:
l1 if Zij < −2σj
l2 if − 2σj ≤ Zij < −σj
l if − σ ≤ Z < µ
3 j ij j
Code(Zij ) =
l 4 if µj ≤ Z ij < σj
l5 if σj ≤ Zij < 2σj
l6 if Zij ≥ 2σj
You can use the discretized data in ./Final data/wdbc {train/dev/test}.csv. (Details on the
discretization process can be found in Discretization prepartion.ipynb and reviewing it is optional.)
Extra credit (5 pts.)
You can discretize the original continuous data in ./Final data/wdbc {train/dev/test} raw.csv
into categorical data yourself using (a) your own approach, or (b) the procedure briefly explained
in R&N in section 19.3.5 (p. 664). Please describe the process you used in plain English and do
your best to express it formally as well, as examplified above.
In your report, answer these questions: Q6: summarize the dataset paper you read and discuss two
key observations. Then, Q7: provide the performance results on the test set in a five-column table
with accuracy, error, precision, and recall, using rows for Gini and IG, respectively. Your table should
also include a comparative baseline result for a trivial majority class classifier that only uses the most
frequent class in the training data to label all instances. Note that for precision and recall, you will need
to logically decide what you treat as the positive class (to determine true and false positive predictions,
respectively). Additionally, Q8: discuss if precision or recall is most critical (most important) as a
performance metric for this problem. Finally, Q9: explain your tuning procedure with the dev (tuning)
set and what you learned from this part of the problem.
Q12: elaborate on the similarities and differences between your implementation and that of scikit-learn.
Additionally, Q13: compare and contrast the performance of the binned versus non-binned (normalized)
data and report if there is a difference in performance between the two, and Q14: speculate why that
may be?
Finally, review the documentation on decision trees for the library at https://scikit-learn.org/
stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html and then explore two ad-
ditional decision tree hyperparameters that you did not yet consider such as a maximum depth stopping
criterion, the minimum number of instances for a split, or the minimum number of instances at a leaf
node. Discuss Q15: your observations, e.g., based on visualized evidence or performance results.