0% found this document useful (0 votes)

17 views5 pages

IDAI610 PS1 DecisionTree

Uploaded by

Talha Tahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views5 pages

IDAI610 PS1 DecisionTree

Uploaded by

Talha Tahir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IDAI 610

Problem set 1: Decision Trees for Clinical Data

Objective: In this problem set, you will implement the standard decision tree algorithm including two
ways to implement node splitting, for a comparison of their results. You will also use your decision
tree implementation to create a predictive model with the Wisconsin Diagnostic Breast Cancer Data,
using training, tuning, and test sets. Subsequently, you will analyze and discuss the results. In a final
task, you will compare with a decision tree implementation in a standard machine learning package
named scikit-learn and explore additional hyperparameters of decision trees as well as decision-tree
visualization.
Submission Instruction: Submit your report and notebook(s) as ps1-[LastName] as in this example:
ps1-alm.[zip|tar.gz] , in the assigned assignment dropbox in our myCourses website. Remember
to include your written report, code, and a succinct readme explaining how to run your code. Clearly
indicate which question you are responding using the format Qn.

Wisconsin Diagnostic Breast Cancer (WDBC) Data

The Wisconsin Diagnostic Breast Cancer Data1 is a benchmark dataset from the domain of machine
learning in health. It was collected by researchers from the University of Wisconsin Hospitals, Madison.
The dataset contains features extracted from digitized images of fine needle aspirates (FNA) of breast
masses. These features are used to predict whether a breast mass is benign (non-cancerous) or malig-
nant (cancerous). Note: You will use reformatted data provided with this assignment, instead of the
original data release.

Please start this assignment by reading the paper: W. Nick Street, W. H. Wolberg, and O. L. Mangasarian
“Nuclear feature extraction for breast tumor diagnosis”, Proceedings of SPIE 1905, Biomedical Image
Processing and Biomedical Visualization, (29 July 1993); https://doi.org/10.1117/12.148698.

Problem 1: Decision trees for Boolean functions

Q1: Draw a Decision Tree based on Boolean functions. (2 points)
Example decision tree for X ∧ Y , where X and Y are the binary features and Yes and No are the labels:

Figure 1: Decision tree for X ∧ Y

Based on the example draw a decision tree each for following Boolean expressions:
1 Original dataset at: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic.

1
IDAI 610 2

1. A ∧ B̄ ∧ C
2. X ∧ Ȳ ∨ X̄ ∧ Y
3. X ∧ Y ∧ Z ∨ X ∧ Ȳ ∧ W ∨ X̄ ∧ Y

Q2: Root node selection using Information Gain and Gini (4 points)

Player League Position Preferred foot Capped Shortlisted

A SerieA CF Left yes True
B LaLiga LW Right no True
C LaLiga CF Right yes True
D PremierLeague CF Left yes True
E PremierLeague LW Left yes False
F SerieA RW Left yes False
G SerieA CF Right no True
H PremierLeague LW Left no False
I SerieA RW Right no True
J LaLiga RW Left yes False
K LaLiga RW Right no True
L PremierLeague CF Right no False
M LaLiga CF Right yes True
N SerieA LW Left no False

Table 1: Data the new soccer club’s sports director has for shortlisting forward players A through N.

Based on Table 1, compute the selection procedure of the root decision node (considering the 4
features) using standard Information Gain and Gini. Show all the intermediary calculation steps.

Problem 2: Implement the decision tree algorithm (18 points)

In this problem, you will implement a recursion-driven, top-down, decision tree from scratch, following
the pseudocode of the algorithm described in class or in section 19.3 in R&N for solving a Boolean
classification problem with binary labels. You may not use an existing library implementa-
tion of decision trees. (Using a helper library like numpy is allowed and requires you to include a
requirement.txt in your submission; see the corresponding file already included for problem 3.) Your
code should be well-documented. We recommend a Jupyter notebook with comments included directly,
but coding with IDLE or another environment and .py files is also allowed. You should test that your
code runs on the cs.rit.edu server enviornnment before you submit.

Your decision tree implementation should include two attribute/node splitting techniques, based on
informativeness, and allow the user to select which one to use:
1. Entropy and Information Gain
Entropy is a measure of heterogeneity (uncertainty) in data. Higher entropy implies less infor-
mation, and vice-versa. Entropy quantifies the uncertainty about the class distribution within a
node in decision tree. A node with low entropy comprises mostly one class while a node with high
entropy contains a broader distribution across classes. The entropy of a node is given by:
X
Entropy(Di ) = − P (ci ) × log2 P (ci ) (1)
ci ∈C

where Di is the set of instances within the node under consideration, c is the number of classes,
and P (ci ) is the probability of instances belonging to class i.

Information Gain (IG) is a way to quantify the quality of split based on entropy. It is used to
evaluate the effectiveness of a feature in reducing entropy. You will implement IG as the difference
between the entropy of the parent node and the weighted average entropy of the child nodes after
the split. The feature with the highest IG is selected as the splitting feature. The Information
IDAI 610 3

Gain IG(Di , f ) of selecting an feature f with levels l, relative to set of instances Di in the parent
node is given by:
X |Di,l |
IG(Di , f ) = Entropy(Di ) − Entropy(Di,l ) (2)
|Di |
l∈levels(f )

where Di,l represents the subset of instances that have value l for feature f .

2. Gini
The Gini index measures the likelihood of a randomly selected instance being classified incorrectly.
(It can be particularly valuable for continuous features, e.g., in CART or Classification and Re-
gression Trees.) It too quantifies the impurity or uncertainty of values in a dataset. And similarly,
a lower Gini index indicates a homogeneous distribution while higher indicates heterogeneous mix
of instances within a node. The Gini index is given by:
X
Gini = P (ci )(1 − P (ci )2 ) (3)
ci ∈C
X
=1− P (ci )2 (4)
ci ∈C

Finally, you must implement tuning and testing procedures (functions) as part of your decision tree
implementation. You will use these in problem 2 so we recommend you read that part before beginning
your own implementation.
Your report on this problem should discuss Q3: your implementation in one paragraph, Q4: any
challenges you faced, and Q5: how you overcame them.

Extra credit (8 points): Implement the pruning element of the decision tree with χ2 pruning,
described in section 19.3.4 in R&N for 8 extra bonus points.

Problem 3: Use your decision tree to develop a model for WDBC (14 points)
For this problem, you will use the train, dev (standing for dev-test or validation or tuning set), and test
sets, available in CSV files provided in myCourses. You will seek to train, tune (using the data partition
called dev), and test your own decision tree that classifies tumor nuclei into two classes: malignant (M)
and benign (B). Thus, it is a Boolean classification problem.
The tuning phase will focus on selecting the best-performing node-splitting criterion (Gini or IG).
If you implemented χ2 pruning for problem 1, you could gain 2 extra points by considering it, i.e.,
setting χ2 pruning to active or not active (on or off), in your tuning process.
Finally, you will test your decision tree’s predictions on the held-out test data. The test set aims to
approximate data seen in deployment. It may not be used before predicting your final results. Thus,
you may NOT review the test set during development, or re-tune to the test set.

A description of columns in WDBC.

1. Binary class labels:
The column Diagnosis includes diagnostic target label where M = malignant and B = benign.

2. Features used for class prediction:

There are 30 distinct features, which are based on 10 continuous (real-valued) features extracted
from cell nucleus information. As noted in the website of the original dataset paper, these ten
measures were as follows - quoting here from the dataset website:

1) Radius: mean of distances from center to points on the perimeter

2) Texture: standard deviation of gray-scale values
3) Perimeter
4) Area
5) Smoothness: local variation in radius lengths
perimeter 2
6) Compactness: area − 1.0
IDAI 610 4

7) Concavity: severity of concave portions of the contour

8) Concave points: number of concave portions of the contour
9) Symmetry
10) Fractal dimension: coastline approximation - 1
Aggregate values including the mean, standard error, and largest (mean of the three largest values)
termed worst of each of the ten features were computed per medical image, resulting in 30 (3 × 10)
features. For instance, in the data files, the feature Radius refers to the mean radius, whereas
feature seRadius refers to the radius standard error, and worstRadius refers to the worst/largest
radius, as just explained.

Discretization:
All the actual feature values are continuous. However, a standard ID3 decision tree implementation
expects categorical features. Thus, there is a need to discretize or bin the continuous, real-valued
features in the original dataset. This has been done for you, with values binned into six categorical
level (l1, l2, l3, l4, l5, l6 ). For discretization, each feature column was first Z-score normalized:
xij − µj
Zij = (5)
σj

where xij is the j th feature (column) of the ith image instance (row), µj and σj are the mean and
standard deviation of j th feature respectively, and Zij is the Z-score normalized value of xij . The
normalized values were then referred to as six levels (l) of the feature f , coded as:


l1 if Zij < −2σj

l2 if − 2σj ≤ Zij < −σj





l if − σ ≤ Z < µ
3 j ij j
Code(Zij ) =


l 4 if µj ≤ Z ij < σj
l5 if σj ≤ Zij < 2σj




l6 if Zij ≥ 2σj


You can use the discretized data in ./Final data/wdbc {train/dev/test}.csv. (Details on the
discretization process can be found in Discretization prepartion.ipynb and reviewing it is optional.)
Extra credit (5 pts.)
You can discretize the original continuous data in ./Final data/wdbc {train/dev/test} raw.csv
into categorical data yourself using (a) your own approach, or (b) the procedure briefly explained
in R&N in section 19.3.5 (p. 664). Please describe the process you used in plain English and do
your best to express it formally as well, as examplified above.
In your report, answer these questions: Q6: summarize the dataset paper you read and discuss two
key observations. Then, Q7: provide the performance results on the test set in a five-column table
with accuracy, error, precision, and recall, using rows for Gini and IG, respectively. Your table should
also include a comparative baseline result for a trivial majority class classifier that only uses the most
frequent class in the training data to label all instances. Note that for precision and recall, you will need
to logically decide what you treat as the positive class (to determine true and false positive predictions,
respectively). Additionally, Q8: discuss if precision or recall is most critical (most important) as a
performance metric for this problem. Finally, Q9: explain your tuning procedure with the dev (tuning)
set and what you learned from this part of the problem.

Problem 4: Compare with an implementation in an ML library (12 points)

For this problem, you will use the Jupyter notebook provided in myCourses and compare it with your
previous results, study additional decision tree hyperparameters, and visualize the resulting decision tree.
Explore the notebook and the differences you observed in the performance between scikit-learn
implementation and your implementation.
In your report, Q10: include the decision-tree visualization and Q11: discuss how your best-
performing informativeness metric compared with the best-performing result in your experiment. Overall,
IDAI 610 5

Q12: elaborate on the similarities and differences between your implementation and that of scikit-learn.
Additionally, Q13: compare and contrast the performance of the binned versus non-binned (normalized)
data and report if there is a difference in performance between the two, and Q14: speculate why that
may be?
Finally, review the documentation on decision trees for the library at https://scikit-learn.org/
stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html and then explore two ad-
ditional decision tree hyperparameters that you did not yet consider such as a maximum depth stopping
criterion, the minimum number of instances for a split, or the minimum number of instances at a leaf
node. Discuss Q15: your observations, e.g., based on visualized evidence or performance results.

Prim Maths 4 2ed TR Unit 11 Test
100% (3)
Prim Maths 4 2ed TR Unit 11 Test
4 pages
Attachment 1
0% (1)
Attachment 1
5 pages
Remote Sensing Image Processing
100% (1)
Remote Sensing Image Processing
137 pages
Worksheet-1 Trigonometry
No ratings yet
Worksheet-1 Trigonometry
3 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
1.3.1 Equations & Graphs of Motion
No ratings yet
1.3.1 Equations & Graphs of Motion
8 pages
CV Module 1
No ratings yet
CV Module 1
166 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Converse, Inverse, Contrapositive, and Biconditional: Welcome, Grade 8
No ratings yet
Converse, Inverse, Contrapositive, and Biconditional: Welcome, Grade 8
20 pages
Varahamihira
100% (2)
Varahamihira
6 pages
Water Resources
No ratings yet
Water Resources
25 pages
20HCC22XX: B.Tech (III Sem)
No ratings yet
20HCC22XX: B.Tech (III Sem)
2 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
SAS Library Data Transformations and Data Manipulation in SAS
No ratings yet
SAS Library Data Transformations and Data Manipulation in SAS
31 pages
TCode & Table Data
No ratings yet
TCode & Table Data
10 pages
Data Science Concepts Lesson04 Decision Tree Concepts
No ratings yet
Data Science Concepts Lesson04 Decision Tree Concepts
22 pages
Attachment 1
No ratings yet
Attachment 1
9 pages
Rosen7eExtraExamples0101 PDF
No ratings yet
Rosen7eExtraExamples0101 PDF
16 pages
GCSE H3 02g4 02 3D Trigonometry
No ratings yet
GCSE H3 02g4 02 3D Trigonometry
2 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
7 pages
Maths English Medium 11th Model Question Paper WWW tn11th in
No ratings yet
Maths English Medium 11th Model Question Paper WWW tn11th in
5 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Linear Programming
100% (1)
Linear Programming
42 pages
Advanced Mathematics 2
No ratings yet
Advanced Mathematics 2
4 pages
Decision Trees and Random Forest Q&a
No ratings yet
Decision Trees and Random Forest Q&a
48 pages
Soft Computing Lab Practical Assignment 2
No ratings yet
Soft Computing Lab Practical Assignment 2
10 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
Kinetika Kimia Orde 1
No ratings yet
Kinetika Kimia Orde 1
24 pages
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
No ratings yet
K.Venkat Ratnam 191911412 Class Work 1) Describe The Attribute Selection Measures Used by The ID3 Algorithm To Construct A Decision Tree. A)
8 pages
Corrections: Applied Drilling Engineering, by Adam T. Bourgoyne JR., Keith K
No ratings yet
Corrections: Applied Drilling Engineering, by Adam T. Bourgoyne JR., Keith K
8 pages
Practice Sheet
No ratings yet
Practice Sheet
2 pages
Decision Tree Introduction
No ratings yet
Decision Tree Introduction
14 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Database Management Systems-9
No ratings yet
Database Management Systems-9
10 pages
MODULE 4-Dr - GM
No ratings yet
MODULE 4-Dr - GM
23 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
Math 5 Q3 Test
No ratings yet
Math 5 Q3 Test
8 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Attachment 1
No ratings yet
Attachment 1
2 pages
Attachment 4
No ratings yet
Attachment 4
2 pages
Attachment 1
No ratings yet
Attachment 1
2 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
BUS 3304 Unit 6 Assignment
No ratings yet
BUS 3304 Unit 6 Assignment
3 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
DBInt4 QueryOpt
No ratings yet
DBInt4 QueryOpt
68 pages
Complete Deep Learning Interview Question
No ratings yet
Complete Deep Learning Interview Question
46 pages
EDA Cat2
No ratings yet
EDA Cat2
54 pages
Attachment 1
No ratings yet
Attachment 1
14 pages
Experiment No-2
No ratings yet
Experiment No-2
4 pages
Decision Tree
No ratings yet
Decision Tree
5 pages
Core Lap
No ratings yet
Core Lap
1 page
DMDW Co3 Session 14
No ratings yet
DMDW Co3 Session 14
55 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Decision Trees
No ratings yet
Decision Trees
25 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
A Review On Cartans Structure Equations For Certa
No ratings yet
A Review On Cartans Structure Equations For Certa
7 pages
IML Unit04 - Learning Decision Trees
No ratings yet
IML Unit04 - Learning Decision Trees
28 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
Data Minning Unit 5 PDF
No ratings yet
Data Minning Unit 5 PDF
19 pages
MLT UNIT-3 Notes
No ratings yet
MLT UNIT-3 Notes
35 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
COS10022 DSP Week05 Decision Tree and Random Forest
No ratings yet
COS10022 DSP Week05 Decision Tree and Random Forest
50 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
new90李美行管理科学与工程 202111200082
No ratings yet
new90李美行管理科学与工程 202111200082
14 pages
IS4834 Week 8
No ratings yet
IS4834 Week 8
42 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
ML Acti
No ratings yet
ML Acti
23 pages
06-Classification Part1
No ratings yet
06-Classification Part1
44 pages
DM Lab 04
No ratings yet
DM Lab 04
6 pages
LAS 00 Basic Calc Semestral Plan (2nd Semester)
No ratings yet
LAS 00 Basic Calc Semestral Plan (2nd Semester)
1 page
Decision Tree
No ratings yet
Decision Tree
44 pages
Effect of The Vent Hole Geometry and Welding
No ratings yet
Effect of The Vent Hole Geometry and Welding
16 pages
LAS DRAWING-Q2-Classification of Drawing Tools
No ratings yet
LAS DRAWING-Q2-Classification of Drawing Tools
3 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
Decision Tree Questions
No ratings yet
Decision Tree Questions
8 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
S&ML Unit 6 - Q & A
No ratings yet
S&ML Unit 6 - Q & A
12 pages
MCIS 5133 Final Project
No ratings yet
MCIS 5133 Final Project
3 pages
ML - Module-3-Chapter-6 RNSIT
No ratings yet
ML - Module-3-Chapter-6 RNSIT
10 pages
Chapter Ii
No ratings yet
Chapter Ii
10 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
Decision Tree Question
No ratings yet
Decision Tree Question
6 pages
ML Unit3
No ratings yet
ML Unit3
24 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
Decision Tree Classifier - Manual
No ratings yet
Decision Tree Classifier - Manual
6 pages
فاينل تعلم
No ratings yet
فاينل تعلم
144 pages
Decision Tree
No ratings yet
Decision Tree
20 pages
Unit-4 (1) .Docx ML
No ratings yet
Unit-4 (1) .Docx ML
42 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Supervised Decision TreeRandom Forest
No ratings yet
Supervised Decision TreeRandom Forest
39 pages
DataMining-Handouts1 5
No ratings yet
DataMining-Handouts1 5
8 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

IDAI610 PS1 DecisionTree

Uploaded by

IDAI610 PS1 DecisionTree

Uploaded by

IDAI 610

Problem set 1: Decision Trees for Clinical Data

Wisconsin Diagnostic Breast Cancer (WDBC) Data

Problem 1: Decision trees for Boolean functions

Figure 1: Decision tree for X ∧ Y

Player League Position Preferred foot Capped Shortlisted

Problem 2: Implement the decision tree algorithm (18 points)

A description of columns in WDBC.

2. Features used for class prediction:

1) Radius: mean of distances from center to points on the perimeter

7) Concavity: severity of concave portions of the contour

Problem 4: Compare with an implementation in an ML library (12 points)

You might also like