IntroAI - 2425 HK2 - Project 2
IntroAI - 2425 HK2 - Project 2
Project 2
Decision Tree
1 Description
In this assignment, you are going to build decision trees on real-world datasets using scikit-learn.
• Binary class dataset: The UCI Heart Disease dataset is used for classifying whether a
patient has a heart disease or not based on age, blood pressure, cholesterol level, and other
medical indicators. This dataset includes 303 samples, with labels indicating presence (1) or
absence (0) of heart disease.
• Multi-class dataset: The Palmer Penguins dataset is used for classifying penguin species
based on physical characteristics. The dataset includes 344 samples of three penguin species:
Adelie, Chinstrap, and Gentoo, with features such as bill length, flipper length, body mass,
and sex.
• Additional dataset: You have to find another dataset and build the decision tree for it.
Please provide a detailed description of the dataset information in your report.
Your dataset must:
2 Specifications
You are required to write Python Notebooks (.ipynb) and use the scikit-learn library to
complete the following tasks described for the Heart Disease dataset.
For the remaining datasets (Penguins dataset and your additional dataset), perform similar
tasks as with the Heart Disease dataset.
While there are no strict guidelines for code organization, each task must be clearly documented
and fully comply with all specified requirements.
You need to shuffle the dataset before splitting and ensure it is split in a stratified fashion.
Other parameters (if there are any) should remain at their default settings.
There will be experiments on training and test sets with different proportions, including 40/60,
60/40, 80/20, and 90/10 (train/test); therefore, you will need 16 subsets in total.
Visualize the class distributions in all datasets (the original set, training sets, and test sets)
across all proportions to demonstrate that they have been appropriately prepared.
How do you interpret the classification report and the confusion matrix? Based on the results,
provide your insights into the performance of these decision tree classifiers.
• Provide the decision trees, visualized using Graphviz, for each max_depth value.
• Report the accuracy_score (on the test set) of the decision tree classifier for each value of
the max_depth parameter in the following table.
max_depth None 2 3 4 5 6 7
Accuracy
3 Requirements
3.1 Report
The report must include the following sections:
• Work assignment table, which includes information on each task assigned to team members,
along with the completion rate of each member compared to the assigned tasks. For example,
student A has a percentage of completion 90% and the group work has a total score of 9.0,
then A receives a score of 9.0 ∗ 90% = 8.1.
• All visualizations must be presented in the .ipynb file, while statistical results and insights
must be presented in the report.
• The report needs to be well-formatted and exported to PDF. If there are figures cut off by
the page break, etc., points will be deducted.
3.2 Submission
• All reports, code, etc., must be contributed in the form of a compressed file (.zip, .rar, .7z)
and named according to the format: StudentID1_StudentID2_etc.zip/.rar/.7z.
• If the compressed file is larger than 25MB, prioritize compressing the report and source code.
Images and other large files may be uploaded to the Google Drive and shared via a link.
4 Assessment
The detailed assessment criteria for this project are outlined as follows:
The detailed assessment criteria for each dataset are outlined as follows:
5 Notices
Please pay attention to the following notices:
• Any plagiarism, any tricks, or any lie will have a 0 point for the course grade.
The end.