INT354 - Unit 2
INT354 - Unit 2
• d) Data Imbalance
• Balanced → Any model works well
• Imbalanced → Consider:
• Resampling (Oversampling, SMOTE, Undersampling)
• Weighted loss function (XGBoost, CatBoost, Deep Learning)
Choosing a Classification Algorithm
• a) Interpretability
• High Interpretability Required → Logistic Regression, Decision Trees, Naïve Bayes
• Low Interpretability (but High Accuracy) → Random Forest, XGBoost, Deep
Learning
• b) Computational Complexity
• Low Complexity (Fast Training & Inference) → Logistic Regression, Naïve Bayes,
SVM (linear)
• Medium Complexity → Decision Trees, Random Forest, XGBoost
• High Complexity (Slow Training, High GPU Requirements) → Deep Learning (CNN,
LSTM, Transformers)
Choosing a Classification Algorithm
• a) Accuracy vs Speed Trade-off
• High Accuracy Required → XGBoost, Random Forest, Deep Learning
• Fast Computation Needed → Logistic Regression, Naïve Bayes
• b) Sensitivity to Noise
• Robust to Noise → Random Forest, XGBoost, Deep Learning
• Sensitive to Noise → SVM, Decision Trees (prone to overfitting)
• c) Overfitting Risk
• Low Risk → Random Forest, XGBoost (with regularization), Ridge/Lasso Regression
• High Risk → Decision Trees (without pruning), Deep Learning (without dropout)
Choosing a Classification Algorithm
• a) Sequential or Time-Series Data
• Yes → LSTM, GRU, Transformer, 1D-CNN
• No → Use traditional ML models
• b) Multiclass vs Binary
• Binary Classification → Logistic Regression, SVM, Decision Trees, Deep Learning
• Multiclass (≥3 classes) → Random Forest, XGBoost, Neural Networks (softmax activation)
While
i<=N
Stop
C4.5 Algorithm
Comparison of ID3 and C4.5
Feature ID3 (Iterative Dichotomiser 3) C4.5 (Successor of ID3)
Uses Entropy & Gain Ratio, which normalizes
Uses Entropy & Information Gain to choose the best feature
Splitting Criterion Information Gain to prevent bias towards
for splitting.
multi-valued attributes.
Handling Cannot handle numerical (continuous) attributes; requires Can handle continuous (numerical) attributes
Continuous discretization. by creating threshold splits.
Features (e.g., "High Credit Score" vs. "Low Credit Score"). (e.g., "Credit Score ≤ 680").
Handling Missing Can handle missing values by assigning
Cannot handle missing values.
Values probabilities to different attribute values.
Bias Towards Multi- Yes, because Information Gain favors attributes with more
No, because Gain Ratio corrects this bias.
Valued Attributes unique values.
Produces a simpler tree with better
Output Produces a large tree with possible overfitting.
generalization.
Slower than ID3 but more accurate due to
Efficiency Faster but less accurate due to lack of pruning.
pruning and handling continuous values.