Classification
Classification
Deduction (Testing): Apply the model to unseen data to predict the class label,
And assess the model's performance and use it to improve the training process.
All attribute values are identical, but the class labels differ?
a. When:Noise or inconsistencies in the data.
b. Solution:Declare it a leaf node and assign it the most common class label in
the training instances associated with this node.
How to determine the best attribute test?
a. *Attribute Test Conditions:
i. Binary Attributes→Binary Split
ii. Nominal Attributes→{Multiway Split, Binary Split:By grouping attribute
values}
iii. Ordinal Attributes→Binary or multiway splits
iv. Grouping should not violate the order.
v. *Continuous Attributes**→{Multiway Split: non overlapping intervals
discretization, Binary Split:determine the threshold}.
a.
b.
c.
d. Gini is faster than entropy (it doesn’t compute log) and it often produces
simpler trees.
e. Entropy→[0,1], Gini→[0,.5], ME→[0,.5].
a.
Gain Ratio:
a. To select the optimal attributes for splitting data like the IG.
b. Addresses limitations of IG, To reduce bias toward attributes with many values.
( نسعملها في حالة ما كان لدينا متغير بعدد كبير من القيم مثلID)
c.
b. Expressiveness:
i. Universal Representation: Can encode any function of discrete-valued
attributes.
ii. Efficient Encoding: The discrete-valued function can be represented as an
assignment table and the decision trees can represent them efficiently.DT
can group a combinations of attributes as leaf nodes (compact
representations). But not all decision trees for discrete-valued-attributes
can be simplified (parity function)
iii. Rectilinear Splits:
1. The test conditions described so far in this
chapter involve using only a single attribute at a time. As a consequence,
The tree-growing procedure can be viewed as the process of partitioning
the attribute space into disjoint regions until each region contains
records of the same class. The border between two neighboring regions of
different classes is known as a decision boundary.
2. Since the test condition involves only a single attribute, the decision
boundaries are rectilinear; i.e., parallel to the coordinate axes.
3. Effective in handling both categorical and continues variables.
4. Disadvantages of Rectilinear Splits:
1. Struggle with Non-linear Boundaries
2. Limited Flexibility: Restricts decision boundaries to orthogonal lines,
limiting flexibility.
3. Oversimplification Risks: Can lead to oversimplified models that fail
to capture the true nature of the data.
Model Evaluation :
a. After training we estimate the performance on new unseen data.
1. Defining Evaluation Metrics:
b. Classification Metrics: Confusion matrix, Accuracy, Precision, F1 score ...
2. Choosing a Data Splitting Strategy:
c. Holdout: A single division of data, reserving a portion for testing.
d. Cross-Validation: Repeated splits for a robust performance estimate.
e. Stratified Sampling: Ensures class balance in each split, especially for
imbalanced data.
Confusion Matrix:
a. Compare the predicted labels against true labels.
b. In BINARY CLASSIFICATION.
c.
d.
e.
c. Factors:
i. Limited Training Size:
1. Finite number of instances can only provide a limited representation of
the overall data, making the patterns learned from a training set do not
fully represent the true patterns in the overall data.
2. Increase the size of a training set → Better patterns learning → Better
resembling the true patterns in the overall data.
Model Selection:
a. There are many possible classifications with different levels of complexity, We
want to select the model that shows lowest generalization error rate.
b. The training error rate cannot be reliably used as the sole criterion for model
selection.
c. Generic approaches:
i. Using a Validation Set:
1. The idea is to use out of sample estimates by evaluating the model on a
separate validation set that is not used for training the model.
2. The validation error rate (the error rate on the validation set) is a
better indicator of generalization performance than the training error
rate (unseen data)
3. The process is the following :
1. Partitioning the D.train into D.tr and D.val.
2. For any model m trained on D.tr we can estimate its validation error
rate on D.val.
3. We select the model with the lowest value of error.val(m).
4. Drawbacks are : sensitivity to the size of the D.tr and D.val.
1. If the D.tr is small it will be less representative.
2. If the D.val is small the validation error rate might not be reliable
for selecting models.
Model Evaluation:
a. The estimate of the generalization performance used to guide the selection of
the classification model are biased indicators of the performance on unseen
instances.
b. We need to evaluate the performance on unseen data D.test by computing the
error.test rate;
c. Data partitioning :
i. Holdout Method:
1. D.train and D.test.
2. Choosing the right fraction for training data is not trivial.
3. Small size of D.train → bad pattern learning/bad generalization.
4. Small size of D.test → Error.test less reliable.
5. Moreover, error.test can have a high variance as we change the random
partitioning of D into D.train and D.test.
6. Random subsampling = repeated holdout ⇾ to obtain a distribution of
D.test to understand its variance.
ii. Cross-Validation:
1. Aims to make effective use of all labeled instances in D for both training
and testing.To avoid the split bias of the holdout method.
2. The k-fold cross-validation method segments the labeled data D of size N
into K equal-sized folds.
3. Each fold is used exactly once for error calculation.(error.test(i))
4.
7. Stratified Sampling:
1. Ensures equal representation of classes in each partition.