0% found this document useful (0 votes)
4 views8 pages

Classification

Uploaded by

ofiscobaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views8 pages

Classification

Uploaded by

ofiscobaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Classification

Role of classification models:


a. Predictive Models: Predict class labels for unseen data, and Learn patterns
from historical data to predict.
b. Descriptive Models: For distinguishing features of different classes, and
analyses the data to find common characteristics.

Framework for Classification:


a. Induction (Training): Learn a model from labeled set using learning algorithms.

Deduction (Testing): Apply the model to unseen data to predict the class label,
And assess the model's performance and use it to improve the training process.

Decision Tree Induction:


a. Goal: Find optimal splits in the data that classes as accurately as possible.
b. Steps:
i. Select attribute: To split the data.
ii. Split data: Into subsets based on attribute values.
iii. Repeat: Until stopping criteria are met.

Hunt’s Algorithm for Decision Tree building:


a. Initial Node with all the instances.
b. Expansion and Child Nodes Formation : Select an attribute to split based on its
values, remove the instances of the leaf nodes from the dataset for the next
split.
c. Recursive Process: repeat the process until Termination.
d. Termination:node has instances of only one class.

How to handle an empty test outcome?:


a. When: The training set does not have instances with specific attribute values.
(happen in testing instances).
b. Solution:Assign the most common class label from parent node to empty nodes.

All attribute values are identical, but the class labels differ?
a. When:Noise or inconsistencies in the data.
b. Solution:Declare it a leaf node and assign it the most common class label in
the training instances associated with this node.
How to determine the best attribute test?
a. *Attribute Test Conditions:
i. Binary Attributes→Binary Split
ii. Nominal Attributes→{Multiway Split, Binary Split:By grouping attribute
values}
iii. Ordinal Attributes→Binary or multiway splits
iv. Grouping should not violate the order.
v. *Continuous Attributes**→{Multiway Split: non overlapping intervals
discretization, Binary Split:determine the threshold}.

b. Objective: Prefer attribute tests leading to pure child nodes.(to stop


expanding. Pure nodes → fewer expansions → less complexity, reduces the
probability of overfitting, easy to interpret).

Impurity Measure for a Single Node:

a.

b.

c.

d. Gini is faster than entropy (it doesn’t compute log) and it often produces
simpler trees.
e. Entropy→[0,1], Gini→[0,.5], ME→[0,.5].

Collective Impurity of Child Nodes:

a.

b. Δ = I(parent) − I(children) → gain in purity (the same as the information gain)


c. Maximizing gain is equivalent to minimizing weighted child impurity.

Gain Ratio:
a. To select the optimal attributes for splitting data like the IG.
b. Addresses limitations of IG, To reduce bias toward attributes with many values.
(‫ نسعملها في حالة ما كان لدينا متغير بعدد كبير من القيم مثل‬ID)

c.

Characteristics of Decision Tree Classifiers


a. Applicability:
i. Nonparametric Approach: No assumptions on the probability distribution of
the data.
ii. Wide Applicability: Any type of data.
iii. No Data Transformation
iv. Multiclass Problem Handling: Without decomposing them into multiple binary
classification tasks.
v. Interpretability: easy to understand (particularly shorter ones).
vi. Competitive Accuracy

b. Expressiveness:
i. Universal Representation: Can encode any function of discrete-valued
attributes.
ii. Efficient Encoding: The discrete-valued function can be represented as an
assignment table and the decision trees can represent them efficiently.DT
can group a combinations of attributes as leaf nodes (compact
representations). But not all decision trees for discrete-valued-attributes
can be simplified (parity function)
iii. Rectilinear Splits:
1. The test conditions described so far in this
chapter involve using only a single attribute at a time. As a consequence,
The tree-growing procedure can be viewed as the process of partitioning
the attribute space into disjoint regions until each region contains
records of the same class. The border between two neighboring regions of
different classes is known as a decision boundary.
2. Since the test condition involves only a single attribute, the decision
boundaries are rectilinear; i.e., parallel to the coordinate axes.
3. Effective in handling both categorical and continues variables.
4. Disadvantages of Rectilinear Splits:
1. Struggle with Non-linear Boundaries
2. Limited Flexibility: Restricts decision boundaries to orthogonal lines,
limiting flexibility.
3. Oversimplification Risks: Can lead to oversimplified models that fail
to capture the true nature of the data.

Model Evaluation :
a. After training we estimate the performance on new unseen data.
1. Defining Evaluation Metrics:
b. Classification Metrics: Confusion matrix, Accuracy, Precision, F1 score ...
2. Choosing a Data Splitting Strategy:
c. Holdout: A single division of data, reserving a portion for testing.
d. Cross-Validation: Repeated splits for a robust performance estimate.
e. Stratified Sampling: Ensures class balance in each split, especially for
imbalanced data.

Confusion Matrix:
a. Compare the predicted labels against true labels.
b. In BINARY CLASSIFICATION.

c.
d.

e.

Model Overfitting / model Underfitting:


a. Overfitting: The Model fits well over the training data, But it shows a poor
generalization performance.
b. Underfitting: Both error rates for the training and the testing are large.
c. When the training and the test error rates are close, the performance on the
training set is fairly representative of the generalization performance.
d. The training error rate keep decreasing by the increasing of the decision tree,
while the test error rate stop decreasing at certain tree size and begins to
increase.
e. The training error rate thus grossly under-estimates the test error rate once
the tree becomes too large.

Reasons for Model Overfitting:


a. Big decision tree give more complex model and more complex decision boundary.
b. When the tree become big, it tries to cover (perfectly fit the training data)
all the data, which makes it fine-tune itself to specific patterns in the
training data, leading to poor performance on an independently chosen test set.

c. Factors:
i. Limited Training Size:
1. Finite number of instances can only provide a limited representation of
the overall data, making the patterns learned from a training set do not
fully represent the true patterns in the overall data.
2. Increase the size of a training set → Better patterns learning → Better
resembling the true patterns in the overall data.

ii. High Model Complexity:


1. Complex Models not always give the best performance.(as we decide before).
2. One measure of model complexity is the number of “parameters” that need to
be inferred from the training set.
3. Parameters are the elements of the tree that are learned from the training
data. These include:
1. Attribute Test Conditions:
1. The rules or conditions at internal nodes (e.g., age>30\text{Age} >
30Age>30) that decide how to split the data.

2. Thresholds or Split Points:


1. The specific values used to split continuous attributes (e.g.,
Salary>50,000\text{Salary} > 50,000Salary>50,000).

3. Class Labels in Leaf Nodes:


1. The predicted class for each leaf node based on the majority class of
the training instances in that node.

4. A more complex tree risks overfitting as it infers more parameters from


the training set.

Model Selection:
a. There are many possible classifications with different levels of complexity, We
want to select the model that shows lowest generalization error rate.
b. The training error rate cannot be reliably used as the sole criterion for model
selection.
c. Generic approaches:
i. Using a Validation Set:
1. The idea is to use out of sample estimates by evaluating the model on a
separate validation set that is not used for training the model.
2. The validation error rate (the error rate on the validation set) is a
better indicator of generalization performance than the training error
rate (unseen data)
3. The process is the following :
1. Partitioning the D.train into D.tr and D.val.
2. For any model m trained on D.tr we can estimate its validation error
rate on D.val.
3. We select the model with the lowest value of error.val(m).
4. Drawbacks are : sensitivity to the size of the D.tr and D.val.
1. If the D.tr is small it will be less representative.
2. If the D.val is small the validation error rate might not be reliable
for selecting models.

ii. Incorporating Model Complexity:


1. When the complexity of the model increase → The chances of overfitting
increase → So we need to take the complexity of the model on consideration
not only the treating error rate.
2. Principle of parsimony, which suggests that given two models with the same
errors, the simpler model is preferred over the more complex model.

Model Selection for Decision Trees:


a. Prepruning (Early Stopping Rule):
i. Haling the growing tree before generating a tree the perfectly fits the
training data.
ii. Limits tree growth by limiting the maximum depth or minimum leaf size.
iii. The advantage of prepruning is that it avoids the computations associated
with generating overly complex subtrees that overfit the training data.
iv. Drawback : If the best possible model in some depth, and we stop before it,
we never reach it.

b. Post-pruning (Applied after the tree is fully grown) :


i. Removes branches that contribute little to classification accuracy.
ii. Reduces model complexity, enhancing generalization to new data.
iii. Subtree Replacement : Replace an entire subtree with a single leaf node. The
leaf node's class label is determined by the majority class of the instances
in that subtree.
iv. Subtree Raising: Replace a subtree by promoting the most frequently used
branch (the branch that has the most instances) to the parent node.

Model Evaluation:
a. The estimate of the generalization performance used to guide the selection of
the classification model are biased indicators of the performance on unseen
instances.
b. We need to evaluate the performance on unseen data D.test by computing the
error.test rate;
c. Data partitioning :
i. Holdout Method:
1. D.train and D.test.
2. Choosing the right fraction for training data is not trivial.
3. Small size of D.train → bad pattern learning/bad generalization.
4. Small size of D.test → Error.test less reliable.
5. Moreover, error.test can have a high variance as we change the random
partitioning of D into D.train and D.test.
6. Random subsampling = repeated holdout ⇾ to obtain a distribution of
D.test to understand its variance.

ii. Cross-Validation:
1. Aims to make effective use of all labeled instances in D for both training
and testing.To avoid the split bias of the holdout method.
2. The k-fold cross-validation method segments the labeled data D of size N
into K equal-sized folds.
3. Each fold is used exactly once for error calculation.(error.test(i))

4.

5. Every instance in the data used k times for testing,(k-1) for


training,every run uses (k-1)/k fraction of the data for traning and 1/k
for testing.
6. Leave-one-out:
1. k = N.
2. advantage of utilizing as much data as possible for training.
3. But it can be misleading and computationally expensive for large data
sets.

7. Stratified Sampling:
1. Ensures equal representation of classes in each partition.

You might also like