0% found this document useful (0 votes)
2 views17 pages

ML 1

The document discusses the concept of version spaces in machine learning, which represent all plausible descriptions of a heuristic based on positive and negative examples. It outlines the version space learning algorithm, Candidate-Elimination, which generalizes and specializes hypotheses to converge on a correct model. Additionally, it emphasizes the importance of inductive bias in machine learning, detailing its types, challenges, and the role of performance metrics in evaluating model effectiveness.

Uploaded by

Bhavya V 8562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

ML 1

The document discusses the concept of version spaces in machine learning, which represent all plausible descriptions of a heuristic based on positive and negative examples. It outlines the version space learning algorithm, Candidate-Elimination, which generalizes and specializes hypotheses to converge on a correct model. Additionally, it emphasizes the importance of inductive bias in machine learning, detailing its types, challenges, and the role of performance metrics in evaluating model effectiveness.

Uploaded by

Bhavya V 8562
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A version space is a hierarchial representation of knowledge that enables you to keep track

of all the useful information supplied by a sequence of learning examples without


remembering any of the examples.

The version space method is a concept learning process accomplished by managing multiple
models within a version space.

Version Space Characteristics

Tentative heuristics are represented using version spaces.

A version space represents all the alternative plausible descriptions of a heuristic.


A plausible description is one that is applicable to all known positive examples and no known
negative example.

A version space description consists of two complementary trees:

1. One that contains nodes connected to overly general models, and


2. One that contains nodes connected to overly specific models.

Node values/attributes are discrete.

Fundamental Assumptions

1. The data is correct; there are no erroneous instances.


2. A correct description is a conjunction of some of the attributes with values.

Diagrammatical Guidelines

There is a generalization tree and a specialization tree.

Each node is connected to a model.

Nodes in the generalization tree are connected to a model that matches everything in its
subtree.

Nodes in the specialization tree are connected to a model that matches only one thing in its
subtree.

Links between nodes and their models denote

 generalization relations in a generalization tree, and


 specialization relations in a specialization tree.

Diagram of a Version Space

In the diagram below, the specialization tree is colored red, and the generalization tree is
colored green.
Generalization and Specialization Leads to Version Space Convergence

The key idea in version space learning is that specialization of the general models and
generalization of the specific models may ultimately lead to just one correct model that
matches all observed positive examples and does not match any negative examples.

That is, each time a negative example is used to specialilize the general models, those
specific models that match the negative example are eliminated and each time a positive
example is used to generalize the specific models, those general models that fail to match the
positive example are eliminated. Eventually, the positive and negative examples may be such
that only one general model and one identical specific model survive.

Version Space Method Learning Algorithm: Candidate-Elimination

The version space method handles positive and negative examples symmetrically.

Given:

 A representation language.
 A set of positive and negative examples expressed in that language.
Compute: a concept description that is consistent with all the positive examples and none of
the negative examples.

Method:

 Initialize G, the set of maximally general hypotheses, to contain one element: the null
description (all features are variables).
 Initialize S, the set of maximally specific hypotheses, to contain one element: the first
positive example.
 Accept a new training example.
o If the example is positive:
1. Generalize all the specific models to match the positive example, but
ensure the following:
 The new specific models involve minimal changes.
 Each new specific model is a specialization of some general
model.
 No new specific model is a generalization of some other
specific model.
2. Prune away all the general models that fail to match the positive
example.
o If the example is negative:
1. Specialize all general models to prevent match with the negative
example, but ensure the following:
 The new general models involve minimal changes.
 Each new general model is a generalization of some specific
model.
 No new general model is a specialization of some other general
model.
2. Prune away all the specific models that match the negative example.
o If S and G are both singleton sets, then:
 if they are identical, output their value and halt.
 if they are different, the training cases were inconsistent. Output this
result and halt.
 else continue accepting new training examples.

The algorithm stops when:

1. It runs out of data.


2. The number of hypotheses remaining is:
o 0 - no consistent description for the data in the language.
o 1 - answer (version space converges).
o 2+ - all descriptions in the language are implicitly included.

Comments on the Version Space Method

The version space method is still a trial and error method.


The program does not base its choice of examples, or its learned heuristics, on an analysis of
what works or why it works, but rather on the simple assumption that what works will
probably work again.
Unlike the decision tree ID3 algorithm,

 Candidate-elimination searches an incomplete set of hypotheses (ie. only a subset of


the potentially teachable concepts are included in the hypothesis space).
 Candidate-elimination finds every hypothesis that is consistent with the training data,
meaning it searches the hypothesis space completely.
 Candidate-elimination's inductive bias is a consequence of how well it can represent
the subset of possible hypotheses it will search. In other words, the bias is a product of
its search space.
 No additional bias is introduced through Candidate-eliminations's search strategy.

Advantages of the version space method:

 Can describe all the possible hypotheses in the language consistent with the data.
 Fast (close to linear).

Disadvantages of the version space method:

 Inconsistent data (noise) may cause the target concept to be pruned.


 Learning disjunctive concepts is challenging.

Inductive bias is the set of assumptions or preferences that a learning algorithm uses to make
predictions beyond the data it has been trained on. Without inductive bias, machine learning
algorithms would be unable to generalize from training data to unseen situations, as the
possible hypotheses or models could be infinite.

Sources: Medium

For instance, in a classification problem, if the model is trained on data that suggests a linear
relationship between features and outcomes, the inductive bias of the model might favor
a linear hypothesis. This preference guides the model to choose simpler, linear relationships
rather than complex, nonlinear ones, even if such relationships might exist in the data.

Examples:

 Inductive bias in decision trees: A preference for shorter trees with fewer splits.
 Inductive bias in linear regression: The assumption that the data follows a linear
trend.
These biases help the algorithm make predictions more efficiently, even in situations where
there is uncertainty.

Types of Inductive Bias

Inductive bias can be categorized into different types based on the constraints or preferences
that guide a learning algorithm:

1. Language Bias

Language bias refers to the constraints placed on the hypothesis space, which defines the
types of models a learning algorithm can consider. For instance, linear regression models
assume a linear relationship between variables, thereby limiting the hypothesis space to linear
functions.

2. Search Bias

Search bias refers to the preferences that an algorithm has when selecting hypotheses from
the available options. For example, many algorithms prefer simpler models over complex
ones due to the principle of Occam’s Razor, which suggests that simpler models are more
likely to generalize well.

3. Algorithm-Specific Biases

Certain algorithms have specific biases based on their structure:

 Linear Models: Assume that the data has linear relationships.


 k-Nearest Neighbors (k-NN): Assumes that similar data points exist in close
proximity.
 Decision Trees: Typically biased towards choosing splits that result in the most
homogeneous subgroups.
Each type of inductive bias impacts how an algorithm approaches the learning process,
guiding it towards certain types of models and predictions.

Inductive Biases in Machine Learning Algorithms

Different machine learning algorithms incorporate distinct inductive biases that shape their
learning and prediction processes:

1. Bayesian Models

In Bayesian models, prior knowledge is treated as a form of inductive bias. This prior helps
the model make predictions even when the available data is limited. The model updates its
predictions as new data becomes available, balancing the prior with the likelihood of the
observed data.

2. k-Nearest Neighbors (k-NN)


The inductive bias in k-NN lies in its assumption that similar data points are located close to
each other in feature space. As a result, k-NN tends to perform well in datasets where locally
similar data points share the same classification.

3. Linear Regression

The inductive bias in linear regression is the assumption that the relationship between input
variables and output is linear. This bias works well for datasets with linear patterns but may
fail to capture more complex, nonlinear relationships.

4. Logistic Regression

Logistic regression assumes a linear decision boundary between classes, which makes it
effective for binary classification tasks with linearly separable data.

Each of these algorithms leverages specific inductive biases to balance accuracy and
generalization, ensuring that the model doesn’t overfit or underfit the training data.

Importance of Inductive Bias

Inductive bias plays a critical role in ensuring that machine learning models can generalize
effectively from training data to unseen data. Without bias, a learning algorithm would have
to consider every possible hypothesis, which is computationally infeasible.

Generalization and Bias-Variance Trade-off:

Inductive bias helps balance the bias-variance trade-off. A model with too much bias
may underfit the data, resulting in poor predictions on unseen data. Conversely, a model
with too little bias may overfit, capturing noise in the training data but failing to generalize.

The goal is to find the right balance: enough inductive bias to ensure generalization, but not
so much that the model becomes too rigid. This is especially important in real-world machine
learning tasks, where data is often noisy and incomplete, and making assumptions about the
data is necessary for the model to make reasonable predictions.

Challenges and Considerations in Inductive Bias

While inductive bias is essential for guiding machine learning models, it comes with
challenges:

Overfitting

When the inductive bias is too weak, the model may overfit the training data by learning
noise rather than meaningful patterns. Overfitting occurs when the model fits the training data
too closely, resulting in poor performance on unseen data.

Underfitting
Conversely, if the inductive bias is too strong, the model may underfit the data, failing to
capture important patterns. This can lead to overly simplistic models that don’t perform well
on either the training or test data.

Finding the Right Balance

Finding the optimal level of inductive bias requires tuning the model’s complexity and
flexibility. For instance, regularization techniques can help control the degree of bias by
penalizing overly complex models, thus encouraging generalization without overfitting.

Machine learning practitioners must carefully consider the trade-off between bias and
flexibility to create models that are both accurate and generalizable.

Conclusion

Inductive bias is a fundamental concept in machine learning that guides models in making
predictions beyond the training data. By introducing assumptions about the data, inductive
bias allows algorithms to generalize and learn more efficiently. However, the strength of the
bias must be carefully balanced to avoid underfitting or overfitting the model. Understanding
the role of inductive bias in different machine learning algorithms is crucial for selecting the
right model for a given task. Further exploration of bias-variance trade-offs will lead to
better-performing models in real-world applications.

Performance metrics in machine learning are used to evaluate the


performance of a machine learning model. These metrics provide
quantitative measures to assess how well a model is performing and to
compare the performance of different models. Performance metrics are
important because they help us understand how well our model is
performing and whether it is meeting our requirements. In this way, we
can make informed decisions about whether to use a particular model or
not.

We must carefully choose the metrics for evaluating ML performance


because −

 How the performance of ML algorithms is measured and compared will be


dependent entirely on the metric you choose.
 How you weight the importance of various characteristics in the result will
be influenced completely by the metric you choose.

There are various metrics which we can use to evaluate the performance
of ML algorithms, classification as well as regression algorithms. Let's
discuss these metrics for Classification and Regression problems
separately.

Performance Metrics for Classification Problems


We have discussed classification and its algorithms in the previous
chapters. Here, we are going to discuss various performance metrics that
can be used to evaluate predictions for classification problems.

 Confusion Matrix
 Classification Accuracy
 Classification Report
 Precision
 Recall or Sensitivity
 Specificity
 Support
 F1 Score
 ROC AUC Score
 LOGLOSS (Logarithmic Loss)

Confusion Matrix

The consfusion matrix is the easiest way to measure the performance of a


classification problem where the output can be of two or more type of
classes. A confusion matrix is nothing but a table with two dimensions viz.
"Actual" and "Predicted" and furthermore, both the dimensions have "True
Positives (TP)", "True Negatives (TN)", "False Positives (FP)", "False
Negatives (FN)" as shown below −

Explanation of the terms associated with confusion matrix are as follows



 True Positives (TP) − It is the case when both actual class & predicted class
of data point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted class
of data point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 &
predicted class of data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 &
predicted class of data point is 0.

Classification Accuracy

Accuracy is most common performance metric for classification


algorithms. It may be defined as the number of correct predictions made
as a ratio of all predictions made. We can easily calculate it by confusion
matrix with the help of following formula −

Accuracy=TP+TN/+++

We can use accuracy_score function of sklearn.metrics to compute


accuracy of our classification model.

Classification Report

This report consists of the scores of Precisions, Recall, F1 and Support.


They are explained as follows −

Precision

Precision measures the proportion of true positive instances out of all


predicted positive instances. It is calculated as the number of true
positive instances divided by the sum of true positive and false positive
instances.

We can easily calculate it by confusion matrix with the help of following


formula −

Precision=TP/TP+FP

Precision, used in document retrievals, may be defined as the number of


correct documents returned by our ML model.

Recall or Sensitivity
Recall measures the proportion of true positive instances out of all actual
positive instances. It is calculated as the number of true positive
instances divided by the sum of true positive and false negative instances.

We can easily calculate it by confusion matrix with the help of following


formula −

Recall=TP/TP+FN

Specificity

Specificity, in contrast to recall, may be defined as the number of


negatives returned by our ML model. We can easily calculate it by
confusion matrix with the help of following formula −

Specificity=TN/TN+FP

Support

Support may be defined as the number of samples of the true response


that lies in each class of target values.

F1 Score

F1 score is the harmonic mean of precision and recall. It is a balanced


measure that takes into account both precision and recall.
Mathematically, F1 score is the weighted average of the precision and
recall. The best value of F1 would be 1 and worst would be 0. We can
calculate F1 score with the help of following formula −

=()/(+)

F1 score is having equal relative contribution of precision and recall.

We can use classification_report function of sklearn.metrics to get the


classification report of our classification model.

ROC AUC Score

The ROC (Receiver Operating Characteristic) Area Under the Curve(AUC)


score is a measure of the ability of a classifier to distinguish between
positive and negative instances. It is calculated by plotting the true
positive rate against the false positive rate at different classification
thresholds and calculating the area under the curve.
As name suggests, ROC is a probability curve and AUC measure the
separability. In simple words, ROC-AUC score will tell us about the
capability of model in distinguishing the classes. Higher the score, better
the model.

We can use roc_auc_score function of sklearn.metrics to compute AUC-


ROC.

LOGLOSS (Logarithmic Loss)

It is also called Logistic regression loss or cross-entropy loss. It basically


defined on probability estimates and measures the performance of a
classification model where the input is a probability value between 0 and
1. It can be understood more clearly by differentiating it with accuracy. As
we know that accuracy is the count of predictions (predicted value =
actual value) in our model whereas Log Loss is the amount of uncertainty
of our prediction based on how much it varies from the actual label. With
the help of Log Loss value, we can have more accurate view of the
performance of our model. We can use log_loss function of
sklearn.metrics to compute Log Loss.

Performance Metrics for Regression Problems

We have discussed regression and its algorithms in previous chapters.


Here, we are going to discuss various performance metrics that can be
used to evaluate predictions for regression problems.

 Mean Absolute Error (MAE)


 Mean Square Error (MSE)
 R Squared (R2) Score

Mean Absolute Error (MAE)

It is the simplest error metric used in regression problems. It is basically


the sum of average of the absolute difference between the predicted and
actual values. In simple words, with MAE, we can get an idea of how
wrong the predictions were. MAE does not indicate the direction of the
model i.e. no indication about underperformance or overperformance of
the model. The following is the formula to calculate MAE −

MAE=1/n∑|Y−Y^|

Here, =Actual Output Values

And Y^= Predicted Output Values.


We can use mean_absolute_error function of sklearn.metrics to compute
MAE.

Mean Square Error (MSE)

MSE is like the MAE, but the only difference is that the it squares the
difference of actual and predicted output values before summing them all
instead of using the absolute value. The difference can be noticed in the
following equation −

MSE=1/n∑(Y−Y^)

Here, =Actual Output Values

And Y^ = Predicted Output Values.

We can use mean_squared_error function of sklearn.metrics to compute


MSE.

R Squared (R2) Score

R Squared metric is generally used for explanatory purpose and provides


an indication of the goodness or fit of a set of predicted output values to
the actual output values. The following formula will help us understanding
it –

R2=1−[1/n∑i=1(Yi−Yi^)^2]/[1/n∑i=1(Yi−Yi)^2]

In the above equation, numerator is MSE and the denominator is the


variance in values.

We can use r2_score function of sklearn.metrics to compute R squared


value.

Before delving into the intricacies of the ID3 algorithm, let's grasp the essence of decision
trees. Picture a tree-like structure where each internal node represents a test on an attribute,
each branch signifies an outcome of that test, and each leaf node denotes a class label or a
decision. Decision trees mimic human decision-making processes by recursively splitting
data based on different attributes to create a flowchart-like structure for classification or
regression.
ID3 Algorithm
A well-known decision tree approach for machine learning is the Iterative Dichotomiser
3 (ID3) algorithm. By choosing the best characteristic at each node to partition the data
depending on information gain, it recursively constructs a tree. The goal is to make the final
subsets as homogeneous as possible. By choosing features that offer the greatest reduction in
entropy or uncertainty, ID3 iteratively grows the tree. The procedure keeps going until a
halting requirement is satisfied, like a minimum subset size or a maximum tree depth.
Although ID3 is a fundamental method, other iterations such as C4.5 and CART have
addresse
How ID3 Works
The ID3 algorithm is specifically designed for building decision trees from a given dataset.
Its primary objective is to construct a tree that best explains the relationship between
attributes in the data and their corresponding class labels

1. Selecting the Best Attribute

ID3 employs the concept of entropy and information gain to determine the attribute that best
separates the data. Entropy measures the impurity or randomness in the dataset.

The algorithm calculates the entropy of each attribute and selects the one that results in the most
significant information gain when used for splitting the data.

2. Creating Tree Nodes

The chosen attribute is used to split the dataset into subsets based on its distinct values.

For each subset, ID3 recurses to find the next best attribute to further partition the data, forming
branches and new nodes accordingly.

3. Stopping Criteria

The recursion continues until one of the stopping criteria is met, such as when all instances in a
branch belong to the same class or when all attributes have been used for splitting.

4. Handling Missing Values

ID3 can handle missing attribute values by employing various strategies like attribute mean/mode
substitution or using majority class values.

5. Tree Pruning

Pruning is a technique to prevent overfitting. While not directly included in ID3, post-processing
techniques or variations like C4.5 incorporate pruning to improve the tree's generalization.

a technique to prevent overfitting. While not directly included in ID3, post-processing


techniques or variations like C4.5 incorporate pruning to improve the tree's generalization.
Mathematical Concepts of ID3 Algorithm
Now let's examine the formulas linked to the main theoretical ideas in the ID3 algorithm:
1. Entropy
A measure of disorder or uncertainty in a set of data is called entropy. Entropy is a tool used
in ID3 to measure a dataset's disorder or impurity. By dividing the data into as homogenous
subsets as feasible, the objective is to minimize entropy.
For a set S with classes {c1, c2, ..., cn}, the entropy is calculated as:
H(S)=Σi=1npilog2(pi)H(S)=Σi=1npilog2(pi)
Where, pi is the proportion of instances of class ci in the set.
2. Information Gain
A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3
splits the data at each stage, choosing the property that maximizes Information Gain. It is
computed using the distinction between entropy prior to and following the split.
Information Gain measures the effectiveness of an attribute A in reducing uncertainty in set
S.
IG(A,S)=H(S)−Σvϵvalues(A)∣Sv∣∣S∣⋅H(Sv)IG(A,S)=H(S)−Σvϵvalues(A)∣S∣∣Sv∣⋅H(Sv)
Where, |Sv | is the size of the subset of S for which attribute A has value v.
3. Gain Ratio
Gain Ratio is an improvement on Information Gain that considers the inherent worth of
characteristics that have a wide range of possible values. It deals with the bias of Information
Gain in favor of characteristics with more pronounced values.
GR(A,S)=IG(A,S)/Σvϵvalues(A)∣Sv∣/S⋅log2(∣Sv∣/∣S∣)

Overfitting happens when a model learns too much from the training data, including details that
don’t matter (like noise or outliers).

For example, imagine fitting a very complicated curve to a set of points. The curve will go through
every point, but it won’t represent the actual pattern.

As a result, the model works great on training data but fails when tested on new data.

Overfitting models are like students who memorize answers instead of understanding the topic.
They do well in practice tests (training) but struggle in real exams (testing).

Reasons for Overfitting:

High variance and low bias.

The model is too complex.

The size of the training data.

Techniques to Reduce Overfitting

Improving the quality of training data reduces overfitting by focusing on meaningful patterns,
mitigate the risk of fitting the noise or irrelevant features.
Increase the training data can improve the model's ability to generalize to unseen data and reduce
the likelihood of overfitting.

Reduce model complexity.

Early stopping during the training phase (have an eye over the loss over the training period as soon
as loss begins to increase stop training).

Ridge Regularization and Lasso Regularization.

Use dropout for neural networks to tackle overfitting.

In decision tree learning (like ID3, C4.5, CART), handling continuous (numerical) values is
crucial, because most real-world data includes attributes like age, income, temperature, etc.

Here’s how decision trees handle continuous values:

1. Basic Idea: Convert Continuous to Categorical Splits

For a continuous attribute like "age", decision trees:

 Find a threshold value (like age < 45).


 Split the data based on that threshold:
o Left branch: age < 45
o Right branch: age ≥ 45

This makes the tree behave like it's using a categorical attribute, but it’s just using a binary
decision.

⚙ 2. Steps to Handle Continuous Attributes

Suppose you're using CART (Classification and Regression Trees) or C4.5:

Step 1: Sort the attribute values.

Example: For age = [25, 32, 40, 45, 50]

Step 2: Try splits between each pair of values.

Try thresholds like:

 (25 + 32)/2 = 28.5


 (32 + 40)/2 = 36
 (40 + 45)/2 = 42.5
 etc.

Step 3: For each threshold, split the data into two groups and calculate:

 Information Gain (for C4.5/ID3)


 or Gini Index (for CART)

Step 4: Choose the threshold that gives the best split.

✅ Example

Let’s say you have:

Age Class
25 No

32 No

40 Yes

45 Yes

50 Yes

Try different splits like:

 Age < 36 and Age ≥ 36


 Age < 42.5 and Age ≥ 42.5

Pick the one with the highest information gain.

How It's Represented in the Tree

yaml
CopyEdit
[Age < 42.5]
/ \
Yes No
(Class = No) (Class = Yes)

Because decision trees:


 Can naturally handle both continuous and categorical data.
 But for continuous values, they need to find good split points to make decisions.

You might also like