0% found this document useful (0 votes)
16 views

Unit-3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Unit-3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Bayesian Concept Learning:

• SYLLABUS:

• Bayesian Concept Learning: Introduction, Bayes’ Theorem, Naïve Bayes


Classifier, Applications of Naïve Bayes Classifier, Supervised Learning:
Classification, Example of Supervised Learning, Classification Model Learning
Steps, Common Classification Algorithms: KNN, Decision Tree, Random forest
model, Support vector machines. Introduction of Regression: Example of
Regression, linear Regression, Multiple linear Regression.
Bayesian Concept Learning:
• Introduction

• Principles of probability used for classification task are an important area of


machine learning algorithms.

• In our practical life, our decisions are affected by our prior knowledge or belief
about an event.

• Bayes' Theorem is used in machine learning because it provides a powerful


framework for reasoning under uncertainty estimation, updating beliefs with
data, and making predictions.

• The key reasons why Bayes' Theorem is important in machine learning are:
Introduction
• 1. Uncertainty Estimation

• Machine learning models often have uncertainty due to incomplete data, noise,
or model limitations.

• Bayes' Theorem helps quantify and manage this uncertainty by computing


probabilities for different outcomes, parameters, or models.

• 2. Updating Beliefs

• A core strength of Bayes' Theorem is its ability to update beliefs based on new
evidence.

• In machine learning, models can start with some prior assumptions (priors)
and improve predictions as new data (likelihood) becomes available.
Introduction

• Here, new evidence (the data) updates our belief in the hypothesis (model
parameters), making the model adaptable and data-driven.

• 3. Regularization and Overfitting Control

• Bayes' Theorem inherently applies regularization through the use of priors.

• By specifying prior distributions over model parameters, we can control the


complexity of the model. This helps avoid overfitting.
Introduction
• 4. Probabilistic Predictions

• Many machine learning tasks benefit from not just predicting outcomes but
understanding the likelihood of those predictions.

• Bayes' Theorem allows for generating probabilistic predictions which gives a


fuller picture of model certainty and risk, making the results more interpretable
and actionable.

• 5. Bayesian Networks

• Bayesian networks are a specific application of Bayes' Theorem in machine


learning like speech recognition, medical diagnostics, autonomous systems etc.
Introduction
• These networks model the complex relationships between variables using
conditional dependencies.
• Bayes' Theorem is used to update the probability of one variable given
the evidence from others.
• Uses of Bayesian classifiers in real-life applications
• Bayesian classifiers use a simple idea that the training data are utilized
to calculate an observed probability of each class based on feature values.
• When the same classifier is used later for unclassified data, it uses the
observed probabilities to predict the most likely class for the new
features.
Introduction
• Some of the real-life uses of Bayesian classifiers are as follows:

• 1. Text-based classification such as spam or junk mail filtering, author


identification, or topic categorization.

• 2. Medical diagnosis such as given the presence of a set of observed


symptoms during a disease, identifying the probability of new patients
having the disease.

• 3. Network security such as detecting illegal intrusion or anomaly in


computer networks
Introduction
• One of the strengths of Bayesian classifiers is that they utilize all
available parameters to subtly change the predictions, while many other
algorithms tend to ignore the features that have weak effects.

• Features of Bayesian learning methods

1. Prior knowledge of the candidate hypothesis is combined with the


observed data for arriving at the final probability of a hypothesis.

2. The Bayesian approach to learning is more flexible than the other


approaches.
Introduction
3. Bayesian methods can perform better than the other methods while
validating the hypotheses that make probabilistic predictions.

4. Through the easy approach of Bayesian methods, it is possible to


classify new instances by combining the predictions of multiple
hypotheses.

5. Bayesian methods can be used to create a standard for the optimal


decision.

• Bayesian methods largely depends on the availability of initial


knowledge about the probabilities of the hypothesis set.
What is concept learning ?
• Concept learning is about recognizing common characteristics from a
set of training examples, both positive and negative.

• In concept learning, a hypothesis is a specific guess about the concept


based on the data provided.

• The hypothesis space represents all possible concepts the model can
consider, and the learning algorithm's task is to find the best hypothesis
that fits the training data.

• Concept learning is used in areas such as image classification, language


processing, and decision-making, where the system needs to learn
concepts like "cat," "chair," or "spam" from examples.
Bayes’ Theorem
• Bayes’ probability rule as given below:

• where A and B are conditionally related events and p(A|B) denotes the
probability of event A occurring when event B has already occurred.

• Suppose we have a training data set D where we have noted some


observed data. Our task is to determine the best hypothesis in space H
by using the knowledge of D.
Bayes’ Theorem
• Key concepts involved in the Baye’s rule are :

• Prior, posterior, evidence, and likelihood.

• Which is for formally written as :


Bayes’ Theorem
• 1. Prior Probability (P(Hypothesis))

• The prior probability represents our belief or knowledge about a


hypothesis or parameter before we observe any data.

• It reflects any existing knowledge or assumptions.

• 2. Likelihood (P(Data | Hypothesis))

• The likelihood represents the probability of observing the data given


that a specific hypothesis is true.

• It measures how well the hypothesis explains the data.


Bayes’ Theorem
• 3. Posterior Probability (P(Hypothesis | Data))

• The posterior probability is the updated probability of the hypothesis


after taking the new evidence or data into account.

• It combines the prior and the likelihood to give a new belief.

• 4. Evidence (P(Data))

• The evidence (or marginal likelihood) is the total probability of


observing the data across all possible hypotheses.

• It serves as a normalizing constant to ensure the posterior is a valid


probability.
Bayesian Concept Learning:
• Example :

• Consider working of an email spam filter. Historically, 80% of the emails are
not spam and 20% are spam. We know that 90% of spam emails contain the
word "offer," and 5% of non-spam emails don’t contain the word "offer.“

• Suppose, if we receive an email that contains the word "offer," what is the
probability that it is spam?

• Solution:

• P(Spam) = 0.20 (Prior probability that an email is spam).

• P(Not Spam) = 0.80 (Prior probability that an email is not spam).


Bayesian Concept Learning:
• P(Word 'offer' | Spam) = 0.90 (Probability that the word "offer" appears in
spam).

• P(Word 'offer' | Not Spam) = 0.05 (Probability that the word "offer" appears in
non-spam).

• The probability that the email is spam given that it contains the word "offer").

• Using Bayes' Theorem:


Bayesian Concept Learning:
• We need to calculate P(Word 'offer') using the law of total
probability:

• Substituting the values

• Now we can calculate P(Spam | Word 'offer'):


The probability that an email is
spam given that it contains the word
"offer" is about 81.8%.
Example 2:
• Defective Product Detection

• A factory produces 5% defective items. A quality control inspection detects


defective items 98% of the time. However, the inspection also incorrectly classifies
3% of non-defective items as defective.

• What is the probability that an item is actually defective if it is classified as


defective during inspection?

• Solution

• We need to find the probability that an item is actually defective given that it has
been classified as defective by the inspection system, denoted as P(D ∣ T)
Example 2:
• D = the event that an item is defective.

• T = the event that an item is classified as defective by the


inspection system.
Example 2:
Example 2:
Example 3:
• A weather forecasting model predicts rain with 90% accuracy when it
actually rains. When it doesn’t rain, the model predicts rain 20% of the
time. The probability of rain on any given day is 30%.

• If the model predicts rain, what is the probability that it will actually
rain?
Example 3:
Example 3:
Example 3:
Bayes’ Theorem : Maximum A Posteriori (MAP) hypothesis
• We will assume that P(h) is the initial probability of a hypothesis ‘h’
called the prior probability P(h).

• P(T) is the prior probability that the training data will be observed.

• We will denote P(T|h) as the probability of observing data T in a


space where ‘h’ holds true.

• we are interested in finding out P(h|T), which means whether the


hypothesis holds true given the observed training data T. This is
called the posterior probability.

• According to Bayes’ theorem


Bayes’ Theorem
• According to Bayes’ theorem

• From the above equation, we can deduce that P(h|T) increases as P(h)
and P(T|h) increases and also as P(T) decreases.

• To find out the maximum probable hypothesis h from a set of hypotheses


H (h∈H) given the observed training data T.
Bayes’ Theorem
• This maximally probable hypothesis is called the maximum a
posteriori (MAP) hypothesis.

• By using Bayes’ theorem, we can identify the MAP hypothesis from


the posterior probability of each candidate hypothesis:
Bayes’ Theorem
• and as P(T) is a constant independent of h, in this case, we can
write

• The conceptual and mathematical representation of Bayes theorem


and the relationship of Prior, Posterior and Likelihood is given by:
Applications of Naïve Bayes classifier
• Text classification:

• Naïve Bayes classifier is among the most successful known algorithms for
learning to classify text documents.

• In text classification, the features are typically the individual words (or tokens)
present in the document.

• A common way to represent text is using the Bag of Words (BoW) model,

• where:

• The text is converted into a vector of word frequencies or word counts.


Applications of Naïve Bayes classifier
• Each word in the document becomes a feature, and its value is the
number of times it appears in the document.

• For instance, in the sentence "I love machine learning," the words "I,"
"love," "machine," and "learning" are the features.

• Naive Bayes assumes that all features (words) are conditionally


independent of each other given the class label.

• This means that the presence of one word in the document does not affect
the presence of another word, which simplifies the computation of the
likelihood:
Applications of Naïve Bayes classifier
• During training, the Naive Bayes classifier calculates the following:

• The prior probability P(Class): The probability of each class in the training
data.

• The likelihood P(Word∣Class): The probability of each word appearing in


documents of a particular class.

• Classification

• For a new document, the Naive Bayes classifier computes the posterior
probability for each class using the trained priors and likelihoods.

• The class with the highest posterior probability is chosen as the predicted class.
Applications of Naïve Bayes classifier
• Spam filtering:

• Spam filtering is the best known use of Naïve Bayesian text


classification.

• Presently, almost all the email providers have this as a built-in


functionality, which makes use of a Naïve Bayes classifier to
identify spam email on the basis of certain conditions and also the
probability of classifying an email as ‘Spam’.

• Server-side email filters such as, Spam Assassin, Spam Bayes, etc.
make use of Bayesian spam filtering techniques.
Applications of Naïve Bayes classifier
• Hybrid Recommender System:

• It uses Naïve Bayes classifier and collaborative filtering.

• Recommender systems (used by e-retailors like eBay, Alibaba, Target,


Flipkart, etc.) apply machine learning and data mining techniques for
filtering unseen information and can predict whether a user would like a
given resource.

• One of the algorithms is combining a Naïve Bayes classification approach


with collaborative filtering, and experimental results show that this
algorithm provides better performance regarding accuracy and coverage
than other algorithms.
Applications of Naïve Bayes classifier
• Online Sentiment Analysis:

• In the case of sentiment analysis, let us assume there are three


sentiments such as nice, nasty, or neutral, and Naïve Bayes classifier is
used to distinguish between them.

• Simple emotion modelling combines a statistically based classifier with a


dynamical model.

• It allocates user utterances into nice, nasty, and neutral classes, labelled
as +1, −1, and 0, respectively.
Advantages of Naïve Bayes classifier
1. Simplicity and Ease of Implementation: The algorithm is
straightforward to implement and easy to understand.
2. Speed and Efficiency: It is very fast in terms of both training and
prediction, making it suitable for real-time applications.
3. Scalability: Naïve Bayes scales linearly with the number of features and
data points.

4. Low Memory Requirement: It requires a small amount of training data


to estimate the parameters necessary for classification.
Advantages of Naïve Bayes classifier
5. Works Well with High-Dimensional Data: Particularly
effective for text classification where the feature space is very
high-dimensional.
6. Multiclass Classification: Naturally supports multi-class
classification problems.
Disadvantages of Naïve Bayes Classifier
1. Independence Assumption: The assumption that features are
independent is rarely true in real-world applications, which can lead to
suboptimal performance.
2. Zero Probability Problem: If a feature category is not present in the
training set, it will assign a zero probability to it, making it impossible for
the model to predict a class that depends on that feature.
• This can be mitigated with techniques like Laplace smoothing.
Disadvantages of Naïve Bayes Classifier
3. Limited Expressiveness: Naïve Bayes is less expressive
compared to more complex models like decision trees, random
forests, and neural networks.
It might not capture complex relationships between features.
4. Sensitivity to Data Quality: Performance can be
significantly affected by noisy or imbalanced data.
Supervised Learning: Classification
• As we have seen, Naïve Bayes algorithm as a very simple but powerful
classifier based on Bayes’ theorem of conditional probability.

• However, other than the Naïve Bayes classifier, there are more
algorithms for classification.

• The first algorithm is k-Nearest Neighbour (kNN), which tries to classify


unlabelled data instances based on the similarity with the labelled
instances in the training data.

• Then, another critical classifier, named as decision tree, is a classifier


based on a series of logical decisions, which resembles a tree with
branches.
Supervised Learning: Classification
• Next, the random forest classifier, which, in very simplistic terms, can be
thought as a collection of many decision trees.

• Finally, a very powerful and popular classifier named Support Vector


Machine (SVM) will be explored.
Supervised Learning: Classification
• Classification is a type of supervised learning in machine learning and
artificial intelligence where the goal is to categorize input data into
predefined labels or classes.

• The process involves training a model using labeled training data so


that it can predict the class labels of new, unseen data.

• In supervised learning, the labelled training data provide the basis for
learning.

• Supervised learning is the process of learning from the training data by


a machine can be related to a teacher supervising the learning process.
Supervised Learning: Classification
• Training data is the past information
with known value of class or ‘label’.

• The ‘training data is labelled’ in the


case of supervised learning, there is
no labelled training data for
unsupervised learning, and in case of
semi-supervised learning, a small
amount of labelled data along with
un-labelled data is used for training as
shown in Figure below.
Supervised Learning: Classification
• Some examples of supervised learning are as follows:

1. Prediction of results of a game based on the past analysis of results.

2. Predicting whether a tumour is malignant(positive) or benign(negative)


on the basis of the analysis of data.

3. Price prediction in domains such as real estate, stocks, etc.


CLASSIFICATION MODEL
• Consider two examples, say ‘predicting whether a tumour is malignant or
benign’ and ‘price prediction in the domain of real estate’.

• Are these two problems same in nature ?

• It is true that both of them are problems related to prediction.

• However, for tumour prediction, we are trying to predict which category or


class, i.e. ‘malignant’ or ‘benign’, an unknown input data belongs to.

• In the other case, that is, for price prediction, we are trying to predict an
absolute value and not a class.
CLASSIFICATION MODEL
• When we are trying to predict a categorical or nominal variable, the
problem is known as a classification problem.

• Whereas when we are trying to predict a numerical variable such as


‘price’, ‘weight’, etc. the problem falls under the category of regression.

• In classification problem, the goal is to assigning a label or category or


class to a test data on the basis of the label or category or class
information that is imparted by the training data.

• We call this type of problem as a classification problem.


CLASSIFICATION MODEL
• Figure below shows the typical process of classification where a
classification model is obtained from the labelled training data by a
classifier algorithm.
• On the basis of the
model, a class label
e.g. ‘Intel’ is assigned
to the test data.
CLASSIFICATION MODEL
• Classification is a type of supervised learning where a target feature,
which is of categorical type, is predicted for test data on the basis of the
information imparted by the training data.

• The target categorical feature is known as class.

• Some typical classification problems include the following:

1. Image classification.
4. Prediction of natural calamity such
2. Disease prediction. as earthquake, flood, etc.

3. Win–loss prediction of games. 5. Handwriting recognition etc.


CLASSIFICATION LEARNING STEPS
• Following Figure depicts
the classification
learning steps
categorized into 7 steps.
CLASSIFICATION LEARNING STEPS
1. Problem Identification: Identifying the problem is the first step in the supervised
learning model.

• The problem needs to be a well-formed problem.

2. Identification of Required Data: On the basis of the problem identified above, the
required data set that precisely represents the identified problem needs to be
identified/evaluated.

3. Data Pre-processing: This is related to the cleaning/transforming the data set.


This step ensures that all the unnecessary/irrelevant data elements are removed.

• This step ensures that the data is ready to be fed into the machine learning
algorithm.
CLASSIFICATION LEARNING STEPS
4. Definition of Training Data Set: Before starting the analysis, the user
should decide what kind of data set is to be used as a training set.

• Thus, a set of data input (X) and corresponding outputs (Y) is gathered
either from human experts or from experiments analysis.

5. Algorithm Selection: This involves determining the structure of the


learning function and the corresponding learning algorithm.

• On the basis of various parameters, the best algorithm for a given problem
is chosen.
CLASSIFICATION LEARNING STEPS
6. Training: The learning algorithm identified in the previous step is run on the
gathered training set for further fine tuning.

• Some supervised learning algorithms require the user to determine specific


control parameters.

• These parameters may also be adjusted by optimizing performance of the


training set.

7. Evaluation with the Test Data Set: Training data is run on the algorithm, and
its performance is measured.

• If a suitable result is not obtained, further training of parameters may be


required.
COMMON CLASSIFICATION ALGORITHMS
• Following are the most common classification algorithms:

1. k-Nearest Neighbour (kNN)

2. Decision tree

3. Random forest

4. Support Vector Machine (SVM)

5. Naïve Bayes classifier


k-Nearest Neighbour (kNN)
• The kNN algorithm is a simple but extremely powerful classification
algorithm.

• The name of the algorithm originates from the underlying philosophy of


kNN – i.e. people having similar background or mindset tend to stay close
to each other. In other words, neighbours in a locality have a similar
background.

• As a part of the kNN algorithm, the unknown and unlabelled data which
comes for a prediction problem is judged on the basis of the training data
set elements which are similar to the unknown element.
k-Nearest Neighbour (kNN)
• Hence, the class label of the unknown element is assigned on the basis of
the class labels of the similar training data set elements.

• How kNN works ?

• Consider a very simple Student data set as depicted in Figure below:

• Each of the students has been assigned a score on a scale of 10 on two


performance parameters–‘Aptitude’ and ‘Communication’.

• Class value is assigned to each student based on the following criteria:


k-Nearest Neighbour (kNN)
1. Students having good communication
skills as well as a good level of
aptitude have been classified as
‘Leader’

2. Students having good communication


skills but not so good level of aptitude
have been classified as ‘Speaker’

3. Students having not so good


communication skill but a good level
of aptitude have been classified as
‘Intel’
k-Nearest Neighbour (kNN)
• To build a classification model,
a part of the labelled input
data is retained as test data.

• The remaining portion of the


input data is used to train the
model – hence known as
training data as shown in
Figure.
k-Nearest Neighbour (kNN)
• In the kNN algorithm, the class label of the test data elements is decided by the
class label of the training data elements which are neighbouring, i.e. similar in
nature.

• There are two challenges:

1. What is the basis of this similarity or when can we say that two data elements
are similar ?

2. How many similar elements should be considered for deciding the class label
of each test data element ?

Ans: Euclidean distance.


k-Nearest Neighbour (kNN)
• Considering a very simple data set having two features (say f1 and
f2), Euclidean distance between two data elements d1 and d2 can be
measured by :

f11 = value of feature f1 for data element d1

f12 = value of feature f1 for data element d2


f21 = value of feature f2 for data element d1
f22 = value of feature f2 for data element d2
k-Nearest Neighbour (kNN)
• The training data points of the Student data
set considering only the features ‘Aptitude’
and ‘Communication’ can be represented as
dots in a two- dimensional feature space.

• As shown in the figure, the training data


points having the same class value are
coming close to each other.

• The test data point for student Josh is


represented as an asterisk in the same
space.
k-Nearest Neighbour (kNN)
• To find out the closest or nearest
neighbours of the test data point,
Euclidean distance of the different
dots need to be calculated from the
asterisk.
• Then, the class value of the closest
neighbours helps in assigning the
class value of the test data element.
k-Nearest Neighbour (kNN)
• How many similar elements should be
considered for deciding the class label of each
test data element?

• The answer lies in the value of ‘k’ which is a


user-defined parameter given as an input to
the algorithm.

• In the kNN algorithm, the value of ‘k’


indicates the number of neighbours that need
to be considered.

• We want to see what class value kNN will


k-Nearest Neighbour (kNN)
• How to decide the value of k ?

• It is often a tricky decision to decide the value of k.

• The reasons are as follows:

• 1. A small value of K (e.g., K=1) makes the algorithm highly sensitive to


noise, as it only considers the closest data point. This can lead to
overfitting.

• 2. A larger K value (e.g., K=20) smooths out the decision boundary,


making it less sensitive to noise. However, too large a K may lead to
underfitting.
k-Nearest Neighbour (kNN)
• Therefore, the best k value is somewhere between these two extremes.

• What are the best possible strategies to arrive at a value of k ?

• 1. One common practice is to set k equal to the square root of the number of

training records.

• 2. An alternative approach is to test several k values on a variety of test data

sets and choose the one that delivers the best performance.

• 3. Choose a larger value of k, but apply a weighted voting process in which the

vote of close neighbours is considered more influential than the vote of distant

neighbours.
k-Nearest Neighbour (kNN) Algorithm
1. Input:
• Training data set: A labeled set of data points are used to "train" the
model.
• Test data set: A new, unlabeled set of data points for which we want to
predict the class label.

• Value of K: The number of nearest neighbors to consider when


classifying the test point.

2. Steps:

• For all test data points:

• Calculate the distance (commonly Euclidean distance) between the test


data point and every data point in the training set.

• Find the K nearest training data points (i.e., the K training points
that have the smallest distance from the test point).
k-Nearest Neighbour (kNN) Algorithm

• If K = 1:

• Assign the class label of the nearest training data point


to the test data point.

• Else (if K > 1):

• Assign the class label that is most frequent (majority


voting) among the K nearest neighbors to the test data
point.

3. End:

• This process repeats for each test data point.


Example on KNN
• Considering only two features: Sepal Length and Sepal Width from the
simplified Iris dataset. Classify the Test data point: Sepal Length: 5.0 and
Sepal Width: 3.4 to which species class label belongs.

Sepal
Sepal Width Species
Length
5.1 3.5 Setosa
4.9 3.0 Setosa
7.0 3.2 Versicolor
6.4 3.2 Versicolor
5.5 2.3 Versicolor
5.0 3.4 ??
Example on KNN
• Solution :

1. Calculate the Euclidean distance between the test data point and each of the
training data points.

• The formula for Euclidean distance with two features is:

• where x1, y1 represent the Sepal Length and Sepal Width of the test point, and
x2, y2 represent the Sepal Length and Sepal Width of the training data points.

2. Using the above formula, calculate the distances:

• Distance to (5.1, 3.5) (Setosa):


Example on KNN
• Distance to (4.9, 3.0) (Setosa):

• Distance to (7.0, 3.2) (Versicolor):

• Distance to (6.4, 3.2) (Versicolor):

• Distance to (5.5, 2.3) (Versicolor):


Example on KNN
3. Pick K nearest neighbors:

• Let's assume K = 3. The three nearest neighbors to the test point (5.0, 3.4) are :

4. Majority voting:

• Out of the three nearest neighbors, two belong to the species Setosa and one

belongs to Versicolor.

• Therefore, based on majority voting, we assign the test data point to the Setosa

class.
Example on KNN
• Thus, the test data point with Sepal Length = 5.0 and Sepal Width = 3.4 would

be classified as Setosa using KNN with K = 3.

• Why the kNN algorithm is called a lazy learner ?

• The k-Nearest Neighbors (kNN) algorithm is called a lazy learner because it

delays the learning process until a query is made during prediction.

• Unlike other algorithms, kNN does not build a model in the training phase.

Instead, it stores the entire training dataset and performs computations (like

calculating distances between points) only when making predictions.


k-Nearest Neighbour (kNN)
❖ Strengths of the kNN algorithm
1. Extremely simple algorithm – easy to understand
2. Very effective in certain situations,
Ex: For recommender system design it is very fast or almost no much time
required for the training phase.
❖ Weaknesses of the kNN algorithm
1. Does not learn anything in the real sense. Classification is done completely on
the basis of the training data and the classification process is very slow.
2. If the training data does not represent the problem domain comprehensively,
the algorithm fails to make an effective classification.
3. A large amount of computational space is required to load the training data for
classification.
k-Nearest Neighbour (kNN)

• Application of the kNN algorithm

1. The kNN algorithm is widely adopted in recommender systems.

• Recommender systems recommend users different items which are similar to a

particular item that the user seems to like.

• The liking pattern may be revealed from past purchases or browsing history

and the similar items are identified using the kNN algorithm.

2. Another area where there is widespread adoption of kNN is searching

documents/contents similar to a given document/content. This is a core area

under information retrieval and is known as concept search.


Decision Tree
• Decision Tree is a supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems.

• It is a tree-structured classifier, where internal nodes represent the


features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.

• In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree
• Decision nodes are used to make any decision

and have multiple branches, whereas Leaf

nodes are the output of those decisions and do

not contain any further branches.

• It is called a decision tree because, similar

to a tree, it starts with the root node, which

expands on further branches and constructs a

tree-like structure.

• In order to build a tree, we use the CART

(Classification and Regression Tree)algorithm.


Decision Tree
• A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
• How Decision Tree Works ?

• The decision tree works on the basis of ID3(Iterative Dichotomiser 3).

• The ID3 algorithm is specifically designed for building decision trees from a
given dataset.

• Its primary objective is to construct a tree that best explains the relationship
between attributes in the data and their corresponding class labels.

• The following steps are followed during the working of decision tree :
Decision Tree
1. Selecting the Best Attribute
❑ ID3 employs the concept of entropy and information gain to
determine the attribute that best separates the data.
❑ Entropy measures the impurity or randomness in the dataset.
❑ The algorithm calculates the entropy of each attribute and selects
the one that results in the most significant information gain when
used for splitting the data.
Decision Tree
2. Creating Tree Nodes
• The chosen attribute is used to split the dataset into subsets based on its distinct
values.

• For each subset, ID3 recurses to find the next best attribute to further partition
the data, forming branches and new nodes accordingly.
3. Stopping Criteria

• The recursion continues until one of the stopping criteria is met, such as when all
instances in a branch belong to the same class or when all attributes have been
used for splitting.
4. Handling Missing Values

• ID3 can handle missing attribute values by employing various strategies like
attribute of mean/mode substitution or using majority class values.
Decision Tree
5. Tree Pruning

❑ Pruning is a technique to prevent overfitting. While not directly included


in ID3, post-processing techniques or variations like incorporate pruning
to improve the tree’s generalization.

❑ Note : Tree pruning is a technique used in decision trees to reduce their


size and complexity by removing parts of the tree that are not necessary
or that contribute to overfitting.
Decision Tree
• Formulas linked to the main theoretical ideas in the ID3 algorithm :

1. Entropy

• A measure of disorder or uncertainty in a set of data is called entropy.

• Entropy is a tool used in ID3 to measure a dataset’s disorder or impurity.

• By dividing the data into as homogenous subsets as feasible, the objective


is to minimize entropy.
• For a set S with classes {c1, c2, …, cn}, the entropy is calculated as:
Decision Tree
2. Information Gain
• A measure of how well a certain quality reduces uncertainty is called Information
Gain.
• ID3 splits the data at each stage, choosing the property that maximizes
Information Gain.
• It is computed using the distinction between entropy prior to and following the
split.
• Information Gain measures the effectiveness of an attribute A in reducing
uncertainty in set S
Example on Decision tree
Example on Decision tree
Example on Decision tree
Example on Decision tree
Decision Tree-Based Id3 algorithm.
• The ID3 algorithm (Iterative Dichotomiser 3) is a popular decision tree
algorithm.

• It constructs a decision tree by recursively splitting the dataset based on


the attribute that provides the maximum information gain.

• Steps for the ID3 Algorithm:

1. Calculate Entropy: Entropy measures the impurity of the dataset. ID3


uses entropy to measure the information gain for different attributes.

2. Choose the Best Attribute: The attribute that provides the most
information gain is chosen for splitting.
Advantages of using Decision tree
1. Easy to Understand and Interpret: Decision trees are easy to visualize and
understand. They can be easily explained to non-technical stakeholders.
2. Requires Little Data Preparation: Decision trees do not require normalization
of data or scaling of variables.
3. Handles Both Numerical and Categorical Data: Decision trees can handle both
types of data.

4. Non-Linear Relationships: Capable of capturing non-linear relationships


between features.

5. Feature Selection: Inherently performs feature selection by choosing the most


important features to split on.
6. Handling Missing Values: Can handle missing values in the dataset.
Dis-advantages of using Decision tree
1. Overfitting: Decision trees are prone to overfitting, especially with noisy data
or when the tree is too deep.
2. Instability: Small variations in the data can result in a completely different
tree being generated.
3. Bias: Can be biased if some classes dominate.

4. Complexity: Can become complex and less interpretable if the tree is too deep.

5. Non-optimal Decisions: Greedy algorithms used to build decision trees (like


ID3, C4.5, CART) do not guarantee the global optimal tree.
Decision tree Real-time Applications
1. Medical Diagnosis: Used to determine the probability of a patient having a
particular disease based on various symptoms and test results.
2. Customer Relationship Management (CRM): Helps in predicting customer
churn, segmenting customers, and targeting the right audience for marketing
campaigns.
3. Credit Scoring: Used by financial institutions to predict the creditworthiness of
applicants.

4. Fraud Detection: Identifies potentially fraudulent transactions in real-time by


examining transaction attributes.
5. Recommendation Systems: Assists in recommending products or services to
customers based on their past behavior and preferences.
Strengths of decision tree
1. It produces very simple understandable rules.

2. For smaller trees, not much mathematical and computational


knowledge is required to understand this model.

3. Works well for most of the problems.

4. It can handle both numerical and categorical variables.

5. Can work well both with small and large training data sets.
Weaknesses of decision tree
1. Decision tree models are often biased towards features having more
number of possible values, i.e. levels.
2. This model gets overfitted or underfitted quite easily.
3. Decision trees are prone to errors in classification problems with many
classes and relatively small number of training examples.
4. A decision tree can be computationally expensive to train.
5. Large trees are complex to understand.
Random forest model
• Random forest is an ensemble classifier, i.e. a combining classifier that
uses and combines many decision tree classifiers.

• Ensembling is usually done using the concept of bagging with different


feature sets.

• Why use Random Forest?

1. It takes less training time as compared to other algorithms.

2. It predicts output with high accuracy, even for the large dataset it runs
efficiently.

3. It can also maintain accuracy when a large proportion of data is missing.


Random forest model
• The reason for using large number of
trees in random forest is to train the trees
enough such that contribution from each
feature comes in a number of models.

• After the random forest is generated by


combining the trees, majority vote is
applied to combine the output of the
different trees.
• The result from the ensemble
• A simplified random forest model is
model is usually better than that
depicted in Figure. from the individual decision tree.
Random forest model
• The random forest algorithm works as follows:

1. If there are N variables or features in the input data set, select


a subset of ‘m’ (m < N) features at random out of the N features.

2. Use the best split principle on these ‘m’ features to calculate the
number of nodes ‘d’.

3. Keep splitting the nodes to child nodes till the tree is grown to
the maximum possible extent.
Random forest model
4. Final class assignment is done on the basis of the majority votes from
the ‘n’ trees.
• Out-of-bag (OOB) error in random forest
• In random forests, we have seen, that each tree is constructed using a
different bootstrap sample from the original data.
• The samples left out of the bootstrap and not used in the construction of
the i-th tree can be used to measure the performance of the model.
• At the end of the run, predictions for each such sample evaluated each
time are tallied, and the final prediction for that sample is obtained by
taking a vote.
Random forest model
• The total error rate of predictions for such samples is termed as out-
of-bag (OOB) error rate.

• Strengths of random forest

1. It runs efficiently on large and expansive data sets.

2. It has a robust method for estimating missing data and maintains


precision when a large proportion of the data is absent.

3. It has powerful techniques for balancing errors in a class


population of unbalanced data sets.
Random forest model
4. It gives estimates (or assessments) about which features are the
most important ones in the overall classification.

5. Generated forests can be saved for future use on other data.

6. Lastly, the random forest algorithm can be used to solve both


classification and regression problems.
Random forest model
• Weaknesses of random forest

1. This model is not as easy to understand as a decision tree model,


because it combines a number of decision tree models,

2. It is computationally much more expensive than a simple model like


decision tree.
Support vector machines
• SVM is a model, which can do linear classification as well as
regression.

• SVM is based on the concept of a surface, called a hyperplane, which


draws a boundary between data instances plotted in the multi-
dimensional feature space.

• The output prediction of an SVM is one of two conceivable classes


which are already defined in the training data.
Support vector machines
• Classification using hyperplanes

• In SVM, a model is built to discriminate


the data instances belonging to different
classes.

• In a two-dimensional space, the data


instances belonging to different classes
fall in different sides of a straight line and
multi-dimensional drawn in the two-
dimensional space as depicted in Figure.
Support vector machines
• The goal of the SVM analysis is to find a plane, or a hyperplane, which
separates the instances on the basis of their classes.

• New examples (i.e. new instances) are then mapped into that same space
and predicted to which class the new instance belongs.

• In the overall training process, the SVM algorithm analyses input data
and identifies a surface in the multi-dimensional feature space called the
hyperplane.

• There may be many possible hyperplanes, and one of the challenges with
the SVM model is to find the optimal hyperplane.
Support vector machines
• Support Vectors in the SVM

• Support vectors are the data points (representing classes), the critical
component in a data set, which are near the identified set of lines (hyperplane).

• Hyperplane and Margin: For an N-dimensional feature space, hyperplane is a


flat subspace of dimension (N−1) that separates and classifies a set of data.

• For example, if we consider a two-dimensional feature space, a hyperplane will


be a one-dimensional subspace or a straight line.

• For a three-dimensional feature space (data set having three features and a
class variable), hyperplane is a two-dimensional subspace or a simple plane.
Support vector machines
• Mathematically, in a two-dimensional space, hyperplane can be defined by
the equation:
• , which is nothing but an equation of a straight
line.
• For an N-dimensional space, hyperplane can be defined by the equation:
• OR
• The distance between hyperplane and data points is known as margin.
Identifying the correct hyperplane in SVM
• Let us examine a few examples to identify
which hyperplanes will result in the best
classification.
• Example 1 : As depicted in Figure, we have
three hyperplanes: A, B, and C.
• Now, we need to identify the correct hyperplane
which better segregates the two classes
represented by the triangles and circles.
• hyperplane ‘A’ has performed this task quite
well.
Identifying the correct hyperplane in SVM
• Example 2 : As depicted in Figure (a) and (b), we
have three hyperplanes: A, B, and C.

• We have to identify the correct hyperplane which


classifies the triangles and circles in the best
possible way.

• Maximizing the distances between the nearest


data points of both the classes will help us to
decide the correct hyperplane.

• This distance is called as margin.


• In Figure b, we can see that the margin for hyperplane A is high as compared to
those for both B and C. Hence, hyperplane A is the correct hyperplane.
Identifying the correct hyperplane in SVM
• Example 3: In this Example, as shown in
Figure, it is not possible to distinctly
segregate the two classes by using a straight
line, as one data instance belonging to one of
the classes (triangle) lies in the territory of the
other class (circle) as an outlier.

• SVM has a feature to ignore outliers and find


the hyperplane that has the maximum margin

• Hence, we can say that SVM is robust to


outliers
Identifying the correct hyperplane in SVM
• How to find out a way to identify a
hyperplane which maximizes the margin ?

• Finding the Maximum Margin Hyperplane


(MMH) is nothing but identifying the
hyperplane which has the largest
separation with the data instances of the
two classes.

• Support vectors, are observed in Figure


which are data instances from the two
classes that are closest to the MMH.
Identifying the MMH for linearly separable data
• Finding out the MMH is relatively
straightforward for the data that is linearly
separable.
• An outer boundary needs to be drawn for the
data instances belonging to the different
classes.
• These outer boundaries are known as convex
hull, as depicted in Figure.
• a hyperplane in the N dimensional feature
space can be represented by the equation:
Identifying the MMH for linearly separable data
• Using this equation, the objective is to find a set of values for the
vector such that two hyperplanes.

• This is to ensure that all the data instances that belong to one class
falls above one hyperplane and all the data instances belonging to
the other class falls below another hyperplane.
Identifying the MMH for non-linearly separable data
• For identifying MMH in non-linearly
separable data we have to use a slack
variable ξ, which provides some soft
margin for data instances in one class
that fall on the wrong side of the
hyperplane as shown in Figure.

• A cost value ‘C’ is imposed on all such


data instances that fall on the wrong
side of the hyperplane.
Identifying the MMH for non-linearly separable data
• The task of SVM is now to minimize the total cost due to such data instances.

• SVM has a another technique called the kernel trick to deal with non-linearly
separable data as shown in Figure.
Identifying the MMH for non-linearly separable data
• In the process, it converts linearly non-separable data to a linearly separable
data. These functions are called kernels.

• Some of the common kernel functions for transforming from a lower dimension
‘i’ to a higher dimension ‘j’ used by different SVM implementations are as
follows:
Strengths of SVM
1. SVM can be used for both classification and regression.

2. It is robust, i.e. not much impacted by data with noise or


outliers.

3. The prediction results using this model are very promising


Weaknesses of SVM
1. SVM is applicable only for binary classification, i.e. when there
are only two classes in the problem domain.

2. The SVM model is very complex – almost like a black box when it
deals with a high-dimensional data set. Hence, it is very difficult
and close to impossible to understand the model in such cases.

3. It is slow for a large dataset, i.e. a data set with either a large
number of features or a large number of instances
Application of SVM
1. SVM is most effective when it is used for binary classification,
i.e. for solving a machine learning problem with two classes.

2. One common problem on which SVM can be applied is in the


field of bioinformatics – more specifically, in detecting cancer
and other genetic disorders.

3. It can also be used in detecting the image of a face by binary


classification of images into face and non-face components.
Supervised Learning: Regression
• Here, we will build concepts on prediction of numerical variables – which

is another key area of supervised learning.

• This area, known as regression, focuses on solving problems such as

predicting value of real estate, demand forecast in retail, weather

forecast, etc.

• The most popular and simplest algorithm is simple linear regression. This

model roots from the statistical concept of fitting a straight line and the

least squares method.

• Next, we will also explore the concept of multiple linear regression.


Supervised Learning: Regression
• Then, we briefly discuss the other important algorithms in regression,
namely multivariate adaptive regression splines, logistic regression, and
maximum likelihood estimation.

• EXAMPLE OF REGRESSION

• A regression model is used to solve for example real estate price


prediction problem.

• In the context of regression, dependent variable (Y) is the one whose


value is to be predicted, e.g. the price quote of the real estate.

• In other words, the dependent variable depends on independent


variable(s) or predictor(s).
Supervised Learning: Regression
• Regression is essentially finding a relationship (or) association between the
dependent variable (Y) and the independent variable(s) (X), i.e. to find the function ‘f
’ for the association Y = f (X).

• COMMON REGRESSION ALGORITHMS

1. Simple linear regression

2. Multiple linear regression

3. Polynomial regression

4. Multivariate adaptive regression splines

5. Logistic regression

6. Maximum likelihood estimation (least squares)


1. Simple Linear Regression
• Simple linear regression is the
simplest regression model
which involves only one
predictor.

• This model assumes a linear


relationship between the
dependent variable and the
predictor variable as shown in
Figure.
1. Simple Linear Regression
• In a real estate problem, if we take Price of a Property as the dependent variable
and the Area of the Property (in sq. m.) as the predictor variable, we can build a
model using simple linear regression.

• Assuming a linear association, we can reformulate the model as :

• where ‘a’ and ‘b’ are intercept and slope of the straight line, respectively
1. Simple Linear Regression
• Recall, straight lines can be defined in a slope intercept form

• Y = (a + bX), where a = intercept and b = slope of the straight line.

• The value of intercept indicates the value of Y when X = 0.

• It is known as ‘the intercept or Y intercept’ because it specifies where the


straight line crosses the vertical or Y-axis.
1. Slope of the simple linear regression model
• Slope of a straight line represents how
much the line in a graph changes in
the vertical direction (Y-axis) over a
change in the horizontal direction (X-
axis) as shown in Figure.

Slope = Change in Y / Change in X


1. Slope of the simple linear regression model
• Example of slope

• Find the slope of the graph where the lower point on the line is
represented as (−3, −2) and the higher point on the line is
represented as (2, 2).
1. Slope of the simple linear regression model
• There can be two types of slopes in a linear regression model: positive
slope and negative slope.
• Different types of regression lines based on the type of slope include :
1. Linear positive slope
2. Curve linear positive slope
3. Linear negative slope
4. Curve linear negative slope
Types of regression lines
1. Linear positive slope

• A positive slope always moves upward on a


graph from left to right as shown in Figure.

Slope = Rise/Run

= (Y2 − Y1 ) / (X2 − X, )

= Delta (Y) / Delta(X)

• Scenario 1 for positive slope: Delta (Y) is


positive and Delta (X) is positive

• Scenario 2 for positive slope: Delta (Y) is


negative and Delta (X) is negative
Types of regression lines
2. Curve linear positive slope
• Curves in these graphs slope upward from left to right.
• Slope = (Y2 − Y1 ) / (X2 − X1 )
• = Delta (Y) / Delta(X)
• Slope for a variable (X) may vary between two graphs, but it will always
be positive; hence, the above graphs are called as graphs with curve
linear positive slope.
Types of regression lines
3. Linear negative slope

• A negative slope always moves downward on a graph


from left to right.

• As X value (on X-axis) increases, Y value decreases.

• Slope = Rise/Run

= (Y2 − Y1 ) / (X2 − X1 )

= Delta (Y) / Delta(X)

• Scenario 1 for negative slope: Delta (Y) is positive


and Delta (X) is negative

• Scenario 2 for negative slope: Delta (Y) is negative


Types of regression lines
4. Curve linear negative slope

• Curves in these graphs slope downward from left to right.

• Slope = (Y2 − Y1 ) / (X2 − X1 )

• = Delta (Y) / Delta(X)

• Slope for a variable (X) may vary between two graphs, but it will always be negative;
hence, the above graphs are called as graphs with curve linear negative slope.
Types of regression lines

• No relationship graph

• Graph shown in Figure indicates ‘no


relationship’ curve as it is very difficult to
conclude whether the relationship between X
and Y is positive or negative.
Error in simple regression
• The regression equation model in machine learning uses the above slope–
intercept format in algorithms.
• X and Y values are provided to the machine, and it identifies the values of
a (intercept) and b (slope) by relating the values of X and Y.
• Identifying the exact match of values for a and b is not always possible.
• There will be some error value (ɛ) associated with it.
• This error is called marginal or residual error.
Example of Simple Linear Regression:
• A college professor believes that if the grade for internal examination is high
in a class, the grade for external examination will also be high.

• A random sample of 15 students in that class was selected, and the data is
given below:

• We need to predict the value of Y for any given X OR Find the value of
Y for any given X.
Example of Simple Linear Regression:
• A scatter plot is shown in Figure to explore the relationship between the
independent variable (internal marks) mapped to X-axis and dependent
variable (external marks) mapped to Y-axis.

• The line (i.e. the regression line)


does not predict the data exactly,
instead, it just cuts through the
data.
• Some predictions are lower than
expected, while some others are
higher than expected.
Example of Simple Linear Regression:
• As we know, in simple linear regression, the line is drawn using the
regression formula

• If we know the values of ‘a’ and ‘b’, then it is easy to predict the value of Y
for any given X.

• Ordinary Least Squares (OLS) is the technique used to estimate a line


that will minimize the error (ε), which is the difference between the
predicted and the actual values of Y.
Example of Simple Linear Regression:
• Summing the errors of each prediction or, more appropriately, the Sum of
the Squares of the Errors (SSE)

• The slope b takes the formula :

• The corresponding value of ‘a’ calculated using the above value of ‘b’ is :
Example of Simple Linear Regression:
Example of Simple Linear Regression:

• The extended version of the regression


graph is shown in Figure:
OLS algorithm for simple linear regression
Step 1: Calculate the mean of X and Y
Step 2: Calculate the errors of X and Y
Step 3: Get the product
Step 4: Get the summation of the products
Step 5: Square the difference of X
Step 6: Get the sum of the squared difference
Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’
Step 8: Calculate ‘a’ using the value of ‘b
Example of Simple Linear Regression:
1. Let us consider an example where the five weeks’ sales data (in
thousands) is given as shown table.
2. Apply Linear Regression Technique to predict the 7th and 12th Week
Sales

Xi Yi ( Sales in
(Week) Thousands)
1 1.2
2 1.8
3 2.6
4 3.2
5 3.8
Example of Simple Linear Regression:

• STEPS:

1. Linear Regression Equation is given by

Y= a0+a1*x+e OR

• where
Example of Simple Linear Regression:
• How to calculate

• Where the values of


Example of Simple Linear Regression:
• Regression Equation is given by Y= a0+a1*x+e

• Final Equation and Result


2. Multiple Linear Regression
• In a multiple regression model, two or more independent variables, i.e.
predictors are involved in the model.

• From the real estate price prediction problem involving dependent variable
(property price) and independent variable (Area of property).

• However, location, floor, number of years since purchase, amenities available,


etc. are also important features (variables ) which should not be ignored.

• Thus, if we consider Price of a Property (in $) as the dependent variable and


Area of the Property (in sq. m.), location, floor, number of years since purchase
and amenities available as the independent variables.
2. Multiple Linear Regression
• We can form a multiple regression equation as shown below :

• The following expression describes the equation involving the relationship


with two predictor variables, namely X1 and X2 .
2. Multiple Linear Regression
• The model describes a plane in the three-dimensional space of Ŷ, X1 ,
and X2.

• Parameter ‘a’ is the intercept of the plane. Parameters ‘b ’ and ‘b ’ are


referred to as partial regression coefficients.

• Multiple regression for estimating equation when there are ‘n’ predictor
variables is as follows:

• While finding the best fit line, we can fit either a polynomial or
curvilinear regression.
2. Multiple Linear Regression
• Consider an example where the five weeks’ sales data (in thousands) is given as
shown table. Apply Multiple Linear Regression Technique to predict the values
X1= 6, X2 = 9, find the Weekly Sales?

X1 X2 Weekly
(Product 1 (Product 2 Sales
Sales) Sales)
1 4 1
2 5 6
3 8 8
4 2 12
2. Multiple Linear Regression
• The Multiple Linear Regression of two variables x1 and x2 is given as follows :

• In general, this is given for ‘n ‘independent variable as:

• Apply multiple Linear Regression for the values given in the table, where
weekly sales along with sales products x1 and x2 are provided.
2. Multiple Linear Regression
• Use matrix approach for finding Multiple Linear Regression, design Matrix
with a column of 1’s for the intercept.

• Here the matrices for Y and X are given as follow

• The above table X is Independent variable; Y is the dependent variable.

• The coefficient of the Multiple Linear Regression


2. Multiple Linear Regression
• The regression coefficient for Multiple Linear Regression is calculated as :

• Take the inverse of the matrix to obtain.


2. Multiple Linear Regression

• Hence, the values of a0, a1, and a2 are :


2. Multiple Linear Regression
• The Multiple Linear Regression Equation is

• Hence, the constructed model equation given below predicts the y value
given the variable x1 and x2.

• In the given problem x1 = 6 and x2 = 9 is given to predict the weekly


sales.

• The final value of Multiple Linear Regression = -1.69 + 3.48 * 6 - 0.05 * 9

= 22.039
Main Problems in Regression Analysis
• In multiple regressions, there are two primary problems: multicollinearity
and heteroskedasticity.

• 1. Multicollinearity

• Two variables are perfectly collinear if there is an exact linear


relationship between them.

• Multicollinearity is the situation in which the degree of correlation is not


only between the dependent variable and the independent variable, but
there is also a strong correlation within (among) the independent
variables themselves.
Main Problems in Regression Analysis
• A multiple regression equation can make good predictions when there is
multicollinearity, but it is difficult for us to determine how the dependent
variable will change.

• When multicollinearity is present, it increases the standard errors of the


coefficients and it becomes difficult to determine the unique contribution
of each variable to the outcome.

• One way to gauge multicollinearity is to calculate the Variance Inflation


Factor (VIF).
Main Problems in Regression Analysis
• If multicollinearity is high, We can try the following :

1. Removing highly correlated predictors

2. Using dimensionality reduction techniques like Principal


Component Analysis (PCA).

3. Using regularization methods (e.g., Ridge or Lasso regression)


Main Problems in Regression Analysis
• 2. Heteroskedasticity

• Heteroscedasticity occurs when the variance of errors (residuals) changes


across levels of an independent variable.

• This violates one of the assumptions of linear regression and can lead to
inefficient estimates, potentially making hypothesis tests invalid.

• The solution to this problem is :

• Use heteroscedasticity-robust standard errors.

• Transform the dependent variable (e.g., taking the log or square root).

• Try weighted least squares regression, which gives less weight to observations
with high variance.
Improving Accuracy of the Linear Regression Model
• To improve the accuracy of a linear regression model, consider the
following strategies:

• Feature Engineering:
• Include Relevant Features: Ensure that all important and relevant
variables are included in the model.

• Remove Irrelevant Features: Exclude variables that do not contribute


meaningfully to the prediction.

• Create Derived Features: Generate new features that better capture


relationships in the data, such as interaction terms or polynomial features.
Improving Accuracy of the Linear Regression Model
• 2. Feature Scaling:
• Normalize or standardize features to bring them to the same scale, especially for models where
feature magnitude affects predictions.

• 3. Handle Outliers:
• Detect and address outliers in the dataset, as they can disproportionately influence the
regression coefficients.

• 4. Address Multicollinearity:
• If independent variables are highly correlated, use techniques like Variance Inflation Factor
(VIF) to detect multicollinearity and drop or combine correlated features.
Improving Accuracy of the Linear Regression Model
• 5. Transform Variables:
• Use transformations (e.g., log, square root) to linearize relationships between features and
the target variable or stabilize variance.

• 6. Regularization:
• Apply Lasso (L1) or Ridge (L2) regression to penalize overly complex models and reduce the
likelihood of overfitting.

• 7. Cross-Validation:
• Use k-fold cross-validation to evaluate model performance on unseen data and ensure
robustness.

• 8. Improve Data Quality:


• Remove noise, fix missing data, and ensure accurate measurements.

You might also like