0% found this document useful (0 votes)
19 views113 pages

BI Lecture-Mod 2

Probabilistic and graphical models use probability to represent observations given model parameters. Examples include Hidden Markov Models and belief networks. These models are fitted to data using maximum likelihood or Bayesian methods to estimate parameters. Graphical models have many applications in bioinformatics for modeling sequences, medical diagnosis with belief networks, and analyzing gene expression patterns. Information theory concepts like entropy, joint entropy, and mutual information can be applied in medical contexts to quantify uncertainty and information. Decision theory combined with probabilities and utilities allows optimal decisions under uncertainty, such as in medical diagnosis problems.

Uploaded by

Risheel Chheda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views113 pages

BI Lecture-Mod 2

Probabilistic and graphical models use probability to represent observations given model parameters. Examples include Hidden Markov Models and belief networks. These models are fitted to data using maximum likelihood or Bayesian methods to estimate parameters. Graphical models have many applications in bioinformatics for modeling sequences, medical diagnosis with belief networks, and analyzing gene expression patterns. Information theory concepts like entropy, joint entropy, and mutual information can be applied in medical contexts to quantify uncertainty and information. Decision theory combined with probabilities and utilities allows optimal decisions under uncertainty, such as in medical diagnosis problems.

Uploaded by

Risheel Chheda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

PROBABILISTIC AND

GRAPHICAL MODELS
Probabilistic And Graphical Models

• By probabilistic models, we mean the models that express the probability of some
observations given a set of model parameters
• Such models are graphical when this probability can be represented as a graph.
• Examples of probabilistic graphical models are Hidden Markov Models (HMMs)
(modelling of protein families)and belief networks (for the reconstruction of gene
networks from expression data).
• Once the probabilistic model has been set up, the goal is to find model parameters
matching the observed data.
• This can be achieved by maximum likelihood or maximum a posteriori estimation or by
Bayesian inference.
• In Bayesian inference, we use the data to update a prior probability distribution into a
posterior probability distribution over the parameters given in the data.
• After the modelling criterion has been chosen, a variety of algorithms are available for
estimating the model, such as gradient descent, Expectation-Maximization etc.
Applications

• The application of graphical models in bioinformatics is broad.


• DNA, RNA, and protein sequences use simple probabilistic modelling because of their sequential
structure.
• In medical informatics, belief networks provide a powerful tool for decision support in diagnosis.
• Another domain where probabilistic graphical models play an important role is statistical genomics
• The patterns of expression of genes and proteins can be analysed with graphical models for
clustering and with belief networks.
INFORMATION THEORETIC METRICS

• Information theory deals with the study of the transmission, processing, extraction, and
utilization of information at a mathematical level.
• Many of the outcomes from the study of information theory have been reduced to
engineering practice in applications like Artificial Intelligence and Machine Learning.
• So, the question naturally arises: could it be used to inform the practice of medicine?
Information Theory in Medical Perspective

• Study of situations where one agent (the transmitter) conveys some message over a
channel to another agent (the receiver).
• This is performed by having the transmitter sending a series of partial messages.
• Each of these partial messages can be thought of having some measure of uncertainty in
the receiver as to the content of the original message.
• This measure of uncertainty resolved by a partial message is its information content.
Quantities of Information
Quantities of Information
Quantities of Information
Quantities of Information

• Entropy:

• Entropy HX of a discrete random variable X is a measure of the amount of uncertainty


associated with the value of X when only its distribution is known.
• When we observe the possibilities of the occurrence of an event, how uncertain it would
be, it means that we are trying to have an idea on the average content of the information
from the source of the event.
• Entropy can be defined as a measure of the average information content per source
symbol.
Quantities of Information

Joined Entropy:
• The joint entropy of two discrete random variables X and Y is the entropy of their
pairing: (X, Y). This implies that if X and Y are independent, then their joint entropy is
the sum of their individual entropies.
• For example, if (X, Y) represents the position of a chess piece—X the row and Y the
column, then the joint entropy of the row of the piece and the column of the piece will be
the entropy of the position of the piece.
Quantities of Information

• Conditional Entropy:
• Quantifies the amount of information needed to describe the outcome of a random
variable Y given that the value of another random variable X is known.
• The entropy of Y conditioned on X is written as H(Y/X)
• The conditional entropy of Y given X is defined as
Quantities of Information

• Mutual Information:
• Mutual information measures the amount of information that can be obtained about one
random variable by observing another.
• It can be used to maximize the amount of information shared between sent and received
signals.
Properties of Information

• If there is more uncertainty about the message, information carried is also more
• If the receiver knows the message being transmitted, the amount of information carried is
zero.
• If I1 is the information carried by message m1 and I2 is the information carried by m2,then
amount of information carried by m1 and m2 is 11+ I2.

• If there are M=2N equally likely messages, then amount of information carried by each
message will be N bits.
DECISION SUPPORT VIA PROBABILITIES AND UTILITIES

• Decision theory when combined with probabilities and utilities allows us to make optimal
decisions in situations involving uncertainty.
• Suppose we have an input vector x together with a corresponding vector t of target
variables, and our goal is to predict t given a new value for x
• The joint probability distribution p (x, t) provides a complete summary of the uncertainty
associated with these variables.
• In a practical application, we need to do a prediction for the value of t, or understand the
values t is likely to take, and this aspect is the subject of decision theory.
DECISION THEORY

• • Consider a medical diagnosis problem in which we have taken an X-ray image of a patient,
and we wish to determine whether the patient has cancer or not.
• In this case, the input vector x is the set of pixel intensities in the image, and output variable t
will represent the presence of cancer, which we denote by the class C1, or the absence of cancer,
which we denote by the class C2.
• We can choose t to be a binary variable such that t = 0 corresponds to class C1 and t = 1
corresponds to class C2.
• The general inference problem then involves determining the joint distribution p (x, Ck), or
equivalently p (x, t).
DECISION THEORY

• In the end we must decide either to give treatment to the patient or not.
• This is the decision step, and it is the subject of decision theory to tell us how to make
optimal decisions given the appropriate probabilities.
• When we obtain the X-ray image x for a new patient, our goal is to decide which of the
two classes to assign to the image.
• We are interested in the probabilities of the two classes given the image, which are given
by p(Ck|x)

• Using Bayes’ theorem, these probabilities can be expressed in the form


DECISION THEORY

• We can now interpret p(Ck) as the prior probability for the class Ck and p(Ck|x) as the
corresponding posterior probability.
• Thus p(C1) represents the probability that a person has cancer, before we take the X-ray
measurement.

• Similarly, p(C1|x) is the corresponding probability, revised using Bayes’ theorem


according to the information contained in the X-ray.
• If our aim is to minimize the chance of assigning x to the wrong class, then we would
choose the class having the higher posterior probability.
DECISION MAKING-GENERAL CRITERIA

Minimizing the misclassification rate:


• Suppose that our goal is to make a few misclassifications as possible. We need a rule that
assigns each value of x to one of the available classes.
• Such a rule will divide the input space into regions Rk called decision regions, one for
each class, such that all points in Rk are assigned to class Ck

• The boundaries between decision regions are called decision boundaries or decision
surfaces.
DECISION MAKING-GENERAL CRITERIA

Minimizing the misclassification rate:


• Consider first the case of two classes, as in the cancer problem for instance.
• A mistake occurs when an input vector belonging to class C1 is assigned to class C2 or
vice versa. The probability of this occurring is given by
DECISION MAKING-GENERAL CRITERIA

Minimizing the misclassification rate:


• To minimize p(mistake) we should arrange that each x is assigned to whichever class has
the smaller value of the integrand.
• Thus, if p(x, C1) >p(x, C2) for a given value of x, then we should assign that x to class C1.

• From the product rule of probability, we have p(x, C k) = p(Ck|x)p(x).

• Because the factor p(x) is common to both terms, we can restate this result as saying that the
minimum probability of making a mistake is obtained if each value of x is assigned to the
class for which the posterior probability p(Ck|x) is largest
DECISION MAKING-GENERAL CRITERIA

Minimizing the misclassification rate:


• For the more general case of K classes, it is slightly easier to maximize the probability of
being correct, which is given by
DECISION MAKING-GENERAL CRITERIA

Minimizing the misclassification rate:

• P(correct)is maximized when the regions Rk are chosen such that each x is assigned to the
class for which p(x, Ck) is largest.

• Using the product rule p (x, Ck) = p(Ck|x)p(x) and noting that the factor of p(x) is
common to all terms, we see that each x should be assigned to the class having the largest
posterior probability p(Ck|x).
DECISION MAKING-GENERAL CRITERIA

Minimizing the expected loss:


• We note that, if a patient who does not have cancer is incorrectly diagnosed as having cancer,
the consequences may be some patient distress plus the need for further investigations.
• Conversely, if a patient with cancer is diagnosed as healthy, the result may be premature death
due to lack of treatment.
• We can formalize such issues through the introduction of a loss function, also called a cost
function, which is a single, overall measure of loss incurred in taking any of the available
decisions or actions.
DECISION MAKING-GENERAL CRITERIA

Minimizing the expected loss:


• Our goal is then to minimize the total loss incurred.

• Suppose that, for a new value of x, the true class is C k and that we assign x to class Cj
(where j may or may not be equal to k). In so doing, we incur some level of loss that we
denote by Lkj, which we can view as the k, j element of a loss matrix.
DECISION MAKING-GENERAL CRITERIA

Minimizing the expected loss:


• The optimal solution is the one which minimizes the loss function.
• For a given input vector x, our uncertainty in the true class is expressed through the joint
probability distribution p (x, Ck) and so instead to minimize the average loss, where the
average is computed with respect to this distribution, which is given by
DECISION MAKING-GENERAL CRITERIA

Minimizing the expected loss:


• Each x can be assigned independently to one of the decision regions R j.

• Our goal is to choose the regions Rj to minimize the expected loss

• Thus, the decision rule that minimizes the expected loss is the one that assigns each new x to the
class j for which the quantity
Is minimum.
DECISION MAKING-GENERAL CRITERIA

Minimizing the expected loss:


• Each x can be assigned independently to one of the decision regions R j.

• Our goal is to choose the regions Rj to minimize the expected loss

• Thus, the decision rule that minimizes the expected loss is the one that assigns each new x to the
class j for which the quantity
Is minimum.
DECISION MAKING-GENERAL CRITERIA

Reject Option:
• In some applications, it will be appropriate to avoid making decisions on the difficult cases in
anticipation of a lower error rate on those examples for which a classification decision is made.
• For example, in our medical illustration, it may be appropriate to use an automatic system to
classify those X-ray images for which there is little doubt as to the correct class, while leaving
a human expert to classify the more ambiguous cases.
• We can achieve this by introducing a threshold θ and rejecting those inputs x for which the
largest of the posterior probabilities p(C k|x) is less than or equal to θ.
• Thus, the fraction of examples that get rejected is controlled by the value of θ.
DECISION SUPPORT VIA
EXPERT SYSTEMS

• Expert Systems are AI-driven software


applications that emulate the decision-making
abilities of human experts in specific
domains.
• They leverage a knowledge base, which
contains a vast amount of information and
rules, to analyse complex problems and
provide recommendations or solutions.
• Expert Systems utilize inference engines to
process data and apply logical reasoning to
arrive at conclusions.
DECISION SUPPORT VIA EXPERT SYSTEMS

KNOWLEDGE BASE:
• The knowledge base is the foundation of an Expert System.
• It consists of structured information and rules derived from
human experts in a particular domain.
• This knowledge can be acquired through interviews,
documentation, or by analysing existing data.
• For example, in the medical field, an Expert System may
incorporate knowledge about symptoms, diseases, and
treatment options gathered from experienced doctors.
DECISION SUPPORT VIA EXPERT SYSTEMS

INFERENCE ENGINE:
• The inference engine is the core component of an Expert System.
• It applies logical rules and reasoning techniques to process data and draw
conclusions.
• It uses the knowledge base to make informed decisions or
recommendations based on the given input.
DECISION SUPPORT VIA EXPERT SYSTEMS

USER INTERFACE:
• A well-designed user interface, (UI) is crucial for an Expert System to
effectively interact with users.
• The interface should be intuitive, user-friendly, and capable of capturing
input data required for analysis.
• It should also present the output in a clear and understandable manner.
DECISION TREE

• The process of selecting a specific model, given a new input x, can be described by a
sequential decision-making process corresponding to the traversal of a binary tree (one
that splits into two branches at each node).
• Here we focus on a particular tree-based framework called classification and regression
trees, or CART.
DECISION TREE

• The first step divides the whole of the input space into two regions according to whether
x1≤ θ1 or x1 > θ1 where θ1 is a parameter of the model.

• This creates two subregions, each of which can then be subdivided independently.

• For instance, the region x1≤ θ1 is further subdivided according to whether x2≤θ2 or x2 > θ2,
giving rise to the regions denoted A and B.
DECISION TREE

• For any new input x, we determine which region it falls into by starting at the top of the tree at the root
node and following a path down to a specific leaf node according to the decision criteria at each node.
• Consider first a regression problem in which the goal is to predict a single target variable t from a D-
dimensional vector x = (x1, . . . , xD)T of input variables.

• The training data consists of input vectors {x 1, . . . , xN} along with the corresponding continuous labels
{t1, . . . , tN}.

• • If the partitioning of the input space is given, and we minimize the sum-of-squares error function, then
the optimal value of the predictive variable within any given region is just given by the average of the
values of tn for those data points that fall in that region.
DECISION TREE

• Consider how to determine the structure of the decision tree.


• Even for a fixed number of nodes in the tree, the problem of determining the optimal structure to minimize
the sum-of-squares error is usually computationally infeasible due to the combinatorial large number of
possible solutions
• Instead, a greedy optimization is generally done by starting with a single root node, corresponding to the
whole input space, and then growing the tree by adding nodes one at a time.
• At each step there will be some number of candidate regions in input space that can be split, corresponding
to the addition of a pair of leaf nodes to the existing tree.
• For each of these, there is a choice of which of the D input variables to split, as well as the value of the
threshold.
DECISION TREE

• The joint optimization of the choice of region to split, and the choice of input variable
and threshold, can be done efficiently by exhaustive search noting that, for a given choice
of split variable and threshold.
• The optimal choice of predictive variable is given by the local average of the data, as
noted earlier.
• This is repeated for all possible choices of variable to be split, and the one that gives the
smallest residual sum-of-squares error is retained.
DECISION TREE

• Given a greedy strategy for growing the tree, there remains the issue of when to stop
adding nodes.
• A simple approach would be to stop when the reduction in residual error falls below some
threshold.
DECISION TREE

• The pruning is based on a criterion that balances residual error against a measure of
model complexity.
• • If we denote the starting tree for pruning by T0, then we define T ⊂ T0 to be a subtree of
T0 it can be obtained by pruning nodes from T0 (in other words, by collapsing internal
nodes by combining the corresponding regions).
• • Suppose the leaf nodes are indexed by τ = 1, . . . , |T|, with leaf node τ representing a
region Rτ of input space having Nτ data points, and |T| denoting the total number of leaf
nodes.
DECISION TREE

The optimal prediction for region Rτ is then given by

and the corresponding contribution to the residual sum-of-squares is then


DECISION TREE

• The pruning criterion is then given by

• The regularization parameter λ determines the trade-off between the overall residual sum-
of-squares error and the complexity of the model as measured by the number |T| of leaf
nodes, and its value is chosen by cross-validation.
• • If we define pτk to be the proportion of data points in region Rτ assigned to class k,
where k = 1, . . . , K, then two commonly used choices are the cross-entropy.
DECISION TREE

• And the Gini Index

• These both vanish for pτk = 0and pτk = 1and have a maximum at pτk = 0.5.
• They encourage the formation of regions in which a high proportion of the data points are assigned to one
class.
MODELLING AND BAYESIAN NETWORKS

• Bayesian networks or Bayesian graphical models, are probabilistic graphical models that
represent a set of variables and their conditional dependencies using a directed acyclic
graph (DAG).
• Nodes represent random variables, and edges represent probabilistic dependencies
between them.
• Each node is associated with a conditional probability table that quantifies the probability
of that variable given its parents in the graph.
MODELLING AND BAYESIAN NETWORKS

• Bayesian networks are widely used for reasoning under uncertainty, making predictions,
and performing inference tasks.
• They have applications in various fields such as medicine, finance, engineering, and
natural language processing.
• Bayesian networks are also used in decision analysis and can be employed to model
complex systems where uncertainty plays a significant role.
LEARNING BAYESIAN NETWORKS

• Learning Bayesian networks involves the process of inferring the structure and
parameters of the network from data. There are primarily two types of learning in
Bayesian networks:
• Structure Learning: Involves determining the graphical structure of the Bayesian
network, i.e., identifying the dependencies between variables and the network topology.
• Parameter Learning: Once the structure is determined, the next step is to estimate the
parameters (conditional probability distributions) associated with each node in the
network.
STRUCTURAL LEARNING

PC Algorithm:
• Constraint-based algorithm used for learning the structure of Bayesian networks from
observational data.
• It is widely used due to its efficiency and effectiveness in identifying causal relationships
among variables.
• The PC algorithm is particularly useful when the number of variables is relatively large.
STRUCTURAL LEARNING

PC Algorithm:
Here's a brief overview of how the PC algorithm works:
Step 1: Construct an Independence Graph: The PC algorithm begins by constructing an undirected graph
called the "independence graph." This graph represents conditional independence relationships among
variables in the dataset. Initially, all variables are nodes in the graph, and no edges exist between them.
Step 2: Test Conditional Independence: For each pair of variables in the dataset, the algorithm tests
whether they are conditionally independent given subsets of other variables.
Step 3: Remove Dependent Edges: Based on the conditional independence tests, the algorithm removes
edges that violate conditional independence assumptions.
STRUCTURAL LEARNING

PC Algorithm:
Step 4: Orient Undirected Edges: After removing edges, the algorithm attempts to orient
the remaining undirected edges to create a directed acyclic graph (DAG) that represents
causal relationships. This is done using additional conditional independence tests and causal
inference principles.
Step 5: Finalize the Structure: Finally, the algorithm may perform additional refinement
steps, such as checking for unshielded colliders and adjusting the graph accordingly to
ensure consistency with causal semantics.
PARAMETER LEARNING

EM Algorithm:
• The Expectation-Maximization (EM) algorithm is an iterative method used for estimating
the parameters of probabilistic models when there are latent (unobserved) variables
involved.
• It's particularly useful in situations where there is incomplete data or missing values.
• The EM algorithm aims to maximize the likelihood (or log-likelihood) of the observed
data by iteratively updating estimates of the parameters until convergence.
PARAMETER LEARNING

EM Algorithm:
Here's a simplified overview of how the EM algorithm works:
Initialization: Start with initial estimates for the parameters of the model. These can be
randomly chosen or based on some prior knowledge.
E-step (Expectation Step): In this step, we compute the expected values of the latent
variables given the observed data and the current parameter estimates. This involves
calculating the posterior distribution of the latent variables using Bayes' theorem.
PARAMETER LEARNING

EM Algorithm:
M-step (Maximization Step): In this step, we update the parameter estimates to maximize
the expected likelihood obtained in the E-step. This often involves taking derivatives of the
likelihood function with respect to the parameters and setting them to zero to find the
maximum likelihood estimates.
Iteration: Steps 2 and 3 are repeated iteratively until the parameter estimates converge to a
stable solution. Convergence can be determined based on various criteria, such as when the
change in parameter estimates between iterations falls below a certain threshold.
MACHINE LEARNING

• Machine learning is a field of computer science that gives computers the ability to learn
without being explicitly programmed.
• Supervised learning and Unsupervised learning are two main types of machine learning.
• In supervised learning, the machine is trained on a set of labeled data, which means that
the input data is paired with the desired output. The machine then learns to predict the
output for new input data.
• Supervised learning is often used for tasks such as classification, regression, and object
detection.
MACHINE LEARNING

• In unsupervised learning, the machine is trained on a set of unlabelled data, which means
that the input data is not paired with the desired output.
• The machine then learns to find patterns and relationships in the data.
• Unsupervised learning is often used for tasks such as clustering, dimensionality
reduction, and anomaly detection.
SUPERVISED LEARNING

• Supervised learning involves training a machine from labelled data.


• Labelled data consists of examples with the correct answer or classification.
• The machine learns the relationship between inputs and outputs.
• The trained machine can then make predictions on new, unlabelled data.
SUPERVISED LEARNING

Types of Supervised Learning


Supervised learning is classified into two categories of algorithms:
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
SUPERVISED LEARNING

Classification:
• Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters.
• In Classification, a computer program is trained on the training dataset and based on that
training, it categorizes the data into different classes.
• The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).
SUPERVISED LEARNING

Classification:
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the following types:
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Decision Tree Classification
SUPERVISED LEARNING

Regression:
• Regression is a process of finding the correlations between dependent and independent
variables.
• It helps in predicting the continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
• The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
SUPERVISED LEARNING

Regression:
Types of Regression Algorithm:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
SUPERVISED LEARNING

Regression Algorithm Classification Algorithm


In Regression, the output variable must be of In Classification, the output variable must be a discrete
continuous nature or real value. value
In Regression, we try to find the best fit line, In Classification, we try to find the decision boundary,
which can predict the output more accurately which can divide the dataset into different classes

Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather Prediction, classification problems such as Identification of spam
House price prediction, etc. emails, Speech Recognition, Identification of cancer
cells, etc.

The regression Algorithm can be further divided The Classification algorithms can be divided into
into Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.
SUPERVISED
LEARNING
Linear Regression vs Logistic Regression:
• Linear Regression and Logistic Regression
are the two famous Machine Learning
Algorithms which come under supervised
learning technique.
• The Linear Regression is used for solving
Regression problems whereas Logistic
Regression is used for solving the
Classification problems.
SUPERVISED LEARNING

Linear Regression:
• Used for solving regression problems.
• The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
• If single independent variable is used for prediction, then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
SUPERVISED LEARNING

Linear Regression:
• By finding the best fit line, algorithm establishes the relationship between dependent variable and
independent variable.
• And the relationship should be of linear nature.
• The output for Linear regression should only be the continuous values such as price, age, salary, etc.
• In below image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:

• y= a0+a1x+ ε Where, a0 and a1 are the coefficients and ε is the error term.
SUPERVISED LEARNING

Logistic Regression:
• It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
• Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.
• The output of Logistic Regression problem can be only between the 0 and 1.
• Logistic regression can be used where the probabilities between two classes is required.
SUPERVISED
LEARNING
• In logistic regression, we pass the
weighted sum of inputs through an
activation function that can map
values in between 0 and 1.
• Such activation function is known
as sigmoid function and the curve
obtained is called as sigmoid curve
or S-curve. Consider the image:
SUPERVISED LEARNING

Linear Regression Logistic Regression


Linear regression is used to predict the continuous Logistic Regression is used to predict the
dependent variable using a given set of categorical dependent variable using a given set
independent variables. of independent variables
Linear Regression is used for solving Regression Logistic regression is used for solving
problem. Classification problems.
In linear regression, we find the best fit line, by In Logistic Regression, we find the S-curve by
which we can easily predict the output. which we can classify the samples.
Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.
EVALUATING SUPERVISED LEARNING

For Regression
• Mean Squared Error (MSE): MSE measures the average squared difference between the predicted
values and the actual values. Lower MSE values indicate better model performance.
• Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard
deviation of the prediction errors. Like MSE, lower RMSE values indicate better model performance.
• Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
• R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in the
target variable that is explained by the model. Higher R-squared values indicate better model fit.
EVALUATING SUPERVISED LEARNING

For Classification
Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is
calculated by dividing the number of correct predictions by the total number of predictions.
Precision: Precision is the percentage of positive predictions that the model makes that are
correct. It is calculated by dividing the number of true positives by the total number of positive
predictions.
Recall: Recall is the percentage of all positive examples that the model correctly identifies. It is
calculated by dividing the number of true positives by the total number of positive examples.
EVALUATING SUPERVISED LEARNING

For Classification
F1 score: The F1 score is a weighted average of precision and recall. It is calculated by
taking the harmonic mean of precision and recall.
Confusion matrix: A confusion matrix is a table that shows the number of predictions for
each class, along with the actual class labels. It can be used to visualize the performance of
the model and identify areas where the model is struggling.
APPLICATIONS OF SUPERVISED LEARNING

• Supervised learning can be used to solve a wide variety of problems, including:


• Spam filtering: Supervised learning algorithms can be trained to identify and classify spam
emails based on their content, helping users avoid unwanted messages.
• Image classification: Supervised learning can automatically classify images into different
categories, such as animals, objects, or scenes, facilitating tasks like image search, content
moderation, and image-based product recommendations.
• Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient
data, such as medical images, test results, and patient history, to identify patterns that suggest
specific diseases or conditions.
APPLICATIONS OF SUPERVISED LEARNING

• Fraud detection: Supervised learning models can analyze financial transactions and
identify patterns that indicate fraudulent activity, helping financial institutions prevent
fraud and protect their customers.
• Natural language processing (NLP): Supervised learning plays a crucial role in NLP
tasks, including sentiment analysis, machine translation, and text summarization, enabling
machines to understand and process human language effectively.
UNSUPERVISED LEARNING

• Unsupervised learning is a type of machine learning that learns from unlabelled data..
• The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
• Here the task of the machine is to group unsorted information according to similarities,
patterns, and differences without any prior training of data.
• Unlike supervised learning, no training will be given to the machine. Therefore, the
machine is restricted to find the hidden structure in unlabelled data by itself.
TYPES OF UNSUPERVISED LEARNING

• Unsupervised learning is classified into two categories of algorithms:


• Clustering
• Association
• Clustering is a type of unsupervised learning that is used to group similar data points
together.
• Clustering algorithms work by iteratively moving data points closer to their cluster
centres and further away from data points in other clusters.
TYPES OF UNSUPERVISED LEARNING

Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
• Gaussian Mixture Models (GMMs)
TYPES OF UNSUPERVISED LEARNING

• Association rule learning is a type of unsupervised learning that is used to identify patterns
in a data.
• Association rule learning algorithms work by finding relationships between different items
in a dataset.
• Some common association rule learning algorithms include:

Apriori Algorithm
Eclat Algorithm
FP-Growth Algorithm
EVALUATING UNSUPERVISED LEARNING

• There are several different metrics that can be used to evaluate unsupervised learning models, but some of
the most common ones include:
• Silhouette score: The silhouette score measures how well each data point is clustered with its own cluster
members and separated from other clusters. It ranges from -1 to 1, with higher scores indicating better
clustering.
• Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance between
clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores indicating better
clustering.
• Adjusted Rand index: The adjusted Rand index measures the similarity between two clustering. It ranges
from -1 to 1, with higher scores indicating more similar clustering
EVALUATING UNSUPERVISED LEARNING

• Davies-Bouldin index: The Davies-Bouldin index measures the average similarity


between clusters. It ranges from 0 to infinity, with lower scores indicating better
clustering.
• F1 score: The F1 score is a weighted average of precision and recall, which are two
metrics that are commonly used in supervised learning to evaluate classification models.
However, the F1 score can also be used to evaluate unsupervised learning models, such as
clustering models.
APPLICATIONS OF UNSUPERVISED LEARNING

• Unsupervised learning can be used to solve a wide variety of problems, including:


• Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from normal
behaviour in data, enabling the detection of fraud, intrusion, or system failures.
• Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in
scientific data, leading to new hypotheses and insights in various scientific fields.
• Recommendation systems: Unsupervised learning can identify patterns and similarities in user
behaviour and preferences to recommend products, movies, or music that align with their interests.
APPLICATIONS OF UNSUPERVISED LEARNING

• Customer segmentation: Unsupervised learning can identify groups of customers with


similar characteristics, allowing businesses to target marketing campaigns and improve
customer service more effectively.
• Image analysis: Unsupervised learning can group images based on their content,
facilitating tasks such as image classification, object detection, and image retrieval.
FEATURE EXTRACTION AND MAPPING

• Feature extraction and mapping are fundamental processes in machine learning that
involve transforming raw data into a format suitable for modelling
• Feature extraction involves transforming raw data into a more informative representation
with reduced dimensionality.
• Feature mapping involves transforming data into a higher-dimensional space to facilitate
learning.
• Both processes are crucial for effectively training machine learning models.
FEATURE EXTRACTION

• Feature extraction is a process in machine learning and data analysis that involves
identifying and extracting relevant features from raw data.
• These features are later used to create a more informative dataset, which can be further
utilized for various tasks such as Classification, Prediction and Clustering.
• Feature extraction aims to reduce data complexity (often known as “data dimensionality”)
while retaining as much relevant information as possible.
FEATURE EXTRACTION

• This helps to improve the performance and efficiency of machine learning algorithms and
simplify the analysis process.
• Feature extraction may involve the creation of new features and data manipulation to
separate and simplify the use of meaningful features from irrelevant ones.
FEATURE EXTRACTION

What is a feature??
• In machine learning and statistics, features are often called “variables” or “attributes.”
• Relevant features have a correlation on a model’s use case.
• In a patient medical dataset, features could be age, gender, blood pressure, cholesterol
level, and other observed characteristics relevant to the patient.
FEATURE EXTRACTION

Why is feature extraction important??


• Feature extraction is critical for processes such as image and speech recognition, predictive
modelling, and Natural Language Processing (NLP).
• In these scenarios, the raw data may contain many irrelevant or redundant features. This makes it
difficult for algorithms to accurately process the data.
• By performing feature extraction, the relevant features are separated (“extracted”) from the
irrelevant ones.
• With fewer features to process, the dataset becomes simpler, and the accuracy and efficiency of the
analysis improves.
FEATURE EXTRACTION

Common Feature Types:


Numerical: Values with numeric types (int, float, etc.). Examples: age, salary, height.
Categorical Features: Features that can take one of a limited number of values. Examples:
gender (male, female, X), colour (red, blue, green).
Ordinal Features: Categorical features that have a clear ordering. Examples: T-shirt size
(S, M, L, XL).
Binary Features: A special case of categorical features with only two categories.
Examples: is_smoker (yes, no), has_subscription (true, false).
COMMON FEATURE EXTRACTION TECHNIQUES

Feature extraction techniques vary depending on the type of data and the specific problem.
Some common methods include:
• Principal Component Analysis (PCA): A technique used to reduce the dimensionality
of data by projecting it onto a lower-dimensional subspace while preserving the
maximum variance.
• Singular Value Decomposition (SVD): Similar to PCA, SVD decomposes a matrix into
its constituent parts to reduce dimensionality and extract relevant features.
COMMON FEATURE EXTRACTION TECHNIQUES

Principal Component Analysis(PCA)


• Principal Component Analysis (PCA) is a widely used technique for dimensionality
reduction in data analysis and machine learning.
• It's particularly useful when dealing with high-dimensional data, as it helps in identifying
patterns and reducing the number of features while preserving the most important
information.
COMMON FEATURE EXTRACTION TECHNIQUES

Principal Component Analysis(PCA)


Objective:
• PCA aims to transform the original features of a dataset into a new set of orthogonal
(uncorrelated) features called principal components.
• The first principal component accounts for the maximum variance in the data, the second
principal component for the second maximum variance, and so on.
• By retaining a subset of the principal components that capture most of the variance, PCA
effectively reduces the dimensionality of the data.
COMMON FEATURE EXTRACTION TECHNIQUES

Steps:
• Standardization: If the features of the dataset are measured in different scales, it's important to
standardize them (subtract mean and divide by standard deviation) to ensure that each feature contributes
equally to the analysis.
• Covariance Matrix Computation: PCA computes the covariance matrix of the standardized data, which
represents the relationships between pairs of features.
• Eigenvalue Decomposition: PCA then performs eigenvalue decomposition on the covariance matrix to
obtain the eigenvalues and corresponding eigenvectors. The eigenvectors represent the directions
(principal components) of maximum variance in the data, and the eigenvalues represent the magnitude of
variance along those directions.
COMMON FEATURE EXTRACTION TECHNIQUES

Steps:
• Selection of Principal Components: The principal components are sorted based on their
corresponding eigenvalues, with the highest eigenvalue indicating the direction of
maximum variance.
• Projection: Finally, the original data is projected onto the selected principal components
to obtain the transformed dataset with reduced dimensionality.
COMMON FEATURE EXTRACTION TECHNIQUES

Applications:
• Dimensionality Reduction: PCA is primarily used for reducing the dimensionality of
high-dimensional datasets while retaining as much variance as possible.
• Data Visualization: PCA can be employed to visualize high-dimensional data in lower-
dimensional space (usually 2D or 3D) for easier interpretation and visualization.
• Noise Reduction: PCA can also help in removing noise from the data by retaining only
the principal components that capture meaningful information.
COMMON FEATURE EXTRACTION TECHNIQUES

Considerations:
• Interpretability: While PCA reduces the dimensionality of the data, the resulting
principal components might not always be directly interpretable in terms of the original
features.
• Information Loss: Dimensionality reduction inevitably leads to some loss of
information, and the challenge lies in balancing the reduction in dimensionality with the
preservation of relevant information.
FEATURE MAPPING

• Feature mapping involves defining a mapping function that transforms the original
features into a higher-dimensional space.
• The aim is to make the data more suitable to modelling by transforming it into a space
where it may be easier to find linear or nonlinear relationships.
Methods:
• Feature mapping is commonly used in kernel methods such as Support Vector Machines
(SVMs) and kernelized versions of algorithms like kernel PCA.
FEATURE MAPPING

• The kernel function implicitly maps the data into a higher-dimensional space without
explicitly computing the transformed feature vectors.
Common kernel functions include:
• Polynomial kernel: Maps data into a higher-dimensional space using polynomial
functions.
• Gaussian (RBF) kernel: Maps data into an infinite-dimensional space using Gaussian
functions.
• Sigmoid kernel: Maps data into a higher-dimensional space using sigmoid functions.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


• Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for
classification, regression, and outlier detection tasks.
• SVMs are effective in high-dimensional spaces and when the number of features exceeds
the number of samples.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Basic Idea:
• SVMs aim to find the optimal hyperplane that best separates data points of different
classes in a feature space.
• In a binary classification scenario, the hyperplane is a decision boundary that maximizes
the margin between the closest points (support vectors) of different classes.
• SVMs can also handle nonlinear separation by implicitly mapping the input features into
a higher-dimensional space using a kernel function.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Margin Maximization:
• The margin is the distance between the decision boundary and the closest data points
(support vectors).
• SVMs seek to maximize this margin, as a larger margin implies better generalization and
robustness to noise.
• The decision boundary is determined by the support vectors, which are the data points
that lie closest to the decision boundary.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Kernel Trick:
• In cases where the data is not linearly separable, SVMs employ a kernel trick to map the
input features into a higher-dimensional space where separation is possible.
• Popular kernel functions include Linear Kernel, Polynomial Kernel and Gaussian Radial
Basis Function (RBF) Kernel
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Regularization:
• SVMs incorporate regularization parameters to balance margin maximization and
classification error.
• The regularization parameter C controls the trade-off between maximizing the margin
and minimizing the classification error on the training data.
• A smaller C value leads to a wider margin and potentially more misclassifications, while
a larger C value allows for fewer misclassifications but may lead to overfitting.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Multi-Class Classification:
• SVMs perform binary classification. For multi-class classification, strategies such as one-
vs-one or one-vs-rest are commonly used.
• In one-vs-one, SVMs are trained for each pair of classes, and the class with the most
votes is predicted. In one-vs-rest, a separate SVM is trained for each class against the rest.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Advantages:
• Effective in high-dimensional spaces: SVMs perform well even when the number of
features exceeds the number of samples.
• Versatile: SVMs can handle both linear and nonlinear decision boundaries through the use
of appropriate kernel functions.
• Robust: SVMs are less prone to overfitting, especially in high-dimensional spaces, due to
margin maximization.
FEATURE MAPPING TECHNIQUES

Support Vector Machines:


Disadvantages:
• Computationally intensive: Training an SVM can be computationally expensive, especially with large datasets.
• Sensitivity to parameters: SVMs are sensitive to the choice of hyperparameters such as the kernel function and
regularization parameter.
• Limited interpretability: The decision function learned by SVMs may be difficult to interpret, particularly in
high-dimensional spaces with complex kernels.
• Proper tuning of hyperparameters and careful handling of large datasets are essential for effective utilization of
SVMs in machine learning applications.

You might also like