BI Lecture-Mod 2
BI Lecture-Mod 2
GRAPHICAL MODELS
Probabilistic And Graphical Models
• By probabilistic models, we mean the models that express the probability of some
observations given a set of model parameters
• Such models are graphical when this probability can be represented as a graph.
• Examples of probabilistic graphical models are Hidden Markov Models (HMMs)
(modelling of protein families)and belief networks (for the reconstruction of gene
networks from expression data).
• Once the probabilistic model has been set up, the goal is to find model parameters
matching the observed data.
• This can be achieved by maximum likelihood or maximum a posteriori estimation or by
Bayesian inference.
• In Bayesian inference, we use the data to update a prior probability distribution into a
posterior probability distribution over the parameters given in the data.
• After the modelling criterion has been chosen, a variety of algorithms are available for
estimating the model, such as gradient descent, Expectation-Maximization etc.
Applications
• Information theory deals with the study of the transmission, processing, extraction, and
utilization of information at a mathematical level.
• Many of the outcomes from the study of information theory have been reduced to
engineering practice in applications like Artificial Intelligence and Machine Learning.
• So, the question naturally arises: could it be used to inform the practice of medicine?
Information Theory in Medical Perspective
• Study of situations where one agent (the transmitter) conveys some message over a
channel to another agent (the receiver).
• This is performed by having the transmitter sending a series of partial messages.
• Each of these partial messages can be thought of having some measure of uncertainty in
the receiver as to the content of the original message.
• This measure of uncertainty resolved by a partial message is its information content.
Quantities of Information
Quantities of Information
Quantities of Information
Quantities of Information
• Entropy:
Joined Entropy:
• The joint entropy of two discrete random variables X and Y is the entropy of their
pairing: (X, Y). This implies that if X and Y are independent, then their joint entropy is
the sum of their individual entropies.
• For example, if (X, Y) represents the position of a chess piece—X the row and Y the
column, then the joint entropy of the row of the piece and the column of the piece will be
the entropy of the position of the piece.
Quantities of Information
• Conditional Entropy:
• Quantifies the amount of information needed to describe the outcome of a random
variable Y given that the value of another random variable X is known.
• The entropy of Y conditioned on X is written as H(Y/X)
• The conditional entropy of Y given X is defined as
Quantities of Information
• Mutual Information:
• Mutual information measures the amount of information that can be obtained about one
random variable by observing another.
• It can be used to maximize the amount of information shared between sent and received
signals.
Properties of Information
• If there is more uncertainty about the message, information carried is also more
• If the receiver knows the message being transmitted, the amount of information carried is
zero.
• If I1 is the information carried by message m1 and I2 is the information carried by m2,then
amount of information carried by m1 and m2 is 11+ I2.
• If there are M=2N equally likely messages, then amount of information carried by each
message will be N bits.
DECISION SUPPORT VIA PROBABILITIES AND UTILITIES
• Decision theory when combined with probabilities and utilities allows us to make optimal
decisions in situations involving uncertainty.
• Suppose we have an input vector x together with a corresponding vector t of target
variables, and our goal is to predict t given a new value for x
• The joint probability distribution p (x, t) provides a complete summary of the uncertainty
associated with these variables.
• In a practical application, we need to do a prediction for the value of t, or understand the
values t is likely to take, and this aspect is the subject of decision theory.
DECISION THEORY
• • Consider a medical diagnosis problem in which we have taken an X-ray image of a patient,
and we wish to determine whether the patient has cancer or not.
• In this case, the input vector x is the set of pixel intensities in the image, and output variable t
will represent the presence of cancer, which we denote by the class C1, or the absence of cancer,
which we denote by the class C2.
• We can choose t to be a binary variable such that t = 0 corresponds to class C1 and t = 1
corresponds to class C2.
• The general inference problem then involves determining the joint distribution p (x, Ck), or
equivalently p (x, t).
DECISION THEORY
• In the end we must decide either to give treatment to the patient or not.
• This is the decision step, and it is the subject of decision theory to tell us how to make
optimal decisions given the appropriate probabilities.
• When we obtain the X-ray image x for a new patient, our goal is to decide which of the
two classes to assign to the image.
• We are interested in the probabilities of the two classes given the image, which are given
by p(Ck|x)
• We can now interpret p(Ck) as the prior probability for the class Ck and p(Ck|x) as the
corresponding posterior probability.
• Thus p(C1) represents the probability that a person has cancer, before we take the X-ray
measurement.
• The boundaries between decision regions are called decision boundaries or decision
surfaces.
DECISION MAKING-GENERAL CRITERIA
• Because the factor p(x) is common to both terms, we can restate this result as saying that the
minimum probability of making a mistake is obtained if each value of x is assigned to the
class for which the posterior probability p(Ck|x) is largest
DECISION MAKING-GENERAL CRITERIA
• P(correct)is maximized when the regions Rk are chosen such that each x is assigned to the
class for which p(x, Ck) is largest.
• Using the product rule p (x, Ck) = p(Ck|x)p(x) and noting that the factor of p(x) is
common to all terms, we see that each x should be assigned to the class having the largest
posterior probability p(Ck|x).
DECISION MAKING-GENERAL CRITERIA
• Suppose that, for a new value of x, the true class is C k and that we assign x to class Cj
(where j may or may not be equal to k). In so doing, we incur some level of loss that we
denote by Lkj, which we can view as the k, j element of a loss matrix.
DECISION MAKING-GENERAL CRITERIA
• Thus, the decision rule that minimizes the expected loss is the one that assigns each new x to the
class j for which the quantity
Is minimum.
DECISION MAKING-GENERAL CRITERIA
• Thus, the decision rule that minimizes the expected loss is the one that assigns each new x to the
class j for which the quantity
Is minimum.
DECISION MAKING-GENERAL CRITERIA
Reject Option:
• In some applications, it will be appropriate to avoid making decisions on the difficult cases in
anticipation of a lower error rate on those examples for which a classification decision is made.
• For example, in our medical illustration, it may be appropriate to use an automatic system to
classify those X-ray images for which there is little doubt as to the correct class, while leaving
a human expert to classify the more ambiguous cases.
• We can achieve this by introducing a threshold θ and rejecting those inputs x for which the
largest of the posterior probabilities p(C k|x) is less than or equal to θ.
• Thus, the fraction of examples that get rejected is controlled by the value of θ.
DECISION SUPPORT VIA
EXPERT SYSTEMS
KNOWLEDGE BASE:
• The knowledge base is the foundation of an Expert System.
• It consists of structured information and rules derived from
human experts in a particular domain.
• This knowledge can be acquired through interviews,
documentation, or by analysing existing data.
• For example, in the medical field, an Expert System may
incorporate knowledge about symptoms, diseases, and
treatment options gathered from experienced doctors.
DECISION SUPPORT VIA EXPERT SYSTEMS
INFERENCE ENGINE:
• The inference engine is the core component of an Expert System.
• It applies logical rules and reasoning techniques to process data and draw
conclusions.
• It uses the knowledge base to make informed decisions or
recommendations based on the given input.
DECISION SUPPORT VIA EXPERT SYSTEMS
USER INTERFACE:
• A well-designed user interface, (UI) is crucial for an Expert System to
effectively interact with users.
• The interface should be intuitive, user-friendly, and capable of capturing
input data required for analysis.
• It should also present the output in a clear and understandable manner.
DECISION TREE
• The process of selecting a specific model, given a new input x, can be described by a
sequential decision-making process corresponding to the traversal of a binary tree (one
that splits into two branches at each node).
• Here we focus on a particular tree-based framework called classification and regression
trees, or CART.
DECISION TREE
• The first step divides the whole of the input space into two regions according to whether
x1≤ θ1 or x1 > θ1 where θ1 is a parameter of the model.
• This creates two subregions, each of which can then be subdivided independently.
• For instance, the region x1≤ θ1 is further subdivided according to whether x2≤θ2 or x2 > θ2,
giving rise to the regions denoted A and B.
DECISION TREE
• For any new input x, we determine which region it falls into by starting at the top of the tree at the root
node and following a path down to a specific leaf node according to the decision criteria at each node.
• Consider first a regression problem in which the goal is to predict a single target variable t from a D-
dimensional vector x = (x1, . . . , xD)T of input variables.
• The training data consists of input vectors {x 1, . . . , xN} along with the corresponding continuous labels
{t1, . . . , tN}.
• • If the partitioning of the input space is given, and we minimize the sum-of-squares error function, then
the optimal value of the predictive variable within any given region is just given by the average of the
values of tn for those data points that fall in that region.
DECISION TREE
• The joint optimization of the choice of region to split, and the choice of input variable
and threshold, can be done efficiently by exhaustive search noting that, for a given choice
of split variable and threshold.
• The optimal choice of predictive variable is given by the local average of the data, as
noted earlier.
• This is repeated for all possible choices of variable to be split, and the one that gives the
smallest residual sum-of-squares error is retained.
DECISION TREE
• Given a greedy strategy for growing the tree, there remains the issue of when to stop
adding nodes.
• A simple approach would be to stop when the reduction in residual error falls below some
threshold.
DECISION TREE
• The pruning is based on a criterion that balances residual error against a measure of
model complexity.
• • If we denote the starting tree for pruning by T0, then we define T ⊂ T0 to be a subtree of
T0 it can be obtained by pruning nodes from T0 (in other words, by collapsing internal
nodes by combining the corresponding regions).
• • Suppose the leaf nodes are indexed by τ = 1, . . . , |T|, with leaf node τ representing a
region Rτ of input space having Nτ data points, and |T| denoting the total number of leaf
nodes.
DECISION TREE
• The regularization parameter λ determines the trade-off between the overall residual sum-
of-squares error and the complexity of the model as measured by the number |T| of leaf
nodes, and its value is chosen by cross-validation.
• • If we define pτk to be the proportion of data points in region Rτ assigned to class k,
where k = 1, . . . , K, then two commonly used choices are the cross-entropy.
DECISION TREE
• These both vanish for pτk = 0and pτk = 1and have a maximum at pτk = 0.5.
• They encourage the formation of regions in which a high proportion of the data points are assigned to one
class.
MODELLING AND BAYESIAN NETWORKS
• Bayesian networks or Bayesian graphical models, are probabilistic graphical models that
represent a set of variables and their conditional dependencies using a directed acyclic
graph (DAG).
• Nodes represent random variables, and edges represent probabilistic dependencies
between them.
• Each node is associated with a conditional probability table that quantifies the probability
of that variable given its parents in the graph.
MODELLING AND BAYESIAN NETWORKS
• Bayesian networks are widely used for reasoning under uncertainty, making predictions,
and performing inference tasks.
• They have applications in various fields such as medicine, finance, engineering, and
natural language processing.
• Bayesian networks are also used in decision analysis and can be employed to model
complex systems where uncertainty plays a significant role.
LEARNING BAYESIAN NETWORKS
• Learning Bayesian networks involves the process of inferring the structure and
parameters of the network from data. There are primarily two types of learning in
Bayesian networks:
• Structure Learning: Involves determining the graphical structure of the Bayesian
network, i.e., identifying the dependencies between variables and the network topology.
• Parameter Learning: Once the structure is determined, the next step is to estimate the
parameters (conditional probability distributions) associated with each node in the
network.
STRUCTURAL LEARNING
PC Algorithm:
• Constraint-based algorithm used for learning the structure of Bayesian networks from
observational data.
• It is widely used due to its efficiency and effectiveness in identifying causal relationships
among variables.
• The PC algorithm is particularly useful when the number of variables is relatively large.
STRUCTURAL LEARNING
PC Algorithm:
Here's a brief overview of how the PC algorithm works:
Step 1: Construct an Independence Graph: The PC algorithm begins by constructing an undirected graph
called the "independence graph." This graph represents conditional independence relationships among
variables in the dataset. Initially, all variables are nodes in the graph, and no edges exist between them.
Step 2: Test Conditional Independence: For each pair of variables in the dataset, the algorithm tests
whether they are conditionally independent given subsets of other variables.
Step 3: Remove Dependent Edges: Based on the conditional independence tests, the algorithm removes
edges that violate conditional independence assumptions.
STRUCTURAL LEARNING
PC Algorithm:
Step 4: Orient Undirected Edges: After removing edges, the algorithm attempts to orient
the remaining undirected edges to create a directed acyclic graph (DAG) that represents
causal relationships. This is done using additional conditional independence tests and causal
inference principles.
Step 5: Finalize the Structure: Finally, the algorithm may perform additional refinement
steps, such as checking for unshielded colliders and adjusting the graph accordingly to
ensure consistency with causal semantics.
PARAMETER LEARNING
EM Algorithm:
• The Expectation-Maximization (EM) algorithm is an iterative method used for estimating
the parameters of probabilistic models when there are latent (unobserved) variables
involved.
• It's particularly useful in situations where there is incomplete data or missing values.
• The EM algorithm aims to maximize the likelihood (or log-likelihood) of the observed
data by iteratively updating estimates of the parameters until convergence.
PARAMETER LEARNING
EM Algorithm:
Here's a simplified overview of how the EM algorithm works:
Initialization: Start with initial estimates for the parameters of the model. These can be
randomly chosen or based on some prior knowledge.
E-step (Expectation Step): In this step, we compute the expected values of the latent
variables given the observed data and the current parameter estimates. This involves
calculating the posterior distribution of the latent variables using Bayes' theorem.
PARAMETER LEARNING
EM Algorithm:
M-step (Maximization Step): In this step, we update the parameter estimates to maximize
the expected likelihood obtained in the E-step. This often involves taking derivatives of the
likelihood function with respect to the parameters and setting them to zero to find the
maximum likelihood estimates.
Iteration: Steps 2 and 3 are repeated iteratively until the parameter estimates converge to a
stable solution. Convergence can be determined based on various criteria, such as when the
change in parameter estimates between iterations falls below a certain threshold.
MACHINE LEARNING
• Machine learning is a field of computer science that gives computers the ability to learn
without being explicitly programmed.
• Supervised learning and Unsupervised learning are two main types of machine learning.
• In supervised learning, the machine is trained on a set of labeled data, which means that
the input data is paired with the desired output. The machine then learns to predict the
output for new input data.
• Supervised learning is often used for tasks such as classification, regression, and object
detection.
MACHINE LEARNING
• In unsupervised learning, the machine is trained on a set of unlabelled data, which means
that the input data is not paired with the desired output.
• The machine then learns to find patterns and relationships in the data.
• Unsupervised learning is often used for tasks such as clustering, dimensionality
reduction, and anomaly detection.
SUPERVISED LEARNING
Classification:
• Classification is a process of finding a function which helps in dividing the dataset into
classes based on different parameters.
• In Classification, a computer program is trained on the training dataset and based on that
training, it categorizes the data into different classes.
• The task of the classification algorithm is to find the mapping function to map the
input(x) to the discrete output(y).
SUPERVISED LEARNING
Classification:
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the following types:
• Logistic Regression
• K-Nearest Neighbours
• Support Vector Machines
• Kernel SVM
• Decision Tree Classification
SUPERVISED LEARNING
Regression:
• Regression is a process of finding the correlations between dependent and independent
variables.
• It helps in predicting the continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
• The task of the Regression algorithm is to find the mapping function to map the input
variable(x) to the continuous output variable(y).
SUPERVISED LEARNING
Regression:
Types of Regression Algorithm:
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
SUPERVISED LEARNING
Regression algorithms can be used to solve the Classification Algorithms can be used to solve
regression problems such as Weather Prediction, classification problems such as Identification of spam
House price prediction, etc. emails, Speech Recognition, Identification of cancer
cells, etc.
The regression Algorithm can be further divided The Classification algorithms can be divided into
into Linear and Non-linear Regression. Binary Classifier and Multi-class Classifier.
SUPERVISED
LEARNING
Linear Regression vs Logistic Regression:
• Linear Regression and Logistic Regression
are the two famous Machine Learning
Algorithms which come under supervised
learning technique.
• The Linear Regression is used for solving
Regression problems whereas Logistic
Regression is used for solving the
Classification problems.
SUPERVISED LEARNING
Linear Regression:
• Used for solving regression problems.
• The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
• If single independent variable is used for prediction, then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
SUPERVISED LEARNING
Linear Regression:
• By finding the best fit line, algorithm establishes the relationship between dependent variable and
independent variable.
• And the relationship should be of linear nature.
• The output for Linear regression should only be the continuous values such as price, age, salary, etc.
• In below image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
• y= a0+a1x+ ε Where, a0 and a1 are the coefficients and ε is the error term.
SUPERVISED LEARNING
Logistic Regression:
• It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
• Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.
• The output of Logistic Regression problem can be only between the 0 and 1.
• Logistic regression can be used where the probabilities between two classes is required.
SUPERVISED
LEARNING
• In logistic regression, we pass the
weighted sum of inputs through an
activation function that can map
values in between 0 and 1.
• Such activation function is known
as sigmoid function and the curve
obtained is called as sigmoid curve
or S-curve. Consider the image:
SUPERVISED LEARNING
For Regression
• Mean Squared Error (MSE): MSE measures the average squared difference between the predicted
values and the actual values. Lower MSE values indicate better model performance.
• Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard
deviation of the prediction errors. Like MSE, lower RMSE values indicate better model performance.
• Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
• R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in the
target variable that is explained by the model. Higher R-squared values indicate better model fit.
EVALUATING SUPERVISED LEARNING
For Classification
Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is
calculated by dividing the number of correct predictions by the total number of predictions.
Precision: Precision is the percentage of positive predictions that the model makes that are
correct. It is calculated by dividing the number of true positives by the total number of positive
predictions.
Recall: Recall is the percentage of all positive examples that the model correctly identifies. It is
calculated by dividing the number of true positives by the total number of positive examples.
EVALUATING SUPERVISED LEARNING
For Classification
F1 score: The F1 score is a weighted average of precision and recall. It is calculated by
taking the harmonic mean of precision and recall.
Confusion matrix: A confusion matrix is a table that shows the number of predictions for
each class, along with the actual class labels. It can be used to visualize the performance of
the model and identify areas where the model is struggling.
APPLICATIONS OF SUPERVISED LEARNING
• Fraud detection: Supervised learning models can analyze financial transactions and
identify patterns that indicate fraudulent activity, helping financial institutions prevent
fraud and protect their customers.
• Natural language processing (NLP): Supervised learning plays a crucial role in NLP
tasks, including sentiment analysis, machine translation, and text summarization, enabling
machines to understand and process human language effectively.
UNSUPERVISED LEARNING
• Unsupervised learning is a type of machine learning that learns from unlabelled data..
• The goal of unsupervised learning is to discover patterns and relationships in the data
without any explicit guidance.
• Here the task of the machine is to group unsorted information according to similarities,
patterns, and differences without any prior training of data.
• Unlike supervised learning, no training will be given to the machine. Therefore, the
machine is restricted to find the hidden structure in unlabelled data by itself.
TYPES OF UNSUPERVISED LEARNING
Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
• Gaussian Mixture Models (GMMs)
TYPES OF UNSUPERVISED LEARNING
• Association rule learning is a type of unsupervised learning that is used to identify patterns
in a data.
• Association rule learning algorithms work by finding relationships between different items
in a dataset.
• Some common association rule learning algorithms include:
Apriori Algorithm
Eclat Algorithm
FP-Growth Algorithm
EVALUATING UNSUPERVISED LEARNING
• There are several different metrics that can be used to evaluate unsupervised learning models, but some of
the most common ones include:
• Silhouette score: The silhouette score measures how well each data point is clustered with its own cluster
members and separated from other clusters. It ranges from -1 to 1, with higher scores indicating better
clustering.
• Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance between
clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores indicating better
clustering.
• Adjusted Rand index: The adjusted Rand index measures the similarity between two clustering. It ranges
from -1 to 1, with higher scores indicating more similar clustering
EVALUATING UNSUPERVISED LEARNING
• Feature extraction and mapping are fundamental processes in machine learning that
involve transforming raw data into a format suitable for modelling
• Feature extraction involves transforming raw data into a more informative representation
with reduced dimensionality.
• Feature mapping involves transforming data into a higher-dimensional space to facilitate
learning.
• Both processes are crucial for effectively training machine learning models.
FEATURE EXTRACTION
• Feature extraction is a process in machine learning and data analysis that involves
identifying and extracting relevant features from raw data.
• These features are later used to create a more informative dataset, which can be further
utilized for various tasks such as Classification, Prediction and Clustering.
• Feature extraction aims to reduce data complexity (often known as “data dimensionality”)
while retaining as much relevant information as possible.
FEATURE EXTRACTION
• This helps to improve the performance and efficiency of machine learning algorithms and
simplify the analysis process.
• Feature extraction may involve the creation of new features and data manipulation to
separate and simplify the use of meaningful features from irrelevant ones.
FEATURE EXTRACTION
What is a feature??
• In machine learning and statistics, features are often called “variables” or “attributes.”
• Relevant features have a correlation on a model’s use case.
• In a patient medical dataset, features could be age, gender, blood pressure, cholesterol
level, and other observed characteristics relevant to the patient.
FEATURE EXTRACTION
Feature extraction techniques vary depending on the type of data and the specific problem.
Some common methods include:
• Principal Component Analysis (PCA): A technique used to reduce the dimensionality
of data by projecting it onto a lower-dimensional subspace while preserving the
maximum variance.
• Singular Value Decomposition (SVD): Similar to PCA, SVD decomposes a matrix into
its constituent parts to reduce dimensionality and extract relevant features.
COMMON FEATURE EXTRACTION TECHNIQUES
Steps:
• Standardization: If the features of the dataset are measured in different scales, it's important to
standardize them (subtract mean and divide by standard deviation) to ensure that each feature contributes
equally to the analysis.
• Covariance Matrix Computation: PCA computes the covariance matrix of the standardized data, which
represents the relationships between pairs of features.
• Eigenvalue Decomposition: PCA then performs eigenvalue decomposition on the covariance matrix to
obtain the eigenvalues and corresponding eigenvectors. The eigenvectors represent the directions
(principal components) of maximum variance in the data, and the eigenvalues represent the magnitude of
variance along those directions.
COMMON FEATURE EXTRACTION TECHNIQUES
Steps:
• Selection of Principal Components: The principal components are sorted based on their
corresponding eigenvalues, with the highest eigenvalue indicating the direction of
maximum variance.
• Projection: Finally, the original data is projected onto the selected principal components
to obtain the transformed dataset with reduced dimensionality.
COMMON FEATURE EXTRACTION TECHNIQUES
Applications:
• Dimensionality Reduction: PCA is primarily used for reducing the dimensionality of
high-dimensional datasets while retaining as much variance as possible.
• Data Visualization: PCA can be employed to visualize high-dimensional data in lower-
dimensional space (usually 2D or 3D) for easier interpretation and visualization.
• Noise Reduction: PCA can also help in removing noise from the data by retaining only
the principal components that capture meaningful information.
COMMON FEATURE EXTRACTION TECHNIQUES
Considerations:
• Interpretability: While PCA reduces the dimensionality of the data, the resulting
principal components might not always be directly interpretable in terms of the original
features.
• Information Loss: Dimensionality reduction inevitably leads to some loss of
information, and the challenge lies in balancing the reduction in dimensionality with the
preservation of relevant information.
FEATURE MAPPING
• Feature mapping involves defining a mapping function that transforms the original
features into a higher-dimensional space.
• The aim is to make the data more suitable to modelling by transforming it into a space
where it may be easier to find linear or nonlinear relationships.
Methods:
• Feature mapping is commonly used in kernel methods such as Support Vector Machines
(SVMs) and kernelized versions of algorithms like kernel PCA.
FEATURE MAPPING
• The kernel function implicitly maps the data into a higher-dimensional space without
explicitly computing the transformed feature vectors.
Common kernel functions include:
• Polynomial kernel: Maps data into a higher-dimensional space using polynomial
functions.
• Gaussian (RBF) kernel: Maps data into an infinite-dimensional space using Gaussian
functions.
• Sigmoid kernel: Maps data into a higher-dimensional space using sigmoid functions.
FEATURE MAPPING TECHNIQUES