ML Notes Unit 1-2
ML Notes Unit 1-2
ML Notes Unit 1-2
Introduction:
Basic concepts: Definition of learning systems, Goals and applications of machine learning. Aspects of
developing a learning system: training data, concept representation, function approximation.
Types of Learning: Supervised learning and unsupervised learning. Overview of classification: setup, training,
test, validation dataset, over fitting.
Classification Families: linear discriminative, non-linear discriminative, decision trees, probabilistic
(conditional and generative), nearest neighbor. [T1,
T2][No. of Hrs: 12]
UNIT-II
Logistic regression, Perceptron, Exponential family, Generative learning algorithms, Gaussian discriminant
analysis, Naive Bayes, Support vector machines: Optimal hyper plane, Kernels. Model selection and feature
selection. Combining classifiers: Bagging, boosting (The Ada boost algorithm), Evaluating and debugging
learning algorithms, Classification errors.
[T1, T2][No. of Hrs: 11]
1.1 Basic concepts:
A machine is said to be learning from past Experiences (data feed in) with
respect to some class of tasks if its Performance in a given Task improves with
the Experience.
A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks T, as measured by P, improves with
experience E.
Examples
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
The learning process, whether by a human or a machine, can be divided into four
components, namely, data storage, abstraction, generalization and evaluation. Various
components and the steps involved in the learning process.
1. Data storage- Facilities for storing and retrieving huge amounts of data are an important
component of the learning process. Humans and computers alike utilize data storage as a
foundation for advanced reasoning.
4. Evaluation- Evaluation is the last component of the learning process. It is the process of
giving feedback to the user to measure the utility of the learned knowledge. This feedback is
then utilised to effect improvements in the whole learning process.
1.1.2 Machine learning algorithms build a mathematical model based on sample data,
known as “training data”, in order to make predictions or decisions without being explicitly
programmed to do so.
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being
explicitly programmed.”
Machine Learning (ML) is basically that field of computer science with the help of which
computer systems can provide sense to data in much the same way as human beings do. In
simple words, ML is a type of artificial intelligence that extracts patterns out of raw data by
using an algorithm or method. The key focus of ML is to allow computer systems to learn
from experience without being explicitly programmed or human intervention.
1. To make the computers smarter, more intelligent. The more direct objective in this aspect
is to develop systems (programs) for specific practical learning tasks in application domains
1) Supervised ML Algorithms
Multi Classification: this is where the ML system must choose between two or more
outcomes.
Binary Classification: this is when the system must subdivide the datasets into at least two
or more distinct categories.
2) Unsupervised ML Algorithms
Reduction of Dimensionality: Here, the total number of variables that are not needed are
automatically excluded to streamline the prediction process.
Association Mining/Skewness Detection: This is where datasets that are similar and unlike
one another are detected and identified.
Data Labeling: In these applications, ML systems can take a very small number of inputs
(datasets), and automatically create and apply labels to a much larger dataset.
4) Reinforcement ML Algorithms
Resource Allocation: a key principle of economics is how to do more with a finite number
of resources. These ML systems are used to determine the optimal mix of how the inputs
should be distributed to yield the maximum output possible.
Effective machine learning techniques can solve complex problems that engineers may
struggle with. Used correctly, machine learning is capable of disrupting and creating new
capabilities in most industries. This can help improve productivity while reducing costs and
increasing ROI.
Machine learning has also seen usage in the business sector when it comes to understanding
their respective customer base.
1. Breaking down the lifecycle of the customer acquisition, from identifying prospects to
becoming paying customers.
2. Knowing when customers leave to acquire products and services from the
competition.
3. Learning how to market the right products/services to the appropriate customer
segment at the proper timing.
4. In retail business, machine learning is used to study consumer behaviour and
understanding buying patterns behaviors of most valuable customers.
5. In finance, banks analyze their past data to build models to use in credit applications,
fraud detection, and the stock market
6. In manufacturing, learning models are used for optimization, control, and
troubleshooting.
7. In medicine, learning programs are used for medical diagnosis.
8. In telecommunications, call patterns are analyzed for network optimization and
maximizing the quality of service.
9. In science, large amounts of data in physics, astronomy, and biology can only be
analyzed fast enough by computers. The World Wide Web is huge; it is constantly
growing and searching for relevant information cannot be done manually.
10. In artificial intelligence, it is used to teach a system to learn and adapt to changes so
that the system designer need not foresee and provide solutions for all possible
situations.
11. It is used to find solutions to many problems in vision, speech recognition, and
robotics.
12. Machine learning methods are applied in the design of computer-controlled vehicles
to steer correctly when driving on a variety of roads.
13. Machine learning methods have been used to develop programmes for playing games
such as chess, backgammon and Go.
1.1.4 History:-
● 1834: In 1834, Charles Babbage, the father of the computer, conceived a device that
could be programmed with punch cards. However, the machine was never built, but
all modern computers rely on its logical structure.
● 1936: In 1936, Alan Turing gave a theory that how a machine can determine and
execute a set of instructions.
The era of stored program computers:
● 1940: In 1940, the first manually operated computer, "ENIAC" was invented, which
was the first electronic general-purpose computer. After that stored program computer
such as EDSAC in 1949 and EDVAC in 1951 were invented.
● 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
● 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"
● 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more it
played.
● 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
● The duration of 1974 to 1980 was the tough time for AI and ML researchers, and this
duration was called as AI winter.
● In this duration, failure of machine translation occurred, and people had reduced their
interest from AI, which led to reduced funding by the government to the researches.
● 1959: In 1959, the first neural network was applied to a real-world problem to remove
echoes over phone lines using an adaptive filter.
● 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words in
one week.
● 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
● 2006: In the year 2006, computer scientist Geoffrey Hinton has given a new name to
neural net research as "deep learning," and nowadays, it has become one of the most
trending technologies.
● 2012: In 2012, Google created a deep neural network which learned to recognize the
image of humans and cats in YouTube videos.
● 2014: In 2014, the Chabot "Eugen Goostman" cleared the Turing Test. It was the
first Chabot who convinced the 33% of human judges that it was not a machine.
● 2014: DeepFace was a deep neural network created by Facebook, and they claimed
that it could recognize a person with the same precision as a human can do.
● 2016: AlphaGo beat the world's number second player Lee sedol at Go game. In
2017 it beat the number one player of this game Ke Jie.
● 2017: In 2017, the Alphabet's Jigsaw team built an intelligent system that was able to
learn the online trolling. It used to read millions of comments of different websites to
learn to stop online trolling.
To make a system learn, a training data is to be provided. For example if we want to make a
system learn to differentiate between animals and birds, then we provide it with some
examples of animals and some example of birds (labelled dataset i.e. each example also
contains information of whether it is a bird or animal).
From these examples, the system learns some concept, for example, birds have wings, a beak
and two legs whereas animals have four legs and have a nose, etc.
After concept representation the system should be able to identify new objects. However, the
learning system needs to use function approximation as some of the new example may
exactly not match the concepts i.e. an elephant has a trunk (which is neither a nose nor a
beak) so the system should be able to approximate whether it is an animal or a bird.
. We want to design a learning system that follows the learning process,we need to consider a
few design choices.The design choices will be to decide the following key component.
● Training data
● Concept representation
● Function approximation
Training Data:-Training data is also known as training dataset, learning set & training set, it
is an essential component of every machine learning model and helps them make accurate
predictions or perform a desired task.
Task T: To recognize and classify handwritten words within the given images.
Performance measure P: Total percent of words being correctly classified by the program.
For a system being designed to detect spam emails, TPE would be,
Performance measure P: Total percent of mails being correctly classified as 'spam' (or 'not
spam' ) by the program.
Training experience E: A set of mails with given labels ('spam' / 'not spam').
If we are able to find the factors T, P, and E of a learning problem, we will be able to decide
the following three key components:
2. A representation for this target knowledge (Choosing a representation for the Target
Function)
The next important step is choosing the target function. It means according to the knowledge
fed to the algorithm the machine learning will choose NextMove function which will describe
what type of legal moves should be taken. For example : While playing chess with the op-
ponent, when opponent will play then the machine learning algorithm will decide what be the
number of possible legal moves taken in order to get success. Let's take the example of a
checkers-playing program that can generate the legal moves (M) from any board state (B).
The program needs only to learn how to choose the best move from among these legal
moves. Let's assume a function NextMove such that:
NextMove: B -> M
Here, B denotes the set of board states and M denotes the set of legal moves given a board
state. NextMove is our target function.
Concept representation:-
When the machine algorithm will know all the possible legal moves the next step is to choose
the optimized move using any representation i.e. using linear Equations, Hierarchical Graph
Representation, Tabular form etc. The NextMove function will move the Target move like
out of these move which will provide more success rate. For Example : while playing chess
machine have 4 possible moves, so the machine will choose that optimized move which will
provide success to it. We need to choose a representation that the learning algorithm will use
to describe the function NextMove. The function NextMove will be calculated as a linear
combination of the following board features:
Thus, the learning program will be regarded as a linear function of the form .
NextMove= w0+w1x1+w2x2+w3x3+w4x4+w5x5+w6x6
Where , w0 through w6 are numerical coefficients , or weights that will be chosen by the
learning algorithm .
Learned values for the weight w1 through w6 will determine the relative importance of the
various board features in determining the value of the board.
An optimized move cannot be chosen just with the training data. The training data had to go
through with set of example and through these examples the training data will approximates
which steps are chosen and after that machine will provide feedback on it. For Example :
When a training data of Playing chess is fed to algorithm so at that time it is not machine
algorithm will fail or get success and again from that failure or success it will measure while
next move what step should be chosen and what is its success rate. To learn the target
function NextMove, we require a set of training examples, each describing a specific board
state b and the training value (Correct Move ) y for b. The training algorithm
learns/approximate the coefficients w0, w1 up to w6 with the help of these training examples
by estimating and adjusting these weights.
The final design is created at last when system goes from number of examples , failures and
success , correct and incorrect decision and what will be the next step etc. Example:
DeepBlue is an intelligent computer which is ML-based won chess game against the chess
expert Garry Kasparov, and it became the first computer which had beaten a human chess
expert.
Choices in designing the checkers learning problem
1.2 Types of Learning:
a) Supervised learning: The computer is presented with example inputs and their
desired outputs, and the goal is to learn a general rule that maps inputs to outputs.
Supervised learning is used in predictive analytics i.e. classification and regression.
Classification deals with discrete output, and regression deals with continuous output.
Types of Supervised Learning:
1. Classification: It is a Supervised Learning task where output is having defined labels
(discrete value). For example in above Figure A, Output – Purchased has defined
labels i.e. 0 or 1; 1 means the customer will purchase and 0 means that customer
won’t purchase. The goal here is to predict discrete values belonging to a particular
class and evaluate them on the basis of accuracy.
Example: Gmail classifies mails in more than one class like social, promotions, updates,
forums.
2. Regression: It is a Supervised Learning task where output is having continuous
value.
Example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in the particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating
the error value. The smaller the error the greater the accuracy of our regression model.
1. Automation of everything.
2. Wide rage of application.
3. Scope of improvement.
4. Efficient handling of data.
5. Best for education.
1) Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labeled input data.
1) Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
2) The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning algorithms are trained using Unsupervised learning algorithms are trained
labeled data. using unlabeled data.
Supervised learning model takes direct feedback Unsupervised learning model does not take
to check if it is predicting correct output or not. any feedback.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is
the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when it is the hidden patterns and useful insights from
given new data. the unknown dataset.
Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.
Supervised learning can be used for those cases Unsupervised learning can be used for those
where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.
Supervised learning model produces an accurate Unsupervised learning model may give less
result. accurate result as compared to supervised
learning.
Supervised learning is not close to true Artificial Unsupervised learning is more close to the true
intelligence as in this, we first train the model Artificial Intelligence as it learns similarly as a
for each data, and then only it can predict the child learns daily routine things by his
correct output. experiences.
1.2.1 Datasets
Training Dataset:
Sample of data used to fit the model. It is the actual dataset that we use to train the model
(weights and biases). The model sees and learns from this data.
Validation Dataset
Sample of data used to provide an unbiased evaluation of a model fit on the training dataset
while tuning model hyper-parameters. Evaluation becomes more biased as skill on the
validation dataset is incorporated into the model configuration. The model occasionally sees
this data, but never does it “learn” from this
Test Dataset
Sample of data used to provide an unbiased evaluation of a final model fit on the training
dataset. It is used once a model is completely trained (using the train and validation sets).
Test set is generally what is used to evaluate competing models. It contains carefully sampled
data that spans the various classes that the model would face, when used in the real world
Overfitting
A statistical model is said to be overfitted when we train it with a lot of data (just like fitting
ourselves in oversized pants!). When a model gets trained with so much data, it starts learn-
ing from the noise and inaccurate data entries in our data set. Then the model does not cat-
egorize the data correctly, because of too many details and noise. The causes of overfitting
are the non-parametric and non-linear methods because these types of machine learning al-
gorithms have more freedom in building the model based on the dataset and therefore they
can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm
if we have linear data or using the parameters like the maximal depth if we are using decision
trees.
In a nutshell, High variance and low bias
3. Early stopping during the training phase (have an eye over the loss over the training period
as soon as loss begins to increase stop training).
Underfitting
4. Increase the number of epochs or increase the duration of training to get better results.
Classification is a process of categorizing a given set of data into classes, It can be performed
on both structured or unstructured data. The process starts with predicting the class of given
data points. The classes are often referred to as target, label or categories.
• Speech Recognition
• Handwriting Recognition
• Biometric Identification
• Document Classification
Advantages:
• Helps Banks and Financial Institutions to identify defaulters so that they may
approve Cards, Loan, etc.
Disadvantages:
Privacy: When the data is either are chances that a company may give some information
about their customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best
accuracy and result.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some
overlapping. So, we will keep on increasing the number of features for proper classification.
Here, LDA uses both the axes (X and Y) to create a new axis and projects data onto a new
axis in a way to maximize the separation of the two categories and hence, reducing the
2Dgraph into a 1D graph.
But LDA fails when the mean of the distributions are shared, as it becomes impossible for
LDA to find a new axis that makes both the classes linearly separable. In such cases, we use
non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
· Tree is built by splitting the source set, constituting the root node of the tree,
into subsets—which constitute the successor children
· recursion is completed when the subset at a node has all the same values of the
target variable, or when splitting no longer adds value to the predictions
1. Clear Visualization: The algorithm is simple to understand, interpret and visualize as the
idea is mostly used in our daily lives. Output of a Decision Tree can be easily interpreted by
humans.
2. Simple and easy to understand: Decision Tree looks like simple if-else statements which
are very easy to understand.
3. Decision Tree can be used for both classification and regression problems.
6. Handles non-linear parameters efficiently: Non linear parameters don't affect the
performance of a Decision Tree unlike curve based algorithms. So, if there is high non-
linearity between the independent variables, Decision Trees may outperform as compared to
other curve based algorithms.
8. Decision Tree is usually robust to outliers and can handle them automatically.
9. Less Training Period: Training period is less as compared to Random Forest because it
generates only one tree unlike forest of trees in the Random Forest.
1. Overfitting: This is the main problem of the Decision Tree. It generally leads to overfitting
of the data which ultimately leads to wrong predictions. In order to fit the data (even noisy
data), it keeps generating new nodes and ultimately the tree becomes too complex to
interpret. In this way, it loses its generalization capabilities. It performs very well on the
trained data but starts making a lot of mistakes on the unseen data.
2. High variance: As mentioned in point 1, Decision Tree generally leads to the overfitting of
data. Due to the overfitting, there are very high chances of high variance in the output which
leads to many errors in the final estimation and shows high inaccuracy in the results. In order
to achieve zero bias (overfitting), it leads to high variance.
3. Unstable: Adding a new data point can lead to re-generation of the overall tree and all
nodes need to be recalculated and recreated.
4. Affected by noise: Little bit of noise can make it unstable which leads to wrong
predictions.
5. Not suitable for large datasets: If data size is large, then one single tree may grow complex
and lead to overfitting. So in this case, we should use Random Forest instead of a single
Decision Tree
Probabilistic model gives the probability of each class for the given data instance. The class
whose probability comes out to be highest for the instance is assigned. A probabilistic model
can be further classified as conditional/discriminative and generative. A conditional model
uses a discriminant function to calculate the probabilities e.g. logistic regression, whereas a
generative model models the probability distribution of the given dataset and based on this
distribution predicts the probabilities of the instance belonging to each class e.g. Naive
Bayes.
Discriminative Model
The discriminative model is used particularly for supervised machine learning. Also called a
conditional model, it learns the boundaries between classes or labels in a dataset. It creates
new instances using probability estimates and maximum likelihood. However, they are not
capable of generating new data points. The ultimate goal of discriminative models is to
separate one class from another.
Logistic regression: It is one of the most popular machine learning algorithms. Logistic
regression is a mathematical model in statistics that uses previous data to estimate an event’s
probability. The output is a categorical or a discrete value. In principle, logistic regression is
similar to linear regression. However, there is a small difference between them. While linear
regression is used for regression problems, logistic regression is used for classification
problems.
Support vector machine: It is a supervised learning algorithm used both for classification and
regression problems. A type of discriminative modelling, support vector machine (SVM)
creates a decision boundary to segregate n-dimensional space into classes. The best decision
boundary is called a hyperplane created by choosing the extreme points called the support
vectors.
Decision tree: A type of supervised machine learning model where data is continuously split
according to certain parameters. It has two main entities–decision nodes and leaves. While
leaves are the final outcomes or decisions, nodes are the points where data is split.
Random forest: It is a flexible and easy-to-use machine learning algorithm that gives great
results without even using hyper-parameter tuning. Because of its simplicity and diversity, it
is one of the most used algorithms for both classification and regression tasks.
Generative models
Generative models are a class of statistical models that generate new data instances. These
models are used in unsupervised machine learning to perform tasks such as probability and
likelihood estimation, modelling data points, and distinguishing between classes using these
probabilities. Generative models rely on the Bayes theorem to find the joint probability.
Common examples of generative models are:
Hidden Markov model: It is a statistical model known for its effectiveness in modelling the
correlation between adjacent symbols, events or domains, finding major application in speech
recognition and digital communication.
Autoregressive model: An AR model predicts future values based on past values. This kind
of model is good at handling a wide range of time-series patterns.
Generative adversarial network: GANs have gained much popularity recently. A GAN model
has two parts–generator and discriminator. The generative model captures the data
distribution and the discriminative model estimates the probability of sample coming from
training data rather than the generative model.
Difference:
Discriminative models draw boundaries in the data space, while generative ones model how
data is placed throughout the space. Mathematically speaking, a discriminative machine
learning trains a model by learning parameters that maximise the conditional probability P(Y|
X), but a generative model learns parameters by maximising the joint probability P(X,Y).
Because of their different approaches to machine learning, both are suited for specific tasks.
Generative models are useful for unsupervised learning tasks and discriminative models work
better for supervised learning tasks.
https://www.analyticsvidhya.com/blog/2021/04/simple-understanding-and-implementation-
of-knn-algorithm/
https://www.geeksforgeeks.org/k-nearest-neighbours/
It is one of the simplest Machine Learning algorithms which is based on Supervised Learning
technique. K-NN algorithm assumes the similarity between the newcase/data and available
cases and put the new case into the category that is most similar to the available categories
stores all the available data and classifies a new data point based on the similarity when new
data appears then it can be easily classified into a well suited category can be used for
Regression as well as for Classification is a non-parametric algorithm, which means it does
not make any assumption on underlying data a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the s dataset and at the time of
classification, it performs an action on the dataset.
4. Among these k neighbors, count the number of the data points in each category
5. Assign the new data points to that category for which the number of the neighbor is
maximum
Advantages
· simple to implement
Disadvantages
· computation cost is high because of calculating the s distance between the data
points for all the training
UNIT-II
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Logistic Function (Sigmoid Function):
● The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
● It maps any real value into another value within a range of 0 and 1.
● The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
● In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
● But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
On the basis of the categories, Logistic Regression can be classified into three types:
● Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
● Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
linear regression which outputs continuous number values, logistic regression transforms
its output using the logistic sigmoid function to return a probability value which can then
be mapped to two or more discrete classes
2.2 Perceptron
f(x) = +1 if x>θ
= -1 if x≤θ
The perceptron is a linear binary classifier. The training phase of perceptron performs
multiple iterations on the training data points. Each iteration goes through all the training
instances, and if a misclassified instance is encountered, the parameters of the hyperplane are
changed so that the misclassified instance moves closer to the hyperplane or maybe even
across the hyperplane onto the correct side.
Until all the instances in the training data are classified correctly
Generative Adversarial Networks (GANs) are a powerful class of neural networks that are
used for unsupervised learning. It was developed and introduced by Ian J. Goodfellow in
2014. GANs are basically made up of a system of two competing neural network models
which compete with each other and are able to analyze, capture and copy the variations
within a dataset. Generative approaches try to build a model of the positives and a model of
the negatives. You can think of a model as a “blueprint” for a class. A decision boundary is
formed where one model becomes more likely. As these create models of each class they can
be used for generation.
To create these models, a generative learning algorithm learns the joint probability
distribution P(x, y).
Since, to predict a class label y, we are only interested in the arg max , the denominator can
be removed from (ii).
Hence to predict the label y from the training example x, generative models evaluate:
The most important part in the above is P(x | y). This is what allows the model to be
generative! P(x | y) means – what x (features) are there given class y. Hence, with the joint
probability distribution function (i), given a y, you can calculate (“generate”) its
corresponding x. For this reason they are called generative models!
Generative Adversarial Networks (GANs) can be broken down into three parts:
· Generative: To learn a generative model, which describes how data is generated in
terms of a probabilistic model.
· Adversarial: The training of a model is done in an adversarial setting.
· Networks: Use deep neural networks as the artificial intelligence (AI) algorithms for
training purpose.
· That your data is Gaussian, that each variable is shaped like a bell curve when plotted.
· That each attribute has the same variance that values of each variable vary around the mean
by the same amount on average. With these assumptions, the LDA model estimates the mean
and variance from your data for each class.
where,
Steps:
1. Train the data and obtain a discriminate fn, tells which class a data point has higher
prob of belonging to
2. compute μμ and σσ for each class, then calculate prob that data belongs to it, class
with highest prob chosen
2.6 Naive Bayes,
It is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems. Mainly used in text classification that includes a high-dimensional
training dataset. It is simple and most effective Classification algorithms. Probabilistic
classifier, which means it predicts on the basis of the probability of an object. Examples are
spam filtration, sentimental analysis, and classifying articles. It assumes that the occurrence
of a certain feature is independent of the occurrence of other features
P(E), P(H), P(E|H) are priori-probabilities which are used to calculate conditional probability
P(H|E).
In Naive Bayes, we have to predict the class (C) of an example(X), so the equation can be re-
written as
P(C|X) = P(X|C) * P(C) / P(X)
Let's understand Naive Bayes by an example. Suppose, we are given the following dataset for
training the classifier, where "Play" is the output with 2 labels "yes" and "no".
So we have to build a classifier using the above training set i.e. we have to calculate
priori probabilities P(C), P(X|C) and P(X).
As we have only two classes in out training dataset, therefore P(C) is P(yes) and P(no).
P(yes) = 9/14
P(no) = 5/14
P(sunny) = 5/14
P(overcast) = 4/14
P(rainy) = 5/14
P(hot) = 4/14
P(mild) = 6/14
P(cool) = 4/14
P(high) = 7/14
P(normal) = 7/14
P(false) = 8/14
P(true) = 6/14
We have obtained all the 3 priori probabilities from the training dataset. Now, we want to
classify a new unclassified example.
Let the example be {sunny,cool,high,true} and we have to predict it's class. The class can be
Case I: Yes
P(sunny) * P(cool) * P(high) * P(true) = 9/14 * 2/9 *3/9 * 3/9 * 3/9 / ΠP(X)
Case II : No
Result:
P(yes|sunny,cool,high,true) = 0.00529
P(no|sunny,cool,high,true) = 0.02057
Types
1. Gaussian:
○ assumes that features follow a normal distribution
○ if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution
2. Multinomial:
○ used when the data is multinomial distributed
○ used for document classification problems
○ uses the frequency of words for the predictors
3. Bernoulli:
○ works similar to the Multinomial classifier, but the predictor variables are the
independent Booleans variables
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is
used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the
correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine.
Types of SVM
SVM can be of two types:
● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
o The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
o We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Kernels.
● SVM algorithms use a set of mathematical functions that are defined as the kernel
● function of kernel is to take data as input and transform it into the required form
● different SVM algorithms use different types of kernel functions. These functions can
be different types. For example linear, nonlinear, polynomial, radial basis function
(RBF), and sigmoid
● most used type of kernel function is RBF. Because it has localized and finite response
along the entire x-axis
● kernel functions return the inner product between two points in a suitable feature
space, thus defining a notion of similarity, with little computational cost even in very
high-dimensional spaces
● examples:
1. Polynomial
2. Gaussian
4. Laplace RBF
A radial basis function is a real-valued function whose value depends only on the distance
from the origin. Any function that satisfies the property ϕ(x)=ϕ(||x||) is a radial function.
There are various types of RBF: Gaussian, Multi-quadratic, Inverse quadratic, etc.
Gaussian Kernel
The Gaussian kernel is an example of RBF kernel. The adjustable parameter sigma plays a
major role in the performance of the kernel, and should be carefully tuned to the problem at
hand. If over-estimated the exponential will behave almost linearly and the higher
dimensional projection will start to lose its non-linear power. On the other hand, if under-
estimated, the function will lack regularization and the decision boundary will be highly
sensitive to noise in training data.
Exponential kernel
The exponential kernel is closely related to the Gaussian kernel, with only the square of the
norm left out. It is also a radial basis function kernel.
2.8 Model selection and feature selection.
Model selection
· Given a set of models, choose the model that is expected to give the best results.
· Choosing among different learning algorithms e.g. choosing kNN over other
· Classification algorithms.
· Note: Feature Selection is different from Feature Extraction. The latter transforms
original
· How?
· Use some criteria to rank features and keep top ranked features.
· Wrapper Methods: requires repeated runs of the learning algorithm with different
· Set of features.
2.9 Combining classifiers: Bagging, boosting (The Ada boost algorithm),
Ensemble Models
Bagging
Its objective is to create several subsets of data from training sample chosen randomly with
replacement. Each collection of subset data is used to train their decision trees. We get an
ensemble of different models. Average of all the predictions from different trees are used
which is more robust than a single decision tree classifier
Steps:
1. Suppose there are observations and features in training data set, sample from
training data set is taken randomly with replacement
2. A subset of features are selected randomly and whichever feature gives the best
split is used to split the node iteratively
3. The tree is grown to the largest
4. Above steps are repeated times and prediction is given based on the aggregation
of predictions from number of trees.
Advantages:
Disadvantages:
Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the classification and regression model.
Boosting
It is used to create a collection of predictors. Learners are learned sequentially with early
learners fitting simple models to the data and then analysing data for errors. Consecutive trees
are fit and at every step, the goal is to improve the accuracy from the prior tree. When an
input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more
likely to classify it correctly. Process converts weak learners into better performing model.
Steps:
1. Draw a random subset of training samples without replacement from the training
set to train a weak learner
2. Draw second random training subset without replacement from the training set
and add percent of the samples that were previously falsely classified/misclassified to train a
weak learner
3. Find the training samples d3 in the training set D on which and disagree to train a
third weak learner
4. Combine all the weak learners via majority voting.
Advantages
Disadvantages
Prone to over-fitting
Adaboost
Weak models are added sequentially, trained using the weighted training data.
The training weights are updated giving more weight to incorrectly predicted instances, and
less weight to correctly predicted instances.
The process continues until a pre-set number of weak learners have been created (a user
parameter) or no further improvement can be made on the training dataset.
Once completed, you are left with a pool of weak learners each with a stage value.
A stage value is calculated for the trained model which provides a weighting for any
predictions that the model makes.
Predictions are made by calculating the weighted average of the weak classifiers.
2.10Evaluating and debugging learning algorithms, Classification errors.
For example, consider that there are 98% samples of class A and 2% samples of
class B in our training set. Then our model can easily get 98% training accuracy by
simply predicting every training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and
40% samples of class B, then the test accuracy would drop down to 60%.
Classification Accuracy is great, but gives us the false sense of achieving high
accuracy.
The real problem arises, when the cost of misclassification of the minor class
samples are very high. If we deal with a rare but fatal disease, the cost of failing
to diagnose the disease of a sick person is much higher than the cost of sending
a healthy person to more tests.
Logarithmic Loss
where,
Log Loss has no upper bound and it exists on the range [0,
∞). Log Loss nearer to 0 indicates higher accuracy,
whereas if the Log Loss is away from 0 then it indicates
lower accuracy.
In general, minimising Log Loss gives greater accuracy for the classifier.
Confusion Matrix
Confusion Matrix
True Positives : The cases in which we predicted YES and the actual output was
also YES.
True Negatives : The cases in which we predicted NO and the actual output was
NO.
False Positives : The cases in which we predicted YES and the actual output was
NO.
False Negatives : The cases in which we predicted NO and the actual output was
YES.
Accuracy for the matrix can be calculated by taking average of the values lying
across the “main diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve
Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is
used for binary classification problem. AUC of a classifier is equal to the
probability that the classifier will rank a randomly chosen positive example
higher than a randomly chosen negative example. Before defining AUC, let us
understand two basic terms :
True Positive Rate (Sensitivity) : True Positive Rate is defined as TP/ (FN+TP). True
Positive Rate corresponds to the proportion of positive data points that are
correctly considered as positive, with respect to all positive data points.
False Positive Rate : False Positive Rate is defined as FP / (FP+TN). False Positive
Rate corresponds to the proportion of negative data points that are mistakenly
considered as positive, with respect to all negative data points.
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR
and TPR both are computed at varying threshold values such as (0.00, 0.02, 0.04,
…., 1.00) and a graph is drawn. AUC is the area under the curve of plot False
Positive Rate vs True Positive Rate at different points in [0, 1].
As evident, AUC has a range of [0, 1]. The greater the value, the better is the
performance of our model.
F1 Score
F1 Score is the Harmonic Mean between precision and recall. The range for F1
Score is [0, 1]. It tells you how precise your classifier is (how many instances it
classifies correctly), as well as how robust it is (it does not miss a significant
number of instances).
High precision but lower recall, gives you an extremely accurate, but it then
misses a large number of instances that are difficult to classify. The greater the
F1 Score, the better is the performance of our model. Mathematically, it can be
expressed as :
F1 Score
Precision
Recall : It is the number of correct positive results divided by the number of all
relevant samples (all samples that should have been identified as positive).
Recall
Mean Absolute Error
Mean Absolute Error is the average of the difference between the Original
Values and the Predicted Values. It gives us the measure of how far the
predictions were from the actual output. However, they don’t gives us any idea
of the direction of the error i.e. whether we are under predicting the data or over
predicting the data. Mathematically, it is represented as :
Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only
difference being that MSE takes the average of the square of the difference
between the original values and the predicted values. The advantage of MSE
being that it is easier to compute the gradient, whereas Mean Absolute Error
requires complicated linear programming tools to compute the gradient. As we
take square of the error, the effect of larger errors become more pronounced
then smaller error, hence the model can now focus more on the larger errors.