0% found this document useful (0 votes)
29 views

Machine Learning Notes

Uploaded by

vishnupalaskar19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Machine Learning Notes

Uploaded by

vishnupalaskar19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

BTCOC603: Machine Learning

[Unit 1] [7 Hours]

Basic definitions, types of learning, hypothesis space and inductive bias, evaluation, cross-
validation, Linear regression, Decision trees, over fitting, Instance based learning, Feature
reduction, Collaborative filtering based recommendation

[Unit 2] [7 Hours]

Probability and Bayes learning, Logistic Regression, Support Vector Machine, Kernel function
and Kernel SVM.

[Unit 3] [7 Hours]

Perceptron, multilayer network, back propagation, introduction to deep neural network.

[Unit 4] [7 Hours]

Computational learning theory, PAC learning model, Sample complexity, VC Dimension,


Ensemble learning.

[Unit 5] [7 Hours]

Clustering k-means, adaptive hierarchical clustering, Gaussian mixture model.

Text Book:

1.Tom Mitchell, Machine Learning, First Edition, McGraw Hill, 1997.

Reference Books:

1. Ethem Alpaydin, Introduction to Machine Learning, 2nd Edition,


[Unit 1]

What Is Machine Learning?

To solve a problem on a computer, we need an algorithm. An algorithm is a sequence of


instructions that should be carried out to transform the input to output. For example, one can devise
an algorithm for sorting. The input is a set of numbers and the output is their ordered list. For the
same task, there may be various algorithms and we may be interested in finding the most efficient
one, requiring the least number of instructions or memory or both. For some tasks, however, we
do not have an algorithm—for example, to tell spam emails from legitimate emails. We know what
the input is: an email document that in the simplest case is a file of characters. We know what the
output should be: a yes/no output indicating whether the message is spam or not. We do not know
how to transform the input to the output. What can be considered spam changes in time and from
individual to individual. What we lack in knowledge, we make up for in data. We can easily
compile thousands of example messages some of which we know to be spam and what we want is
to “learn” what consititutes spam from them. In other words, we would like the computer
(machine) to extract automatically the algorithm for this task. There is no need to learn to sort
numbers, we already have algorithms for that; but there are many applications for which we do not
have an algorithm but do have example data. With advances in computer technology, we currently
have the ability to store and process large amounts of data, as well as to access it from physically
distant locations over a computer network. Most data acquisition devices are digital now and record
reliable data. Think, for example, of a supermarket chain that has hundreds of stores all over a
country selling thousands of goods to millions of customers. The point of sale terminals record the
details of each transaction: date, customer identification code, goods bought and their amount, total
money spent, and so forth. This typically amounts to gigabytes of data every day. What the
supermarket chain wants is to be able to predict who are the likely customers for a product. Again,
the algorithm for this is not evident; it changes in time and by geographic location. The stored data
becomes useful only when it is analyzed and turned into information that we can make use of, for
example, to make predictions.

We do not know exactly which people are likely to buy this ice cream flavor, or the next book of
this author, or see this new movie, or visit this city, or click this link. If we knew, we would not
need any analysis of the data; we would just go ahead and write down the code. But because we
do not, we can only collect data and hope to extract the answers to these and similar questions
from data. We do believe that there is a process that explains the data we observe. Though we do
not know the details of the process underlying the generation of data—for example, consumer
behavior—we know that it is not completely random. People do not go to supermarkets and buy
things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices
for Glühwein in winter. There are certain patterns in the data. We may not be able to identify the
process completely, but we believe we can construct a good and useful approximation. That
approximation may not explain everything, but may still be able to account for some part of the
data. We believe that though identifying the complete process may not be possible, we can still
detect certain patterns or regularities. This is the niche of machine learning. Such patterns may
help us understand the process, or we can use those patterns to make predictions: Assuming that
the future, at least the near future, will not be much different from the past when the sample data
was collected, the future predictions can also be expected to be right. Application of machine
learning methods to large databases is called data mining. The analogy is that a large volume of
earth and raw material is extracted from a mine, which when processed leads to a small amount of
very precious material; similarly, in data mining, a large volume of data is processed to construct
a simple model with valuable use for example, having high predictive accuracy. Its application
areas are abundant: In addition to retail, in finance banks analyze their past data to build models
to use in credit applications, fraud detection, and the stock market. In manufacturing, learning
models are used for optimization, control, and troubleshooting. In medicine, learning programs are
used for medical diagnosis. In telecommunications, call patterns are analyzed for network
optimization and maximizing the quality of service. In science, large amounts of data in physics,
astronomy, and biology can only be analyzed fast enough by computers. The World Wide Web is
huge; it is constantly growing, and searching for relevant information cannot be done manually.

Types of Learning:-

1. Supervised learning

Gartner, a business consulting firm, predicts that supervised learning will remain the most utilized
machine learning among enterprise information technology leaders in 2022 [2]. This type of
machine learning feeds historical input and output data in machine learning algorithms, with
processing in between each input/output pair that allows the algorithm to shift the model to create
outputs as closely aligned with the desired result as possible. Common algorithms used during
supervised learning include neural networks, decision trees, linear regression, and support vector
machines.

This machine learning type got its name because the machine is “supervised” while it's learning,
which means that you’re feeding the algorithm information to help it learn. The outcome you
provide the machine is labeled data, and the rest of the information you give is used as input
features.

For example, if you were trying to learn about the relationships between loan defaults and borrower
information, you might provide the machine with 500 cases of customers who defaulted on their
loans and another 500 who didn't. The labeled data “supervises” the machine to figure out the
information you're looking for.

Supervised learning is effective for a variety of business purposes, including sales forecasting,
inventory optimization, and fraud detection. Some examples of use cases include:

• Predicting real estate prices

• Classifying whether bank transactions are fraudulent or not

• Finding disease risk factors

• Determining whether loan applicants are low-risk or high-risk

• Predicting the failure of industrial equipment's mechanical parts

2. Unsupervised learning

While supervised learning requires users to help the machine learn, unsupervised learning doesn't
use the same labeled training sets and data. Instead, the machine looks for less obvious patterns in
the data. This machine learning type is very helpful when you need to identify patterns and use
data to make decisions. Common algorithms used in unsupervised learning include Hidden
Markov models, k-means, hierarchical clustering, and Gaussian mixture models.
Using the example from supervised learning, let's say you didn't know which customers did or
didn't default on loans. Instead, you'd provide the machine with borrower information and it would
look for patterns between borrowers before grouping them into several clusters.

This type of machine learning is widely used to create predictive models. Common applications
also include clustering, which creates a model that groups objects together based on specific
properties, and association, which identifies the rules existing between the clusters. A few example
use cases include:

• Creating customer groups based on purchase behavior

• Grouping inventory according to sales and/or manufacturing metrics

• Pinpointing associations in customer data (for example, customers who buy a specific style
of handbag might be interested in a specific style of shoe)

3. Reinforcement learning

Reinforcement learning is the closest machine learning type to how humans learn. The algorithm
or agent used learns by interacting with its environment and getting a positive or negative reward.
Common algorithms include temporal difference, deep adversarial networks, and Q-learning.

Going back to the bank loan customer example, you might use a reinforcement learning algorithm
to look at customer information. If the algorithm classifies them as high-risk and they default, the
algorithm gets a positive reward. If they don't default, the algorithm gets a negative reward. In the
end, both instances help the machine learn by understanding both the problem and environment
better.

Gartner notes that most ML platforms don't have reinforcement learning capabilities because it
requires higher computing power than most organizations have [2]. Reinforcement learning is
applicable in areas capable of being fully simulated that are either stationary or have large volumes
of relevant data. Because this type of machine learning requires less management than supervised
learning, it’s viewed as easier to work with dealing with unlabeled data sets. Practical applications
for this type of machine learning are still emerging. Some examples of uses include:

• Teaching cars to park themselves and drive autonomously


• Dynamically controlling traffic lights to reduce traffic jams

• Training robots to learn policies using raw video images as input that they can use to
replicate the actions they see

Why And How To Predict Hypothesis Space And Inductive Bias?


Predictions Have Become Very Essential In Our Lives Today. From Predicting Weather To
Predicting The Status Of Stock In Finance, These Predictions Play An Important Role. But How
Do We Actually Predict? What Are The Necessary Tools Required For Successful
Prediction? The Answer Is That While Making Predictions, We Need Or Are Given Examples Or
Data. The Examples Are Of The Form (X̂,Y), Where For A Particular Instance, X̂ Comprises The
Values Of Different Features Of That Instance And Y Is The Value Of The Output Attribute.
Features Refer To Properties That Describe Each Instance. Another Way Of Thinking This Is (X̂,
F(X̂)). So This Way Of Thinking Is Based On The Fact That The Output Of An Instance Is A
Function Of The Input Feature Vector. This Is The Function We Are Trying To Learn In Machine
Learning.

Inductive Bias
Consider The Two Types Of Supervised Learning Problems: Classification And Regression,
Which Depends On Output Attribute Type (That Is Discrete Valued Or Continuous Valued). In
Classification Type, This F(X̂), Is Discrete While In Regression F(X̂) Is Continuous. Apart From
Classification And Regression, In Some Cases, We May Want To Determine The Probability Of
A Particular Value Of Y. So In Cases Of Probability Estimation, Our F(X̂) Is The Probability Of
X̂. So These Are The Types Of Inductive Bias Problems We Are Trying To Look At.
We Call This Inductive Bias Because We Are Given Some Data And We Are Trying To Do
Induction, To Try Identify A Function Which Can Explain The Data. Unless We Can See All The
Instances (All The Possible Data Points) Or We Make Some Restrictive Assumptions About The
Language In Which The Hypothesis Is Expressed Or Some Bias, This Problem Is Not Well
Defined. Therefore It Is Called An Inductive Bias.

What Is Feature Space?


FIG 1: Source: Slide By Jesse Davis, University Of Washington
We Know That Features Refer To Properties That Describe Each Instance. Often We Have
Multiple Features Which We Call A Feature Vector. For Example For A Particular Task We May
Be Describing All The Instances With 10 Features. So The Feature Vector Will Be A 1
Dimensional Vector Of Size 10. Based On This We Can Define A Feature Space. Consider For
Simplicity That We Have 2 Features And Let’s Call Them X1 And X2. These Features Define A
Two Dimensional Space. So The Space Defined By The Features Is Known As Feature Space. For
N Features We Can Define N Dimensional Space.

Example Situation
Let Us Look At A Classification Problem. Let’s Assume That It Is A 2 Class Classification
Problem, That Is We Are Provide With A Number Of Instances Or Examples. Some Of Them
Belong To Class 1 The Others Belong To Class 2.

Also We Are Provide With A Training Set Which Comprises A Subset Some Of Them Are Mark
Class 1 And Some Of Them Are Mark Class 2 And We Can Say That Class 1 Is Positive And
Class 2 Is Negative. We Can Map Different Points In The Feature Space As Shown In The
Figure 1.

Our Objective Is For The Function To Know/Predict If A New Instance Is Provide, Whether This
Instance Will Be Positive Or Negative. The Function Should Separate The Positive From
Negative With The Help Of A Curve Or A Line. Depending Upon The Separating Line We Can
Determine If The New Instance Is Positive Or Negative.

Let’s Again Consider Figure 1. Let’s Mark Some Test Points(The Question Mark Points In
Figure 2). What We Have To Determine Is The Class Of The Points (Positive Or Negative) In
The Prediction Problem. In Order To Answer The Prediction Problem We Have To Come Up
With A Curve (That Is A Function). Let’s Consider This Function Marked In The Pink Line.
According To This Case Of The Function, We Can Say That The Test Points To The Right
Would Be Negative (Based On The Trends) And To The Left Would Be Positive.

FIG 2: Source: Slide By Jesse Davis, University Of Washington


We Can Consider Any Curve. But That Curve Must Appropriately Predict(Not Necessarily
100%) The Class Of The Task Point. This Function (Pink Line) Is Also Refer Hypothesis And
We Use This Hypothesis For Predictions.

Hypothesis Space
Instead Of This Particular Line, We Could Have Used Other Functions For Hypothesis. As
Shown In Figure 3 All These Are The Possible Functions Which We Could Have Found.
The Set Involving All Such Legal Functions (That Are Possible) Defines The Hypothesis
Space. In A Particular Learning Problem We First Define The Hypothesis Space (The Class Of
The Function We Are Going To Consider), Then Given The Data Points We Try To Come Up
With The Best Hypothesis.

FIG 3:Source: Slide By Jesse Davis, University Of Washington


Other Decision Curves
Inductive Learning Is A Way To Predict Using Hypothesis Space About The Class Of The Task
Points. Various Types Of Representation Have Been Considered For Making Predictions. Some
Examples Are Linear(Discussed Above), Which Acts As A Discriminator Between Two Classes.
Another Structure Which Is Used Is A Decision Tree. A Decision Tree Is A Tree, Where At
Every Node We Take A Decision Based On The Value Of The Attribute. Based On This We Go
To Different Branches Of The Tree. Every Leaf Node Is Labelled According To The Value Of
‘Y’. Other Representations Are Multivariate Representation, Neural Networks, Single Layer
Perceptron(The Basic Unit Of The Neural Network) And Multi-Layer Perceptron.
SOURCE: Https://Www.Researchgate.Net/Figure/The-Different-Regions-In-Hypothesis-Space-
Representing-The-Knowledge-Of-The-Learner_fig3_227155069
Hypothesis Is Describe By The Features And Language That Is Select. From This Set, The
Learning Algorithm Will Pick A Hypothesis. A Hypothesis Space Is Represent By ‘H’ And The
Learning Algorithm Outputs H ∈ H. ‘H’ Represents The Chosen Hypothesis. H Depends On
Data Points That Are Select And Also On Certain Types Of Restrictions (Bias) That We Have
Imposed. Thus Supervised Learning Can Be Thought Of As A Device Which Explores The
Hypothesis Space In Order To Find Out One Of The Hypothesis Which Satisfies The Given
Criteria.
An Example To Conclude
Let’s Consider An Example For Further Understanding. Let’s Take The Features Which Are
Boolean That Is X1, X2, X3, X4 Are 4 Features Which Are Boolean. Thus X1 Can Take Either
1(=T) Or 0(=F). Similarly X2,X3,X4 Can Take Either 0 Or 1 As Shown.

X1 X2 X3 X4

1 1 1 1
0 0 0 0

Thus The Possible Instances Will Be 16(=24).How Many Boolean Functions Are Possible? The
Function Will Classify Some Of The Points As Positive, Others Negative Out Of The 16 Points.
Thus The Number Of Functions Is The Number Of Possible Subsets Of The 16 Instances. So The
Subsets Possible Are (216). This Can Be Generalise To N Boolean Features Too. So Instead Of 4
Boolean Features If We Have N Boolean Features Then The Number Of Possible Instances
Is 2n And The Number Of Possible Functions Will Be 2(2^N).
As It Can Be Seen The Hypothesis Space Is Gigantic In Size And It Is Not Possible To Look At
Every Hypothesis Individually In Order To Select The Best Hypothesis. So One Puts Restrictions
In The Hypothesis Space To Consider Only Specific Hypothesis Space. These Restrictions Are
Also Refer Bias And They Are Of Many Types.

An Example Is Occam’s Razor Which States That The Simplest Consistent Hypothesis About
The Target Function Is The Best And Should Be Considered As The Hypothesis. Other Types Of
Bias Include Minimum Description Length And Maximum Margin Bias. The Choice Of Bias
Depends On The Requirements And The Available Data Sets.

EVALUATION:-

Types of evaluation

Many types of evaluation exist, consequently evaluation methods need to be customised according
to what is being evaluated and the purpose of the evaluation.1,2 It is important to understand the
different types of evaluation that can be conducted over a program’s life-cycle and when they
should be used. The main types of evaluation are process, impact, outcome and summative
evaluation.1

Before you are able to measure the effectiveness of your project, you need to determine if the
project is being run as intended and if it is reaching the intended audience. 3 It is futile to try and
determine how effective your program is if you are not certain of the objective, structure,
programing and audience of the project. This is why process evaluation should be done prior to
any other type of evaluation.3

Process evaluation

Process evaluation is used to “measure the activities of the program, program quality and who it
is reaching”3 Process evaluation, as outlined by Hawe and colleagues 3 will help answer questions
about your program such as:

• Has the project reached the target group?

• Are all project activities reaching all parts of the target group?

• Are participants and other key stakeholders satisfied with all aspects of the project?

• Are all activities being implemented as intended? If not why?

• What if any changes have been made to intended activities?

• Are all materials, information and presentations suitable for the target audience?

Impact evaluation

Impact evaluation is used to measure the immediate effect of the program and is aligned with the
programs objectives. Impact evaluation measures how well the programs objectives (and sub-
objectives) have been achieved.1,3

Impact evaluation will help answer questions such as:

• How well has the project achieved its objectives (and sub-objectives)?

• How well have the desired short term changes been achieved?

For example, one of the objectives of the My-Peer project is to provide a safe space and learning
environment for young people, without fear of judgment, misunderstanding, harassment or abuse.
Impact evaluation will assess the attitudes of young people towards the learning environment and
how they perceived it. It may also assess changes in participants’ self esteem, confidence and
social connectedness.
Impact evaluation measures the program effectiveness immediate after the completion of the
program and up to six months after the completion of the program.

Outcome evaluation

Outcome evaluation is concerned with the long term effects of the program and is generally used
to measure the program goal. Consequently, outcome evaluation measures how well the program
goal has been achieved.1,3

Outcome evaluation will help answer questions such as:

• Has the overall program goal been achieved?

• What, if any factors outside the program have contributed or hindered the desired change?

• What, if any unintended change has occurred as a result of the program?

In peer-based youth programs outcome evaluation may measure changes to: mental and physical
wellbeing, education and employment and help-seeking behaviours.

Outcome evaluation measures changes at least six months after the implementation of the program
(longer term). Although outcome evaluation measures the main goal of the program, it can also be
used to assess program objectives over time. It should be noted that it is not always possible or
appropriate to conduct outcome evaluation in peer-based programs.

cross-validation:-

What is cross-validation?

Cross-validation is a technique for evaluating a machine learning model and testing its
performance. CV is commonly used in applied ML tasks. It helps to compare and select an
appropriate model for the specific predictive modeling problem.

CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods
used to count the model’s efficiency scores. All this makes cross-validation a powerful tool for
selecting the best model for the specific task.
There are a lot of different techniques that may be used to cross-validate a model. Still, all of them
have a similar algorithm:

1. Divide the dataset into two parts: one for training, other for testing

2. Train the model on the training set

3. Validate the model on the test set

4. Repeat 1-3 steps a couple of times. This number depends on the CV method that you are
using

As you may know, there are plenty of CV techniques. Some of them are commonly used, others
work only in theory. Let’s see the cross-validation methods that will be covered in this article.

• Hold-out

• K-folds

• Leave-one-out

• Leave-p-out

• Stratified K-folds

• Repeated K-folds

• Nested K-folds

• Time series CV

Hold-out cross-validation

Hold-out cross-validation is the simplest and most common technique. You might not know that
it is a hold-out method but you certainly use it every day.

The algorithm of hold-out technique:


1. Divide the dataset into two parts: the training set and the test set. Usually, 80% of the
dataset goes to the training set and 20% to the test set but you may choose any splitting that
suits you better

2. Train the model on the training set

3. Validate on the test set

4. Save the result of the validation

That’s it.

We usually use the hold-out method on large datasets as it requires training the model only once.

It is really easy to implement hold-out. For example, you may do it using


sklearn.model_selection.train_test_split.

import numpy as np
from sklearn.model_selection import train_test_split

X, y = np.arange(10).reshape((5, 2)), range(5)

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2,
random_state=111)
Still, hold-out has a major disadvantage.

For example, a dataset that is not completely even distribution-wise. If so we may end up in a
rough spot after the split. For example, the training set will not represent the test set. Both training
and test sets may differ a lot, one of them might be easier or harder.

Moreover, the fact that we test our model only once might be a bottleneck for this method. Due to
the reasons mentioned before, the result obtained by the hold-out technique may be considered
inaccurate.

k-Fold cross-validation

k-Fold cross-validation is a technique that minimizes the disadvantages of the hold-out method. k-
Fold introduces a new way of splitting the dataset which helps to overcome the “test only once
bottleneck”.

The algorithm of the k-Fold technique:

1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is
less than the dataset’s length.

2. Split the dataset into k equal (if possible) parts (they are called folds)

3. Choose k – 1 folds as the training set. The remaining fold will be the test set

4. Train the model on the training set. On each iteration of cross-validation, you must train a
new model independently of the model trained on the previous iteration

5. Validate on the test set

6. Save the result of the validation

7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set. In the end, you
should have validated the model on every fold that you have.

8. To get the final score average the results that you got on step 6.
To perform k-Fold cross-validation you can use sklearn.model_selection.KFold.

import numpy as np

from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])


y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]


In general, it is always better to use k-Fold technique instead of hold-out. In a head to head,
comparison k-Fold gives a more stable and trustworthy result since training and testing is
performed on several different parts of the dataset. We can make the overall score even more robust
if we increase the number of folds to test the model on many different sub-datasets.

Still, k-Fold method has a disadvantage. Increasing k results in training more models and the
training process might be really expensive and time-consuming.

Leave-one-out cross-validation

Leave-one-out сross-validation (LOOCV) is an extreme case of k-Fold CV. Imagine if k is equal


to n where n is the number of samples in the dataset. Such k-Fold case is equivalent to Leave-one-
out technique.

The algorithm of LOOCV technique:

1. Choose one sample from the dataset which will be the test set

2. The remaining n – 1 samples will be the training set

3. Train the model on the training set. On each iteration, a new model must be trained

4. Validate on the test set

5. Save the result of the validation

6. Repeat steps 1 – 5 n times as for n samples we have n different training and test sets

7. To get the final score average the results that you got on step 5.
For LOOCV sklearn also has a built-in method. It can be found in the model_selection library –
sklearn.model_selection.LeaveOneOut.

import numpy as np
from sklearn.model_selection import LeaveOneOut

X = np.array([[1, 2], [3, 4]])


y = np.array([1, 2])

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):


print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
The greatest advantage of Leave-one-out cross-validation is that it doesn’t waste much data. We
use only one sample from the whole dataset as a test set, whereas the rest is the training set. But
when compared with k-Fold CV, LOOCV requires building n models instead of k models, when
we know that n which stands for the number of samples in the dataset is much higher than k. It
means LOOCV is more computationally expensive than k-Fold, it may take plenty of time to cross-
validate the model using LOOCV.

Thus, the Data Science community has a general rule based on empirical evidence and different
researches, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.

Leave-p-out cross-validation

Leave-p-out cross-validation (LpOC) is similar to Leave-one-out CV as it creates all the possible


training and test sets by using p samples as the test set. All mentioned about LOOCV is true and
for LpOC.

Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap for LpOC if p is
higher than 1.

The algorithm of LpOC technique:

1. Choose p samples from the dataset which will be the test set

2. The remaining n – p samples will be the training set

3. Train the model on the training set. On each iteration, a new model must be trained

4. Validate on the test set

5. Save the result of the validation

6. Repeat steps 2 – 5 Cpn times

7. To get the final score average the results that you got on step 5

You can perform Leave-p-out CV using sklearn – sklearn.model_selection.LeavePOut.


import numpy as np

from sklearn.model_selection import LeavePOut

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

y = np.array([1, 2, 3, 4])
lpo = LeavePOut(2)

for train_index, test_index in lpo.split(X):


print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]


y_train, y_test = y[train_index], y[test_index]
LpOC has all the disadvantages of the LOOCV, but, nevertheless, it’s as robust as LOOCV.

Stratified k-Fold cross-validation

Sometimes we may face a large imbalance of the target value in the dataset. For example, in a
dataset concerning wristwatch prices, there might be a larger number of wristwatch having a high
price. In the case of classification, in cats and dogs dataset there might be a large shift towards the
dog class.

Stratified k-Fold is a variation of the standard k-Fold CV technique which is designed to be


effective in such cases of target imbalance.

It works as follows. Stratified k-Fold splits the dataset on k folds such that each fold contains
approximately the same percentage of samples of each target class as the complete set. In the case
of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all
the folds.

The algorithm of Stratified k-Fold technique:

1. Pick a number of folds – k


2. Split the dataset into k folds. Each fold must contain approximately the same percentage
of samples of each target class as the complete set

3. Choose k – 1 folds which will be the training set. The remaining fold will be the test set

4. Train the model on the training set. On each iteration a new model must be trained

5. Validate on the test set

6. Save the result of the validation

7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set. In the end, you
should have validated the model on every fold that you have.

8. To get the final score average the results that you got on step 6.

As you may have noticed, the algorithm for Stratified k-Fold technique is similar to the standard
k-Folds. You don’t need to code something additionally as the method will do everything
necessary for you.

Stratified k-Fold also has a built-in method in sklearn – sklearn.model_selection.StratifiedKFold.


import numpy as np

from sklearn.model_selection import StratifiedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])


y = np.array([0, 0, 1, 1])

skf = StratifiedKFold(n_splits=2)

for train_index, test_index in skf.split(X, y):


print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]


All mentioned above about k-Fold CV is true for Stratified k-Fold technique. When choosing
between different CV methods, make sure you are using the proper one. For example, you might
think that your model performs badly simply because you are using k-Fold CV to validate the
model which was trained on the dataset with a class imbalance. To avoid that you should always
do a proper exploratory data analysis on your data.

Repeated k-Fold cross-validation

Repeated k-Fold cross-validation or Repeated random sub-sampling CV is probably the most


robust of all CV techniques in this paper. It is a variation of k-Fold but in the case of Repeated k-
Folds k is not the number of folds. It is the number of times we will train the model.

The general idea is that on every iteration we will randomly select samples all over the dataset as
our test set. For example, if we decide that 20% of the dataset will be our test set, 20% of samples
will be randomly selected and the rest 80% will become the training set.

The algorithm of Repeated k-Fold technique:

1. Pick k – number of times the model will be trained

2. Pick a number of samples which will be the test set


3. Split the dataset

4. Train on the training set. On each iteration of cross-validation, a new model must be trained

5. Validate on the test set

6. Save the result of the validation

7. Repeat steps 3-6 k times

8. To get the final score average the results that you got on step 6.

Repeated k-Fold has clear advantages over standard k-Fold CV. Firstly, the proportion of train/test
split is not dependent on the number of iterations. Secondly, we can even set unique proportions
for every iteration. Thirdly, random selection of samples from the dataset makes Repeated k-Fold
even more robust to selection bias.

Still, there are some disadvantages. k-Fold CV guarantees that the model will be tested on all
samples, whereas Repeated k-Fold is based on randomization which means that some samples may
never be selected to be in the test set at all. At the same time, some samples might be selected
multiple times. Thus making it a bad choice for imbalanced datasets.

Sklearn will help you to implement a Repeated k-Fold CV. Just use
sklearn.model_selection.RepeatedKFold. In sklearn implementation of this technique you must set
the number of folds that you want to have (n_splits) and the number of times the split will be
performed (n_repeats). It guarantees that you will have different folds on each iteration.

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])


y = np.array([0, 0, 1, 1])
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):


print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X[train_index], X[test_index]


y_train, y_test = y[train_index], y[test_index]
Nested k-Fold
Unlike the other CV techniques, which are designed to evaluate the quality of an algorithm, Nested
k-fold CV is used to train a model in which hyperparameters also need to be optimized. It estimates
the generalization error of the underlying model and its (hyper)parameter search.
Nested
k-Fold cross-validation resampling | Source

The algorithm of Nested k-Fold technique:

1. Define set of hyper-parameter combinations, C, for current model. If model has no hyper-
parameters, C is the empty set.

2. Divide data into K folds with approximately equal distribution of cases and controls.

3. (outer loop) For fold k, in the K folds:

1. Set fold k, as the test set.

2. Perform automated feature selection on the remaining K-1 folds.

3. For parameter combination c in C:

1. (inner loop) For fold k, in the remaining K-1 folds:

1. Set fold k, as the validation set.


2. Train model on remaining K-2 folds.

3. Evaluate model performance on fold k.

• Calculate average performance over K-2 folds for parameter combination


c.

• Train model on K-1 folds using hyper-parameter combination that yielded best
average performance over all steps of the inner loop.

• Evaluate model performance on fold k.

Calculate average performance over K folds.

The inner loop performs cross-validation to identify the best features and model hyper-parameters
using the k-1 data folds available at each iteration of the outer loop. The model is trained once for
each outer loop step and evaluated on the held-out data fold. This process yields k evaluations of
the model performance, one for each data fold, and allows the model to be tested on every sample.

It is to be noted that this technique is computationally expensive because plenty of models is


trained and evaluated. Unfortunately, there is no built-in method in sklearn that would perform
Nested k-Fold CV for you.

You can either implement it yourself or refer to the implementation here.

Time-series cross-validation

Traditional cross-validation techniques don’t work on sequential data such as time-series because
we cannot choose random data points and assign them to either the test set or the train set as it
makes no sense to use the values from the future to forecast values in the past. There are mainly
two ways to go about this:

1. Rolling cross-validation

Cross-validation is done on a rolling basis i.e. starting with a small subset of data for training
purposes, predicting the future values, and then checking the accuracy on the forecasted data
points. The following image can help you get the intuition behind this approach.
Rolling cross-validation | Source

2. Blocked cross-validation

The first technique may introduce leakage from future data to the model. The model will observe
future patterns to forecast and try to memorize them. That’s why blocked cross-validation was
introduced.
Blocked cross-validation | Source

It works by adding margins at two positions. The first is between the training and validation folds
in order to prevent the model from observing lag values which are used twice, once as a regressor
and another as a response. The second is between the folds used at each iteration in order to prevent
the model from memorizing patterns from one iteration to the next.

Cross-validation in Machine Learning

When is cross-validation the right choice?

Although doing cross-validation of your trained model can never be termed as a bad choice, there
are certain scenarios in which cross-validation becomes an absolute necessity:

1. Limited dataset

Let’s say we have 100 data points and we are dealing with a multi-class classification problem
with 10 classes, this averages out to ~10 examples per class. In an 80-20 train-test split, this number
would go down even further to 8 samples per class for training. The smart thing to do here would
be to use cross-validation and utilize our entire dataset for training as well as testing.

Read also
Leveraging Unlabeled Image Data With Self-Supervised Learning or Pseudo Labeling With
Mateusz Opala

2. Dependent data points

When we perform a random train-test split of our data, we assume that our examples are
independent. It means that knowing some instances will not help us understand other instances.
However, that’s not always the case, and in such situations, it’s important that our model gets
familiar with the entire dataset which is possible with cross-validation.

3. Cons of single metric

In the absence of cross-validation, we only get a single value of accuracy or precision or recall
which could be an outcome of chance. When we train multiple models, we eliminate such
possibilities and get a metric per model which results in robust insights.

4. Hyperparameter tuning

Although there are many methods to tune the hyperparameters of your model such as grid search,
Bayesian optimization, etc., this exercise can’t be done on training or test set, and a need for a
validation set arises. Thus, we fall back to the same splitting problem that we have discussed above
and cross-validation can help us out of this.

Linear regression:-

What is linear regression in machine learning?


Linear Regression is a machine learning algorithm based on supervised learning. It performs a
regression task. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.

Decision trees :-

What is decision trees in machine learning?

A decision tree is a type of supervised machine learning used to categorize or make predictions
based on how a previous set of questions were answered. The model is a form of supervised
learning, meaning that the model is trained and tested on a set of data that contains the desired
categorization.

What is decision tree in machine learning with example?

A decision tree is a flowchart-like structure in which each internal node represents a test on a
feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class label
(decision taken after computing all features) and branches represent conjunctions of features that
lead to those class ...

Over fitting:-

What is the overfitting in machine learning?

Overfitting is an undesirable machine learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data. When data scientists use
machine learning models for making predictions, they first train the model on a known data set.
What is meant by overfitting?

Overfitting is a concept in data science, which occurs when a statistical model fits exactly against
its training data. When this happens, the algorithm unfortunately cannot perform accurately against
unseen data, defeating its purpose.
What is overfitting in machine learning and how can you avoid it?

Overfitting makes the model relevant to its data set only, and irrelevant to any other data
sets. Some of the methods used to prevent overfitting include ensembling, data augmentation, data
simplification, and cross-validation.

Instance based learning:-

What is meant by instance-based learning?

Definition. Instance-based learning refers to a family of techniques for classification and


regression, which produce a class label/predication based on the similarity of the query to its
nearest neighbor(s) in the training set.

What is instance-based learning with example?

Instance-based learners may simply store a new instance or throw an old instance away. Examples
of instance-based learning algorithms are the k-nearest neighbors algorithm, kernel machines
and RBF networks.

Why KNN is called instance-based learning?

Instance-Based Learning: The raw training instances are used to make predictions. As such
KNN is often referred to as instance-based learning or a case-based learning (where each training
instance is a case from the problem domain).
Feature reduction:-
Feature reduction leads to the need for fewer resources to complete computations or tasks. Less
computation time and less storage capacity needed means the computer can do more work. During
machine learning, feature reduction removes multicollinearity resulting in improvement of the
machine learning model in use.
What are the benefits of feature reduction?

Fewer features mean less complexity. You will need less storage space because you have fewer
data. Fewer features require less computation time. Model accuracy improves due to less
misleading data.

What are 3 ways of reducing dimensionality?

Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis
(LDA) and Truncated Singular Value Decomposition (SVD) are examples of linear
dimensionality reduction methods.

Collaborative filtering based recommendation:-

What is collaborative filtering based recommendation system?

Collaborative filtering filters information by using the interactions and data collected by the
system from other users. It's based on the idea that people who agreed in their evaluation of
certain items are likely to agree again in the future
What is collaborative recommendation system?
Collaborative filtering is a family of algorithms where there are multiple ways to find similar
users or items and multiple ways to calculate rating based on ratings of similar users.
Depending on the choices you make, you end up with a type of collaborative filtering approach.

What is an example of collaborative filtering?

Amazon is known for its use of collaborative filtering, matching products to users based on past
purchases. For example, the system can identify all of the products a customer and users with
similar behaviors have purchased and/or positively rated.
[Unit 2]

Probability and Bayes learning :-

Bayes Theorem of Conditional Probability

Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability.

Recall that marginal probability is the probability of an event, irrespective of other random
variables. If the random variable is independent, then it is the probability of the event directly,
otherwise, if the variable is dependent upon other variables, then the marginal probability is the
probability of the event summed over all outcomes for the dependent variables, called the sum
rule.

• Marginal Probability: The probability of an event irrespective of the outcomes of other


random variables, e.g. P(A).

The joint probability is the probability of two (or more) simultaneous events, often described in
terms of events A and B from two dependent random variables, e.g. X and Y. The joint probability
is often summarized as just the outcomes, e.g. A and B.

• Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or
P(A, B).

The conditional probability is the probability of one event given the occurrence of another event,
often described in terms of events A and B from two dependent random variables e.g. X and Y.

• Conditional Probability: Probability of one (or more) event given the occurrence of another
event, e.g. P(A given B) or P(A | B).

The joint probability can be calculated using the conditional probability; for example:

• P(A, B) = P(A | B) * P(B)

This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:

• P(A, B) = P(B, A)
The conditional probability can be calculated using the joint probability; for example:

• P(A | B) = P(A, B) / P(B)

The conditional probability is not symmetrical; for example:

• P(A | B) != P(B | A)

We are now up to speed with marginal, joint and conditional probability. If you would like more
background on these fundamentals, see the tutorial:

• A Gentle Introduction to Joint, Marginal, and Conditional Probability

An Alternate Way To Calculate Conditional Probability

Now, there is another way to calculate the conditional probability.

Specifically, one conditional probability can be calculated using the other conditional probability;
for example:

• P(A|B) = P(B|A) * P(A) / P(B)

The reverse is also true; for example:

• P(B|A) = P(A|B) * P(B) / P(A)

This alternate approach of calculating the conditional probability is useful either when the joint
probability is challenging to calculate (which is most of the time), or when the reverse conditional
probability is available or easy to calculate.

This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes
Theorem, named for Reverend Thomas Bayes, who is credited with first describing it. It is
grammatically correct to refer to it as Bayes’ Theorem (with the apostrophe), but it is common to
omit the apostrophe for simplicity.

• Bayes Theorem: Principled way of calculating a conditional probability without the joint
probability.

It is often the case that we do not have access to the denominator directly, e.g. P(B).
We can calculate it an alternative way; for example:

• P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of
P(B), described below:

• P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

Or with brackets around the denominator for clarity:

• P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A))

Note: the denominator is simply the expansion we gave above.

As such, if we have P(A), then we can calculate P(not A) as its complement; for example:

• P(not A) = 1 – P(A)

Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for
example:

• P(B|not A) = 1 – P(not B|not A)

Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the
meaning of the terms in the equation.

Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Naming the Terms in the Theorem

The terms in the Bayes Theorem equation are given names depending on the context where the
equation is used.
It can be helpful to think about the calculation from these different perspectives and help to map
your problem onto the equation.

Firstly, in general, the result P(A|B) is referred to as the posterior probability and P(A) is referred
to as the prior probability.

• P(A|B): Posterior probability.

• P(A): Prior probability.

Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.

• P(B|A): Likelihood.

• P(B): Evidence.

This allows Bayes Theorem to be restated as:

• Posterior = Likelihood * Prior / Evidence

We can make this clear with a smoke and fire case.

What is the probability that there is fire given that there is smoke?

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:

• P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)

You can imagine the same situation with rain and clouds.

Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at a scenario
where we can calculate it.

Worked Example for Calculating Bayes Theorem

Bayes theorem is best understood with a real-life worked example with real numbers to
demonstrate the calculations.

First we will define a scenario then work through a manual calculation, a calculation in Python,
and a calculation using the terms that may be familiar to you from the field of binary classification.
1. Diagnostic Test Scenario

2. Manual Calculation

3. Python Code Calculation

4. Binary Classifier Terminology

Let’s go.

Diagnostic Test Scenario

An excellent and widely used example of the benefit of Bayes Theorem is in the analysis of a
medical diagnostic test.

Scenario: Consider a human population that may or may not have cancer (Cancer is True or False)
and a medical test that returns positive or negative for detecting cancer (Test is Positive or
Negative), e.g. like a mammogram for detecting breast cancer.

Problem: If a randomly selected patient has the test and it comes back positive, what is the
probability that the patient has cancer?

Manual Calculation

Medical diagnostic tests are not perfect; they have error.

Sometimes a patient will have cancer, but the test will not detect it. This capability of the test to
detect cancer is referred to as the sensitivity, or the true positive rate.

In this case, we will contrive a sensitivity value for the test. The test is good, but not great, with a
true positive rate or sensitivity of 85%. That is, of all the people who have cancer and are tested,
85% of them will get a positive result from the test.

• P(Test=Positive | Cancer=True) = 0.85

Given this information, our intuition would suggest that there is an 85% probability that the patient
has cancer.

Our intuitions of probability are wrong.


This type of error in interpreting probabilities is so common that it has its own name; it is referred
to as the base rate fallacy.

It has this name because the error in estimating the probability of an event is caused by ignoring
the base rate. That is, it ignores the probability of a randomly selected person having cancer,
regardless of the results of a diagnostic test.

In this case, we can assume the probability of breast cancer is low, and use a contrived base rate
value of one person in 5,000, or (0.0002) 0.02%.

• P(Cancer=True) = 0.02%.

We can correctly calculate the probability of a patient having cancer given a positive test result
using Bayes Theorem.

Let’s map our scenario onto the equation:

• P(A|B) = P(B|A) * P(A) / P(B)

• P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) /


P(Test=Positive)

We know the probability of the test being positive given that the patient has cancer is 85%, and
we know the base rate or the prior probability of a given patient having cancer is 0.02%; we can
plug these values in:

• P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / P(Test=Positive)

We don’t know P(Test=Positive), it’s not given directly.

Instead, we can estimate it using:

• P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

• P(Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) +


P(Test=Positive|Cancer=False) * P(Cancer=False)
Firstly, we can calculate P(Cancer=False) as the complement of P(Cancer=True), which we
already know

• P(Cancer=False) = 1 – P(Cancer=True)

• = 1 – 0.0002

• = 0.9998

Let’s plugin what we have:

We can plug in our known values as follows:

• P(Test=Positive) = 0.85 * 0.0002 + P(Test=Positive|Cancer=False) * 0.9998

We still do not know the probability of a positive test result given no cancer.

This requires additional information.

Specifically, we need to know how good the test is at correctly identifying people that do not have
cancer. That is, testing negative result (Test=Negative) when the patient does not have cancer
(Cancer=False), called the true negative rate or the specificity.

We will use a contrived specificity value of 95%.

• P(Test=Negative | Cancer=False) = 0.95

With this final piece of information, we can calculate the false positive or false alarm rate as the
complement of the true negative rate.

• P(Test=Positive|Cancer=False) = 1 – P(Test=Negative | Cancer=False)

• = 1 – 0.95

• = 0.05

We can plug this false alarm rate into our calculation of P(Test=Positive) as follows:

• P(Test=Positive) = 0.85 * 0.0002 + 0.05 * 0.9998


• P(Test=Positive) = 0.00017 + 0.04999

• P(Test=Positive) = 0.05016

Excellent, so the probability of the test returning a positive result, regardless of whether the person
has cancer or not is about 5%.

We now have enough information to calculate Bayes Theorem and estimate the probability of a
randomly selected person having cancer if they get a positive test result.

• P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) /


P(Test=Positive)

• P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.05016

• P(Cancer=True | Test=Positive) = 0.00017 / 0.05016

• P(Cancer=True | Test=Positive) = 0.003389154704944

The calculation suggests that if the patient is informed they have cancer with this test, then there
is only 0.33% chance that they have cancer.

It is a terrible diagnostic test!

The example also shows that the calculation of the conditional probability
requires enough information.

For example, if we have the values used in Bayes Theorem already, we can use them directly.

This is rarely the case, and we typically have to calculate the bits we need and plug them in, as we
did in this case. In our scenario we were given 3 pieces of information, the the base rate,
the sensitivity (or true positive rate), and the specificity (or true negative rate).

• Sensitivity: 85% of people with cancer will get a positive test result.

• Base Rate: 0.02% of people have cancer.

• Specificity: 95% of people without cancer will get a negative test result.
We did not have the P(Test=Positive), but we calculated it given what we already had available.

We might imagine that Bayes Theorem allows us to be even more precise about a given scenario.
For example, if we had more information about the patient (e.g. their age) and about the domain
(e.g. cancer rates for age ranges), and in turn we could offer an even more accurate probability
estimate.

Logistic Regression:-

Simple introduction to logistic regression

Before we dive into understanding logistic regression, let us start with some basics about the
different types of machine learning algorithms.

What are the differences between supervised learning, unsupervised learning & reinforcement
learning?

Machine learning algorithms are broadly classified into three categories - supervised learning,
unsupervised learning, and reinforcement learning.

1. Supervised Learning - Learning where data is labeled and the motivation is to classify
something or predict a value. Example: Detecting fraudulent transactions from a list of
credit card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the motivation is to find
patterns in given data. In this case, you are asking the machine learning model to process
the data from which you can then draw conclusions. Example: Customer segmentation
based on spend data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to how humans
learn. The motivation is to find optimal policy of how to act in a given environment. The
machine learning model examines all possible actions, makes a policy that maximizes
benefit, and implements the policy(trial). If there are errors from the initial policy, apply
reinforcements back into the algorithm and continue to do this until you reach the optimal
policy. Example: Personalized recommendations on streaming platforms like YouTube.
What are the two types of supervised learning?

As supervised learning is used to classify something or predict a value, naturally there are two
types of algorithms for supervised learning - classification models and regression models.

1. Classification model - In simple terms, a classification model predicts possible


outcomes. Example: Predicting if a transaction is fraud or not.
2. Regression model - Are used to predict a numerical value. Example: Predicting the sale
price of a house.

What is logistic regression?

Logistic regression is an example of supervised learning. It is used to calculate or predict the


probability of a binary (yes/no) event occurring. An example of logistic regression could be
applying machine learning to determine if a person is likely to be infected with COVID-19 or not.
Since we have two possible outcomes to this question - yes they are infected, or no they are not
infected - this is called binary classification.

In this imaginary example, the probability of a person being infected with COVID-19 could be
based on the viral load and the symptoms and the presence of antibodies, etc. Viral load, symptoms,
and antibodies would be our factors (Independent Variables), which would influence our outcome
(Dependent Variable).

How is logistic regression different from linear regression?

In linear regression, the outcome is continuous and can be any possible value. However in the case
of logistic regression, the predicted outcome is discrete and restricted to a limited number of
values.

For example, say we are trying to apply machine learning to the sale of a house. If we are trying
to predict the sale price based on the size, year built, and number of stories we would use linear
regression, as linear regression can predict a sale price of any possible value. If we are using those
same factors to predict if the house sells or not, we would logistic regression as the possible
outcomes here are restricted to yes or no.
Hence, linear regression is an example of a regression model and logistic regression is an example
of a classification model.

Where to use logistic regression

Logistic regression is used to solve classification problems, and the most common use case
is binary logistic regression, where the outcome is binary (yes or no). In the real world, you can
see logistic regression applied across multiple areas and fields.

• In health care, logistic regression can be used to predict if a tumor is likely to be benign or
malignant.
• In the financial industry, logistic regression can be used to predict if a transaction is
fraudulent or not.
• In marketing, logistic regression can be used to predict if a targeted audience will respond
or not.

Are there other use cases for logistic regression aside from binary logistic regression? Yes. There
are two other types of logistic regression that depend on the number of predicted outcomes.

The three types of logistic regression

1. Binary logistic regression - When we have two possible outcomes, like our original
example of whether a person is likely to be infected with COVID-19 or not.
2. Multinomial logistic regression - When we have multiple outcomes, say if we build out our
original example to predict whether someone may have the flu, an allergy, a cold, or
COVID-19.
3. Ordinal logistic regression - When the outcome is ordered, like if we build out our original
example to also help determine the severity of a COVID-19 infection, sorting it into mild,
moderate, and severe cases.

Training data assumptions for logistic regression

Training data that satisfies the below assumptions is usually a good fit for logistic regression.
• The predicted outcome is strictly binary or dichotomous. (This applies to binary logistic
regression).
• The factors, or the independent variables, that influence the outcome are independent of
each other. In other words there is little or no multicollinearity among the independent
variables.
• The independent variables can be linearly related to the log odds.
• Fairly large sample sizes.

Support Vector Machine:-

Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited for
classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space
that distinctly classifies the data points. The dimension of the hyperplane depends upon the number
of features. If the number of input features is two, then the hyperplane is just a line. If the number
of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine
when the number of features exceeds three.

Let’s consider two independent variables x1, x2 and one dependent variable which is either a blue
circle or a red circle.
Linearly Separable Data points

From the figure above its very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregates our data points or does
a classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points.

Selecting the best hyper-plane:

One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2.

Let’s consider a scenario like shown below


Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm
has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the
margin. SVM is robust to outliers.
So in this type of data points what SVM does is, it finds maximum margin as done with previous
data sets along with that it adds a penalty each time a point crosses the margin. So the margins in
these type of cases are called soft margin. When there is a soft margin to the data set, the SVM
tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations
no hinge loss.If violations hinge loss proportional to the distance of violation.

Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?

Say, our data is like shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below

In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as kernel.
SVM Kernel:

The SVM kernel is a function that takes low dimensional input space and transforms it into higher-
dimensional space, ie it converts non separable problem to separable problem. It is mostly useful
in non-linear separation problems. Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data based on the labels or outputs
defined.

Advantages of SVM:

• Effective in high dimensional cases

• Its memory efficient as it uses a subset of training points in the decision function called
support vectors

• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels

Kernel function and Kernel SVM:-

Kernel Functions-Introduction to SVM Kernel & Examples

Free Machine Learning course with 50+ real-time projects Start Now!!

1. Objective

In our previous Machine Learning blog we have discussed about SVM (Support Vector
Machine) in Machine Learning. Now we are going to provide you a detailed description of SVM
Kernel and Different Kernel Functions and its examples such as linear, nonlinear, polynomial,
Gaussian kernel, Radial basis function (RBF), sigmoid etc.
Kernel Functions-Introduction to SVM Kernel & Examples

2. SVM Kernel Functions

SVM algorithms use a set of mathematical functions that are defined as the kernel. The function
of kernel is to take data as input and transform it into the required form. Different SVM algorithms
use different types of kernel functions. These functions can be different types. For example linear,
nonlinear, polynomial, radial basis function (RBF), and sigmoid.
Introduce Kernel functions for sequence data, graphs, text, images, as well as vectors. The most
used type of kernel function is RBF. Because it has localized and finite response along the entire
x-axis.
The kernel functions return the inner product between two points in a suitable feature space. Thus
by defining a notion of similarity, with little computational cost even in very high-dimensional
spaces.

3. Kernel Rules

Define kernel or a window function as follows:


Kernel or a window function

This value of this function is 1 inside the closed ball of radius 1 centered at the origin, and 0
otherwise . As shown in the figure below:

Kernel or a window function

For a fixed xi, the function is K(z-xi)/h) = 1 inside the closed ball of radius h centered at xi, and 0
otherwise as shown in the figure below:
Kernel or a window function

So, by choosing the argument of K(·), you have moved the window to be centered at the point xi
and to be of radius h.

4. Examples of SVM Kernels

Let us see some common kernels used with SVMs and their uses:

4.1. Polynomial kernel

It is popular in image processing.


Equation is:

Polynomial kernel equation

where d is the degree of the polynomial.

4.2. Gaussian kernel

It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:

Gaussian kernel equation

4.3. Gaussian radial basis function (RBF)


It is a general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:

Gaussian radial basis function (RBF)

, for:

Gaussian radial basis function (RBF)

Sometimes parametrized using:

Gaussian radial basis function (RBF)

4.4. Laplace RBF kernel

It is general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:

Laplace RBF kernel equation

4.5. Hyperbolic tangent kernel

We can use it in neural networks.


Equation is:

Hyperbolic tangent kernel equation

, for some (not every) k>0 and c<0.


4.6. Sigmoid kernel

We can use it as the proxy for neural networks. Equation is

Sigmoid kernel equation

4.7. Bessel function of the first kind Kernel

We can use it to remove the cross term in mathematical functions. Equation is :

Equation of Bessel function of the first kind kernel

where j is the Bessel function of first kind.

4.8. ANOVA radial basis kernel

We can use it in regression problems. Equation is:

ANOVA radial basis kernel equation

4.9. Linear splines kernel in one-dimension

It is useful when dealing with large sparse data vectors. It is often used in text categorization. The
splines kernel also performs well in regression problems. Equation is:

Linear splines kernel equation in one-dimension

If you have any query about SVM Kernel Functions, So feel free to share with us. We will be glad
to solve your queries.
[Unit 3]

Perceptron :-

The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).

The idea was to use different weights to represent the importance of each input, and that the sum
of the values should be greater than a threshold value before making a decision
like true or false (0 or 1).

ADVERTISEMENT

Perceptron Example

Imagine a perceptron (in your brain).


The perceptron tries to decide if you should go to a concert.

Is the artist good? Is the weather good?

What weights should these facts have?

Criteria Input Weight

Artists is Good x1 = 0 or 1 w1 = 0.7

Weather is Good x2 = 0 or 1 w2 = 0.6

Friend will Come x3 = 0 or 1 w3 = 0.5

Food is Served x4 = 0 or 1 w4 = 0.3

Alcohol is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm

Frank Rosenblatt suggested this algorithm:

1. Set a threshold value

2. Multiply all inputs with its weights

3. Sum all the results

4. Activate the output

1. Set a threshold value:

• Threshold = 1.5
2. Multiply all inputs with its weights:

• x1 * w1 = 1 * 0.7 = 0.7

• x2 * w2 = 0 * 0.6 = 0

• x3 * w3 = 1 * 0.5 = 0.5

• x4 * w4 = 0 * 0.3 = 0

• x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

• 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:

• Return true if the sum > 1.5 ("Yes I will go to the Concert")

Note

If the weather weight is 0.6 for you, it might different for someone else. A higher weight means
that the weather is more important to them.

If the threshold value is 1.5 for you, it might be different for someone else. A lower threshold
means they are more wanting to go to the concert.

Example

const threshold = 1.5;


const inputs = [1, 0, 1, 0, 1];
const weights = [0.7, 0.6, 0.5, 0.3, 0.4];

let sum = 0;
for (let i = 0; i < inputs.length; i++) {
sum += inputs[i] * weights[i];
}
const activate = (sum > 1.5);

Perceptron Terminology

• Perceptron Inputs

• Node values

• Node Weights

• Activation Function

Perceptron Inputs

Perceptron inputs are called nodes.

The nodes have both a value and a weight.

Node Values

In the example above, the node values are: 1, 0, 1, 0, 1

The binary input values (0 or 1) can be interpreted as (no or yes) or (false or true).

Node Weights

Weights shows the strength of each node.

In the example above, the node weights are: 0.7, 0.6, 0.5, 0.3, 0.4

The Activation Function

The activation functions maps the result (the weighted sum) into a required value like 0 or 1.

In the example above, the activation function is simple: (sum > 1.5)

The binary output (1 or 0) can be interpreted as (yes or no) or (true or false).


Note

It is obvious that a decision is NOT made by one neuron alone.

Other neurons must provide input: Is the artist good. Is the weather good...

In Neuroscience, there is a debate if single-neuron encoding or distributed encoding is most


relevant for understanding brain functions.

Neural Networks

The Perceptron defines the first step into Neural Networks.

Multi-Layer Perceptrons can be used for very sophisticated decision making.

In the Neural Network Model, input data (yellow) are processed against a hidden layer (blue)
and modified against more hidden layers (green) to produce the final output (red).

The First Layer:


The 3 yellow perceptrons are making 3 simple decisions based on the input evidence. Each
single decision is sent to the 4 perceptrons in the next layer.
The Second Layer:
The blue perceptrons are making decisions by weighing the results from the first layer. This layer
make more complex decisions at a more abstract level than the first layer.

The Third Layer:


Even more complex decisions are made by the green perceptons.

Multilayer Network:-

Multi-layer neural networks can be set up in numerous ways. Typically, they have at least one
input layer, which sends weighted inputs to a series of hidden layers, and an output layer at the
end. These more sophisticated setups are also associated with nonlinear builds using sigmoids and
other functions to direct the firing or activation of artificial neurons. While some of these systems
may be built physically, with physical materials, most are created with software functions that
model neural activity.

Convolutional neural networks (CNNs), so useful for image processing and computer vision, as
well as recurrent neural networks, deep networks and deep belief systems are all examples of multi-
layer neural networks. CNNs, for example, can have dozens of layers that work sequentially on an
image. All of this is central to understanding how modern neural networks function.

Back Propagation :-

What is a backpropagation algorithm?

Backpropagation, or backward propagation of errors, is an algorithm that is designed to test for


errors working back from output nodes to input nodes. It is an important mathematical tool for
improving the accuracy of predictions in data mining and machine learning. Essentially,
backpropagation is an algorithm used to calculate derivatives quickly.

There are two leading types of backpropagation networks:

1. Static backpropagation. Static backpropagation is a network developed to map static inputs


for static outputs. Static backpropagation networks can solve static classification problems,
such as optical character recognition (OCR).
2. Recurrent backpropagation. The recurrent backpropagation network is used for fixed-point
learning. Recurrent backpropagation activation feeds forward until it reaches a fixed value.

The key difference here is that static backpropagation offers instant mapping and recurrent
backpropagation does not.

Machine
learning vs. deep learning vs. neural networks

What is a backpropagation algorithm in a neural network?

Artificial neural networks use backpropagation as a learning algorithm to compute a gradient


descent with respect to weight values for the various inputs. By comparing desired outputs to
achieved system outputs, the systems are tuned by adjusting connection weights to narrow the
difference between the two as much as possible.

The algorithm gets its name because the weights are updated backward, from output to input.

The advantages of using a backpropagation algorithm are as follows:

• It does not have any parameters to tune except for the number of inputs.

• It is highly adaptable and efficient and does not require any prior knowledge about the
network.

• It is a standard process that usually works well.


• It is user-friendly, fast and easy to program.

• Users do not need to learn any special functions.

The disadvantages of using a backpropagation algorithm are as follows:

• It prefers a matrix-based approach over a mini-batch approach.

• Data mining is sensitive to noise and irregularities.

• Performance is highly dependent on input data.

• Training is time- and resource-intensive.

What is the objective of a backpropagation algorithm?

Backpropagation algorithms are used extensively to train feedforward neural networks in areas
such as deep learning. They efficiently compute the gradient of the loss function with respect to
the network weights. This approach eliminates the inefficient process of directly computing the
gradient with respect to each individual weight. It enables the use of gradient methods, like
gradient descent or stochastic gradient descent, to train multilayer networks and update weights to
minimize loss.

The difficulty of understanding exactly how changing weights and biases affect the overall
behavior of an artificial neural network was one factor that held back more comprehensive use of
neural network applications, arguably until the early 2000s when computers provided the
necessary insight.
Today, backpropagation algorithms have practical applications in many areas of artificial
intelligence (AI), including OCR, natural language processing and image processing.

What is a backpropagation algorithm in machine learning?

Backpropagation requires a known, desired output for each input value in order to calculate the
loss function gradient -- how a prediction differs from actual results -- as a type of supervised
machine learning. Along with classifiers such as Naïve Bayesian filters and decision trees, the
backpropagation training algorithm has emerged as an important part of machine learning
applications that involve predictive analytics.

What is the time complexity of a backpropagation algorithm?

The time complexity of each iteration -- how long it takes to execute each statement in an algorithm
-- depends on the network's structure. For multilayer perceptron, matrix multiplications dominate
time.

What is a backpropagation momentum algorithm?

The concept of momentum in backpropagation states that previous weight changes must influence
the present direction of movement in weight space.

What is a backpropagation algorithm pseudocode?

The backpropagation algorithm pseudocode represents a plain language description of the steps in
a system.

What is the Levenberg-Marquardt backpropagation algorithm?

The Levenberg-Marquardt method helps adjust the weight and bias variables. Then, the
backpropagation algorithm is used to calculate the Jacobian matrix of performance functions
considering the weight and bias variables.
Introduction to Deep Neural Network:-

What is Deep Learning? Deep learning is a branch of machine learning which is completely based
on artificial neural networks, as neural network is going to mimic the human brain so deep learning
is also a kind of mimic of human brain. In deep learning, we don’t need to explicitly program
everything. The concept of deep learning is not new. It has been around for a couple of years now.
It’s on hype nowadays because earlier we did not have that much processing power and a lot of
data. As in the last 20 years, the processing power increases exponentially, deep learning and
machine learning came in the picture. A formal definition of deep learning is- neurons

Deep Learning is a subset of Machine Learning that is based on artificial neural networks (ANNs)
with multiple layers, also known as deep neural networks (DNNs). These neural networks are
inspired by the structure and function of the human brain, and they are designed to learn from large
amounts of data in an unsupervised or semi-supervised manner.

Deep Learning models are able to automatically learn features from the data, which makes them
well-suited for tasks such as image recognition, speech recognition, and natural language
processing. The most widely used architectures in deep learning are feedforward neural networks,
convolutional neural networks (CNNs), and recurrent neural networks (RNNs).

Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.

Convolutional Neural Networks (CNNs) are a special type of FNNs designed specifically for
image and video recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object detection, and image
segmentation.

Recurrent Neural Networks (RNNs) are a type of neural networks that are able to process
sequential data, such as time series and natural language. RNNs are able to maintain an internal
state that captures information about the previous inputs, which makes them well-suited for tasks
such as speech recognition, natural language processing, and language translation.
Deep Learning models are trained using large amounts of labeled data and require significant
computational resources. With the increasing availability of large amounts of data and
computational resources, deep learning has been able to achieve state-of-the-art performance in a
wide range of applications such as image and speech recognition, natural language processing, and
more.

Deep learning is a particular kind of machine learning that achieves great power and flexibility
by learning to represent the world as a nested hierarchy of concepts, with each concept defined in
relation to simpler concepts, and more abstract representations computed in terms of less abstract
ones.

In human brain approximately 100 billion neurons all together this is a picture of an individual
neuron and each neuron is connected through thousand of their neighbours. The question here is
how do we recreate these neurons in a computer. So, we create an artificial structure called an
artificial neural net where we have nodes or neurons. We have some neurons for input value and
some for output value and in between, there may be lots of neurons interconnected in the hidden
layer. Architectures :

1. Deep Neural Network – It is a neural network with a certain level of complexity (having
multiple hidden layers in between input and output layers). They are capable of modeling
and processing non-linear relationships.

2. Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is multi-layer belief


networks. Steps for performing DBN : a. Learn a layer of features from visible units using
Contrastive Divergence algorithm. b. Treat activations of previously trained features as
visible units and then learn features of features. c. Finally, the whole DBN is trained when
the learning for the final hidden layer is achieved.

3. Recurrent (perform same task for every element of a sequence) Neural Network – Allows
for parallel and sequential computation. Similar to the human brain (large feedback
network of connected neurons). They are able to remember important things about the input
they received and hence enables them to be more precise.
Difference between
Machine Learning and Deep Learning :

Machine Learning Deep Learning

Works on small amount of Dataset for accuracy. Works on Large amount of Dataset.

Heavily dependent on High-end


Dependent on Low-end Machine. Machine.

Divides the tasks into sub-tasks, solves them


individually and finally combine the results. Solves problem end to end.

Takes less time to train. Takes longer time to train.

Testing time may increase. Less time to test the data.

Working : First, we need to identify the actual problem in order to get the right solution and it
should be understood, the feasibility of the Deep Learning should also be checked (whether it
should fit Deep Learning or not). Second, we need to identify the relevant data which should
correspond to the actual problem and should be prepared accordingly. Third, Choose the Deep
Learning Algorithm appropriately. Fourth, Algorithm should be used while training the dataset.
Fifth, Final testing should be done on the

dataset. Tool
s used : Anaconda, Jupyter, Pycharm, etc. Languages used : R, Python, Matlab, CPP, Java, Julia,
Lisp, Java Script, etc. Real Life Examples :
How to recognize square from other shapes?

...a) Check the four lines!

...b) Is it a closed figure?

...c) Does the sides are perpendicular from each other?


...d) Does all sides are equal?

So, Deep Learning is a complex task of identifying the shape and broken down into simpler

tasks at a larger side.

Recognizing an Animal! (Is it a Cat or Dog?)

Defining facial features which are important for classification and system will then identify this
automatically.

(Whereas Machine Learning will manually give out those features for classification)

Limitations :

1. Learning through observations only.

2. The issue of biases.

Advantages :

1. Best in-class performance on problems.

2. Reduces need for feature engineering.

3. Eliminates unnecessary costs.

4. Identifies defects easily that are difficult to detect.

Disadvantages :

1. Large amount of data required.

2. Computationally expensive to train.

3. No strong theoretical foundation.


Applications :

1. Automatic Text Generation – Corpus of text is learned and from this model new text is
generated, word-by-word or character-by-character. Then this model is capable of learning
how to spell, punctuate, form sentences, or it may even capture the style.

2. Healthcare – Helps in diagnosing various diseases and treating it.

3. Automatic Machine Translation – Certain words, sentences or phrases in one language is


transformed into another language (Deep Learning is achieving top results in the areas of
text, images).

4. Image Recognition – Recognizes and identifies peoples and objects in images as well as to
understand content and context. This area is already being used in Gaming, Retail,
Tourism, etc.

5. Predicting Earthquakes – Teaches a computer to perform viscoelastic computations which


are used in predicting earthquakes.

6. Deep learning has a wide range of applications in various fields such as computer vision,
speech recognition, natural language processing, and many more. Some of the most
common applications include:

7. Image and video recognition: Deep learning models are used to automatically classify
images and videos, detect objects, and identify faces. Applications include image and video
search engines, self-driving cars, and surveillance systems.

8. Speech recognition: Deep learning models are used to transcribe and translate speech in
real-time, which is used in voice-controlled devices, such as virtual assistants, and
accessibility technology for people with hearing impairments.

9. Natural Language Processing: Deep learning models are used to understand, generate and
translate human languages. Applications include machine translation, text summarization,
and sentiment analysis.
10. Robotics: Deep learning models are used to control robots and drones, and to improve their
ability to perceive and interact with the environment.

11. Healthcare: Deep learning models are used in medical imaging to detect diseases, in drug
discovery to identify new treatments, and in genomics to understand the underlying causes
of diseases.

12. Finance: Deep learning models are used to detect fraud, predict stock prices, and analyze
financial data.

13. Gaming: Deep learning models are used to create more realistic characters and
environments, and to improve the gameplay experience.

14. Recommender Systems: Deep learning models are used to make personalized
recommendations to users, such as product recommendations, movie recommendations,
and news recommendations.

15. Social Media: Deep learning models are used to identify fake news, to flag harmful content
and to filter out spam.

16. Autonomous systems: Deep learning models are used in self-driving cars, drones, and other
autonomous systems to make decisions based on sensor data.
[Unit 4]

Computational Learning Theory:-

Computational learning theory (CoLT) refers to applying formal mathematical methods to learning
systems using theoretical computer science tools to quantify learning problems. This task includes
discerning how hard it is for an artificial intelligence (AI) system to learn specific tasks.

Simply put, CoLT is an AI subfield devoted to studying the design and analysis of machine
learning (ML) algorithms. It analyzes how difficult it will be for an AI system to learn a task.

Other interesting terms…

• What is Cloud Robotics?

• What is Forward Chaining?

Read More about “Computational Learning Theory”

CoLT is typically applied to supervised learning, an ML technique requiring a human to train a


machine by giving it data specifically meant to help the computer achieve the desired results. Apart
from determining the level of difficulty of performing a task, CoLT also gauges how much time it
will take a machine to achieve the desired outcome and even if it’s feasible to begin with.

What Does Computational Learning Theory Determine?

CoLT formalizes three key aspects, namely:

• The way the learner interacts with its environment

• The definition of success in terms of completing the learning task

• A formal definition of efficiency of both data usage (sample complexity) and processing
time (time complexity)

CoLT considers a computation feasible if it can be performed in polynomial time. That means the
number of steps required to complete the algorithm for a given input is not infinite. CoLT produces
two kinds of results for this—positive (the machine can learn the task in polynomial time) and
negative (the device can’t learn the task in polynomial time).

Why Is Computational Learning Theory Important?

You should remember that the theoretical learning models AI systems analyze represent real-life
problems abstractly. As such, ML experts have to validate or change the abstractions to ensure that
the computers will produce theoretical results that represent real-life solutions.

CoLT is thus critical to ML research. Besides the predictive capability CoLT offers, it also
addresses other vital factors, including simplicity, robustness to variations in the learning scenario,
and the ability to create insights into empirically observed phenomena. In other words, CoLT
simplifies the data the AI system has to process. Finally, it helps users ensure the computer can
adapt to changes in its environment even while learning a task. And it lets users understand and
apply the results to real-life situations.

What Fields Employ Computational Learning Theory?

CoLT is usually applied to statistics, calculus, geometry, information theory, probability theory,
and programming optimization.

What Questions Can Computational Learning Theory Answer?

CoLT can respond to questions, such as:

• How can you tell if a model decently approximates the goal function? (How can you tell if
an ML algorithm accurately represents your objective?)

• How can you determine if you have a good answer at the local or global level? (How can
you tell if the ML model successfully provided you with the correct results for a specific
or general task?)

• What type of hypothesis space should be employed? (How many hypotheses should the
machine come up with?)

• What can you do to avoid overfitting? (What can you do to prevent the results from
becoming only applicable to the data studied?)
• How many examples of data are required?

What Are the Practical Applications of Computational Learning Theory?

CoLT has several uses, including:

• It can help programmers predict how well their algorithms will do in processing and
analyzing specific volumes of data. Will they work as well on 5 million data points as they
did on 1 million data points? Were the results just as accurate?

• It simplifies the process of modifying algorithms by limiting the possible parameters,


making software development and upgrading faster. Here, unsupervised learning plays a
huge part, specifically in labeling data and choosing data points that are likely to produce
the best results.

PAC learning model:-

We want to learn the concept "medium-built person" from examples. We are given the height and
weight of m individuals, the training set. We are told for each [height,weight] pair if it is or not of
medium built. We would like to learn this concept, i.e. produce an algorithm that in the future
answers correctly if a pair [height,weight] represents/not a medium-built person. We are interested
in knowing which value of m to use if we want to learn this concept well. Our concern is to
characterize what we mean by well or good when evaluating learned concepts.

First we examine what we mean by saying the probability of error of the learned concept is at most
epsilon.

Say that c is the (true) concept we are learning, h is the concept we have learned, then

error(h) = Probability[c(x) < > h(x) for an individual x] < epsilon

For a specific learning algorithm, what is the probability that a concept it learns will have an error
that is bound by epsilon? We would like to set a bound delta on the probability that this error is
greater that epsilon. That is,

Probability[error(h) > epsilon] < delta


We are now in a position to say when a learned concept is good:

When the probability that its error is greater than the accuracy epsilon is less than
the confidence delta.

Different degrees of "goodness" will correspond to different values of epsilon and delta. The
smaller epsilon and delta are, the better the leaned concept will be.

This method of evaluating learning is called Probably Approximately Correct (PAC) Learning and
will be defined more precisely in the next section.

Our problem, for a given concept to be learned, and given epsilon and delta, is to determine the
size of the training set. This may or not depend on the algorithm used to derive the learned concept.

Going back to our problem of learning the concept medium-built people, we can assume that the
concept is represented as a rectangle, with sides parallel to the axes height/weight, and with
dimensions height_min, height_max, weight_min, weight_max. We assume that also the
hypotheses will take the form of a rectangle with the sides parallel to the axes.

We will use a simple algorithm to build the learned concept from the training set:

1. If there are no positive individuals in the training set, the learned concept is null.

2. Otherwise it is the smallest rectangle with sides parallel to the axes which contains the
positive individuals.

We would like to know how good is this learning algorithm. We choose epsilon and delta and
determine a value for m that will satisfy the PAC learning condition.

An individual x will be classified incorrectly by the learned concept h if x lays in the area between
h and c. We divide this area into 4 strips, on top, bottom and sides of h. We allow these strips,
pessimistically, to overlap in the corners. In figure 1 we represent the top strip as t'. If each of these
strips is of area at most epsilon/4, i.e. is contained in the strip t of area epsilon/4, then the error for
our hypothesis h will be bound by epsilon. [In determining the area of a strip we need to make
some hypothesis about the probability of each point. However our analysis is valid for any chosen
distribution.]
What is the probability that the hypothesis h will have an error bound by epsilon? We determine
the probability that one individual will be outside of the strip t, (1 - epsilon/4). Then we determine
the probability that all m individuals will be outside of the strip t, (1 - epsilon/4)^m. The probability
of all m individuals to be simultaneously outside of at least one of the four strips is 4*(1 -
epsilon/4)^m. When all the m individuals are outside of at least one of the strips, then the
probability of error for an individual can be greater than epsilon [this is pessimistic]. Thus if we
bound 4*(1 - epsilon/4)^m by delta, we make sure that the the probability of individual error being
greater that epsilon is at most delta.

From the condition

4*(1 - epsilon/4)^m < delta

we

obtain

m > ln (delta/4)/ln(1 - epsilon/4)

If we remember that for y < 1, we have


-ln(1 - y) = y + y^2/2 + y^3/3 + ..
and thus
(1 - y) < e^(-y)
We derive
m > (4/epsilon) * ln(4/delta)
which tells us the size of the training set required for the chosen values of epsilon and delta.
Here are same representative values
epsilon | delta | m
=======================
0.1 | 0.1 | 148
0.1 | 0.01 | 240
0.1 | 0.001 | 332
-----------------------
0.01 | 0.1 | 1476
0.01 | 0.01 | 2397
0.01 | 0.001 | 3318
-----------------------
0.001 | 0.1 | 14756
0.001 | 0.01 | 23966
0.001 | 0.001 | 33176
=======================
It is not difficult, though tedious, to verify this bound. One chooses a rectangle c, chooses values
for epsilon and delta, and runs an experiment with a training set of size m. One then runs a test set
and determines if the probability of error is greater than epsilon.
One repeats this procedure a number of times to determine the probability that the error was greater
than epsilon. One can then verify that the resulting probability is bound by delta.
Here are some subcases of our problem of learning "rectangular concepts":

Two of the sides overlap the axes (thus we have two strips only)

The concept is a one dimensional interval (we still have two strips)

The concept is a one dimensional interval starting at 0 (thus only one strip).

PAC Learning deals with the question of how to choose the size of the training set, if we want to
have confidence delta that the learned concept will have an error that is bound by epsilon.

Formal Definition of PAC Learning

We set up the learning conditions as follows:

f is the function that we want to learn, the target function.

F is the class of functions from which f can be selected. f is an element of F.

X is the set of possible individuals. It is the domain of f.

N is the cardinality of X.
D is a probability distribution on X; this distribution is used both when the training set is created
and when the test set is created.

ORACLE(f,D), a function that in a unit of time returns a pair of the form (x,f(x)), where x is
selected from X according to D.

H is the set of possible hypotheses.

h is the specific hypothesis that has been learned. H is an element of H.

m is the cardinality of the training set.

NOTE: We can examine learning in terms of functions or of concepts, i.e. sets. They are
equivalent, if we remember the use of characteristic functions for sets.

The error of an hypothesis h is defined as follows:

error(h) = Probability[f(x) < > h(x), x chosen from X according to D]

DEFINITION: A class of functions F is Probably Approximately (PAC) Learnable if there is a


learning algorithm L that for all f in F, all distributions D on X, all epsilon (0 < epsilon < 1) and
delta (0 < delta < 1), will produce an hypothesis h, such that the probability is at most delta that
error(h) > epsilon.
L has access to the values of epsilon and delta, and to ORACLE(f,D).
F is Efficiently PAC Learnable if L is polynomial in epsilon, delta, and ln(N). It is Polynomial
PAC Learnable if m is polynomial in epsilon, delta, and the size of (minimal) descriptions of
individuals and of the concept.

Lower Bound on the Size of the Training Set

Surprisingly, we can derive a general lower bound on the size m of the training set required in
PAC learning. We assume that for each x in the training set we will have h(x) = f(x), that is, the
hypothesis is consistent with the target concept on the training set.

If the hypothesis h is bad, i.e it has an error greater than epsilon, and is consistent on the training
set, then the probability that on one individual x we have h(x)=f(x) is at most (1 - epsilon), and the
probability of having f(x)=h(x) for m individuals is at most (1 - epsilon)^m. This is an upper bound
on the probability of a bad consistent function. If we multiply this value by the number of bad
hypotheses, we have an upper bound on the probability of learning a bad consistent hypothesis.

The number of bad hypothesis is certainly less than the number N of hypotheses. Thus
Probability[h is bad and consistent] < N*(1 - epsilon)^m < delta
Which we can rewrite as
Probability[h is bad and consistent] < N*(e^(-epsilon))^m = N*(e^(-m*epsilon)) < delta
Solving this inequality for m

m > (1/epsilon)*(ln(1/delta)+ln N)
Learning a Boolean Function
Suppose we want to learn a boolean function of n variables. The number N of such functions is
2^(2^n). Thus
m > (1/epsilon)*(ln(1/delta)+(2^n)ln2)

Here are some values of m as a function of n, epsilon, and delta.


n | epsilon | delta | m
===========================
5 | 0.1 | 0.1 |245
5 | 0.1 | 0.01 |268
5 | 0.01 | 0.1 |2450
5 | 0.01 | 0.01 |2680
---------------------------
10 | 0.1 | 0.1 |7123
10 | 0.1 | 0.01 |7146
10 | 0.01 | 0.1 |71230
10 | 0.01 | 0.01 |71460
===========================
Of course it would be much easier to learn a symmetric boolean function since there are many
fewer functions of n variables, namely 2^(n+1).

Learning a Boolean Conjunction

We can efficiently PAC learn concepts that are represented as the conjunction of boolean literals
(i.e. positive or negative boolean variables). Here is a learning algorithm:

1. Start with an hypothesis h which is the conjunction of each variable and its negation x1 &
~x1 & x2 & ~x2 & .. & xn & ~xn.
2. Do nothing with negative instances.

3. On a positive instance a, eliminate in h ~xi if ai is positive, eliminate xi if ai is negative.


For example if a positive instance is 01100 then eliminate x1, ~x2, ~x3, x4, and x5.

In this algorithm h, as domain, is non-decreasing and at all times contained in the set denoted by
c. [By induction: certainly true initially, ..]. We will have an error when h contains a literal z which
is not in c.

We compute first the probability that a literal z is deleted from h because of one specific positive
example. Clearly this probability is 0 if z occurs in c, and if ~z is in c the probability is 1. At issue
are the z where neither z nor ~z is in c. We would like to eliminate both of them from h. If one of
the two remains, we have an error for an instance a that is positive for c and negative for h. Let's
call these literals, free literals.
We have:

error(h) is less than or equal to the Sum of the probabilities of the free literals z in h not to be
eliminated by one positive example.

Since there are at most 2*n literals in h, if h is a bad hypothesis, i.e. an hypothesis with error greater
than epsilon, we will have

Probability[free literal z is eliminated from h by one positive example] > epsilon/(2*n)

From this we obtain :

Probability[free literal z survives one positive example] = 1 - Probability[free literal z is eliminated


from h by one positive example] < (1 - epsilon/(2*n))

Probability[free literal z survives m positive examples] < (1 - epsilon/(2*n))^m

Probability[some free literal z survives m positive examples] < 2n*(1 - epsilon/(2*n))^m <
2n*(e^(-epsilon/(2*n)))^m = 2n*e^(-(m*epsilon)/(2*n))

That is

m > (2*n/epsilon)*(ln(1/delta)+n*ln(2))
Sample complexity:-

What is complexity in machine learning?

In machine learning, model complexity often refers to the number of features or terms included
in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so
on. It can also refer to the algorithmic learning complexity or computational complexity.
How do you calculate time complexity of machine learning model?

Computational Complexity of Machine Learning Models - II

• Linear Regression. Train Time Complexity=O(n*m^2 + m^3) ...


• Logistic Regression. Train Time Complexity=O(n*m) ...
• K Nearest Neighbors. Train Time Complexity=O( k*n*m ) ...
• SVM. Train Time Complexity=O(n^2) ...
• Decision Tree. ...
• Random Forest. ...
• Naive Bayes.

VC Dimension :-

What is the purpose of VC dimension?

The Vapnik-Chervonenkis dimension, more commonly known as the VC dimension, is a model


capacity measurement used in statistics and machine learning. It is termed informally as a measure
of a model's capacity. It is used frequently to guide the model selection process while developing
machine learning applications.

Why is VC dimension important for machine learning?


VC dimension is a formal measure of bias which has played an important role in
mathematical work on learnability. the maximum number of datapoints that can be separated
(i.e., grouped) in all possible ways.

What is VC dimension for a classifier?

The VC dimension of a classifier is defined by Vapnik and Chervonenkis to be the cardinality


(size) of the largest set of points that the classification algorithm can shatter [1].

What is VC dimension in SVM?

The VC dimension of {f(α)} is the maximum number of. training points that can be shattered
by {f(α)} For example, the VC dimension of a set of oriented lines in R2 is three. In general, the
VC dimension of a set of oriented hyperplanes in Rn is n+1. Note: need to find just one set of
points.

Ensemble learning.-

What is meant by ensemble learning?

Ensemble learning is the process by which multiple models, such as classifiers or experts, are
strategically generated and combined to solve a particular computational intelligence
problem. Ensemble learning is primarily used to improve the (classification, prediction, function
approximation, etc.)

What is ensemble learning give an example?


An example of an ensemble learning algorithm is bagging [2]. Given a learning algorithm for
creating single predictive models and a data set, bagging creates diverse predictive models by
feeding different uniform samples of the data set to the learning algorithm in order to create each
model.

Which are the three types of ensemble learning?

The three main classes of ensemble learning methods are bagging, stacking, and boosting, and it
is important to both have a detailed understanding of each method and to consider them on your
predictive modeling project.

[Unit 5]
Clustering k-means :-
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors
without referring to known, or labelled, outcomes.
AndreyBu, who has more than 5 years of machine learning experience and currently teaches people
his skills, says that “the objective of K-means is simple: group similar data points together and
discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of
clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset.
A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every
data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of
randomly selected centroids, which are used as the beginning points for every cluster, and then
performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because the clustering
has been successful.
• The defined number of iterations has been achieved.
K-means algorithm example problem
Let’s see the steps on how the K-means machine learning algorithm works using the Python
programming language.
We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple
explanation.
Step 1: Import libraries
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import
KMeans%matplotlib inline
As you can see from the above code, we’ll import the following libraries in our project:
• Pandas for reading and writing spreadsheets
• Numpy for carrying out efficient computations
• Matplotlib for visualization of data
Step 2: Generate random data
Here is the code for generating some random data in a two-dimensional space:
X= -2 * np.random.rand(100,2)X1 = 1 + 2 * np.random.rand(50,2)X[50:100, :] = X1plt.scatter(X[
: , 0], X[ :, 1], s = 50, c = ‘b’)plt.show()
A total of 100 data points has been generated and divided into two groups, of 50 points each.
Here is how the data is displayed on a two-dimensional space:

Step 3: Use Scikit-Learn


We’ll use some of the available functions in the Scikit-learn library to process the randomly
generated data.
Here is the code:
from sklearn.cluster import KMeansKmean = KMeans(n_clusters=2)Kmean.fit(X)
In this case, we arbitrarily gave k (n_clusters) an arbitrary value of two.
Here is the output of the K-means parameters we get if we run the code:
KMeans(algorithm=’auto’, copy_x=True, init=’k-means++’, max_iter=300
n_clusters=2, n_init=10, n_jobs=1, precompute_distances=’auto’,
random_state=None, tol=0.0001, verbose=0)
Step 4: Finding the centroid
Here is the code for finding the center of the clusters:
Kmean.cluster_centers_
Here is the result of the value of the centroids:
array([[-0.94665068, -0.97138368],
[ 2.01559419, 2.02597093]])
Let’s display the cluster centroids (using green and red color).
plt.scatter(X[ : , 0], X[ : , 1], s =50, c=’b’)plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’,
marker=’s’)plt.scatter(2.01559419, 2.02597093, s=200, c=’r’, marker=’s’)plt.show()
Here is the output:

Step 5: Testing the algorithm


Here is the code for getting the labels property of the K-means clustering example dataset; that is,
how the data points are categorized into the two clusters.
Kmean.labels_
Here is the result of running the above K-means algorithm code:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
As you can see above, 50 data points belong to the 0 cluster while the rest belong to the 1 cluster.
For example, let’s use the code below for predicting the cluster of a data point:
sample_test=np.array([-3.0,-3.0])second_test=sample_test.reshape(1, -
1)Kmean.predict(second_test)
Here is the result:
array([0])
It shows that the test data point belongs to the 0 (green centroid) cluster.
Wrapping up
Here is the entire K-means clustering algorithm code in Python:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster
import KMeans%matplotlib inlineX= -2 * np.random.rand(100,2)X1 = 1 + 2 *
np.random.rand(50,2)X[50:100, :] = X1plt.scatter(X[ : , 0], X[ :, 1], s = 50, c =
‘b’)plt.show()from sklearn.cluster import KMeansKmean =
KMeans(n_clusters=2)Kmean.fit(X)Kmean.cluster_centers_plt.scatter(X[ : , 0], X[ : , 1], s =50,
c=’b’)plt.scatter(-0.94665068, -0.97138368, s=200, c=’g’, marker=’s’)plt.scatter(2.01559419,
2.02597093, s=200, c=’r’, marker=’s’)plt.show()Kmean.labels_sample_test=np.array([-3.0,-
3.0])second_test=sample_test.reshape(1, -1)Kmean.predict(second_test)
K-means clustering is an extensively used technique for data cluster analysis.
It is easy to understand, especially if you accelerate your learning using a K-means clustering
tutorial. Furthermore, it delivers training results quickly.
However, its performance is usually not as competitive as those of the other sophisticated
clustering techniques because slight variations in the data could lead to high variance.
Furthermore, clusters are assumed to be spherical and evenly sized, something which may reduce
the accuracy of the K-means clustering Python results.

Adaptive Hierarchical Clustering:-


Introduction to Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together
the unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls
into following two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data
point is treated as a single cluster and then successively merge or agglomerate (bottom-up
approach) the pairs of clusters. The hierarchy of the clusters is represented as a dendrogram or tree
structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the
data points are treated as one big cluster and the process of clustering involves dividing (Top-down
approach) the one big cluster into various small clusters.
Steps to Perform Agglomerative Hierarchical Clustering
We are going to explain the most used and important Hierarchical clustering i.e. agglomerative.
The steps to perform the same is as follows −
• Step 1 − Treat each data point as single cluster. Hence, we will be having, say K clusters
at start. The number of data points will also be K at start.
• Step 2 − Now, in this step we need to form a big cluster by joining two closet datapoints.
This will result in total of K-1 clusters.
• Step 3 − Now, to form more clusters we need to join two closet clusters. This will result in
total of K-2 clusters.
• Step 4 − Now, to form one big cluster repeat the above three steps until K would become
0 i.e. no more data points left to join.
• Step 5 − At last, after making one single big cluster, dendrograms will be used to divide
into multiple clusters depending upon the problem.
Role of Dendrograms in Agglomerative Hierarchical Clustering
As we discussed in the last step, the role of dendrogram starts once the big cluster is formed.
Dendrogram will be used to split the clusters into multiple cluster of related data points depending
upon our problem. It can be understood with the help of following example −
Example 1
To understand, let us start with importing the required libraries as follows −
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
Next, we will be plotting the datapoints we have taken for this example −
X = np.array(
[[7,8],[12,20],[17,19],[26,15],[32,37],[87,75],[73,85], [62,80],[73,60],[87,96],])
labels = range(1, 11)
plt.figure(figsize = (10, 7))
plt.subplots_adjust(bottom = 0.1)
plt.scatter(X[:,0],X[:,1], label = 'True Position')
for label, x, y in zip(labels, X[:, 0], X[:, 1]):
plt.annotate(
label,xy = (x, y), xytext = (-3, 3),textcoords = 'offset points', ha = 'right', va = 'bottom')
plt.show()

From the above diagram, it is very easy to see that we have two clusters in out datapoints but in
the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms
of our datapoints by using Scipy library −
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize = (10, 7))
dendrogram(linked, orientation = 'top',labels = labelList,
distance_sort ='descending',show_leaf_counts = True)
plt.show()
Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then
drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at
two points, the number of clusters would be two.

Next, we need to import the class for clustering and call its fit_predict method to predict the
cluster. We are importing AgglomerativeClustering class of sklearn.cluster library −
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
cluster.fit_predict(X)
Next, plot the cluster with the help of following code −
plt.scatter(X[:,0],X[:,1], c = cluster.labels_, cmap = 'rainbow')
The above diagram shows the two clusters from our datapoints.
Example 2
As we understood the concept of dendrograms from the simple example discussed above, let us
move to another example in which we are creating clusters of the data point in Pima Indian
Diabetes Dataset by using hierarchical clustering.
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import numpy as np
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names = headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
data.shape
(768, 9)
data.head()

Preg Plas Pres Skin Test Mass Pedi Age Class


0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

patient_data = data.iloc[:, 3:5].values


import scipy.cluster.hierarchy as shc
plt.figure(figsize = (10, 7))
plt.title("Patient Dendograms")
dend = shc.dendrogram(shc.linkage(data, method = 'ward'))

from sklearn.cluster import AgglomerativeClustering


cluster = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'ward')
cluster.fit_predict(patient_data)
plt.figure(figsize = (10, 7))
plt.scatter(patient_data[:,0], patient_data[:,1], c = cluster.labels_, cmap = 'rainbow')
Gaussian mixture model:-

Suppose there are set of data points that need to be grouped into several parts or clusters based
on their similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering:

• K Means Clustering

• Hierarchical Clustering

• Gaussian Mixture Models

In this article, Gaussian Mixture Model will be discussed.

Normal or Gaussian Distribution


In real life, many datasets can be modeled by Gaussian Distribution (Univariate or Multivariate).
So it is quite natural and intuitive to assume that the clusters come from different Gaussian
Distributions. Or in other words, it is tried to model the dataset as a mixture of several Gaussian
Distributions. This is the core idea of this model.
In one dimension the probability density function of a Gaussian Distribution is given by

where and are respectively mean and variance of the distribution.


For Multivariate ( let us say d-variate) Gaussian Distribution, the probability density function is
given by

Here is a d dimensional vector denoting the mean of the distribution and is the d X d
covariance matrix.
Gaussian Mixture Model

Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of
clusters is known and it is K). So and is also estimated for each k. Had it been only one
distribution, they would have been estimated by the maximum-likelihood method. But since
there are K such clusters and the probability density is defined as a linear function of densities of
all these K distributions, i.e.
where is the mixing coefficient for k-th distribution.
For estimating the parameters by the maximum log-likelihood method, compute p(X| , , ).

Now define a random variable such that =p(k|X).


From Bayes’theorem,

Now for the log-likelihood function to be maximum, its derivative of with

respect to , and should be zero. So equating the derivative of with


respect to to zero and rearranging the terms,

Similarly taking derivative with respect to and pi respectively, one can obtain the following
expressions.
And
Note: denotes the total number of sample points in the k-th cluster. Here it is assumed that there
is a total N number of samples and each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed form. This is where
the Expectation-Maximization algorithm is beneficial.
Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood


estimates for model parameters when the data is incomplete or has some missing data points or
has some hidden variables. EM chooses some random values for the missing data points and
estimates a new set of data. These new values are then recursively used to estimate a better first
date, by filling up missing points, until the values get fixed.
These are the two basic steps of the EM algorithm, namely E Step or Expectation Step or
Estimation Step and M Step or Maximization Step.

• Estimation step:

• initialize , and by some random values, or by K means clustering


results or by hierarchical clustering results.

• Then for those given parameter values, estimate the value of the latent variables
(i.e )

• Maximization Step:

• Update the value of the parameters( i.e. , and ) calculated using ML


method.

Algorithm:

• Initialize the mean [Tex], [/Tex]


[Tex]\Sigma_k [/Tex]
[Tex]\pi_k [/Tex]
• Compute the [Tex]values for all k.[/Tex]


[Tex]\gamma_k [/Tex]

• Compute log-likelihood function.


• Put some convergence criterion
• If the log-likelihood value converges to some value ( or if all the parameters converge to
some values ) then stop, else return to Step 2.
*** QuickLaTeX cannot compile formula:

*** Error message:

Error: Nothing to show, formula is empty

Example: In this example, IRIS Dataset is taken. In Python, there is a GaussianMixture class to
implement GMM.

You might also like