Machine Learning Notes
Machine Learning Notes
[Unit 1] [7 Hours]
Basic definitions, types of learning, hypothesis space and inductive bias, evaluation, cross-
validation, Linear regression, Decision trees, over fitting, Instance based learning, Feature
reduction, Collaborative filtering based recommendation
[Unit 2] [7 Hours]
Probability and Bayes learning, Logistic Regression, Support Vector Machine, Kernel function
and Kernel SVM.
[Unit 3] [7 Hours]
[Unit 4] [7 Hours]
[Unit 5] [7 Hours]
Text Book:
Reference Books:
We do not know exactly which people are likely to buy this ice cream flavor, or the next book of
this author, or see this new movie, or visit this city, or click this link. If we knew, we would not
need any analysis of the data; we would just go ahead and write down the code. But because we
do not, we can only collect data and hope to extract the answers to these and similar questions
from data. We do believe that there is a process that explains the data we observe. Though we do
not know the details of the process underlying the generation of data—for example, consumer
behavior—we know that it is not completely random. People do not go to supermarkets and buy
things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices
for Glühwein in winter. There are certain patterns in the data. We may not be able to identify the
process completely, but we believe we can construct a good and useful approximation. That
approximation may not explain everything, but may still be able to account for some part of the
data. We believe that though identifying the complete process may not be possible, we can still
detect certain patterns or regularities. This is the niche of machine learning. Such patterns may
help us understand the process, or we can use those patterns to make predictions: Assuming that
the future, at least the near future, will not be much different from the past when the sample data
was collected, the future predictions can also be expected to be right. Application of machine
learning methods to large databases is called data mining. The analogy is that a large volume of
earth and raw material is extracted from a mine, which when processed leads to a small amount of
very precious material; similarly, in data mining, a large volume of data is processed to construct
a simple model with valuable use for example, having high predictive accuracy. Its application
areas are abundant: In addition to retail, in finance banks analyze their past data to build models
to use in credit applications, fraud detection, and the stock market. In manufacturing, learning
models are used for optimization, control, and troubleshooting. In medicine, learning programs are
used for medical diagnosis. In telecommunications, call patterns are analyzed for network
optimization and maximizing the quality of service. In science, large amounts of data in physics,
astronomy, and biology can only be analyzed fast enough by computers. The World Wide Web is
huge; it is constantly growing, and searching for relevant information cannot be done manually.
Types of Learning:-
1. Supervised learning
Gartner, a business consulting firm, predicts that supervised learning will remain the most utilized
machine learning among enterprise information technology leaders in 2022 [2]. This type of
machine learning feeds historical input and output data in machine learning algorithms, with
processing in between each input/output pair that allows the algorithm to shift the model to create
outputs as closely aligned with the desired result as possible. Common algorithms used during
supervised learning include neural networks, decision trees, linear regression, and support vector
machines.
This machine learning type got its name because the machine is “supervised” while it's learning,
which means that you’re feeding the algorithm information to help it learn. The outcome you
provide the machine is labeled data, and the rest of the information you give is used as input
features.
For example, if you were trying to learn about the relationships between loan defaults and borrower
information, you might provide the machine with 500 cases of customers who defaulted on their
loans and another 500 who didn't. The labeled data “supervises” the machine to figure out the
information you're looking for.
Supervised learning is effective for a variety of business purposes, including sales forecasting,
inventory optimization, and fraud detection. Some examples of use cases include:
2. Unsupervised learning
While supervised learning requires users to help the machine learn, unsupervised learning doesn't
use the same labeled training sets and data. Instead, the machine looks for less obvious patterns in
the data. This machine learning type is very helpful when you need to identify patterns and use
data to make decisions. Common algorithms used in unsupervised learning include Hidden
Markov models, k-means, hierarchical clustering, and Gaussian mixture models.
Using the example from supervised learning, let's say you didn't know which customers did or
didn't default on loans. Instead, you'd provide the machine with borrower information and it would
look for patterns between borrowers before grouping them into several clusters.
This type of machine learning is widely used to create predictive models. Common applications
also include clustering, which creates a model that groups objects together based on specific
properties, and association, which identifies the rules existing between the clusters. A few example
use cases include:
• Pinpointing associations in customer data (for example, customers who buy a specific style
of handbag might be interested in a specific style of shoe)
3. Reinforcement learning
Reinforcement learning is the closest machine learning type to how humans learn. The algorithm
or agent used learns by interacting with its environment and getting a positive or negative reward.
Common algorithms include temporal difference, deep adversarial networks, and Q-learning.
Going back to the bank loan customer example, you might use a reinforcement learning algorithm
to look at customer information. If the algorithm classifies them as high-risk and they default, the
algorithm gets a positive reward. If they don't default, the algorithm gets a negative reward. In the
end, both instances help the machine learn by understanding both the problem and environment
better.
Gartner notes that most ML platforms don't have reinforcement learning capabilities because it
requires higher computing power than most organizations have [2]. Reinforcement learning is
applicable in areas capable of being fully simulated that are either stationary or have large volumes
of relevant data. Because this type of machine learning requires less management than supervised
learning, it’s viewed as easier to work with dealing with unlabeled data sets. Practical applications
for this type of machine learning are still emerging. Some examples of uses include:
• Training robots to learn policies using raw video images as input that they can use to
replicate the actions they see
Inductive Bias
Consider The Two Types Of Supervised Learning Problems: Classification And Regression,
Which Depends On Output Attribute Type (That Is Discrete Valued Or Continuous Valued). In
Classification Type, This F(X̂), Is Discrete While In Regression F(X̂) Is Continuous. Apart From
Classification And Regression, In Some Cases, We May Want To Determine The Probability Of
A Particular Value Of Y. So In Cases Of Probability Estimation, Our F(X̂) Is The Probability Of
X̂. So These Are The Types Of Inductive Bias Problems We Are Trying To Look At.
We Call This Inductive Bias Because We Are Given Some Data And We Are Trying To Do
Induction, To Try Identify A Function Which Can Explain The Data. Unless We Can See All The
Instances (All The Possible Data Points) Or We Make Some Restrictive Assumptions About The
Language In Which The Hypothesis Is Expressed Or Some Bias, This Problem Is Not Well
Defined. Therefore It Is Called An Inductive Bias.
Example Situation
Let Us Look At A Classification Problem. Let’s Assume That It Is A 2 Class Classification
Problem, That Is We Are Provide With A Number Of Instances Or Examples. Some Of Them
Belong To Class 1 The Others Belong To Class 2.
Also We Are Provide With A Training Set Which Comprises A Subset Some Of Them Are Mark
Class 1 And Some Of Them Are Mark Class 2 And We Can Say That Class 1 Is Positive And
Class 2 Is Negative. We Can Map Different Points In The Feature Space As Shown In The
Figure 1.
Our Objective Is For The Function To Know/Predict If A New Instance Is Provide, Whether This
Instance Will Be Positive Or Negative. The Function Should Separate The Positive From
Negative With The Help Of A Curve Or A Line. Depending Upon The Separating Line We Can
Determine If The New Instance Is Positive Or Negative.
Let’s Again Consider Figure 1. Let’s Mark Some Test Points(The Question Mark Points In
Figure 2). What We Have To Determine Is The Class Of The Points (Positive Or Negative) In
The Prediction Problem. In Order To Answer The Prediction Problem We Have To Come Up
With A Curve (That Is A Function). Let’s Consider This Function Marked In The Pink Line.
According To This Case Of The Function, We Can Say That The Test Points To The Right
Would Be Negative (Based On The Trends) And To The Left Would Be Positive.
Hypothesis Space
Instead Of This Particular Line, We Could Have Used Other Functions For Hypothesis. As
Shown In Figure 3 All These Are The Possible Functions Which We Could Have Found.
The Set Involving All Such Legal Functions (That Are Possible) Defines The Hypothesis
Space. In A Particular Learning Problem We First Define The Hypothesis Space (The Class Of
The Function We Are Going To Consider), Then Given The Data Points We Try To Come Up
With The Best Hypothesis.
X1 X2 X3 X4
1 1 1 1
0 0 0 0
Thus The Possible Instances Will Be 16(=24).How Many Boolean Functions Are Possible? The
Function Will Classify Some Of The Points As Positive, Others Negative Out Of The 16 Points.
Thus The Number Of Functions Is The Number Of Possible Subsets Of The 16 Instances. So The
Subsets Possible Are (216). This Can Be Generalise To N Boolean Features Too. So Instead Of 4
Boolean Features If We Have N Boolean Features Then The Number Of Possible Instances
Is 2n And The Number Of Possible Functions Will Be 2(2^N).
As It Can Be Seen The Hypothesis Space Is Gigantic In Size And It Is Not Possible To Look At
Every Hypothesis Individually In Order To Select The Best Hypothesis. So One Puts Restrictions
In The Hypothesis Space To Consider Only Specific Hypothesis Space. These Restrictions Are
Also Refer Bias And They Are Of Many Types.
An Example Is Occam’s Razor Which States That The Simplest Consistent Hypothesis About
The Target Function Is The Best And Should Be Considered As The Hypothesis. Other Types Of
Bias Include Minimum Description Length And Maximum Margin Bias. The Choice Of Bias
Depends On The Requirements And The Available Data Sets.
EVALUATION:-
Types of evaluation
Many types of evaluation exist, consequently evaluation methods need to be customised according
to what is being evaluated and the purpose of the evaluation.1,2 It is important to understand the
different types of evaluation that can be conducted over a program’s life-cycle and when they
should be used. The main types of evaluation are process, impact, outcome and summative
evaluation.1
Before you are able to measure the effectiveness of your project, you need to determine if the
project is being run as intended and if it is reaching the intended audience. 3 It is futile to try and
determine how effective your program is if you are not certain of the objective, structure,
programing and audience of the project. This is why process evaluation should be done prior to
any other type of evaluation.3
Process evaluation
Process evaluation is used to “measure the activities of the program, program quality and who it
is reaching”3 Process evaluation, as outlined by Hawe and colleagues 3 will help answer questions
about your program such as:
• Are all project activities reaching all parts of the target group?
• Are participants and other key stakeholders satisfied with all aspects of the project?
• Are all materials, information and presentations suitable for the target audience?
Impact evaluation
Impact evaluation is used to measure the immediate effect of the program and is aligned with the
programs objectives. Impact evaluation measures how well the programs objectives (and sub-
objectives) have been achieved.1,3
• How well has the project achieved its objectives (and sub-objectives)?
• How well have the desired short term changes been achieved?
For example, one of the objectives of the My-Peer project is to provide a safe space and learning
environment for young people, without fear of judgment, misunderstanding, harassment or abuse.
Impact evaluation will assess the attitudes of young people towards the learning environment and
how they perceived it. It may also assess changes in participants’ self esteem, confidence and
social connectedness.
Impact evaluation measures the program effectiveness immediate after the completion of the
program and up to six months after the completion of the program.
Outcome evaluation
Outcome evaluation is concerned with the long term effects of the program and is generally used
to measure the program goal. Consequently, outcome evaluation measures how well the program
goal has been achieved.1,3
• What, if any factors outside the program have contributed or hindered the desired change?
In peer-based youth programs outcome evaluation may measure changes to: mental and physical
wellbeing, education and employment and help-seeking behaviours.
Outcome evaluation measures changes at least six months after the implementation of the program
(longer term). Although outcome evaluation measures the main goal of the program, it can also be
used to assess program objectives over time. It should be noted that it is not always possible or
appropriate to conduct outcome evaluation in peer-based programs.
cross-validation:-
What is cross-validation?
Cross-validation is a technique for evaluating a machine learning model and testing its
performance. CV is commonly used in applied ML tasks. It helps to compare and select an
appropriate model for the specific predictive modeling problem.
CV is easy to understand, easy to implement, and it tends to have a lower bias than other methods
used to count the model’s efficiency scores. All this makes cross-validation a powerful tool for
selecting the best model for the specific task.
There are a lot of different techniques that may be used to cross-validate a model. Still, all of them
have a similar algorithm:
1. Divide the dataset into two parts: one for training, other for testing
4. Repeat 1-3 steps a couple of times. This number depends on the CV method that you are
using
As you may know, there are plenty of CV techniques. Some of them are commonly used, others
work only in theory. Let’s see the cross-validation methods that will be covered in this article.
• Hold-out
• K-folds
• Leave-one-out
• Leave-p-out
• Stratified K-folds
• Repeated K-folds
• Nested K-folds
• Time series CV
Hold-out cross-validation
Hold-out cross-validation is the simplest and most common technique. You might not know that
it is a hold-out method but you certainly use it every day.
That’s it.
We usually use the hold-out method on large datasets as it requires training the model only once.
import numpy as np
from sklearn.model_selection import train_test_split
For example, a dataset that is not completely even distribution-wise. If so we may end up in a
rough spot after the split. For example, the training set will not represent the test set. Both training
and test sets may differ a lot, one of them might be easier or harder.
Moreover, the fact that we test our model only once might be a bottleneck for this method. Due to
the reasons mentioned before, the result obtained by the hold-out technique may be considered
inaccurate.
k-Fold cross-validation
k-Fold cross-validation is a technique that minimizes the disadvantages of the hold-out method. k-
Fold introduces a new way of splitting the dataset which helps to overcome the “test only once
bottleneck”.
1. Pick a number of folds – k. Usually, k is 5 or 10 but you can choose any number which is
less than the dataset’s length.
2. Split the dataset into k equal (if possible) parts (they are called folds)
3. Choose k – 1 folds as the training set. The remaining fold will be the test set
4. Train the model on the training set. On each iteration of cross-validation, you must train a
new model independently of the model trained on the previous iteration
7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set. In the end, you
should have validated the model on every fold that you have.
8. To get the final score average the results that you got on step 6.
To perform k-Fold cross-validation you can use sklearn.model_selection.KFold.
import numpy as np
Still, k-Fold method has a disadvantage. Increasing k results in training more models and the
training process might be really expensive and time-consuming.
Leave-one-out cross-validation
1. Choose one sample from the dataset which will be the test set
3. Train the model on the training set. On each iteration, a new model must be trained
6. Repeat steps 1 – 5 n times as for n samples we have n different training and test sets
7. To get the final score average the results that you got on step 5.
For LOOCV sklearn also has a built-in method. It can be found in the model_selection library –
sklearn.model_selection.LeaveOneOut.
import numpy as np
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
Thus, the Data Science community has a general rule based on empirical evidence and different
researches, which suggests that 5- or 10-fold cross-validation should be preferred over LOOCV.
Leave-p-out cross-validation
Still, it is worth mentioning that unlike LOOCV and k-Fold test sets will overlap for LpOC if p is
higher than 1.
1. Choose p samples from the dataset which will be the test set
3. Train the model on the training set. On each iteration, a new model must be trained
7. To get the final score average the results that you got on step 5
y = np.array([1, 2, 3, 4])
lpo = LeavePOut(2)
Sometimes we may face a large imbalance of the target value in the dataset. For example, in a
dataset concerning wristwatch prices, there might be a larger number of wristwatch having a high
price. In the case of classification, in cats and dogs dataset there might be a large shift towards the
dog class.
It works as follows. Stratified k-Fold splits the dataset on k folds such that each fold contains
approximately the same percentage of samples of each target class as the complete set. In the case
of regression, Stratified k-Fold makes sure that the mean target value is approximately equal in all
the folds.
3. Choose k – 1 folds which will be the training set. The remaining fold will be the test set
4. Train the model on the training set. On each iteration a new model must be trained
7. Repeat steps 3 – 6 k times. Each time use the remaining fold as the test set. In the end, you
should have validated the model on every fold that you have.
8. To get the final score average the results that you got on step 6.
As you may have noticed, the algorithm for Stratified k-Fold technique is similar to the standard
k-Folds. You don’t need to code something additionally as the method will do everything
necessary for you.
skf = StratifiedKFold(n_splits=2)
The general idea is that on every iteration we will randomly select samples all over the dataset as
our test set. For example, if we decide that 20% of the dataset will be our test set, 20% of samples
will be randomly selected and the rest 80% will become the training set.
4. Train on the training set. On each iteration of cross-validation, a new model must be trained
8. To get the final score average the results that you got on step 6.
Repeated k-Fold has clear advantages over standard k-Fold CV. Firstly, the proportion of train/test
split is not dependent on the number of iterations. Secondly, we can even set unique proportions
for every iteration. Thirdly, random selection of samples from the dataset makes Repeated k-Fold
even more robust to selection bias.
Still, there are some disadvantages. k-Fold CV guarantees that the model will be tested on all
samples, whereas Repeated k-Fold is based on randomization which means that some samples may
never be selected to be in the test set at all. At the same time, some samples might be selected
multiple times. Thus making it a bad choice for imbalanced datasets.
Sklearn will help you to implement a Repeated k-Fold CV. Just use
sklearn.model_selection.RepeatedKFold. In sklearn implementation of this technique you must set
the number of folds that you want to have (n_splits) and the number of times the split will be
performed (n_repeats). It guarantees that you will have different folds on each iteration.
import numpy as np
from sklearn.model_selection import RepeatedKFold
1. Define set of hyper-parameter combinations, C, for current model. If model has no hyper-
parameters, C is the empty set.
2. Divide data into K folds with approximately equal distribution of cases and controls.
• Train model on K-1 folds using hyper-parameter combination that yielded best
average performance over all steps of the inner loop.
The inner loop performs cross-validation to identify the best features and model hyper-parameters
using the k-1 data folds available at each iteration of the outer loop. The model is trained once for
each outer loop step and evaluated on the held-out data fold. This process yields k evaluations of
the model performance, one for each data fold, and allows the model to be tested on every sample.
Time-series cross-validation
Traditional cross-validation techniques don’t work on sequential data such as time-series because
we cannot choose random data points and assign them to either the test set or the train set as it
makes no sense to use the values from the future to forecast values in the past. There are mainly
two ways to go about this:
1. Rolling cross-validation
Cross-validation is done on a rolling basis i.e. starting with a small subset of data for training
purposes, predicting the future values, and then checking the accuracy on the forecasted data
points. The following image can help you get the intuition behind this approach.
Rolling cross-validation | Source
2. Blocked cross-validation
The first technique may introduce leakage from future data to the model. The model will observe
future patterns to forecast and try to memorize them. That’s why blocked cross-validation was
introduced.
Blocked cross-validation | Source
It works by adding margins at two positions. The first is between the training and validation folds
in order to prevent the model from observing lag values which are used twice, once as a regressor
and another as a response. The second is between the folds used at each iteration in order to prevent
the model from memorizing patterns from one iteration to the next.
Although doing cross-validation of your trained model can never be termed as a bad choice, there
are certain scenarios in which cross-validation becomes an absolute necessity:
1. Limited dataset
Let’s say we have 100 data points and we are dealing with a multi-class classification problem
with 10 classes, this averages out to ~10 examples per class. In an 80-20 train-test split, this number
would go down even further to 8 samples per class for training. The smart thing to do here would
be to use cross-validation and utilize our entire dataset for training as well as testing.
Read also
Leveraging Unlabeled Image Data With Self-Supervised Learning or Pseudo Labeling With
Mateusz Opala
When we perform a random train-test split of our data, we assume that our examples are
independent. It means that knowing some instances will not help us understand other instances.
However, that’s not always the case, and in such situations, it’s important that our model gets
familiar with the entire dataset which is possible with cross-validation.
In the absence of cross-validation, we only get a single value of accuracy or precision or recall
which could be an outcome of chance. When we train multiple models, we eliminate such
possibilities and get a metric per model which results in robust insights.
4. Hyperparameter tuning
Although there are many methods to tune the hyperparameters of your model such as grid search,
Bayesian optimization, etc., this exercise can’t be done on training or test set, and a need for a
validation set arises. Thus, we fall back to the same splitting problem that we have discussed above
and cross-validation can help us out of this.
Linear regression:-
Decision trees :-
A decision tree is a type of supervised machine learning used to categorize or make predictions
based on how a previous set of questions were answered. The model is a form of supervised
learning, meaning that the model is trained and tested on a set of data that contains the desired
categorization.
A decision tree is a flowchart-like structure in which each internal node represents a test on a
feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class label
(decision taken after computing all features) and branches represent conjunctions of features that
lead to those class ...
Over fitting:-
Overfitting is an undesirable machine learning behavior that occurs when the machine learning
model gives accurate predictions for training data but not for new data. When data scientists use
machine learning models for making predictions, they first train the model on a known data set.
What is meant by overfitting?
Overfitting is a concept in data science, which occurs when a statistical model fits exactly against
its training data. When this happens, the algorithm unfortunately cannot perform accurately against
unseen data, defeating its purpose.
What is overfitting in machine learning and how can you avoid it?
Overfitting makes the model relevant to its data set only, and irrelevant to any other data
sets. Some of the methods used to prevent overfitting include ensembling, data augmentation, data
simplification, and cross-validation.
Instance-based learners may simply store a new instance or throw an old instance away. Examples
of instance-based learning algorithms are the k-nearest neighbors algorithm, kernel machines
and RBF networks.
Instance-Based Learning: The raw training instances are used to make predictions. As such
KNN is often referred to as instance-based learning or a case-based learning (where each training
instance is a case from the problem domain).
Feature reduction:-
Feature reduction leads to the need for fewer resources to complete computations or tasks. Less
computation time and less storage capacity needed means the computer can do more work. During
machine learning, feature reduction removes multicollinearity resulting in improvement of the
machine learning model in use.
What are the benefits of feature reduction?
Fewer features mean less complexity. You will need less storage space because you have fewer
data. Fewer features require less computation time. Model accuracy improves due to less
misleading data.
Principal Component Analysis (PCA), Factor Analysis (FA), Linear Discriminant Analysis
(LDA) and Truncated Singular Value Decomposition (SVD) are examples of linear
dimensionality reduction methods.
Collaborative filtering filters information by using the interactions and data collected by the
system from other users. It's based on the idea that people who agreed in their evaluation of
certain items are likely to agree again in the future
What is collaborative recommendation system?
Collaborative filtering is a family of algorithms where there are multiple ways to find similar
users or items and multiple ways to calculate rating based on ratings of similar users.
Depending on the choices you make, you end up with a type of collaborative filtering approach.
Amazon is known for its use of collaborative filtering, matching products to users based on past
purchases. For example, the system can identify all of the products a customer and users with
similar behaviors have purchased and/or positively rated.
[Unit 2]
Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability.
Recall that marginal probability is the probability of an event, irrespective of other random
variables. If the random variable is independent, then it is the probability of the event directly,
otherwise, if the variable is dependent upon other variables, then the marginal probability is the
probability of the event summed over all outcomes for the dependent variables, called the sum
rule.
The joint probability is the probability of two (or more) simultaneous events, often described in
terms of events A and B from two dependent random variables, e.g. X and Y. The joint probability
is often summarized as just the outcomes, e.g. A and B.
• Joint Probability: Probability of two (or more) simultaneous events, e.g. P(A and B) or
P(A, B).
The conditional probability is the probability of one event given the occurrence of another event,
often described in terms of events A and B from two dependent random variables e.g. X and Y.
• Conditional Probability: Probability of one (or more) event given the occurrence of another
event, e.g. P(A given B) or P(A | B).
The joint probability can be calculated using the conditional probability; for example:
This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:
• P(A, B) = P(B, A)
The conditional probability can be calculated using the joint probability; for example:
• P(A | B) != P(B | A)
We are now up to speed with marginal, joint and conditional probability. If you would like more
background on these fundamentals, see the tutorial:
Specifically, one conditional probability can be calculated using the other conditional probability;
for example:
This alternate approach of calculating the conditional probability is useful either when the joint
probability is challenging to calculate (which is most of the time), or when the reverse conditional
probability is available or easy to calculate.
This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes
Theorem, named for Reverend Thomas Bayes, who is credited with first describing it. It is
grammatically correct to refer to it as Bayes’ Theorem (with the apostrophe), but it is common to
omit the apostrophe for simplicity.
• Bayes Theorem: Principled way of calculating a conditional probability without the joint
probability.
It is often the case that we do not have access to the denominator directly, e.g. P(B).
We can calculate it an alternative way; for example:
This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of
P(B), described below:
As such, if we have P(A), then we can calculate P(not A) as its complement; for example:
• P(not A) = 1 – P(A)
Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for
example:
Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the
meaning of the terms in the equation.
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
The terms in the Bayes Theorem equation are given names depending on the context where the
equation is used.
It can be helpful to think about the calculation from these different perspectives and help to map
your problem onto the equation.
Firstly, in general, the result P(A|B) is referred to as the posterior probability and P(A) is referred
to as the prior probability.
Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.
• P(B|A): Likelihood.
• P(B): Evidence.
What is the probability that there is fire given that there is smoke?
Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:
You can imagine the same situation with rain and clouds.
Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at a scenario
where we can calculate it.
Bayes theorem is best understood with a real-life worked example with real numbers to
demonstrate the calculations.
First we will define a scenario then work through a manual calculation, a calculation in Python,
and a calculation using the terms that may be familiar to you from the field of binary classification.
1. Diagnostic Test Scenario
2. Manual Calculation
Let’s go.
An excellent and widely used example of the benefit of Bayes Theorem is in the analysis of a
medical diagnostic test.
Scenario: Consider a human population that may or may not have cancer (Cancer is True or False)
and a medical test that returns positive or negative for detecting cancer (Test is Positive or
Negative), e.g. like a mammogram for detecting breast cancer.
Problem: If a randomly selected patient has the test and it comes back positive, what is the
probability that the patient has cancer?
Manual Calculation
Sometimes a patient will have cancer, but the test will not detect it. This capability of the test to
detect cancer is referred to as the sensitivity, or the true positive rate.
In this case, we will contrive a sensitivity value for the test. The test is good, but not great, with a
true positive rate or sensitivity of 85%. That is, of all the people who have cancer and are tested,
85% of them will get a positive result from the test.
Given this information, our intuition would suggest that there is an 85% probability that the patient
has cancer.
It has this name because the error in estimating the probability of an event is caused by ignoring
the base rate. That is, it ignores the probability of a randomly selected person having cancer,
regardless of the results of a diagnostic test.
In this case, we can assume the probability of breast cancer is low, and use a contrived base rate
value of one person in 5,000, or (0.0002) 0.02%.
• P(Cancer=True) = 0.02%.
We can correctly calculate the probability of a patient having cancer given a positive test result
using Bayes Theorem.
We know the probability of the test being positive given that the patient has cancer is 85%, and
we know the base rate or the prior probability of a given patient having cancer is 0.02%; we can
plug these values in:
• P(Cancer=False) = 1 – P(Cancer=True)
• = 1 – 0.0002
• = 0.9998
We still do not know the probability of a positive test result given no cancer.
Specifically, we need to know how good the test is at correctly identifying people that do not have
cancer. That is, testing negative result (Test=Negative) when the patient does not have cancer
(Cancer=False), called the true negative rate or the specificity.
With this final piece of information, we can calculate the false positive or false alarm rate as the
complement of the true negative rate.
• = 1 – 0.95
• = 0.05
We can plug this false alarm rate into our calculation of P(Test=Positive) as follows:
• P(Test=Positive) = 0.05016
Excellent, so the probability of the test returning a positive result, regardless of whether the person
has cancer or not is about 5%.
We now have enough information to calculate Bayes Theorem and estimate the probability of a
randomly selected person having cancer if they get a positive test result.
The calculation suggests that if the patient is informed they have cancer with this test, then there
is only 0.33% chance that they have cancer.
The example also shows that the calculation of the conditional probability
requires enough information.
For example, if we have the values used in Bayes Theorem already, we can use them directly.
This is rarely the case, and we typically have to calculate the bits we need and plug them in, as we
did in this case. In our scenario we were given 3 pieces of information, the the base rate,
the sensitivity (or true positive rate), and the specificity (or true negative rate).
• Sensitivity: 85% of people with cancer will get a positive test result.
• Specificity: 95% of people without cancer will get a negative test result.
We did not have the P(Test=Positive), but we calculated it given what we already had available.
We might imagine that Bayes Theorem allows us to be even more precise about a given scenario.
For example, if we had more information about the patient (e.g. their age) and about the domain
(e.g. cancer rates for age ranges), and in turn we could offer an even more accurate probability
estimate.
Logistic Regression:-
Before we dive into understanding logistic regression, let us start with some basics about the
different types of machine learning algorithms.
What are the differences between supervised learning, unsupervised learning & reinforcement
learning?
Machine learning algorithms are broadly classified into three categories - supervised learning,
unsupervised learning, and reinforcement learning.
1. Supervised Learning - Learning where data is labeled and the motivation is to classify
something or predict a value. Example: Detecting fraudulent transactions from a list of
credit card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the motivation is to find
patterns in given data. In this case, you are asking the machine learning model to process
the data from which you can then draw conclusions. Example: Customer segmentation
based on spend data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to how humans
learn. The motivation is to find optimal policy of how to act in a given environment. The
machine learning model examines all possible actions, makes a policy that maximizes
benefit, and implements the policy(trial). If there are errors from the initial policy, apply
reinforcements back into the algorithm and continue to do this until you reach the optimal
policy. Example: Personalized recommendations on streaming platforms like YouTube.
What are the two types of supervised learning?
As supervised learning is used to classify something or predict a value, naturally there are two
types of algorithms for supervised learning - classification models and regression models.
In this imaginary example, the probability of a person being infected with COVID-19 could be
based on the viral load and the symptoms and the presence of antibodies, etc. Viral load, symptoms,
and antibodies would be our factors (Independent Variables), which would influence our outcome
(Dependent Variable).
In linear regression, the outcome is continuous and can be any possible value. However in the case
of logistic regression, the predicted outcome is discrete and restricted to a limited number of
values.
For example, say we are trying to apply machine learning to the sale of a house. If we are trying
to predict the sale price based on the size, year built, and number of stories we would use linear
regression, as linear regression can predict a sale price of any possible value. If we are using those
same factors to predict if the house sells or not, we would logistic regression as the possible
outcomes here are restricted to yes or no.
Hence, linear regression is an example of a regression model and logistic regression is an example
of a classification model.
Logistic regression is used to solve classification problems, and the most common use case
is binary logistic regression, where the outcome is binary (yes or no). In the real world, you can
see logistic regression applied across multiple areas and fields.
• In health care, logistic regression can be used to predict if a tumor is likely to be benign or
malignant.
• In the financial industry, logistic regression can be used to predict if a transaction is
fraudulent or not.
• In marketing, logistic regression can be used to predict if a targeted audience will respond
or not.
Are there other use cases for logistic regression aside from binary logistic regression? Yes. There
are two other types of logistic regression that depend on the number of predicted outcomes.
1. Binary logistic regression - When we have two possible outcomes, like our original
example of whether a person is likely to be infected with COVID-19 or not.
2. Multinomial logistic regression - When we have multiple outcomes, say if we build out our
original example to predict whether someone may have the flu, an allergy, a cold, or
COVID-19.
3. Ordinal logistic regression - When the outcome is ordered, like if we build out our original
example to also help determine the severity of a COVID-19 infection, sorting it into mild,
moderate, and severe cases.
Training data that satisfies the below assumptions is usually a good fit for logistic regression.
• The predicted outcome is strictly binary or dichotomous. (This applies to binary logistic
regression).
• The factors, or the independent variables, that influence the outcome are independent of
each other. In other words there is little or no multicollinearity among the independent
variables.
• The independent variables can be linearly related to the log odds.
• Fairly large sample sizes.
Support Vector Machine(SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well its best suited for
classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space
that distinctly classifies the data points. The dimension of the hyperplane depends upon the number
of features. If the number of input features is two, then the hyperplane is just a line. If the number
of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine
when the number of features exceeds three.
Let’s consider two independent variables x1, x2 and one dependent variable which is either a blue
circle or a red circle.
Linearly Separable Data points
From the figure above its very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregates our data points or does
a classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points.
One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
Say, our data is like shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms it into higher-
dimensional space, ie it converts non separable problem to separable problem. It is mostly useful
in non-linear separation problems. Simply put the kernel, it does some extremely complex data
transformations then finds out the process to separate the data based on the labels or outputs
defined.
Advantages of SVM:
• Its memory efficient as it uses a subset of training points in the decision function called
support vectors
• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels
Free Machine Learning course with 50+ real-time projects Start Now!!
1. Objective
In our previous Machine Learning blog we have discussed about SVM (Support Vector
Machine) in Machine Learning. Now we are going to provide you a detailed description of SVM
Kernel and Different Kernel Functions and its examples such as linear, nonlinear, polynomial,
Gaussian kernel, Radial basis function (RBF), sigmoid etc.
Kernel Functions-Introduction to SVM Kernel & Examples
SVM algorithms use a set of mathematical functions that are defined as the kernel. The function
of kernel is to take data as input and transform it into the required form. Different SVM algorithms
use different types of kernel functions. These functions can be different types. For example linear,
nonlinear, polynomial, radial basis function (RBF), and sigmoid.
Introduce Kernel functions for sequence data, graphs, text, images, as well as vectors. The most
used type of kernel function is RBF. Because it has localized and finite response along the entire
x-axis.
The kernel functions return the inner product between two points in a suitable feature space. Thus
by defining a notion of similarity, with little computational cost even in very high-dimensional
spaces.
3. Kernel Rules
This value of this function is 1 inside the closed ball of radius 1 centered at the origin, and 0
otherwise . As shown in the figure below:
For a fixed xi, the function is K(z-xi)/h) = 1 inside the closed ball of radius h centered at xi, and 0
otherwise as shown in the figure below:
Kernel or a window function
So, by choosing the argument of K(·), you have moved the window to be centered at the point xi
and to be of radius h.
Let us see some common kernels used with SVMs and their uses:
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:
, for:
It is general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:
It is useful when dealing with large sparse data vectors. It is often used in text categorization. The
splines kernel also performs well in regression problems. Equation is:
If you have any query about SVM Kernel Functions, So feel free to share with us. We will be glad
to solve your queries.
[Unit 3]
Perceptron :-
The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that the sum
of the values should be greater than a threshold value before making a decision
like true or false (0 or 1).
ADVERTISEMENT
Perceptron Example
• Threshold = 1.5
2. Multiply all inputs with its weights:
• x1 * w1 = 1 * 0.7 = 0.7
• x2 * w2 = 0 * 0.6 = 0
• x3 * w3 = 1 * 0.5 = 0.5
• x4 * w4 = 0 * 0.3 = 0
• x5 * w5 = 1 * 0.4 = 0.4
• Return true if the sum > 1.5 ("Yes I will go to the Concert")
Note
If the weather weight is 0.6 for you, it might different for someone else. A higher weight means
that the weather is more important to them.
If the threshold value is 1.5 for you, it might be different for someone else. A lower threshold
means they are more wanting to go to the concert.
Example
let sum = 0;
for (let i = 0; i < inputs.length; i++) {
sum += inputs[i] * weights[i];
}
const activate = (sum > 1.5);
Perceptron Terminology
• Perceptron Inputs
• Node values
• Node Weights
• Activation Function
Perceptron Inputs
Node Values
The binary input values (0 or 1) can be interpreted as (no or yes) or (false or true).
Node Weights
In the example above, the node weights are: 0.7, 0.6, 0.5, 0.3, 0.4
The activation functions maps the result (the weighted sum) into a required value like 0 or 1.
In the example above, the activation function is simple: (sum > 1.5)
Other neurons must provide input: Is the artist good. Is the weather good...
Neural Networks
In the Neural Network Model, input data (yellow) are processed against a hidden layer (blue)
and modified against more hidden layers (green) to produce the final output (red).
Multilayer Network:-
Multi-layer neural networks can be set up in numerous ways. Typically, they have at least one
input layer, which sends weighted inputs to a series of hidden layers, and an output layer at the
end. These more sophisticated setups are also associated with nonlinear builds using sigmoids and
other functions to direct the firing or activation of artificial neurons. While some of these systems
may be built physically, with physical materials, most are created with software functions that
model neural activity.
Convolutional neural networks (CNNs), so useful for image processing and computer vision, as
well as recurrent neural networks, deep networks and deep belief systems are all examples of multi-
layer neural networks. CNNs, for example, can have dozens of layers that work sequentially on an
image. All of this is central to understanding how modern neural networks function.
Back Propagation :-
The key difference here is that static backpropagation offers instant mapping and recurrent
backpropagation does not.
Machine
learning vs. deep learning vs. neural networks
The algorithm gets its name because the weights are updated backward, from output to input.
• It does not have any parameters to tune except for the number of inputs.
• It is highly adaptable and efficient and does not require any prior knowledge about the
network.
Backpropagation algorithms are used extensively to train feedforward neural networks in areas
such as deep learning. They efficiently compute the gradient of the loss function with respect to
the network weights. This approach eliminates the inefficient process of directly computing the
gradient with respect to each individual weight. It enables the use of gradient methods, like
gradient descent or stochastic gradient descent, to train multilayer networks and update weights to
minimize loss.
The difficulty of understanding exactly how changing weights and biases affect the overall
behavior of an artificial neural network was one factor that held back more comprehensive use of
neural network applications, arguably until the early 2000s when computers provided the
necessary insight.
Today, backpropagation algorithms have practical applications in many areas of artificial
intelligence (AI), including OCR, natural language processing and image processing.
Backpropagation requires a known, desired output for each input value in order to calculate the
loss function gradient -- how a prediction differs from actual results -- as a type of supervised
machine learning. Along with classifiers such as Naïve Bayesian filters and decision trees, the
backpropagation training algorithm has emerged as an important part of machine learning
applications that involve predictive analytics.
The time complexity of each iteration -- how long it takes to execute each statement in an algorithm
-- depends on the network's structure. For multilayer perceptron, matrix multiplications dominate
time.
The concept of momentum in backpropagation states that previous weight changes must influence
the present direction of movement in weight space.
The backpropagation algorithm pseudocode represents a plain language description of the steps in
a system.
The Levenberg-Marquardt method helps adjust the weight and bias variables. Then, the
backpropagation algorithm is used to calculate the Jacobian matrix of performance functions
considering the weight and bias variables.
Introduction to Deep Neural Network:-
What is Deep Learning? Deep learning is a branch of machine learning which is completely based
on artificial neural networks, as neural network is going to mimic the human brain so deep learning
is also a kind of mimic of human brain. In deep learning, we don’t need to explicitly program
everything. The concept of deep learning is not new. It has been around for a couple of years now.
It’s on hype nowadays because earlier we did not have that much processing power and a lot of
data. As in the last 20 years, the processing power increases exponentially, deep learning and
machine learning came in the picture. A formal definition of deep learning is- neurons
Deep Learning is a subset of Machine Learning that is based on artificial neural networks (ANNs)
with multiple layers, also known as deep neural networks (DNNs). These neural networks are
inspired by the structure and function of the human brain, and they are designed to learn from large
amounts of data in an unsupervised or semi-supervised manner.
Deep Learning models are able to automatically learn features from the data, which makes them
well-suited for tasks such as image recognition, speech recognition, and natural language
processing. The most widely used architectures in deep learning are feedforward neural networks,
convolutional neural networks (CNNs), and recurrent neural networks (RNNs).
Feedforward neural networks (FNNs) are the simplest type of ANN, with a linear flow of
information through the network. FNNs have been widely used for tasks such as image
classification, speech recognition, and natural language processing.
Convolutional Neural Networks (CNNs) are a special type of FNNs designed specifically for
image and video recognition tasks. CNNs are able to automatically learn features from the images,
which makes them well-suited for tasks such as image classification, object detection, and image
segmentation.
Recurrent Neural Networks (RNNs) are a type of neural networks that are able to process
sequential data, such as time series and natural language. RNNs are able to maintain an internal
state that captures information about the previous inputs, which makes them well-suited for tasks
such as speech recognition, natural language processing, and language translation.
Deep Learning models are trained using large amounts of labeled data and require significant
computational resources. With the increasing availability of large amounts of data and
computational resources, deep learning has been able to achieve state-of-the-art performance in a
wide range of applications such as image and speech recognition, natural language processing, and
more.
Deep learning is a particular kind of machine learning that achieves great power and flexibility
by learning to represent the world as a nested hierarchy of concepts, with each concept defined in
relation to simpler concepts, and more abstract representations computed in terms of less abstract
ones.
In human brain approximately 100 billion neurons all together this is a picture of an individual
neuron and each neuron is connected through thousand of their neighbours. The question here is
how do we recreate these neurons in a computer. So, we create an artificial structure called an
artificial neural net where we have nodes or neurons. We have some neurons for input value and
some for output value and in between, there may be lots of neurons interconnected in the hidden
layer. Architectures :
1. Deep Neural Network – It is a neural network with a certain level of complexity (having
multiple hidden layers in between input and output layers). They are capable of modeling
and processing non-linear relationships.
3. Recurrent (perform same task for every element of a sequence) Neural Network – Allows
for parallel and sequential computation. Similar to the human brain (large feedback
network of connected neurons). They are able to remember important things about the input
they received and hence enables them to be more precise.
Difference between
Machine Learning and Deep Learning :
Works on small amount of Dataset for accuracy. Works on Large amount of Dataset.
Working : First, we need to identify the actual problem in order to get the right solution and it
should be understood, the feasibility of the Deep Learning should also be checked (whether it
should fit Deep Learning or not). Second, we need to identify the relevant data which should
correspond to the actual problem and should be prepared accordingly. Third, Choose the Deep
Learning Algorithm appropriately. Fourth, Algorithm should be used while training the dataset.
Fifth, Final testing should be done on the
dataset. Tool
s used : Anaconda, Jupyter, Pycharm, etc. Languages used : R, Python, Matlab, CPP, Java, Julia,
Lisp, Java Script, etc. Real Life Examples :
How to recognize square from other shapes?
So, Deep Learning is a complex task of identifying the shape and broken down into simpler
Defining facial features which are important for classification and system will then identify this
automatically.
(Whereas Machine Learning will manually give out those features for classification)
Limitations :
Advantages :
Disadvantages :
1. Automatic Text Generation – Corpus of text is learned and from this model new text is
generated, word-by-word or character-by-character. Then this model is capable of learning
how to spell, punctuate, form sentences, or it may even capture the style.
4. Image Recognition – Recognizes and identifies peoples and objects in images as well as to
understand content and context. This area is already being used in Gaming, Retail,
Tourism, etc.
6. Deep learning has a wide range of applications in various fields such as computer vision,
speech recognition, natural language processing, and many more. Some of the most
common applications include:
7. Image and video recognition: Deep learning models are used to automatically classify
images and videos, detect objects, and identify faces. Applications include image and video
search engines, self-driving cars, and surveillance systems.
8. Speech recognition: Deep learning models are used to transcribe and translate speech in
real-time, which is used in voice-controlled devices, such as virtual assistants, and
accessibility technology for people with hearing impairments.
9. Natural Language Processing: Deep learning models are used to understand, generate and
translate human languages. Applications include machine translation, text summarization,
and sentiment analysis.
10. Robotics: Deep learning models are used to control robots and drones, and to improve their
ability to perceive and interact with the environment.
11. Healthcare: Deep learning models are used in medical imaging to detect diseases, in drug
discovery to identify new treatments, and in genomics to understand the underlying causes
of diseases.
12. Finance: Deep learning models are used to detect fraud, predict stock prices, and analyze
financial data.
13. Gaming: Deep learning models are used to create more realistic characters and
environments, and to improve the gameplay experience.
14. Recommender Systems: Deep learning models are used to make personalized
recommendations to users, such as product recommendations, movie recommendations,
and news recommendations.
15. Social Media: Deep learning models are used to identify fake news, to flag harmful content
and to filter out spam.
16. Autonomous systems: Deep learning models are used in self-driving cars, drones, and other
autonomous systems to make decisions based on sensor data.
[Unit 4]
Computational learning theory (CoLT) refers to applying formal mathematical methods to learning
systems using theoretical computer science tools to quantify learning problems. This task includes
discerning how hard it is for an artificial intelligence (AI) system to learn specific tasks.
Simply put, CoLT is an AI subfield devoted to studying the design and analysis of machine
learning (ML) algorithms. It analyzes how difficult it will be for an AI system to learn a task.
• A formal definition of efficiency of both data usage (sample complexity) and processing
time (time complexity)
CoLT considers a computation feasible if it can be performed in polynomial time. That means the
number of steps required to complete the algorithm for a given input is not infinite. CoLT produces
two kinds of results for this—positive (the machine can learn the task in polynomial time) and
negative (the device can’t learn the task in polynomial time).
You should remember that the theoretical learning models AI systems analyze represent real-life
problems abstractly. As such, ML experts have to validate or change the abstractions to ensure that
the computers will produce theoretical results that represent real-life solutions.
CoLT is thus critical to ML research. Besides the predictive capability CoLT offers, it also
addresses other vital factors, including simplicity, robustness to variations in the learning scenario,
and the ability to create insights into empirically observed phenomena. In other words, CoLT
simplifies the data the AI system has to process. Finally, it helps users ensure the computer can
adapt to changes in its environment even while learning a task. And it lets users understand and
apply the results to real-life situations.
CoLT is usually applied to statistics, calculus, geometry, information theory, probability theory,
and programming optimization.
• How can you tell if a model decently approximates the goal function? (How can you tell if
an ML algorithm accurately represents your objective?)
• How can you determine if you have a good answer at the local or global level? (How can
you tell if the ML model successfully provided you with the correct results for a specific
or general task?)
• What type of hypothesis space should be employed? (How many hypotheses should the
machine come up with?)
• What can you do to avoid overfitting? (What can you do to prevent the results from
becoming only applicable to the data studied?)
• How many examples of data are required?
• It can help programmers predict how well their algorithms will do in processing and
analyzing specific volumes of data. Will they work as well on 5 million data points as they
did on 1 million data points? Were the results just as accurate?
We want to learn the concept "medium-built person" from examples. We are given the height and
weight of m individuals, the training set. We are told for each [height,weight] pair if it is or not of
medium built. We would like to learn this concept, i.e. produce an algorithm that in the future
answers correctly if a pair [height,weight] represents/not a medium-built person. We are interested
in knowing which value of m to use if we want to learn this concept well. Our concern is to
characterize what we mean by well or good when evaluating learned concepts.
First we examine what we mean by saying the probability of error of the learned concept is at most
epsilon.
Say that c is the (true) concept we are learning, h is the concept we have learned, then
For a specific learning algorithm, what is the probability that a concept it learns will have an error
that is bound by epsilon? We would like to set a bound delta on the probability that this error is
greater that epsilon. That is,
When the probability that its error is greater than the accuracy epsilon is less than
the confidence delta.
Different degrees of "goodness" will correspond to different values of epsilon and delta. The
smaller epsilon and delta are, the better the leaned concept will be.
This method of evaluating learning is called Probably Approximately Correct (PAC) Learning and
will be defined more precisely in the next section.
Our problem, for a given concept to be learned, and given epsilon and delta, is to determine the
size of the training set. This may or not depend on the algorithm used to derive the learned concept.
Going back to our problem of learning the concept medium-built people, we can assume that the
concept is represented as a rectangle, with sides parallel to the axes height/weight, and with
dimensions height_min, height_max, weight_min, weight_max. We assume that also the
hypotheses will take the form of a rectangle with the sides parallel to the axes.
We will use a simple algorithm to build the learned concept from the training set:
1. If there are no positive individuals in the training set, the learned concept is null.
2. Otherwise it is the smallest rectangle with sides parallel to the axes which contains the
positive individuals.
We would like to know how good is this learning algorithm. We choose epsilon and delta and
determine a value for m that will satisfy the PAC learning condition.
An individual x will be classified incorrectly by the learned concept h if x lays in the area between
h and c. We divide this area into 4 strips, on top, bottom and sides of h. We allow these strips,
pessimistically, to overlap in the corners. In figure 1 we represent the top strip as t'. If each of these
strips is of area at most epsilon/4, i.e. is contained in the strip t of area epsilon/4, then the error for
our hypothesis h will be bound by epsilon. [In determining the area of a strip we need to make
some hypothesis about the probability of each point. However our analysis is valid for any chosen
distribution.]
What is the probability that the hypothesis h will have an error bound by epsilon? We determine
the probability that one individual will be outside of the strip t, (1 - epsilon/4). Then we determine
the probability that all m individuals will be outside of the strip t, (1 - epsilon/4)^m. The probability
of all m individuals to be simultaneously outside of at least one of the four strips is 4*(1 -
epsilon/4)^m. When all the m individuals are outside of at least one of the strips, then the
probability of error for an individual can be greater than epsilon [this is pessimistic]. Thus if we
bound 4*(1 - epsilon/4)^m by delta, we make sure that the the probability of individual error being
greater that epsilon is at most delta.
we
obtain
Two of the sides overlap the axes (thus we have two strips only)
The concept is a one dimensional interval (we still have two strips)
The concept is a one dimensional interval starting at 0 (thus only one strip).
PAC Learning deals with the question of how to choose the size of the training set, if we want to
have confidence delta that the learned concept will have an error that is bound by epsilon.
N is the cardinality of X.
D is a probability distribution on X; this distribution is used both when the training set is created
and when the test set is created.
ORACLE(f,D), a function that in a unit of time returns a pair of the form (x,f(x)), where x is
selected from X according to D.
NOTE: We can examine learning in terms of functions or of concepts, i.e. sets. They are
equivalent, if we remember the use of characteristic functions for sets.
Surprisingly, we can derive a general lower bound on the size m of the training set required in
PAC learning. We assume that for each x in the training set we will have h(x) = f(x), that is, the
hypothesis is consistent with the target concept on the training set.
If the hypothesis h is bad, i.e it has an error greater than epsilon, and is consistent on the training
set, then the probability that on one individual x we have h(x)=f(x) is at most (1 - epsilon), and the
probability of having f(x)=h(x) for m individuals is at most (1 - epsilon)^m. This is an upper bound
on the probability of a bad consistent function. If we multiply this value by the number of bad
hypotheses, we have an upper bound on the probability of learning a bad consistent hypothesis.
The number of bad hypothesis is certainly less than the number N of hypotheses. Thus
Probability[h is bad and consistent] < N*(1 - epsilon)^m < delta
Which we can rewrite as
Probability[h is bad and consistent] < N*(e^(-epsilon))^m = N*(e^(-m*epsilon)) < delta
Solving this inequality for m
m > (1/epsilon)*(ln(1/delta)+ln N)
Learning a Boolean Function
Suppose we want to learn a boolean function of n variables. The number N of such functions is
2^(2^n). Thus
m > (1/epsilon)*(ln(1/delta)+(2^n)ln2)
We can efficiently PAC learn concepts that are represented as the conjunction of boolean literals
(i.e. positive or negative boolean variables). Here is a learning algorithm:
1. Start with an hypothesis h which is the conjunction of each variable and its negation x1 &
~x1 & x2 & ~x2 & .. & xn & ~xn.
2. Do nothing with negative instances.
In this algorithm h, as domain, is non-decreasing and at all times contained in the set denoted by
c. [By induction: certainly true initially, ..]. We will have an error when h contains a literal z which
is not in c.
We compute first the probability that a literal z is deleted from h because of one specific positive
example. Clearly this probability is 0 if z occurs in c, and if ~z is in c the probability is 1. At issue
are the z where neither z nor ~z is in c. We would like to eliminate both of them from h. If one of
the two remains, we have an error for an instance a that is positive for c and negative for h. Let's
call these literals, free literals.
We have:
error(h) is less than or equal to the Sum of the probabilities of the free literals z in h not to be
eliminated by one positive example.
Since there are at most 2*n literals in h, if h is a bad hypothesis, i.e. an hypothesis with error greater
than epsilon, we will have
Probability[some free literal z survives m positive examples] < 2n*(1 - epsilon/(2*n))^m <
2n*(e^(-epsilon/(2*n)))^m = 2n*e^(-(m*epsilon)/(2*n))
That is
m > (2*n/epsilon)*(ln(1/delta)+n*ln(2))
Sample complexity:-
In machine learning, model complexity often refers to the number of features or terms included
in a given predictive model, as well as whether the chosen model is linear, nonlinear, and so
on. It can also refer to the algorithmic learning complexity or computational complexity.
How do you calculate time complexity of machine learning model?
VC Dimension :-
The VC dimension of {f(α)} is the maximum number of. training points that can be shattered
by {f(α)} For example, the VC dimension of a set of oriented lines in R2 is three. In general, the
VC dimension of a set of oriented hyperplanes in Rn is n+1. Note: need to find just one set of
points.
Ensemble learning.-
Ensemble learning is the process by which multiple models, such as classifiers or experts, are
strategically generated and combined to solve a particular computational intelligence
problem. Ensemble learning is primarily used to improve the (classification, prediction, function
approximation, etc.)
The three main classes of ensemble learning methods are bagging, stacking, and boosting, and it
is important to both have a detailed understanding of each method and to consider them on your
predictive modeling project.
[Unit 5]
Clustering k-means :-
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors
without referring to known, or labelled, outcomes.
AndreyBu, who has more than 5 years of machine learning experience and currently teaches people
his skills, says that “the objective of K-means is simple: group similar data points together and
discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of
clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset.
A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every
data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of
randomly selected centroids, which are used as the beginning points for every cluster, and then
performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
• The centroids have stabilized — there is no change in their values because the clustering
has been successful.
• The defined number of iterations has been achieved.
K-means algorithm example problem
Let’s see the steps on how the K-means machine learning algorithm works using the Python
programming language.
We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple
explanation.
Step 1: Import libraries
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import
KMeans%matplotlib inline
As you can see from the above code, we’ll import the following libraries in our project:
• Pandas for reading and writing spreadsheets
• Numpy for carrying out efficient computations
• Matplotlib for visualization of data
Step 2: Generate random data
Here is the code for generating some random data in a two-dimensional space:
X= -2 * np.random.rand(100,2)X1 = 1 + 2 * np.random.rand(50,2)X[50:100, :] = X1plt.scatter(X[
: , 0], X[ :, 1], s = 50, c = ‘b’)plt.show()
A total of 100 data points has been generated and divided into two groups, of 50 points each.
Here is how the data is displayed on a two-dimensional space:
From the above diagram, it is very easy to see that we have two clusters in out datapoints but in
the real world data, there can be thousands of clusters. Next, we will be plotting the dendrograms
of our datapoints by using Scipy library −
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
linked = linkage(X, 'single')
labelList = range(1, 11)
plt.figure(figsize = (10, 7))
dendrogram(linked, orientation = 'top',labels = labelList,
distance_sort ='descending',show_leaf_counts = True)
plt.show()
Now, once the big cluster is formed, the longest vertical distance is selected. A vertical line is then
drawn through it as shown in the following diagram. As the horizontal line crosses the blue line at
two points, the number of clusters would be two.
Next, we need to import the class for clustering and call its fit_predict method to predict the
cluster. We are importing AgglomerativeClustering class of sklearn.cluster library −
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
cluster.fit_predict(X)
Next, plot the cluster with the help of following code −
plt.scatter(X[:,0],X[:,1], c = cluster.labels_, cmap = 'rainbow')
The above diagram shows the two clusters from our datapoints.
Example 2
As we understood the concept of dendrograms from the simple example discussed above, let us
move to another example in which we are creating clusters of the data point in Pima Indian
Diabetes Dataset by using hierarchical clustering.
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import numpy as np
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names = headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
data.shape
(768, 9)
data.head()
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
Suppose there are set of data points that need to be grouped into several parts or clusters based
on their similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering:
• K Means Clustering
• Hierarchical Clustering
Here is a d dimensional vector denoting the mean of the distribution and is the d X d
covariance matrix.
Gaussian Mixture Model
Suppose there are K clusters (For the sake of simplicity here it is assumed that the number of
clusters is known and it is K). So and is also estimated for each k. Had it been only one
distribution, they would have been estimated by the maximum-likelihood method. But since
there are K such clusters and the probability density is defined as a linear function of densities of
all these K distributions, i.e.
where is the mixing coefficient for k-th distribution.
For estimating the parameters by the maximum log-likelihood method, compute p(X| , , ).
Similarly taking derivative with respect to and pi respectively, one can obtain the following
expressions.
And
Note: denotes the total number of sample points in the k-th cluster. Here it is assumed that there
is a total N number of samples and each sample containing d features is denoted by .
So it can be clearly seen that the parameters cannot be estimated in closed form. This is where
the Expectation-Maximization algorithm is beneficial.
Expectation-Maximization (EM) Algorithm
• Estimation step:
• Then for those given parameter values, estimate the value of the latent variables
(i.e )
• Maximization Step:
Algorithm:
•
[Tex]\gamma_k [/Tex]
Example: In this example, IRIS Dataset is taken. In Python, there is a GaussianMixture class to
implement GMM.