Machine Learning Unit 1
Machine Learning Unit 1
Learning
• Introduction to Machine Learning, Comparison of Machine
learning with traditional programming,
ML vs AI vs Data Science.
• Types of learning: Supervised, Unsupervised, and semi-
supervised, reinforcement learning techniques,
• Models of Machine learning: Geometric model,
Probabilistic Models, Logical Models, Grouping and grading
models.
• Parametric and non-parametric models.
• Important Elements of Machine Learning- Data formats
• Learnability .
• Statistical learning approaches .
1
Introduction to Machine Learning
• Arthur Samuel, an early American leader in
the field of computer gaming and artificial
intelligence, coined the term “Machine
Learning ” in 1959 while at IBM.
• He defined machine learning as “the field of
study that gives computers the ability to
learn without being explicitly programmed “.
However, there is no universally accepted
definition for machine learning.
• Different authors define the term differently.
2
Introduction to Machine Learning
• Machine learning is programming computers to
optimize a performance criterion using example data
or past experience .
• We have a model defined up to some parameters, and
learning is the execution of a computer program to
optimize the parameters of the model using the
training data or past experience.
• The model may be predictive to make predictions in
the future, or descriptive to gain knowledge from
data.
• The field of study known as machine learning is concerned with
the question of how to construct computer programs that
automatically improve with experience.
3
Introduction to Machine Learning
• Definition of learning: A computer program is
said to learn from experience E with respect to
some class of tasks T and performance
measure P,
• if its performance at tasks T, as measured by P
improves with experience E.
• Examples Handwriting recognition learning problem
– Task T : Recognizing and classifying handwritten words
within images
– Performance P : Percent of words correctly classified
– Training experience E : A dataset of handwritten words
with given classifications
4
Introduction to Machine Learning
A robot driving learning problem
– Task T : Driving on highways using vision sensors
– Performance P : Average distance traveled before an
error
– Training experience E : A sequence of images and
steering commands recorded while observing a
human driver
• Definition: A computer program which
learns from experience is called a
machine learning program or simply a
learning program .
5
Introduction to Machine Learning
9
Introduction to Machine Learning
• However, when we need to predict something,
we need to use an algorithm with a variety of
input parameters.
• In case of prediction of the exchange rate, it’s
mandatory to add such details like yesterday’s
rate; external and internal economic changes
in the country that issues the currency and
more.
• Well, it’s simple, we need to add a thousand
and hundreds of parameters, whereas their
limited set allows building a very basic and
unscalable model. 10
Introduction to Machine Learning
• How a data engineer develops a solution using
machine learning https://morioh.com/p/1063d43ef15e
Ref- https://pythongeeks.org/ai-vs-data-science-vs-deep-learning-vs-ml/
12
Introduction to Machine Learning
• Artificial Intelligence-
• It focuses mainly on building smart machines
that are capable of performing tasks that
replicate human intelligence without any
human interference.
• These systems, built using Artificial Intelligence
models, tend to mimic human cognitive
functions, that allows decision making, and also
helps to improve learning.
• Example- forecast financial and business outcomes
and can efficiently provide solutions for businesses.
13
• We can demonstrate the working of AI in brief with the
following steps:
1. Collect Data
2. Clean and Prepare Data
3. Train the Model
4. Test the Data
5. Improve 14
• Machine learning algorithms make use of
computational methods and try to “learn” from the
input data without the requirement of any
predetermined equation.
• It comprises an application of AI that allows the
systems to learn and improve significantly from the
past data experience.
15
• The working of the Machine Learning models is
simply put as:
1. Gather data from source
2. Clean and filter the data
3. Choose the effective algorithm according to
your problem
4. Train the test model
5. Tune in the parameters for best performance
6. Test the models and try to improve the
efficiency
7. Deploy the final model having precise outputs
16
• Deep learning is again a subfield of the artificial
intelligence domain.
• It makes use of a multi-layered structure of algorithms,
more commonly known as a neural network.
• Deep Learning, like Machine Learning algorithms, also
need data for learning and solving problems like
classification and prediction.
• We can even consider Deep Learning as a subdomain
of machine learning. 17
DEEP LEARNING
• Unlike Machine Learning, we do not need trained data for
the model training in Deep Learning, since we can use this
technology when we do not have well classified data.
• The Deep Learning system looks for appropriate
differentiators in the given data points without considering
any external classification.
• This is how we can avoid the situation of human interference
for the training of the Deep Learning Model.
• These models analyze new entities for new features at each
layer and use it to choose the way in which we can classify
the entries.
• The system keeps on checking itself in order to look for new
classifications or categories that we can generate from the
new entities.
18
DATA SCIENCE
25
How Supervised Learning Works?
• Suppose we have a dataset of different types of shapes
which includes square, rectangle, triangle, and Polygon. Now
the first step is that we need to train the model for each
shape.
• If the given shape has four sides, and all the sides are equal,
then it will be labelled as a Square.
• If the given shape has three sides, then it will be labelled as
a triangle.
• If the given shape has six equal sides then it will be labelled
as hexagon.
• Now, after training, we test our model using the test set, and
the task of the model is to identify the shape.
• The machine is already trained on all types of shapes, and
when it finds a new shape, it classifies the shape on 26the
How Supervised Learning Works?
• Steps Involved in Supervised Learning:
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and
validation dataset.
• Determine the input features of the training dataset, which should
have enough knowledge so that the model can accurately predict
the output.
• Determine the suitable algorithm for the model, such as support
vector machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
• Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is
accurate. 27
Types of supervised Machine learning Algorithms:
1. Regression
• Regression algorithms are used if there is a
relationship between the input variable and the
output variable.
• It is used for the prediction of continuous variables,
such as Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which
come under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression 28
Supervised Machine Learning
2. Classification
• Classification algorithms are used when the
output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
29
Advantages of Supervised learning:
• With the help of supervised learning, the model can predict
the output on the basis of prior experiences.
• In supervised learning, we can have an exact idea about the
classes of objects.
• Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.
Disadvantages of supervised learning:
• Supervised learning models are not suitable for handling the
complex tasks.
• Supervised learning cannot predict the correct output if the
test data is different from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the
classes of object.
30
Unsupervised Machine Learning
• Unsupervised learning is where you only have input
data (X) and no corresponding output variables.
• As the name suggests, unsupervised learning is a
machine learning technique in which models are
not supervised using training dataset.
• Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to
learning which takes place in the human brain
• The goal of unsupervised learning is to find the
underlying structure of dataset, group that data
according to similarities, and represent that
dataset in a compressed format.
31
Working of Unsupervised Learning
32
• Here, we have taken an unlabeled input data,
which means it is not categorized and
corresponding outputs are also not given.
• Now, this unlabeled input data is fed to the
machine learning model in order to train it.
Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will
apply suitable algorithms such as k-means
clustering, Decision tree, etc.
• Once it applies the suitable algorithm, the
algorithm divides the data objects into groups
according to the similarities and difference
between the objects.
33
Types of Unsupervised Learning Algorithm:
• Clustering: Clustering is a method of grouping the objects into
clusters such that objects with most similarities remains into
a group and has less or no similarities with the objects of
another group.
• Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and
absence of those commonalities.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of items
that occurs together in the dataset.
• Association rule makes marketing strategy more effective.
Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of
Association rule is Market Basket Analysis. 34
• Unsupervised Learning algorithms:
• Below is the list of some popular unsupervised
learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition 35
Advantages of Supervised learning:
• Unsupervised learning is used for more complex tasks
as compared to supervised learning because, in
unsupervised learning, we don't have labeled input
data.
• Unsupervised learning is preferable as it is easy to get
unlabeled data in comparison to labeled data.
Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than
supervised learning as it does not have corresponding
output.
• The result of the unsupervised learning algorithm might
be less accurate as input data is not labeled, and
algorithms do not know the exact output in advance. 36
Difference between Supervised vs Unsupervised Learning
37
Difference between Supervised vs Unsupervised Learning
38
Semi-Supervised Learning
39
Semi-supervised learning
• Semi-supervised learning bridges supervised
learning and unsupervised learning techniques to
solve their key challenges.
• With it, you train an initial model on a few labeled
samples and then iteratively apply it to the greater
number of unlabeled data.
• Unlike unsupervised learning, SSL works for a variety
of problems from classification and regression to
clustering and association.
• Unlike supervised learning, the method uses small
amounts of labeled data and also large amounts of
unlabeled data, which reduces expenses on manual
annotation and cuts data preparation time. 40
How Semi-supervised learning work
41
• You pick a small amount of labeled data, e.g., images showing cats and
dogs with their respective tags, and you use this dataset to train a
base model with the help of ordinary supervised methods.
• Then you apply the process known as pseudo-labeling — when you
take the partially trained model and use it to make predictions for
the rest of the database which is yet unlabeled. The labels generated
thereafter are called pseudo as they are produced based on the
originally labeled data that has limitations (say, there may be an
uneven representation of classes in the set resulting in bias — more
dogs than cats).
• From this point, you take the most confident predictions made with
your model (for example, you want the confidence of over 80 percent
that a certain image shows a cat, not a dog). If any of the pseudo-
labels exceed this confidence level, you add them into the labeled
dataset and create a new, combined input to train an improved model.
• The process can go through several iterations (10 is often a standard
amount) with more and more pseudo-labels being added every time.
Provided the data is suitable for the process, the performance of the
42
model will keep increasing at each iteration.
Semi-supervised learning examples
• Speech Recognition- Facebook (now Meta)
has successfully applied semi-supervised
learning (namely the self-training method) to its
speech recognition models and improved them.
• Web content classification- Many search
engines, including Google, apply SSL to their
ranking component to better understand
human language and the relevance of candidate
search results to queries.
• Text document classification- building of a
text document classifier.
43
Reinforcement Learning Techniques
• Reinforcement learning is an area of Machine
Learning. It is about taking suitable action to
maximize reward in a particular situation.
• in reinforcement learning, there is no answer
but the reinforcement agent decides what to do
to perform the given task. In the absence of a
training dataset, it is bound to learn from its
experience.
• Example: The problem is as follows: We have an agent
and a reward, with many hurdles in between. The
agent is supposed to find the best possible path to
reach the reward. The following problem explains the
44
problem more easily.
Reinforcement Learning Techniques
47
Types of Reinforcement:
• There are two types of Reinforcement:
• Positive
Positive Reinforcement is defined as when an event,
occurs due to a particular behavior, increases the
strength and the frequency of the behavior. In other
words, it has a positive effect on behavior.
• Advantages of reinforcement learning are:
– Maximizes Performance
– Sustain Change for a long period of time
– Too much Reinforcement can lead to an overload
of states which can diminish the results
48
Types of Reinforcement:
• Negative –
Negative Reinforcement is defined as
strengthening of behavior because a negative
condition is stopped or avoided.
• Advantages of reinforcement learning:
– Increases Behavior
– Provide defiance to a minimum standard of performance
– It Only provides enough to meet up the minimum behavior
• State-Action-Reward-State-Action (SARSA): SARSA is an On-
policy algorithm based on the Markov decision process. It uses
the action performed by the current policy to learn the Q-value.
The SARSA algorithm stands for State Action Reward State
Action, which symbolizes the tuple (s, a, r, s', a').
49
Practical applications Reinforcement Learning
• RL can be used in robotics for industrial automation.
• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide
custom instruction and materials according to the
requirement of students.
• RL can be used in large environments in the following
situations:
A model of the environment is known, but an analytic
solution is not available;
• Only a simulation model of the environment is given
(the subject of simulation-based optimization)
• The only way to collect information about the
environment is to interact with it. 50
Models of Machine learning: Geometric model
• Some broad categories of models:
1. Geometric models
• E.g. K-nearest neighbors, linear regression, support
vector machine, logistic regression, …
2. Probabilistic models
• Naïve Bayes, Gaussian process regression, conditional
random field, …
3. Logical models
• Decision tree, random forest, … Compositional models
Neural networks, logistic regression, ..
4. Ensemble models- Boosting, bagging, random forest.
5. Grading vs grouping models
51
Models of Machine learning: Geometric model
• Machine learning is concerned with using the right
features to build the right models that achieve the
right tasks.
• The basic idea of Learning models has divided into
three categories.
• For a given problem, the collection of all possible outcomes
represents the sample space or instance space.
• Using a Logical expression. (Logical models)
• Using the Geometry of the instance space. (Geometric
models)
• Using Probability to classify the instance space.
(Probabilistic models)
• Grouping and Grading
52
Logical models
• Logical models use a logical expression to divide the
instance space into segments and hence construct
grouping models.
• A logical expression is an expression that returns a
Boolean value, i.e., a True or False outcome.
• Once the data is grouped using a logical expression, the
data is divided into homogeneous groupings for the
problem we are trying to solve.
• For example, for a classification problem, all the
instances in the group belong to one class.
• There are mainly two kinds of logical models: Tree
models and Rule models.
53
Logical models
• Rule models consist of a collection of
implications or IF-THEN rules.
• Tree-based models, the ‘if-part’ defines a
segment and the ‘then-part’ defines the
behavior of the model for this segment.
• Rule models follow the same reasoning.
• Tree models can be seen as a particular type of
rule model where the if-parts of the rules are
organized in a tree structure.
• Both Tree models and Rule models use the
same approach to supervised learning.
54
Logical models
• Logical models and Concept learning
• To understand logical models further, we need to understand
the idea of Concept Learning.
• Concept Learning involves learning logical expressions or
concepts from examples.
• Concept learning forms the basis of both tree-based and rule-
based models.
• More formally, Concept Learning involves acquiring the
definition of a general category from a given set of positive and
negative training examples of the category.
• A Formal Definition for Concept Learning is “The inferring of a
Boolean-valued function from training examples of its input
and output.”
• In concept learning, we only learn a description for the positive
class and label everything that doesn’t satisfy that description as
55
negative.
Logical models
58
Geometric model
• we have seen that with logical models, such as
decision trees, a logical expression is used to partition
the instance space.
• Two instances are similar when they end up in the
same logical segment.
• In this section, we consider models that define
similarity by considering the geometry of the instance
space.
• In Geometric models, features could be described as
points in two dimensions (x- and y-axis) or a three-
dimensional space (x, y, and z).
• Eg. , temperature as a function of time can be
modelled in two axes)
59
Geometric model
• There are two ways we could impose similarity.
• We could use geometric concepts like lines or planes
to segment (classify) the instance space. These are
called Linear models.
• Alternatively, we can use the geometric notion of
distance to represent similarity.
• In this case, if two points are close together, they have
similar values for features and thus can be classed as
similar. We call such models as Distance-based
models.
60
Geometric model
• 1. Linear models are relatively simple.
• In this case, the function is represented as a linear
combination of its inputs.
• Thus, if x1 and x2 are two scalars or vectors of the same
dimension and a and b are arbitrary scalars,
then ax1 + bx2 represents a linear combination of x1 and x2.
• In the simplest case where f(x) represents a straight line,
we have an equation of the form
• f (x) = mx + c where c represents the intercept
and m represents the slope.
61
Geometric model
• Linear models are parametric, which means that they
have a fixed form with a small number of numeric
parameters that need to be learned from data.
• For example, in f (x) = mx + c, m and c are the para
• meters that we are trying to learn from the data.
• Linear models are stable, i.e., small variations in the
training data have only a limited impact on the learned
model.
• In contrast, tree models tend to vary more with the
training data, as the choice of a different split at the
root of the tree typically means that the rest of the
tree is different as well.
62
Geometric model
• Linear models have low variance and high bias.
• This implies that Linear models are less likely to
overfit the training data than some other models.
• However, they are more likely to underfit. For example,
if we want to learn the boundaries between countries
based on labelled data, then linear models are not
likely to give a good approximation
• Errors in Machine Learning?
• Reducible errors: These errors can be reduced to
improve the model accuracy.
• Irreducible errors: These errors will always be present
in the model
63
Geometric model
• What is Bias?
• In general, a machine learning model analyses the data, find
patterns in it and make predictions.
• While training, the model learns these patterns in the dataset
and applies them to test data for prediction.
• While making predictions, a difference occurs between
prediction values made by the model and actual values/
expected values, and this difference is known as bias errors
or Errors due to bias.
• Low Bias: A low bias model will make fewer assumptions
about the form of the target function.
• High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important
features of our dataset. A high bias model also cannot
perform well on new data. 64
Geometric model
• Some examples of machine learning algorithms
with low bias are Decision Trees, k-Nearest
Neighbours and Support Vector Machines. At the
same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and
Logistic Regression.
• Ways to reduce High Bias:
• High bias mainly occurs due to a much simple model.
Below are some ways to reduce the high bias:
• Increase the input features as the model is
underfitted.
• Use more complex models, such as including some
polynomial features. 65
Geometric model
• What is a Variance Error?
• The variance would specify the amount of variation in the
prediction if the different training data was used.
• In simple words, variance tells that how much a random
variable is different from its expected value.
• Ideally, a model should not vary too much from one training
dataset to another, which means the algorithm should be good
in understanding the hidden mapping between inputs and
output variables.
• Variance errors are either of low variance or high variance.
• Low variance means there is a small variation in the prediction
of the target function with changes in the training data set. At
the same time, High variance shows a large variation in the
prediction of the target function with changes in the training
dataset.
66
Models of Machine learning: Geometric model
• A model that shows high variance learns a lot and
perform well with the training dataset, and does not
generalize well with the unseen dataset.
• As a result, such a model gives good results with the
training dataset but shows high error rates on the test
dataset.
• Since, with high variance, the model learns too much
from the dataset, it leads to overfitting of the model.
• A model with high variance has the below problems:
• A high variance model leads to overfitting.
• Increase model complexities.
67
Models of Machine learning: Geometric model
68
Models of Machine learning: Geometric model
Ways to Reduce High Variance:
• Reduce the input features or number of parameters as a model
is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.
Different Combinations of Bias-Variance
Low-Bias, Low-Variance:
The combination of low bias and low variance
shows an ideal machine learning model.
However, it is not possible practically.
69
Models of Machine learning: Geometric model
• Low-Bias, High-Variance: With low bias and high
variance, model predictions are inconsistent and
accurate on average. This case occurs when the model
learns with a large number of parameters and hence
leads to an overfitting
• High-Bias, Low-Variance: With High bias and low
variance, predictions are consistent but inaccurate on
average. This case occurs when a model does not
learn well with the training dataset or uses few
numbers of the parameter. It leads
to underfitting problems in the model.
• High-Bias, High-Variance:
With high bias and high variance, predictions are
inconsistent and also inaccurate on average. 70
71
Geometric model
• 2. Distance-based models Distance-based models are the
second class of Geometric models.
• Like Linear models, distance based models are based on
the geometry of data.
• As the name implies, distance-based models
• In the context of Machine learning, the concept of
distance is not based on merely the physical distance
between two points.
• Instead, we could think of the distance between two
points considering the mode of transport between two
points.
• Travelling between two cities by plane 6 covers less
distance physically than by train because a plane is
unrestricted. 72
Models of Machine learning: Geometric model
– length of a segment connecting two points.
– Useful for Less dimensionality
– increases of your data might be skewed
73
Minkowski distance
• In Minkowski distance a metric used in Normed vector
space (n-dimensional real space), which means that it can
be used in a space where distances can be represented as a
vector that has a length.
• This measure has three requirements:
• Zero Vector — The zero vector has a length of zero whereas
every other vector has a positive length. For example, if we
travel from one place to another, then that distance is always
positive. However, if we travel from one place to itself, then
that distance is zero.
• Scalar Factor — When you multiple the vector with a positive
number its length is changed whilst keeping its direction. For
example, if we go a certain distance in one direction and add the
same distance, the direction does not change.
• Triangle Inequality — The shortest distance between two
74
points is a straight line.
Models of Machine learning: Geometric model
• Most interestingly about this distance measure is the
use of parameter p.
• We can use this parameter to manipulate the distance
metrics to closely resemble others.
• Common values of p are:
• p=1 — Manhattan distance
• p=2 — Euclidean distance
• p=∞ —Chebyshev distance
Chebyshev distance- it is simply the maximum distance
along one axis. Due to its nature, it is often referred to as
Chessboard distance since the minimum number of
moves needed by a king to go from one square to
another is equal to Chebyshev distance. 75
Models of Machine learning: Geometric model
• V
77
Probabilistic Models
• Probabilistic models use the idea of probability to
classify new entities.
• Probabilistic models see features and target variables
as random variables.
• The process of modelling represents and manipulates
the level of uncertainty with respect to these variables.
• There are two types of probabilistic models:
Predictive and Generative.
• Predictive probability models use the idea of a
conditional probability distribution P (Y |X) from which
Y can be predicted from X.
• Generative models estimate the joint distribution P (Y,
X). O 78
Probabilistic Models
• Once we know the joint distribution for the generative
models, we can derive any conditional or marginal
distribution involving the same variables.
• Thus, the generative model is capable of creating new
data points and their labels, knowing the joint
probability distribution.
• The joint distribution looks for a relationship between
two variables.
• Once this relationship is inferred, it is possible to infer
new data points.
• Naïve Bayes is an example of a probabilistic classifier
79
Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning
algorithm, which is based on Bayes theorem and used
for solving classification problems.
• It is mainly used in text classification that includes a
high-dimensional training dataset.
• It is a probabilistic classifier, which means it predicts
on the basis of the probability of an object.
• Some popular examples of Naïve Bayes Algorithm
are spam filtration, Sentimental analysis, and
classifying articles.
• Naïve: It is called Naïve because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. 80
Naïve Bayes Classifier Algorithm
• Bayes: It is called Bayes because it depends on the
principle of Bayes' Theorem.
• Bayes' Theorem: is used to determine the probability
of a hypothesis with prior knowledge. It depends on
the conditional probability.
• The formula for Bayes' theorem is given as:
• Where,
• P(A|B) is Posterior probability: Probability of
hypothesis A on the observed event B.
• P(B|A) is Likelihood probability: Probability of the
evidence given that the probability of a hypothesis is
true. 81
Naïve Bayes Classifier Algorithm
• P(A) is Prior Probability: Probability of hypothesis
before observing the evidence.
• P(B) is Marginal Probability: Probability of Evidence.
• uppose we have a dataset of weather conditions and
corresponding target variable "Play". So using this
dataset we need to decide that whether we should
play or not on a particular day according to the
weather conditions. So to solve this problem, we need
to follow the below steps:
• Convert the given dataset into frequency tables.
• Generate Likelihood table by finding the probabilities
of given features.
82
Naïve Bayes Classifier Algorithm
• Now, use Bayes theorem to calculate the posterior probability.
• Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the below dataset:
83
Naïve Bayes Classifier Algorithm
84
Naïve Bayes Classifier Algorithm
• Applying Bayes'theorem:
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
• P(Sunny|Yes)= 3/10= 0.3
• P(Sunny)= 0.35
• P(Yes)=0.71
• So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
• P(Sunny|NO)= 2/4=0.5
• P(No)= 0.29
• P(Sunny)= 0.35
• So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation
that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play the game. 85
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the
other Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between features.
• Applications of Naïve Bayes Classifier:
• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes
Classifier is an eager learner.
• It is used in Text classification such as Spam
filtering and Sentiment analysis. 86
Grouping and Grading Model
• Grading vs grouping is an orthogonal categorization to
geometric-probabilistic-logical-compositional.
• Grouping models break the instance space up into
groups or segments and in each segment apply a very
simple method (such as majority class).
• E.g. decision tree, KNN.
• Grading models form one global model over the
instance space.
• E.g. Linear classifiers – Neural networks
87
Parametric Machine Learning Algorithms
• A learning model that summarizes data with a set of
parameters of fixed size (independent of the number
of training examples) is called a parametric model.
• No matter how much data you throw at a parametric
model, it won’t change its mind about how many
parameters it needs.
• The algorithms involve two steps:
• Select a form for the function.
• Learn the coefficients for the function from the training
data.
• An easy to understand functional form for the mapping
function is a line, as is used in linear regression:
• b0 + b1*x1 + b2*x2 = 0 88
Naïve Bayes Classifier Algorithm
• Where b0, b1 and b2 are the coefficients of the line that
control the intercept and slope, and x1 and x2 are two
input variables.
• a linear combination of the input variables and as such
parametric machine learning algorithms are often also
called “linear machine learning algorithms“.
• Some more examples of parametric machine learning
algorithms include:
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron Naive Bayes Simple Neural Networks
• Parameters for using the normal distribution is as follows:
• Mean
• Standard Deviation
89
Naïve Bayes Classifier Algorithm
• Benefits of Parametric Machine Learning Algorithms:
• Simpler: These methods are easier to understand and interpret
results.
• Speed: Parametric models are very fast to learn from data.
• Less Data: They do not require as much training data and
can work well even if the fit to the data is not perfect.
• Limitations of Parametric Machine Learning Algorithms:
• Constrained: By choosing a functional form these methods are
highly constrained to the specified form.
• Limited Complexity: The methods are more suited to simpler
problems.
• Poor Fit: In practice the methods are unlikely to match the
underlying mapping function.
90
Nonparametric Machine Learning Algorithms
• Nonparametric methods are good when you have a lot
of data and no prior knowledge, and when you don’t
want to worry too much about choosing just the right
features.
• Nonparametric methods seek to best fit the training
data in constructing the mapping function, whilst
maintaining some ability to generalize to unseen data.
As such, they are able to fit a large number of
functional forms.
• An easy to understand nonparametric model is the k-
nearest neighbors algorithm that makes predictions
based on the k most similar training patterns for a new
data instance.
91
Nonparametric Machine Learning Algorithms
• Some more examples of popular nonparametric machine learning algorithms are:
• k-Nearest Neighbors Decision Trees like CART and C4.5
• Support Vector Machines
• Benefits of Nonparametric Machine Learning Algorithms:
• Flexibility: Capable of fitting a large number of functional forms.
• Power: No assumptions (or weak assumptions) about the
underlying function.
• Performance: Can result in higher performance models for
prediction.
• Limitations of Nonparametric Machine Learning Algorithms:
• More data: Require a lot more training data to estimate the
mapping function.
• Slower: A lot slower to train as they often have far more
parameters to train.
• Overfitting: More of a risk to overfit the training data and it
92
is harder to explain why specific predictions are made.
93
94
Important Elements in Machine
Learning
• Data formats
• In a supervised learning problem, there will
always be a dataset, defined as a finite set of
real vectors with m features each:
95
Data Format
• Labeled data: Data consisting of a set
of training examples, where each example is
a pair consisting of an input and a desired
output value (also called the supervisory
signal, labels, etc)
• Classification: The goal is to predict discrete
values, e.g. {1,0}, {True, False}, {spam, not
spam}.
• Regression: The goal is to predict continuous
values, e.g. home prices.
96
• Feature vector: A typical setting for machine
learning is to be given a collection of objects (or data
points), each of which is characterised by several
different features.
• Features can be of different sorts: e.g., they might
be continuous (say, real- or integer-valued) or
categorical (for instance, a feature for colour can
have values like green, blue, red ).
• A vector containing all of the feature values for a
given data point is called the feature vector;
• if this is a vector of length m, then one can think of
each data point as being mapped to a m-
dimensional vector space (in the case of real-valued
features, this is R m ), called the feature space. 97
• This means all variables belong to the same
distribution D, and considering an arbitrary
subset of m values, it happens that:
98
• Categorical examples are
99
interpretation can be expressed in terms of
additive noise:
101
• there's an example of a dataset whose points
must be classified as red (Class A) or blue
(Class B).
• Three hypotheses are shown: the first one
(the middle line starting from left)
misclassifies one sample,
• while the lower and upper ones misclassify 13
and 23 samples respectively:
• the first hypothesis is optimal and should be
selected; however, it's important to
understand an essential concept which can
determine a potential overfitting
102
103
• The blue classifier is linear while the red one
is cubic. At a glance, non-linear strategy
seems to perform better, because it can
capture more expressivity, thanks to its
concavities.
• However, if new samples are added following
the trend defined by the last four ones (from
the right), they'll be completely misclassified.
• In fact, while a linear function is globally
better but cannot capture the initial
oscillation between 0 and 4, a cubic approach
can fit this data almost perfectly but, at the
104
same time, loses its ability to keep a global
Error measures
• In general, when working with a supervised
scenario, we define a non-negative error
measure em which takes two arguments
(expected & predicted output ) and allows us
to compute a total error value over the whole
dataset (made up of n samples):
105
• This value is also implicitly dependent on the
specific hypothesis H through the parameter
set, therefore optimizing the error implies
finding an optimal hypothesis
106
107
Statistical learning approaches
• Imagine that you need to design a spam-
filtering algorithm starting from this initial
(over- simplistic) classification based on two
parameters:
Parameter Spam emails (X1) Regular emails (X2)