Machine Learning
Machine Learning
Subject Notes
Unit I
VIII Semester
Subject Name: IT 802-Machine Learning
INTRODUCTION
Software engineering combined human created rules with data to create answers to a
problem. Instead, machine learning uses data and answers to discover the rules behind a
problem. To learn the rules governing a phenomenon, machines have to go through a
learning process, trying different rules and learning from how well they perform. Therefore,
it’s known as Machine Learning.
• Dataset: A set of data examples that contain features important to solving the
problem.
• Features: Important pieces of data that help us understand a problem. These are fed
into a Machine Learning algorithm to help it learn.
• Data Collection: Collect the data that the algorithm will learn from.
• Data Preparation: Format and engineer the data into the optimal format, extracting
important features and performing dimensionality reduction.
• Training: This is where the Machine Learning algorithm learns by showing it the data
that has been collected and prepared.
There are many approaches that can be taken when conducting Machine Learning.
Supervised and Unsupervised are well established approaches and the most used. Semi-
supervised and Reinforcement Learning are newer and more complex but have shown
impressive results.
There are three basic types of learning paradigms widely associated with machine learning,
namely
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
Supervised learning is a machine learning task in which a function maps the input to output
data using the provided input-output pairs.
In this type of learning, you need to give both the input and output (usually in the form of
labels) to the computer for it to learn from it. The computer generates a function based on
this data, which can be anything like a simple line, to a complex convex function, depending
on the data provided.
This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. For example:
Regression: Machine is trained to predict some value like price, weight, or height.
Unsupervised Learning
In this type of learning paradigm, the computer is provided with just the input to develop a
learning pattern. It is basically learning from no results.
This means that the computer has to recognize a pattern in the given input and develop a
learning algorithm accordingly. So, we conclude that “the machine learns through
observation & find structures in data”. This is still a much unexplored field of machine
learning, and big tech giants like Google and Microsoft are currently researching on
development in it.
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data
Reinforcement Learning
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial
Intelligence. It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance.
There is an excellent analogy to explain this type of learning paradigm, “training a dog”.
This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.
Game playing — determining the best move to make in a game often depends on a number
of different factors; hence the number of possible states that can exist in a particular game
is usually very large.
Control problems — such as elevator scheduling. Again, it is not obvious what strategies
would provide the best, most timely elevator service. For control problems such as this, RL
agents can be left to learn in a simulated environment and eventually they will come up
with good controlling policies.
Perspective:
It involves searching a very large space of possible hypothesis to determine the one that
best fits the observed data. Machine perception is the capability of a computer system to
interpret data in a manner that is similar to the way humans use their senses to relate to the
world around them. The basic method that the computers take in and respond to their
environment is through the attached hardware. Until recently input was limited to a
keyboard, or a mouse, but advances in technology, both in hardware and software, have
allowed computers to take in sensory input in a way similar to humans. Machine perception
allows the computer to use this sensory input, as well as conventional computational means
of gathering information, to gather information with greater accuracy and to present it in a
way that is more comfortable for the user.
The end goal of machine perception is to give machines the ability to see, feel and perceive
the world as humans do and therefore for them to be able to explain in a human way why
they are making their decisions, to warn us when it is failing and more importantly, the
reason why it is failing. This purpose is very similar to the proposed purposes for artificial
intelligence generally, except that machine perception would only grant machines limited
sentience, rather than bestow upon machines full consciousness, self-awareness, and
intentionality.
Issues:
Some of the issues that the science of machine perception still has to overcome include:
• Embodied Cognition - The theory that cognition is a full body experience, and
therefore can only exist, and therefore be measure and analyzed, in fullness if all
required human abilities and processes are working together through a mutually
aware and supportive systems network.
• The Principle of Similarity - The ability young children develop to determine what
family a newly introduced stimulus falls under even when the said stimulus is
different from the members with which the child usually associates said family with.
(An example could be a child figuring that a Chihuahua is a dog and house pet rather
than vermin.)
• The innate human ability to follow the Likelihood Principle in order to learn from
circumstances and others over time.
• The Free energy principle - determining long before hand how much energy one can
safely delegate to being aware of things outside one's self without the loss of the
needed energy one requires for sustaining their life and function satisfactorily. This
allows one to become both optimally aware of the world around them self without
depleting their energy so much that they experience damaging stress, decision
fatigue, and/or exhaustion.
CONCEPT LEARNING
Concept learning, also known as category learning. It searches for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories. More
simply put, concepts are the mental categories that help us classify objects, events, or ideas,
building on the understanding that each object, event, or idea has a set of common relevant
features. Thus, concept learning is a strategy which requires a learner to compare and
contrast groups or categories that contain concept-relevant features with groups or
categories that do not contain concept-relevant features.
Let’s Design the problem formally with TPE (Task, Performance, Experience):
Task T: Learn to predict the value of EnjoySport for an arbitrary day, based on the values of
the attributes of the day.
Where x1, x2, x3, x4, x5 and x6 are the values of Sky, Air-Temp, Humidity, Wind, Water and
Forecast.
Hence h1 will look like (the first row of the table above):
h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1 represents a positive
hypothesis / Positive example
We want to find the most suitable hypothesis which can represent the concept. For
example, the person enjoys his favorite sport only on cold days with high humidity.
Here ‘?’ indicates that any value of the attribute is acceptable. The most generic hypothesis
will be < ?, ?, ?, ?, ?, ?> where every day is a positive example and the most specific
hypothesis will be <?,?,?,?,?,? > where no day is a positive example. The two most popular
approaches to find a suitable hypothesis, they are:
1. Find-S Algorithm
2. List-Then-Eliminate Algorithm
Find-S Algorithm:
▪ Then do nothing
• Output hypothesis h
VERSION SPACE
A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.
The version space method is a concept learning process accomplished by managing multiple
models within a version space.
• A plausible description is one that is applicable to all known positive examples and
no known negative example.
A hypothesis is a function on the sample space, giving a value for each point in the sample
space. If the possible values are {0, 1} then we can identify a hypothesis with the subset of
those points that are given value 1. The error of a hypothesis is the probability of that
subset where the hypothesis disagrees with the true hypothesis. Learning from examples is
the process of making independent random observations and eliminating those hypotheses
that disagree with observations.
The hypothesis space is the set of all possible hypotheses (i.e., functions from inputs to the
outputs) that can be returned by a model. The hypothesis space is important because it
specifies what types of functions you can model and what types you cannot. The absolute
best error you can achieve on a dataset is lower bounded by the error of the “best” function
in your hypothesis space.
In machine learning are designed to make better use data using PAC (Probably
Approximately Correct). The PAC analyses assume that the true answer/concept is in the
given hypothesis space H. A machine learning algorithm L with hypothesis space H is one
that, given a training data set D, will always return a hypothesis H consistent with D if one
exists, otherwise it will indicate that no such hypothesis exists. In a finite machine learning
hypothesis H does not have polynomial sample complexity. If H has polynomial sample
complexity it is called infinite hypothesis.
PAC LEARNING
In this framework, the learner receives samples and must select a generalization function
(called the hypothesis) from a certain class of possible functions. The goal is that, with high
probability (the "probably" part), the selected function will have low generalization error
(the "approximately correct" part). The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.
Probably approximately correct (PAC) learning theory helps analyze whether and under
what conditions a learner L will probably output an approximately correct classifier.
Approximate: A hypothesis h∈H is approximately correct if its error over the distribution of
inputs is bounded by some ϵ,0 ≤ ϵ ≤ (1/2). I.e., errorD(h)<ϵ, where D is the distribution over
inputs.
Probably: If L will output such a classifier with probability 1−δ, with 0 ≤ δ ≤ (1/2), we call
that classifier probably approximately correct.
Knowing that a target concept is PAC-learnable allows to bound the sample size necessary
to probably learn an approximately correct classifier, which is what's shown in the formula
reproduced:
To gain some intuition about this, note the effects on m when you alter variables in the
right-hand side. As allowable error decreases, the necessary sample size grows. Likewise, it
grows with the probability of an approximately correct learner, and with the size of the
hypothesis space H. (Loosely, a hypothesis space is the set of classifiers algorithm
considers.) More plainly, as we consider more possible classifiers, or desire a lower error or
higher probability of correctness, we need more data to distinguish between them.
VC DIMENSION
The capacity of a classification model is related to how complicated it can be. For example,
consider the threshold of a high-degree polynomial: if the polynomial evaluates above zero,
that point is classified as positive, otherwise as negative. A high-degree polynomial can be
wiggly, so it can fit a given set of training points well. But one can expect that the classifier
will make errors on other points, because it is too wiggly. Such a polynomial has a high
capacity. A much simpler alternative is to threshold a linear function.
Suppose we want a model (e.g., some classifier) that generalizes well on unseen data. And
we are limited to a specific amount of sample data.
The following figure shows some Models (S1 up to Sk) of differing complexity (VC
dimension), here shown on the x-axis and called h.
The diagram shows that a higher VC dimension allows for a lower empirical risk (the error a
model makes on the sample data), but also introduces a higher confidence interval. This
If we use a model of low complexity, we introduce assumption (bias) regarding the dataset
e.g., when using a linear classifier, we assume the data can be described with a linear
model. If this is not the case, our given problem cannot be solved by a linear model, for
example because the problem is of nonlinear nature. We will end up with a bad performing
model which will not be able to learn the data's structure. We should therefore try to avoid
introducing a strong bias.
On the other side of the x-axis, we see models of higher complexity which might be of such
a great capacity that it will rather memorize the data instead of learning its general
underlying structure i.e. the model over fits. After realizing this problem, it seems that we
should avoid complex models.
This may seem controversial as we shall not introduce a bias i.e., have low VC dimension but
should also do not have high VC dimension. This problem has deep roots in statistical
learning theory and is known as the bias-variance-tradeoff. What we should do in this
situation is to be as complex as necessary and as simplistic as possible, so when comparing
two models which end up with the same empirical error, we should use the less complex
one.
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit II
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO2): Apply various supervised learning methods to appropriate
problems.
In Supervised learning, you train the machine using data which is well "labeled." It means
some data is already tagged with the correct answer. It can be compared to learning which
takes place in the presence of a supervisor or a teacher.
A supervised learning algorithm learns from labeled training data, helps you to predict
outcomes for unforeseen data. Successfully building, scaling, and deploying accurate
supervised machine learning Data science model takes time and technical expertise from a
team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make
sure the insights given remains true until its data changes.
• Supervised learning allows to collect data or produce a data output from the
previous experience.
Suppose we want to learn the class, C, of a “family car.” We have a set of examples of cars,
and we have a group of people that we survey to whom we show these cars. The people
look at the cars and label them; the cars that they believe are family cars are positive
examples, and the other are negative examples. Class learning is finding a description that is
shared by all the positive examples and none of the negative examples.
The features that separate a family car from other type of cars are the price and engine
power. These two attributes are the inputs to the class recognizer.
Input representation:
Training Set x
Class C has price as the first input attribute x1 and engine power as the second attribute x2
Figure 2.2 Example of a hypothesis class. The class of family car is a rectangle
Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (xt1, xt2) and its type, namely, positive versus
negative, is given by r t (see figure 2.1) After further discussions with the expert and the
analysis of the data, we may have reason to believe that for a car to be a family car, its price
and engine power should be in a certain range (p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤
e2)
The aim is to find h ∈ H that is as similar as possible to C. So, the hypothesis h makes a
prediction for an instance x. we have is the training set X, which is a small subset of the set
of all possible x. The empirical error is the proportion of training instances where predictions
of h do not match the required values given in X.
LINEAR CLASSIFICATION
Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyper-plane (a plane in
more than 2 dimensions). They can only be used to classify data that is linearly separable.
They can be modified to classify non-linearly separable data. Major algorithms in linear
classification are:
Perceptron: In Perceptron, we take weighted linear combination of input features and pass
it through a thresholding function which outputs 1 or 0. The sign of wTx tells us which side
of the plane wTx=0, the point x lies on. Thus, by taking threshold as 0, perceptron classifies
data based on which side of the plane the new point lies on.
The task during training is to arrive at the plane (defined by w) that accurately classifies the
training data. If the data is linearly separable, perceptron training always converges
SVM: There can be multiple hyperplanes that separate linearly separable data. SVM
calculates the optimal separating hyperplane using concepts of geometry. SVM only be used
to separate linearly separable data. But we can modify our data and project it into higher
dimensions to make it linearly separable.
NONLINEAR CLASSIFICATION
Nonlinear functions can be used to separate instances that are not linearly separable. We’ve
many nonlinear classifiers:
K-nearest-neighbors (kNN): The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each other. The KNN algorithm hinges
on this assumption being true enough for the algorithm to be useful. KNN captures the idea
of similarity.
Kernel SVM: SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions
Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.
Multilayer Perceptron: Perceptron consists of an input layer and an output layer which are
fully connected. MLPs have the same input and output layers but may have multiple hidden
layers in between the aforementioned layers. The input layer is the first set of perceptrons
which output positive/negative based on the observed features in the data. A hidden layer is
a set of perceptrons that uses the outputs of the previous layer as inputs, instead of using
the original data. There can be multiple hidden layers and the final hidden layer is also
called the output layer.
MULTICLASS CLASSIFICATION
Multiclass classification means a classification task with more than two classes, e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multiclass classification
makes the assumption that each sample is assigned to one and only one label: a fruit can be
either an apple or a pear but not both at the same time.
MULTILABEL CLASSIFICATION
Multilabel classification assigns to each sample a set of target labels. This can be thought of
as predicting properties of a data-point that are not mutually exclusive, such as topics that
are relevant for a document. A text might be about any of religion, politics, finance or
education at the same time or none of these.
DECISION TREES
A decision tree is a structure that contains nodes and edges and is built from a dataset
(table of columns representing features/attributes and rows corresponds to records). Each
node is either used to make a decision (known as decision node) or represent an outcome
(known as leaf node).
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step. ID3 uses a
top-down greedy approach to build a decision tree. In simple words, the top-down approach
means that we start building the tree from the top and the greedy approach means that at
each iteration we select the best feature at the present moment to create a node.
ID3 uses Information Gain or just Gain to find the best feature. Information Gain calculates
the reduction in the entropy and measures how well a given feature separates or classifies
the target classes. The feature with the highest Information Gain is selected as the best one.
ID3 Steps
• Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.
• Make a decision tree node using the feature with the maximum Information gain.
• If all rows belong to the same class, make the current node as a leaf node with the
class as its label.
• Repeat for the remaining features until we run out of all features, or the decision
tree has all leaf nodes.
Classification and Regression Trees or CART for is a term introduced to refer to Decision Tree
algorithms that can be used for classification or regression predictive modeling problems.
The CART algorithm provides a foundation for important algorithms like bagged decision
trees, random forest and boosted decision trees.
This is a binary in this each root node represents a single input variable (x) and a split point
on that variable (assuming the variable is numeric).
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.
A dataset with two inputs (x) of height in centimeters and weight in kilograms the output of
gender as male or female, below is an example of a binary decision tree.
The tree can be stored to file as a graph or a set of rules. For example, below is the above
decision tree as a set of rules.
With the binary tree representation of the CART model, making predictions is relatively
straightforward. The tree is traversed by evaluating the specific input started at the root
node of the tree.
A learned binary tree is actually a partitioning of the input space. Each input variable as a
dimension on a p-dimensional space. The decision tree split this up into rectangles (when
p=2 input variables) or some kind of hyper-rectangles with more inputs. New data is filtered
through the tree and lands in one of the rectangles and the output value for that rectangle
is the prediction made by the model. This gives some feeling for the type of decisions that a
CART model is capable of making, e.g., boxy decision boundaries.
REGRESSION
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Regression analysis is a predictive modeling technique that analyzes the relation between
the target or dependent variable and independent variable in a dataset. The different types
of regression analysis techniques get used when the target and independent variables show
a linear or non-linear relationship between each other, and the target variable contains
continuous values. The regression technique gets used mainly to determine the predictor
strength, forecast trend, time series, and in case of cause & effect relation.
LINEAR REGRESSION
Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly
to each other. In case the data involves more than one independent variable, then linear
regression is called multiple linear regression models.
y=mx+c+e
Where m is the slope of the line, c is an intercept, and e represents the error in the model.
The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
selected in such a way that it gives the minimum predictor error. It is important to note that
a simple linear regression model is susceptible to outliers. Therefore, it should not be used
in case of big size data.
Multiple linear regression (MLR), is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable. The goal of multiple linear
regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable.
LOGISTIC REGRESSION
Logistic regression is one of the types of regression analysis technique, which gets used
when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the
target variable can have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.
Logit function is used in Logistic Regression to measure the relationship between the target
variable and independent variables. Below is the equation that denotes the logistic
regression.
For selecting logistic regression, as the regression analyst technique, it should be noted, the
size of data is large with the almost equal occurrence of values to come in target variables.
Also, there should be no multi-collinearity, which means that there should be no correlation
between independent variables in the dataset.
Neural Network are processing devices (algorithms or actual hardware) that are loosely
modeled after the neuronal structure of the human cerebral cortex but on much smaller
scales. A large Neural Network might have hundreds or thousands of processor units,
whereas a human brain has billions of neurons with a corresponding increase in magnitude
of their overall interaction and emergent behavior. Neural networks are typically organized
in layers. Layers are made up of a number of interconnected 'nodes' which contain an
'activation function'. Patterns are presented to the network via the 'input layer', which
communicates to one or more 'hidden layers' where the actual processing is done via a
system of weighted 'connections'. The hidden layers then link to an 'output layer' where the
answer is output.
PERCEPTRON
The process of Perceptron begins by taking all the input values and multiplying them by
their weights. Then, all of these multiplied values are added together to create the weighted
sum. The weighted sum is then applied to the activation function, producing the
perceptron's output. The activation function plays the integral role of ensuring the output is
mapped between required values such as (0,1) or (-1,1). It is important to note that the
weight of an input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.
MULTILAYER PERCEPTRONS
The perceptron is very useful for classifying data sets that are linearly separable. They
encounter serious limitations with data sets that do not conform to this pattern as
discovered with the XOR problem. The XOR problem shows that for any classification of
four points that there exists a set that are not linearly separable. The MultiLayer Perceptron
(MLPs) breaks this restriction and classifies datasets which are not linearly separable. They
do this by using a more robust and complex architecture to learn regression and
classification models for difficult datasets.
The Perceptron consists of an input layer and an output layer which are fully connected.
MLPs have the same input and output layers but may have multiple hidden layers in
between the aforementioned layers
• Just as with the perceptron, the inputs are pushed forward through the MLP by
taking the dot product of the input with the weights that exist between the input
layer and the hidden layer (W---H). This dot product yields a value at the hidden
layer. We do not push this value forward as we would with a perceptron though.
• MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function, tanh.
Push the calculated output at the current layer through any of these activation
functions.
• Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot product
with the corresponding weights.
• Repeat steps two and three until the output layer is reached.
• At the output layer, the calculations will either be used for a backpropagation
algorithm that corresponds to the activation function that was selected for the MLP
(in the case of training) or a decision will be made based on the output (in the case
of testing).
MLPs form the basis for all neural networks and have greatly improved the power of
computers when applied to classification and regression problems. Computers are no
longer limited by XOR cases and can learn rich and complex models thanks to the multilayer
perceptron.
Support Vector Machine is a linear model for classification and regression problems. It can
solve linear and non-linear problems and work well for many practical problems. The idea of
SVM is simple: The algorithm creates a line or a hyperplane which separates the data into
classes.
SVM is an algorithm that takes the data as an input and outputs a line that separates those
classes if possible. Suppose we have a dataset as shown below and we need to classify the
red rectangles from the blue ellipses (positives from the negatives). So, our task is to find an
ideal line that separates this dataset in two classes (red and blue).
It’s visually quite intuitive in this case that the yellow line classifies better. The green line in
the image above is quite close to the red class. Though it classifies the current datasets it is
not a generalized line and in machine learning our goal is to get a more generalized
separator.
The SVM algorithm finds the points closest to the line from both the classes. These points
are called support vectors. Now, compute the distance between the line and the support
vectors. This distance is called the margin. The goal is to maximize the margin. The
hyperplane for which the margin is maximum, called the optimal hyperplane.
SVM tries to make a decision boundary in such a way that the separation between the two
classes (that street) is as wide as possible.
For a bit complex dataset, which is not linearly separable we use non linear SVM.
This data is clearly not linearly separable. We cannot draw a straight line that can classify
this data. But this data can be converted to linearly separable data in higher dimension.
Let’s add one more dimension and call it z-axis. Let the co-ordinates on z-axis be governed
by the constraint,
z = x²+y²
So, basically z co-ordinate is the square of distance of the point from origin. Let’s plot the
data on z-axis.
Now the data is clearly linearly separable. Let the purple line separating the data in higher
dimension be z=k, where k is a constant. Since, z=x²+y² we get x² + y² = k; which is an
equation of a circle. So, we can project this linear separator in higher dimension back in
original dimensions using this transformation.
Thus, we can classify data by adding an extra dimension to it so that it becomes linearly
separable and then projecting the decision boundary back to original dimensions using
mathematical transformation.
KERNEL FUNCTIONS
Kernel Function is a method used to take data as input and transform into the required form
of processing data. “Kernel” is used due to set of mathematical functions used in Support
Vector Machine provides the window to manipulate the data. So, Kernel Function generally
transforms the training set of data so that a non-linear decision surface is able to be
For Implementing Kernel Functions, first of all we have to install “scikit-learn” library using
command prompt terminal: pip install scikit-learn
Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.
Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature
space over polynomials of the original variables used in kernel.
K-NEAREST NEIGHBORS
In the case of classification and regression, choosing the right K for data is done by trying
several Ks and picking the one that works best.
3.1 Calculate the distance between the query example and the current example from the
data.
3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances
Advantages
• The algorithm is versatile. It can be used for classification, regression, and search
Disadvantages
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit III
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO3): Identify and integrate more than one technique to enhance the
performance of learning.
ENSEMBLE LEARNING
Ensemble learning is the process by which multiple models, such as classifiers or
experts, are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.)
Ensemble methods are techniques that create multiple models and then combine them
to produce improved results. Ensemble methods usually produce more accurate
solutions than a single model would. This has been the case in a number of machine
learning competitions, where the winning solutions used ensemble methods.
Model is the output of the algorithm that trained with data. This model is then used for
making predictions. This algorithm can be any machine learning algorithm such as
logistic regression, decision tree, etc. These models, when used as inputs of ensemble
methods, are called “base models”.
There are different ways the multiple base-learners are combined to generate the final
output:
o In the global approach, also called learner fusion, given an input, all base-
learners generate an output, and all these outputs are used. Examples are
voting and stacking.
• Multistage combination methods use a serial approach where the next base-
learner is trained with or tested on only the instances where the previous base-
learners are not accurate enough. The idea is that the base-learners (or the
different representations they use) are sorted in increasing complexity so that a
complex base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident. An example is
cascading.
Figure 3.2: Base-learners are dj and their outputs are combined using f ()
Let us say that we have L base-learners. We denote by dj (x) the prediction of base-
learner Mj given the arbitrary dimensional input x. In the case of multiple
representations, each Mj uses a different input representation xj . The final prediction is
calculated from the predictions of the base-learners:
y = f (d1, d2,...,dL|Φ)
where f () is the combining function with Φ denoting its parameters. When there are K
outputs, for each learner there are dj i(x), i = 1,...,K, j = 1,...,L, and, combining them, we
also generate K values, yi, i = 1,...,K and then for example in classification, we choose the
class with the maximum yi value:
VOTING
The simplest way to combine multiple classifiers is by voting, which corresponds to
taking a linear combination of the learners (see figure 3.2):
This is also known as ensembles and linear opinion pools. In the simplest case, all
learners are given equal weight and we have simple voting that corresponds to taking
an average. Still, taking a (weighted) sum is only one of the possibilities and there are
also other combination rules, as shown in table 3.1. If the outputs are not posterior
probabilities, these rules require that outputs be normalized to the same scale.
An example of the use of these rules is shown in table 3.2, which demonstrates the
effects of different rules. Sum rule is the most intuitive and is the most widely used in
practice. Median rule is more robust to outliers; minimum and maximum rules are
pessimistic and optimistic, respectively.
Table 3.2: Example of combination rules on three learners and three classes
With the product rule, each learner has veto power; regardless of the other ones, if one
learner has an output of 0, the overall output goes to 0. After the combination rules, yi
do not necessarily sum up to 1.
In weighted sum, dj i is the vote of learner j for class Ci and wj is the weight of its vote.
Simple voting is a special case where all voters have equal weight, namely, wj = 1/L. In
classification, this is called plurality voting where the class having the maximum
number of votes is the winner. When there are two classes, this is majority voting where
the winning class gets more than half of the votes. If the voters can also supply the
additional information of how much they vote for each class (e.g., by the posterior
probability), then after normalization, these can be used as weights in a weighted voting
scheme. Equivalently, if dj i are the class posterior probabilities, P (Ci|x,Mj ), then we can
just sum them up (wj = 1/L) and choose the class with maximum yi.
In the case of regression, simple or weighted averaging or median can be used to fuse
the outputs of base-regressors. Median is more robust to noise than the average.
Another possible way to find wj is to assess the accuracies of the learners (regressor or
classifier) on a separate validation set and use that information to compute the weights,
so that we give more weights to more accurate learners. These weights can also be
learned from data.
We cannot integrate over all models; we only choose a subset for which we believe P
(Mj ) is high, or we can have another Bayesian step and calculate P (Mj |X), the
probability of a model given the sample, and sample high probable models from this
density.
Let us assume that dj are iid with expected value E[dj ] and variance Var(dj ), then when
we take a simple average with wj = 1/L, the expected value and variance of the output
are
The expected value does not change, so the bias does not change. But variance, and
therefore mean square error, decreases as the number of independent voters, L,
increase. In the general case,
which implies that if learners are positively correlated, variance (and error) increase.
We can thus view using different algorithms and input features as efforts to decrease, if
not eliminate, the positive correlation.
Further decrease in variance is possible if the voters are not independent but negatively
correlated. The error then decreases if the accompanying increase in bias is not higher
because these aims are contradictory; we cannot have a number of classifiers that are
all accurate and negatively correlated. In mixture of experts for example, where learners
are localized, the experts are negatively correlated but biased.
Base-learners are binary classifiers having output −1/ + 1, and there is a code matrix W
of K × L whose K rows are the binary codes of classes in terms of the L base-learners dj .
For example, if the second row of W is [−1, +1, +1, −1], this means that for us to say an
instance belongs to C2, the instance should be on the negative side of d1 and d4, and on
the positive side of d2 and d3. Similarly, the columns of the code matrix define the task
of the base-learners. For example, if the third column is [−1, +1, +1]T , we understand
that the task of the third base-learner, d3, is to separate the instances of C1 from the
instances of C2 and C3 combined. This is how we form the training set of the base-
learners. For example in this case, all instances labeled with C2 and C3 form X+3 and
instances labeled with C1 form X−3 , and d3 is trained so that xt ∈ X+3 give output +1
and xt ∈ X−3 give output −1.
The code matrix thus allows us to define a polychotomy (K > 2 classification problem) in
terms of dichotomies (K = 2 classification problem), and it is a method that is applicable
using any learning algorithm to implement the base-learners—for example, linear or
multilayer perceptrons (with a single output), decision trees, or SVMs whose original
definition is for two-class problems.
The typical one discriminant per class setting corresponds to the diagonal code matrix
where L = K. For example, for K = 4, we have
The problem here is that if there is an error with one of the base learners, there may be
a misclassification because the class code words are so similar. So the approach in
error-correcting codes is to have L>K and increase the Hamming distance between the
code words. One possibility is pairwise separation of classes where there is a separate
base learner to separate Ci from Cj , for i<j (section 10.4). In this case, L = K(K − 1)/2
and with K = 4, the code matrix is
where a 0 entry denotes “don’t care.” That is, d1 is trained to separate C1 from C2 and
does not use the training instances belonging to the other classes. Similarly, we say that
an instance belongs to C2 if d1 = −1 and d4 = d5 = +1, and we do not consider the values
of d2, d3, and d6. The problem here is that L is O(K2), and for large K pairwise
separation may not be feasible.
If we can have L high, we can just randomly generate the code matrix with −1/ + 1 and
this will work fine, but if we want to keep L low, we need to optimize W. The approach is
to set L beforehand and then find W such that the distances between rows, and at the
same time the distances between columns, are as large as possible, in terms of Hamming
distance. With K classes, there are 2(K−1) − 1 possible columns, namely, two-class
problems. This is because K bits can be written in 2K different ways and complements
(e.g., “0101” and “1010,” from our point of view, define the same discriminant) dividing
the possible combinations by 2 and then subtracting 1 because a column of all 0s (or 1s)
is useless. For example, when K = 4, we have
When K is large, for a given value of L, we look for L columns out of the 2(K−1)−1. We
would like these columns of W to be as different as possible so that the tasks to be
learned by the base-learners are as different from each other as possible. At the same
time, we would like the rows of W to be as different as possible so that we can have
maximum error correction in case one or more base learners fail. ECOC can be written
as a voting scheme where the entries of W, wij, are considered as vote weights:
and then we choose the class with the highest yi. Taking a weighted sum and then
choosing the maximum instead of checking for an exact match allows dj to no longer
need to be binary but to take a value between −1 and +1, carrying soft certainties
instead of hard decisions. Note that a value pj between 0 and 1, for example, a posterior
probability, can be converted to a value dj between −1 and +1 simply as dj = 2pj – 1
BAGGING
Bagging is a voting method whereby base learners are made different by training them
over slightly different training sets. Bagging, that often considers homogeneous weak
learners, learns them independently from each other in parallel and combines them
following deterministic averaging process.
In parallel methods we fit the different considered learners independently from each
other’s and, so it is possible to train them concurrently. The most famous such approach
is “bagging” (standing for “bootstrap aggregating”) that aims at producing an ensemble
model that is more robust than the individual models composing it.
underlying distribution), the fitted model is also subject to variability: if another dataset
had been observed, we would have obtained a different model.
The idea of bagging is then simple: we want to fit several independent models and
“average” their predictions in order to obtain a model with a lower variance. However,
we can’t, in practice, fit fully independent models because it would require too much
data. So, we rely on the good “approximate properties” of bootstrap samples
(representativity and independence) to fit models that are almost independent.
First, we create multiple bootstrap samples so that each new bootstrap sample will act
as another (almost) independent dataset drawn from true distribution. Then, we can fit
a weak learner for each of these samples and finally aggregate them such that we kind
of “average” their outputs and, so, obtain an ensemble model with less variance that its
components. Roughly speaking, as the bootstrap samples are approximatively
independent and identically distributed (i.i.d.), so are the learned base models. Then,
“averaging” weak learners outputs do not change the expected answer but reduce its
variance (just like averaging i.i.d. random variables preserve expected value but reduce
variance).
and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that.
and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that
There are several possible ways to aggregate the multiple models fitted in parallel. For a
regression problem, the outputs of individual models can literally be averaged to obtain
the output of the ensemble model. For classification problem the class outputted by
each model can be seen as a vote and the class that receives the majority of the votes is
returned by the ensemble model (this is called hard voting). Still for a classification
problem, we can also consider the probabilities of each class returned by all the models,
average these probabilities and keep the class with the highest average probability (this
is called soft-voting). Averages or votes can either be simple or weighted if any relevant
weights can be used.
Finally, we can mention that one of the big advantages of bagging is that it can be
parallelized. As the different models are fitted independently from each other’s,
intensive parallelization techniques can be used if required.
Figure 3.3: Bagging consists in fitting several base models on different bootstrap
samples
Sampling over features has indeed the effect that all trees do not look at the exact same
information to make their decisions and, so, it reduces the correlation between the
different returned outputs. Another advantage of sampling over the features is that it
makes the decision-making process more robust to missing data: observations (from
the training dataset or not) with missing data can still be regressed or classified based
on the trees that take into account only features where data are not missing. Thus,
random forest algorithm combines the concepts of bagging and random feature
subspace selection to create more robust models.
BOOSTING
Boosting methods work in the same spirit as bagging methods: we build a family of
models that are aggregated to obtain a strong learner that performs better. However,
unlike bagging that mainly aims at reducing variance, boosting is a technique that
consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset
that were badly handled by the previous models in the sequence. Intuitively, each new
model focuses its efforts on the most difficult observations to fit up to now, so that we
obtain, at the end of the process, a strong learner with lower bias (even if we can notice
that boosting can also have the effect of reducing variance). Boosting, like bagging, can
be used for regression as well as for classification problems.
Being mainly focused at reducing bias, the base models that are often considered for
boosting are models with low variance but high bias. For example, if we want to use
trees as our base models, we will choose most of the time shallow decision trees with
only a few depths. Another important reason that motivates the use of low variance but
high bias models as weak learners for boosting is that these models are in general less
computationally expensive to fit (few degrees of freedom when parametrized). Indeed,
as computations to fit the different models can’t be done in parallel (unlike bagging), it
could become too expensive to fit sequentially several complex models.
Once the weak learners have been chosen, we still need to define how they will be
sequentially fitted (what information from previous models do we consider when fitting
current model?) and how they will be aggregated (how do we aggregate the current
model to the previous ones?). We will discuss these questions in the two following
subsections, describing more especially two important boosting algorithms: adaboost
and gradient boosting.
In a nutshell, these two meta-algorithms differ on how they create and aggregate the
weak learners during the sequential process. Adaptive boosting updates the weights
attached to each of the training dataset observations whereas gradient boosting updates
the value of these observations. This main difference comes from the way both methods
try to solve the optimization problem of finding the best model that can be written as a
weighted sum of weak learners.
ADABOOST
In adaptative boosting (often called “adaboost”), we try to define our ensemble model as
a weighted sum of L weak learners
Finding the best ensemble model with this form is a difficult optimization problem.
Then, instead of trying to solve it in one single shot (finding all the coefficients and weak
learners that give the best overall additive model), we make use of an iterative
optimization process that is much more tractable, even if it can lead to a sub-optimal
solution. More especially, we add the weak learners one by one, looking at each iteration
for the best possible pair (coefficient, weak learner) to add to the current ensemble
model. In other words, we define recurrently the (s_l)’s such that
where c_l and w_l are chosen such that s_l is the model that fit the best the training data
and, so, that is the best possible improvement over s_(l-1). We can then denote
where E(.) is the fitting error of the given model and e(.,.) is the loss/error function.
Thus, instead of optimizing “globally” over all the L models in the sum, we approximate
the optimum by optimizing “locally” building and adding the weak learners to the
strong model one by one.
More especially, when considering a binary classification, we can show that the
adaboost algorithm can be re-written into a process that proceeds as follow. First, it
updates the observations weights in the dataset and train a new weak learner with a
special focus given to the observations misclassified by the current ensemble model.
Second, it adds the weak learner to the weighted sum according to an update coefficient
that expresses the performances of this weak model: the better a weak learner
performs, the more it contributes to the strong learner.
So, assume that we are facing a binary classification problem, with N observations in
our dataset and we want to use adaboost algorithm with a given family of weak models.
At the very beginning of the algorithm (first model of the sequence), all the observations
have the same weights 1/N. Then, we repeat L times (for the L learners in the sequence)
the following steps:
• fit the best possible weak model with the current observations weights
• compute the value of the update coefficient that is some kind of scalar evaluation
metric of the weak learner that indicates how much this weak learner should be
taken into account into the ensemble model
• update the strong learner by adding the new weak learner multiplied by its
update coefficient
• compute new observations weights that expresses which observations we would
like to focus on at the next iteration (weights of observations wrongly predicted
by the aggregated model increase and weights of the correctly predicted
observations decrease)
Repeating these steps, we have then built sequentially our L models and aggregate them
into a simple linear combination weighted by coefficients expressing the performance of
each learner. Notice that there exist variants of the initial adaboost algorithm such that
LogitBoost (classification) or L2Boost (regression) that mainly differ by their choice of
loss function.
STACKING
Stacking mainly differ from bagging and boosting on two points. First stacking often
considers heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners. Second,
stacking learns to combine the base models using a meta-model whereas bagging and
boosting combine weak learners following deterministic algorithms.
The idea of stacking is to learn several different weak learners and combine them by
training a meta-model to output predictions based on the multiple predictions returned
by these weak models. So, we need to define two things in order to build our stacking
model: the L learners we want to fit and the meta-model that combines them.
for example, for a classification problem, we can choose as weak learners a KNN
classifier, a logistic regression and a SVM, and decide to learn a neural network as meta-
model. Then, the neural network will take as inputs the outputs of our three weak
learners and will learn to return final predictions based on it.
So, assume that we want to fit a stacking ensemble composed of L weak learners. Then
we have to follow the steps thereafter:
the meta-model. Thus, an obvious drawback of this split of our dataset in two parts is
that we only have half of the data to train the base models and half of the data to train
the meta-model. In order to overcome this limitation, we can however follow some kind
of “k-fold cross-training” approach (similar to what is done in k-fold cross-validation)
such that all the observations can be used to train the meta-model: for any observation,
the prediction of the weak learners are done with instances of these weak learners
trained on the k-1 folds that do not contain the considered observation. In other words,
it consists in training on k-1 fold in order to make predictions on the remaining fold and
that iteratively so that to obtain predictions for observations in any folds. Doing so, we
can produce relevant predictions for each observation of our dataset and then train our
meta-model on all these predictions.
Multi-levels Stacking
For each meta-model of the different levels of a multi-levels stacking ensemble model,
we have to choose a learning algorithm that can be almost whatever we want (even
algorithms already used at lower levels). We can also mention that adding levels can
either be data expensive (if k-folds like technique is not used and then, more data are
needed) or time expensive (if k-folds like technique is used and, then, lot of models need
to be fitted).
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit IV
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO4): Design probabilistic and unsupervised learning models for handling
unknown pattern.
INTRODUCTION TO CLUSTERING
Clustering is the most important technique of unsupervised learning. Clustering is an
unsupervised learning technique in which there is predefined classes and prior
information which defines how the data should be grouped or labeled into separate
classes. Cluster is the collection of data objects which are similar to one another within
the same group (class or category) and are different from the objects in the other
clusters. It is Exploratory Data Analysis (EDA) process which helps to discover hidden
patterns of interest or structure in data. Clustering can also work as a standalone tool to
get the insights about the data distribution or as a preprocessing step in other
algorithms.
High quality clusters can be created by reducing the distance between the objects in the
same cluster known as intra-cluster minimization and increasing the distance with the
objects in the other cluster known as inter-cluster maximization.
Intra-cluster minimization: The closer the objects in a cluster, the more likely they
belong to the same cluster.
Inter-cluster Maximization: This makes the separation between two clusters. The
main goal is to maximize the distance between 2 clusters.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
HIERARCHICAL
Hierarchical clustering does not partition the dataset into clusters in a single step.
Instead, it involves multiple steps which run from a single cluster containing all the data
points to n clusters containing single data point.
Divisive Method
This method is also known as top-down clustering method. It assigns all the data points
to a single cluster and then it partitions the cluster to two least similar clusters. Then
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
the same method is applied recursively on both the clusters until we get the cluster of
each data point.
Agglomerative method
AGNES
AGNES is an inside-out or bottoms-up approach. Here every data point is assigned as a
cluster initially if there are n data points n clusters will be formed initially. In the next
iteration, similar clusters are merged (again based on the density and distances), this
continuous until similar points are clustered together and are distinct for other clusters.
In steps one all the data points are assigned as clusters. In step two again depending
upon the density and distances the points are clubbed into a cluster. Lastly, in step 3 all
the similar points depending upon density and distances are clustered together which
are distinct to other clusters thus forming final clusters.
DIANA
DIANA is also known as Divisive Analysis clustering algorithm. It is the top-down
approach form of hierarchical clustering where all data points are initially assigned a
single cluster. Further, the clusters are split into two least similar clusters. This is done
recursively until clusters groups are formed which are distinct to each other.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
In step 1 that is the blue outline circle can be thought of as all the points are assigned a
single cluster. Moving forward it is divided into 2 red-colored clusters based on the
distances/density of points. Now, we have two red-colored clusters in step 2. Lastly, in
step 3 the two red clusters are further divided into 2 blacks dotted each, again based on
density and distances to give us final four clusters. Since the points in the respective 4
clusters are very similar to each other and very different when compared to the other
cluster groups they are not further divided. Thus, this is how user gets DIANA clusters
or top-down approached Hierarchical clusters.
In the partitioning method when database that contains multiple objects then the
partitioning method constructs user-specified partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that
come under partitioning method some of the popular ones are K-means clustering, K-
Mode Clustering etc.
K-MEANS CLUSTERING
The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with the
data objects from outside the cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster.
It is a type of square error algorithm. At the start randomly k objects from the dataset
are chosen in which each of the objects represents a cluster mean. For the rest of the
data objects, they are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated with the added data
objects.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Method:
2. Reassign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
K-means clustering algorithm works efficiently only for numerical dataset. It cannot
give proper results for the categorical data because of the improper spatial
representation. K-Means Clustering fails to find patterns in the categorical dataset.
K-MODES CLUSTERING
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
It is widely used algorithm for grouping the categorical data because it is easy to
implement and efficiently handles large amount of data. It defines clusters based on the
number of matching categories between data points. The k-modes clustering algorithm
is an extension of k-means clustering algorithm.
Let X, x11, x12,…,xnm be the data set consists of n number of objects with m number of
attributes. The main objective of the k-modes clustering algorithm is to group the data
objects X into K clusters by minimize the cost function:
K-modes algorithm where inputs are Data objects (X) and Number of clusters (K).
Step 1: Randomly select the K initial modes from the data objects such that Cj, j = 1,2,…,K
Step 2: Find the matching dissimilarity between the each K initial cluster modes and
each data objects using the Euation:
Step 4: Find the minimum mode values in each data object i.e. finding the objects
nearest to the initial cluster modes.
Step 5: Assign the data objects to the nearest cluster centroid modes.
Step 6: Update the modes by apply the frequency-based method on newly formed
clusters.
Step 7: Recalculate the similarity between the data objects and the updated modes.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Step 8: Repeat the step 4 and step 5 until no changes in the cluster ship of data objects.
SELF-ORGANIZING MAP
A self-organizing map (SOM) is a clustering technique that helps to uncover categories
in large datasets, such as to find customer profiles based on a list of past purchases. Self-
organizing maps are unsupervised neural networks, where nodes (reference vectors)
are arranged in a single, 2-dimensional grid, which can take the shape of either
rectangles or hexagons.
One example of a data type with more than two dimensions is color. Colors have three
dimensions, typically represented by RGB (red, green, blue) values. SOM can distinguish
between two color clusters.
SOM comprises neurons in the grid, which gradually adapt to the intrinsic shape of data.
The result allows visualizing data points and identifying clusters in a lower dimension.
The iterative process to learn the shape of data by SOM is given below:
Step 1: Select one data point, either randomly or systematically cycling through the
dataset in order
Step 2: Find the neuron that is closest to the chosen data point. This neuron is called the
Best Matching Unit (BMU).
Step 3: Move the BMU closer to that data point. The distance moved by the BMU is
determined by a learning rate, which decreases after each iteration.
Step 4: Move the BMU’s neighbors closer to that data point as well, with farther away
neighbors moving less. Neighbors are identified using a radius around the BMU, and the
value for this radius decreases after each iteration.
Step 5: Update the learning rate and BMU radius, before repeating Steps 1 to 4. Iterate
these steps until positions of neurons have been stabilized.
EXPECTATION MAXIMIZATION
The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood
estimates for model parameters when your data is incomplete, has missing data points,
or has unobserved (hidden) latent variables.
1. An initial guess is made for the model’s parameters and a probability distribution
is created. This is sometimes called the “E-Step” for the “Expected” distribution.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
3. The probability distribution from the E-step is tweaked to include the new data.
This is sometimes called the “M-step.”
4. Steps 2 through 4 are repeated until stability (i.e. a distribution that doesn’t
change from the E-step to the M-step) is reached.
The EM Algorithm always improves a parameter’s estimation through this above given
process. However, it sometimes needs a few random starts to find the best model. The
EM algorithm can be very slow, even on the fastest computer. It works best when
dataset have a small percentage of missing data and the dimensionality of the data isn’t
too big.
Step 1: Standardization
First step is to standardize the range of the continuous initial variables so that each one
of them contributes equally to the analysis. This can be done by subtracting the mean
and dividing by the standard deviation for each value of each variable.
Second is to understand how the variables of the input data set are varying from the
mean with respect to each other. Variables are highly correlated in such a way that they
contain redundant information. So, in order to identify these correlations, the
covariance matrix is computed.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Eigenvectors and Eigen-values are the linear algebra concepts that are computed from
the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the
new variables are uncorrelated and most of the information within the initial variables
is squeezed or compressed into the first components. So, the idea is 10-dimensional
data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second
and so on, until having something like shown figure below.
Feature vector is to choose whether to keep all the components or discard those of
lesser significance (of low Eigen-values), and form with the remaining ones. So, the
feature vector is simply a matrix that has as columns the eigenvectors of the
components that are kept.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
In this step, the aim is to use the feature vector formed using the eigenvectors of the
covariance matrix, to reorient the data from the original axes to the ones represented by
the principal components. This can be done by multiplying the transpose of the original
data set by the transpose of the feature vector.
For example, in the house price prediction problem, it may have a feature like the age of
the seller, which may not affect the house price. Dimensionality reduction helps to keep
the more important features in the feature set, reducing the number of features
required to predict the output.
LLE first finds the k-nearest neighbors of the points. Then, it approximates each data
vector as a weighted linear combination of its k-nearest neighbors. Finally, it computes
the weights that best reconstruct the vectors from its neighbors, then produce the low-
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
The below given equation is used to minimize the cost function, where j’th nearest
neighbor for point Xi:
Now for defining the new vector space Y such that it minimizes the cost for Y as the new
points the below given equation is used.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
FACTOR ANALYSIS
Factor analysis is a way to take a mass of data and shrinking it to a smaller data set that
is more manageable and more understandable. It’s a way to find hidden patterns, show
how those patterns overlap and show what characteristics are seen in multiple patterns.
It is also used to create a set of variables for similar items in the set. It can be a very
useful tool for complex sets of data involving psychological studies, socioeconomic
status, and other involved concepts. A “factor” is a set of observed variables that have
similar response patterns; they are associated with a hidden variable (called a
confounding variable) that cannot be directly measured. Factors are listed according to
factor loadings, or how much variation in the data they can explain.
1. Exploratory factor analysis is used when user doesn’t have any idea about
what structure of data is or how many dimensions are in a set of variables.
The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because they are all associated with a latent (i.e., not directly
measured) variable. In every factor analysis, there are the same numbers of factors as
there are variables. Each factor captures a certain amount of the overall variance in the
observed variables, and the factors are always listed in order of how much variation
they explain.
The Eigen value is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an Eigen value ≥1 explains more variance than a single
observed variable. So, if the factor for socioeconomic status had an Eigen value of 2.3 it
would explain as much variance as 2.3 of the three variables. This factor, which
captures most of the variance in those three variables, could then be used in other
analyses.
Factor Loading
The relationship of each variable to the underlying factor is expressed by the factor
loading. Here is an example of the output of a simple factor analysis looking at
indicators of wealth, with just six variables and two resulting factors.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
The variable with the strongest association to the underlying latent variable Factor 1 is
income, with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, so the
variable income has a correlation of 0.65 with Factor 1. This would be considered a
strong association for a factor analysis.
Two other variables, education, and occupation are also associated with Factor 1. Based
on the variables loading highly onto Factor 1, it is “Individual socioeconomic status.”
House value, number of public parks, and number of violent crimes per year, however,
have high factor loadings on the other factor, Factor 2. They seem to indicate the overall
wealth within the neighborhood, so it is Factor 2 “Neighborhood socioeconomic status.”
The variable house value also is marginally important in Factor 1 (loading = 0.38) since
the value of a person’s house should be associated with his or her income.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester
Unit V
Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO5): Analyze the co-occurrence of data to find interesting frequent
patterns and preprocess the data before applying to any real-world problem for evaluation.
PROBABILISTIC LEARNING
Probabilistic classification learning is one form of implicit learning in which cues are
probabilistically associated with outcomes and participants process associations
without explicit awareness. Probabilistic classifier is a classifier that is able to predict,
given an observation of an input, a probability distribution over a set of classes, rather
than only outputting the most likely class that the observation should belong to.
Probabilistic classifiers provide classification that can be useful or when combining
classifiers into ensembles.
A probabilistic method or model is based on the theory of probability or the fact that
randomness plays a role in predicting future events. The opposite is deterministic,
which is the opposite of random it tells us something can be predicted exactly, without
the added complication of randomness.
BAYESIAN LEARNING
In Bayesian machine learning, follow these three steps:
1. To define a model, use a “generative process” for the data, i.e., a sequence of
steps describing how the data was created.
b. Incorporate prior beliefs about these parameters, which take the form of
distributions over values that the parameters might take.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
3. After running the learning algorithm, it is left with an updated belief about the
parameters — i.e., a new distribution over the parameters.
• User has prior beliefs about unknown model parameters or explicit information
about data generation — i.e., useful info user wants to incorporate.
• User has few data or many unknown model parameters, and it is hard to get an
accurate result with data alone (without the added structure or information).
• User wants to capture the uncertainty about the result — how sure or unsure the
model is instead of only a single “best” result.
Suppose user grab a carton of milk from the fridge, see that it is seven days past the
expiration date, and want to know if the milk is still good or if it has gone bad. A quick
internet search leads him to believe that there is roughly a 50–50 chance that the milk is
still good. This is a prior belief (Figure 1).
From past experience, user has some knowledge about how smelly milk gets when it
has gone bad. Specifically, let’s suppose he rate smelliness on a scale of 0–10 (0 being no
smell and 10 being completely rancid) and have probability distributions over the
smelliness of good milk and of bad milk (Figure 2).
Here’s how Bayesian learning works: When he gets some data, i.e., when he smells the
milk (Figure 3), He can apply the machinery of Bayesian inference (Figure 4) to
compute an updated belief about whether the milk is still good or has gone bad (Figure
5).
For example, if user observe that the milk is about a 5 out of 10 on the smelly scale, he
can then use Bayesian learning to factor in his prior beliefs and the distributions over
smelliness of good vs. bad milk to return an updated belief — that there is now a 33%
chance that the milk is still good and a 67% chance that the milk has gone bad.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
A program that computes an update belief about whether the “milk has gone bad
whenever user smells the milk”. A program will do the following:
1. Encode prior beliefs about whether the milk is still good or has gone bad and
probability distributions over the smelliness of good vs. bad milk.
2. Smell the milk and give this observation as an input to the program.
For a Bayesian model, user needs to mathematically derive an inference algorithm i.e.,
the learning algorithm that computes the final distribution over beliefs given the data.
The equation below demonstrates how to calculate the conditional probability for a new
instance (vi) given the training data (D), given a space of hypotheses (H).
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Where vj is a new instance to be classified, H is the set of hypotheses for classifying the
instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given
hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the
data D.
Selecting the outcome with the maximum probability is an example of a Bayes optimal
classification.
Any model that classifies examples using this equation is a Bayes optimal classifier and
no other model can outperform this technique, on average.
Any system that classifies new instances according to [the equation] is called a Bayes
optimal classifier, or Bayes optimal learner. No other classification method using the
same hypothesis space and same prior knowledge can outperform this method on
average. Although the classifier makes optimal predictions, it is not perfect given the
uncertainty in the training data and incomplete coverage of the problem domain and
hypothesis space. As such, the model will make errors. These errors are often referred
to as Bayes errors. Because the Bayes classifier is optimal, the Bayes error is the
minimum possible error.
For example, some fruit may be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other
features, all these properties independently contribute to the probability that this fruit
is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). The equation is given below:
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
where,
• P(c|x) is the posterior probability of class (c, target) given predictor (x,
attributes).
A training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing) is given below. User needs to classify whether players will play
or not based on weather condition. Given below steps to perform classification:
Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
The probabilities are calculated in the belief networks by the following formula
To be able to calculate the joint distribution one need to have conditional probabilities
indicated by the network.
Bayesian Network can be used for building models from data and experts’ opinions, and
it consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Each node corresponds to the random variables, and a variable can be continuous or
discrete.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
• If node B relates to node A by a directed arrow, then node A is called the parent
of Node B.
• Causal Component
• Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
The strength of the association rule between 2 items, (for instance item1 and item2) or
the association confidence represents the number of transactions containing item1 and
item 2 divided by the number of transactions containing item1
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
The confidence metric estimates the likelihood that a transaction containing item1 will
include also item2.
The lift is the ratio of observing two items together to the likelihood of seeing just one of
them. A lift greater than 1 means that items1 and 2 are more likely present together in
transactions while values inferior to 1 apply to the cases when the two items are rarely
associated.
Apriori algorithm
Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less. This implies that when deciding on a minimum support
threshold (minimum frequency an item set needs to have in order to not be
discarded)we can avoid calculating S1 or any other superset of S if support(s) <
minimum support. It can be said that all such candidates are being discarded a priori.
The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all
possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the
minimum support rate are being discarded. The iterations end when no further
extensions (joins) are being found.
Eclat algorithm
Eclat (Equivalence Class Clustering and bottom-up Lattice traversal) algorithm uses
data organized by vertical layout which associates each element with the list of
underlying transactions. In an iterative depth-first search way the algorithm continues
by calculating for all combinations of k items (starting from k=1, it calculates all pairs of
2 items) the list of common transactions. In a nutshell, during the k step all
combinations of k items are calculated by intersecting the lists of transactions
associated with the k-1 itemsets. K will be incremented by 1 each time until no frequent
items or no candidate items can still be found.
Eclat algorithm is generally faster than apriori and requires only one database scan
which will find the support for all itemsets with 1 element. All k>1 iterations rely only
on previously stored data.
FP tree algorithm
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660
The first database scan sorts all items in the order of their global occurrence in the
database (the equivalent of applying a counter to the unraveled data of all transactions).
The second pass iterates line by line through the list of transactions and for each
transaction it sorts the elements by the global order (previous step corresponding to
the first database pass) and introduces them as nodes of a tree grown in depth. These
nodes are introduced with a count value of 1. Continuing the iterations, for each line
new nodes are being added to the tree at the point where the ordered items differ from
the existing tree. If the same pattern already exists, all common nodes will increase their
count value by one.
The FP tree can be pruned by removing all nodes having a count value inferior to a
minimum threshold occurrence. The remaining tree can be traversed and for instance
all paths from the root node to leaves correspond to clusters of frequently occurring
items.
follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])