Introduction: Geometric Models: - Page 1 of 25
Introduction: Geometric Models: - Page 1 of 25
Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for linear
classifiers, Support vector machines, obtaining probabilities from linear classifiers, going beyond
linearity with kernel methods. Distance Based Models: Introduction, Neighbors and exemplars,
Nearest Neighbors classification, Distance Based Clustering, Hierarchical Clustering.
Instance space1: An instance space is the space of all possible instances for some learning task. In
other words, it is defined by a set of unique instances.
Note: The goal of model-based machine learning is a single modeling framework should support a
wide range of models.
. Page 1 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Linear Models
Geometric Models
Distance based Models
Linear Models:
Geometric concepts like lines or planes are used to segment (classify) the instance space.
Linear functions take particular forms, depending on the domain and codomain of function f.
If x and f (x) are scalars, it follows that f is of the form
f (x) = a + bx for some constants a and b; a is called the intercept and b the slope.
If x = (x1, . . .,xd ) is a vector and f (x) is a scalar, then f is of the form
f (x) = a +b1x1 +. . .+bd xd = a + b· x
with b = (b1, . . .,bd ). The equation f (x) = 0 defines a plane in R d perpendicular to the normal
vector b.
Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data.
o For example, in f(x) = a + bx, a and b are the parameters that we are trying to learn
from data. This is different from tree or rule models, where the structure of the model
(e.g., which features to use in the tree, and where) is not fixed in advance.
Linear models are stable, which is to say that small variations in the training data have only
limited impact on the learned model. In contrast, Tree models tend to vary more with the
training data, as the choice of a different split at the root of the tree typically means that the
rest of the tree is different as well.
Linear models are less likely to over fit the training data than some other models, largely
because they have relatively few parameters. The flipside of this is that they sometimes lead to
under fitting: e.g., imagine you are learning where the border runs between two countries
from labeled samples, and then a linear model is unlikely to give a good approximation.
Linear models exist for all predictive tasks, including classification, probability estimation and
regression.
Few examples in real-life:
Hours spent studying Vs. Marks scored by students
Amount of rainfall Vs. Agricultural yield
Electricity usage Vs. Electricity bill
Suicide rates Vs. Number of stressful people
Years of experience Vs. Salary, and Demand Vs. Product price
. Page 2 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 3 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 4 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
9 15 13.97 -1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Mr. Sairam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he
uses the above equation to estimate that he will sell
y = 1.518 x 8 + 0.305 = 12.45 Ice Creams
Problem:
Study the relationship between the monthly e-commerce sales and the online advertising
costs. You have the survey results for 7 online stores for the last year.
Online Monthly E-commerce Online Advertising
Store Sales (in 1000 s) Dollars (1000 s)
1 368 1.7
2 340 1.5
3 665 2.8
4 954 5
5 331 1.3
6 556 2.2
7 376 1.3
. Page 5 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Perceptron
A perceptron is a neural network unit (an artificial neuron) that does certain computations to
detect features or business intelligence in the input data.
Page 6 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.
Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value ”f(x)”is generated.
. Page 7 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It
has only two values: Yes and No or True and False. The summation function “∑” multiplies all
inputs of “x” by weights “w” and then adds them up as follows:
For example:
If ∑ wixi> 0 => then final output “o” = 1 (issue bank loan) Else, final output “o” = -1 (deny bank loan)
. Page 8 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Step function gets triggered above a certain value of the neuron output; else it outputs zero.
Sign Function outputs +1 or -1 depending on whether neuron output is greater than zero or
not. Sigmoid is the S-curve and outputs a value between 0 and 1.
Output of Perceptron
Perceptron with a Boolean output:
Inputs: x1…xn
Output: o(x1….xn)
An output of +1 specifies that the neuron is triggered. An output of -1 specifies that the neuron did
not get triggered.
“sgn” stands for sign function with output +1 or -1.
Error in Perceptron
In the Perceptron Learning Rule, the predicted output is compared with the known output. If it
does not match, the error is propagated backward to allow weight adjustment to happen.
Perceptron: Decision Function
A decision function φ(z) of Perceptron is defined to take a linear combination of x and w
vectors.
. Page 9 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Bias Unit
For simplicity, the threshold θ can be brought to the left and represented as w0x0, where w0= -
θ and x0= 1.
The value w0 is called the bias unit. The decision function then becomes:
Output
The figure shows how the decision function squashes wTx to either +1 or -1 and how it can be
used to discriminate between two linearly separable classes.
Perceptron at a Glance
Perceptron has the following characteristics:
o Perceptron is an algorithm for Supervised Learning of single layer binary linear
classifier.
o Optimal weight coefficients are automatically learned.
o Weights are multiplied with the input features and decision is made if the neuron is
fired or not.
o Activation function applies a step rule to check if the output of the weighting function is
greater than zero.
o Linear decision boundary is drawn enabling the distinction between the two linearly
separable classes +1 and -1.
o If the sum of the input signals exceeds a certain threshold, it outputs a signal;
otherwise, there is no output.
Types of activation functions include the sign, step, and sigmoid functions.
. Page 10 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is
number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two
classes very well (look at the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a
frontier which best segregates the two classes (hyper-plane/ line).
You can look at support vector machines and a few examples of its working here.
How does it work?
Let’s understand:
Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
Now, identify the right hyper-plane to classify star and circle.
o You need to remember a thumb rule to identify the right hyper-plane: “Select the
hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B”
has excellently performed this job.
. Page 11 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C) and
all are segregating the classes well. Now, How can we identify the right hyper-plane?
o Here, maximizing the distances between nearest data point (either class) and hyper-
plane will help us to decide the right hyper-plane. This distance is called as Margin. Let’s
look at the below snapshot:
o Above, you can see that the margin for hyper-plane C is high as compared to both A
and B. Hence, we name the right hyper-plane as C. Another lightning reason for
selecting the hyper-plane with higher margin is robustness. If we select a hyper-plane
having low margin then there is high chance of miss-classification.
Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous
section to identify the right hyper-plane
. Page 12 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
o Some of you may have selected the hyper-plane B as it has higher margin compared to
A. But, here is the catch; SVM selects the hyper-plane which classifies the classes
accurately prior to maximizing margin. Here, hyper-plane B has a classification error
and A has classified all correctly. Therefore, the right hyper-plane is A.
Can we classify two classes (Scenario-4)?: Below, I am unable to segregate the two classes
using a straight line, as one of star lies in the territory of other(circle) class as an outlier.
o As I have already mentioned, one star at other end is like an outlier for star class. SVM
has a feature to ignore outliers and find the hyper-plane that has maximum margin.
Hence, we can say, SVM is robust to outliers.
. Page 13 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Find the hyper-plane to segregate to classes (Scenario-5): In the scenario below, we can’t
have linear hyper-plane between the two classes, so how does SVM classify these two classes?
Till now, we have only looked at the linear hyper-plane.
SVM can solve this problem. Easily! It solves this problem by introducing additional feature. Here,
we will add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:
. Page 14 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
In SVM, it is easy to have a linear hyper-plane between these two classes. But, another burning
question which arises is, should we need to add this feature manually to have a hyper-plane.
No, SVM has a technique called the kernel trick. These are functions which takes low
dimensional input space and transform it to a higher dimensional space i.e. it converts not
separable problem to separable problem, these functions are called kernels. It is mostly useful
in non-linear separation problem. Simply put, it does some extremely complex data
transformations, then find out the process to separate the data based on the labels or outputs
you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
. Page 15 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
The magic of the kernel is to find a function that avoids all the trouble implied by the high-
dimensional computation. The result of a kernel is a scalar, or said differently we are back to
one-dimensional space
After you found this function, you can plug it to the standard linear classifier.
Let's see an example to understand the concept of Kernel. You have two vectors, x1 and x2.
The objective is to create a higher dimension by using a polynomial mapping. The output is
equal to the dot product of the new feature map. From the method above, you need to:
Application areas of kernel methods are diverse and include geostatistics kriging, inverse
distance weighting, 3D reconstruction, bioinformatics, chemoinformatics, information
extraction and handwriting recognition.
Popular kernels:
Fisher kernel
Graph kernels
Kernel smoother
Polynomial kernel
Radial basis function kernel (RBF)
String kernels
Neural tangent kernel
. Page 16 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that find a centre of mass according to a chosen distance metric or medoids
that find the most centrally located data point. The most commonly used centroid is the
arithmetic mean, which minimises squared Euclidean distance to all other points.
Notes:
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean
position of all the points in the figure from the centroid point. This definition extends to any
object in n-dimensional space: its centroid is the mean position of all the points.
Medoids are similar in concept to means or centroids. Medoids are most commonly used on
data when a mean or centroid cannot be defined. They are used in contexts where the
centroid is not representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbour models, which use the
training data as exemplars – for example, in classification. The K-means clustering algorithm
also uses exemplars to create clusters of similar data points.
Nearest Neighbor Classifiers
. Page 17 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 19 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 20 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
Clustering: Introduction
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
than those in other groups. In simple words, the aim is to segregate groups with similar traits
and assign them into clusters.
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to
look at details of each costumer and devise a unique business strategy for each one of them?
Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based
on their purchasing habits and use a separate strategy for costumers in each of these 10
groups. And this is what we call clustering.
Now, that we understand what is clustering. Let’s take a look at the types of clustering.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one group out of the 10
groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a
probability or likelihood of that data point to be in those clusters is assigned. For example,
from the above scenario each costumer is assigned a probability to be in either of 10 clusters
of the retail store.
. Page 21 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
. Page 22 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using
red cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid: Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
5. Re-compute cluster centroids: Now, re-computing the centroids for both the clusters.
. Page 23 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and
5th steps until we’ll reach global optima. When there will be no further switching of data
points between two clusters for two successive repeats. It will mark the termination of the
algorithm if not explicitly mentioned.
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is only
a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
. Page 24 of 25
R16 – B.Tech – CSE – IV/II – Machine Learning – Unit IV
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible to
follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There
are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
*****END******
. Page 25 of 25