0% found this document useful (0 votes)
5 views

Moving Points Algorithm

Uploaded by

Rajkumar R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Moving Points Algorithm

Uploaded by

Rajkumar R
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

USAGE OF BASIC GEOMETRICAL IDEAS AND

PRINCIPLES OF SYMMETRY FOR MACHINE


LEARNING
Vatsal Srivastava

January 2024

Abstract—This paper attempts to give the reader an under- ern classes of objects such as NumPy boosted processing speed
standing of existing simple classifiers introduced as long ago as of these classifiers, the base algorithms have some aspects
1943 along with their pros and cons. The aim is to perform which introduce redundancy or less than optimum magnitude
an analysis of the simplest of classifiers and go through with
an attempt to build a classifier that can classify data that is of updates to take place. Some classifiers which might perform
linearly separable while balancing accuracy and computational generally well in terms of speed and accuracy are also included
efficiency using geometrical ideas. We won’t be going into the in our study. Our aim is not to beat any classifier or family
depths of mechanics of some classifiers and their various differing of classifiers, rather to perform a comparison to establish the
versions, since our focus is not on their mechanics, rather we potential of extensive usage of geometrical ideas in machine
aim to understand the broader set of underlying concepts which
are used in these classifiers. Instead will look at a train of learning and classification problems.
successive classifiers closely related to each other, with each being
an improvement to its predecessor and also briefly touch upon II. E XISTING C LASSIFIERS
some present geometrical classifiers. Our ultimate aim being, to Machine Learning models can be broadly classified into two
deviate from the decades old track of statistical inference and types, namely Classifiers and Regressors. Linear classifiers
try to create a model based on geometry, starting of course with
small and easy problems. were the first step in development of Artificial Intelligence.
They gave a machine the ability to behave like a single neuron.
I. I NTRODUCTION Since 1943, from the inception of the first mathematical
The first mathematical model of a neuron was described model of a neuron we’ve come a long way with powerful
by Warren McCulloch and Walter Pitts in 1943. This was tools like Decision Trees, Logistic Regression and Support
followed by Frank Rosenblatt’s implementation of a Percep- Vector Machines. We will go through three of the foundational
tron. It was further improved upon by Prof. Bernard Widrow classifiers namely the McCulloch-Pitts neuron, Perceptron
and his student Ted Hoff with the introduction of Adaptive and ADALINE. After that we’ll take a look at an existing
Linear Neuron. This was also the time when geometrical geometrical classifier as well.
model K-Nearest Neighbors was introduced. Each of these
A. McCulloch-Pitts Mathematical Model of Neuron, 1943
implementations or models, though better than their predeces-
sor, still has some inherent flaws. The aim of this research is Warren McCulloch, a neuroscientist and Walter Pitts, a logi-
to understand and further avoid them while building a new cian together created the first mathematical model of a neuron
classifier. It is important to note that stating these models in 1943 [6]. The model consisted of two parts (or functions),
have inherent flaws by no means diminishes the importance which will be denoted as g(x) and f (x). The function g(x)
of these models as foundational to the study of neural activity, is an aggregator which takes a series of successive stimulus
machine learning and artificial intelligence. These models (or inputs) and calculates the net input. The series of stimulus
are far from obsolete and are still implemented in various can be represented as a matrix X with elements from set S ∈
advanced machine learning libraries like Scikit-Learn [7]. Our {0, 1}. The net input is a singular, real and measurable value
aim is to improve the computational performance of these which acts as the input y to the function f (y). The function
models while using a field of mathematics that is not generally f (y) is the decider or threshold function which governs the
associated with machine learning. This is an experiment to decision that whether the neuron will fire or not.
diverge from the branch of statistical inference and explore the In context of this paper, the expression will fire refers to
relationship between other fields of mathematics and machine the act of a neuron sending an electrical or electromechanical
learning. impulse upon receiving a set of stimuli in a given time frame,
whose aggregate value is greater than or equal to the threshold
A. Problem Statement value θ. The statement that a neuron will fire upon receiving an
The original implementations of machine learning models aggregate stimulus equal to the threshold value is subjective
had been computationally non-optimum. Though various mod- and can change depending on the model and differing use
The threshold function is defined as:
(
0 y<θ
f (y) = (2)
1 y≥θ
The M-P neuron was not a model created for implementa-
tion in electronic or electromechanical devices. It served the
purpose of providing an insight into how naturally occurring
neural networks work. Even if the model is implemented
using modern software then it would not be regarded as an
intelligent model since the object itself lacks a learning rule
Fig. 1. A simple representation of the mathematical process in an M-P neuron or mechanism. The threshold value θ needs to be calculated
manually for different use cases.

cases for which such a model may adapt not by itself but due B. Perceptron by Frank Rosenblatt, 1957
to the intervention of an outside human agent which is not Frank Rosenblatt was a project engineer at the Cornell
considered as to be a part of the system itself. This threshold Aeronautical Laboratory who worked alongside US Navy to
function is what gives the implementations of neuron up to develop the first electromechanical implementation of a neuron
Perceptron, their characteristic property of being an all-or- which he called as a Perceptron [8] or an object that could
none object. perceive. Rosenblatt wrote in 1958, “Yet we are about to
Reader should note that this does not summarize the original witness the birth of such a machine – a machine capable
paper of McCulloch and Pitts titled ’A Logical Calculus of of perceiving, recognizing and identifying its surroundings
Ideas Immanent in Nervous Activity’. It is in-fact a summary without any human training or control.” What Rosenblatt
of the inferences derived from the analysis of the original paper proposed was not limited to an algorithmic model as popular
presented for application in machine learning. This inference in modern machine learning practices, rather he described the
was arrived upon from the intensive discussion and general working a machine that is capable of learning.
acceptance of the all-or-none nature of biological neurons that A machine such as the one described by Rosenblatt would
gave way to the study of neurological systems as a sum of constitute of three systems;
different propositional logic. Interestingly, the original paper 1) The S-System(Sensory System)
did not endeavour to interpret neurological or physiological 2) The A-System(Association System)
mechanism of neuron for the purpose of direct implementation 3) The R-System(Response System)
in machine learning practices, rather it was aimed at deriving a
calculus that could consistently describe the action of a neuron
with respect to synapses and dendrites via the argument that
the nature of any compound statement can be derived from its
individual claims.
The M-P neuron does not have the ability to learn from
inputs given at a previous time. This is a direct result of the
fact that the authors did not describe any learning method
for their model of neuron. Such a neuron cannot undergo Fig. 2. The basic structure of a Perceptron from original paper of Rosenblatt
captioned, General Organization of the Perceptron
changes to stimuli received at a previous time, or in case of
machine learning, such a neuron cannot retain information
The sensory system acts as the input function, the association
that corresponds to an input received at a previous time. This
system is synonymous to the learning rule and the response
is where an M-P neuron differs from their natural counterparts;
system is the threshold function. The basic idea is that the
neurons present in sentient beings can undergo changes to
S-System can send excitatory(positive) or inhibitory(negative)
retain information that was associated with the stimuli it
impulses to the A-System which will then build an association
receives. The mechanism of that process is beyond the scope
between responses and stimulus. The notation As.r denotes the
of this research.
subset of the A units activated by stimulus s and response r.
The application of this model is highly limited and can Suppose stimulus S1 is presented and subsets A1.1 and A1.2
be described as follows- suppose we have a set of stimuli are activated, meaning two possible responses 1 and 2. The net
c1 , c2 , c3 . . . , cj where each has a value x1 , x2 , x3 . . . , xj . value of both the sets is compared and the one with greater net
Each of xi can take values from the set S ∈ {0, 1}. The value is considered dominant and the one with lesser value is
aggregator function g(x) is defined as: suppressed. The elements of the subset of correct correlation
j
will gather an increment to their value while the elements of
X suppressed subsets will remain unchanged. Since the value
g(x) = xi (1)
i=1
has been incremented, the next time stimulus S1 is presented,
the net input corresponding to the previously greater subset
would be even greater, thus ensuring a higher probability of
w ← w + η(o − y)x (3)
selection of the previously established correct response due to
the reinforcement that was done when the stimulus S1 was
The derivation of this rule can be done using least mean
presented at a previous time. The machine has associated the
squares to calculate the error and differentiating the expression
stimulus S1 with whichever the correct response was.
to perform regular gradient descent. The activation function is
It is important to note here that modern implementations of present in Perceptron as well. The only difference is that it is
Perceptron differ in the approach of performing updates. Most an identity function, meaning σ(x) = x.
of software based implementation use the threshold function
The benefit of using a continuous value for checking error
similarly but they focus on incorrect classifications. The idea
is that we can then incorporate a measure of degree of
is that if a stimulus(or data point) is incorrectly classified then
incorrectness when updating the weights and hence making
the update should attempt to move the model in the direction
the classifiers adaptive, as in the name. Thus, theoretically
of correctness. Instead of reinforcing correct classifications,
convergence would be faster. This however brings about a
they penalize incorrect classifications. The learning algorithm
disadvantage, since we are trying to fit multiple continuous
of perceptron is defined as:
values to a fixed value i.e. the class label, we cannot be sure
when the model has sufficiently fit the data. it comes from:
Theorem II.1. No two real and distinct numbers when mul-
tiplied by a non-zero real number shall yield results that are
equal.
Proof. To prove this, we will use contradiction. Let us assume
that we have two real and distinct numbers a and b and a non-
zero real number c and when a and b are multiplied with c,
they yield the same number. Then

a.c = b.c

or
Fig. 3. Perceptron algorithm, from, Convergence Proof for the Perceptron
Algorithm, Michael Collins a=b

The only aspect that slows down the performance of a Per- which contradicts the assumption that a and b are distinct.
ceptron is that it has to use the threshold function to determine Hence our theorem is true.
the update value which means that the updates will always
take place with fixed values. The degree of incorrectness is
not propagated to the neuron. This makes convergence slightly
slower.

C. Adaptive Linear Element, Bernard Widrow and Ted Hoff,


1959
This object was an improvement built upon the Perceptron.
There is not much difference between the two, unless we
go into the depths of their papers [11] but they are more
concerned with the mechanics of their respective machines
which is not our area of interest. Since we are not building
an electromechanical machine, a majority of that content is
beyond the scope of this paper. Our interest is in the learning Fig. 4. Schematic of Original ADALINE algorithm, (Widrow, 1960)
algorithm. In case of ADALINE, similar to Perceptron we will
consider LMS as primary means of error calculation. From Since we cannot determine objectively when convergence
the schematic of ADALINE, the difference is pretty apparent. has reached, we must let the model train for a set epochs until
The ADALINE does not process the output of summer(or the satisfactory results are achieved. One could run the weights
aggregator) through the quantizer(or threshold function) rather through threshold function and check accuracy but that would
it gets processed through an activation function σ(x) = y mean losing the time advantage gained by using adaptive
which gives a continuous value that will be used in the update learning method. This is the only disadvantage in using this
rule. method. Still the algorithm is very fast.
D. K-Nearest Neighbors by Evelyn Fix and Joseph Hodges, (a) There will exist at least one line l1 that will divide the
1951 plane into two (excluding the set of points on the line)
This is a non-parametric algorithm and essentially very disjoint sets A and B such that X0 ⊂ A and X1 ⊂ B.
simple classifier. Calculate the k nearest data points that exist (b) No line segment drawn by joining two points of the same
in the training dataset, where k is an integer and k > 0. A set from X0 or X1 shall meet with the line l1 .
majority or plurality vote will be carried out to determine the (c) Any line segment drawn by joining two points from the
class label. In essence, we determine the k nearest data points different sets X0 and X1 will always meet the line l1
of known classification and use them to determine the class of A. Initialization Algorithm
the given point. It is one of the most widely used geometrical
classifiers. Distance can be calculated in multiple ways. We Since the postulate states that a line segment joining points
will use the Minkowski distance formula which can be used from different sets will always meet, then we can be sure that
to calculate various other distance metrics. a line segment joining the medians of the two datasets X0
and X1 will intersect a decision boundary. Why, is shown as
The Minkowski distance d measure is defined as:
follows:
# p1
Let us consider the sets X0 and X1 as sample spaces
" m
X [a] [b] p
d= (|xi − xi ) (4) Ω0 and Ω1 . X1 , X2 . . . , Xn ∼ δi ∀ i ∈ [P1 , Pn ] where
i=1 P1 , P2 , P3 , . . . are points in Ω0 . Similarly, Y1 , Y2 . . . , Ym ∼
Where; δj ∀ j ∈ [Q1 , Qm ] where Q1 , Q2 , Q3 , . . . are points in Ω1 .
If X1 , X2 . . . , Xn are IID, then we can say that1 ;
• m = number of features
• a = point from training set P
X n −→ µX0 (5)
• b = point to be classified
• p = an arbitrary constant Which can be interpreted as the distribution of X n becomes
The value of p determines which kind of distance is to be more concentrated around µ as n gets large.
calculated. For eg, p = 1 corresponds to Manhattan distance Since we can calculate X n from our sample, we can say
and p = 2 corresponds to Euclidean distance. that the expected value of total populace will lie somewhere
The most apparent disadvantage of this system is that the in proximity. Thus by minimizing collinearity between the
model does not learn anything from the training data. It simply decision boundary l1 and line segment joining X n (or other
stores and iterates through it whenever used. This gives rise measures of central tendency), we reduce collinearity between
to a variety of problems; expectation µX0 and l1 greatly. Trivially, the sample and the
total populace would be distributed uniformly around µX0 .
1) We have to store the entire dataset. Which means that Hence by reducing the collinearity, we increase the accuracy
large storage space is needed for storing the training data of our decision boundary.
and it cannot be discarded since it is required for every
Same can be said for Y m ;
classification.
P
2) The distance calculation is resource heavy on the system. Y m −→ µX1 (6)
Suppose we have a million training examples and each
with five features. A single classification could possibly The term collinear is used for classification. Since we want
take minutes. to ”reduce” collinearity, we need to use a measure for degree
of collinearity. Such a measurement can be done using the
Due to extensive research with this classifier, there have
angle that two segments(or line and segment) subtend and their
been many solution to the above problems. Such as instead
proximity(distance).
of storing all training examples, we can do clustering. A
brief account of advancements and improvements in the KNN Definition III.1. Two line segments are said to be near
algorithm have been discussed in this [10] article Haiyan Wang collinear if the angle subtended by them and their proximity
et al. is below an arbitrary threshold. With them being collinear if
With this we conclude our analysis of foundational machine both values are 0.
learning classifiers.
To extend this definition to our line l1 and line segment
III. M IGRATING P OINTS A LGORITHM joining mean of sets X0 and X1 (say AB, we can consider
Euclid’s second postulate[3]: Any straight line segment can
To lay the foundation of a new classifier, let us first establish be indefinitely extended in a straight line. Let us consider the
some postulates for the cartesian plane; line segment AB as ⃗s and a line segment ⃗r on l1 such that
Postulate III.1. Let there be two sets of points X0 and X1 , ⃗r ∩ ⃗s ̸= ϕ Therefore the angle θ between ⃗s and ⃗r is given by:
such that there exist no two line segments that can be drawn
 
−1 ⃗s · ⃗r
from a set of four distinct points with two and only two points θ = cos (7)
|⃗s| · |⃗r|
from one set, that will intersect unless the two endpoints of
one segment belong to different sets, then: 1 Theorem 5.6, All of Statistics, Larry Wasserman, Springer
The maximum value of cos−1 is π but we need to remember Now we can calculate the displacement vector ⃗v (not related
that there are two angles when line intersects. Both angles are to displacement value d) as:
supplementary hence to maximise both, each needs to be π2 .
The above section shows why a line l1 passing through line ⃗ − ⃗u
⃗v = w (16)
segment joining the segment means and perpendicular to it is
a good choice for initialization of two random points.
B. Learning Algorithm
After initialization, we will have two points E ≡ (x1 , y1 )
and F ≡ (x1 , y2 ) which can be used to write equation of a
lines as:
y2 − y1
y − y1 = (x − x1 ) (8)
x2 − x1
y2 − y2
y= (x − x1 ) + y1 (9)
x2 − x1
(x2 − x1 )y = xy2 − x1 y2 − y1 x + y1 x1 + y1 x2 − y1 x1 (10)
(x2 − x1 )y = (y2 − y1 )x − x1 y2 + y1 x2 (11)
compare equation (11) to Ax + By + C. Putting the values
of the mean points (or median) we will get one positive and
one negative value, which will be used to assign pseudo class
∈ {−1, 1} to the sets X0 and X1 . The sign of pseudo class Fig. 5. Graphical representation of Calculation of displacement vector
should be similar to that of their respective outputs.
Then we calculate the displacement d value of each point Then the unit displacement vector which tells us the direc-
T ≡ (x0 , y0 ). tion of movement is given by:
Ax0 + By0 + C
d= √ (12)
A2 + B 2 ⃗v
v̂ = (17)
Using values from equation4 (11), we can write the equation |⃗v |
of displacement as:
the displacement factor which tells us the magnitude of
(y1 − y2 )x0 + (x2 − x1 )y0 + (x1 y2 − x2 y1 )
d= p (13) movement (or update) is given by |η × λ|. The addition or
(y1 − y2 )2 + (x2 − x1 )2 movement vector is given by:
If d× pseudo class < 0, we can infer that the classification is
incorrect. Let this quantity be λ. Thus λ has a information ⃗t = v̂ × |η × λ| (18)
about whether the classification is correct or not and also
about the degree of misclassification(points further away or Finally the new position vector ⃗n of the migrating point will
with higher displacement are more misclassified than the be given by:
nearer ones). Hence we can move the decision boundary in ⃗n = ⃗a + (v̂ × |η × λ|) (19)
an adaptive manner.
1) Upon iterating through the dataset, suppose we encounter The reason why this would work is again based on proximity.
a point Q which was misclassified. Let the position vector Postulate 3.1(a) states that X0 ⊂ A, X1 ⊂ B and A ∩ B = ϕ.

of that point be d. Therefore a misclassification would imply a situation such that
2) Out of the two points E and F , we will calculate which Q ∈ X0 and Q ∈ B. If the data is linearly separable then in
point is closer to point Q. Let the position vector of this order to correct the classification, decision boundary should
point be ⃗c. move in such a way that the resulting boundary satisfies Q ∈ /
3) We will use these two position vectors to calculate the B. For that to happen, the decision boundary(or its points)
vector joining these points; shall move in such a way that it moves towards the sample
⃗u = ⃗c − d⃗ (14) mean of class whose points are a subset of B.
There are various ways to move the decision boundary in
4) Now we select a random point from the near-cluster2 and a manner that satisfies the above conditions. For example by
let this point have the position vector ⃗g . Using this we moving points along vector joining the sample means of both
will calculate another vector w⃗ as; class. Or simply move the point towards the sample mean of
⃗ = ⃗g − d⃗
w (15) opposite class. The reason for using the method described is
2 near cluster is a set of points at a specified distance metric from the
that it takes into account the relative position of point that
sample mean. Random points are selected to suppress the effects of outliers was misclassified. Therefore if an outlier is misclassified, the
in learning. model would update accordingly.
C. Over-fitting Cases
Like all classifiers, the MPA can also suffer from over train-
ing. As the position of points is updated, They are displaced
towards the central tendency of the class they misclassified.
This also gives them a displacement towards each other since.
The closer two points are, the more drastically the orientation
of the line made by joining them changes with change in the
relative position of the two points. Therefore if the model
trains for a long time, two points will get too close to each
other and the decision boundary will start swinging.

Fig. 7. Model trains over a non-linearly separable dataset for 150 epochs

Fig. 6. Two moving points start moving close to each other as the model
trains.

The above figure shows how two migrating points E and F


might move along the respective displacement vectors ⃗v and
⃗b. The misclassified point is point Q and R is a random point
Fig. 8. Same model as above trains over the same dataset for 5000 epochs
in the near cluster of opposing class. It can be observed that
the points get drawn closer as the model trains.
This problem becomes even more prominent if the datasets In order to address this flaw we can update the algorithm
are not linearly separable. In this case, after the decision with an addendum using , such that after the movement vector
boundary reaches an optimum position, the moving points ⃗b is calculated, the algorithm checks the distance between both
say, E and F will effectively have a displacement along the migrating points used to draw the decision boundary and
the line connecting the central tendencies equal to 03 . But, if the distance is less than5 or equal to a predefined threshold
the displacement towards the line segment joining central α (say) then the unit displacement vector is adjusted such that
tendencies is in the same direction. Hence the two points will the point which is to be moved along ⃗b, moves along ⃗b′ .
converge.
Figure 7and Figure 8 compare the position of the two (
⃗b′ = ⃗b − (EF
ˆ × |⃗b| cos δ) δ ≤ 90◦
moving points after training over the same dataset with same f (y) = ⃗ ′ ⃗ (20)
initialization for 150 and 5000 epochs. As can be seen, when b =b δ > 90◦
we kept updating the model for a high number of epochs,
−−→
the points didn’t move around much but they did converge Where δ is the angle between ⃗b and F E What this does
at a point4 and due to this, the decision boundary changed is simply cancel out the displacement of the points towards
drastically to yield incorrect results because the effects of each other if they come too close, judged by an arbitrary
randomness in are more pronounced. threshold α. This is done only if the angle between calculated
−−→
3 Upon reaching a position in case of non-linearly separable datasets where displacement vector and F E is less than 90◦ because if the
no further improvement is possible it is probable that the model will mis- angle is greater than 90◦ the points would naturally move
classify a near equal number of examples hence the displacement towards further away.
either class gets cancelled out by a movement towards the opposite class.
4 This will not always be the result when training the model through
extremely high number of epochs. In some cases the higher number of epochs 5 Less than is included because it is possible that the previous update caused
will not affect the position of moving points much. This is case specific. the distance between the moving points to fall below the threshold.
D. Higher Dimensional Datasets hyperplane of n-1 dimensions. The general equation of such
We only discussed about a dataset that has two dimensions. a hyperplane is given by:
The method to find decision boundary was also explicit to a1 x1 + a2 x2 + . . . an xn = c (26)
a line that can only separate points on a plane. What about
datasets like: where c is a constant. This is analogous to equation (12).
 [1] The distance d between a point and a hyperplane in is given
[1] [1] [1] 
x1 x2 x3 . . . xn by:
 [2] [2] [2] [2]  |a1 x1 + a2 x2 + a3 x3 ... + an xn |
 x1 x2 x3 . . . xn  d= p (27)
 [3] [3] [3] [3]  a21 + a22 + a23 ... + a2n
D=  x1 x2 x3 . . . xn  (21)
 .. .. .. .. 

.. Now before we move to how we will calculate the equation
 . . . . . 
[m] [m] [m] [m] of the hyperplane from a set of given points, we shall discuss
x1 x2 x3 ... xn the the nature of the points required to define a hyperplane of
where m is the number of examples and n is the number of n-1 dimensions embedded in an ambient space of n dimen-
features. sions:
In order to deal with data points described in n-dimensions, 1) To describe a hyperplane of n-1 dimensions, you need at
we need to first establish a few points. least n points.
1) If we have a dataset of n features, then that dataset can 2) The set of n points must not lie on the same n-2
be mapped to an ambient space of n dimensions. hyperplane.
2) It would be meaningless to define an hyperplane of m Earlier we used an equation specific to lines to find out
dimensions in q dimensions such that m ≥ q. Therefore the equation of the line that would pass through two points
every m dimension hyperplane is defined or ”embed- and form a decision boundary. Now we shall look at the
ded”in m + 1 or more dimensions. method that will give us the equation of any n-1 dimensional
Hence, we shall state: A hyperplane of n−1 dimensions can hyperplane embedded in an n-dimensional hyperspace using n
separate an ambient space of n dimensions into two regions points.
of infinite extent 6 . which can be framed as a theorem: Let each point that will be used to draw the hyperplane be
[1] [1] [1]
of the type t = (x1 , x2 , . . . , xn ). We need to have n such
Theorem III.1. A hyperplane construct of n - 1 dimensions
points to describe the hyperplane. To find the equation solve
can divide an ambient space of n dimensions into three disjoint
the determinant7 :
sets X1 , X2 , and X3 (say) of points, with one set containing  
all and only points that satisfy the equation of the hyperplane. x1 x2 x3 . . . x n 1
 x[1] x[1] x[1] . . . x[1] 1
 1 2 3 n
Proof. By definition a hyperplane is a subspace is a subspace  [2] [2] [2] [2]

x1 x2 x3 . . . xn 1 = 0

whose dimensions is one less than the ambient space. Let w det 
 . (28)
 . .. .. .. .. .. 
be a normal vector normal to the hyperplane H ∈ Rn and x  . . . . . .

be any point inside the ambient space. Therefore, x1
[m]
x2
[m] [m]
x3
[m]
. . . xn 1
wT .x + b = 0 (22) Solving the determinant would give us an equation of the
is the general equation of the hyperplane, where w = form w1 x1 + w2 x2 + w3 x3 · · · + wn xn + c = 0 where c is a
[w1 , w2 , w3 · · · , wn ]. Now, we can define three sets of points constant term (usually a sum of constants values). Comparing
from this equation. Points that satisfy; this equation to the general equation of hyperplane (equation
(21)), we get w1 = a1 , w2 = a2 , w3 = a3 . . . wn = an . This
wT .x + b = 0 (23) therefore givers us the coefficients of the hyperplane equation,
using which we can apply the same algorithm as discussed in
or,
the case of 2-D datasets.
wT .x + b < 0 (24)
E. Performance Evaluation
or,
T
w .x + b > 0 (25) We evaluated the model against a Perceptron, a KNN model
and a SVC model initialized using Google’s sklearn library[7].
No point can satisfy two or more of these equations simul- To get a generalized performance we calculated the mean
taneously and every point in the ambient space would satisfy errors over 500 datasets each containing two 2 feature columns
one of these equations hence our theorem is true. and one true class label column without changing the param-
Therefore in order to accomplish the task of making a eters. Both Perceptron and SVC class have adaptive learning
decision boundary, we just need to find the equation of this rate feature, which is absent in our classifier implementation.
The 500 datasets were obtained using make blobs method
6 Just like a plane has an infinite area, 3-D space has infinite volume, n-
dimensional space would have an infinite ”extent” which does not have a 7 The reasoning behind this method is explained well in this wonderful
specific quantity answer[1] by user amd on Mathematics Stack Exchange
of sklearn. Integer values from 0-50 were used as random
seeds and further the standard deviation of these 50 datasets
was varied over values from 1.0 to 2.0 which gave us 10
datasets from each one randomly generated dataset. Hence
a total of 500 datasets. Due to this we also did not apply
standardization to the datasets. The Perceptron was initialized
with (random state = 8). The KNN model was initialized
with (n neighbors = 3, n jobs = -1 ). SVC was initialized
with (kernel = ’linear’). This was done to prevent SVC from
using the kernel trick and increasing the dimensionality of
the dataset. This might seem as a bad practice, to restrict the
capabilities of one classifier in order to compare performance,
but it must be understood that SVC and our classifier basically
do the same task of creating a hyperplane in order to categorize
the data. So theoretically the kernel trick can also be used
with our classifier but since we haven’t implemented it in Fig. 10. The decision boundary calculated by the Moving Points Algorithm.
the testing object the potential feature of our classifier gets
restricted. Hence we even out the playing field be restricting
SVC to linear boundary only.
The results of the performance evaluation are as follows:
• KNN performed the best with a mean error of 4.2909.
• Migrating Points Algorithm had the mean error of 7.6781
which is just slightly worse than that of SVC at 7.2254.
• Moving Points Algorithm performed better than a Per-
ceptron which had a mean error of 9.1436.
The similar errors of SVC and our classifier are expected
since they are basically doing the same thing i.e creating
a hyperplane. Only the methods of creating the hyperplane
differ and apparently SVC performed slightly better on these
datasets. Our model out-performed a Perceptron by a signifi-
cant margin.
Fig. 11. The decision boundary calculated by the algorithm on Iris dataset,
label 0: Iris-versicolor, label 1: Iris-setosa.

Further the dataset was tested on a sample of classes


not linearly separable in two dimensions using the labels
”Iris-versicolor” and ”Iris-virginica”. For this test, ”Petal-
LengthCm” and ”SepalLengthCm”. The resultant decision
boundary is presented in 12
For accuracy score determined using the formula:
Number of Correct Predictions
accuracy score = (29)
Total Number of Predictions
The proposed algorithm gave an accuracy score of 0.96,
Compared to support vector machine which classifier used
with linear kernel, gave an accuracy score of 0.94 on the same
dataset. The decision boundary found by SVM is represented
Fig. 9. Sample non linearly separable dataset. in Fig.11
Perceptron performed rather poorly on the same dataset
Upon testing the custom made implementation of the al- and had an accuracy score of 0.86, the decision boundary is
gorithm on the the Iris dataset [9] to classify the samples represented in 14
of Iris-setosa and Iris-versicolor with 50 samples of each
class, it was found that the algorithm was able to classify IV. U SING M ULTIPLE P OINTS FOR F INDING C OMPLEX
the linearly separable dataset in two dimensions with 100% PATTERNS IN DATA
accuracy. The feature vectors selected are: SepalWidthCm and In the above sections, we discussed the algorithm that can
SepalLengthCm. be used to draw a linear decision boundary. However, we also
Fig. 12. The decision boundary drawn by the model for a dataset that is
not linearly separable in two dimenstions, label 1: Iris-virginica,label 0: Iris-
versicolor
Fig. 14. Decision boundary found using Perceptron on the same dataset.

underlying patterns in data. To understand what this means let


us state an important mechanism about the working of this
algorithm. The algorithm essentially has two parts:
• The Points - they are the objects that contain the infor-
mation about the underlying patterns.
• Method of Inference - basically the method that we use
to draw or find the decision boundary.
For eg. when we defined the algorithm for strictly making
a linear decision boundary, the steps for calculating the equa-
tion of that boundary (eq. 23) are the method of inference.
Similarly, using more than the stipulated n − 1 points to
find the decision boundary of dataset of n dimensions calls
for a different method of inference from those points. Once
the training ends, the amount of information contained in the
Fig. 13. Decision boundary found using SVM on the same dataset. points is fixed but how accurately and efficiently we are able to
extract that information depends on the method of inference.
Coming to the original statement that the increased number of
have an approach that allows us to create non-linear decision points required to remove the effects of Runge’s phenomenon
boundaries. do not add to the information about data pattern contained
A. Polynomial Interpolation by the points. It solely focuses on improving the accuracy
of method of inference. This can be seen trivially using the
Two popularly used interpolation techniques are Lagrange following example:
interpolation [5] and Newton’s divided difference interpolation
15 points were used to fit the polynomial to a satisfactorily,
formula [2]. However both of these techniques are suscepti-
but it can be seen by inspection that these many polynomials
ble to Runge’s phenomenon, which refers to the inaccurate
are not needed for such a simple graph. Hence, the extra points
oscillations in the polynomial around the boundaries of the
do not add to the information present in the model.
interval. The critical effect it can have on the predictions in
machine learning can be seen from blog post Explore Runge’s
Polynomial Interpolation Phenomenon by Cleve Moler. B. Splines
The Runge’s phenomenon can be avoided if we use Cheby-
shev distribution to calculate the polynomial by concentrating This is an alternate method that is used in interpolation
several points of towards the end of the distribution but that problems. It is basically a piece wise function made up of
would increase the training time and at the same time make linear, quadratic, cubic or polynomials of any degree. The
the calculation of polynomial expensive. Another down side biggest plus point of this method is that it can avoid Runge’s
is that whatever amount of points we use to ensure that the phenomenon even at higher degrees. Hence, in our use case
function does not oscillate do not add to the information about we can make do with a relatively short number of points.
[5] Jim Farmer. “Lagrange’s interpolation formula”. In:
Australian Senior Mathematics Journal 32.1 (2018),
pp. 8–12. ISSN: 0819-4564.
[6] Warren S. McCulloch and Walter Pitts. “A logical
calculus of the ideas immanent in nervous activity”. en.
In: The Bulletin of Mathematical Biophysics 5.4 (Dec.
1943), pp. 115–133. ISSN: 0007-4985, 1522-9602. DOI:
10.1007/BF02478259. URL: http://link.springer.com/10.
1007/BF02478259 (visited on 08/24/2024).
[7] Fabian Pedregosa et al. “Scikit-learn: Machine learn-
ing in Python”. In: the Journal of machine Learning
research 12 (2011). Publisher: JMLR. org, pp. 2825–
2830. ISSN: 1532-4435.
[8] Frank Rosenblatt. The perceptron, a perceiving and rec-
ognizing automaton Project Para. Cornell Aeronautical
Fig. 15. Credit: Explore Runge’s Polynomial Interpolation Phenomenon, Laboratory, 1957.
Cleve Moler. [4] [9] UCI Machine Learning Repository. URL: https : / /
archive . ics . uci . edu / dataset / 53 / iris (visited on
09/11/2024).
V. C ONCLUSION [10] Haiyan Wang, Peidi Xu, and Jinghua Zhao. “Improved
The paper discussed some foundational classifiers and then KNN Algorithm Based on Preprocessing of Center in
inspired by those, described a classifier that operates on the Smart Cities”. en. In: Complexity 2021.1 (Jan. 2021).
most basic element of mathematics, points. Like atoms are Ed. by Zhihan Lv, p. 5524388. ISSN: 1076-2787, 1099-
to universe, points are to geometry. Using this fundamental 0526. DOI: 10 . 1155 / 2021 / 5524388. URL: https : / /
element of mathematics, we were able to deviate machine onlinelibrary. wiley. com / doi / 10 . 1155 / 2021 / 5524388
learning from statistical inference to geometrical inference. We (visited on 08/24/2024).
focused extensively on the initialization process and the ability [11] Bernard Widrow. Adaptive ”ADALINE” Neuron Using
of this algorithm to deal with outliers. This helps boost runtime Chemical ”Memristor”.
efficiency and accuracy. The algorithm we described explored
only one method to move a point using position vector of
initialized point, test point and one median point. There can
be various other methods such as using both median points or
just one point all together. The computational performance of
these different methods can only be judged after testing them.
The main focus would be on developing an optimized and
efficient implementation of this algorithm for further testing
and understanding potential real life applications.
R EFERENCES
[1] amd. Answer to ”How to determine the equation of the
hyperplane that contains several points”. Apr. 2018.
URL: https : / / math . stackexchange . com / a / 2723930
(visited on 08/24/2024).
[2] Biswajit Das and Dhritikesh Chakrabarty. “Newton’s
Divided Difference Interpolation formula: Representa-
tion of Numerical Data by a Polynomial curve”. In:
International Journal of Mathematics Trends and Tech-
nology 35 (July 2016), pp. 197–203. DOI: 10.14445/
22315373/IJMTT-V35P528.
[3] . Euclid’s ”Elements”. Venice : Erhard Ratdolt, May
1482. URL: https://hdl.loc.gov/loc.wdl/wdl.18198.
[4] Explore Runge’s Polynomial Interpolation
Phenomenon. en. Dec. 2018. URL: https : / / blogs .
mathworks . com / cleve / 2018 / 12 / 10 / explore - runges -
polynomial - interpolation - phenomenon/ (visited on
08/24/2024).

You might also like