Moving Points Algorithm

Uploaded by

Rajkumar R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views10 pages

Moving Points Algorithm

Uploaded by

Rajkumar R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

USAGE OF BASIC GEOMETRICAL IDEAS AND

PRINCIPLES OF SYMMETRY FOR MACHINE

LEARNING
Vatsal Srivastava

January 2024

Abstract—This paper attempts to give the reader an under- ern classes of objects such as NumPy boosted processing speed
standing of existing simple classifiers introduced as long ago as of these classifiers, the base algorithms have some aspects
1943 along with their pros and cons. The aim is to perform which introduce redundancy or less than optimum magnitude
an analysis of the simplest of classifiers and go through with
an attempt to build a classifier that can classify data that is of updates to take place. Some classifiers which might perform
linearly separable while balancing accuracy and computational generally well in terms of speed and accuracy are also included
efficiency using geometrical ideas. We won’t be going into the in our study. Our aim is not to beat any classifier or family
depths of mechanics of some classifiers and their various differing of classifiers, rather to perform a comparison to establish the
versions, since our focus is not on their mechanics, rather we potential of extensive usage of geometrical ideas in machine
aim to understand the broader set of underlying concepts which
are used in these classifiers. Instead will look at a train of learning and classification problems.
successive classifiers closely related to each other, with each being
an improvement to its predecessor and also briefly touch upon II. E XISTING C LASSIFIERS
some present geometrical classifiers. Our ultimate aim being, to Machine Learning models can be broadly classified into two
deviate from the decades old track of statistical inference and types, namely Classifiers and Regressors. Linear classifiers
try to create a model based on geometry, starting of course with
small and easy problems. were the first step in development of Artificial Intelligence.
They gave a machine the ability to behave like a single neuron.
I. I NTRODUCTION Since 1943, from the inception of the first mathematical
The first mathematical model of a neuron was described model of a neuron we’ve come a long way with powerful
by Warren McCulloch and Walter Pitts in 1943. This was tools like Decision Trees, Logistic Regression and Support
followed by Frank Rosenblatt’s implementation of a Percep- Vector Machines. We will go through three of the foundational
tron. It was further improved upon by Prof. Bernard Widrow classifiers namely the McCulloch-Pitts neuron, Perceptron
and his student Ted Hoff with the introduction of Adaptive and ADALINE. After that we’ll take a look at an existing
Linear Neuron. This was also the time when geometrical geometrical classifier as well.
model K-Nearest Neighbors was introduced. Each of these
A. McCulloch-Pitts Mathematical Model of Neuron, 1943
implementations or models, though better than their predeces-
sor, still has some inherent flaws. The aim of this research is Warren McCulloch, a neuroscientist and Walter Pitts, a logi-
to understand and further avoid them while building a new cian together created the first mathematical model of a neuron
classifier. It is important to note that stating these models in 1943 [6]. The model consisted of two parts (or functions),
have inherent flaws by no means diminishes the importance which will be denoted as g(x) and f (x). The function g(x)
of these models as foundational to the study of neural activity, is an aggregator which takes a series of successive stimulus
machine learning and artificial intelligence. These models (or inputs) and calculates the net input. The series of stimulus
are far from obsolete and are still implemented in various can be represented as a matrix X with elements from set S ∈
advanced machine learning libraries like Scikit-Learn [7]. Our {0, 1}. The net input is a singular, real and measurable value
aim is to improve the computational performance of these which acts as the input y to the function f (y). The function
models while using a field of mathematics that is not generally f (y) is the decider or threshold function which governs the
associated with machine learning. This is an experiment to decision that whether the neuron will fire or not.
diverge from the branch of statistical inference and explore the In context of this paper, the expression will fire refers to
relationship between other fields of mathematics and machine the act of a neuron sending an electrical or electromechanical
learning. impulse upon receiving a set of stimuli in a given time frame,
whose aggregate value is greater than or equal to the threshold
A. Problem Statement value θ. The statement that a neuron will fire upon receiving an
The original implementations of machine learning models aggregate stimulus equal to the threshold value is subjective
had been computationally non-optimum. Though various mod- and can change depending on the model and differing use
The threshold function is defined as:
(
0 y<θ
f (y) = (2)
1 y≥θ
The M-P neuron was not a model created for implementa-
tion in electronic or electromechanical devices. It served the
purpose of providing an insight into how naturally occurring
neural networks work. Even if the model is implemented
using modern software then it would not be regarded as an
intelligent model since the object itself lacks a learning rule
Fig. 1. A simple representation of the mathematical process in an M-P neuron or mechanism. The threshold value θ needs to be calculated
manually for different use cases.

cases for which such a model may adapt not by itself but due B. Perceptron by Frank Rosenblatt, 1957
to the intervention of an outside human agent which is not Frank Rosenblatt was a project engineer at the Cornell
considered as to be a part of the system itself. This threshold Aeronautical Laboratory who worked alongside US Navy to
function is what gives the implementations of neuron up to develop the first electromechanical implementation of a neuron
Perceptron, their characteristic property of being an all-or- which he called as a Perceptron [8] or an object that could
none object. perceive. Rosenblatt wrote in 1958, “Yet we are about to
Reader should note that this does not summarize the original witness the birth of such a machine – a machine capable
paper of McCulloch and Pitts titled ’A Logical Calculus of of perceiving, recognizing and identifying its surroundings
Ideas Immanent in Nervous Activity’. It is in-fact a summary without any human training or control.” What Rosenblatt
of the inferences derived from the analysis of the original paper proposed was not limited to an algorithmic model as popular
presented for application in machine learning. This inference in modern machine learning practices, rather he described the
was arrived upon from the intensive discussion and general working a machine that is capable of learning.
acceptance of the all-or-none nature of biological neurons that A machine such as the one described by Rosenblatt would
gave way to the study of neurological systems as a sum of constitute of three systems;
different propositional logic. Interestingly, the original paper 1) The S-System(Sensory System)
did not endeavour to interpret neurological or physiological 2) The A-System(Association System)
mechanism of neuron for the purpose of direct implementation 3) The R-System(Response System)
in machine learning practices, rather it was aimed at deriving a
calculus that could consistently describe the action of a neuron
with respect to synapses and dendrites via the argument that
the nature of any compound statement can be derived from its
individual claims.
The M-P neuron does not have the ability to learn from
inputs given at a previous time. This is a direct result of the
fact that the authors did not describe any learning method
for their model of neuron. Such a neuron cannot undergo Fig. 2. The basic structure of a Perceptron from original paper of Rosenblatt
captioned, General Organization of the Perceptron
changes to stimuli received at a previous time, or in case of
machine learning, such a neuron cannot retain information
The sensory system acts as the input function, the association
that corresponds to an input received at a previous time. This
system is synonymous to the learning rule and the response
is where an M-P neuron differs from their natural counterparts;
system is the threshold function. The basic idea is that the
neurons present in sentient beings can undergo changes to
S-System can send excitatory(positive) or inhibitory(negative)
retain information that was associated with the stimuli it
impulses to the A-System which will then build an association
receives. The mechanism of that process is beyond the scope
between responses and stimulus. The notation As.r denotes the
of this research.
subset of the A units activated by stimulus s and response r.
The application of this model is highly limited and can Suppose stimulus S1 is presented and subsets A1.1 and A1.2
be described as follows- suppose we have a set of stimuli are activated, meaning two possible responses 1 and 2. The net
c1 , c2 , c3 . . . , cj where each has a value x1 , x2 , x3 . . . , xj . value of both the sets is compared and the one with greater net
Each of xi can take values from the set S ∈ {0, 1}. The value is considered dominant and the one with lesser value is
aggregator function g(x) is defined as: suppressed. The elements of the subset of correct correlation
j
will gather an increment to their value while the elements of
X suppressed subsets will remain unchanged. Since the value
g(x) = xi (1)
i=1
has been incremented, the next time stimulus S1 is presented,
the net input corresponding to the previously greater subset
would be even greater, thus ensuring a higher probability of
w ← w + η(o − y)x (3)
selection of the previously established correct response due to
the reinforcement that was done when the stimulus S1 was
The derivation of this rule can be done using least mean
presented at a previous time. The machine has associated the
squares to calculate the error and differentiating the expression
stimulus S1 with whichever the correct response was.
to perform regular gradient descent. The activation function is
It is important to note here that modern implementations of present in Perceptron as well. The only difference is that it is
Perceptron differ in the approach of performing updates. Most an identity function, meaning σ(x) = x.
of software based implementation use the threshold function
The benefit of using a continuous value for checking error
similarly but they focus on incorrect classifications. The idea
is that we can then incorporate a measure of degree of
is that if a stimulus(or data point) is incorrectly classified then
incorrectness when updating the weights and hence making
the update should attempt to move the model in the direction
the classifiers adaptive, as in the name. Thus, theoretically
of correctness. Instead of reinforcing correct classifications,
convergence would be faster. This however brings about a
they penalize incorrect classifications. The learning algorithm
disadvantage, since we are trying to fit multiple continuous
of perceptron is defined as:
values to a fixed value i.e. the class label, we cannot be sure
when the model has sufficiently fit the data. it comes from:
Theorem II.1. No two real and distinct numbers when mul-
tiplied by a non-zero real number shall yield results that are
equal.
Proof. To prove this, we will use contradiction. Let us assume
that we have two real and distinct numbers a and b and a non-
zero real number c and when a and b are multiplied with c,
they yield the same number. Then

a.c = b.c

or
Fig. 3. Perceptron algorithm, from, Convergence Proof for the Perceptron
Algorithm, Michael Collins a=b

The only aspect that slows down the performance of a Per- which contradicts the assumption that a and b are distinct.
ceptron is that it has to use the threshold function to determine Hence our theorem is true.
the update value which means that the updates will always
take place with fixed values. The degree of incorrectness is
not propagated to the neuron. This makes convergence slightly
slower.

C. Adaptive Linear Element, Bernard Widrow and Ted Hoff,

1959
This object was an improvement built upon the Perceptron.
There is not much difference between the two, unless we
go into the depths of their papers [11] but they are more
concerned with the mechanics of their respective machines
which is not our area of interest. Since we are not building
an electromechanical machine, a majority of that content is
beyond the scope of this paper. Our interest is in the learning Fig. 4. Schematic of Original ADALINE algorithm, (Widrow, 1960)
algorithm. In case of ADALINE, similar to Perceptron we will
consider LMS as primary means of error calculation. From Since we cannot determine objectively when convergence
the schematic of ADALINE, the difference is pretty apparent. has reached, we must let the model train for a set epochs until
The ADALINE does not process the output of summer(or the satisfactory results are achieved. One could run the weights
aggregator) through the quantizer(or threshold function) rather through threshold function and check accuracy but that would
it gets processed through an activation function σ(x) = y mean losing the time advantage gained by using adaptive
which gives a continuous value that will be used in the update learning method. This is the only disadvantage in using this
rule. method. Still the algorithm is very fast.
D. K-Nearest Neighbors by Evelyn Fix and Joseph Hodges, (a) There will exist at least one line l1 that will divide the
1951 plane into two (excluding the set of points on the line)
This is a non-parametric algorithm and essentially very disjoint sets A and B such that X0 ⊂ A and X1 ⊂ B.
simple classifier. Calculate the k nearest data points that exist (b) No line segment drawn by joining two points of the same
in the training dataset, where k is an integer and k > 0. A set from X0 or X1 shall meet with the line l1 .
majority or plurality vote will be carried out to determine the (c) Any line segment drawn by joining two points from the
class label. In essence, we determine the k nearest data points different sets X0 and X1 will always meet the line l1
of known classification and use them to determine the class of A. Initialization Algorithm
the given point. It is one of the most widely used geometrical
classifiers. Distance can be calculated in multiple ways. We Since the postulate states that a line segment joining points
will use the Minkowski distance formula which can be used from different sets will always meet, then we can be sure that
to calculate various other distance metrics. a line segment joining the medians of the two datasets X0
and X1 will intersect a decision boundary. Why, is shown as
The Minkowski distance d measure is defined as:
follows:
# p1
Let us consider the sets X0 and X1 as sample spaces
" m
X [a] [b] p
d= (|xi − xi ) (4) Ω0 and Ω1 . X1 , X2 . . . , Xn ∼ δi ∀ i ∈ [P1 , Pn ] where
i=1 P1 , P2 , P3 , . . . are points in Ω0 . Similarly, Y1 , Y2 . . . , Ym ∼
Where; δj ∀ j ∈ [Q1 , Qm ] where Q1 , Q2 , Q3 , . . . are points in Ω1 .
If X1 , X2 . . . , Xn are IID, then we can say that1 ;
• m = number of features
• a = point from training set P
X n −→ µX0 (5)
• b = point to be classified
• p = an arbitrary constant Which can be interpreted as the distribution of X n becomes
The value of p determines which kind of distance is to be more concentrated around µ as n gets large.
calculated. For eg, p = 1 corresponds to Manhattan distance Since we can calculate X n from our sample, we can say
and p = 2 corresponds to Euclidean distance. that the expected value of total populace will lie somewhere
The most apparent disadvantage of this system is that the in proximity. Thus by minimizing collinearity between the
model does not learn anything from the training data. It simply decision boundary l1 and line segment joining X n (or other
stores and iterates through it whenever used. This gives rise measures of central tendency), we reduce collinearity between
to a variety of problems; expectation µX0 and l1 greatly. Trivially, the sample and the
total populace would be distributed uniformly around µX0 .
1) We have to store the entire dataset. Which means that Hence by reducing the collinearity, we increase the accuracy
large storage space is needed for storing the training data of our decision boundary.
and it cannot be discarded since it is required for every
Same can be said for Y m ;
classification.
P
2) The distance calculation is resource heavy on the system. Y m −→ µX1 (6)
Suppose we have a million training examples and each
with five features. A single classification could possibly The term collinear is used for classification. Since we want
take minutes. to ”reduce” collinearity, we need to use a measure for degree
of collinearity. Such a measurement can be done using the
Due to extensive research with this classifier, there have
angle that two segments(or line and segment) subtend and their
been many solution to the above problems. Such as instead
proximity(distance).
of storing all training examples, we can do clustering. A
brief account of advancements and improvements in the KNN Definition III.1. Two line segments are said to be near
algorithm have been discussed in this [10] article Haiyan Wang collinear if the angle subtended by them and their proximity
et al. is below an arbitrary threshold. With them being collinear if
With this we conclude our analysis of foundational machine both values are 0.
learning classifiers.
To extend this definition to our line l1 and line segment
III. M IGRATING P OINTS A LGORITHM joining mean of sets X0 and X1 (say AB, we can consider
Euclid’s second postulate[3]: Any straight line segment can
To lay the foundation of a new classifier, let us first establish be indefinitely extended in a straight line. Let us consider the
some postulates for the cartesian plane; line segment AB as ⃗s and a line segment ⃗r on l1 such that
Postulate III.1. Let there be two sets of points X0 and X1 , ⃗r ∩ ⃗s ̸= ϕ Therefore the angle θ between ⃗s and ⃗r is given by:
such that there exist no two line segments that can be drawn

−1 ⃗s · ⃗r
from a set of four distinct points with two and only two points θ = cos (7)
|⃗s| · |⃗r|
from one set, that will intersect unless the two endpoints of
one segment belong to different sets, then: 1 Theorem 5.6, All of Statistics, Larry Wasserman, Springer
The maximum value of cos−1 is π but we need to remember Now we can calculate the displacement vector ⃗v (not related
that there are two angles when line intersects. Both angles are to displacement value d) as:
supplementary hence to maximise both, each needs to be π2 .
The above section shows why a line l1 passing through line ⃗ − ⃗u
⃗v = w (16)
segment joining the segment means and perpendicular to it is
a good choice for initialization of two random points.
B. Learning Algorithm
After initialization, we will have two points E ≡ (x1 , y1 )
and F ≡ (x1 , y2 ) which can be used to write equation of a
lines as:
y2 − y1
y − y1 = (x − x1 ) (8)
x2 − x1
y2 − y2
y= (x − x1 ) + y1 (9)
x2 − x1
(x2 − x1 )y = xy2 − x1 y2 − y1 x + y1 x1 + y1 x2 − y1 x1 (10)
(x2 − x1 )y = (y2 − y1 )x − x1 y2 + y1 x2 (11)
compare equation (11) to Ax + By + C. Putting the values
of the mean points (or median) we will get one positive and
one negative value, which will be used to assign pseudo class
∈ {−1, 1} to the sets X0 and X1 . The sign of pseudo class Fig. 5. Graphical representation of Calculation of displacement vector
should be similar to that of their respective outputs.
Then we calculate the displacement d value of each point Then the unit displacement vector which tells us the direc-
T ≡ (x0 , y0 ). tion of movement is given by:
Ax0 + By0 + C
d= √ (12)
A2 + B 2 ⃗v
v̂ = (17)
Using values from equation4 (11), we can write the equation |⃗v |
of displacement as:
the displacement factor which tells us the magnitude of
(y1 − y2 )x0 + (x2 − x1 )y0 + (x1 y2 − x2 y1 )
d= p (13) movement (or update) is given by |η × λ|. The addition or
(y1 − y2 )2 + (x2 − x1 )2 movement vector is given by:
If d× pseudo class < 0, we can infer that the classification is
incorrect. Let this quantity be λ. Thus λ has a information ⃗t = v̂ × |η × λ| (18)
about whether the classification is correct or not and also
about the degree of misclassification(points further away or Finally the new position vector ⃗n of the migrating point will
with higher displacement are more misclassified than the be given by:
nearer ones). Hence we can move the decision boundary in ⃗n = ⃗a + (v̂ × |η × λ|) (19)
an adaptive manner.
1) Upon iterating through the dataset, suppose we encounter The reason why this would work is again based on proximity.
a point Q which was misclassified. Let the position vector Postulate 3.1(a) states that X0 ⊂ A, X1 ⊂ B and A ∩ B = ϕ.
⃗
of that point be d. Therefore a misclassification would imply a situation such that
2) Out of the two points E and F , we will calculate which Q ∈ X0 and Q ∈ B. If the data is linearly separable then in
point is closer to point Q. Let the position vector of this order to correct the classification, decision boundary should
point be ⃗c. move in such a way that the resulting boundary satisfies Q ∈ /
3) We will use these two position vectors to calculate the B. For that to happen, the decision boundary(or its points)
vector joining these points; shall move in such a way that it moves towards the sample
⃗u = ⃗c − d⃗ (14) mean of class whose points are a subset of B.
There are various ways to move the decision boundary in
4) Now we select a random point from the near-cluster2 and a manner that satisfies the above conditions. For example by
let this point have the position vector ⃗g . Using this we moving points along vector joining the sample means of both
will calculate another vector w⃗ as; class. Or simply move the point towards the sample mean of
⃗ = ⃗g − d⃗
w (15) opposite class. The reason for using the method described is
2 near cluster is a set of points at a specified distance metric from the
that it takes into account the relative position of point that
sample mean. Random points are selected to suppress the effects of outliers was misclassified. Therefore if an outlier is misclassified, the
in learning. model would update accordingly.
C. Over-fitting Cases
Like all classifiers, the MPA can also suffer from over train-
ing. As the position of points is updated, They are displaced
towards the central tendency of the class they misclassified.
This also gives them a displacement towards each other since.
The closer two points are, the more drastically the orientation
of the line made by joining them changes with change in the
relative position of the two points. Therefore if the model
trains for a long time, two points will get too close to each
other and the decision boundary will start swinging.

Fig. 7. Model trains over a non-linearly separable dataset for 150 epochs

Fig. 6. Two moving points start moving close to each other as the model
trains.

The above figure shows how two migrating points E and F

might move along the respective displacement vectors ⃗v and
⃗b. The misclassified point is point Q and R is a random point
Fig. 8. Same model as above trains over the same dataset for 5000 epochs
in the near cluster of opposing class. It can be observed that
the points get drawn closer as the model trains.
This problem becomes even more prominent if the datasets In order to address this flaw we can update the algorithm
are not linearly separable. In this case, after the decision with an addendum using , such that after the movement vector
boundary reaches an optimum position, the moving points ⃗b is calculated, the algorithm checks the distance between both
say, E and F will effectively have a displacement along the migrating points used to draw the decision boundary and
the line connecting the central tendencies equal to 03 . But, if the distance is less than5 or equal to a predefined threshold
the displacement towards the line segment joining central α (say) then the unit displacement vector is adjusted such that
tendencies is in the same direction. Hence the two points will the point which is to be moved along ⃗b, moves along ⃗b′ .
converge.
Figure 7and Figure 8 compare the position of the two (
⃗b′ = ⃗b − (EF
ˆ × |⃗b| cos δ) δ ≤ 90◦
moving points after training over the same dataset with same f (y) = ⃗ ′ ⃗ (20)
initialization for 150 and 5000 epochs. As can be seen, when b =b δ > 90◦
we kept updating the model for a high number of epochs,
−−→
the points didn’t move around much but they did converge Where δ is the angle between ⃗b and F E What this does
at a point4 and due to this, the decision boundary changed is simply cancel out the displacement of the points towards
drastically to yield incorrect results because the effects of each other if they come too close, judged by an arbitrary
randomness in are more pronounced. threshold α. This is done only if the angle between calculated
−−→
3 Upon reaching a position in case of non-linearly separable datasets where displacement vector and F E is less than 90◦ because if the
no further improvement is possible it is probable that the model will mis- angle is greater than 90◦ the points would naturally move
classify a near equal number of examples hence the displacement towards further away.
either class gets cancelled out by a movement towards the opposite class.
4 This will not always be the result when training the model through
extremely high number of epochs. In some cases the higher number of epochs 5 Less than is included because it is possible that the previous update caused
will not affect the position of moving points much. This is case specific. the distance between the moving points to fall below the threshold.
D. Higher Dimensional Datasets hyperplane of n-1 dimensions. The general equation of such
We only discussed about a dataset that has two dimensions. a hyperplane is given by:
The method to find decision boundary was also explicit to a1 x1 + a2 x2 + . . . an xn = c (26)
a line that can only separate points on a plane. What about
datasets like: where c is a constant. This is analogous to equation (12).
 [1] The distance d between a point and a hyperplane in is given
[1] [1] [1] 
x1 x2 x3 . . . xn by:
 [2] [2] [2] [2]  |a1 x1 + a2 x2 + a3 x3 ... + an xn |
 x1 x2 x3 . . . xn  d= p (27)
 [3] [3] [3] [3]  a21 + a22 + a23 ... + a2n
D=  x1 x2 x3 . . . xn  (21)
 .. .. .. .. 

.. Now before we move to how we will calculate the equation
 . . . . . 
[m] [m] [m] [m] of the hyperplane from a set of given points, we shall discuss
x1 x2 x3 ... xn the the nature of the points required to define a hyperplane of
where m is the number of examples and n is the number of n-1 dimensions embedded in an ambient space of n dimen-
features. sions:
In order to deal with data points described in n-dimensions, 1) To describe a hyperplane of n-1 dimensions, you need at
we need to first establish a few points. least n points.
1) If we have a dataset of n features, then that dataset can 2) The set of n points must not lie on the same n-2
be mapped to an ambient space of n dimensions. hyperplane.
2) It would be meaningless to define an hyperplane of m Earlier we used an equation specific to lines to find out
dimensions in q dimensions such that m ≥ q. Therefore the equation of the line that would pass through two points
every m dimension hyperplane is defined or ”embed- and form a decision boundary. Now we shall look at the
ded”in m + 1 or more dimensions. method that will give us the equation of any n-1 dimensional
Hence, we shall state: A hyperplane of n−1 dimensions can hyperplane embedded in an n-dimensional hyperspace using n
separate an ambient space of n dimensions into two regions points.
of infinite extent 6 . which can be framed as a theorem: Let each point that will be used to draw the hyperplane be
[1] [1] [1]
of the type t = (x1 , x2 , . . . , xn ). We need to have n such
Theorem III.1. A hyperplane construct of n - 1 dimensions
points to describe the hyperplane. To find the equation solve
can divide an ambient space of n dimensions into three disjoint
the determinant7 :
sets X1 , X2 , and X3 (say) of points, with one set containing  
all and only points that satisfy the equation of the hyperplane. x1 x2 x3 . . . x n 1
 x[1] x[1] x[1] . . . x[1] 1
 1 2 3 n
Proof. By definition a hyperplane is a subspace is a subspace  [2] [2] [2] [2]

x1 x2 x3 . . . xn 1 = 0

whose dimensions is one less than the ambient space. Let w det 
 . (28)
 . .. .. .. .. .. 
be a normal vector normal to the hyperplane H ∈ Rn and x  . . . . . .

be any point inside the ambient space. Therefore, x1
[m]
x2
[m] [m]
x3
[m]
. . . xn 1
wT .x + b = 0 (22) Solving the determinant would give us an equation of the
is the general equation of the hyperplane, where w = form w1 x1 + w2 x2 + w3 x3 · · · + wn xn + c = 0 where c is a
[w1 , w2 , w3 · · · , wn ]. Now, we can define three sets of points constant term (usually a sum of constants values). Comparing
from this equation. Points that satisfy; this equation to the general equation of hyperplane (equation
(21)), we get w1 = a1 , w2 = a2 , w3 = a3 . . . wn = an . This
wT .x + b = 0 (23) therefore givers us the coefficients of the hyperplane equation,
using which we can apply the same algorithm as discussed in
or,
the case of 2-D datasets.
wT .x + b < 0 (24)
E. Performance Evaluation
or,
T
w .x + b > 0 (25) We evaluated the model against a Perceptron, a KNN model
and a SVC model initialized using Google’s sklearn library[7].
No point can satisfy two or more of these equations simul- To get a generalized performance we calculated the mean
taneously and every point in the ambient space would satisfy errors over 500 datasets each containing two 2 feature columns
one of these equations hence our theorem is true. and one true class label column without changing the param-
Therefore in order to accomplish the task of making a eters. Both Perceptron and SVC class have adaptive learning
decision boundary, we just need to find the equation of this rate feature, which is absent in our classifier implementation.
The 500 datasets were obtained using make blobs method
6 Just like a plane has an infinite area, 3-D space has infinite volume, n-
dimensional space would have an infinite ”extent” which does not have a 7 The reasoning behind this method is explained well in this wonderful
specific quantity answer[1] by user amd on Mathematics Stack Exchange
of sklearn. Integer values from 0-50 were used as random
seeds and further the standard deviation of these 50 datasets
was varied over values from 1.0 to 2.0 which gave us 10
datasets from each one randomly generated dataset. Hence
a total of 500 datasets. Due to this we also did not apply
standardization to the datasets. The Perceptron was initialized
with (random state = 8). The KNN model was initialized
with (n neighbors = 3, n jobs = -1 ). SVC was initialized
with (kernel = ’linear’). This was done to prevent SVC from
using the kernel trick and increasing the dimensionality of
the dataset. This might seem as a bad practice, to restrict the
capabilities of one classifier in order to compare performance,
but it must be understood that SVC and our classifier basically
do the same task of creating a hyperplane in order to categorize
the data. So theoretically the kernel trick can also be used
with our classifier but since we haven’t implemented it in Fig. 10. The decision boundary calculated by the Moving Points Algorithm.
the testing object the potential feature of our classifier gets
restricted. Hence we even out the playing field be restricting
SVC to linear boundary only.
The results of the performance evaluation are as follows:
• KNN performed the best with a mean error of 4.2909.
• Migrating Points Algorithm had the mean error of 7.6781
which is just slightly worse than that of SVC at 7.2254.
• Moving Points Algorithm performed better than a Per-
ceptron which had a mean error of 9.1436.
The similar errors of SVC and our classifier are expected
since they are basically doing the same thing i.e creating
a hyperplane. Only the methods of creating the hyperplane
differ and apparently SVC performed slightly better on these
datasets. Our model out-performed a Perceptron by a signifi-
cant margin.
Fig. 11. The decision boundary calculated by the algorithm on Iris dataset,
label 0: Iris-versicolor, label 1: Iris-setosa.

Further the dataset was tested on a sample of classes

not linearly separable in two dimensions using the labels
”Iris-versicolor” and ”Iris-virginica”. For this test, ”Petal-
LengthCm” and ”SepalLengthCm”. The resultant decision
boundary is presented in 12
For accuracy score determined using the formula:
Number of Correct Predictions
accuracy score = (29)
Total Number of Predictions
The proposed algorithm gave an accuracy score of 0.96,
Compared to support vector machine which classifier used
with linear kernel, gave an accuracy score of 0.94 on the same
dataset. The decision boundary found by SVM is represented
Fig. 9. Sample non linearly separable dataset. in Fig.11
Perceptron performed rather poorly on the same dataset
Upon testing the custom made implementation of the al- and had an accuracy score of 0.86, the decision boundary is
gorithm on the the Iris dataset [9] to classify the samples represented in 14
of Iris-setosa and Iris-versicolor with 50 samples of each
class, it was found that the algorithm was able to classify IV. U SING M ULTIPLE P OINTS FOR F INDING C OMPLEX
the linearly separable dataset in two dimensions with 100% PATTERNS IN DATA
accuracy. The feature vectors selected are: SepalWidthCm and In the above sections, we discussed the algorithm that can
SepalLengthCm. be used to draw a linear decision boundary. However, we also
Fig. 12. The decision boundary drawn by the model for a dataset that is
not linearly separable in two dimenstions, label 1: Iris-virginica,label 0: Iris-
versicolor
Fig. 14. Decision boundary found using Perceptron on the same dataset.

underlying patterns in data. To understand what this means let

us state an important mechanism about the working of this
algorithm. The algorithm essentially has two parts:
• The Points - they are the objects that contain the infor-
mation about the underlying patterns.
• Method of Inference - basically the method that we use
to draw or find the decision boundary.
For eg. when we defined the algorithm for strictly making
a linear decision boundary, the steps for calculating the equa-
tion of that boundary (eq. 23) are the method of inference.
Similarly, using more than the stipulated n − 1 points to
find the decision boundary of dataset of n dimensions calls
for a different method of inference from those points. Once
the training ends, the amount of information contained in the
Fig. 13. Decision boundary found using SVM on the same dataset. points is fixed but how accurately and efficiently we are able to
extract that information depends on the method of inference.
Coming to the original statement that the increased number of
have an approach that allows us to create non-linear decision points required to remove the effects of Runge’s phenomenon
boundaries. do not add to the information about data pattern contained
A. Polynomial Interpolation by the points. It solely focuses on improving the accuracy
of method of inference. This can be seen trivially using the
Two popularly used interpolation techniques are Lagrange following example:
interpolation [5] and Newton’s divided difference interpolation
15 points were used to fit the polynomial to a satisfactorily,
formula [2]. However both of these techniques are suscepti-
but it can be seen by inspection that these many polynomials
ble to Runge’s phenomenon, which refers to the inaccurate
are not needed for such a simple graph. Hence, the extra points
oscillations in the polynomial around the boundaries of the
do not add to the information present in the model.
interval. The critical effect it can have on the predictions in
machine learning can be seen from blog post Explore Runge’s
Polynomial Interpolation Phenomenon by Cleve Moler. B. Splines
The Runge’s phenomenon can be avoided if we use Cheby-
shev distribution to calculate the polynomial by concentrating This is an alternate method that is used in interpolation
several points of towards the end of the distribution but that problems. It is basically a piece wise function made up of
would increase the training time and at the same time make linear, quadratic, cubic or polynomials of any degree. The
the calculation of polynomial expensive. Another down side biggest plus point of this method is that it can avoid Runge’s
is that whatever amount of points we use to ensure that the phenomenon even at higher degrees. Hence, in our use case
function does not oscillate do not add to the information about we can make do with a relatively short number of points.
[5] Jim Farmer. “Lagrange’s interpolation formula”. In:
Australian Senior Mathematics Journal 32.1 (2018),
pp. 8–12. ISSN: 0819-4564.
[6] Warren S. McCulloch and Walter Pitts. “A logical
calculus of the ideas immanent in nervous activity”. en.
In: The Bulletin of Mathematical Biophysics 5.4 (Dec.
1943), pp. 115–133. ISSN: 0007-4985, 1522-9602. DOI:
10.1007/BF02478259. URL: http://link.springer.com/10.
1007/BF02478259 (visited on 08/24/2024).
[7] Fabian Pedregosa et al. “Scikit-learn: Machine learn-
ing in Python”. In: the Journal of machine Learning
research 12 (2011). Publisher: JMLR. org, pp. 2825–
2830. ISSN: 1532-4435.
[8] Frank Rosenblatt. The perceptron, a perceiving and rec-
ognizing automaton Project Para. Cornell Aeronautical
Fig. 15. Credit: Explore Runge’s Polynomial Interpolation Phenomenon, Laboratory, 1957.
Cleve Moler. [4] [9] UCI Machine Learning Repository. URL: https : / /
archive . ics . uci . edu / dataset / 53 / iris (visited on
09/11/2024).
V. C ONCLUSION [10] Haiyan Wang, Peidi Xu, and Jinghua Zhao. “Improved
The paper discussed some foundational classifiers and then KNN Algorithm Based on Preprocessing of Center in
inspired by those, described a classifier that operates on the Smart Cities”. en. In: Complexity 2021.1 (Jan. 2021).
most basic element of mathematics, points. Like atoms are Ed. by Zhihan Lv, p. 5524388. ISSN: 1076-2787, 1099-
to universe, points are to geometry. Using this fundamental 0526. DOI: 10 . 1155 / 2021 / 5524388. URL: https : / /
element of mathematics, we were able to deviate machine onlinelibrary. wiley. com / doi / 10 . 1155 / 2021 / 5524388
learning from statistical inference to geometrical inference. We (visited on 08/24/2024).
focused extensively on the initialization process and the ability [11] Bernard Widrow. Adaptive ”ADALINE” Neuron Using
of this algorithm to deal with outliers. This helps boost runtime Chemical ”Memristor”.
efficiency and accuracy. The algorithm we described explored
only one method to move a point using position vector of
initialized point, test point and one median point. There can
be various other methods such as using both median points or
just one point all together. The computational performance of
these different methods can only be judged after testing them.
The main focus would be on developing an optimized and
efficient implementation of this algorithm for further testing
and understanding potential real life applications.
R EFERENCES
[1] amd. Answer to ”How to determine the equation of the
hyperplane that contains several points”. Apr. 2018.
URL: https : / / math . stackexchange . com / a / 2723930
(visited on 08/24/2024).
[2] Biswajit Das and Dhritikesh Chakrabarty. “Newton’s
Divided Difference Interpolation formula: Representa-
tion of Numerical Data by a Polynomial curve”. In:
International Journal of Mathematics Trends and Tech-
nology 35 (July 2016), pp. 197–203. DOI: 10.14445/
22315373/IJMTT-V35P528.
[3] . Euclid’s ”Elements”. Venice : Erhard Ratdolt, May
1482. URL: https://hdl.loc.gov/loc.wdl/wdl.18198.
[4] Explore Runge’s Polynomial Interpolation
Phenomenon. en. Dec. 2018. URL: https : / / blogs .
mathworks . com / cleve / 2018 / 12 / 10 / explore - runges -
polynomial - interpolation - phenomenon/ (visited on
08/24/2024).

Unit I
0% (1)
Unit I
21 pages
Unit 1 Notes
100% (1)
Unit 1 Notes
14 pages
Artificial Neural Networks Notes PDF
100% (1)
Artificial Neural Networks Notes PDF
27 pages
MAT6007 Session4 MP Neuron Perceptrons
No ratings yet
MAT6007 Session4 MP Neuron Perceptrons
15 pages
Unit-7_ANN
No ratings yet
Unit-7_ANN
211 pages
DL_Unit_I_&_Unit_II
No ratings yet
DL_Unit_I_&_Unit_II
156 pages
Module 1
No ratings yet
Module 1
100 pages
ML UNIT 3 NOTES
No ratings yet
ML UNIT 3 NOTES
37 pages
Mod-1 Part 1
No ratings yet
Mod-1 Part 1
143 pages
Chapter-1 Intorduction to Neural networks [Autosaved]
No ratings yet
Chapter-1 Intorduction to Neural networks [Autosaved]
118 pages
dp learn
No ratings yet
dp learn
72 pages
DL QB Answers
No ratings yet
DL QB Answers
121 pages
Module 6
No ratings yet
Module 6
104 pages
mod3
No ratings yet
mod3
101 pages
This Document Is About Artificial Inteligence.
No ratings yet
This Document Is About Artificial Inteligence.
81 pages
Unit-V
No ratings yet
Unit-V
42 pages
Neural Networks in A Softcomputing Framework
No ratings yet
Neural Networks in A Softcomputing Framework
67 pages
PERCEPTRON IMPLEMENTATION
No ratings yet
PERCEPTRON IMPLEMENTATION
33 pages
ml42
No ratings yet
ml42
32 pages
Most
No ratings yet
Most
31 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
1. Introduction to Artificial Neural Networks _ Neural networks and deep learning
No ratings yet
1. Introduction to Artificial Neural Networks _ Neural networks and deep learning
26 pages
Introduction To Artificial Neural Networks and Perceptron
No ratings yet
Introduction To Artificial Neural Networks and Perceptron
59 pages
UNDERSTANG PERCEPTRON and Perceptron LEARNING
No ratings yet
UNDERSTANG PERCEPTRON and Perceptron LEARNING
26 pages
UNIT1_Perceptron_MLP
No ratings yet
UNIT1_Perceptron_MLP
26 pages
Neural Networks Notes
No ratings yet
Neural Networks Notes
22 pages
SCT UNIT-2
No ratings yet
SCT UNIT-2
30 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
20200428135045cfbc718e2c (1)
No ratings yet
20200428135045cfbc718e2c (1)
30 pages
ANN_2a
No ratings yet
ANN_2a
24 pages
Stat Learning 7 R
No ratings yet
Stat Learning 7 R
86 pages
CHP 9
No ratings yet
CHP 9
29 pages
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE4020 ETH VL2023240103694 2023-09-01 Reference-Material-I
35 pages
Neural Networks and CNN
No ratings yet
Neural Networks and CNN
25 pages
MP_Neuron_Perceptrons
No ratings yet
MP_Neuron_Perceptrons
11 pages
DR_DL
No ratings yet
DR_DL
7 pages
DL KIET Model Question Paper
No ratings yet
DL KIET Model Question Paper
31 pages
EngAppsArtificialIntelligence19 (2) 2006 219-234
No ratings yet
EngAppsArtificialIntelligence19 (2) 2006 219-234
16 pages
Biological Neurons
No ratings yet
Biological Neurons
37 pages
Data analysis ch1
No ratings yet
Data analysis ch1
13 pages
MODULE 1 DL
No ratings yet
MODULE 1 DL
6 pages
eL_Assignment
No ratings yet
eL_Assignment
10 pages
Module 4
No ratings yet
Module 4
50 pages
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
Neuro 3 PDF
No ratings yet
Neuro 3 PDF
36 pages
Unit 2
No ratings yet
Unit 2
15 pages
28 Lecture CSC462
No ratings yet
28 Lecture CSC462
28 pages
Dynamical Complexity in Cognitive Neural Networks: y 1 W X - B
No ratings yet
Dynamical Complexity in Cognitive Neural Networks: y 1 W X - B
8 pages
TMP 5 C03
No ratings yet
TMP 5 C03
17 pages
Neural Networks: Machine Learning Is Machine Learning Is
No ratings yet
Neural Networks: Machine Learning Is Machine Learning Is
23 pages
ch6 Perceptron MLP PDF
No ratings yet
ch6 Perceptron MLP PDF
31 pages
Fundamentals of Artificial Neural Networks
No ratings yet
Fundamentals of Artificial Neural Networks
27 pages
Machine Learning Course in Bangalore
No ratings yet
Machine Learning Course in Bangalore
14 pages
AI: Neural Network For Beginners (Part 1 of 3) : Sacha Barber
No ratings yet
AI: Neural Network For Beginners (Part 1 of 3) : Sacha Barber
9 pages
What Are Neural Nets
No ratings yet
What Are Neural Nets
4 pages
NeuralNetworks_JorgeAndreu
No ratings yet
NeuralNetworks_JorgeAndreu
6 pages
Week 3 Chapter 2 Calibration Curve and Method Validation PDF
No ratings yet
Week 3 Chapter 2 Calibration Curve and Method Validation PDF
32 pages
Object-Oriented Rosenblatt Perceptron Using C++
No ratings yet
Object-Oriented Rosenblatt Perceptron Using C++
30 pages
DMP Alarm Panel XRSuper6-XR20-XR40 installation manual
No ratings yet
DMP Alarm Panel XRSuper6-XR20-XR40 installation manual
24 pages
BSP Circ. No. 1032 s. 2019
No ratings yet
BSP Circ. No. 1032 s. 2019
37 pages
Assignment Area
No ratings yet
Assignment Area
4 pages
Navistar, Inc.: Electrical Circuit Diagrams
No ratings yet
Navistar, Inc.: Electrical Circuit Diagrams
217 pages
Sangeet-Natak-Akademi-SNA-Recruitment-2025-Notice
No ratings yet
Sangeet-Natak-Akademi-SNA-Recruitment-2025-Notice
1 page
Genasi_Ignara_Flameborn_Full_Character_Sheet
No ratings yet
Genasi_Ignara_Flameborn_Full_Character_Sheet
3 pages
Neural Net Primer
No ratings yet
Neural Net Primer
3 pages
Volvo Tooth System Handbook - 1 - 1
No ratings yet
Volvo Tooth System Handbook - 1 - 1
36 pages
Ava Mar Technical
No ratings yet
Ava Mar Technical
530 pages
Minimum Static Strength Requirements
No ratings yet
Minimum Static Strength Requirements
2 pages
Articles of Incorporation - Timeline Text
100% (2)
Articles of Incorporation - Timeline Text
6 pages
Chemistry Coursework Experiment 13
50% (2)
Chemistry Coursework Experiment 13
6 pages
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
No ratings yet
A Chemistry Laboratory Platform Enhanced With Virtual Reality For Students' Adaptive Learning
10 pages
Aeronautical Second Sem 2020 Batch
No ratings yet
Aeronautical Second Sem 2020 Batch
1 page
XXGST BPIL Miscllleneous CN 060324
No ratings yet
XXGST BPIL Miscllleneous CN 060324
1 page
Brochure SaphyGATE GN en
No ratings yet
Brochure SaphyGATE GN en
2 pages
Encoders PDF
100% (1)
Encoders PDF
6 pages
33932803
No ratings yet
33932803
7 pages
Example
No ratings yet
Example
1 page
Bookkeeping Lesson Edited
No ratings yet
Bookkeeping Lesson Edited
23 pages
Porblems in Wave Optics
0% (1)
Porblems in Wave Optics
1 page
Relic Knights Rulebook Web PDF
No ratings yet
Relic Knights Rulebook Web PDF
32 pages
Q1W1 - (E) Test For Fat
No ratings yet
Q1W1 - (E) Test For Fat
2 pages
Tot Watchers The Two Mouseketeers: Shorts
No ratings yet
Tot Watchers The Two Mouseketeers: Shorts
1 page
De Thi Tieng Anh Lop 3 Hoc Ki 1 Co File Nghe
No ratings yet
De Thi Tieng Anh Lop 3 Hoc Ki 1 Co File Nghe
11 pages
Using KWL Strategy To Improve Students R
No ratings yet
Using KWL Strategy To Improve Students R
10 pages
Korean Wave
No ratings yet
Korean Wave
12 pages
TTD Seva Receipt - SEENU KALYANAM-1
No ratings yet
TTD Seva Receipt - SEENU KALYANAM-1
1 page
Temporary Marking Inverted Spray Paint: Technical Data Sheet
No ratings yet
Temporary Marking Inverted Spray Paint: Technical Data Sheet
1 page
What Is Steel Reinforcement? Why Is It Required in A Concrete Structure?
No ratings yet
What Is Steel Reinforcement? Why Is It Required in A Concrete Structure?
11 pages
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
From Everand
Feedforward Neural Networks: Fundamentals and Applications for The Architecture of Thinking Machines and Neural Webs
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

Moving Points Algorithm

Uploaded by

Moving Points Algorithm

Uploaded by

USAGE OF BASIC GEOMETRICAL IDEAS AND

PRINCIPLES OF SYMMETRY FOR MACHINE

C. Adaptive Linear Element, Bernard Widrow and Ted Hoff,

The above figure shows how two migrating points E and F

Further the dataset was tested on a sample of classes

underlying patterns in data. To understand what this means let

You might also like