Most
Most
Classification Problems
This paper focus on the action of neurons in the brain, especially the EPSP/IPSP
cancellation between excitatory and inhibitory synapses, and propose a new
Machine Learning. The feature is to consider one neuron and give it a
multivariable X | G-1, 2..) and its function value F(X j) to the input layer. The
multivariable input layer and processing neuron are linked by two lines to each
variable node. One line is called an EPSP edge, and the other is called an IPSP
edge, and a parameter A j common to each edge is introduced. The processing
neuron is divided back and forth into two parts, and at the front side, a pulse
having a width 2A j and a height 1 is defined around an input X. The latter half
of the processing neuron defines a pulse having a width 2A j centered on the
input X j and a height F(X j ) based on a value obtained from the input layer of
F(XJ). This information is defined as belonging to group i. This grouping is
learned and stored for the Teaching signals, and the output of the TEST signals
is predicted by which group the TEST signals belongs to. The parameter Aj is
optimized so that the accuracy of the prediction is maximized. We apply the
proposed method to the classification problem and confirm that it is faster and
more accurate than the conventional neural network method.
Olivier Bousquet
This paper is original and has not been submitted to other journals.
This paper aims to propose a new Machine Learning for classification problems.
The classification problem is the process of classifying data into each category.
Including pattern recognition, the classification problem is an important and
popular problem in machine leaning. In this paper, we consider the multiclass
classification problem in particular. Machine leaning approaches to
classification problems include classification by Euclidean distance from a
representative vector, neural networks, decision trees, Bayesian networks,
clustering, and ensemble learning.
The classification by the Euclidean distance from the representative vector is
considered to be able to represent each class by one representative vector, when
the data of each class are locally aggregated for each class. This method can be
further classified into Template Matching method [1], k-Nearest Neighbor
method [2], etc.
Neural networks, especially hierarchical neural networks, are currently the
mainstream classification method, and their development depends largely on (1)
the ReLU function as an activation function of the intermediate layer [3], (2) the
Sofmax function as an activation function of the output layer, (3) the cross
entropy function as a loss function [4] , and (4) the stochastic gradient descent
method as an optimization algorithm [5]. In particular, Stochastic Gradient
Descent (Stochastic Gradient Descent: SGD) [6], AdaGrad [7], RMS prop [8],
Adam [9] and many other reviews have been published. It is well known that
many models have been proposed for the construction of neural networks that
are specialized in various fields, such as convolutional neural networks, data
compression by auto-encoders, and recurrent neural networks.
The decision tree [10] is a method of analyzing data using a tree structure (tree
diagram). This method is particularly used in data mining because the process
of classification in a classification model can be easily interpreted. In this case,
the decision tree has a tree structure in which the leaves represent the
classification, and the branches represent the collection of features leading to
the classification. The usefulness of the decision tree is that it is a nonparametric
method that does not assume the distribution of the data to be analyzed. Both
explanatory and objective
variables can be used from nominal scales to interval scales and are said to be
robust to outliers. On the other hand, classification accuracy is lower than other
Machine Learnings, and it is not suitable for linear data
The Bayesian network is one of the probabilistic inference techniques for
probabilistic inference of events. By combining multiple relationships between
cause and effect, the phenomenon that occurs while the cause and effect affect
each other is visualized in terms of network diagrams and probabilities. The
following features are available:
1) it is possible to analyze and infer the connection between "cause" and
"effect," 2) when a certain "cause" is assumed, it is possible to infer the "effect"
that may occur from it, and 3) when an expected "effect" is assumed, it is
possible to infer the "cause" that may lead to it.
Clustering [11] and [12] are characterized by unsupervised learning, whereas
neural networks classification is learning with supervised signals. K-means
[13], an example of clustering, is a non-hierarchical clustering algorithm that
uses the average of clusters to classify into a given number of k clusters. It is
characterized by fast execution and scalability.
Thus, there are many approaches to the classification problem, and many faster
and more accurate methods have been proposed.
The authors have proposed MOST (Monte Carlo Stochastic) [17], which is a
new optimization method including learning of hierarchical neural networks,
applied it to Iris classification problems, and verified its validity by comparing
it with other optimization methods. In this paper, we propose an approach to a
new classification problem based on the MOST optimization, which does not
belong to any of the above, and verify its validity. In the modeling, we focused
on the action of neurons in the brain, especially the EPSP/IPSP cancellation
between excitatory and inhibitory synapses, and discussed it with reference to it.
The details are described below.
Overview of neuron-especially synaptic action in the brain
1)
where,
2)
In the figure, the red line indicates the step function, and the yellow-green line
indicates the function after processing.
Next, at the IPSP edge shown in Fig. 4b), the well-known step function
multiplied by -1 is shifted by (X𝚤 + Δ). In other words, do the following.
3)
In the figure, as in the case of the EPSP edge, the red line indicates the step
function, and the yellow-green line indicates the step function on the negative
side after processing.
The operation node in FIG. 2 is divided into two parts, the front part and the
back part, and the following is calculated in the front part.
4)
5)
The range defined from X𝚤 and F(X𝚤 )) is called Group 𝚤 . The meaning is
simple: "For a variable x in the range X𝚤 - Δ to X𝚤 + Δ, all outputs are F(X𝚤 )."
Next, consider a case where this processing is applied to multiple data items
using FIG. 5. The dashed red line in the figure is the function behind the data.
As shown in FIG. 5 a), (X₁, F (X₁))) determines Group-1 under a certain Δ.
These operations are called learning and memory. Next, if the above processing
is performed with new data, such as X 2 and X3, then Groups-2 and -3 can be
obtained. This repetition is called experience. Next, consider new data
represented by red and green triangles as shown in FIG. 5c). Because the data
represented by the red triangle belongs to Group-2 in the figure, the output is
expected to be F(X 2).
On the other hand, the data represented by the green triangle does not belong to
any group, so Group-4 based on the new green triangle data is learned. The
processes in FIG. 5 d) are called "relearning" and "additional memory." By
repeating this operation, you can learn by the Teaching signals, make
predictions in the memorized group, and if there is no group to which you
belong, you can construct a smarter group one after another by repeating
relearning. The above is an outline of the proposed model. Although our
proposal has a simple structure as shown in FIG. 3, it can express processing
similar to the process that our brain learnings. If this operation is applied to a
continuous discretized variable at a constant interval Δ, the function F(x) can be
approximated depending on the resolution of the given Δ, as shown in FIG. 6. It
is easy to imagine that the discretization function asymptotically approaches the
original function F(x) at Δ -0. Presumably, this function approximation with
continuous discretized variables at constant intervals Δ is equivalent to the
universal function approximation theorem [16], which is the theoretical validity
of neural networks. Learning in this model is to optimize Δ associated with
EPSP and IPSP edges. This optimization of Δ is described in detail in the
algorithm in the next chapter.
3.2 Case of multivariable :
Again, the basic idea is the same as for a single variable. The case of two
variables is shown in FIG. 7. The input nodes are X₁𝚤 . X2𝚤 and F(X₁𝚤 , X 2𝚤 ). The
X₁𝚤 node, the X2𝚤 node, and the arithmetic node are connected by two lines of
EPSP edge and IPSP edge, respectively, as in the case of a single variable.
Processing at each node and edge is summarized below:
6)
7)
3) Processing of 𝜓
8)
4) Processing of 𝜑 :
9)
It differs from a single variable only in that 𝜓 is the product of the pulse
functions of variables X₁, and X2. The range defined here, and its output F(X₁𝚤 ,
X2𝚤 ) are called group-𝚤 . As in the case of a single variable, the meaning is that
the outputs of variables x₁ and x2 in the range: X𝑗𝚤 -Δ𝑗 − X𝑗𝚤 +Δ𝑗 , (j=1, 2) are all
F(X₁𝚤 , X2𝚤 ). The "learning" and "memory" of these two variables are shown in
FIG. 8. Other operations in FIG. 5-operations such as "experience,"
"prediction." and "relearning" are the same as for a single variable. This concept
can be applied to any number of variables, and only the product of the pulse
function increases as the number of variables increases in Eq. 8). Learning in
the case of multiple variables also involves optimizing Δ𝑗 , associated with the
EPSP edge and IPSP edge of each variable. The systems shown in FIGS. 3 and
7 are similar to the structure of a perceptron, but the value of the function F(x)
is not an output but input, and the processing at the operation node is
completely different. The number of variables to be optimized coincides with
the number of inputs parameters, and is much less than that of conventional
neural networks, which is expected to reduce the computational load. In
addition, it is well known that when a Teaching signal are based to a specific
data, it is strongly affected by data with many learning results of conventional
neural networks. However, the proposed algorithm has a feature that is not
affected by the amount of data because the weights of the groups obtained are
the same. Hereinafter, this model, which is common to both single and multiple
variables, is called Single Neural Grouping (SiNG) Method.
4. Algorithm of proposed model
4.1 Fundamental algorithm
[STEP-10] In the above flow, the whole Teaching signals are divided into a new
Teaching and TEST signals to optimize Δ𝑗 , so in this STEP, the prediction is
based on a separately prepared TEST signals. Since the optimization of Δ𝑗 has
already been determined, the above steps [STEP-1]-[STEP-6] are simply
applied to the TEST signals.
This is the algorithm of the proposed model. The overall flow of the algorithm
is shown in FIG. 10.
10)
where n is the number of variables and K is the total number of Monte Carlo
calculations. The Monte Carlo method generates random numbers
corresponding to each variable even in the function of multivariable, substitutes
them to the objective function, adds them up, and divides them by the number
of random numbers to obtain the numerical integration value.
Thus, the optimization method based on the Monte Carlo method can be applied
to an objective function composed of multiple variables. However, in the case
of multivariable, when each is divided into two at once, the number of divisions
of the region becomes 2 𝑛 . For example, when n is 100, the number of
divisions is 1.26 × 1030 . And, the real calculation becomes difficult, when the
number n of the variable increases, because the integration is repeated 20 times
necessary for the solution to converge, and as a result, it integrates 2" × 20 times
in total. To solve this problem , we first divide only the variable x 1 into two and
fix the domain of the other variables. The integration value is calculated by the
Monte Carlo method, and the region in which the integration value is small is
chosen.
Next, one side region of selected x1 is fixed, and only the region of x2 is divided
into two. The domain of variables after x3 is fixed. In addition, the integration
value calculation and the region of x2 are selected again by the Monte Carlo
method. By repeating this process with other variables x𝚤 , all variables converge
as optimal solutions. In this case, the number of divisions is 2 × n even if there
are n variables, and 2×n×20=40 × n even if 20 times are required for
convergence. Therefore, even if n is 100 , the number of calculations is only
4000, and the number of calculations can be drastically reduced without
exponential increase. The comparison is shown in FIG. 11. To apply
the proposed model described above, the error between the output and the
predicted value in the TEST signals, which is defined below, is minimized
instead of the integral value to obtain the optimum Δ𝑗 .
Error= 12(Truth value - Predicted value) 2 11)
The smaller the error, the higher the accuracy rate. For this loss function, there
is an option to use cross entropy or the like.
5. Verification of Proposed Model with Actual Classification Problem
The proposed method combined with MOST was applied to three problems: (1)
Iris flower species classification problem, (2) used car rank assessment problem,
and (3) abalone age classification. These three classification problems are
sufficiently reliable as validation benchmark problems for neural networks.
In this paper, we calculated the classification problems of Iris, used cars, and
abalone by using a network with two hidden layers between the input layer and
the output layer. The LeRu function is applied to the first hidden layer and the
SoftMax function is applied to the second hidden layer. The last output layer
selects the one with the larger SoftMax function value as the final output result.
From the difference between this output layer and the flower species number of
the actual learning data, apply the same squared error as in Eq. 11). To
minimize the total error of the learning data, the weighting factor on the line
connecting each node and the bias node is optimized. The MOST described
above is used for optimization. The bias is applied to the input layer and the
hidden layer, respectively.
5.2 Iris floral classification problem
Data on four parameters of Iris, namely, "Sepal length", "Sepal width", "Petal
length" and "Petal width", are given to classify Iris into three flower varieties,
"Versicolor", "Setosa" and "Virginica" , depending on their characteristics [18]
and [19]. To each flower species give the numbers Versicolor: 1, Setosa: 2, and
Virginica: 3. The comparison between the proposed method and neural
networks in this Iris classification is shown in Figure 13. The hidden layers -1
and 2 of the neural network are each provided with 3 nodes. Including bias, the
neural networks have 27 weighting factors to determine. The proposed method
requires only Δ1 to Δ4 to be determined for each of inputs X 1 to X 4 , and the
number of variables to be optimized is smaller than that of a neural network.
This feature reduces the computational load in optimization. Inputs X-X4 are
given by four input data obtained from the UC Irvine Machine Learning
Repository: "Sepal length", "Sepal width", "Petal length", and "Petal width".
Examples are shown in Table 2. The 150 data consist of "Versicolor":50 data,
"Setosa":50 data, and "Virginica":50 data, of which 120 data are used for
training as Teaching signals and the remaining 30 data are used for testing. In
the TEST signals, 10 data were randomly selected from the original 150 data for
each flower species. Optimization in MOST requires specifying the search area
and the number of random numbers to be used for Monte Carlo integration. For
SING, the search region is 0<Δ𝑗 <1.0 and the number of random numbers is 50.
For neural networks, the search region is - 2.0<w𝚤𝑗 <2.0 and the number of
random numbers is 200.
Tables 3 and 4 show the results of optimizing the weighting coefficient of the
neural networks: w𝚤𝑗 and the width of the SiNG: Δ𝑗 . Tables 3 and 4 show the
results of optimization of the neural networks: w𝚤𝑗 and SING: Δ𝑗 .
Table 5 shows the actual flower species and SING evaluation results for the 30
TEST signals using the optimized values. The 3 columns to the right of the table
are the count variables P (k) of the group to which they belong, as shown in
[STEP-5] of the basic algorithm. Each TEST signal may belong to as many as
40 groups. However, in the range of this calculation, it can be judged that the
appropriate optimization was carried out, because there was no case which
spanned the groups of different outputs. In the case of applying SING and
neural networks to 30 TEST signals, the predictive accuracy rates are compared
in Table 6. The neural networks have good prediction accuracy with 99%
correct rate in learning and 93% correct rate in prediction. The results obtained
by SiNG were 100% for both learning and prediction, and it was confirmed that
more accurate results were obtained. It is a well known fact that in neural
networks, especially when the number of nodes is small, learning results differ
for each optimization, and learning must be repeated to increase the accuracy
rate. On the other hand, in SING, the reproducibility of the learning of Δ𝑗 is high
and the repetition of the learning is not required.
5.3 Rating classification problem of used cars
The validity of the proposed method was verified by applying SING to the
above three cases and comparing with the results of neural networks.
6. Conclusions