2202.13829 HJFHKG
2202.13829 HJFHKG
2202.13829 HJFHKG
INTRODUCTION
achieved great success. However, linear models suffer from low predictivity, and more
essentially, the success of machine learning is mainly due to the application of nonlinear
models. Therefore, to develop explanation methods applicable for both linearly
separable and linearly inseparable features and to investigate the difference between the
learning of these two kinds of features are important.
(𝑙) (𝑙)
where 𝑥𝑖𝑙 and ℎ𝑖𝑙 are the output and the local field of the 𝑖𝑙 th neuron in the lth layer
respectively, 𝑤 (𝑙) 𝑖𝑙𝑖𝑙−1 represents the weight that connects the 𝑖𝑙−1 th neuron in the (l-
1)th layer to the 𝑖𝑙 th neuron in the lth layer, and 𝑓(𝑥) is the neuron transfer function,
𝑁𝑙−1 is the number of the neurons in the (l-1)th layer. The weights in each layer are
bounded in an interval of (-1,1) and are randomly initialized. Note that we do not
involve the neuron bias into the model for sake of the simplicity. Extending the analysis
to cases involving biases is straightforward.
Assuming that the training set is consist of P samples {(𝒙(𝜇), 𝒚(𝜇)), 𝜇 = 1, ⋯ , 𝑃},
(0)
where 𝒙(𝜇) is the input vector of the 𝜇 th sample with components 𝑥𝑖0 (𝜇) (𝑖0 =
1, … , 𝑁0 ),𝒚(𝜇) defines its expected state in the output layer. Here we set 𝑦𝑖𝐿 (𝜇) = 1
if this sample belongs to the 𝑖𝐿 th class (we call this neuron the label neuron of this
class), and 𝑦𝑖𝐿 (𝜇) = −1 otherwise. For the sake of simplicity, we always apply the
(𝐿) (𝐿)
linear neuron transfer function to the output layer neurons, i.e., set 𝑥𝑖𝐿 = ℎ𝑖𝐿 , while
applying 𝑓(ℎ) = ℎ and 𝑓(ℎ) = tanh(ℎ) to the hidden-layer neurons in linear
neural networks (LNNs) and nonlinear neural networks (NNNs) respectively. We then
define the cost function as
1 2
(𝐿)
𝑆(𝑑) = ∑𝑃𝜇=1 ∑𝑁
𝑖𝐿 =1 (𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇) − 𝑑) ,
𝐿
(2)
𝑃𝑁𝐿
where 𝑑 is a parameter that controls the gap between the output of the label neuron
and that of the other neurons in the output layer. This cost function is equivalent to the
Margin commonly used by support vector machines. Minimizing the cost function to
(𝐿)
zero gives 𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇) = 𝑑 for all of the samples (we call this relation the goal of
(𝐿)
training), but as long as 𝑥𝑖𝐿 (𝜇) is the largest on the label neuron the sample is
correctly classified (we call this condition the goal of classification).
The WPA approach is applicable to all learning algorithms in principle. However,
in order to clearly demonstrate how to control the transition from linear learning mode
to nonlinear learning mode, we use a simple gradient-free algorithm, namely the Monte
Carlo (MC) algorithm [21,22]. The algorithm is quite simple:select a weight randomly
with equal probability, then change it randomly to a new value; accept the change if it
reduces the cost-function and discard it otherwise. The operation is repeated until the
minimization of the cost function is achieved. Because each update is judged by all
samples, the cost-function will be reduced monotonously. Besides, since the changed
neurons induced by the mutation involves only those connected by weight paths
involving the mutated weight, the computation time is acceptable for neural networks
with few hidden layers [21].
Since there is no need of differentiation and back-propagation in this algorithm,
we can limit the range of weights, for example, by setting |𝑤 (𝑙) 𝑖𝑙𝑖𝑙−1 | ≤ 1. Then, with
sensitivity of the network to avoid the overlearning. Due to the restriction of the norm
of the weights, linearly separable feature of a sample cannot be amplified infinitely, and
thus the cross-over transition from the linear to nonlinear learning modes can be clearly
shown by increasing the parameter d. Furthermore, with the limitation on the input-
output sensitivity of the network, the width of the network can be greatly increased
without overlearning. On the contrary, traditional algorithms, such as the back-
propagation (BP) algorithm, does not have these properties. Therefore, although its
training speed is not as fast as other traditional algorithms, the MC algorithm is more
suitable for studying the mechanism of the machine learning due to its flexibility.
2 RESULTS
(𝑖 ,𝑙) (0)
the 𝑖𝐿 th neuron in the output layer is proportional to 𝑐𝑖0𝑙𝑖𝐿 𝑥𝑖0 (𝜇), and the map,
(𝑖 ,𝑙) 𝑁 (𝑖 ,𝑙 )
𝐻𝑖𝐿𝑙 (𝜇) = ∑𝑖00=1 𝑐𝑖0𝑙𝑖𝐿 𝑥𝑖0 (𝜇), (4)
(𝐿)
characterizes the contribution of the whole subnetwork to the local field ℎ𝑖𝐿 (𝜇). Note
that in the particular case of 𝑙 = 𝐿 (as in the case of Fig. 1(a)), for a LNN,
(𝑖 ,𝐿) (𝐿) (𝑖 ,𝐿) (𝐿)
𝐻𝑖𝐿𝐿 (𝜇) = ℎ𝑖𝐿 (𝜇) , and for a NNN 𝐻𝑖𝐿𝐿 (𝜇) ∝ ℎ𝑖𝐿 (𝜇) . Therefore, penetration
coefficients characterize the influence of each component of an input vector on an
output neuron by a specific subnetwork of weight pathways; it helps us to infer whether
such an input component’s influence over that output neuron is positive or negative by
the subnetwork. By decomposing a neural network into subnetworks, we can
investigate how each part of the neural network works individually and cooperatively
to achieve the goal of classification and the goal of training.
Suppose samples are given by images (with single grey channel) that consist of
𝑀 × 𝑀 bitmap pixels, and the representation vector of the μth sample is coded as
(0)
(𝑥𝑖0 (𝜇), 𝑖0 = 𝑀𝛼 + 𝛽, 𝛼 = 1, ⋯ , 𝑀, 𝛽 = 1, ⋯ , 𝑀 ). By plotting a two-dimensional
heat map as the following
(𝑖 ,𝑙) (𝑖 ,𝑙)
𝑙𝛼𝛽𝑙 (𝑖𝐿 ) = 𝑐𝑖0𝑙𝑖𝐿 (5),
we have a visualization of the penetration coefficients. Note that penetration coefficient
(𝑖 ,𝑙)
𝑐𝑖0𝑙𝑖𝐿 can either be positive or negative, corresponding to p-pathway dominant or n-
(𝑖 ,𝑙)
pathway dominant respectively, and the corresponding heat map 𝑙𝛼𝛽𝑙 (𝑖𝐿 ) pixels are
positively displayed (in shades of red pseudo colors) or negatively displayed (in shades
of blue pseudo colors) respectively. The resulting heat map image will have patterns of
positively or negatively displayed regions, corresponding to regions of input pixels that
would be connected to the output neuron with positive or negative penetration
coefficients. With these patterns, one can infer the enhancement or suppression effect
to a given sample image at the pixel level following the characteristic map. The
visualization can be considered a view of the internal structure of a subnetwork. We
(𝑖 ,𝑙)
thus call heat map 𝑙𝛼𝛽𝑙 (𝑖𝐿 ) the radiograph of the subnetwork pertaining to the 𝑖𝑙 th
neuron in the 𝑙 th layer, and call the patterns in the radiograph the mode of the
subnetwork or the mode of this neuron.
We illustrate the basic principles of the WPA approach with three sets of toy
samples. The input samples are 100x100 bitmaps, and they all contain an identical circle
at different position. The pixels inside the circle (the face zone) are assigned a value of
e = 1, 2, or 3, while the pixels outside of the circle (the ground zone) are always assigned
a value of -1. Fig. 2(a) shows the three samples of the first training set, where circles
with value e=1, 2, and 3 are shown in yellow, light brown and dark brown colors,
respectively. There is no overlap between the face regions in first training set. Fig. 2(b)
shows the second training set. The difference from the first one is that the face regions
of these three samples completely overlap with each other. Fig.2 (c) shows the third set
of training samples. The difference from the first set is that the face regions of the three
samples are partially overlapped. Each sample can be vectorized as a 𝑁 = 100 ×
100 = 104 dimensional vector. We train a 104 − 200 − 3 neural network to perform
the task of classification; the cost function is defined by (2). Each sample represents
one class; the first to the third output neurons are the label neurons of the first to the
(2)
third sample or class, respectively. The achievement of classification means ℎ𝑖2 (𝜇) is
the largest if 𝜇 = 𝑖2 , while the achievement of the training goal means
(2)
ℎ𝑖2 (𝜇)𝑦𝑖2 (𝜇) = 𝑑. It will be seen that the second sample of the second set is linearly
inseparable, and the second sample of the third set has both linearly separable and
inseparable features. All the other samples are linearly separable.
Fig. 2 The visualization of penetrating coefficients. (a)-(c) show the first, second, and
third sample sets. The first line gives overlap images of the three samples. (d)-(f)
show scenographs of penetrating coefficients of subnetworks for the three sample
sets, respectively. The first to the third columns are for the first to the third output
neurons. The first to third rows are for the LNN30, the NNN30 and NNN200,
respectively. (g) shows scenographs of penetrating coefficients of subnetworks of
hidden-layer neurons classified by the LNI for the NNN30. The plots are managed
flowing the order of the matrix Π(𝜇|𝑖2 ) with 𝑖2 , 𝜇 = 1,2,3. (h) is same as (g) but for
the NNN200. (i) shows the scenographs of penetrating coefficients of subnetworks
of hidden-layer neurons of the 𝐶1 to 𝐶6 classes for the second sample set and
the NNN30. Here the first to third columns are for the first to third output neurons.
(j) and (k) are same as (g) and (h) but for the third sample sets.
Fig. 3 The transition from the linear learning mode to the nonlinear learning mode. (a)-(c)
(2)
show 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 as a function of training time (MC steps) for the three sample sets,
respectively. The solid(green), dot lines(blue) and dashed(red) lines for the LNN30, the
NNN30 and the NNN200. (d)-(f) show the LNI for the three sample sets, respectively. The
first, second and third columns are for the LNN30, the NNN30 and the NNN200,
respectively. The parameter 𝑖2 indicates the order of the output neuron. (g) The transition
from the phase of the linear learning mode to the mixture phase of linear and nonlinear
learning modes. The triangles, circles, and stars represent Π(1|2) , Π(2|2) and Π(3|2)
obtained after these quantities become stationary after a sufficient long training times.
As a reference, we first train these three training sets with linear neurons of
f(h)=0.002h and with d=30. We shorthand this neural network as LNN30. In Fig. 3(a)-
(2)
3(c), we show the evolution of local fields of output neurons 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 with
training time as blue dotted lines. Here, 〈∙〉 is averaged over the three samples and
three output neurons. We see that the goal of training is reached for both the first and
(2)
the third sets. For the second set, it has 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 ≈ 20. Averaging only over the
(2) (2)
three samples, we found that 〈𝑥1 (𝜇)𝑦1 (𝜇)〉 ≈ 〈𝑥3 (𝜇)𝑦3 (𝜇)〉 ≈ 30 for the first and
(2)
the third output neurons, but 〈𝑥2 (𝜇)𝑦2 (𝜇)〉 ≈ 0 for the second output neuron. It can
be checked that the output of the second sample is never the largest.
We then train these training sets with nonlinear neurons of f (h)= tanh (0.002h)
and with d=30. We shorthand this neural network as NNN30. The local fields of output
(2)
neurons 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 are shown as green solid lines in Fig. 3(a)-3(c). It can be seen
that the goal of training is also completely fulfilled for the first and third training sets.
(2)
For the second training set, it has 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 ≈ 28. Again, averaging only over
(2) (2)
the three samples we find that 〈𝑥1 (𝜇)𝑦1 (𝜇)〉 ≈ 〈𝑥3 (𝜇)𝑦3 (𝜇)〉 ≈ 30 , and
〈𝑥2(2) (𝜇)𝑦2 (𝜇)〉 ≈ 25. Thus, both the goal of classification and the goal of training are
fully achieved for the first and the third samples. The goal of training is not fully
(2)
achieved for the second sample; however, it can be checked that 𝑥2 (2) is larger than
(2) (2)
𝑥2 (1) and 𝑥2 (3), i.e., the goal of classification is achieved. When d is increased to
d=200 (We shorthand this neural network as NNN200), the goal of classification is
achieved for all training sets, but the goal of training is not fully achieved by any of the
training sets., as shown by the red dashed lines in Fig.3(a)-(c). For the first and third
(2) (2)
sets, 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 ≈ 100 . In detail, it can be checked that 〈𝑥1 (𝜇)𝑦1 (𝜇)〉 ≈
〈𝑥2(2) (𝜇)𝑦2 (𝜇)〉 ≈ 〈𝑥3(2) (𝜇)𝑦3 (𝜇)〉 ≈ 100 for these sets. For the second set,
〈𝑥𝑖(2) (𝜇)𝑦𝑖𝐿 (𝜇)〉 ≈ 70 and it can be checked that 〈𝑥1(2) (𝜇)𝑦1 (𝜇)〉 ≈ 90 ,
𝐿
〈𝑥2(2) (𝜇)𝑦2 (𝜇)〉 ≈ 50 and 〈𝑥3(2) (𝜇)𝑦3 (𝜇)〉 ≈ 75, where 〈∙〉 is averaged over the three
samples.
After all the training have finished, we show radiographs of subnetworks of the
(1,2) (2,2) (3,2)
three output neurons 𝑙𝛼𝛽 (1), 𝑙𝛼𝛽 (2), and 𝑙𝛼𝛽 (3) for the LNN30, the NNN30,
and the NNN200 in Fig.2 (d)-2(f), where the columns correspond to the output neurons
and the rows correspond to the three types of neural network. For better visualization,
we calculate an ensemble average of penetration coefficients of 16 replicas of neural
networks for each training set. We see that each visualization image clearly shows the
patterns of all the samples in the training set, indicating that each subnetwork
"holographically" encodes all samples of a training set. Some zones are positively
displayed while others are negatively displayed, implying that the subnetwork of weight
pathways establishes coherent structures for enhancement or suppression of different
features. The "holographic" structure reveals how a neural network stores information,
and indicates that classifications are achieved through the interaction between the input
samples and the "holographic" structure. In the following, we will explain how the
"holographic" structures are formed and in what a way they work.
2.2.2 The extraction of linearly separable feature and the classification of hidden-
layer neurons.
We study the first training set in this section. For this set, the characteristic map
(𝑖 ,𝑙)
can be decomposed into four parts, 𝐻𝑖2 𝑙 (𝜇) = c(G)x(G) + c(𝐹1 )x(𝐹1 ) +
c(𝐹2 )x(𝐹2 ) + c(𝐹3 )x(𝐹3 ) , which respectively represent the contributions of the
common ground zone (zone G), the face zone of the first sample (zone 𝐹1 ), the face
zone of the second sample (zone 𝐹2 ), and the face zone of the third sample (zone 𝐹3 ).
Here c(G) , c(𝐹1 ) , c(𝐹2 ) and c(𝐹3 ) is obtained by summarizing the penetration
(𝑖 ,𝑙)
coefficients within the zone, i.e., ∑𝑖0 𝑐𝑖0𝑙𝑖𝐿 , over the zones of G, 𝐹1 , 𝐹2 and 𝐹3 zones,
respectively. From Fig 2(d) we see that the polarity of each zone stays the same for all
three rows, meaning that the modes of output neurons of either the LNN or the NNNs
with a small d and a large d are almost same.
How the goal of classification is achieved can be revealed by the patterns of the
radiographs shown in Fig. 2(d). These patterns indicate the modes of the three output
neurons. As the positive and negative penetration coefficients represent the
enhancement or the suppression effect, the signs of c(G), c(𝐹1 ), c(𝐹2 ) and c(𝐹3 )
thus are essential and are employed to represent the mode of a neuron. From Fig. 2(d)
we see that (sign(c(G)), sign(c(𝐹1 )), sign(c(𝐹2 )), sign(c(𝐹3 ))) = (−, +, −, −)
(+, −, +, −), and (+, −, −, +) for the three output neurons, respectively. Combining
the input values in the four zones (x(G), x(𝐹1 ), x(𝐹2 ), x(𝐹3 )) as (−1,1, −1, −1),
(−1, −1,2, −1),and (−1, −1, −1,3) for the first, the second, and the third samples
(1,2)
So as long as c(𝐹1 ) is large enough, the condition of 𝐻1 (1) > 0, 𝐻1(1,2) (2) < 0
(1,2)
and 𝐻1 (3) < 0 can always be reached. Similar argument can be applied to the
second and third output neurons, and one can thus get positive local field for label
neurons (𝜇 = 𝑖𝐿 ) while keeping all the non-label neurons (𝜇 ≠ 𝑖𝐿 ) negative. Hence
the goal of classification can always be achieved.
Following the modes of the output neurons, one can construct a linear
classification hyperplane for each sample. For example, a vector in the input vector
space can be selected as the classification vector, whose components that lie within the
G, 𝐹1 , 𝐹2 and 𝐹3 zones have signs of (−, +, −, −) , as the signs of
(c(G), c(𝐹1 ), c(𝐹2 ), c(𝐹3 ) have. Then, the projection of the first sample to this vector
must be the largest among the three input samples. The existence of such classification
vectors indicates that samples in this set are all linearly separable. This conclusion is
also confirmed by the fact that samples in this training set can be classified by the LNN.
(2)
However, the goal of training, i.e., the requirement that ℎ𝑖2 (𝜇)𝑦𝑖2 (𝜇) = d for
each sample as well as for each output neuron, generates additional tasks to the hidden-
layer neurons; they will accomplish this goal by working cooperatively. Specifically,
the hidden-layer neurons self-organize into classes that are specific to each sample and
to each output neuron. One way to capture such self-organization is to count the number
of hidden-layer neurons that give the largest contribution to each output neuron when a
specific sample is fed to the neural network and study the distribution of such counts.
We denote the number of the largest-contribution hidden-layer neurons from input
sample 𝜇 for the 𝑖2 th output neuron as Π(𝜇|𝑖2 ) , and call it largest contribution
hidden-layer neuron classification index, or simply the largest neuron index (LNI).
Without hidden-layer neuron self-organization, LNI of the label neuron would equal to
the total number of hidden-layer neurons while that of the other output neurons would
equal to zero. Deviation from such a behavior is an indication of the occurrence of
classification of the hidden-layer neurons.
Fig. 3(d) shows the training time evolution of the LNIs of the first, the second, and
the third output neurons in the LNN30 for the first training set. In each sub-plot, the
solid, dashed and dot lines represents the number of the largest contribution neurons of
the first, second and third samples. Similar LNI training time evolutions of the NNN30
and NNN200 are also shown in Fig.3(d). We see that although the label neuron has
biggest LNI, i.e.,Π(𝜇|𝑖2 ) has the biggest value when 𝜇 = 𝑖2 , there are still a lot of
neurons give their largest contribution to non-label output neurons. This fact reveals the
differentiation and self-organization of the hidden-layer neurons.
The classification of hidden-layer neurons can provide more information on how
the cost function is minimized. In Fig. 2(g)-2(h), we show radiographs of subnetworks
of hidden-layer neurons classified by Π(𝜇|𝑖2 ) for the NNN30 (Fig. 2(g)) and the
(Π(𝜇 |𝑖 ),1)
NNN200 (Fig.2(h)). We denote the radiographs obtained in this way as 𝑙𝛼𝛽 2 (𝜇),
where Π(𝜇|𝑖2 ) specifies the class of hidden-layer neurons. The radiographs for the
LNN are similar to that of NN30 and are not shown here. Taking the first output neuron
of the NNN30 as an example (the first column of Fig.2(g)), we see that the radiograph
(Π(1|1),1)
𝑙𝛼𝛽 (1) (the first row and the first column of Fig.2(g)) shows mode of
(1,2)
(c(G), c(𝐹1 ), c(𝐹2 ), c(𝐹3 ))~(−, +, −, −), which is the same as that of 𝑙𝛼𝛽 (1). With
(Π(1|1),1)
this mode the characteristic map gives 𝐻1 (1) ~(+, +, +, +) ,
(Π(1|1),1)
𝐻1 (2) ~(+, −, −, +) , and 𝐻1(Π(1|1),1) (3) ~(+, −, +, −) for the first, the
second, and the third samples. Since this class of neurons has the largest population
(Fig. 3(d)), it can achieve the goal of classification, as discussed early. We call these
(Π(2|1),1)
hidden-layer neurons the dominant neurons. Meanwhile, the radiograph 𝑙𝛼𝛽 (1)
(the second row and the first column of Fig.2(g)) shows the mode of
(c(G), c(𝐹1 ), c(𝐹2 ), c(𝐹3 ))~(+, +, +, −) . The characteristic map gives
(Π(2|1),1) (Π(2|1),1)
𝐻1 (1) ~(−, +, −, +) , 𝐻1 (2) ~(−, −, +, +) , and
(Π(2|1),1) (Π(3|1),1)
𝐻1 (3) ~(−, −, −, −) . The radiograph 𝑙𝛼𝛽 (1) (the third row and the
first column of Fig.2(g)) shows the mode of (c(G), c(𝐹1 ), c(𝐹2 ), c(𝐹3 ))~(+, +, −, +).
(Π(3|1),1)
The characteristic map gives 𝐻1 (1) ~(−, +, −, +) ,
(Π(3|1),1)
𝐻1 (2) ~(−, −, +, +) , and 𝐻1(Π(3|1),1) (3) ~(−, −, −, −) . Obviously, the
differentiation of hidden-layer neurons provides more freedom to adjust the outputs.
For example, neurons of Π(2|1) class decreases the output of the third sample, and
(2)
neurons of Π(3|1) class decreases the output of the second sample, leading ℎ1 (2)
(1,2)
𝑙𝛼𝛽 (3) (see the third row of Fig.2(d)) as those for the NNN30; the time evolutions of
LNIs are also similar (see Fig.3(d)), but the radiographs of subnetworks of neurons
classified by the LNI show certain significant differences. The G zones of radiographs
almost all reverse their sign relative to those of the NNN30. This phenomenon is a result
of self-organization for the purpose of maximum minimization of the cost function.
From the patterns one can also find out how the self-organization is performed. For
example, Π(2|3) class almost disappears, so the G zone of Π(1|3) class has to
become highly positive in order to decrease the local field of the third output neuron
when inputting the first and the second samples. Note that due to the limits put on the
norm of the weights, the cost function has not been fully minimized to zero.
For the second training set, each sample has only a ground zone (G) and a face
zone (F),and we thus can perform a detail analysis about the modes of the neurons
before report the results. In this case, a neuron has four possible modes,
(c(G), c(𝐹))~(+, −), (−, +),(+, +), (−, −). Since x(𝐺) = −1 and x(𝐹)= 1, 2, 3 for
(𝑖 ,𝑙)
the three samples, characteristic map 𝐻𝑖2 𝑙 (𝜇) = c(G)x(G) + c(𝐹)x(𝐹) can be
determined directly by neuron modes. For example, mode (+, −) leads to
(𝑖 ,𝑙) (𝑖,𝑙)
𝐻𝑖2 𝑙 (𝜇) <0 and mode (−, +) leads to 𝐻𝑖2 (𝜇)>0 for all three samples. While mode
(+, +) has c(G)x(G) <0 and c(F)x(F) > 0 , and thus 𝐻𝑖(𝑖,𝑙) (𝜇) may have different
2
signs for different samples depending on the amplitudes of c(G) and c(F). In more
details, for a fixed value of c(G)x(G)<0, depending on the amplitude of c(F), it has
(𝑖,𝑙)
four possible combinations of 𝐻𝑖2 (𝜇) when inputting the three samples, i.e.,
sample to the largest value. This is because its face zone is identical to those of the other
samples and its pixel value is between that of the first and third ones; therefore, the
monotonic transformation cannot make its output largest in any way. In other words, a
single neuron or a single subnetwork alone cannot realize the classification of the
second sample. This sample is thus linearly inseparable, which is confirmed by the fact
that a LNN cannot classify this sample.
Fig.4 The formation of the nonlinear learning mode. The right-hand and left-hand
plots of (a) show the modes of (+, +)3 and (−, −)2 of hidden-layer neurons. (b)
and (c) show the number of neurons in the six classes (see text), for the NNN30 and
the NNN200, respectively.
Therefore, the fact that the NNN can achieve the goal of classification implies that
it makes use of combinations of neurons with different modes. As shown in Fig. 4(a), a
pair of neurons with (+, +)3 and (−, −)2 modes can convert the output of the second
sample to be the largest, as long as the outputs of the first and third samples lie on the
nonlinear region of tanh(ℎ) ≈ ±1 . In this case, the outputs of the first and third
samples cancel approximately, while that of the second sample appears as a large
positive quantity since both neurons contribute a positive term. It is easy to check that
the conversion can also be realized by combinations of (+, +)3 and (−, −)3 , (+, +)2
and (−, −)2 , though they are less efficient at converting. We emphasize that the
nonlinearity of the transfer function plays a key role here. With a linear neural transfer
function, f(ℎ) = 𝑘ℎ, (+, +)3 and (−, −)2 must lead to the combined outputs of the
second sample vanishing if one wants to offset the outputs of the first and the third
samples.
It is obvious that, for the first output neuron, for example, the goal of training
cannot be achieved by neurons only with modes of (−, −)3 , since it gives 0 >
(2) (2) (2) (2)
ℎ2 (2) > ℎ2 (3) and thus fails to approach the condition of ℎ2 (2) = ℎ2 (3) =
−d. Therefore, auxiliary neurons with different mode must be utilized. In Fig.3(e) we
show the LNI for this training set. It clearly confirms the differentiation of hidden-layer
neurons, as the non-vanishing Π(1|𝑖2 ) and Π(3|𝑖2 ) indicate that each output neuron
has hidden-layer neurons in different modes. The fact that Π(2|𝑖2 ) vanishes indicates
that there is no single neuron can convert the output of the second sample to be the
largest. Note that neurons in a class of Π(𝜇|𝑖2 ) may have several neuron modes. For
example, neurons of Π(1|1) may involve modes of (−, −)2 , (−, −)3 and (+, +)4 ,
since all of them have the largest contribution to the first output neuron from the first
input sample. For a more detailed investigation, we can check the distribution of
neurons for all of the possible modes. Note that neuron modes of this training set can
be divided into six classes: (+, −), (+, +)1 and (−, −)4 leading to negative outputs
for all samples and are classified into one class (designated as C1 class); (−, +),
(+, +)4 and (−, −)1 leading to positive outputs for all samples and are classified into
another class (designated as C2 class); For the rest modes: (+, +)2 , (+, +)3 ,
(−, −)2 , and (−, −)3, each represents a different class (designated as C3 , C4 , C5 , and
C6 classes respectively). For the sake of simplicity, we calculate the number of hidden-
layer neurons of the six classes instead of the 10 modes for the NNN30 and NNN200,
and show them in Fig. 4(b)-4(c). This approach provides another classification standard
for checking the differentiation. The results reveal more details of the classification of
hidden-layer neurons than LNI does.
With these details, we can understand how the classification of hidden-layer
neurons is essential for minimizing the cost function. Using the first output neuron as
an example again,it can be seen that although (−, −)3 accounts for a large proportion,
neurons of all the six classes exist. Neurons with mode (+, +)2 raises the output of
(2) (2)
the third sample, leading the local fields evolve towards ℎ2 (2) = ℎ2 (3) = −𝑑 .
The 𝐶1 class decreases the outputs globally, but can also adjust the amplitude of
contributions for different samples. In such a way, the local fields evolve towards the
goal of minimizing the cost function to zero. For this purpose, neurons with mode
(−, −)3 and (+, +)2 , as well as the modes in the 𝐶1 class are more essential;
therefore, in the case of NNN200, these modes remain but others are suppressed
dramatically. Due to the insufficient number of hidden-layer neurons, minimization of
the cost function can only give 〈𝑥𝑖𝐿 (𝜇)𝑦𝑖𝐿 (𝜇)〉 ≈ 70 , and due to the insufficient
number of auxiliary neurons, it has 〈𝑥1 (𝜇)𝑦1 (𝜇)〉 ≈ 90 , 〈𝑥2 (𝜇)𝑦2 (𝜇)〉 ≈ 50 and
〈𝑥3 (𝜇)𝑦3 (𝜇)〉 ≈ 75.
(𝐶 ,1)
In Fig 2 (i), we show radiographs 𝑙𝛼𝛽𝑖 (𝑖𝐿 ) of subnetworks of the 6 classes for
the NNN30 as examples. We see that they are in well agreement with the patterns of
(𝑖,2)
modes of that class. Note that radiograph 𝑙𝛼𝛽 (𝑖𝐿 ) of Fig.2(e) is the result of
superposition of radiographs of the 6 classes. For the first and the third output neurons,
since hidden-layer neurons with modes (−, −)3 and (+, +)2 have the largest
(1,2) (3,2)
populations respectively, 𝑙𝛼𝛽 (1) and 𝑙𝛼𝛽 (3) thus show patterns of
(c(G), c(𝐹))~(−, −) and ~(+, +) . For the second output neuron, neurons with
modes of (+, +)3 and (−, −)2 have roughly equal numbers, (see Fig. 4(b)), patterns
(2,2)
resulted by them thus are almost canceled with each other. The pattern of 𝑙𝛼𝛽 (2),
(c(G), c(𝐹)) = (+, −), should be caused by neurons of mode (+, −) in the 𝐶1 class.
Note that according to Fig. 4(b) this class has a large number of hidden-layer neurons.
Therefore, extracting the linearly separable and inseparable features has an
essential difference. For the linearly separable feature, neurons can independently
achieve the goal of classification. The differentiation of hidden-layer neurons is for the
purpose of achieving the goal of training. Specifically, for the first training set, a LNN
can achieve the goal of classification with similar modes or radiographs as those of a
NNN, see Fig.2(d) and Fig3(d). We call this way of learning the linear learning mode.
For linearly inseparable features, for example, the second sample of the second training
set, the neural network must use a combination of neurons with different mode and
relay on the nonlinearity of the neuron transfer function to convert the output to be the
largest. Linear networks cannot achieve such a goal. We call this way of learning the
nonlinear learning mode. In principle, the nonlinear learning mode can extract both
linearly separable and inseparable features; however, the nonlinear learning mode
requires more network resources and hence is reserved for the extraction of the
nonlinear features or for the purpose of reaching the difficult goal of training. As shown
(2)
in Fig. 3(b), even in the case of d =30, the goal of ℎ2 (2) = 𝑑 has not been strictly
reached because the number of auxiliary neurons is not sufficient (Fig. 4(c)).
In the third sample set, the second sample has both linearly separable features (the
part of the face zone that do not overlap with any of the face zones of the other samples,
which shall be called the unit linear features of the sample, and the part of the face zone
that only overlap with some but not all of the face zones of the other samples.) and
linearly inseparable features (the part of the face zone that overlap with all of the face
zones of the other samples). Therefore, it can be used to investigate how the neural
network works in the coexistence of linearly separable and inseparable features. The
first and third samples have only linearly separable features. For the convenience of
discussion below, we denote zones of the unit linear feature of the three samples by 𝐹10 ,
𝐹20 , and 𝐹30 , respectively.
Fig. 3 (f) shows the training time evolution of the LNI in the LNN30, the NNN30,
and the NNN200 in the first to the third column, respectively. We see that the LNI of
the NNN30 behaves almost exactly as that of the LNN30. Furthermore, Fig. 2 (f) shows
that the radiographs of these two neural networks are also qualitatively similar.
Therefore, the NNN30 behaves almost identically to that of the LNN30, indicating that
the nonlinear network invokes only the linear learning mode and extracts only the
linearly separable features.
Radiographs of output neurons can reveal how the linear learning mode works.
(2,2)
For example, from 𝑙𝛼𝛽 (2) in Fig. 2(f) we see that 𝐶(𝐹20 ) is positive, and 𝐶(𝐹10 )
and 𝐶(𝐹30 ) are negative. Following the characteristic map, the signs of contributions
(2,2)
of these zones to the outputs should be 𝐻2 (1) ∝ x(𝐹10 )𝐶(𝐹10 ) + x(𝐺)𝐶(𝐹20 ) +
(2,2)
x(𝐺)𝐶(𝐹30 ) ~(−, −, +) , 𝐻2 (2) ∝ x(𝐺)𝐶(𝐹10 ) + x(𝐹20 )𝐶(𝐹20 ) + x(𝐺)𝐶(𝐹30 )
(2,2)
~(+, +, +) , and 𝐻2 (3) ∝ x(𝐺)𝐶(𝐹10 ) + x(𝐺)𝐶(𝐹20 ) + x(𝐹30 )𝐶(𝐹30 )~(+, −, −),
respectively. Similar to the case of the first sample set, these modes provide the freedom
to fulfill the goal of classification, by amplifying only the amplitude of p-pathways in
𝐹20 zone for example.
The subnetworks of hidden-layer neurons classified by the LNI reveal more details.
In second column of Fig.2(j) we show the radiographs of the three class, Π(1|2) ,
Π(2|2) and Π(3|2), of hidden-layer neurons for NNN30 respect to the second output
neuron, respectively. The non-vanishing of Π(1|2) and Π(3|2) reveal the
phenomenon of differentiation. The Π(2|2) class takes the dominant population of
hidden-layer neurons (see Fig.3(f)), which confirms that the classification information
extracted by single neurons and thus should be linearly separable feature. As a result,
(2,2) (Π(2|2),1)
the 𝑙𝛼𝛽 (2) is qualitatively identical to 𝑙𝛼𝛽 (2) since the former is also
dominated by the class of Π(2|2). The role of classes Π(1|2) and Π(3|2) is to assist
the Π(2|2) class to fulfil the goal of training. For example, the center zones of
(Π(2|2),1) (Π(3|2),1)
𝑙𝛼𝛽 (2) and 𝑙𝛼𝛽 (2) are positive, which give positive contributions to the
second output from all the samples, of which that of the third sample is the largest since
the pixel value of its face zone is the largest. To offset this positive contribution from
(2) (Π(1|2),1)
the third sample, i.e., ℎ2 (3), the face zone of the third sample in 𝑙𝛼𝛽 (2) is
negative, and thus the Π(1|2) class can decrease the output of the third sample, which
(2) (2)
provide the freedom to help ℎ2 (3) move towards ℎ2 (3) = −𝑑.
These facts indicate that applying a NNN does not necessarily mean the extraction
of the linearly inseparable features when both linearly separable and linearly
inseparable features exist in the samples. The reason is that the linear learning mode is
performed by hidden-layer neurons independently, while the nonlinear learning mode
needs to have multiple neurons working cooperatively and thus is more difficult to be
manifested. So, if the goal of training can be achieved using only the linear separable
features, the neural network would avoid to activate the nonlinear learning mode.
Therefore, one may be at risk of losing the information of linear inseparable feature in
the case of coexistence of linearly separable and inseparable features even using a NNN.
However, since we limit the norm of weights, and thus limit the products of weight
pathways. Once the parameter d exceeds a threshold, weight pathways connecting only
the linearly separable zones will not be able to fulfil the goal of training. In this case,
the neural network has to initiate the nonlinear learning mode to extract the linearly
inseparable features to further drive the local fields towards the goal of training. For the
NNN200, we see from Fig.2(f) and Fig. 3(f) that there is no substantial change for the
first and the third output neurons for both the radiographs and the LNI’s, indicating that
the neural network still extracts information with the linear learning mode, since in
these cases all the information are the linearly separable feature. However, remarkable
(2,2)
changes can be seen in the case of second output neuron. The center zone of 𝑙𝛼𝛽 (2)
(2,2)
changes to the same pseudo-color as that of the second sample set (𝑙𝛼𝛽 (2) for the two
NNNs, see Fig. 2(e)). Fig. 3(f) shows that the Π(2|2) is no longer the largest, instead,
Π(1|2) turns to be the largest as in the case of the second sample set (see Fig. 3(e) for
the two NNNs).
These facts indicate that the nonlinear learning mode is initiated for the second
sample, i.e., the features of the center zone of this sample begins to contribute to the
(2)
local field ℎ2 (2), which can be observed more clearly by studying radiographs of
subnetworks of hidden-layer neurons classified by the LNI. Fig.2 (k) shows these
(2,2) (Π(1|2),1)
images for the NNN200. It is seen that 𝑙𝛼𝛽 (2) and 𝑙𝛼𝛽 (2) have the same
patterns, as a result of Π(1|2) class being the dominant population of neurons. The
radiograph of Π(2|2) shows non-substantial changes to that of the NNN30, indicating
that the main role of this class is still for extracting the linearly separable features of the
(Π(1|2),1)
second sample. But radiographs of Π(1|2) and Π(3|2) , i.e., 𝑙𝛼𝛽 (2) and
(Π(3|2),1)
𝑙𝛼𝛽 (2), have changed substantially from that of Fig. 2(j), indicating a different
learning mode is involved. For the Π(1|2) class, by definition, each neuron gives the
largest contribution to the second output neuron with the input of the first sample. The
(Π(1|2),1)
center zone of its radiograph 𝑙𝛼𝛽 (2) is negative, making contribution from the
third sample to the same output neuron the most negative or the smallest in value. In
addition, from the radiograph we see that the unit linear feature zone 𝐹20 is positive,
and thus making the contribution from the second sample larger. As a result, for this
(Π(1|2),1) (Π(1|2),1) (Π(1|2),1)
class we have 𝐻2 (1) > 𝐻2 (2) > 𝐻2 (3) . Similar analysis can
3 Application.
handwritten digits.
The first row of Fig. 5 shows 10 handwritten digits selected from the MNIST set.
Since each sample is composed of a 28x28 bitmap, we design a 784-600-P network
with P =3, 5, 10 to classify the first 3, first 5, and all 10 samples, respectively, with
(𝑖 ,2)
control parameters d=30 and β=0.15. Radiographs of label neurons, i.e., 𝑙𝛼𝛽𝐿 (𝑖𝐿 ) for
𝑖𝐿 = 1,2, … , 𝑃, are shown in the second to the fourth rows in Fig. 5, respectively. The
radiographs are obtained after the neural network has been trained to have its cost
function less than 0.01. It can be seen that in the case of only a few samples, radiographs
are similar to those of the toy sample sets, i.e., patterns of all the digits in the training
set distinctly appear in every radiograph, indicating that the “holographic” nature of the
neural network remains. With the increase of the number of samples, the pattern of the
digit corresponding to a label neuron can still positively recognized, while patterns of
other digits gradually become less distinguishable. By studying the progressive trend
of radiographs from the cases of P=3 to P=5 and to P=10, one can realize that even for
the case of P=10, the patterns of all the digits are indeed still there; they overlap with
each other to form the negatively displayed region. In other words, the “holographic”
structure exists always.
From the radiographs we see that the zones containing unit features of the labeled
digit (the zones of a digit that do not overlap with other digits), which represent the
main part of linearly separable feature of a digit, is positively highlighted, while the
zones containing unit features of other non-labeled digits are negatively explored. The
zones of the labeled digit that overlap with other digits are also positively displayed,
but to a lesser degree than that of the unit features. Following the characteristic map,
these kinds of patterns enhance the output of a sample on its label neuron, while
suppress the outputs of samples of other classes on this output neuron. These patterns
thus indicate how linearly separable feature is carried out by the linear learning mode.
Fig. 6 The LNI for the NNNs with (a) d=30 and (b) d=75. The lines in each
panel represent Π(𝜇|𝑖2 ) with 𝜇 = 1, … ,10 for the specific output neuron 𝑖2 ;
the dark line represents Π(𝜇|𝑖2 ) with 𝜇 = 𝑖2 . The time length in (a) is 8 ×
106 and in (b) is 2 × 107 MC steps in each panel. .
of digits emerge in the radiographs. The positively highlighted parts of a digit represent
the zones of common linearly separable feature of a class of samples. Again, these
radiographs show less distinctive feature, and we see no significant changes from the
9th row to the 11th row. As revealed by Fig.87(a), the extra improvement of the
accuracy by the NNN is about 8%. Therefore, in the case of the MINSIT data set, the
linear separable feature is the dominant feature, so radiographs of MNIST are
dominated by linearly separable features, and thus no obvious difference can be
discerned between the radiographs with or without the nonlinear learning mode.
How to maximally extract the linearly separable and inseparable features should
be an essential problem for obtaining the optimal neural network. Fig. 7(b) shows the
accuracy of the three-layer nonlinear network for several values of β as a function of d.
Here, the accuracy at a given value of d is similar to final accuracy in Fig 7(a), that is,
with each value of d, we train the neural network to find the maximum value of the
accuracy appearing in the training process. We see that, for a given β, increasing d
increases the accuracy initially till it reaches the maximum value, and further increasing
d actually decreases the accuracy, which is a sign of overlearning. Similarly, at a fixed
d, increasing β increases the accuracy initially, but further increasing causes
overlearning and the accuracy to decrease. It seems that the understanding and
controlling of overlearning is the key to the optimal solution.
3.2 Increasing the width or depth of neural networks to balance the linearly
Then the remaining questions are what is the optimal distribution and what factors
determine the optimal distribution. We will show that increasing the width and the depth
of neural networks provides two effective ways to obtain the optimal distribution.
To show the width effect, we study a 784- N1-10 neural network trained by the first
600 MINIST samples with β=0.15. Here, the value of β is the optimal value from the
last section with N1=600, and it is checked this value remains optimal for N1s used in
this section. In Fig 8(a), the circles and triangles show the accuracy as a function of d
with N1=1200 and N1=1800, where each data point is the highest accuracy achieved at
each value of d. Fig. 8 (b) shows the optimal accuracy as a function of the width where
an optimal d is searched and used in its calculation. We see that the accuracy increases
with the increase of width and gradually tends to saturation around N1=1500.
In Figure 8(c), we show the distribution of outputs of hidden-layer neurons for
N1=1200 and 1800. Fig. 8(d) shows the height of peaks ℎ̅ (averaged with the values
of x=±1) as a function of the width. It can be seen that ℎ̅ decreases with N1.
Fig. 8. The width and depth caused improvement of the network performance. (a)
The accuracy as a function of d for N1=1200 and N1=1800. (b) The maximum
accuracy as a function of N1. (c) The distribution of outputs of hidden-layer neurons
for N1=1200 and N1=1800 at the maximum accuracies. (d) The height of the picks
of the distribution as a function of N1 with the log-log scale. (e) The accuracy as a
function of d for the four- and five- layer neural networks. (f) The distribution of
outputs of first hidden-layer neurons for the four- and five-layer neural networks.
The reason that bigger widths improve test accuracies can be understood on the
basis of balancing the extraction of linearly separable and inseparable features. The
amounts of linearly separable and inseparable features in the training set are fixed. With
a small-size neural network, it does not have sufficient neurons to extract completely
both linear and nonlinear features. As the linearly separable features are dominant, the
network is inclined to use more neurons to extract this kind of information, leading to
an insufficient number of nonlinear neurons and a suboptimal small ratio of extremely
nonlinear neurons. With the increase of the width, there are more neurons available and
thus some of them can be spared to extract the linearly inseparable features, and thus
the ratio of extremely nonlinear neurons can increase to an optimal value. However, we
can expect that the total number of the nonlinear neurons used for extracting the linearly
inseparable feature should to stay approximately constant for even larger width; we
denoted this number as 𝑁𝑛𝑜𝑛. The reason is that, as shown in Fig.4, the nonlinear
learning mode requires the combination of different neurons and thus occupies more
resources of the neural network. Therefore, further increasing of the width would not
increase the number of neurons that extract linearly inseparable features. As a result,
one can expect that the ratio of extremely nonlinear neurons, characterized by the
amplitude of peaks around x=±1 to decrease as 𝑁𝑛𝑜𝑛. /(𝑁1 − 𝑁𝑛𝑜𝑛. ) or scale as 1/𝑁1
if 𝑁1 is large enough; the extra neurons should concentrate towards the region of x=0.
These predictions are confirmed by Fig. 8(c) and 8(d). In these situations, neurons for
extracting both linearly separable and inseparable features are approximately saturated;
therefore, continuing increase of the width does not increase the test accuracy
significantly as can be seen in Fig. 8(b). In summary, with a sufficient width, the neural
network has the possibility to extract both linearly separable and inseparable features
while keeping the balance between the linear and nonlinear learning modes.
To show the depth effect, we study a four-layer 784-600-600-10 neural network
and a five-layer 784-600-600-600-10 neural network. Fig. 8 (e) shows the accuracy of
these two neural networks as a function of d, and Fig. 8(f) shows the output distribution
of neurons in the first hidden layer. Together with the result of the three-layer neural
networks shown in Fig.7(b), we see that the test accuracy increases with the depth of
the neural network, while the number of extremely nonlinear neurons decreases with it.
Comparing Fig. 8(e) and Fig. 8(a), we see an important advantage of a deeper-
layer neural network over a wider-layer one for practical applications. As shown in Fig.
8(a), the optimal value of d increases with the increasing width quite significantly. To
obtain the optimal test accuracy of a wider neural network, one would have to search
in a wide range for the optimal d. In contrast, as shown in Fig. 8(e), the optimal value
of d does not change with the increasing depth of the neural network. In fact, there is a
wide range of d that would give approximately optimal test accuracy for five-layer
neural network (triangles). It is difficult for us to apply our MC algorithm to training
neural networks with more layers, but from the trend exhibited by the three to five-layer
neural networks, we can speculate that, with the further increase of the depth, the
optimum solution region should appear as a perfect plateau over a large range of d
values. As a result, it would be much easier to obtain the optimal solution by sparsely
sampling the control parameter space. Furthermore, once an optimal control parameter
is found, one would not have to search it again for networks with deeper depths. This
property is very favorable for neural networks using the softmax cost function. We
suspect that this property may be at least one of the superiorities of the deep-layer neural
networks.
The mechanism that the accuracy increases and the number of extremely nonlinear
neurons decreases with the increasing the depth is similar to that of increasing the width,
since increasing layers also increases the total number of neurons in the network. In the
last row of Fig. 5, we show the radiographs of the ten output neurons of the four-layer
neural network, which indicate that the “holographic” structure of the network is
remained. From the point of view of weight pathways, a deep-layer neural network has
much more weight pathways comparing to a three-layer neural network with the same
number of hidden-layer neurons. For example, a 784-1200-10 network has
784x1200x10 ~ 9x106 weight pathways, while a 784-600-600-10 network has
784x600x600x10 ~ 3x109 weight pathways. Therefore, deep-layer networks have much
more freedom to establish subnetworks, and should provide more freedom to construct
learning models. This should be the reason that a deep-layer neural network can have
the nonlinear linear modes with a relatively small d. The details of this mechanism is
still to be investigated in the future.
Fig.9 Visualization of the attention degree for the NNN with d=30. The first,
second and third lines are for the first, second and third training sample sets, and
the first, second and third columns are for the first, the second and the third
samples. .
to visualize the degree of attention. Fig. 9 shows the heatmaps of the degree of attention
of the NNN30 for the three sample sets . The information involved in other plots can
be easily interpreted.
The visualized degree of attention can explore many details of learning and
learning process. Particularly, we find that there exist two modes of attention, namely,
the face attention mode and the ground attention one. In Fig 9, we see that the degree
of attention for different region is different. For the first sample of the first sample set,
the background has the largest degree of attention, while the face regions of the first
sample has a attention that is larger than that of the third but smaller than that of the
second. Therefore, to establish the subnetwork for the first sample to the first output
neuron, the attention is indeed focused on the background. We call this mode the
background attention mode. For the third sample, on the contrary, we see that the face
zone of the third sample has the largest attention, while the other regions have less
degree of attention. We call this mode the face attention mode. It can be find that the
training performs the background attention for the second sample of the first set, the
first and the second samples of the second set, and the first sample of the third set, while
performs the face attention model for the others. Indeed, it can be realized that the
degree attention has a complex structure, particularly for samples with complex features.
For example, for the second sample of the third set, an extremely high degree of
attention is focused on the cross area of the face zones of the first and the second
samples.
Fig. 10. The visualization of the degree of attention. First to third lines: the
visualization images of the attention degree for the NNN with d=30 trained by the first
three, the first five, and all of the ten digit samples shown in the first line of Fig. (7).
The last line shows the results of the NNN trained by 600 MINIST samples.
Usually, the identifiable information lies in regions of both the target and the
background. In the limit when it has 𝑥𝑖0 (𝜇) → 0 in the face zone, the neural network
have to recognize the samples completely through the geometry of the background,
while in the opposite case of 𝑥𝑖0 (𝜇) → 0 in the ground region, the geometry of the
face becomes the only recognizable feature of the sample, and the recognition should
be performed by the face attention mode. Generally, these two modes of attention are
coexisted, which is similar to the way that the brain recognizes objects.
As can be seen, for example, from the heat maps of a specific sample that the
value of the degree of attention of the face zone of other samples is not zero, which
means that to identify a sample needs also the information of others. This proves the
fact revealed by the weight-path analysis that the subnetworks of each sample stores
also the information of others, i.e., holographic feature of the neural network.
The degree of attention is also helpful for analysis the learning process for the
MINIST data set. In the first to third lines of Fig. 10 we show the visualized degree of
attention for the first three, first five and all of the ten digit samples shown in the first
line of Fig. 7. In the last line of Fig. 10 we show the results for the first 600 samples
respect to the ten output neurons. We can see that there exist also background attention
mode (the second plot in the second line) and the face attention mode (all others). In
addition, the unitary features of a sample, as well as those features having high degree
of overlap with other samples, are usually payed more attentions.
ACKNOWLEDGEMENTS
This work is supported by the NSFC (Grants No. 11975189, No. 11975190)
Author contributions:
H.Z. proposed the idea of WPA; F.S. performed all the simulations and proposed the
idea of the degree of attention of learning; all authors contributed significantly to
analysis with constructive discussions and manuscript preparation; H.Z., F.M. and
F.S. wrote the manuscript.
Reference
[1] L. Yann,Y. Bengio,G. Hinton, Deep Learning, Nature, 521, 436-444 (2015).
[2] T. Poggio, A. Banburski, and Q. Liao, Theoretical issues in deep networks, PNAS,
117(48), 30039(2020).
[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, , Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM, 64(3),
107(2021).
[4] G. Alain, Y. Bengio. Understanding intermediate layers using linear classifier probes,
arXiv:1610.01644, 2016.
[5] M. Gabri´e, A. Manoel, C. Luneau, J. Barbier, N. Macris, F. Krzakala, and L.
Zdeborova´, Journal of Statistical Mechanics: Theory and Experiment 2019, 124014
(2019).
[6] R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via
information, arXiv:1703.00810, 2017.
[7] S. Chung, D. D. Lee, and H. Sompolinsky, Physical Review X 8, 031003 (2018)
[8] F. Gerace, B. Loureiro, F. Krzakala, et al. Generalisation error in learning with
random features and the hidden manifold model, arXiv:2002. 09339, 2020.
[9] Y. Yoshida, R. Karakida, M. Okada, and S.-I. Amari, Journal of Physics A:
Mathematical and Theoretical 52, 184002 (2019).
[10] T. Hou and H. Huang, Statistical physics of unsupervised learning with prior
knowledge in neural networks, Physical Review Letter 124, 248302(2020).
[11] S. Mei, A. Montanari, and P.-M. Nguyen, PNAS 115, E7665 (2018).
[12] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba,
“Understanding the role of individual units in a deep neural network,” PNAS, 117(48),
30071 (2020).
[13] W. Samek, G. montavon, S. Lapuschkin, CJ. Anders, and K. Mueller, Explaining
deep neural networks and beyond: A review of methods and applications,
PROCEEDINGS OF THE IEEE 247 109(3), 2021.
[14] AM. Saxe, JL. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics
of learning in deep linear neural networks, arXiv:1312.6120, 2013.
[15] S. Arora, N. Cohen, W. Hu, Y. Luo, Implicit regularization in deep matrix
factorization, Advances in Neural Information Processing Systems 32, 7413(2017).
[16] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A.
Mordvintsev, The building blocks of interpretability, Distill 3, no. 3 (2018): e10.
[17] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, Network dissection:
Quantifying interpretability of deep visual representations, In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 6541-6549 (2017).
[18] L. Phillips, G. Goh, and N. Hodas, Explanatory Masks for Neural Network
Interpretability, arXiv:1911.06876,2019.
[19] D. Kalimeris, G. Kaplun, P. Nakkiran, B. Edelman, T. Yang, B. Barak, H. Zhang.
"Sgd on neural networks learns functions of increasing complexity." Advances in
Neural Information Processing Systems 32, 3496(2019).
[20] W. Hu, L. Xiao, B. Adlam, and J. Pennington, The surprising simplicity of the
early-time learning dynamics of neural networks, arXiv:2006.14599, 2020.
[20] H. Zhao, A general theory for training learning machine, arXiv:1704.06885 (2017).
[21] H. Zhao, Inferring the dynamics of “black-box” systems using a learning machine,
Science China Physics, Mechanics & Astronomy 64, 270511(2021).
[22] L. F. Barrett, Seven and a half lessons about the brain, Houghton Mifflin Harcourt
Publishing, NY, 2020
[23] J. Cepelewicz, The Brain Doesn’t Think the Way You Think It Does, Quanta
Magazine, 24 Aug 2021 (https://www.quantamagazine.org/mental-phenomena-dont-
map-into-the-brain-as-expected-20210824/)