0% found this document useful (0 votes)
25 views16 pages

Auto Encoder

Uploaded by

dr.sharafhussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

Auto Encoder

Uploaded by

dr.sharafhussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Neurocomputing 551 (2023) 126520

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Additive autoencoder for dimension estimation


Tommi Kärkkäinen ⇑, Jan Hänninen
Faculty of Information Technology, University of Jyväskylä, Finland

a r t i c l e i n f o a b s t r a c t

Article history: Dimension reduction is one of the key data transformation techniques in machine learning and knowl-
Received 10 February 2023 edge discovery. It can be realized by using linear and nonlinear transformation techniques. An additive
Revised 15 May 2023 autoencoder for dimension reduction, which is composed of a serially performed bias estimation, linear
Accepted 30 June 2023
trend estimation, and nonlinear residual estimation, is proposed and analyzed. Compared to the classical
Available online 6 July 2023
Communicated by Zidong Wang
model, adding an explicit linear operator to the overall transformation and considering the nonlinear
residual estimation in the original data dimension significantly improves the data reproduction capabil-
ities of the proposed model. The computational experiments confirm that an autoencoder of this form,
Keywords:
Autoencoder
with only a shallow network to encapsulate the nonlinear behavior, is able to identify an intrinsic dimen-
Dimension reduction sion of a dataset with low autoencoding error. This observation leads to an investigation in which shallow
Intrinsic dimension and deep network structures, and how they are trained, are compared. We conclude that the deeper net-
Deep learning work structures obtain lower autoencoding errors during the identification of the intrinsic dimension.
However, the detected dimension does not change compared to a shallow network. As far as we know,
this is the first experimental result concluding no benefit from a deep architecture compared to its shal-
low counterpart.
Ó 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction eigenvalues of the covariance matrix [10]. Indeed, the use of an


autoencoder to estimate the intrinsic dimension can be considered
Dimension reduction is one of the main tasks in unsupervised a nonlinear extension of the projection method based on PCA [11].
learning [1]. Both linear and nonlinear techniques can be used to However, in the nonlinear case measures for characterizing the
transform a set of features into a smaller dimension [2]. A specific essential information and how this is used to reduce the dimen-
and highly popular set of nonlinear methods are provided with sionality vary [12,13].
autoencoders, AE [3], which by using the original inputs as targets With the intrinsic dimension, for instance the data reduction
integrate unsupervised and supervised learning for dimension step in the KDD process [14,15] would not loose information. In
reduction. The main purpose of this paper is to propose and thor- [16] it was concluded that a shallow autoencoder shows the best
oughly test a new autoencoding model, which comprises an addi- performance when the size of the squeezing dimension is approx-
tive combination of linear and nonlinear dimension reduction imately around the intrinsic dimension. In this direction, tech-
techniques through serially performed bias estimation, linear trend niques that are closely related to our work were proposed in
estimation, and nonlinear residual estimation. Preliminary, limited [17], where the intrinsic dimension was estimated using an
investigations of such a model structure have been reported in autoencoder with sparsity-favoring l1 penalty and singular value
[4,5]. proxies of the squeezing layer’s encoding. In the experiments, the
With the proposed autoencoding model, we consider its ability superiority of the autoencoder compared to PCA was concluded.
to estimate the intrinsic dimensionality of data [6–8]. According to This and the preliminary work by [18] applied a priori fixed archi-
[9], the intrinsic dimension can be defined as the size of the lower- tectures of the autoencoder and different autoencoding error measure
dimension manifold where data lies without information loss. For compared to our work. Here, multiple feedforward models are used
the most popular linear dimension reduction technique, the princi- and compared, with a simple thresholding technique to detect the
pal component analysis (PCA), the information loss can be mea- intrinsic dimension based on the data reconstruction error.
sured with the explained variance which is provided by the

⇑ Corresponding author.
E-mail address: [email protected] (T. Kärkkäinen).

https://doi.org/10.1016/j.neucom.2023.126520
0925-2312/Ó 2023 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

1.1. Motivation of the autoencoding model 1.2. Contributions and contents

The proposed model is formally derived in Section 3.1, but the The main contribution of the paper is the derivation and evalu-
basic idea to apply sequentially linear and nonlinear transforma- ation of the additive autoencoding model. We provide an experi-
tion models for dimension reduction and estimation could be real- mental confirmation of the new autoencoder’s ability to reveal
ized in many ways. However, in order to reduce the reconstruction the intrinsic dimension and study the effect of model depth. With
error efficiently– and to reveal the instrinsic dimension– there is minor role, mostly covered in the Supplementary Material (SM),
one pivotal selection which is illustrated in Fig. 1: After the linear we also discuss and provide an experimental illustration of the dif-
transformation, residual for the nonlinear model’s encoding should ficulties of currently popular deep learning techniques in realizing
be treated in the original dimension!. the potential of deep network models. Apparently, our results
Interestingly, our experiments reveal that the intrinsic dimen- diversify views on the general superiority of deeper network archi-
sion can be identified by using only a shallow feedforward network tectures. They suggest that current and upcoming applications in
as the nonlinear residual operator in the additive autoencoding deep learning and in the use of autoencoders could be improved
model. This results exactly from the two essential choices: iÞ by using an explicit separation of the linear and nonlinear aspects
to include the linear operator in the overall transformation and of the data-driven model.
iiÞ to consider the unexplained residual in the original data The remainder of the paper is organized in the following man-
dimension. This does not happen with the classical autoencoder ner: In Section 2, we provide a brief summary of the forms and
(without the linear term) or if the nonlinear autoencoding would applications of autoencoders and provide a preliminary discussions
be applied in the reduced dimension after the linear transforma- on certain aspects of deep learning. In Section 3, we discuss the for-
tion. These phenomena, with the models and techniques fully malization of the proposed method as a whole. In Section 4, we
specified in the subsequent sections, are illustrated in Fig. 1. There- describe the computational experiments and summarize the main
fore, in addition to exploring the capabilities of revealing the findings. In Section 5, we provide the overall conclusions and dis-
intrinsic dimension we assess the advantages of applying deeper cussion. In the SI, we provide more background and preliminary
networks as nonlinear operators. material and, especially, report the computational experiments as

Fig. 1. Top row: Residual errors with Wine dataset for the usual autoencoders with different number of hidden layers (left) and for the proposed, additive autoencoders
(right). Bottom row: Comparison of classical and proposed autoencoder with five-hidden-layers for the MNIST dataset. Here x-axis contains the squeezing dimension and y-
axis the autoencoding (reconstruction) error. Deeper models provide lower autoencoding errors on the top left, but, only with the additional linear operator as proposed here,
the intrinsic dimension is revealed on the top right and bottom figures. Improvement due to the depth of the model is significant on the top left but only moderate on the top
right, where all models are stricly better than the linear PCA alone. On the bottom, better capability of the proposed autoencoder to encapsulate the variability of MNIST
compared to the classical approach is clearly visible.

2
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

a whole. Main findings solely covered on SI confirm the quality of is outlier or anomaly detection [71,33,72–74]. Also data imputa-
the implementation of the methods and especially indicate that tion has been realized using a shallow autoencoder in [75], with
different autoencoder models, as depicted in the previous section, deeper models in [76–78], and especially for spatio-temporal data
could be used for the nonlinear residual estimation. in [79,76,80–83]. More exotic use cases include (but are not limited
to) use of AE to transfer domain knowledge in recommendation
systems [84] and to construct efficient surrogates in astronomy
2. Related work and preliminaries
[85].

In this section, we introduce the different variants of autoen-


2.2. Preliminaries on deep learning
coders, summarize their wide range of applications, and provide
some preliminaries on deep learning techniques.
Deep learning has been an extremely active field of research
and development in the twenty-first century [86]. These tech-
2.1. On Autoencoders niques reveal repeatedly promising results in new application
areas [87]. On the other hand, deep networks might be overly com-
Feedforward autoencoders have a versatile history, beginning plex and equal performance could be reestablished after significant
from [19,20]. Their development as part of the evolution from shal- pruning [88]. Deep neural networks are sensitive to small varia-
low network models into deep learning techniques with various tions in training and architectural design parameters, thereby
network architectures has been depicted in numerous large and requiring careful calibration [89] and meticulousness in analyzing
comprehensive reviews [3,21,22]. Therefore, we will not go into and comparing different models [90]. These factors have caused,
details in what follows. e.g., the emergence of automated neural architecture search tech-
Deep feedforward autoencoding was highly influenced by the niques [91]. Therefore, it is important to improve our basic under-
seminal work of Hinton and Salakhutdinov [23]. In what follows, standing of both the empirical and theoretical bases of deep
we refer this approach, where the encoder was composed of mul- learning [92]. Systematic methods and studies to analyze the
tiple feedforward layers until the squeezing dimension and estima- behavior of deep neural networks are needed [93].
tion of the network weights was based on the least-squares Indeed, there exist some fundamental aspects of deep learning
fidelity, possibly with a Tykhonov type of squared regularization in which theory and practice are not fully aligned. The first issue
(penalization) term [24], as the classical autoencoder. The work in is the universal approximation capability, whose main shallow
[23] emphasized the importance of pretraining and the usability results are reviewed, for example, by Pinkus [94] and whose gen-
of stacking (i.e., layer-by-layer construction of the deeper architec- eral importance was excellently summarized in [95, p. 363]: ‘‘We
ture). Such techniques, by directing the determination of weights have thus established that such [feedforward] ‘mapping’ networks
to a potential region of the search space, particularly alleviate are universal approximators. This implies that any lack of success
the vanishing gradient problem (see Section 5.9 [3] and references in applications must arise from inadequate learning, insufficient
therein). numbers of hidden units or the lack of a deterministic relationship
As summarized– for example, by [25], many architectures and between input and target.” To put it succinctly, a sufficiently large
training variants for deep autoencoders (AE) have been proposed width of one or at most two hidden layers is, according to the
over the years. Summary of the main AE approaches is given in approximation theory, enough for an accurate approximation of a
Table 1, where ‘Architecture’ refers to the AE’s model structure, nonlinear function, which is implicitly represented by a discrete
‘Learning principle’ summarizes the essential basis of learning, set of examples in the training data.
‘Evaluation’ contains the given information on the assessment of Another aspect is the dominant approach to training a deep net-
quality, efficiency and complexity, and ‘Applications’ points to work structure– that is, estimating unknown weights in different
the basic use cases within the cited papers. layers. This is realized by applying a certain form of the steepest
In addition to the approaches given in Table 1, many other com- gradient descent method with a rough approximation of the true
binations and hybrid forms of AEs exist. Special AEs typically inte- gradient, using one observation (online or stochastic gradient des-
grate concepts and techniques from relevant areas, for instance, cent) or a small subset of observations (minibatch). In classical
autoencoder bottlenecks (AE-BN) that are based on Deep Belief optimization, even the batch gradient descent using the complete
Networks [26] and rough autoencoders where the rough set based data is considered a slow algorithm but still a convergent one pro-
neurons are used in the layers [27]. In Methodological combinations vided that the gradient of the minimized function is Lipschitz con-
with AEs, the base case is to use AE for dimension reduction before tinuous and the line search– that is, the determination of the step
supervised learning. Interesting unsupervised hybrids are provided size (the learning rate in neural networks terminology) when mov-
by clustering techniques that incorporate AEs for feature transfor- ing to the search direction– satisfies specific decrease conditions as
mation [28,29] and unsupervised AE-based hashing methods that given in Theorems 6.3.2 and 6.3.3 in [96] and Theorem 3.2 in [97].
can be used for large-scale information retrieval [30]. More In deep learning, step size may be fixed to a small positive constant
involved combinations include use of AE with unsupervised clus- [98] or be based on monitoring the first- and second-order
tering and generative models [31,32] and for representation learn- moments of the gradient during the search with direct updates
ing of various learning tasks, e.g., low-dimensional visualization, [99]. Often restricted to a fixed number of iterations and not assur-
semantic compression, and data generation [33]. ing stopping criteria measuring fulfillment of the optimality condi-
A wide variety of tasks and domains has been and can be tions (see Section 7.2 in [96]), this implies that the actual
addressed with autoencoders. AEs are typically used in numerous optimization problem for determining the weights of a DNN might
application domains in scenarios where transfer learning can be be solved inaccurately [100].
utilized– for example, in speech processing [55], time series pre- In a genuine supervised learning for a regression model or a
diction [56], fault diagnostics in condition monitoring [57–59], classifier, inexact optimization can be tolerated when seeking the
and machine vision and imaging tasks [60,61,33,62,63,44,64–68]. best generalizing network. Then, the search of weights that provide
Further, AEs have been used for the estimation of data distribution better minimizers for the cost/loss function is stopped prematurely
[69]. Use of a variational AE for joint estimation of a normal latent when the test or validation error of the model begins increasing as
data distribution and the corresponding contributing dimensions an indication of overlearning. In such a case, we are not seeking the
has been addressed in [70]. Yet another use case of autoencoders most accurately optimized network but the best generalizing
3
T. Kärkkäinen and J. Hänninen
Table 1
Summary of main forms of autoencoders.

Technique Architecture Learning principle Evaluation Applications


Denoising AE (DAE) [34–39] Classical Add noise to inputs [34] or pretraining Linear complexity for Marginalized Better higher-order representations [34];
[34] or weight updates [35] Stacked Linear DAE [36]; Time series classification [37];
Higher imputation accuracy [38] Generative models for genetic operators
[39];
Data imputation [38]
Sparse AE (SAE) [40] (see also Sec- Classical Introduce sparsity favoring regularization Reduced models with less active weights High-dimensional regression and
tion 5.6.4 [3]) and special solvers classification problems, e.g., in
Chemometrics
Separable AE (SAE) [41] Classical Pretrain multilayer AE for speech and on- Quality of enhanced speech, analysis of Separate speech and noise
(cf. Siamese neural networks that the-fly multi-layer AE for residual noise noise distribution
do opposite and use shared from Spectrogram with nonnegativity
weights) constraint
Graph AE (GAE) [42–47] Schrinking and enlarging forward– Reconstruction of adjacency matrix in Use latent representation to test Suppressed graph encapsulation of linked
Variational GAE [48] (see VAE backward transformation of data graph least-squares sense [42] or cross-entropy preservation of topological features or data representing citations, emails, flights,
next) (GNN); Use AE to encode-decode Graph sense [43]; node or graph structure; species, social networks, blockchains,
layer Convolution Network [43]; Apply weight-least-squares and graph Classification and clustering quality; collaborations, functional magnetic
4

Masked GAE [46]; Laplacian regularization [44,48]; Linear training scalability [48] resonance imaging (fMRI);
Encode-decode both topology and node Regularize both topology and node attri- Estimation of missing feature information
attributes [47]; bute proximity [47] on nodes for link prediction [48]
GNN encoder + separate feature&label
decoders [48]
Variational AE (VAE) [49–52] Structurally classical, input and/or latent Stochastic fitting to samples and Recovery of generated latent space [49], Generation of topics [50],
distributions, generative use emphasized regularization of latent space, e.g., with size of latent space [49], quality of gener- Generation of samples for minority class
[49,50] Kulback-Leibler divergence, non-convex ated topics [50]; [51]
optimization problems [49,50] imbalanced classification quality [51,52];
Quality of density estimation and number
of active units [52]
Regularized AE (RAE) [53] Any Combination of classical reconstruction Function approximation, recovery of Recovery of data-generation density
error and contractive regularizer manifolds
(encoder-decoder derivative with respect
to input)
Multi-modal AE [54] Set autoencoder to process different data Input–output order matching using sum Quality of set assignments and matching, Genes-to-proteins mapping
modalities, recurrent units of reconstruction error and cross-entropy quality of kernel density estimation
loss between set memberships

Neurocomputing 551 (2023) 126520


T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

model based on another error criterion at the meta-optimization the PC coordinates (i.e., the linear trend in Rm ) in the original space
level. Theoretically, however, generalization and optimality can (see SI, Section 7) can be estimated in the following manner:
be linked in certain respects: An outer-layer locally optimal net-  
work, independently on the level of optimality of the hidden layers, ~ ¼ x  Uy ¼ x  UUT x ¼ I  UUT x:
x ð2Þ
provides an unbiased estimate of the prediction error over the
training data in the sense of mean, median, or spatial median [101]. This transformation is referred to as PCA. With erroneous data or
In a typical use case, the goal of a dimension reduction through data with missing values, mean and classical PCA can be replaced
autoencoding is to obtain a reduced, compact representation that with their statistically robust counterparts [104].
encapsulates variation of data. Similar to linear dimension reduc- In the third, nonlinear phase, we apply the classical fully con-
tion techniques, when attempting to represent the given data accu- nected feedforward autoencoder to the residual vectors in (2).
rately in lower dimensions, we aim for the best possible We note that structurally, in comparison with the classical autoen-
reconstruction accuracy. Then, the cost function that measures coding model [23] (see Section 2.1), the difference according to (2)
the autoencoding error (such as the least-squares error function) is the inclusion of the linear dimension reduction operator acting
must be solved with sufficiently high optimization accuracy on the original dimension. To use the same data dimension in
because of the direct correlation: The smaller the cost function, the linear approximation is suggested by the Taylorian analogy
the better the autoencoder. according to Eq. (1). Furthermore, one needs to return to the orig-
inal dimension in the linear approximation, because otherwise (i.e.,
if we would consider the residual in the reduced dimension m) the
3. Methods
overall reconstruction error would be constrained by the accuracy
of the linear part. Here, instead, when the residual data is repre-
In this section, we describe the methods used here as part of the
sented in the original dimension, both the linear and nonlinear
autoencoding approach. In the following account, we assume that a
parts of the additive autoencoder contribute to the reduction of
training set of N observations X ¼ fxi gNi¼1 , where xi 2 Rn , is given.
the autoencoding error in the original data dimension.
As anticipated by the scaling, the tanh activation function
3.1. The autoencoding model f ðxÞ ¼ 1þexp2ð2xÞ  1 is used. This ensures the smoothness of the
entire transformation and the optimization problem of determin-
In mathematical modelling, linear and nonlinear models are
ing the weights. The currently popular rectified linear units are
typically treated separately [102]. Following [5], according to Tay-
nondifferentiable [101] and, therefore, are not theoretically com-
lor’s formula, in the neighborhood of a point x0 2 Rn , the value of a
patible with the gradient-based methods Section 6.3.1 [98]. The
sufficiently smooth real-valued function f ðxÞ can be approximated
importance of differentiability was also noted in [105].
as
The formalism introduced by [24] is used for the compact
1 derivation of the optimality conditions. By representing the layer-
f ðxÞ ¼ f ðx0 Þ þ rf ðx0 ÞT ðx  x0 Þ þ ðx  x0 ÞT r2 f ðx0 Þðx  x0 Þ þ . . . ; wise activation using diagonal function-matrix
2
m
F ¼ F ðÞ ¼ Diag ff i ðÞgi¼1 , where f i  f , the output of a feedforward
where rf ðx0 Þ denotes the gradient vector and r2 f ðx0 Þ the Hessian network with L layers and linear activation on the final layer reads
matrix at x0 . According to Lemma 4.1.5 [96], there exists as
z 2 lsðx; x0 Þ; z 2 Rn , where ls refers to the line segment connecting
~Þ ¼ WL oðL1Þ ;
o ¼ oL ¼ N ðx ð3Þ
the two points x and x0 , such that

1 where o0 ¼ x ~ for an input vector ~ 2 Rn0 ,


x and
f ðxÞ ¼ f ðx0 Þ þ rf ðx0 ÞT ðx  x0 Þ þ ðx  x0 ÞT r2 f ðzÞðx  x0 Þ: ð1Þ  
l ðl1Þ
2 o ¼F Wo
l
forl ¼ 1; . . . ; L  1. To allow the formal adjoint
This formula yields the sufficient condition of x 2 Rn to be the local transformation to be used as the decoder, we assume that L is even
minimizer of a convex f (whose Hessian is positive semidefinite) and that the bias nodes are not included in (3). The dimensions of
Theorem 2.2 [97]: rf ðxÞ ¼ 0. However, another interpretation of the
 
weight matrices are then given by
the formula above is that we can locally approximate the value of dim Wl ¼ nl  nl1 ; l ¼ 1; . . . ; L. In the autoencoding context,
a smooth function as a sum of its bias (i.e., constant level), a linear
nL ¼ n0 and nl ; 0 < l < L, define the sizes (the number of neurons)
term, and a nonlinear higher-order residual operator. This observa-
of the hidden layers with the squeezing dimension nL=2 < n0 .
tion is the starting point for proposing an autoencoder that has
To determine the weights in (3), we minimize the regularized
exactly such an additive structure.
mean least-squares cost function of the form
The bias estimation simply involves the elimination of its effect
n o  N  
through normalization by subtracting the data mean and scaling L 1 X  L ðL1Þ ~ 2
each feature separately into the same range ½1; 1 with the scaling J Wl ¼ W oi  xi 
2
l¼1 2N i¼1
factor maxðxÞminðxÞ
. Thus, we combine the mean component from z-
L 
X 2
scoring and the scaling component from min–max scaling. The rea- a  l l
þ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi W  W0  ; ð4Þ
son for this is that the unit value of the standard deviation in z- u L  
uX l¼1
scoring does not guarantee equal ranges, and min–max scaling into 2t # Wl1
½1; 1 does not preserve the zero mean. l¼1

In the second phase, we estimate the linear behavior of the nor-  


malized data by using the classical principal component transfor- where k  k denotes the Frobenius norm and # Wl1 the number of
mation [103]. For a zero-mean vector x 2 Rn , the transformation
rows of Wl . Let a be fixed to 1e-6 throughout; to simplify the nota-
to a smaller-dimensional space m < n spanned by m principal com- qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PL
ponents (PCs) is given by y ¼ UT x, where U 2 Rnm consists of the tions, we define b ¼ a= l
l¼1 jW1 j. The underlying idea in (4) is to

m most significant (as measured by the sizes of the corresponding average in both terms: in the first, the data fidelity (least-squares
eigenvalues) eigenvectors of the covariance matrix. Thus, because error, LSE) term, and in the second, the regularization term. Averag-
of the orthonormality of U, the unexplained residual variance of ing the first term with N1 implies that the term scales automatically
5
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

by the size of the data subset, for instance, in minibatching, thereby ½1; 1. Then, residuals according to formula (2) are computed and
providing an approximation of the entire LSE on a comparable scale. this residual data is fed to the feedforward autoencoder. Again due
In the second term, because a is fixed, the inverse scaling constant to (2), the reduced, m-dimensional representation of new data is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PL ffi
l obtained as a sum of its PC projection and the output of the
1= l¼1 jW1 j balances the effect of the regularization compared to
autoencoder’s squeezing layer. Formula (2) shows that the explicit
the data fidelity for networks with a different number of layers of
formation of the residual data between linear and nonlinear
different sizes. Because (4) will be minimized with local optimizers,  
n o representations can be replaced by setting W f 1 ¼ W1 I  UUT
we simply use the initial guesses Wl0 of the weight values in the
and using this as the first transformation layer of the autoencoder
second term to improve the local coercivity of (4) and to restrict the
for the normalized, unseen data.
magnitude of the weights, thereby attempting to improve general-
ization [106]. Because of the residual approximation, the random
initialization of the weight matrices is generated from the uniform 3.2. Layerwise pretraining
distribution U ð½0:1; 0:1Þ. s
n o 
L
The gradient matrices rWl J Wl ; l ¼ L; . . . ; 1, for (4) are We apply the classical stacking procedure depicted, e.g., in Sec-
l¼1 tion 6.2 [107]. A similar idea appears with the deep residual net-
of the following form (see [24]): works (ResNets) in [108], where consecutive residuals are
n o 1 X N  T   stacked together using layer skips– for example, over two or three
l ðl1Þ
rWl J Wl ¼ di o i þ b Wl  Wl0 ; ð5Þ layers with batch normalization. However, in ResNets, the layer
N i¼1 skips can introduce additional weight matrices whereas the
layer-by-layer pretraining follows the originally chosen network
where the layerwise error backpropagation reads as
architecture.
ðL1Þ
L
di ¼ ei ¼ WL oi x~i ; ð6Þ The stacking procedure is illustrated in Fig. 2. For three hidden
n  o  T layers with two unknown weight matrices, W1 and W2 , we first
0 l ðl1Þ ðlþ1Þ
Wðlþ1Þ di :
l
di ¼ Diag ðF Þ W oi ð7Þ
estimate W1 with the given data fx ~ i g. Then, the output data of
n o
The use of different weights in the encoding– that is, in the transfor- ~ i are used as the training data (the input
the estimated layer W1 x
mation until layer L=2– and decoding, from layer L=2 to L, implies
and the desired output) for the second layer W2 . Thereafter, the
more flexibility in the residual autoencoder but also rougly doubles
entire network is fine-tuned by optimizing over both weight matri-
the amount of weights to be determined. Therefore, it is common to
ces. The process from the heads to the inner layers is naturally
use the tight weights, which means that the formal adjoint
 T  T    T  enlarged for a larger number of hidden layers. We could then also
W1 F W2 F . . . WL=2 of the encoder is used as the apply partial fine-tuning– for example, to fine-tune the three hid-
den layers during the process of constructing a five-hidden-layer
decoder. Then, it is easy to see that the layerwise optimality condi-
network. However, according to our tests and similar to [23], the
tions for l ¼ 1; . . . ; L=2 read as
layerwise pretraining suffices before fine-tuning the entire net-
1X N  T     work. A special case of utilizing a simpler structure is the one-
l ðl1Þ ð~l1Þ ~l T
rWl J ¼ d o þ oi di Þ þ b Wl  Wl0 ; ð8Þ hidden-layer case: The symmetric model 1SYM with one weight
N i¼1 i i
matrix is first optimizer to obtain W1 and then used in the form
 T 
where ~l ¼ L  ðl  1Þ. For convenience, we define e
L ¼ L=2– that is,
W1 ; W1 as the initial guess for optimizing the nonsymmetric
the number of layers to be optimized with the symmetric models.
We note that when the layerwise formulae above are used with model 1HID with two weight matrices. Again such an approach
vector-based optimizers, we always need to reshape operations to could be generalized to multiple hidden-layer case for the nonsym-
toggle between the weight matrices and a column vector of all metric, deep autoencoding model.
weights.
Remark 2. As stated in Section 2.1, stacking attempts to mitigate
Remark 1. Let us briefly summarize the use of the additive the vanishing gradient problem, which may prevent the adaptation
autoencoder for an unseen dataset after it has been estimated (and of the weights in deeper layers. We assessed the possibility of such
the corresponding data structures have been stored) for the a phenomenon by studying the relative changes in the weight
 
training data through the three phases. First, data is normalized matrix norms jkWl0 k  kWl kj=kWl0 k while fine-tuning the sym-
through mean subtraction and feature scaling into the same range
metric autoencoders with 3–7 layers (3SYM, 5SYM, and 7SYM; see
Section 4). The subscripts ‘0’ and ‘’ refer to the initial and final
weight matrix values, respectively. This study revealed that the
relative changes in the weights in the deeper layers were not on a
smaller numerical scale compared to the other layers. Apparently,
the double role of the layers in the symmetric models as part of the
encoder and the decoder, with the corresponding effect on the
gradient as seen in formula (8), is also helpful in avoiding a
vanishing gradient.

3.3. Determination of intrinsic dimension

The basic procedure to determine the intrinsic dimension is to


Fig. 2. Layerwise pretraining from heads to inner layers. The most outer layer is
gradually increase the size of the squeezing layer and to seek a
trained first and its residual is then fed as training data for the next hidden layer small value of the reconstruction error measuring autoencoding
until all layers have been sequentially pretrained. error, with a knee point [109] indicating that the level of nondeter-
6
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Fig. 3. Identification of the intrinsic dimension for the Glass dataset. The hidden dimension (plus one) on the left is captured by the sufficiently small error improvement on
the right.

ministic residual noise has been reached in autoencoding. We malization and identification of the linear trend: 1HID (model with
apply the mean root squared error (MRSE) to compute the autoen- one hidden layer and separate weight matrices for the encoder and
coding error: the decoder), 1SYM (symmetric model with one hidden layer and a
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi shared weight matrix), 3SYM (three-hidden-layer symmetric model
u N
1u X with two shared weight matrices), 5SYM, and 7SYM. To systemati-
e ¼ t jxi  N ðxi Þj2 ; ð9Þ cally increase the flexibility and the approximation capability of
N i¼1
the deeper models, the sizes of the layers for l ¼ e
L; . . . ; 1 are given
where fxi g is assumed to be normalized and N denotes the applica- below, where ne is the size of the squeezing layer:
L
tion of the autoencoder. This choice was made because in [110],
MRSE correlated better with the independent validation error than 3SYM:ne – 2ne – n,
L L
the usual root mean squared error RMSE, In practice, the difference 5SYM:ne – 2ne – 4ne – n,
pffiffiffiffi L L L
between the RMSE and the MRSE is only the scaling factor, 1= N vs.
7SYM:ne – 2ne – 3ne – 4ne – n.
1=N. After the linear PC trend estimation, the MRSE is obtained by L L L L

using (9) for the residual data defined in formula (2). Note that
the reconstruction error is a strict error measure and its use Note that for ne > n=2 the size of the second layer and, therefore,
L
requires higher accuracy from autoencoding compared to other the dimension of the first intermediate representation, is larger
measures: with the Wine dataset in Fig. 1, the linear PCA needs than the input dimension for all these models.
all dimensions of the rotated coordinate axis for the reconstruction
whereas already 10 principal components out of 13 would explain
over 96% of the data variance. 4.1. Identification of the intrinsic dimension
An example of determining the intrinsic dimension of the Glass
dataset (see the next section) is presented in Fig. 3. In the figure, The first purpose of the experiments was to search for the
the x-axis ‘‘SqDim” presents the squeezing dimension and the y- intrinsic dimension of a dataset via autoencoding. This was done
axis on the left the ‘‘MRSE” and on the right its change ‘‘D(MRSE)” using the shallow models 1SYM and 1HID. The optimization settings
(i.e., backward difference) for the symmetric model with one hid- and visualization of all results are given in the online SM.
den dimension (1SYM) and the corresponding nonsymmetric model The experiments were carried out for two groups of datasets,
1HID. The intrinsic dimension of data is detected by first locating a one with small-dimension data (less than 100 features) and the
sufficiently small change in the autoencoding error (the right plot). other with large-dimension data (up to 1024 features). The data-
For this purpose, a user-defined threshold s = 4e-3 is applied. The sets were obtained from the UCI repository [111], except the Fas-
detected dimension 5 on the left, marked with a circle, is the hMNIST, which was downloaded from GitHub2. For most of the
dimension below the threshold on the right minus one. The intrin- datasets, only the training data was used; however, with Madelon,
sic dimension 5 is also characterized by a clear knee point in the the given training, testing, and validation datasets were combined.
MRSE behavior. The datasets do not contain missing values. The constant features
were identified and eliminated according to whether the difference
4. Results between the maximum and minimum values of a feature was less
pffiffiffiffiffiffiffiffiffiffiffiffi
than MEps, where MEps denotes machine epsilon (this is classically
The main focus of the computational experiments, which are used numerical proxy of zero, see [96, p. 12]). Because of this prepro-
fully reported in the Supplementary Material (SM), was to investi- cessing, the number of features n in Tables 2 and 3 is not necessarily
gate the ability of the proposed additive autoencoder model to rep- the same as that in the UCI repository.
resent a dataset in a lower-dimensional space. Therefore, we During the search, the squeezing dimension for the small-
confined ourselves to the use of Matlab as the platform (mimicking dimension datasets began from one and was incremented one by
the experiments in [23]) to have full control over the realization of one up to n  1. For the large-dimension cases, we began from 10
the methods in order to study the effects of different parameters and used increments of 10 until the maximum squeezing dimen-
and configurations. Reference implementation of the proposed sion b0:6  nc was reached (cf. the ‘‘Red” values in Tables 2 and
method and its basic testing is available in GitHub1. 3). The experiments were run with Matlab on a Laptop with
We apply and compare the following set of techniques to 2.3 GHz Intel i7 processor and 64 GB RAM and on a server with a
approximate the nonlinear residual of the autoencoder, after nor- Xeon E5-2690 v4 CPU and 384 GB of memory.

1 2
https://github.com/TommiKark/AdditiveAutoencoder https://github.com/zalandoresearch/fashion-mnist

7
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Table 2
Results of the identification of the intrinsic dimension for small-dimension datasets. The intrinsic dimensions were identified with the reduction rates varying between 0.41–0.54.
The SteelPlates and COIL2000 (with the most discrete feature profile) have the best reduction rate. The residual errors are between 1.1e-2–4.3e-4.

Dataset N n ID Red MRSE FeatProf (%)


Glass 214 10 5 0.50 1.3e-3 10–40-50–0
Wine 178 13 7 0.54 1.2e-3 0–46-54–0
Letter 20 000 16 8 0.50 9.4e-4 0–100-0–0
SML2010 2 763 17 9 0.53 5.4e-4 0–12-18–71
FrogMFCC 7 195 22 11 0.50 1.1e-3 0–0-5–95
SteelPlates 1 941 27 11 0.41 4.3e-3 11–11-56–22
BreastCancerW 569 30 14 0.47 6.9e-3 0-0–100-0
Ionosphere 351 33 17 0.52 1.9e-3 3–0-97–0
SatImage 6 435 36 18 0.50 4.3e-4 0–75-25–0
SuperCond 21 263 82 37 0.45 1.1e-2 2–1-12–84
COIL2000 5 822 85 35 0.41 2.8e-2 99–1-0–0

Table 3
Results of the identification of the intrinsic dimension for large-dimension datasets. The intrinsic dimensions were identified with the reduction rates varying between 0.39–0.55.
The HumActRec dataset with a continuous feature profile has the best reduction rate. The residual errors are between 8.0e-2–2.9e-3.

Dataset N n ID Red MRSE FeatProf (%)


USPS 9 298 256 130 0.51 2.9e-3 0–0-0–100
BlogPosts 52 397 277 130 0.47 3.9e-3 79–8-12–1
CTSlices 53 500 379 180 0.47 6.3e-2 8–4-10–78
UJIIndoor 19 937 473 200 0.42 7.0e-2 26–73-1–0
Madelon 4 400 500 250 0.50 7.9e-2 2–31-67–0
HumActRec 7 351 561 220 0.39 8.0e-2 0–2-0–98
Isolet 7 797 617 310 0.50 9.4e-3 1–6-14–80
MNIST 60 000 717 350 0.49 9.3e-3 9–14-77–0
FashMNIST 60 000 784 380 0.48 5.0e-2 0–2-98–0
COIL100 7 200 1 024 560 0.55 6.4e-3 0–11-89–0

Fig. 4. Identification of the intrinsic dimension for the Letter dataset and SuperCond dataset. Left: Clearly identified knee-point in ID = 8 for Letter. Right: Gradual decrease of
MRSE with ID = 37 for SuperCond.

Fig. 5. Identification of the intrinsic dimension for the MNIST dataset. The hidden dimension (plus one) on the left is captured by the sufficiently small error improvement on
the right. Left: Gradual decrease of MRSE with ID = 350 for MNIST. Right: Zoom of MRSE improvement with the threshold s = 3e-3 confirms the detection.

In Tables 2 and 3, we present the name of the dataset, the num- thresholdings are illustrated for all datasets in the online SM. The
ber of observations N, the number of features n, and the detected detection threshold for small-dimension datasets was fixed to s =
intrinsic dimension ID. The autoencoding error trajectories and 4e-3. The reduction rate ID=n for the intrinsic data dimension is

8
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Fig. 6. Identification of the intrinsic dimension for the FashMNIST dataset and COIL100 dataset. Left: Clear knee-point of MRSE with ID = 380 for FashMNIST. Right: More
gradual decrease of MRSE with ID = 560 for COIL100.

reported in the Red column, and the autoencoding error of 1HID for ability of the additive autoencoder. This aim is pursued as follows:
ID according to (9) is included in the MRSE column. There is no Here, we compare shallow and deep networks in cases where fine-
averaging over the data dimension n in (9), so that for higher- tuning is performed using a classical optimization approach, i.e.,
dimension datasets this error is expected to remain larger. This using the L-BFGS optimizer with the complete dataset. In the SM,
was probably one of the reasons why, for large-dimension datasets, we report the results of using different minibatch-based
we needed to use two values of the threshold s (based on visual approaches. Also detailed depictions of the parameter choices
inspection; see the zoomed illustrations in the SM): 3e-3 for USPS, and visualization of the results for all datasets are included there.
BlogPosts, HumActRec, MNIST, and COIL100, and 3e-2 for the In addition to visual assessment, we performed a quantitative
remaining five datasets. comparison between the deep and shallow models. First, the MRSE
For the analysis, we also included a depiction of how discrete or values of all models were divided with the corresponding value of
continuous the set of features for a dataset is. We categorized the the 1HID model’s error. This was done for the squeezing dimensions
features into four groups based on the number of unique values from the first until next to last of ID, to cover the essential search
(UV) each feature has: C1 = {UV 610}, C2 = {10 < UV 6100}, C3 phase of the intrinsic dimension. To exemplify relative perfor-
= {100 < UV 61000}, and C4 = {1000 < UV}. The FeatProf column mance, if the MRSE value of a model divided by the 1HID’s value
in Tables 2 and 3 presents the proportions of C1–C4 in percentages. for a particular squeezing dimension would be 0.5, then such a
model would have half the error level and, conversely, twice the
4.1.1. Conclusions efficiency compared to 1HID. Therefore, the model’s efficiency is
Examples of the ID detection are given in Figs. 3 (Glass), 4 (Let- defined as the reciprocal of relative performance.
ter on the left, SuperCond on the right), 5 (MNIST), and 6 (Fas- The relative performances are illustrated in Figs. 7 and 8.
hMNIST on the left, COIL100 on the right). Identifications in the Descriptive statistics of the efficiencies of symmetric models are
first two cases and for FashMNIST are characterized by clear given in Tables 4 and 5. In each cell there, both the mean efficiency
knee-points in IDs. For SuperCond, with gradual decrease of the and the maximal efficiency are provided. The latter includes, in
MRSE, determination of ID is based on the mutual threshold value parentheses, the squeezing dimension where it was encountered.
s = 4e-3 of small-dimension datasets. Also MNIST has such a
behavior and the zoom in Fig. 5 (right) illustrates the detection
decision with s = 3e-3. 4.2.1. Conclusions
Overall, the intrinsic dimensions were successfully identified During the early phases of searching the intrinsic dimension,
for all tested datasets. The use of a feedforward network to approx- deep networks provide smaller autoecoding errors compared to
imate the nonlinear residual notably decreased the autoencoding the shallow models. However, as exemplified in Fig. 7 (left) and
error of the linear PCA. The overall transformation summarizing is evident from all illustrations in the SI, MRSE in ID is not better
the essential behavior of data roughly halved the original dimen- for deeper models compared to 1HID. Therefore, the use of a deeper
sion: The mean reduction rate over the 21 datasets was 0.48. model would not change the ID values and, in fact, fluctuation of
The reduction rate was independent of the form of the features– the error for the deeper models compared to 1HID may hinder
that is, the best reduction rates for small-dimension datasets were the detection of a knee-point and negatively affect the simple
obtained with the very categorical COIL2000 and primarily contin- thresholding.
uous SteelPlates datasets. This suggests further experiments on Usually, the mean efficiency of the two deepest models, 5SYM
how to treat different types and forms of features, for instance, and 7SYM, is better than that of 1SYM or 3SYM, but for many datasets,
to address whether different loss functions should be applied there is only a slight improvement. Overall, the plots for the rela-
[76]. The best reduction rate, 0.39, was anyway obtained for tive efficiencies between different symmetric models and the
HumActRec, which is characterized by a continuous feature profile. quantitative trends in the rows of Tables 4 and 5 include varying
This indicates that we may obtain smaller reduction rates with patterns. The mean efficiency is highest for UJIIndoor, Ionosphere,
more continuous sets of features with the proposed approach. Madelon, and COIL100, where the last three datasets are character-
ized by high data dimension/number of observations, n=N, ratio
4.2. Comparison of shallow and deep models (0.09, 0.11, and 0.08, respectively). The following are the grand
means of the mean efficiencies over all 21 datasets for the symmet-
The second aim of the experiments was to examine whether ric models: 1SYM 0.97, 3SYM 1.19, 5SYM 1.38, and 7SYM 1.44. This
deeper network structures and deep learning techniques (the net- concludes that deeper models improve the reduction of the
work structure and optimization of the weights) can improve the autoencoding error during the search of ID. However, the speed
identification of the intrinsic dimension and the data restoration of improvement decreases as a function of the number of layers.
9
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Fig. 7. Left: Behavior of the MRSE with all residual models for Ionosphere. Right: Relative performance of the models with respect to 1HID. During the search of the ID the
deeper models show clear improvement but the detected ID is the same for all models.

Fig. 8. Left: Relative performance of the models for BreastCancerW. Right: Relative performance of the models for COIL100. The deeper models show clear improvement over
the shallow ones but the detected ID stays the same and near ID the improved efficiency may be completely lost.

Table 4
Efficiencies of symmetric models for small-dimension datasets.

1SYM 3SYM 5SYM 7SYM


Dataset mean max (dim) mean max (dim) mean max (dim) mean max (dim)
Glass 0.91 1.03 (2) 0.91 1.10 (3) 1.18 1.31 (3) 1.15 1.40 (3)
Wine 0.97 0.99 (1) 1.07 1.22 (6) 1.25 1.59 (6) 1.36 1.75 (6)
Letter 1.00 1.00 (1) 1.03 1.06 (6) 1.10 1.14 (6) 1.11 1.15 (6)
SML2010 0.94 0.99 (1) 0.95 1.07 (5) 1.12 1.23 (3) 1.15 1.32 (3)
FrogMFCCs 0.99 1.00 (7) 1.04 1.17 (8) 1.10 1.23 (8) 1.11 1.23 (8)
SteelPlates 0.94 0.99 (4) 0.94 1.09 (5) 1.04 1.22 (5) 1.04 1.27 (5)
BreastCancerW 0.98 0.99 (11) 1.10 1.29 (13) 1.22 1.42 (12) 1.24 1.41 (11)
Ionosphere 0.91 0.97 (3) 1.04 1.26 (15) 1.74 3.19 (14) 2.07 4.06 (15)
Satimage 0.99 1.00 (10) 1.02 1.07 (17) 1.05 1.09 (17) 1.06 1.12 (1)
SuperCond 1.00 1.00 (30) 1.10 1.18 (33) 1.20 1.30 (29) 1.23 1.34 (26)
COIL2000 0.99 1.02 (16) 1.24 1.85 (32) 1.49 2.89 (29) 1.48 2.62 (30)

Table 5
Efficiencies of symmetric models for large-dimension datasets.

1SYM 3SYM 5SYM 7SYM


Dataset mean max (dim) mean max (dim) mean max (dim) mean max (dim)
USPS 0.99 1.00 (90) 1.08 1.16 (90) 1.13 1.21 (70) 1.14 1.23 (60)
BlogPosts 0.95 1.00 (110) 1.18 1.39 (70) 1.28 1.56 (80) 1.23 1.53 (70)
CTSlices 0.99 1.00 (170) 1.17 1.51 (150) 1.30 1.76 (150) 1.32 1.74 (150)
UJIIndoor 0.99 0.99 (60) 1.69 2.46 (150) 2.15 3.11 (130) 2.24 3.51 (130)
Madelon 0.97 1.00 (30) 1.38 3.74 (240) 2.29 5.77 (220) 3.01 8.02 (220)
HumActRec 0.99 1.00 (160) 1.15 1.31 (160) 1.22 1.40 (130) 1.24 1.40 (130)
Isolet 0.99 1.00 (290) 1.22 2.16 (290) 1.44 2.89 (270) 1.55 2.65 (270)
MNIST 0.99 1.00 (200) 1.32 1.99 (260) 1.42 2.20 (250) 1.41 2.25 (240)
FashMNIST 0.99 1.00 (320) 1.15 1.63 (370) 1.22 1.59 (350) 1.23 1.53 (350)
COIL100 0.98 1.00 (310) 2.18 11.19 (520) 1.96 8.51 (530) 1.79 2.63 (430)
COIL100-Min 0.98 1.00 (310) 1.75 8.05 (510) 1.79 4.22 (510) 1.87 2.63 (430)

10
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Fig. 9. Left: Minimum autoencoding error of the all models for COIL100. Right: Reduced relative performance of the models for COIL100. For COIL100, use of minimum
autoencoding error to identify ID yielded more reasonable results.

Fig. 10. Left: RMSEs for BreastCancerW with the original 2–3-4 pattern for the hidden layers. Right: RMSEs for BreastCancerW with 3–5-7 pattern for the hidden layers.
Slighly smaller errors were encountered during the early search phase of the larger model on the right but the detected ID and the overall behavior remained the same.

Actually, close to the intrinsic dimension, the benefits of deeper 4.3. Comparison of classical and additive autoencoder
models may be completely lost. This is illustrated in Fig. 8 (right)
and in Table 5 for COIL100 (see also, for example, plots of BlogPosts As depicted in Section 3, the basic structural difference between
and MNIST in the SI). The reason for such a behavior with COIL100 the classical and the additive autoencoder is the inclusion of the
is the value of ID, 560, which was obtained with the smaller PCA based linear dimension reduction operator, and transforma-
threshold s = 3e-3 for large-dimension datasets. Therefore, with tion back to the original dimension, in the latter model. One recov-
COIL100, we also tested an alternative approach to the identifica- ers the classical autoencoding model from the additive one by
tion of ID, where we apply the same thresholding technique (and simply setting U ¼ 0 in (2). Next, we study how this affects the
the same s) to the minimum autoencoding error of the all models. autoencoding error and the training time. In order to solely com-
This error plot, the identified ID = 520 (for which the reduction pare the two model structures, we performed experiments for all
rate would be 0.51), and the corresponding reduced set of relative datasets using the 5SYM architecture and exactly the same training
efficiencies are illustrated in Fig. 9. The summary of the efficiencies (i.e., optimization) settings (see SM) either with U ¼ 0 (‘ClasAE’) or
for this modified way to identify ID are given in the last line in the proposed form (‘PropAE’). As before, the tests were carried
‘‘COIL100-Min” in Table 5. It can be concluded that for the largest out over an increasing set of dimensions of the squeezing layer
dimensional dataset COIL100, the use of the minimum autoencod- nL , which started from one and were incremented one by one for
ing error of the models yielded more reasonable results. the small-dimension datasets. For the large-dimension datasets, a
We used a fixed pattern for the sizes of the hidden layers in dee- data-specific increment of the order 10–40 starting from the
per models: 2–3-4 times the squeezing dimension for 7SYM, the squeezing layer’s size 10–40 were used, so that altogether 10–15
first and last of these coefficients for 5SYM, and the first one for squeezing layers were tested in all of these cases (exact specifica-
3SYM. As reported in Section 4.1, the mean reduction rate over tions of the tested dimensions are given in the SM). The intrinsic
the 21 datasets was close to 0.5. Therefore, one may wonder dimension ID identified and reported in Tables 2 and 3 was used
whether this behavior is due to the fact that from this case as the size of the last squeezing layer for all datasets.
onwards all the hidden dimensions are larger than the number of Results of the comparison are summarized in Table 6 and illus-
features so that a kind of nonlinear kernel trick occurs. In other trated, for the two large-dimension dataset, in Fig. 11 (all figures
words, would a different pattern of the hidden dimensions change comparing RMSEs are given in the SM). In Table 6, for the com-
the results and conclusions here? This consideration was tested by pared models, the CPU time is reported in minutes and it includes
considering a 3–5-7 pattern providing much more flexibility for the whole training time over all tested levels (for the proposed AE,
the nonlinear operator compared to the used pattern. These tests also time to compute U using PCA is included). For the latter CPU
are not reported as a whole, because the clearly identified trend value in the fifth column, the CPU ratio ‘rat’ of column five to col-
of the results is readily exemplified in Fig. 10: Increase of the sizes umn two is reported in parenthesis. For both models, the column
of the hidden layers slightly improve the reduction rate during ‘RMSE’ provides the final reconstruction error and the column
early phase of the search but does not change the value of the ID. ‘J 0 ’ includes the first value of the cost function (4) when training

11
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Table 6
Comparison of classical and additive autoencoder.

Classic AE Proposed AE
Data CPU J0 RMSE CPU (rat) J0 RMSE
Glass 4.6e-1 1.1e-1 1.62e-1 5.0e-1 (1.09) 9.0e-4 6.69e-3
Wine 7.1e-1 1.5e-1 2.77e-1 6.8e-1 (0.96) 2.7e-4 1.13e-2
Letter 9.7e0 1.1e-1 3.33e-1 9.7e0 (1.0) 2.5e-4 1.39e-2
SML2010 2.3e0 1.2e-1 1.67e-1 2.1e0 (0.91) 2.1e-4 9.65e-3
FrogMFCC 5.1e0 3.5e-2 1.70e-1 5.0e+0 (0.98) 1.3e-4 6.71e-3
SteelPlates 2.9e0 1.8e-1 2.25e-1 3.0e0 (1.03) 1.4e-3 8.78e-3
BreastCancerW 2.7e+0 7.8e-2 1.65e-1 2.7e+0 (1.0) 6.0e-5 5.71e-3
Ionosphere 2.9e+0 9.2e-1 2.98e-1 2.9e+0 (1.0) 1.5e-2 1.94e-2
SatImage 1.7e+1 1.1e-1 2.25e-1 1.7e+1 (1.0) 7.6e-7 7.21e-4
SuperCond 2.0e+2 1.6e-1 1.85e-1 1.9e+2 (0.95) 1.0e-4 9.92e-3
COIL2000 5.6e+1 3.2e-1 4.47e-1 5.7e+1 (1.02) 4.8e-3 1.90e-2
USPS 1.7e+2 1,4e0 6.27e-1 1.6e+2 (0.94) 9.6e-6 2.01e-4
BlogPosts 8.8e+2 9.0e-1 2.99e-1 8.0e+2 (0.91) 1.4e-4 4.82e-3
CTSlices 1.3e+3 6.8e0 1.51e+0 1.0e+3 (0.77) 7.5e-3 7.42e-2
UJIIndoor 5.7e+2 1.5e0 5.81e-1 5.2e+2 (0.91) 8.1e-3 7.50e-2
Madelon 1.8e+2 6.3e0 1.06e+0 1.7e+2 (0.94) 1.2e-2 3.88e-2
HumActRec 3.2e+2 7.2e-1 4.50e-1 2.9e+2 (0.91) 4.0e-5 5.58e-3
Isolet 7.0e+2 2.9e0 1.59e+0 5.7e+2 (0.81) 6.2e-4 8.34e-3
MNIST 3.1e+3 4.2e0 1.17e+0 2.4e+3 (0.77) 1.1e-3 1.27e-2
FashMNIST 3.7e+3 8.2e0 2.30e+0 2.9e+3 (0.78) 1.0e-2 5.35e-2
COIL100 1.4e+3 2.8e0 1.43e+0 1.1e+3 (0.79) 3.8e-3 9.06e-3

Fig. 11. 5SYM for CTSlices (left) and FashMNIST (righ): behavior of autoencoding error for the classical and additive autoencoder (left y-axis) and the training time (right y-
axis) over a set of squeezing layer dimensions. Both reconstruction errors and CPU times are strictly smaller for the additive autoencoder compared to the classical one.

the models in the last tested dimension ID. The latter allows one to uses a sparsity regularizer based on the Kullback–Leibler diver-
compare the effect of the layerwise pretraining through stacking as gence (cf. Table 1), we tested three different autoencoding pipeli-
depicted in Section 3.2. nes: 1Þ Apply Matlab’s own AE routine in the reduced dimension
This comparison summarizes the additional benefits, in addi- after PCA and compute the reconstruction error with the inverse
tion to the possibility to identify the intrinsic dimension, of having PCA; 2Þ Apply Matlab’s own AE according to our proposition to
the explicit linear operator in the autoencoder. The autoencoding data obtained after the linear trend estimation in the original data
error is always smaller, for all tested dimensions, for the proposed, dimension; 3Þ Use our own implementation of the suggested
additive model. It reaches much smaller reconstruction error in ID method. These tests were concluded as follows: In cases 2Þ and
and, for the large-dimension datasets with more computational 3Þ, both Matlab’s AE and our AE worked similarly and retrieved
efforts needed, decreases the training time (the larger the overall the intrinsic dimension. This indicates that different autoencoding
CPU time the larger the benefit as shown, e.g., by ‘rat’ in the last models, as depicted in Section 3.1, could be used for the nonlinear
three datasets). As exemplified in Fig. 11 (right y-axis) and con- residual estimation in the additive model. As expected, large
firmed by comparing the columns three and six of J 0 s in Table 6, reconstruction errors without recovery of the intrinsic dimension
both the better quality and the reduced CPU of the additive AE are were obtained if the nonlinear autoencoding model was operating
especially due to improvements in the quality of stacking during directly on the reduced dimension after the linear transformation
the shallow pretraining phase. This underlines the usefulness of (case 1). Our implementation (which can be applied with any num-
the original idea to construct a serial approximation of data in ber and size of layers differently from the Matlab routine) was typ-
the reduced dimension using an explicit separation of linear and ically many times faster than the Matlab’s AE and completely
nonlinear operators acting both on the original dimension, as moti- stable, which was not the case with the proprietary routine.
vated in the beginning of Section 3.1.
Finally, in Section 4.2 of the SM, results from a second compar- 4.4. Generalization of the autoencoder
ison that studied further the effect of the autoencoding model’s
structure and assessed our reference implementation are given. In the last experiments, we demonstrate and evaluate the gen-
More precisely, with the nonsymmetric one-hidden-layer 1HID eralization of the additive 5SYM autoencoder. Search over squeez-
model and utilizing Matlab’s own ‘trainAutoencoder’-method that ing dimensions is performed in a similar manner as that done in
12
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Fig. 12. Agreements of training and validation set MRSE values for UJIIndoor (left) and MNIST (right). Large deviation between training-validation errors on the left but
perfect match on the right. In ID, similar autoencoding error level is reached with both datasets.

Section 4.2. We apply a small sample of datasets, for which a sep- autoencoding error, and the corresponding intrinsic dimension,
arate validation data was given in the UCI repository. More pre- was obtained independently on the depth of the network. This
cisely, we use Letter (size of training data N ¼ 16000, size of was not obtained with the classical autoencoding model, without
validation data N v ¼ 4000, i.e., 80%–20% portions with respect to the linear operator, or if the residual after the linear dimension
the entire data; number of nonconstant featureas n ¼ 16), UJIIn- reduction was processed further in the reduced dimensional space.
door (N ¼ 19937; N v ¼ 1111, 95%–5% portions; n ¼ 473), HumAc- However, the experiments clearly indicated that other autoencod-
tRecog (N ¼ 7351; N v ¼ 2946, 71%–29% portions; n ¼ 561), and ing techniques could be used for the nonlinear residual estimation
MNIST (N ¼ 60000; N v ¼ 10000, 86%–14% portions; n ¼ 666). in the additive form.
Note that because all data are used as is, we have no information One clear advantage of the proposed methodology is the lack of
or guarantees on how well the data distributions in the training meta-level parameters (e.g., number and form of layers, selection
and validation sets actually match each other. of activation function, detection of the learning rate) that are usu-
As anticipated, both training-validation portions and the data ally tuned or grid-searched when DNNs are applied. The only
dimension affected the generalization results. For Letter, with parameter that may need adjustment based on visual assessment
80%-20% division between training-validation sizes and small is s– that is, the threshold for identifying the hidden dimension.
number of features, we witnessed a perfect match between the Moreover, because of the observed smoothly decreasing behavior
training and validation MRSE values. The same held true for MNIST, of the autoencoding error, the intrinsic dimension could be
which is illustrated in Fig. 12 (right). The largest discrepancy searched for more efficiently than just incrementally: One could
between the training and validation errors, depicted in Fig. 12 attempt to utilize one-dimensional optimization techniques like
(left), was obtained for UJIIndoor, which has the most deviating a golden-section search and/or polynomial and spline interpolation
95%-5% portions with almost 500 features. This dataset also had to more quickly identify the beginning of the error plateau.
one of the largest efficiencies (i.e., reduction potential) in Table 5. These results challenge the common beliefs and currently pop-
HumActRec was somewhere in the middle in its behavior, with ular traditions with deep learning techniques. The experiments
clearly visible deviation. Because of the data portions (70%- summarized here and given in the SM suggest that many existing
30%), the difference raises doubts regarding the quality of the val- deep learning results could be improved by using a clear separa-
idation set. Note that these considerations provide examples of the tion of linear and nonlinear data-driven modelling. Also use of
possibilities of autoencoders to assess the quality of data. more accurate optimization techniques to determine the weights
The visual inspection was augmented by computing the corre- of such models may be advantageous.
lation coefficient between the MRSE values in the training and val- We can use the additive transformation to the intrinsic dimen-
idation sets. The following values confirmed the conclusions of the sion as a pretrained part for transfer learning with any prediction
visual inspection: Letter 1.0000, UJIIndoor 0.9766, HumActRecog or classification model [105]. It would be interesting to test in
0.9939, and MNIST 0.9999. Finally, an important observation from the future whether one should use this as is or would a transforma-
Fig. 12 is that when the squeezing dimension is increased up to the tion into a smaller squeezing dimension than the intrinsic one gen-
intrinsic dimension, then the validation error tends to the same eralize better in prediction and classification tasks? Another
error level than the training error. Therefore, the additive autoen- detectable dimension of the squeezing layer worth investigating,
coder determined using the training data was always able to as illustrated in the relative MRSE plots (see also the SM) and in
explain the variability of the validation data with a compatible Tables 4 and 5, could be the one with the largest nonlinear gain–
accuracy. that is, with the maximum difference between the PCA error and
the autoencoder error or between the shallow and deep results.
5. Conclusions Moreover, we used global techniques in every part of the autoen-
coder. The technique might benefit from encoding locally esti-
This study illustrated a case where all main concerns with feed- mated behavior– for example, using convolutional layers for
forward mappings, as quoted in Section 2.2 from [95, p. 363], were local-level estimation [112]. Similarly, other linear transformation
solved: learning was successful, the size and the number of hidden techniques and modifications of PCA might provide better perfor-
layers were identified, and the deterministic relationship within a mance [2,113,114], although in the proposed form, we also need
dataset was found. Similar to [23], stacking was found to be an the inverse of the linear mapping to be able to estimate the resid-
essential building block for estimating the weights of deep autoen- ual error in the original vector space.
coders. Learning and dimension estimation were based on a simply
weighted, automatically scalable cost function with compact layer-
wise weight calculus and a straighforward heuristic for determin- Data availability
ing the intrinsic data dimension. Intrinsic dimensions for all tested
datasets with a low autoencoding error were revealed. A similar Data and codes are available in public repositories.
13
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Declaration of Competing Interest [25] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural
network architectures and their applications, Neurocomputing 234 (2017)
11–26.
The authors declare the following financial interests/personal [26] T.N. Sainath, B. Kingsbury, B. Ramabhadran, Auto-encoder bottleneck features
relationships which may be considered as potential competing using deep belief networks, in: 2012 IEEE international conference on
acoustics, speech and signal processing (ICASSP), IEEE, 2012, pp. 4153–4156.
interests: Tommi Kärkkäinen reports financial support from Acad-
[27] M. Khodayar, O. Kaynak, M.E. Khodayar, Rough deep neural architecture for
emy of Finland. Jan Hänninen reports financial support from Jenny short-term wind speed forecasting, IEEE Transactions on Industrial
and Antti Wihuri Foundation. Informatics 13 (6) (2017) 2770–2779.
[28] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, J. Long, A survey of clustering with deep
learning: From the perspective of network architecture, IEEE Access 6 (2018)
39501–39514.
Acknowledgments [29] R. McConville, R. Santos-Rodriguez, R.J. Piechocki, I. Craddock, N2d:(not too)
deep clustering via clustering the local manifold of an autoencoded
This work was supported by the Academy of Finland from the embedding, in: 2020 25th International Conference on Pattern Recognition
(ICPR), IEEE, 2021, pp. 5145–5152.
project 351579 (MLNovCat). [30] B. Zhang, J. Qian, Autoencoder-based unsupervised clustering and hashing,
Applied Intelligence 51 (1) (2021) 493–505.
[31] B. Diallo, J. Hu, T. Li, G.A. Khan, X. Liang, Y. Zhao, Deep embedding clustering
Appendix A. Supplementary data based on contractive autoencoder, Neurocomputing 433 (2021) 96–107.
[32] C. Ling, G. Cao, W. Cao, H. Wang, H. Ren, Iae-clustergan: A new inverse
autoencoder for generative adversarial attention clustering network,
Supplementary data associated with this article can be found, in Neurocomputing 465 (2021) 406–416.
the online version, athttps://doi.org/10.1016/j.neucom.2023. [33] D. Charte, F. Charte, M.J. del Jesus, F. Herrera, An analysis on the use of
autoencoders for representation learning: Fundamentals, learning task case
126520. studies, explainability and challenges, Neurocomputing 404 (2020) 93–107.
[34] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, L. Bottou, Stacked
denoising autoencoders: Learning useful representations in a deep network
References with a local denoising criterion., Journal of machine learning research 11 (12).
[35] K. Ho, C.-S. Leung, J. Sum, Objective functions of online weight noise injection
training algorithms for MLPs, IEEE transactions on neural networks 22 (2)
[1] M.A. Carreira-Perpinán, A review of dimension reduction techniques,
(2010) 317–323.
Department of Computer Science. University of Sheffield. Tech. Rep. CS-96-
[36] M. Chen, K.Q. Weinberger, Z. Xu, F. Sha, Marginalizing stacked linear
09 9 (1997) 1–69.
denoising autoencoders, The Journal of Machine Learning Research 16 (1)
[2] C.J. Burges et al., Dimension reduction: A guided tour, Foundations and
(2015) 3849–3875.
Trends, Machine Learning 2 (4) (2010) 275–365.
[37] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, P.-A. Muller, Deep
[3] J. Schmidhuber, Deep learning in neural networks: An overview, Neural
learning for time series classification: a review, Data mining and knowledge
Networks 61 (2015) 85–117.
discovery 33 (4) (2019) 917–963.
[4] T. Kärkkäinen, J. Rasku, Application of a knowledge discovery process to study
[38] Q. Ma, W.-C. Lee, T.-Y. Fu, Y. Gu, G. Yu, Midia: exploring denoising
instances of capacitated vehicle routing problems, in: Computation and Big
autoencoders for missing data imputation, Data Mining and Knowledge
Data for Transport, Chapter 6, Computational Methods in Applied Sciences,
Discovery 34 (6) (2020) 1859–1897.
Springer-Verlag, 2020, pp. 1–25.
[39] M. Probst, F. Rothlauf, Harmless overfitting: Using denoising autoencoders in
[5] T. Kärkkäinen, On the role of Taylor’s formula in machine learning, in: Impact
estimation of distribution algorithms, Journal of Machine Learning Research
of scientific computing on science and society, Springer Nature, 2022, (18
21 (78) (2020) 1–31.
pages, to appear).
[40] P. Filzmoser, M. Gschwandtner, V. Todorov, Review of sparse methods in
[6] K. Fukunaga, D.R. Olsen, An algorithm for finding intrinsic dimensionality of
regression and classification with application to chemometrics, Journal of
data, IEEE Transactions on Computers 100 (2) (1971) 176–183.
Chemometrics 26 (3–4) (2012) 42–51.
[7] F. Camastra, Data dimensionality estimation methods: a survey, Pattern
[41] M. Sun, X. Zhang, T.F. Zheng, et al., Unseen noise estimation using separable
recognition 36 (12) (2003) 2945–2954.
deep auto encoder for speech enhancement, IEEE/ACM Transactions on
[8] J.A. Lee, M. Verleysen, Nonlinear dimensionality reduction, Springer Science &
Audio, Speech, and Language Processing 24 (1) (2015) 93–104.
Business Media, 2007.
[42] M. Haddad, M. Bouguessa, Exploring the representational power of graph
[9] K. Fukunaga, Intrinsic dimensionality extraction, Handbook of statistics 2
autoencoder, Neurocomputing 457 (2021) 225–241.
(1982) 347–360.
[43] M. Ma, S. Na, H. Wang, Aegcn: An autoencoder-constrained graph
[10] I. Jolliffe, Principal Component Analysis, 2nd Edition., Springer Verlag, 2002.
convolutional network, Neurocomputing 432 (2021) 21–31.
[11] E. Facco, M. d’Errico, A. Rodriguez, A. Laio, Estimating the intrinsic dimension
[44] C. Qiao, X.-Y. Hu, L. Xiao, V.D. Calhoun, Y.-P. Wang, A deep autoencoder with
of datasets by a minimal neighborhood information, Scientific reports 7 (1)
sparse and graph laplacian regularization for characterizing dynamic
(2017) 1–8.
functional connectivity during brain development, Neurocomputing 456
[12] F. Camastra, A. Staiano, Intrinsic dimension estimation: Advances and open
(2021) 97–108.
problems, Information Sciences 328 (2016) 26–41.
[45] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, S.Y. Philip, A comprehensive survey
[13] G. Navarro, R. Paredes, N. Reyes, C. Bustos, An empirical evaluation of intrinsic
on graph neural networks, IEEE Transactions on Neural Networks and
dimension estimators, Information Systems 64 (2017) 206–218.
Learning Systems 32 (1) (2021) 4–24.
[14] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, The kdd process for extracting
[46] Z. Hou, X. Liu, Y. Dong, C. Wang, J. Tang, et al., Graphmae: Self-supervised
useful knowledge from volumes of data, Communications of the ACM 39 (11)
masked graph autoencoders, in: Proceedings o the 28th ACM SIGKDD
(1996) 27–34.
Conference on Knowledge Discovery and Data Mining, 2022, pp. 594–604.
[15] A. Rotondo, F. Quilligan, Evolution paths for knowledge discovery and data
[47] Y. Pan, J. Zou, J. Qiu, S. Wang, G. Hu, Z. Pan, Joint network embedding of
mining process models, SN Computer Science 1 (2) (2020) 1–19.
network structure and node attributes via deep autoencoder,
[16] Y. Wang, H. Yao, S. Zhao, Auto-encoder based dimensionality reduction,
Neurocomputing 468 (2022) 198–210.
Neurocomputing 184 (2016) 232–242.
[48] J. Yoo, H. Jeon, J. Jung, U. Kang, Accurate node feature estimation with structured
[17] N. Bahadur, R. Paffenroth, Dimension estimation using autoencoders with
variational graph autoencoder, in: Proceedings of the 28th ACM SIGKDD
applications to financial market analysis, in: 2020 19th IEEE International
Conference on Knowledge Discovery and Data Mining, 2022, pp. 2336–2346.
Conference on Machine Learning and Applications (ICMLA), IEEE, 2020, pp.
[49] B. Dai, Y. Wang, J. Aston, G. Hua, D. Wipf, Connections with robust PCA and
527–534.
the role of emergent sparsity in variational autoencoder models, The Journal
[18] N. Bahadur, R. Paffenroth, Dimension estimation using autoencoders, arXiv
of Machine Learning Research 19 (1) (2018) 1573–1614.
preprint arXiv:1909.10702.
[50] S. Burkhardt, S. Kramer, Decoupling sparsity and smoothness in the dirichlet
[19] G.W. Cottrell, Learning internal representations from gray-scale images: An
variational autoencoder topic model, Journal of Machine Learning Research
example of extensional programming, in: Proceedings Ninth Annual
20 (131) (2019) 1–27.
Conference of the Cognitive Science Society, Irvine, CA, 1985, pp. 462–473.
[51] Y. Zhao, K. Hao, X.-S. Tang, L. Chen, B. Wei, A conditional variational
[20] H. Bourlard, Y. Kamp, Auto-association by multilayer perceptrons and
autoencoder based self-transferred algorithm for imbalanced classification,
singular value decomposition, Biological cybernetics 59 (4) (1988) 291–294.
Knowledge-Based Systems 218 (106756) (2021) 1–10.
[21] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–
[52] H. Takahashi, T. Iwata, A. Kumagai, S. Kanai, M. Yamada, Y. Yamanaka, H.
444.
Kashima, Learning optimal priors for task-invariant representations in
[22] M.I. Jordan, T.M. Mitchell, Machine learning: Trends, perspectives, and
variational autoencoders, in: Proceedings of the 28th ACM SIGKDD
prospects, Science 349 (6245) (2015) 255–260.
Conference on Knowledge Discovery and Data Mining, 2022, pp. 1739–1748.
[23] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with
[53] G. Alain, Y. Bengio, What regularized auto-encoders learn from the data-
neural networks, Science 313 (5786) (2006) 504–507.
generating distribution, The Journal of Machine Learning Research 15 (1)
[24] T. Kärkkäinen, MLP in layer-wise form with applications to weight decay,
(2014) 3563–3593.
Neural Computation 14 (6) (2002) 1451–1480.

14
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

[54] N. Janakarajan, J. Born, M. Manica, A fully differentiable set autoencoder, in: [82] M. Sangeetha, M.S. Kumaran, Deep learning-based data imputation on time-
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery variant data using recurrent neural network, Soft Computing 24 (17) (2020)
and Data Mining, 2022, pp. 3061–3071. 13369–13380.
[55] J. Deng, S. Frühholz, Z. Zhang, B. Schuller, Recognizing emotions from [83] S. Ryu, M. Kim, H. Kim, Denoising autoencoder-based missing value
whispered speech based on acoustic feature transfer learning, IEEE Access 5 imputation for smart meters, IEEE Access 8 (2020) 40656–40666.
(2017) 5235–5246. [84] A. Ahmed, K. Saleem, O. Khalid, J. Gao, U. Rashid, Trust-aware denoising
[56] C. Sun, M. Ma, Z. Zhao, S. Tian, R. Yan, X. Chen, Deep transfer learning based on autoencoder with spatial-temporal activity for cross-domain personalized
sparse autoencoder for remaining useful life prediction of tool in recommendations, Neurocomputing 511 (2022) 477–494.
manufacturing, IEEE transactions on industrial informatics 15 (4) (2018) [85] P. Nousi, S.-C. Fragkouli, N. Passalis, P. Iosif, T. Apostolatos, G. Pappas, N.
2416–2425. Stergioulas, A. Tefas, Autoencoder-driven spiral representation learning for
[57] M. Sun, H. Wang, P. Liu, S. Huang, P. Fan, A sparse stacked denoising gravitational wave surrogate modelling, Neurocomputing 491 (2022) 67–77.
autoencoder with optimized transfer learning applied to the fault diagnosis of [86] L. Alzubaidi, J. Zhang, A.J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J.
rolling bearings, Measurement 146 (2019) 305–314. Santamaría, M.A. Fadhel, M. Al-Amidie, L. Farhan, Review of deep learning:
[58] M.A. Chao, B.T. Adey, O. Fink, Implicit supervision for fault detection and concepts, CNN architectures, challenges, applications, future directions,
segmentation of emerging fault types with deep variational autoencoders, Journal of big Data 8 (1) (2021) 1–74.
Neurocomputing 454 (2021) 324–338. [87] V. Carletti, A. Greco, G. Percannella, M. Vento, Age from faces in the deep
[59] N. Amini, Q. Zhu, Fault detection and diagnosis with a novel source-aware learning revolution, IEEE transactions on pattern analysis and machine
autoencoder and deep residual neural network, Neurocomputing 488 (2022) intelligence 42 (9) (2020) 2113–2132.
618–633. [88] S. Chen, Q. Zhao, Shallowing deep networks: Layer-wise pruning based on
[60] S. Kim, Y.-K. Noh, F.C. Park, Efficient neural network compression via transfer feature representations, IEEE transactions on pattern analysis and machine
learning for machine vision inspection, Neurocomputing 413 (2020) 294– intelligence 41 (12) (2018) 3048–3056.
304. [89] C. Guo, G. Pleiss, Y. Sun, K.Q. Weinberger, On calibration of modern neural
[61] W. Sheng, X. Li, Siamese denoising autoencoders for joints trajectories networks, arXiv preprint arXiv:1706.04599 (2017).
reconstruction and robust gait recognition, Neurocomputing 395 (2020) 86– [90] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, R. Horaud, A comprehensive
94. analysis of deep regression, IEEE transactions on pattern analysis and
[62] Z. Cao, X. Li, Y. Feng, S. Chen, C. Xia, L. Zhao, Contrastnet: Unsupervised machine intelligence 42 (9) (2020) 2065–2081.
feature learning by autoencoder and prototypical contrastive learning for [91] T. Elsken, J.H. Metzen, F. Hutter, Neural architecture search: A survey, The,
hyperspectral imagery classification, Neurocomputing 460 (2021) 71–83. Journal of Machine Learning Research 20 (1) (2019) 1997–2017.
[63] G. Lin, C. Fan, W. Chen, Y. Chen, F. Zhao, Class label autoencoder with [92] T.J. Sejnowski, The unreasonable effectiveness of deep learning in artificial
structure refinement for zero-shot learning, Neurocomputing 428 (2021) 54– intelligence, Proceedings of the National Academy of SciencesWww.pnas.org/
64. cgi/doi/10.1073/pnas.1907373117.
[64] J. Song, G. Shi, X. Xie, Q. Wu, M. Zhang, Domain-aware stacked autoencoders [93] S. Yu, J.C. Principe, Understanding autoencoders with information theoretic
for zero-shot learning, Neurocomputing 429 (2021) 118–131. concepts, Neural Networks 117 (2019) 104–123.
[65] D. Sun, W. Xie, Z. Ding, J. Tang, Silp-autoencoder for face de-occlusion, [94] A. Pinkus, Approximation theory of the mlp model in neural networks, Acta
Neurocomputing 485 (2022) 47–56. Numerica 8 (1999) 143–195.
[66] X. Yue, J. Li, J. Wu, J. Chang, J. Wan, J. Ma, Multi-task adversarial autoencoder [95] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are
network for face alignment in the wild, Neurocomputing 437 (2021) 261– universal approximators, Neural Networks 2 (5) (1989) 359–366.
273. [96] J.E. Dennis Jr., R.B. Schnabel, Numerical Methods for Unconstrained
[67] W. Yin, L. Li, F.-X. Wu, A semi-supervised autoencoder for autism disease Optimization and Nonlinear Equations, Vol. 16, Siam, 1996.
diagnosis, Neurocomputing 483 (2022) 140–147. [97] J. Nocedal, S. Wright, Numerical Optimization, Springer Science & Business
[68] Q. Zhou, B. Li, P. Tao, Z. Xu, C. Zhou, Y. Wu, H. Hu, Residual-recursive Media, 2006.
autoencoder for accelerated evolution in savonius wind turbines [98] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
optimization, Neurocomputing 500 (2022) 909–920. [99] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in:
[69] A. Khajenezhad, H. Madani, H. Beigy, Masked autoencoder for distribution Proceedings of International Conference on Learning Representations, 2015,
estimation on small structured data sets, IEEE Transactions on Neural arXiv: 1412.6980.
Networks and Learning SystemsEarly Access, to appear. [100] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, T. Goldstein, Training neural
[70] Y. Ikeda, K. Tajiri, Y. Nakano, K. Watanabe, K. Ishibashi, Estimation of networks without gradients: A scalable ADMM approach, in: International
dimensions contributing to detected anomalies with variational Conference on Machine Learning (ICML), PMLR, 2016, pp. 2722–2731.
autoencoders, arXiv preprint arXiv:1811.04576. [101] T. Kärkkäinen, E. Heikkola, Robust formulations for training multilayer
[71] Y. Gao, B. Shi, B. Dong, Y. Chen, L. Mi, Z. Huang, Y. Shi, RVAE-ABFA: robust perceptrons, Neural Computation 16 (4) (2004) 837–862.
anomaly detection for highdimensional data using variational autoencoder, [102] N. Bellomo, L. Preziosi, Modelling mathematical methods and scientific
in: 2020 IEEE 44th Annual Computers, Software, and Applications Conference computation, Vol. 1, CRC Press, 1994.
(COMPSAC), IEEE, 2020, pp. 334–339. [103] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University
[72] Y. Liu, Y. Lin, Q. Xiao, G. Hu, J. Wang, Self-adversarial variational autoencoder Press, 1995.
with spectral residual for time series anomaly detection, Neurocomputing [104] T. Kärkkäinen, M. Saarela, Robust principal component analysis of data with
458 (2021) 349–363. missing values, in: International Workshop on Machine Learning and Data
[73] Q. Yu, M. Kavitha, T. Kurita, Autoencoder framework based on orthogonal Mining in Pattern Recognition, Springer, 2015, pp. 140–154.
projection constraints improves anomalies detection, Neurocomputing 450 [105] A. Ghods, D.J. Cook, A survey of deep network techniques all classifiers can
(2021) 372–388. adopt, Data mining and knowledge discovery 35 (1) (2021) 46–87.
[74] N. Li, F. Chang, C. Liu, Human-related anomalous event detection via spatial- [106] H. Gouk, E. Frank, B. Pfahringer, M.J. Cree, Regularisation of neural networks
temporal graph convolutional autoencoder with embedded long short-term by enforcing lipschitz continuity, Machine Learning 110 (2) (2021) 393–416.
memory network, Neurocomputing 490 (2022) 482–494. [107] Y.O. Bengio, Learning deep architectures for ai, Foundations and trends,
[75] S. Narayanan, R. Marks, J.L. Vian, J. Choi, M. El-Sharkawi, B.B. Thompson, Set Machine Learning 2 (1) (2009) 1–127.
constraint discovery: missing sensor data restoration using autoassociative [108] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
regression machines, in: Proceedings of the 2002 International Joint in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Conference on Neural Networks. IJCNN’02, Vol. 3, IEEE, 2002, pp. 2872–2877. Recognition, 2016, pp. 770–778.
[76] N. Abiri, B. Linse, P. Edén, M. Ohlsson, Establishing strong imputation [109] R.L. Thorndike, Who belongs in the family?, Psychometrika 18 (4) (1953)
performance of a denoising autoencoder in a wide range of missing data 267–276
problems, Neurocomputing 365 (2019) 137–146. [110] T. Kärkkäinen, On cross-validation for MLP model evaluation, in: Joint IAPR
[77] X. Lai, X. Wu, L. Zhang, W. Lu, C. Zhong, Imputations of missing values using a International Workshops on Statistical Techniques in Pattern Recognition
tracking-removed autoencoder trained with incomplete data, (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer,
Neurocomputing 366 (2019) 54–65. 2014, pp. 291–300.
[78] Y. Zhou, Z. Ding, X. Liu, C. Shen, L. Tong, X. Guan, Infer-avae: An attribute [111] D. Dua, C. Graff, UCI Machine Learning Repository (2017). URL: http://archive.
inference model based on adversarial variational autoencoder, ics.uci.edu/ml.
Neurocomputing 483 (2022) 105–115. [112] Y. LeCun, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, L.
[79] L. Tran, X. Liu, J. Zhou, R. Jin, Missing modalities imputation via cascaded D. Jackel, Handwritten digit recognition with a back-propagation network, in:
residual autoencoder, in: Proceedings of the IEEE Conference on Computer Advances in Neural Information Processing Systems, 1990, pp. 396–404.
Vision and Pattern Recognition, 2017, pp. 1405–1414. [113] L. Song, H. Ma, M. Wu, Z. Zhou, M. Fu, A brief survey of dimension reduction,
[80] J. Zhao, Y. Nie, S. Ni, X. Sun, Traffic data imputation and prediction: An in: International Conference on Intelligent Science and Big Data Engineering,
efficient realization of deep learning, IEEE Access 8 (2020) 46713–46722. Springer, 2018, pp. 189–200.
[81] L. Li, M. Franklin, M. Girguis, F. Lurmann, J. Wu, N. Pavlovic, C. Breton, F. [114] J.T. Vogelstein, E.W. Bridgeford, M. Tang, D. Zheng, C. Douville, R. Burns, M.
Gilliland, R. Habre, Spatiotemporal imputation of MAIAC AOD using deep Maggioni, Supervised dimensionality reduction for big data, Nature
learning with downscaling, Remote sensing of environment 237 (2020). communications 12 (1) (2021) 1–9.

15
T. Kärkkäinen and J. Hänninen Neurocomputing 551 (2023) 126520

Tommi Kärkkäinen (TK) received the Ph.D. degree in Jan Hänninen received the BSc and MSc degrees in
Mathematical Information Technology from the Mathematical Information Technology from the
University of Jyväskylä (JYU), in 1995. Since 2002 he has University of Jyväskylä, Department of Mathematical
been serving as a full professor of Mathematical Infor- Information Technology. He is working towards the PhD
mation Technology at the Faculty of Information Tech- degree in Mathematical Information Technology at the
nology (FIT), JYU. TK has led 50 different R&D projects Faculty of Information Technology, University of Jyväs-
and has been supervising 60 PhD students. He has kylä. His research interests include neural networks,
published over 200 peer-reviewed articles. TK received nonlinear optimization, and computational efficiency.
the Innovation Prize of JYU in 2010. He has served in
many administrative positions at FIT and JYU, leading
currently a Research Division and a Research Group on
Human and Machine based Intelligence in Learning. The
main research interests include data mining, machine learning, learning analytics,
and nanotechnology. He is a senior member of the IEEE.

16

You might also like