ML Paper
ML Paper
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Keywords: In machine learning, the model is not as complicated as possible. Good generalization ability means that the
Overfitting model not only performs well on the training data set, but also can make good prediction on new data.
Generalization Regularization imposes a penalty on model’s complexity or smoothness, allowing for good generalization
Regularization
to unseen data even when training on a finite training set or with an inadequate iteration. Deep learning
Machine learning
has developed rapidly in recent years. Then the regularization has a broader definition: regularization is a
technology aimed at improving the generalization ability of a model. This paper gave a comprehensive study
and a state-of-the-art review of the regularization strategies in machine learning. Then the characteristics and
comparisons of regularizations were presented. In addition, it discussed how to choose a regularization for
the specific task. For specific tasks, it is necessary for regularization technology to have good mathematical
characteristics. Meanwhile, new regularization techniques can be constructed by extending and combining
existing regularization techniques. Finally, it concluded current opportunities and challenges of regularization
technologies, as well as many open concerns and research trends.
1. Introduction
∗ School of Economics and Management, University of the Chinese Academy of Sciences, Beijing 100190, China.
E-mail addresses: [email protected] (Y. Tian), [email protected] (Y. Zhang).
https://doi.org/10.1016/j.inffus.2021.11.005
Received 13 May 2021; Received in revised form 23 October 2021; Accepted 2 November 2021
Available online 14 November 2021
1566-2535/© 2021 Elsevier B.V. All rights reserved.
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
fit the data. The second prior preference tends to add more convincing 2. The regularization strategies in the traditional machine learn-
regularization to the model. If there are some stochastic/deterministic ing
noise, the objective function will become unsmooth and easy to over-
fitting. Therefore, the regularization should make target smooth. The In this section, a comprehensive overview of sparse and low-rank
third prior preference is target-dependent. Usually, regularization can regularization is provided, which are empirically categorized into four
be constructed according to some properties of target. The fourth prior groups. These regularizations are usually proposed to solve specific
preference is to make the model easy to solve, such as weight decay. applications. Thus, this section will give various practical problems
Deep models can learn complex representational spaces to deal with before introducing regularizations.
the difficult learning tasks. They are prone to overfitting, particularly For brief, note vector and matrix in bold to distinguish from the
in networks with millions or billions of learnable parameters, such as one-dimension variables.
convolutional neural networks (CNN) and Recurrent Neural Networks
(RNN). Hence, for deep learning, regularization need a broader def-
2.1. Sparse vector-based regularization
inition: regularization is any supplementary technique that aims to
improve the model’s generalization, i.e. produce better results on the
Some variables are required to be sparse in many practical prob-
test set [2]. Explicit regularization, such as dropout and weight decay,
lems, such as compressing sensing, feature selection, sparse signal
may improve generalization performance. When neural networks have
far more learnable parameters than training samples, the generalization separation, sparse PCA, and sparse signals separation. Sparse regular-
takes place even in the absence of any explicit regularization. In this ization, which has attracted much attention in recent years, usually
case, explicit regularization is useless and unnecessary. A possible imposes the penalty on the variables. Many examples in different fields
phenomenon is that the optimization is introducing some implicit can be found where sparse regularization beneficial and favorable.
regularization to make that there is no significant overfitting and test This section will review four applications of sparse regularization-based
error continues decreasing as network size increases past the size sparse vector. Meanwhile, some vector-based sparse regularizations will
required for achieving zero training error. It has been proved that be summarized.
early stopping implicitly regularize some convex learning problems.
Batch normalization is an operator that normalizes the layer responses 2.1.1. Application scenario
within each mini-batch. It has been widely used in many modern neural Compressing Sensing In compressing sensing, radar [7], commu-
networks. Although there is no explicit design for regularization, it nications [8], medical imaging [9], image processing [10], and speech
is usually found that batch normalization can improve generalization signal processing [11]. The objective of compressing sensing is to
performance. In fact, the algorithm itself is an implicit regularization reconstruct a sparse signal 𝒙:
solution (this paper does not consider the implicit regularization based
min 𝑃 (𝒙)
on algorithm) [3]. 𝒙 (1)
This paper provides a comprehensive review of the regularization s.t. 𝑨𝒙 = 𝒚
strategies in both machine learning and deep learning. Different reg- where 𝑨 ∈ R𝑚×𝑛 with 𝑚 ≤ 𝑛 is the sensing matrix (also called
ularizations are classified into four categories in machine learning: measurement matrix), 𝒚 is the compressed measurement of 𝒙, and 𝑃 is
sparse vector regularization, sparse matrix regularization, low-rank a penalty function for the sparsity. In fact, there is often measurement
matrix regularization, and manifold regularization. We discuss the noise 𝜺 ∈ R𝑚 so that
characteristics of each approach and the applied formulation. More-
over, we review the strategies to overcome the overfitting in deep 𝒚 = 𝑨𝒙 + 𝜺. (2)
learning such as data augmentation, dropout, early stopping, batch
To minimize the noise 𝒏, an unconstrained formulation can be written
normalization, etc. Finally, we discuss several future directions for
as
regularization.
1
To the best of our knowledge, there have been some regularization- min ‖𝑨𝒙 − 𝒚‖22 + 𝜇𝑃 (𝒙). (3)
𝒙 2
related surveys conducted up to this point [2,4–6]. The work [2] gave
the border definition of regularization and proposed a taxonomy to with 𝜇 > 0.
categorize existing methods but few specific introduction. The work [4] Feature Selection In many fields today, such as genomics, health
only introduced the low-rank regularizations and their applications. sciences, economics, and machine learning, the analysis of data sets
The work [6] only considered the regularization in traditional machine with a number of variables comparable to or even much larger than the
learning but not in deep learning. The work [5] gave a relatively com- sample size is required. Most high-dimensional problems is infeasible
plete introduction, but there are still omissions, and no application and and impractical because of their expensive computational costs. Feature
comparison are given. In comparison, the novelties and contributions selection has become a very popular topic in the last decade due to its
of this survey are as follows: effectiveness in high-dimensional case [12].
147
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
148
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 1
The convex sets and their corresponding (orthogonal) projections.
Convex set The projections
Nonnegative orthant 𝐶1 = R𝑛+ [𝒙]+
( )−1
Affine set 𝐶2 = {𝒙 ∈ R𝑛 ∶ 𝑨𝒙 = 𝒚} 𝒙 − 𝑨𝑇 𝑨𝑨𝑇 (𝑨𝒙 − 𝒚)
( { { } })𝑛
Box 𝐶3 = Box{[𝑏𝑖 , 𝑢𝑖 ]𝑛𝑖=1 } where 𝑏𝑖 , 𝑢𝑖 ∈ (−∞, ∞] min max 𝑥𝑖 , 𝑏𝑖 , 𝑢𝑖 𝑖=1
{ } [𝒂𝑇 𝒙−𝛼]+
Half-space 𝐶4 = 𝒙 ∶ 𝒂𝑇 𝒙 ≤ 𝛼 𝒙− ‖𝒂‖2
𝒂
149
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 2
The properties of the regularizations. The horizontal axis of Image is the value of 𝑥, and the vertical axis of Image is the value of the corresponding penalty formulate.
Penalty Formulation P1 P2 P3 P4 P5 P6 P7 P8 Image
{
0 if 𝑥 = 0 √ √ √ √ √ √
𝑙0 norm × ×
1 otherwise
√ √ √ √ √ √ √ √
𝑙1 norm |𝑥|
√ √ √ √ √ √ √
𝑙0.5 norm |𝑥|𝑝 ×
√ √ √ √ √
Capped 𝑙1 𝜆 min (|𝑥| , 𝛾) × – ×
√ √ √ √ √ √ √ √
LSP 𝜆 log (1 + |𝑥| ∕𝛾)
𝜆|𝑥| √ √ √ √ √ √ √ √
Rat 1+|𝑥|∕2𝛾
( ( ) ) √ √ √ √ √ √ √ √
Atan 𝜆 √2 tan−1 1+𝛾|𝑥|
√ − 𝜋
6
3 3
√ √ √ √ √ √ √ √
Exp 𝜆(1 − e−𝛾|𝑥| )
{ [ ]
𝜆 |𝑥| − 𝑥2 ∕(2𝛾) , if |𝑥| ≤ 𝛾 √ √ √ √ √ √
Firm × –
𝜆𝛾∕2, if |𝑥| ≥ 𝛾
2.2.1. Application scenario Gaussian network model and has attracted much attention in the last
Large Sparse Covariance Matrix Estimation Large inverse covari- decade [32]. The estimation of large sparse inverse covariance matri-
ance matrix estimation is a fundamental problem in a wide range of ces is a tricky statistical problem in many application areas such as
applications, from economics and finance to genetics, social networks, mathematical finance, geology, health, or many others. Hence, sparsity
and health sciences. The estimation problem becomes more difficult
regularized negative log-likelihood minimization has become a popular
when the dimension of the covariance matrix is high. The assumptions
approach for estimating the sparse inverse covariance matrix. A com-
for high-dimensional covariance matrix estimation is sparsity. The
sparsity means that a majority of the off-diagonal elements are nearly mon method is adding some mechanism such as regularization term in
zero and thus decrease the number of free parameters to estimate [30]. the estimation model to explicitly enforce sparsity in 𝑿.
Since the diagonal elements of a correlation (also known as covariance) ∑ ( )
min tr(𝑺𝑿) − log |𝑿| + 𝑀 𝑋𝑖𝑗 , (28)
matrix are always positive, the assumption is for the off-diagonal 𝑿
𝑖≠𝑗
elements.
Let 𝒔1 , … , 𝒔𝑛 ∈ R𝑛 is independent and identically distributed (i.i.d.). where 𝑡𝑟(⋅) and | ⋅ | represent the trace and the determinant of matrix
The covariance which between each dimensions forms an 𝑛 × 𝑛 matrix respectively. The sample covariance matrix 𝑺 is invertible.
called the covariance matrix. Calculate the sample covariance matrix
𝑺 [31],
1 ∑( 𝑖
𝑛
)( )⊤ 2.2.2. Regularization
𝑺= 𝒔 − 𝒔̄ 𝒔𝑖 − 𝒔̄ (25)
𝑛 − 1 𝑖=1 Norm Penalty A convex model based on the 𝑙1,of f -regularized max-
imum likelihood problem
where 𝒔̄ is the mean of 𝒔1 , … , 𝒔𝑛 , i.e.
1∑ 𝑖
𝑛
min tr(𝑺𝑿) − log |𝑿| + 𝜆‖𝑿‖1,of f (29)
𝒔̄ = 𝒔. (26) 𝐗>0
𝑛 𝑗=1
where
The generalize thresholding estimator solves the following problem
∑
𝑛 ∑
𝑚
| |
1 ∑ ( ) ‖𝑿‖1,of f = |𝑋𝑖𝑗 | , (30)
min ‖𝑿 − 𝑺‖2𝐹 + 𝑀 𝑋𝑖𝑗 , (27) | |
𝑿 2 𝑖≠𝑗
𝑖=1 𝑗=1
where 𝑋𝑖𝑗 are the elements of the matrix 𝑿 and 𝑀 is generalized which refers to the element-wise 1 norm.
penalty function for sparsity. These notations still apply in the rest of The 𝐹 norm is defined as
this section. ( 𝑛 𝑚 )1∕2
Large Sparse Inverse Covariance Matrix Estimation Sparse in- ∑∑
𝑀(𝑿) = ‖𝑿‖𝐹 = 𝑋𝑖𝑗2 . (31)
verse covariance matrix estimation is a fundamental problem in a 𝑖=1 𝑗=1
150
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Generalizing the above norm, we denote the 𝑙𝑝,of f norm of matrix 𝑿 as The penalty function of matrix usually satisfies the following prop-
follows erties:
(𝑚 𝑛 )1∕𝑝 Property 9 (P9): 𝑀(𝑿) ≥ 0.
∑ ∑ | |𝑝
𝑀(𝑿) = ‖𝑿‖𝑝,of f = |𝑋𝑖𝑗 | (32) Property 10 (P10): 𝑀(𝑿) ≤ ‖𝑿‖1,of f .
| |
𝑖=1 𝑗=1 Table 3 lists several penalty functions and their properties.
with 𝑝 ≥ 1. These above norm is called the elements-wise matrix norms The algorithms to solve these regular problems include: (1) alterna-
since they imposes the same penalty to all elements. tive minimized algorithm; (2) block coordinate descent.
Note that 𝑙1,of f norm makes a distinction with the 1 norm of matrix
𝑿 ∈ R𝑚×𝑛 , 2.3. Low-rank matrix recovery based on low-rank regularization
∑
𝑚
| |
𝑀(𝑿) = ‖𝑿‖1 = max |𝑋𝑖𝑗 | , (33) This section reviews low-rank regularization based on the low-rank
1≤𝑗≤𝑛 | |
𝑖=1 recovery problems which conclude two hop topics problems: matrix
which is simply the maximum absolute column sum of the matrix. Mini- completion and robust PCA. Matrix completion aims to recover a low-
mizing the maximum absolute column sum is equivalent to minimizing rank matrix from partially observed entries, while robust PCA aims to
the upper bound of absolute column sum which enforces the column decompose a low-rank matrix from sparse corruption.
sum close to zero, i.e. feature selection.
The 𝑙2,1 norm of a matrix was first introduced in [33] as rotational 2.3.1. Application scenario
invariant 𝑙1 norm, that is Matrix Completion In many applications, the matrix is excepted to
(𝑚 )1∕2 be constructed in the sense that it is low-rank, which can be recovered
∑𝑛 ∑
𝑀(𝑿) = ‖𝑿‖2,1 = 𝑋𝑖𝑗2 , (34) from incomplete portions of the entries. For instance, vendors provide
𝑗=1 𝑖=1 recommendations to the user based on their preferences. Users and
which is proposed to overcome the difficulty of robustness to out- ratings are represented as rows and columns, respectively, in a data ma-
liers [33]. Similarly, the 𝑙1,2 norm of matrix 𝑿 is defined as trix. Users can rate movies but they typically rate only very few movies
so that very few scattered entries can be observed [38]. Commonly,
(𝑚 )2 1∕2
⎛∑ 𝑛 ∑| | ⎞ only a few factors effect to the preference of users so that the data
𝑀(𝑿) = ‖𝑿‖1,2 = ⎜ |𝑋𝑖𝑗 | ⎟ . (35) matrix of all users-rating can be regarded as a low-rank matrix. The
⎜ 𝑗=1 𝑖=1 | | ⎟
⎝ ⎠ goal of this problem is to complete the data matrix of all users-rating
For 𝑝, 𝑞 ≥ 1, a generalized form of the 𝑙1,2 norm is the 𝑙𝑝,𝑞 norm as using the observed data.
follows [34]: Given the incomplete observations 𝐴𝑖𝑗 , the goal is to recover a 𝑚 × 𝑛
1 matrix 𝑋𝑖𝑗 , that is,
(𝑚 )𝑞 𝑞
⎛∑ 𝑛 ∑ | |𝑝 𝑝 ⎞
𝑀(𝑿) = ‖𝑿‖𝑝,𝑞 =⎜ |𝑋 | ⎟ . (36) min rank(𝑿)
⎜ 𝑗=1 𝑖=1 | 𝑖𝑗 | ⎟ 𝐗 (42)
⎝ ⎠ s.t. 𝑋𝑖𝑗 = 𝐴𝑖𝑗 (𝑖, 𝑗) ∈ 𝛺
Moreover, there are many norm penalties based on the ∞ norm where 𝛺 ⊂ [1, … , 𝑚]×[1, … , 𝑛] is the index set in which 𝑋𝑖𝑗 is observed.
∑
𝑛 The rank(𝑿) is equal to the rank of the matrix 𝑿. It usually transforms
| |
𝑀(𝑿) = ‖𝑿‖∞ = max |𝑋𝑖𝑗 | , (37) to an unconstrained formulation which is written as [39]
1≤𝑖≤𝑚 | |
𝑗=1
1‖
(𝑿) − 𝛺 (𝑴)‖
2
min ‖F + 𝐹 (𝑿) (43)
which are respectively 𝑙∞,1 norm [35] 𝑿 2‖ 𝛺
{
∑
𝑚
| | 𝑋𝑖𝑗 if (𝑖, 𝑗) ∈ 𝛺
𝑀(𝑿) = ‖𝑿‖∞,1 = max |𝑋𝑖𝑗 | , (38) where 𝛺 (𝐗) = .
1≤𝑗≤𝑛 | | 0 otherwise
𝑖=1
If 𝐹 (𝑿) = rank(𝑿), the objective function is nonconvex. A popular
and 𝑙∞,of f norm convex relaxation method is to approximate the rank function using the
| | nuclear norm, i.e. ‖𝑿‖∗ which is the sum of singular value of 𝑿 [40].
𝑀(𝑿) = ‖𝑿‖∞,of f = max |𝑋𝑖𝑗 | . (39)
𝑖≠𝑗 | | Robust PCA Robust PCA is applied in many important applications
Non-Norm Penalty The following penalties are not real norms, such as foreground detection of video [41], image processing [42],
because they are nonconvex and do not satisfy the triangle inequality fault detection and diagnosis [43] and so on. The robust PCA problem
of a norm. The capped 𝑙𝑝,1 is defined as [36]: is commonly thought of as a low-rank matrix recovery problem with
incorporates sparse corruption. The goal of robust PCA is to enhance
(𝑚 )1
∑
𝑛 ⎛ ∑ 𝑝 ⎞ the robustness of PCA against outliers or corrupted observations of PCA.
| |𝑝
𝑀(𝑿) = 𝜆 min ⎜ |𝑋𝑖𝑗 | , 𝛾⎟ (40) In fact, the data matrix 𝑨 of this problem is a composite matrix of
⎜ 𝑖=1 | | ⎟
𝑗=1 ⎝ ⎠ sparse and low-rank recovery which needs to be decomposed into two
with the given threshold 𝜃, the capped 𝑙𝑝,1 penalty focuses on rows with components such that 𝑨 = 𝑳 + 𝑺, where 𝑳 is a low-rank matrix and 𝑺
smaller 𝑙𝑝 norms than 𝜃, which is more likely to be sparse. is a sparse matrix. The matrix 𝑨 is observed and the problem can be
The elements-wise MCP function is defined as, represented as:
151
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Fig. 5. Left: rows sparsity (Row 1, 3, 5, 8, 9). Middle: columns sparsity (Column 1, 3, 4, 6, 10). Right: sparsity of both columns and rows (Row 6, 7, 8 and Column 6,7).
Table 3
The sparse matrix regularizations.
Penalty Formulation Rows Columns Rows & Columns P9 P10
sparsity sparsity sparsity
∑𝑛 ∑𝑚 | | √ √ √
‖𝑿‖1,of f 𝑗=1 | 𝑖𝑗 |
𝑋 × ×
(∑ ∑ | |)1∕2
𝑖=1
𝑛 𝑚 √ √ √
‖𝑿‖𝐹 𝑖=1
𝑋2
𝑗=1 𝑖𝑗
× ×
Elements-wise
(∑ ∑ ) √ √ √
𝑚 𝑛 | |𝑝 1∕𝑝
‖𝑿‖𝑝,of f 𝑖=1 𝑗=1 || 𝑖𝑗 ||
𝑋 × ×
| | √ √ √
‖𝑿‖∞,of f max𝑖≠𝑗 |𝑋𝑖𝑗 | × ×
| |
⎧ 𝛾𝜆2 | |
⎪ 2 , if |𝑋𝑖𝑗 | ≥ 𝛾𝜆 √ √ √
MCP | |
⎨ | | 𝑋𝑖𝑗2 × ×
⎪ 𝜆 ||𝑋𝑖𝑗 || − 2𝛾 , otherwise
⎩
∑𝑛 ( ∑𝑚 )1∕2 √ √ √
‖𝑿‖2,1 𝑗=1
𝑋2
𝑖=1 𝑖𝑗
× ×
∑𝑚 ( ∑𝑛 )1∕2 √ √ √
‖𝑿‖2,1 𝑖=1
𝑋2
𝑗=1 𝑖𝑗
× ×
Non-elements-wise ( (∑ )𝑞 )1
∑𝑛 | |𝑝 𝑝 𝑞
𝑚 √ √ √
‖𝑿‖𝑝,𝑞 |𝑋𝑖𝑗 | × ×
𝑗=1 | |𝑖=1
∑𝑛 | | √ √ √
‖𝑿‖∞ max1≤𝑖≤𝑚 𝑗=1 |𝑋𝑖𝑗 | × ×
| |
∑𝑚 | | √ √ √
|𝑿||1 max1≤𝑗≤𝑛 𝑖=1 |𝑋𝑖𝑗 | × ×
| |
∑𝑚 | | √ √ √
‖𝑿‖∞,1 𝑖=1 max1≤𝑗≤𝑛 ||𝑋𝑖𝑗 || × ×
(( )
∑𝑛 ∑𝑚 | |𝑝 ) 𝑝 √ √ √
1
𝑖=1 || 𝑖𝑗 ||
Capped 𝑙𝑝,1 𝜆 𝑗=1 min 𝑋 ,𝛾 × ×
Considering the unconstrained problem is more tractable, the for- The weighted nuclear norm is to improve the flexibility of nuclear
mulations can be alternated by norm [47], which is defined as:
1
min ‖𝑳‖∗ + ‖𝑺‖1 + ‖𝑨 − 𝑳 − 𝑺‖2𝑭 (46) ∑
𝑘
𝑳,𝑺 2𝜇
𝐹 (𝑿) = ‖𝑿‖𝑤 = 𝑤 𝑖 𝜎𝑖 , (49)
where 𝜇 > 0 is the penalty parameter. 𝑖=1
2.3.2. Regularization where the singular values are assigned different weights. Moreover, this
In recent years, the rank-norm is replaced by some alternative idea can be generalized to the Schatten 𝑝-norm minimization [48].
regularizers, which can be divided into two groups: convex relaxations
The capped trace norm is defined as
and non-convex relaxations [4]. The main idea of these regularizers
is to make singular values sparse which is equal to make the matrix ∑
𝑘
( )
low-rank. 𝐹 (𝑿) = ‖𝑿‖𝐶∗𝜀 = min 𝜎𝑖 , 𝜀 . (50)
The work [44] proposed the elastic-net regularization for singular 𝑖=1
values
This regularization is used to improve the robustness to outliers by
∑
𝑘
( )
𝐹 (𝑿) = 𝜎𝑖 + 𝜆𝜎𝑖2 , (47) closely approximating the rank minimization.
𝑖=1
Moreover, numerous non-convex relaxations derived from sparse
where 𝜎𝑖 is the singular values of 𝑿 and 𝑘 = min{𝑚, 𝑛}. It is widely learning of the vectors have been proposed [49], such as Log Nuclear
applicable to robust subspace learning problems with heavy corruptions
Norm (LNN) [50]
such as outliers and missing entries.
Schatten 𝑝-Norm (Sp-Norm) [45] of matrix 𝐗 is defined as
( ) ∑𝑘
( )
( ) 1𝑝 𝐹 (𝑿) = 𝜆 log det 𝑰 + 𝑿 𝑇 𝑿 = 𝜆 log 𝜎𝑖 + 1 , (51)
∑
𝑘 ( (( ) 𝑝 ))
1
𝑝 𝑖=1
𝐹 (𝑿) = ‖𝑿‖𝑆𝑝 = 𝜎𝑖𝑝 = Tr 𝑿 𝑇 𝑿 2 (48)
𝑖=1 MCP
which is equivalent to 𝑙𝑝 norm of the singular values. The Sp-norm
⎧ 𝜎𝑖
minimization method provides a better approximation to the orig- ∑𝑘
⎪ 𝜆𝜎𝑖 − 2𝛾 , if 𝜎𝑖 < 𝛾𝜆
inal NP-hard problem, resulting in better theoretical and practical 𝐹 (𝑿) = ⎨ 𝛾𝜆2 , (52)
results [46]. 𝑖=1 ⎪ , if 𝜎𝑖 ≥ 𝛾𝜆
⎩ 2
152
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
SCAD
⎧ 𝜆𝜎𝑖 , if 𝜎𝑖 ≤ 𝜆
𝑘 ⎪ 2 2
∑ ⎪ −𝜎𝑖 + 2𝛾𝜆𝜎𝑖 − 𝜆 , if 𝜆 < 𝜎𝑖 ≤ 𝜆𝛾
𝐹 (𝑿) = ⎨ 2(𝛾 − 1) , (53)
𝑖=1 ⎪ 𝜆2 (𝛾 + 1)
⎪ , if 𝜎𝑖 > 𝜆𝛾
⎩ 2
ETP [51]
( ( ))
∑
𝑘
𝜆 1 − exp −𝛾𝜎𝑖
𝐹 (𝑿) = , (54)
𝑖=1
1 − exp(−𝛾)
Logarithm [52]
( )
∑𝑘
log 𝛾𝜎𝑖 + 1 Fig. 6. The geometry structure of the data distribution in low-dimension space.
𝐹 (𝑿) = , (55)
𝑖=1
log(𝛾 + 1)
Geman [53]
inherent structure. In manifold learning, this assumption is explicit:
∑𝑘
𝜆𝜎𝑖
𝐹 (𝑿) = , (56) it assumes that the observed data lies on a low-dimensional manifold
𝑖=1
𝜎 𝑖 +𝛾 embedded in a higher-dimensional space which can be seen in Fig. 6.
Laplace [54] Intuitively, this assumption states that the shape of data is relatively
( ( )) simple. The low-dimensional manifold model can be applied to surface
∑
𝑘
𝜎 patches in the point cloud, which uses he patch manifold prior to seek
𝐹 (𝑿) = 𝜆 1 − exp − 𝑖 , (57)
𝑖=1
𝛾 self-similar patches and remove noise [58].
In semi-supervised learning, there are two sets of samples 𝑥 ∈ R𝑛
and so on. Note that Laplace, ETP, and Exp are similar in formulation, {( )}𝑙
and they can be transformed into each other by adjusting parameters. which consists 𝑙 labeled samples 𝑄 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 and 𝑢 unlabeled
{( )}𝑢+𝑙
Similarly, this relationship exists between LNN, Logarithm and LSP. samples 𝑈 = 𝑥𝑖 , 𝑦𝑖 𝑖=1+𝑙 . The optimization can be written as
For the problems with low-rank regularization, a popular method is
1∑ (
𝑙
proximal method. It requires computing the proximal operator )
𝑓 ∗ = argmin 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 + 𝛾‖𝑓 ‖2𝐾 . (60)
𝑓 ∈𝐾 𝑙 𝑖=1
1 ̃ 2 + 𝐹 (𝑿)
Prox𝜆𝐹 (𝑿)
̃ = arg min ‖𝑿 − 𝑿‖ 𝐹 (58)
𝑿 2 where 𝐾 represents reproducing kernel Hilbert spaces. The first term
where usually 𝑿 ̃ is known. However, there may not exist a general ex- is the loss function and the regularization ‖𝑓 ‖𝐾 is used to control
plicit solution of the proximal operator. Hence, the proximity operator the complexity of the classification model. 𝛾 is a trade-off parameters.
is calculated by minimizing the problem Manifold learning adds a regularization ‖𝑓 ‖𝐼 to the loss function
1 that penalizes functions which are more complex with respect to the
min ‖𝑿 − 𝑿‖ ̃ 2 + 𝐹 (𝑿) , (59)
𝑿 2 𝐹 intrinsic geometry of the data manifold:
whose results are solved by the optimization solver here. In order to
1∑ (
𝑙
)
explain the shrinkage effect more intuitively, these results are regarded 𝑓 ∗ = argmin 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 + 𝛾𝐴 ‖𝑓 ‖2𝐾 + 𝛾𝐵 ‖𝑓 ‖2𝐼 (61)
𝑓 ∈𝐾 𝑙 𝑖=1
as approximate to explicit solutions.
Table 4 shows some low-rank regularizations. Among it, the shrink- where 𝛾𝐴 and 𝛾𝐵 are trade-off parameters and 𝐾 represents repro-
age image shows the results Prox𝜆𝐹 (𝑏) where 𝑏 is a variable of the ducing kernel Hilbert spaces. The regularization should bias learning
singular value. And it can be seen that when 𝑏 takes a small value toward smoother functions 𝑓 . Since ‖∇𝑓 (𝑥)‖ describes the smoothness
the shrinkage effect of different regularizers are similar. Nevertheless, of a function 𝑓 at 𝑥, a notion of the smoothness of 𝑓 on the entire
when 𝑏 takes a large value the difference of Prox𝜆𝐹 (𝑏) between non- manifold 𝑀 is given as:
convex regularizers and nuclear norm is significant. In particular, the
shrink of non-convex relaxations on larger values are very small, which ‖∇𝑓 (𝑥)‖2 𝑑𝑥. (62)
∫𝑀
is in contrast with the nuclear norm taking serious shrinks on larger
singular values. Hence, using non-convex regularizers can preserve This regularization expresses the idea that the functions should be
main information of 𝑏[55]. smooth on the manifold, not just smooth in the extrinsic space.
The algorithms to solve these regular problems include: (1) proximal To smooth 𝑓 in the intrinsic space, there are many manifold regu-
methods; (2) block coordinate methods; (3) alternating linearization larizations including the Laplacian regularization (LapR), Hessian reg-
algorithms; (4) greedy strategy approximation. ularization (HesR) and 𝑝-Laplacian regularization (pLapR) [59].
153
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 4
The low-rank regularizations. The horizontal axis of shrinkage image is the value of 𝜎𝑖 , and the vertical axis of shrinkage image is corresponding
shrinkage value.
Regularization Convexity Formulation Shrinkage image
∑𝑘
‖𝐗‖∗ Convex 𝑖=1
𝜎𝑖
∑𝑘
‖𝐗‖𝑤 Convex 𝑖=1 𝑤𝑖 𝜎𝑖
∑𝑘 ( )
‖𝐗‖𝐶∗𝜀 Non-Convex 𝑖=1
min 𝜎𝑖 , 𝜀
(∑ )1
𝑘
𝜎𝑖𝑝
𝑝
‖𝐗‖𝑆𝑝 Non-Convex 𝑖=1
∑𝑘 ( 2
)
Elastic-net Convex 𝑖=1 𝜎𝑖 + 𝜆𝜎𝑖
∑𝑘 ( )
LNN Non-Convex 𝑖=1
𝜆 log 𝜎𝑖 + 1
{ 𝜎𝑖
∑𝑘 𝜆𝜎𝑖 − 2𝛾
, if 𝜎𝑖 < 𝛾𝜆
MCP Non-Convex 𝑖=1 𝛾𝜆2
2
, if 𝜎𝑖 ≥ 𝛾𝜆
⎧ 𝜆𝜎𝑖 , if 𝜎𝑖 ≤ 𝜆
∑𝑘 ⎪ −𝜎𝑖2 +2𝛾𝜆𝜎𝑖 −𝜆2
SCAD Non-Convex 𝑖=1 ⎨ 2(𝛾−1)
, if 𝜆 < 𝜎𝑖 ≤ 𝜆𝛾
⎪ 𝜆2 (𝛾+1)
⎩ 2
, if 𝜎𝑖 > 𝜆𝛾
∑𝑘 𝜆(1−exp(−𝛾𝜎𝑖 ))
ETP Non-Convex 𝑖=1 1−exp(−𝛾)
∑𝑘 log(𝛾𝜎𝑖 +1)
Logarithm Non-Convex 𝑖=1 log(𝛾+1)
∑𝑘 𝜆𝜎𝑖
Geman Non-Convex 𝑖=1 𝜎𝑖 +𝛾
∑𝑘 ( ( ))
𝜎
Laplace Non-Convex 𝑖=1
𝜆 1 − exp − 𝛾𝑖
[ ( ) ( ) ( )]𝑇
the squared difference between a node’s value and the values of its where 𝒇 = 𝑓 𝑥1 , 𝑓 𝑥2 , … , 𝑓 𝑥𝑛 is prediction of all of the
neighbors in the graph case. That is, ‖𝑓 ‖2𝐼 is approximated as data(labeled and unlabeled), 𝐷 is a diagonal matrix with 𝑊𝑖𝑖 =
∑𝑛
1 ∑ ( ( ) ( ))2 𝑗=1 𝑊𝑖𝑗 , and 𝑳 = 𝑫 − 𝑾 is called the Laplacian of the graph 𝐺, which
𝑊𝑖𝑗 𝑓 𝑥𝑖 − 𝑓 𝑥𝑗 . (63) quantifies the smoothness of functions. Notice that, in Eq. (61), {𝑓𝑖 }𝑙𝑖=1
𝑛2 𝑖,𝑗 ( )
should satisfy the constraint 𝑓 𝑥𝑖 = 𝑦𝑖 , 𝑖 = 1, … , 𝑙, which is reflected
∑ ( )
LapR constructs a nearest neighbor graph on the feature space in the term 1𝑙 𝑙𝑖=1 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 .
which utilizes a graph similarity (affinity) matrix 𝑾 = [𝑊𝑖𝑗 ] which
Considering the 𝑉 as the hinge loss, the framework of LapR is
measures the similarity between samples 𝑥𝑖 and 𝑥𝑗 :
written as follow:
⎧ ⎧ ‖ ‖2 ⎫
1 ∑(
𝑙
( ))
⎪ ‖𝑥𝑖 − 𝑥𝑗 ‖
⎪ ( ) ( ) 𝛾
⎪ exp ⎨− ‖ ‖2 ⎪ 𝑓 ∗ = arg min 1 − 𝑦𝑖 𝑓 𝑥𝑖 + + 𝛾𝐴 ‖𝑓 ‖2𝐾 + 𝐵 𝒇 𝑇 𝑳𝒇 (66)
⎬, if 𝑥𝑖 ∈ 𝑁𝑞 𝑥𝑗 or 𝑥𝑖 ∈ 𝑁𝑞 𝑥𝑖 𝑓 ∈𝐾 𝑙 𝑖=1 𝑛2
𝑊𝑖𝑗 = ⎨ ⎪ 2𝜎 2
⎪ (64)
⎪ ⎩ ⎭
⎪ where 𝑛 = 𝑙 +𝑢 and 𝐾 represents reproducing kernel Hilbert spaces. In
⎩ 0, otherwise
this way, the data geometric structure existed in high dimensional space
where 𝑁𝑝 (𝒙) is the set of 𝑝-nearest neighbors of sample 𝒙 and 𝑾 is is preserved and the data distribution information is well explored.
usually symmetrized by 𝑾 = (𝑾 𝑇 + 𝑾 )∕2. It assumes that if two point In the image annotation, the manifold regularization is used to
𝑥1 and 𝑥2 are close in the manifold geometry, their labels should be
ensure that local geometric structures are consistent between different
similar.
view feature spaces and the latent semantics matrix among different
The geometry structure is exploited as
views [60,61]. The work [62] further learns different the graph Lapla-
1 ∑∑
𝑛 𝑛
( ( ) ( ))2 ( ) cian from the different views and adjusts the combination coefficients.
𝑊 𝑓 𝑥𝑖 − 𝑓 𝑥𝑗 = Tr 𝒇 𝑇 𝑳𝒇 (65)
2 𝑖=1 𝑗=1 𝑖𝑗 An alternative approach assumes that the intrinsic manifold is the
154
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
155
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Fig. 8. The differences of semi-supervised regression by using LapR, HesR, and pLapR for fitting two points on the 1-D spiral.
need to impose sparse constraints on variables, such as compressing 3.1. Data augmentation
sensing, feature selection, sparse signals separation, sparse PCA, and
sparse signals separation. 𝑙0 norm constrains the value of the weight to One powerful way to improve model generalization performance is
close to zero. Except for no penalty to zero, 𝑙0 norm penalizes nonzero to train it on more data. In practice, the lack of a sufficient amount of
variables in the same way. As seen from images of regularizations, training data or uneven class balance within the datasets are common
penalties function decrease rapidly as variables close to zero. Thus, problems. Data augmentation is based on an assumption that more in-
the nonconvex penalties are closer to 𝑙0 norm. Moreover, the common formation can be obtained from the new data after augmentations [81].
properties of these regularizations are as well proposed. It is worth- Dataset augmentation has been a particularly effective technique for
while to discuss some good mathematical properties since having these image classification. For some image classification tasks, it is reasonable
properties may result in some positive characteristics, such as being to create new fake data to add to the data set. In computer vision fields,
more conducive to solving or being easier to obtain convergence. These there are some basic augmentation operations, which can be seen as
images and ideas can be used to construct vector-based regularizations. a kind of oversampling [82–84]. This section also contains black-box
The second subsection summarized the matrix-based sparse regu- approaches focused on deep neural networks, in addition to traditional
larizations which are important in large covariance matrix and inverse white-box methods.
covariance matrix estimation. These regularizations are classified into
three types: row sparsity, column sparsity, and column and row spar- 3.1.1. Traditional augmentation
sity. Furthermore, two properties of these regularizations are proposed, The augmentation methods [85,86] include roughly: flipping, crop-
which give the regularizations’ binds. All of the above can make us ping, resizing, rotating, transposing, inverting, brightness, sharpness,
choose or construct a regularization for specific tasks. equalize, auto contrast, convert and color balance. These methods have
Some matrix-based low-rank regularizations are given in the third experimented on the bee image of hymenoptera data. The image pixels
subsection, which can be applied in matrix completion and robust PCA. are 500 × 464. The result of these methods is shown in Fig. 9. Table 5
The low-rank regularization adds penalty to singular value, which is lists the methods and their own descriptions which are used to augment
motivated by the vector-based sparse regularizations. Thus, a straight- data.
forward way to construct regularization can be proposed by extending Affine transformation is essentially a linear transformation of the
the vector-based regularizations. image coordinate vector space, it could be formulated as:
The last subsection reviewed the manifold regularizations. The man-
ifold assumes the data has some inherent structure, which is widely 𝒚 = 𝑾 𝒙 + 𝒃, (76)
used in computer vision and multimedia. The manifold learning prefers in which 𝒙 and 𝒚 are 2-D image pixel coordinate vector before and
to find a more smooth model. This regularization captures the intuition after affine transformation, 𝑾 and 𝒃 are linear transformation matrix
that our functions should be smooth on the manifold, not just smooth and translation vector respectively. The common affine-transformation-
in the extrinsic space. A natural way is to construct a regularization to based image data augmentation methods include translation, rotation,
measure the smoothness of 𝑓 , such as ∫𝑀 ‖∇𝑓 (𝑥)‖2 𝑑𝑥. and flipping, as illustrated in Fig. 9(c), 9(d), 9(e) respectively.
They have been proven to be easy, fast, repeatable, and depend-
3. The regularization strategies in deep learning able. However, some techniques result in the loss of image data. The
disadvantages of color space transformations are increased memory,
There are many strategies for deep learning. Aiming at the reason transformation costs, and training time. Moreover, color transforma-
which consists of noise data, the limited size of the training set, and tions may discard essential color details and thus are not always a
the complexity of classifiers, this section main introduced the data label-preserving transformation. For example, when the pixel value of
augmentation, dropout, early stopping, batch normalization and several an image is reduced to simulate a darker environment, the objects in
common methods to avoid the overfitting. the image may be invisible.
156
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 5
The methods used to augment data and their own descriptions.
Translate Description
Crop Crop the image by a box and it will reduce the size of input.
Resize Zoom in or Zoom out the image and select the center of the scaled image.
Rotate Rotate the image some degrees.
Transpose Converted the image horizontally or vertically.
Invert Invert the pixels of the image.
Brightness Brightness enhancement and magnitude is proportional to brightness.
Sharpness Sharpness enhancement and magnitude is proportional to sharpness.
Equalize Equalize the image histogram.
Auto contrast Maximize the contrast of the image.
Convert Convert the mode of image.
Color balance Adjust the color balance of the image. A 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = 0 gives a black and white
image, while 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = 1 gives the original image.
157
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 6
The GAN-variants.
Model Description Ref
CGAN Conditional GAN: Constructed by merely extra auxiliary information (e.g., class label). [150]
DCGAN Deep Convolutional GAN: First structure of de-convolutional neural networks (de-CNN). [151]
LapGAN Laplacian GAN: Combine the CGAN with the framework of the Laplacian pyramid. [152]
InfoGAN Information Maximizing GAN: Learn disentangled design in a wholly unsupervised way. [153]
WGAN Wasserstein GAN: Its loss function derived through Earth-Mover or Wasserstein distance. [154]
BEGAN Boundary Equilibrium GAN: Keep-up an equilibrium between variety and superiority. [155]
PGGAN Progressive-Growing GAN: A multi-scale based GAN architecture. [156]
BigGAN BigGAN: Train bigger neural networks. [157]
StyleGAN Style-Based Generator Architecture for GAN: Control the image synthesis process. [158]
EBGAN Energy-Based GAN: Its discriminator works as an energy function. [159]
RPDGAN Realistic Painting Drawing GAN: An unsupervised cross-domain image translation framework. [160]
Fig. 11. Style Transfer: The style of van Gogh is applied to the content of Black-and-White image to generate an image which preserves the original content of (b) and has the
style of (a) [99].
enough data to support the convergence of network training. When Some data augmentation strategies recently proposed methods fol-
training data are not sufficient, GAN is difficult to achieve a satisfactory low a generative approach. In the work [101], Wang et al. explored
equilibrium, and it is easy to fall into mode collapse. Although the Generative Adversarial Nets to generate images of different styles to
generated sample size has been expanded, it is not obviously helpful augment the dataset. Zheng et al. generated two stylized images for
for the diversity of samples, which is similar to the simple replication each input image, then merged the stylized images and original images
of samples. For the downstream neural network, the model will be to compose the final training dataset [102]. A disadvantage of Neural
biased. At the same time, when dividing training, verification and test Style Transfer Data Augmentation is the effort required to select styles
sets, data leakage may occur. For large scale data, there is generally no to transfer images into. If the style set is too small, further biases could
need to expand the sample or use simple data enhancement methods be introduced into the dataset.
to expand the sample size. When the amount of data is medium, it
seems that using GAN to generate samples and expand data sets is
3.1.5. Meta-learning
helpful to the training convergence of downstream neural networks.
Meta-learning, also known as learning to learn, is a scientific ap-
GAN’s data augmentation may be more effective in some tasks with
proach to observe how different machine learning methods perform in a
low resolution and acceptable definition. The work [95] proposed a
wide range of learning tasks, and then learning from this experience or
training scheme that first uses classical data augmentation to enlarge
meta data to learn new tasks faster than other possible methods [103].
the training set and then used GAN to synthetic data augmentation that
In contrast to conventional machine learning, which solves tasks from
enlarged the size and diversity of datasets. The work [96] used GAN to
scratch using a fixed learning algorithm, meta-learning attempts to
generate synthetic data with the purpose of training classifiers without
refine the learning algorithm itself based on the experience of multiple
using the original data set and oversampling a few class in unbalanced
learning episodes, aiming to learn on the basis of tasks rather than
classification scenario.
samples, that is, learning task-agnostic learning systems rather than
3.1.4. Neural style transfer task-specific models. Successful applications have been demonstrated in
Neural Style Transfer algorithms [97] can apply the artistic style of areas spanning few-shot image recognition [104], unsupervised learn-
one image to another image while preserving its original content, which ing [105], data efficient [106,107] and self-directed [108] reinforce-
is shown in Fig. 11. It serves as a great tool for Data Augmentation ment learning (RL), hyper-parameter optimization [109], and neural
which is somewhat analogous to color space lighting transformations architecture search (NAS) [110].
for images. The style transfer algorithms consist of a descriptive ap- Smart Augmentation is the process of learning suitable augmenta-
proach and a generative approach. The descriptive approach refers to tions when training deep neural networks. Its goal is to learn the best
changing the pixels of a noise image in an iteration, and the generative augmentation strategy for a given class of input data. The work [111]
approach uses a pre-trained model of the desired style to achieve the used two networks, Network A and Network B. Network A is an
same effect in a single forward pass [98]. augmentation network to generate new samples to train Network B.
A model is trained in advance for each style image in the generative Network B can perform some specific tasks. The change in the error
approach. Johnson et al. [99] proposed a two-component architecture- rate in Network B is then backpropagated to update Network A.
generator and loss networks. They also introduce a novel loss function Auto-Augment is a Reinforcement Learning algorithm that searches
based on a perceptual differences between the content and target. for an optimal augmentation policy among a constrained set of ge-
In [100], an approach was proposed to move the computational burden ometric transformations with miscellaneous levels of distortions. The
to a learning stage to improve the work of [99]. work [112] took the labeled images and predefined preprocessing
158
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
3.2.3. Standout
To ensure unconfident units more frequently dropout than confident
units, Standout overlays a binary belief network onto a neural network
which controls the dropout properties of individual neurons [117]. The
belief network is interpreted as tuning the architecture of the neural
network. For each weight in the original neural network, Standout
adds a corresponding weight parameter in the binary belief network.
A layer’s output during training is given by:
( ( ))
𝑦 = 𝑓 (𝑾 𝒙)◦𝒎, 𝑚𝑖 ∼ Bernoulli 𝑔 𝑾 𝑠 𝒙 , (80)
Fig. 12. An example of standard dropout. The left network is fully connected, and the
right has had neurons dropped with probability 0.5. where 𝐖𝑠 representing the belief network’s weights for that layer and
𝑔(⋅) representing the belief network’s activation function.
An effective approach to determine belief network weights is setting
transformations as the input set. It jointly learned the classifier and the them as:
optimal preprocessing transformations for individual images. 𝑾 𝑠 = 𝛼𝑾 + 𝛽 (81)
A disadvantage to meta-learning is that it is a relatively new concept
and has not been heavily tested. Additionally, meta-learning schemes at each training iteration for some constants 𝛼 and 𝛽. The output of
can be difficult and time-consuming to implement. each layer during testing is given by:
( )
3.2. Dropout 𝑦 = 𝑓 (𝑾 𝒙)◦𝑔 𝑾 𝑠 𝒙 . (82)
Training a deep neural net with a large number of learnable pa- 3.2.4. Curriculum dropout
rameters is very costly. Overfitting and long training time will become In a traditional machine learning algorithm, all training exam-
two fundamental challenges. Model compression can simplify model ples are presented to the model in an unorganized fashion, with fre-
size or running time but maintain accuracy, that is, training fewer quent random shuffling. The level of complexity of the concepts to
parameters without losing model accuracy. Dropout zeros out the ac- learn in curriculum learning is proportional to the age of the people,
tivation values of randomly selected neurons during training in order i.e. handling easier knowledge when babies and harder knowledge
to improve sparsity in neural network weights. This property means when adults. Inspired by this, training examples can be subdivided
that dropout methods can be applied in compressing neural network based on their difficulty. Then, the learning is configured so that easier
models by reducing the number of parameters needed to perform examples come first, eventually complicating them and processing the
effectively [113]. hardest ones at the end of training.
Curriculum Dropout was proposed to using a time schedule for the
3.2.1. Standard dropout probability of retaining neurons in the network [118]. This results in an
Hinton et al. proposed the dropout method for the first time [114] adaptive regularization scheme that dynamically increases the expected
and was subsequently applied to large scale visual recognition prob- number of suppressed units to boost the generalization of the model
lems [115]. The main idea is that individual nodes are either kept with while smoothly increasing the difficulty of the optimization problem.
a probability of 𝑝 or omitted from the network with a probability of 1−𝑝 Using a fixed dropout probability during training is proven to be a
in each training iteration. Dropout is not applied to the output layer. suboptimal choice. At the beginning of Curriculum Dropout, no entry
As shown in Fig. 12, on each presentation of each training case, each
of 𝑧0 is set to zero. This clearly corresponds to the easiest available
hidden unit is randomly omitted from the network with a probability
example and considers all possible available visual information. As
of 0.5.
learning time grows, a greater number of entries are set to zero. This
Mathematically, the behavior of standard dropout during training
complicates the challenge and necessitates a greater effort on the part
for a neural network layer is given by:
of the model to capitalize on the limited amount of uncorrupted data
𝒚 = 𝑓 (𝑾 𝒙)◦𝒎, 𝑚𝑖 ∼ Bernoulli(𝑝), (77) available at that point in the training phase.
where 𝒚 is the layer output, 𝑓 (⋅) is the activation function, 𝑾 is the
3.2.5. DropMaps
layer weight matrix, 𝒙 is the layer input, and 𝒎 is the layer dropout
Moradi et al. proposed DropMaps, where for a training batch, each
mask, with each element 𝑚𝑖 being 0 with probability 1−𝑝. Once trained,
feature is kept with a probability of 𝑝 and is dropped with the probabil-
the layer output is given by:
ity of 1 − 𝑝[119]. At test time, all feature maps are kept, and everyone
𝒚 = (1 − 𝑝)𝑓 (𝑾 𝒙). (78) is multiplied by 𝑝. Like Dropout, DropMaps has a regularization effect
that causes the coincidence of feature maps to be avoided. DropMaps
Standard dropout is equivalent to adding layer after a layer of neurons
can handle the problem of overfitting large models for tiny images.
that simply sets values to zero with some probability during training,
and multiplies them by 1 − 𝑝 during testing. Table 7 shows some proposed methods and theoretical advances in
The standard dropout promotes sparsity in the weights of neural dropout.
networks, causing more weights to be near zero [115]. Several variants
have been produced since the standard dropout was proposed, as shown 3.3. Early stopping
in Table 7.
Actually, noisy labels are very common in real-world training data.
3.2.2. Dropconnect Since model is overfitting to noisy labels in the training, the test data
One of the variations on standard dropout was Dropconnect [116]. may have weak generalization. As shown in Fig. 13, the training error
Therefore, in training, the output of a network layer is given by: decreases steadily over time, but validation set error begins to rise
again. A copy of the model parameters is saved every time the error
𝒚 = 𝑓 ((𝑾 ◦𝑴)𝒙), 𝑀𝑖𝑗 ∼ Bernoulli(𝑝), (79)
on the validation set improves. When the training algorithm finishes,
where 𝑴 is a binary matrix encoding the connection information and instead of using the most recent parameters, use these parameters as
𝑀𝑖𝑗 ∼ Bernoulli(𝑝). Dropconnect is only applicable to fully connected the result. Usually the model is stopped when the validation set error
layers and randomly drop the weights rather than the activations. has not been improved for a while, rather than the model reaching
159
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 7
Some proposed methods and theoretical advances in dropout.
Year Methods
2012 Standard dropout
2013 Standout; Fast dropout; Dropconnect; Maxout
2014 Annealed dropout; DropAll
2015 Variational dropout; RNNdrop; MaxTpooling dropout; Spatial dropout
2016 Evolutionary dropout; Variational RNN dropout; Monte Carlo dropout; Selective CNN dropout; Swapout
2017 Cutout; Concrete dropout; Curriculum dropout; Tuneout.
2018 Adversarial dropout methods; Fraternal dropout; Information dropout; Targeted dropout; DropBlock; Jumpout
2019 Spectral dropout; Ising dropout; Weighted Channel Dropout; DropMaps
2020 Gradient Dropout; Surrogate Dropout
2021 Ranked dropout; LocalDrop
160
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
161
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 8
For each normalization, 𝑁 is the batch axis, 𝐶 is the channel axis, and (𝐻, 𝑊 ) is the spatial axes. The pixel in blue are normalized by the sam mean and variance, computed by
aggregating the values of these pixel.
Item Content Mean and variant Image
1 ∑𝑁 ∑𝐻 ∑𝑊
𝜇𝑐 (𝑥) = 𝑁𝐻𝑊 𝑛=1 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
BN BN normalizes the data in each batch. √
1 ∑𝑁 ∑𝐻 ∑𝑊 ( )2
𝜎𝑐 (𝑥) = 𝑁𝐻𝑊 𝑛=1 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑐 (𝑥) +𝜖
1 ∑ 𝐻 ∑ 𝑊
IN is a separate normalization operation 𝜇𝑛𝑐 (𝑥) = 𝐻𝑊 ℎ=1 𝑥
𝑤=1 𝑛𝑐ℎ𝑤
IN √
1 ∑𝐻 ∑𝑊 ( )2
for each channel in a sample.
𝜎𝑛𝑐 (𝑥) = 𝐻𝑊 ℎ=1 𝑤=1
𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛𝑐 (𝑥) + 𝜖
1 ∑ 𝐶 ∑ 𝐻 ∑ 𝑊
𝜇𝑛 (𝑥) = 𝐶𝐻𝑊 𝑐=1 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
LN LN normalizes all data in a sample. √
1 ∑𝐶 ∑𝐻 ∑𝑊 ( )2
𝜎𝑛 (𝑥) = 𝐶𝐻𝑊 𝑐=1 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛 (𝑥) +𝜖
1 ∑(𝑔+1)𝐶∕𝐺 ∑𝐻 ∑𝑊
GN divides the channel of a sample into 𝜇𝑛𝑔 (𝑥) = (𝐶∕𝐺)𝐻𝑊 𝑐=𝑔𝐶∕𝐺 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
GN
multiple groups, then normalized each √
1 ∑𝑊 ∑𝐻 ∑𝑊 ( )2
group. 𝜎𝑛𝑔 (𝑥) = (𝐶∕𝐺)𝐻𝑊 𝑐=𝑔𝐶∕𝐺 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛𝑔 (𝑥) +𝜖
Table 9
Some normalization methods.
Model Description Ref
Weight normalization It normalizes the weights of the layer. [161]
Batch-Instance Normalization It tries to learn how much style information should be used for each channel. [162]
Decorrelated Batch Normalization It centers, scales and whitens activations of each layer. [163]
Iterative Normalization It employs Newton’s iterations for much more efficient whitening and avoids the eigen-decomposition. [164]
Batch Renormalization The model outputs are dependent only on the individual examples during both training and inference. [165]
Switchable Normalization It selects different normalizers for different normalization layers of a deep neural network. [166]
Differentiable Dynamic Normalization It learns arbitrary normalization forms in data- and task-driven way for deep neural networks. [167]
162
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
Table 11
The advantages and disadvantages of the regularization technologies in deep learning.
Technology Advantages Disadvantages
Data augmentation Obtains a smaller robust error Induces the data distribution bias
Dropout Prevents units from co-adapting Increases the training time
Early stopping No change to model/algorithm Need validation data
Batch normalization Removes dropout and increased accuracy Difficult to estimate 𝜇 and 𝜎 in the test.
Table 12
Model accuracy, number of epochs the model needs to be learned and the number of operation per input sample
during the training phase [5].
Method None 𝑙2 norm Dropout Data Batch Adding
augmentation normalization noise
Accuracy 82.1 88.3 90.6 88.1 90.8 84.5
Epochs 12 12 28 17 6 16
Training Ops/Sam (M) 230 232 118 232 420 231
3.5.3. Multi-task learning The combination of two or more regularization techniques may lead
A special type of regularization technology is multi-task learning to a great effect, but others might make ’1 + 1 < 1’. For example,
(MTL). Most multi-task data can be collected from diverse domains or Li et al. [147] reconciles dropout and batch normalization, which
various tasks. MTL helps the relationship among tasks. MTL improves reduces the error rate in those applications. It recommends that apply
the performance of all tasks by learning multiple tasks simultane- dropout after batch normalization with a small dropout rate. While
ously [138]. the work [148] shows that 𝑙2 regularization has no regularizing effect
MTL consists feature learning approach, task clustering approach when combined with normalization. Therefore, the combination of
and task relation learning approach. The feature learning approach regularization technologies plays a fundamental role in regularization
assumes that related tasks are in a common feature subspace [139]. The research.
task clustering approach assumes that the tasks form several clusters Computational cost is a factor in the regularization selection. Ta-
where the model is trained by task groups [140]. Task relation learn- ble 12 showed the model accuracy, the number of convergent epochs
ing approach assumes more complex relationships among tasks which and the number of operation per input sample of some regularization
usually learns pairwise relations directly among tasks, such as task technologies [5]. As can be seen, weight decay and data augmentation
similarity [141], task correlations [142], and task covariance [143]. have little computational side effects. Therefore, they can be used
in most applications. In the case of enough computational resources,
3.5.4. Adding noise the related methods of Dropout are reasonable to be used. Moreover,
Adding noise to a model or weight may be used to make structures in the case of abundant computing resources, batch normalization
more generalizable and avoid overfitting [144,145]. In deep learning, family methods are reasonable strategies used as regularization in the
noise can be added either to the input data or to the weights of the network.
network. Adding noise to the input data has a relatively long history in
data pre-processing and in some literature, it is referred to as dithering. 4. Conclusion
In deep learning, adding noise to the input data hinders memorizing but
preserves learnability. The idea of adding noise has similar effects on In this paper, we reviewed favorite regularization techniques in
different problems in machine learning. recent years. In machine learning, this paper introduced sparse regular-
ization, low-rank regularization, and manifold regularization. In deep
3.6. Discussion learning, data augmentation, dropout, early stopping, batch normal-
ization are seen as the common regularization strategies. This paper
This part will compare the mentioned regularization technologies summarized how to choose the regularization technologies as follows:
and conclude existing opportunities and challenges of the regulariza-
• For sparse regularization, it is meaningful to construct a reg-
tions in deep learning.
ularization that has good mathematical properties for specific
While simple to state, these comparisons might have profound
tasks.
implications:
• The low-rank regularization can be constructed by extending the
• The main applications of these regularization technologies are sparse regularizations.
shown in Table 10. • Choose the regularization technologies that lead to the optimiza-
• The advantages and disadvantages of the regularization technolo- tion properties.
gies in deep learning are discussed in Table 11. • Manifold regularization should guarantee the smoothness of the
learning function.
Regularization technologies have rapidly become an essential part
• Computational cost can be considered as a factor when choosing
of the deep learning toolkit. However, a fundamental problem that
regularizations.
has accompanied this rapid progress is a lack of theoretical proof. In
particular, it is unclear how these technologies improve generalization, It seems the trend of employing more effective regularization tech-
even unknown whether they achieved better generalization. niques will continue and we will experience better and smarter methods
The lack of these theories greatly limits the development of reg- in the near future.
ularization technologies. However, a recent work thinks BN makes Several findings of this article, as well as potential future research,
the optimization landscape significantly smoother instead of avoiding are summarized below: (1) Developing more effective regularization
covariate shift [146]. Specifically, the Lipschitzness of both the loss strategies has been the subject of significant research efforts. (2) The
and the gradients induce a more predictive and stable behavior of advancement of regularization technologies is severely hampered by
the gradients, allowing for faster training. Regularizations with good the lack of theoretical clarification. In particular, it is unclear how
properties should be preferred. these regularization technologies improved generalization, and it is also
163
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
unclear whether they achieved better generalization. (3) The combi- [24] D. Geman, G. Reynolds, Constrained restoration and the recovery of
nation of two or more regularization techniques may lead to a great discontinuities, IEEE Trans. Pattern Anal. Mach. Intell. 14 (3) (1992) 367–383.
[25] I.W. Selesnick, I. Bayram, Sparse signal estimation by maximally sparse convex
effect, but others might make ’1 + 1 < 1’. Therefore, taking into
optimization, IEEE Trans. Signal Process. 62 (5) (2014) 1078–1092.
account the combination of regularization technologies is important in [26] M. Malek-Mohammadi, C.R. Rojas, B. Wahlberg, A class of nonconvex penalties
regularization research. preserving overall convexity in optimization-based mean filtering, IEEE Trans.
Signal Process. 64 (24) (2016) 6650–6664.
CRediT authorship contribution statement [27] J. Fan, Y. Liao, H. Liu, An overview of the estimation of large covariance and
precision matrices, Econom. J. 19 (1) (2016) C1–C32.
[28] J. Fan, F. Han, H. Liu, Challenges of big data analysis, Natl. Sci. Rev. 1 (2)
Yingjie Tian: Funding acquisition, Resources, Writing – review- (2014) 293–314.
ing and editing, Conceptualization, Supervision. Yuqi Zhang: Writing- [29] R.C. Qiu, P. Antonik, Smart Grid using Big Data Analytics: A Random Matrix
original draft, Visualization, Methodology. Theory Approach, John Wiley and Sons, 2017.
[30] H. Liu, L. Wang, T. Zhao, Sparse covariance matrix estimation with eigenvalue
constraints, J. Comput. Graph. Statist. 23 (2) (2014) 439–459.
Declaration of competing interest
[31] D. Belomestny, M. Trabs, A.B. Tsybakov, Sparse covariance matrix estimation
in high-dimensional deconvolution, Bernoulli 25 (3) (2019) 1901–1938.
The authors declare that they have no known competing finan- [32] X. Liu, N. Zhang, Sparse inverse covariance matrix estimation via the-norm with
cial interests or personal relationships that could have appeared to tikhonov regularization, Inverse Problems 35 (11) (2019) 115010.
influence the work reported in this paper. [33] C. Ding, D. Zhou, X. He, H. Zha, R 1-pca: rotational invariant l 1-norm principal
component analysis for robust subspace factorization, in: Proceedings of the
23rd International Conference on Machine Learning, 2006, pp. 281–288.
Acknowledgment [34] N. Wang, Y. Xue, Q. Lin, P. Zhong, Structured sparse multi-view feature
selection based on weighted hinge loss, Multimedia Tools Appl. 78 (11) (2019)
This work has been partially supported by grants from: National 15455–15481.
Natural Science Foundation of China (No. 12071458, 71731009). [35] H. Liu, M. Palatucci, J. Zhang, Blockwise coordinate descent procedures for
the multi-task lasso, with applications to neural semantic basis discovery, in:
Proceedings of the 26th Annual International Conference on Machine Learning,
References 2009, pp. 649–656.
[36] P. Gong, J. Ye, C. Zhang, Multi-stage multi-task feature learning, J. Mach. Learn.
[1] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. Res. 14 (1) (2013) 2979–3010.
[2] J. Kukavcka, V. Golkov, D. Cremers, Regularization for deep learning: A [37] S. Wang, D. Liu, Z. Zhang, Nonconvex relaxation approaches to robust ma-
taxonomy, 2017, arXiv preprint arXiv:1710.10686. trix recovery, in: Twenty-Third International Joint Conference on Artificial
[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep Intelligence, 2013.
learning (still) requires rethinking generalization, Commun. ACM 64 (3) (2021) [38] E.J. Candes, B. Recht, Exact matrix completion via convex optimization, Found.
107–115. Comput. Math. 9 (6) (2009) 717–772.
[4] Z. Hu, F. Nie, R. Wang, X. Li, Low rank regularization: a review, Neural Netw. [39] R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for
(2020). learning large incomplete matrices, J. Mach. Learn. Res. 11 (2010) 2287–2322.
[5] R. Moradi, R. Berangi, B. Minaei, A survey of regularization strategies for deep
[40] T. Hastie, R. Mazumder, J.D. Lee, R. Zadeh, Matrix completion and low-rank
models, Artif. Intell. Rev. 53 (6) (2020) 3947–3986.
SVD via fast alternating least squares, J. Mach. Learn. Res. 16 (1) (2015)
[6] F. Wen, L. Chu, P. Liu, R.C. Qiu, A survey on nonconvex regularization-
3367–3402.
based sparse and low-rank recovery in signal processing, statistics, and machine
[41] T. Bouwmans, E.H. Zahzah, Robust PCA via principal component pursuit: A
learning, IEEE Access 6 (2018) 69883–69906.
review for a comparative evaluation in video surveillance, Comput. Vis. Image
[7] L.C. Potter, E. Ertin, J.T. Parker, M. Cetin, Sparsity and compressed sensing in
Underst. 122 (2014) 22–34.
radar imaging.
[42] T. Bouwmans, S. Javed, H. Zhang, Z. Lin, R. Otazo, On the applications of robust
[8] C.R. Berger, Z. Wang, J. Huang, S. Zhou, Application of compressive sensing
PCA in image and video processing, Proc. IEEE 106 (8) (2018) 1427–1457.
to sparse channel estimation, IEEE Commun. Mag. 48 (11) (2010) 164–174.
[43] L. Luo, S. Bao, C. Tong, Sparse robust principal component analysis with
[9] M. Lustig, D. Donoho, J.M. Pauly, Sparse MRI: The application of compressed
applications to fault detection and diagnosis, Ind. Eng. Chem. Res. 58 (3) (2019)
sensing for rapid mr imaging, Magn. Reson. Med. 58 (6) (2007) 1182–1195.
1300–1309.
[10] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse
[44] E. Kim, M. Lee, S. Oh, Elastic-net regularization of singular values for robust
representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873.
subspace learning, in: Proceedings of the IEEE Conference on Computer Vision
[11] X. Jiang, R. Ying, F. Wen, S. Jiang, P. Liu, An improved sparse reconstruction
and Pattern Recognition, 2015, pp. 915–923.
algorithm for speech compressive sensing using structured priors, in: 2016 IEEE
[45] F. Nie, H. Huang, C. Ding, Low-rank matrix recovery via efficient schatten p-
International Conference on Multimedia and Expo (ICME), IEEE, 2016, pp. 1–6.
norm minimization, in: Twenty-Sixth AAAI Conference on Artificial Intelligence,
[12] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc.
2012.
Ser. B Stat. Methodol. 58 (1) (1996) 267–288.
[13] M.B. McCoy, V. Cevher, Q.T. Dinh, A. Asaei, L. Baldassarre, Convexity in source [46] L. Liu, W. Huang, D.-R. Chen, Exact minimum rank approximation via schatten
separation: Models, geometry, and algorithms, IEEE Signal Process. Mag. 31 (3) p-norm minimization, J. Comput. Appl. Math. 267 (2014) 218–227.
(2014) 87–95. [47] S. Gu, L. Zhang, W. Zuo, X. Feng, Weighted nuclear norm minimization with
[14] Z. Xu, X. Chang, F. Xu, H. Zhang, l1/2 regularization: A thresholding represen- application to image denoising, in: Proceedings of the IEEE Conference on
tation theory and a fast solver, IEEE Trans. Neural Netw. Learn. Syst. 23 (7) Computer Vision and Pattern Recognition, 2014, pp. 2862–2869.
(2012) 1013–1027. [48] Y. Xie, S. Gu, Y. Liu, W. Zuo, W. Zhang, L. Zhang, Weighted schatten 𝑝-norm
[15] N. Parikh, S. Boyd, et al., Proximal algorithms, Found. Trends R Optimiz. 1 minimization for image denoising and background subtraction, IEEE Trans.
(2014). Image Process. 25 (10) (2016) 4842–4857.
[16] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its [49] C. Lu, J. Tang, S. Yan, Z. Lin, Generalized nonconvex nonsmooth low-rank
oracle properties, J. Amer. Statist. Assoc. 96 (456) (2001) 1348–1360. minimization, in: Proceedings of the IEEE Conference on Computer Vision and
[17] C.-H. Zhang, Nearly unbiased variable selection under minimax concave Pattern Recognition, 2014, pp. 4130–4137.
penalty, Ann. Statist. 38 (2) (2010) 894–942. [50] C. Peng, Z. Kang, H. Li, Q. Cheng, Subspace clustering using log-determinant
[18] N. Anantrasirichai, R. Zheng, I. Selesnick, A. Achim, Image fusion via sparse rank approximation, in: Proceedings of the 21th ACM SIGKDD International
regularization with non-convex penalties, Pattern Recognit. Lett. 131 (2020) Conference on Knowledge Discovery and Data Mining, 2015, pp. 925–934.
355–360. [51] C. Gao, N. Wang, Q. Yu, Z. Zhang, A feasible nonconvex relaxation approach
[19] T. Zhang, Analysis of multi-stage convex relaxation for sparse regularization., to feature selection, in: Proceedings of the AAAI Conference on Artificial
J. Mach. Learn. Res. 11 (3) (2010). Intelligence, 2011, 25(1).
[20] J. Wangni, D. Lin, Learning sparse visual representations with leaky capped [52] J.H. Friedman, Fast sparse regression and classification, Int. J. Forecast. 28 (3)
norm regularizers, 2017, arXiv preprint arXiv:1711.02857. (2012) 722–738.
[21] H.-Y. Gao, A.G. Bruce, Waveshrink with firm shrinkage, Statist. Sinica (1997) [53] D. Geman, C. Yang, Nonlinear image recovery with half-quadratic regulariza-
855–874. tion, IEEE Trans. Image Process. 4 (7) (1995) 932–946.
[22] I. Selesnick, M. Farshchian, Sparse signal approximation via nonseparable [54] J. Trzasko, A. Manduca, Highly undersampled magnetic resonance image
regularization, IEEE Trans. Signal Process. 65 (10) (2017) 2561–2575. reconstruction via homotopic l0-minimization, IEEE Trans. Med. Imaging 28
[23] M. Nikolova, Energy Minimization Methods, Springer, 2011. (1) (2008) 106–121.
164
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
[55] C. Lu, C. Zhu, C. Xu, S. Yan, Z. Lin, Generalized singular value thresholding, [86] C. Lei, B. Hu, D. Wang, S. Zhang, Z. Chen, A preliminary study on data
in: Proceedings of the AAAI Conference on Artificial Intelligence, 2015, vol. augmentation of deep learning for image classification, in: Proceedings of the
29(1). 11th Asia-Pacific Symposium on Internetware, 2019, pp. 1–6.
[56] Y. Zhang, M. Rabbat, A graph-cnn for 3d point cloud classification, in: 2018 [87] H. Bagherinezhad, M. Horton, M. Rastegari, A. Farhadi, Label refinery: Im-
IEEE International Conference on Acoustics, Speech and Signal Processing proving imagenet classification through label progression, 2018, arXiv preprint
(ICASSP), IEEE, 2018, pp. 6279–6283. arXiv:1805.02641.
[57] D. Yang, W. Gao, Pointmanifold: Using manifold learning for point cloud [88] T. DeVries, G.W. Taylor, Dataset augmentation in feature space, 2017, arXiv
classification, 2020, arXiv preprint arXiv:2010.07215. preprint arXiv:1702.05538.
[58] J. Zeng, G. Cheung, M. Ng, J. Pang, C. Yang, 3D Point cloud denoising using [89] X. Gastaldi, Shake-shake regularization, 2017, arXiv preprint arXiv:1705.07485.
graph Laplacian regularization of a low dimensional manifold model, IEEE [90] Y. Yamada, M. Iwamura, T. Akiba, K. Kise, Shakedrop regularization for deep
Trans. Image Process. 29 (2019) 3474–3489. residual learning, IEEE Access 7 (2019) 186126–186136.
[59] X. Ma, W. Liu, Recent advances of manifold regularization, in: Manifolds [91] Y. Wang, Z.-P. Bian, J. Hou, L.-P. Chau, Convolutional neural networks with
II-Theory and Applications, IntechOpen, 2018. dynamic regularization, IEEE Trans. Neural Netw. Learn. Syst. 32 (5) (2020)
[60] Y. Zhang, J. Wu, Z. Cai, S.Y. Philip, Multi-view multi-label learning with sparse 2299–2304.
feature selection for image annotation, IEEE Trans. Multimed. 22 (11) (2020) [92] V. Kumar, H. Glaude, C. de Lichy, W. Campbell, A closer look at feature
2844–2857. space data augmentation for few-shot intent classification, 2019, arXiv preprint
[61] C. Shi, Q. Ruan, G. An, C. Ge, Semi-supervised sparse feature selection based arXiv:1910.04176.
on multi-view Laplacian regularization, Image Vis. Comput. 41 (2015) 1–10. [93] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic
[62] Y. Li, X. Shi, C. Du, Y. Liu, Y. Wen, Manifold regularized multi-view feature minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002)
selection for social image annotation, Neurocomputing 204 (2016) 135–141. 321–357.
[63] B. Geng, D. Tao, C. Xu, L. Yang, X.-S. Hua, Ensemble manifold regularization, [94] C. Shorten, T.M. Khoshgoftaar, A survey on image data augmentation for deep
IEEE Trans. Pattern Anal. Mach. Intell. 34 (6) (2012) 1227–1233. learning, J. Big Data 6 (1) (2019) 1–48.
[64] X. Ma, D. Tao, W. Liu, Effective human action recognition by combining [150] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
manifold regularization and pairwise constraints, Multimedia Tools Appl. 78 preprint arXiv:1411.1784.
(10) (2019) 13313–13329. [151] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[65] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview Hessian discriminative sparse convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012)
coding for image annotation, Comput. Vis. Image Underst. 118 (2014) 50–60. 1097–1105.
[149] D. Tao, L. Jin, W. Liu, X. Li, Hessian regularized support vector machines for [152] E. Denton, S. Chintala, A. Szlam, R. Fergus, Deep generative image models
mobile image annotation on the cloud, IEEE Trans. Multimed. 15 (4) (2013) using a laplacian pyramid of adversarial networks, 2015, arXiv preprint arXiv:
833–844. 1506.05751.
[66] K.I. Kim, F. Steinke, M. Hein, Semi-supervised regression using hessian energy
[153] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, Infogan:
with an application to semi-supervised dimensionality reduction, MPI for
Interpretable representation learning by information maximizing generative
Biological Cybernetics, 2010.
adversarial nets, in: Proceedings of the 30th International Conference on Neural
[67] W. Liu, H. Liu, D. Tao, Y. Wang, K. Lu, Multiview hessian regularized logistic
Information Processing Systems, 2016, pp. 2180–2188.
regression for action recognition, Signal Process. 110 (2015) 101–107.
[154] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial net-
[68] G. Feng, W. Liu, S. Li, D. Tao, Y. Zhou, Hessian-regularized multitask dictionary
works, in: International Conference on Machine Learning, PMLR, 2017, pp.
learning for remote sensing image recognition, IEEE Geosci. Remote Sens. Lett.
214–223.
16 (5) (2018) 821–825.
[155] D. Berthelot, T. Schumm, L. Metz, Began: Boundary equilibrium generative
[69] S. Lefkimmiatis, A. Bourquard, M. Unser, Hessian-based norm regularization for
adversarial networks, 2017, arXiv preprint arXiv:1703.10717.
image restoration with biomedical applications, IEEE Trans. Image Process. 21
[156] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for
(3) (2011) 983–995.
improved quality, stability, and variation, 2017, arXiv preprint arXiv:1710.
[70] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE
10196.
Trans. Image Process. 22 (7) (2013) 2676–2687.
[157] A. Brock, J. Donahue, K. Simonyan, Large scale GAN training for high fidelity
[71] T. Buhler, M. Hein, Spectral clustering based on the graph p-Laplacian, in:
natural image synthesis, 2018, arXiv preprint arXiv:1809.11096.
Proceedings of the 26th Annual International Conference on Machine Learning,
[158] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative
2009, pp. 81–88.
adversarial networks, in: Proceedings of the IEEE/CVF Conference on Computer
[72] D. Zhou, B. Scholkopf, Regularization on discrete spaces, in: Joint Pattern
Vision and Pattern Recognition, 2019, pp. 4401–4410.
Recognition Symposium, Springer, 2005, pp. 361–368.
[159] J. Zhao, M. Mathieu, Y. LeCun, Energy-based generative adversarial network,
[73] D. Luo, H. Huang, C. Ding, F. Nie, On the eigenvectors of p-Laplacian, Mach.
2016, arXiv preprint arXiv:1609.03126.
Learn. 81 (1) (2010) 37–51.
[74] W. Liu, X. Ma, Y. Zhou, D. Tao, J. Cheng, p-Laplacian regularization for scene [160] X. Gao, Y. Tian, Z. Qi, Rpd-gan: Learning to draw realistic paintings
recognition, IEEE Trans. Cybern. 49 (8) (2018) 2927–2940. with generative adversarial network, IEEE Trans. Image Process. 29 (2020)
[75] A. Elmoataz, X. Desquesnes, O. Lezoray, Non-local morphological PDEs and 8706–8720.
p-Laplacian equation on graphs with applications in image processing and [95] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data
machine learning, IEEE J. Sel. Top. Sign. Proces. 6 (7) (2012) 764–779. augmentation using GAN for improved liver lesion classification, in: 2018 IEEE
[76] A. Elmoataz, F. Lozes, M. Toutain, Nonlocal pdes on graphs: From tug-of-war 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018,
games to unified interpolation on images and point clouds, J. Math. Imaging pp. 289–293.
Vision 57 (3) (2017) 381–401. [96] F.H.K.d.S. Tanaka, C. Aranha, Data augmentation using GANs, 2019, arXiv
[77] W. Liu, Z.-J. Zha, Y. Wang, K. Lu, D. Tao, p-Laplacian regularized sparse preprint arXiv:1904.09135.
coding for human activity recognition, IEEE Trans. Ind. Electron. 63 (8) (2016) [97] L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style, 2015,
5120–5129. arXiv preprint arXiv:1508.06576.
[78] X. Ma, W. Liu, S. Li, D. Tao, Y. Zhou, Hypergraph p-Laplacian regularization [98] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, M. Song, Neural style transfer: A review,
for remotely sensed image recognition, IEEE Trans. Geosci. Remote Sens. 57 IEEE Trans. Vis. Comput. Graphics 26 (11) (2019) 3365–3385.
(3) (2018) 1585–1595. [99] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer
[79] X. Ma, W. Liu, D. Tao, Y. Zhou, Ensemble p-laplacian regularization for scene and super-resolution, in: European Conference on Computer Vision, Springer,
image recognition, Cogn. Comput. 11 (6) (2019) 841–854. 2016, pp. 694–711.
[80] D. Slepcev, M. Thorpe, Analysis of p-laplacian regularization in semisupervised [100] D. Ulyanov, V. Lebedev, A. Vedaldi, V.S. Lempitsky, Texture networks: Feed-
learning, SIAM J. Math. Anal. 51 (3) (2019) 2085–2120. forward synthesis of textures and stylized images, in: ICML, vol. 1(2), 2016,
[81] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Revisiting unreasonable effective- pp. 4.
ness of data in deep learning era, in: Proceedings of the IEEE International [101] L. Perez, J. Wang, The effectiveness of data augmentation in image classification
Conference on Computer Vision, 2017, pp. 843–852. using deep learning, 2017, arXiv preprint arXiv:1712.04621.
[82] J. Lu, P. Gong, J. Ye, C. Zhang, Learning from very few samples: A survey, [102] X. Zheng, T. Chalasani, K. Ghosal, S. Lutz, A. Smolic, Stada: Style transfer as
2020, arXiv preprint arXiv:2009.02653. data augmentation, 2019, arXiv preprint arXiv:1909.01056.
[83] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in [103] T. Hospedales, A. Antoniou, P. Micaelli, A. Storkey, Meta-learning in neural
the details: Delving deep into convolutional nets, 2014, arXiv preprint arXiv: networks: A survey, 2020, arXiv preprint arXiv:2004.05439.
1405.3531. [104] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
[84] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, of deep networks, in: International Conference on Machine Learning, PMLR,
in: European Conference on Computer Vision, Springer, 2014, pp. 818–833. 2017, pp. 1126–1135.
[85] E.D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q.V. Le, Autoaugment: Learning [105] L. Metz, N. Maheswaranathan, B. Cheung, J. Sohl-Dickstein, Meta-learning
augmentation strategies from data, in: Proceedings of the IEEE/CVF Conference update rules for unsupervised representation learning, 2018, arXiv preprint
on Computer Vision and Pattern Recognition, 2019, pp. 113–123. arXiv:1804.00222.
165
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166
[106] Y. Duan, J. Schulman, X. Chen, P.L. Bartlett, I. Sutskever, P. Abbeel, Rl2: Fast [132] Y. Wu, K. He, Group normalization, in: Proceedings of the European Conference
reinforcement learning via slow reinforcement learning, 2016, arXiv preprint on Computer Vision (ECCV), 2018, pp. 3–19.
arXiv:1611.02779. [161] T. Salimans, D.P. Kingma, Weight normalization: A simple reparameterization
[107] R. Houthooft, R.Y. Chen, P. Isola, B.C. Stadie, F. Wolski, J. Ho, P. Abbeel, to accelerate training of deep neural networks, Adv. Neural Inf. Process. Syst.
Evolved policy gradients, 2018, arXiv preprint arXiv:1802.04821. 29 (2016) 901–909.
[108] F. Alet, M.F. Schneider, T. Lozano-Perez, L.P. Kaelbling, Meta-learning curiosity [162] H. Nam, H.-E. Kim, Batch-instance normalization for adaptively style-invariant
algorithms, 2020, arXiv preprint arXiv:2003.05325. neural networks, 2018, arXiv preprint arXiv:1805.07925.
[109] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, M. Pontil, Bilevel program- [163] L. Huang, D. Yang, B. Lang, J. Deng, Decorrelated batch normalization,
ming for hyperparameter optimization and meta-learning, in: International in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Conference on Machine Learning, PMLR, 2018, pp. 1568–1577. Recognition, 2018, pp. 791–800.
[110] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable architecture search, 2018, [164] L. Huang, Y. Zhou, F. Zhu, L. Liu, L. Shao, Iterative normalization: Beyond
arXiv preprint arXiv:1806.09055. standardization towards efficient whitening, in: Proceedings of the IEEE/CVF
[111] J. Lemley, S. Bazrafkan, P. Corcoran, Smart augmentation learning an optimal Conference on Computer Vision and Pattern Recognition, 2019, pp. 4874–4883.
data augmentation strategy, Ieee Access 5 (2017) 5858–5869. [165] S. Ioffe, Batch renormalization: Towards reducing minibatch dependence in
[112] T.N. Minh, M. Sinn, H.T. Lam, M. Wistuba, Automated image data preprocessing batch-normalized models, 2017, arXiv preprint arXiv:1702.03275.
with deep reinforcement learning, 2018, arXiv preprint arXiv:1806.05886. [166] P. Luo, J. Ren, Z. Peng, R. Zhang, J. Li, Differentiable learning-to-normalize via
[113] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: switchable normalization, 2018, arXiv preprint arXiv:1806.10779.
a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. [167] P. Luo, P. Zhanglin, S. Wenqi, Z. Ruimao, R. Jiamin, W. Lingyun, Differen-
15 (1) (2014) 1929–1958. tiable dynamic normalization for learning deep representation, in: International
[114] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Conference on Machine Learning, PMLR, 2019, pp. 4203–4211.
Improving neural networks by preventing co-adaptation of feature detectors, [133] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief
2012, arXiv preprint arXiv:1207.0580. nets.
[115] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep [134] C. Xie, J. Lv, X. Li, Finding a good initial configuration of parameters
convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012) for restricted Boltzmann machine pre-training, Soft Comput. 21 (21) (2017)
1097–1105. 6471–6479.
[116] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, R. Fergus, Regularization of neural [135] S. Kokalj-Filipovic, R. Miller, N. Chang, C.L. Lau, Mitigation of adversarial
networks using dropconnect, in: International Conference on Machine Learning, examples in rf deep classifiers utilizing autoencoder pre-training, in: 2019
PMLR, 2013, pp. 1058–1066. International Conference on Military Communications and Information Systems,
[117] J. Ba, B. Frey, Adaptive dropout for training deep neural networks, Adv. Neural ICMCIS, IEEE, 2019, pp. 1–6.
Inf. Process. Syst. 26 (2013) 3084–3092. [136] C. Plahl, T.N. Sainath, B. Ramabhadran, D. Nahamoo, Improved pre-training of
[118] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, Curriculum dropout. deep belief networks using sparse encoding symmetric machines, in: 2012 IEEE
[119] R. Moradi, R. Berangi, B. Minaei, Sparsemaps: convolutional networks with International Conference on Acoustics, Speech and Signal Processing, ICASSP,
sparse feature maps for tiny image classification, Expert Syst. Appl. 119 (2019) IEEE, 2012, pp. 4165–4168.
142–154. [137] S.J. Nowlan, G.E. Hinton, Simplifying neural networks by soft weight-sharing,
[120] A. Lodwich, Y. Rangoni, T. Breuel, Evaluation of robustness and performance of Neural Comput. 4 (4) (1992) 473–493.
early stopping rules with multi layer perceptrons, in: 2009 International Joint [138] J. Zhang, J. Miao, K. Zhao, Y. Tian, Multi-task feature selection with sparse
Conference on Neural Networks, IEEE, 2009, pp. 1877–1884. regularization to extract common and task-specific features, Neurocomputing
[121] R. Ganguli, S. Bandopadhyay, Neural network performance versus network 340 (2019) 76–89.
architecture for a quick stop training application, in: APCOM 2003: 31st [139] A. Maurer, M. Pontil, B. Romera-Paredes, Sparse coding for multitask and
International Symposium on Application of Computers and Operations Research transfer learning, in: International Conference on Machine Learning, PMLR,
in the Minerals Industries, South African Institute of Mining and Metallurgy, 2013, pp. 343–351.
2003, p. 39. [140] Y. Zhang, Q. Yang, A survey on multi-task learning, 2017, arXiv preprint
[122] L. Prechelt, Automatic early stopping using cross validation: quantifying the arXiv:1707.08114.
criteria, Neural Netw. 11 (4) (1998) 761–767. [141] C. Williams, E.V. Bonilla, K.M. Chai, Multi-task Gaussian process prediction,
[123] M.S. Iyer, R.R. Rhinehart, A novel method to stop neural network training, Adv. Neural Inf. Process. Syst. (2007) 153–160.
in: Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No. [142] Y. Zhang, D.Y. Yeung, Multilabel relationship learning, ACM Trans. Knowl.
00CH36334), vol. 2, IEEE, 2000, pp. 929–933. Discov. Data 7 (2) (2013) 1–30.
[124] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning [143] Y. Zhang, D.-Y. Yeung, A regularization approach to learning task relationships
algorithms, in: Proceedings of the 23rd International Conference on Machine in multitask learning, ACM Trans. Knowl. Discov. Data 8 (3) (2014) 1–31.
Learning, 2006, pp. 161–168. [144] B. Poole, J. Sohl-Dickstein, S. Ganguli, Analyzing noise in autoencoders and
[125] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping without a deep networks, 2014, arXiv preprint arXiv:1406.1831.
validation set, 2017, arXiv preprint arXiv:1703.09580. [145] S. Hochreiter, J. Schmidhuber, Simplifying neural nets by discovering flat
[126] H. Song, M. Kim, D. Park, J.-G. Lee, How does early stopping help minima, in: Advances in Neural Information Processing Systems, 1995, pp.
generalization against label noise? 2019, arXiv preprint arXiv:1911.08059. 529–536.
[127] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training [146] S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help
by reducing internal covariate shift, in: International Conference on Machine optimization? in: Proceedings of the 32nd International Conference on Neural
Learning, PMLR, 2015, pp. 448–456. Information Processing Systems, 2018, pp. 2488–2498.
[128] J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, 2016, arXiv preprint [147] X. Li, S. Chen, X. Hu, J. Yang, Understanding the disharmony between dropout
arXiv:1607.06450. and batch normalization by variance shift, in: Proceedings of the IEEE/CVF
[129] J. Xu, X. Sun, Z. Zhang, G. Zhao, J. Lin, Understanding and improving layer Conference on Computer Vision and Pattern Recognition, 2019, pp. 2682–2690.
normalization, 2019, arXiv preprint arXiv:1911.07013. [148] T. Van Laarhoven, L2 regularization versus batch and weight normalization,
[130] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: The missing 2017, arXiv preprint arXiv:1706.05350.
ingredient for fast stylization, 2016, arXiv preprint arXiv:1607.08022.
[131] Z. Xu, X. Yang, X. Li, X. Sun, The effectiveness of instance normalization: a
strong baseline for single image dehazing, 2018, arXiv preprint arXiv:1805.
03305.
166