0% found this document useful (0 votes)

5 views21 pages

ML Paper

This paper provides a comprehensive survey of regularization strategies in machine learning, emphasizing their importance for improving model generalization and preventing overfitting. It categorizes regularization techniques into four groups, discusses their characteristics, and offers guidance on selecting appropriate methods for specific tasks. The paper also highlights current opportunities and challenges in regularization technologies, along with future research directions.

Uploaded by

Ruhan Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views21 pages

ML Paper

Uploaded by

Ruhan Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Information Fusion 80 (2022) 146–166

Contents lists available at ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

A comprehensive survey on regularization strategies in machine learning

Yingjie Tian a,c,d ,∗, Yuqi Zhang b,c,d
a
School of Economics and Management, University of the Chinese Academy of Sciences, Beijing 100190, China
b
School of Mathematical Sciences, University of the Chinese Academy of Sciences, Beijing 100049, China
c
Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China
d Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China

ARTICLE INFO ABSTRACT

Keywords: In machine learning, the model is not as complicated as possible. Good generalization ability means that the
Overfitting model not only performs well on the training data set, but also can make good prediction on new data.
Generalization Regularization imposes a penalty on model’s complexity or smoothness, allowing for good generalization
Regularization
to unseen data even when training on a finite training set or with an inadequate iteration. Deep learning
Machine learning
has developed rapidly in recent years. Then the regularization has a broader definition: regularization is a
technology aimed at improving the generalization ability of a model. This paper gave a comprehensive study
and a state-of-the-art review of the regularization strategies in machine learning. Then the characteristics and
comparisons of regularizations were presented. In addition, it discussed how to choose a regularization for
the specific task. For specific tasks, it is necessary for regularization technology to have good mathematical
characteristics. Meanwhile, new regularization techniques can be constructed by extending and combining
existing regularization techniques. Finally, it concluded current opportunities and challenges of regularization
technologies, as well as many open concerns and research trends.

1. Introduction

Training a large-scale model is a challenging problem. A model

might be selected by maximizing its performance on some set of
training data. Its generalization might be determined by its ability
to perform well on unseen data. As an example in Fig. 1, the data
is fitted into a linear function and a polynomial function. Although
the polynomial function fits the data more perfectly, while the linear
function may make better predictions for unseen data. The concept
of overfitting is that the loss decreases on training data, but the loss
increases on unseen data.
Overfitting is a fundamental issue in supervised machine learning
because of the presence of data noise, the limited size of the training
set, and the complexity of classifiers. Regularization is a key component Fig. 1. An example of overfitting.
of machine learning [1], allowing for good generalization to unseen
that can fit data but some models overfit data. The penalty terms
data even when training on a finite training set or with an inadequate
iteration. usually limit the complexity the models to avoid overfitting. Fig. 2
Regularization adds some constraints to the minimized objective shows an overfitted model and a regularized model. The green line
function, which cannot be obtained from the data and represents the works best with training data, but it is overly reliant on it and is likely
prior preference. The first prior preference originated in Occam’s razor to have a higher error rate with new, unseen data. The black line rep-
principle which holds that it is good to explain the phenomenon with resents a regularized model, in which the regularization term limits the
the simplest hypothesis. In the early optimization literatures, ‘regular- model’s complexity and causes problems with using a simple model to
ization’ is the penalty term. In the optimization problem, many models

∗ School of Economics and Management, University of the Chinese Academy of Sciences, Beijing 100190, China.
E-mail addresses: [email protected] (Y. Tian), [email protected] (Y. Zhang).

https://doi.org/10.1016/j.inffus.2021.11.005
Received 13 May 2021; Received in revised form 23 October 2021; Accepted 2 November 2021
Available online 14 November 2021
1566-2535/© 2021 Elsevier B.V. All rights reserved.
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

• Provide a thorough review of the strategies for overcoming over-

fitting, and these strategies are regarded as the comprehensive
definition of regularization.
• Give the characteristics of regularizations and the comparison.
• Discuss how to choose or design a regularization for specific tasks.
• Introduce the related algorithms of solving the regularization
problems.
• Conclude existing regularization opportunities and challenges,
and provide several open issues and research trends.

The rest of this paper is organized as follows. Section 2 provides the

Fig. 2. The green line represents an overfitted model and the black line represents a regularization in machine learning. Subsequently, the regularization
regularized model. (For interpretation of the references to color in this figure legend, strategies of deep learning are introduced in Section 3. Finally, it
the reader is referred to the web version of this article.) concludes this paper and provides several open issues and research
trends in Section 4.

fit the data. The second prior preference tends to add more convincing 2. The regularization strategies in the traditional machine learn-
regularization to the model. If there are some stochastic/deterministic ing
noise, the objective function will become unsmooth and easy to over-
fitting. Therefore, the regularization should make target smooth. The In this section, a comprehensive overview of sparse and low-rank
third prior preference is target-dependent. Usually, regularization can regularization is provided, which are empirically categorized into four
be constructed according to some properties of target. The fourth prior groups. These regularizations are usually proposed to solve specific
preference is to make the model easy to solve, such as weight decay. applications. Thus, this section will give various practical problems
Deep models can learn complex representational spaces to deal with before introducing regularizations.
the difficult learning tasks. They are prone to overfitting, particularly For brief, note vector and matrix in bold to distinguish from the
in networks with millions or billions of learnable parameters, such as one-dimension variables.
convolutional neural networks (CNN) and Recurrent Neural Networks
(RNN). Hence, for deep learning, regularization need a broader def-
2.1. Sparse vector-based regularization
inition: regularization is any supplementary technique that aims to
improve the model’s generalization, i.e. produce better results on the
Some variables are required to be sparse in many practical prob-
test set [2]. Explicit regularization, such as dropout and weight decay,
lems, such as compressing sensing, feature selection, sparse signal
may improve generalization performance. When neural networks have
far more learnable parameters than training samples, the generalization separation, sparse PCA, and sparse signals separation. Sparse regular-
takes place even in the absence of any explicit regularization. In this ization, which has attracted much attention in recent years, usually
case, explicit regularization is useless and unnecessary. A possible imposes the penalty on the variables. Many examples in different fields
phenomenon is that the optimization is introducing some implicit can be found where sparse regularization beneficial and favorable.
regularization to make that there is no significant overfitting and test This section will review four applications of sparse regularization-based
error continues decreasing as network size increases past the size sparse vector. Meanwhile, some vector-based sparse regularizations will
required for achieving zero training error. It has been proved that be summarized.
early stopping implicitly regularize some convex learning problems.
Batch normalization is an operator that normalizes the layer responses 2.1.1. Application scenario
within each mini-batch. It has been widely used in many modern neural Compressing Sensing In compressing sensing, radar [7], commu-
networks. Although there is no explicit design for regularization, it nications [8], medical imaging [9], image processing [10], and speech
is usually found that batch normalization can improve generalization signal processing [11]. The objective of compressing sensing is to
performance. In fact, the algorithm itself is an implicit regularization reconstruct a sparse signal 𝒙:
solution (this paper does not consider the implicit regularization based
min 𝑃 (𝒙)
on algorithm) [3]. 𝒙 (1)
This paper provides a comprehensive review of the regularization s.t. 𝑨𝒙 = 𝒚
strategies in both machine learning and deep learning. Different reg- where 𝑨 ∈ R𝑚×𝑛 with 𝑚 ≤ 𝑛 is the sensing matrix (also called
ularizations are classified into four categories in machine learning: measurement matrix), 𝒚 is the compressed measurement of 𝒙, and 𝑃 is
sparse vector regularization, sparse matrix regularization, low-rank a penalty function for the sparsity. In fact, there is often measurement
matrix regularization, and manifold regularization. We discuss the noise 𝜺 ∈ R𝑚 so that
characteristics of each approach and the applied formulation. More-
over, we review the strategies to overcome the overfitting in deep 𝒚 = 𝑨𝒙 + 𝜺. (2)
learning such as data augmentation, dropout, early stopping, batch
To minimize the noise 𝒏, an unconstrained formulation can be written
normalization, etc. Finally, we discuss several future directions for
as
regularization.
1
To the best of our knowledge, there have been some regularization- min ‖𝑨𝒙 − 𝒚‖22 + 𝜇𝑃 (𝒙). (3)
𝒙 2
related surveys conducted up to this point [2,4–6]. The work [2] gave
the border definition of regularization and proposed a taxonomy to with 𝜇 > 0.
categorize existing methods but few specific introduction. The work [4] Feature Selection In many fields today, such as genomics, health
only introduced the low-rank regularizations and their applications. sciences, economics, and machine learning, the analysis of data sets
The work [6] only considered the regularization in traditional machine with a number of variables comparable to or even much larger than the
learning but not in deep learning. The work [5] gave a relatively com- sample size is required. Most high-dimensional problems is infeasible
plete introduction, but there are still omissions, and no application and and impractical because of their expensive computational costs. Feature
comparison are given. In comparison, the novelties and contributions selection has become a very popular topic in the last decade due to its
of this survey are as follows: effectiveness in high-dimensional case [12].

147
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Let 𝑨 is the design(deterministic) matrix, 𝒙 is the unknown regres-

sion coefficients, and 𝒚 is the response vector. Naturally, the penalty
function uses the 𝑙0 norm, i.e., 𝑃 (𝒙) = ‖𝒙‖0 which counts the number
of nonzero components in the vector 𝒙. The equivalent optimization
problem is

min 𝒚 = 𝑨𝒙 s.t. ‖𝒙‖0 ≤ 𝑘 (4)

𝒙

called the 𝑘-sparse approximation problem.

Sparse PCA PCA is a useful tool for dimensionality reduction and
feature extraction, which has been applied in virtually all areas of
science and engineering, such as signal processing, machine learning,
statistics, biology, medicine, finance, neurocomputing, and computer
networks, to name just a few.
Let 𝑨 ∈ R𝑚×𝑛 be a centered data matrix, and the sparse PCA prob-
lem can be formulated as maximizing the variance along a direction
represented by vector 𝒙 ∈ R𝑛 : Fig. 3. The curve of |𝑥|𝑝 for various values of 𝑝. As 𝑝 tends to zero, the curve of |𝑥|𝑝
approaches the indicator function.
1
max 𝑛−1
‖𝑨𝒙‖22
𝒙
s.t. ‖𝒙‖2 = 1 . (5)
𝑃 (𝒙) ≤ 𝑘 where 𝑥𝑗 are the elements of 𝒙. For measuring sparsity, 0 < 𝑝 < 1 is
The first constraint states that 𝑥 is a unit vector. In the second con- of most interest. The two special cases of 𝑝 = 1∕2 and 𝑝 = 2∕3 are
straint, 𝑃 is typically 𝑙0 norm, limiting the number of non-zero compo- in [14]. Fig. 3 presents the curve of the scalar weight function 𝑙𝑝 norm,
nents in 𝑥 that is less than or equal to 𝑘. An alternative to the sparsity the core of the norm computation, for various values of 𝑝, showing that
constrained formulation is the sparsity penalized formulation as follows as 𝑝 goes to zero, this measure becomes a count of the nonzeros in 𝒙.
Note that among the 𝑙𝑝 norms, the choice 𝑝 = 1 gives a convex
max ‖𝑨𝒙‖22 − 𝜇𝑃 (𝒙) function, while every choice 0 < 𝑝 < 1 yields a concave function. The
𝒙 .
s.t. ‖𝒙‖2 = 1 𝑙𝑝 penalty reduces to the soft-thresholding when 𝑝 = 1, while it tends
to the hard-thresholding in the limit as 𝑝 → −∞.
Sparse Signals Separation Sparse signals separation has broad
Projection Operator The projection operator [15], which can be
applications, such as source separation, super-resolution and inpaint-
seen as a regularization, is defined as
ing, interference cancellation, saturation and clipping restoration, and
robust sparse recovery in impulsive (sparse) noise. 𝑃 (𝒙) = 𝛿𝐶 (𝒙) (11)
The goal is to recover and demix 𝒙1 and 𝒙2 form the mixed linear {
0, if 𝒙 ∈ 𝐶
measurements where 𝐶 is a closed and convex set and 𝛿𝐶 (𝒙) =
∞, otherwise
𝒚 = 𝑨1 𝒙1 + 𝑨2 𝒙2 (6) is the indicator function that makes the solution 𝑥 project back onto
𝐶. The Table 1 shows closed and convex sets and their corresponding
via the following formulation (orthogonal) projections:
( ) ( ) Smoothly Clipped Absolute Deviation (SCAD) The SCAD was
min 𝜇𝑔1 𝒙1 + 𝑔2 𝒙2
𝒙1 ,𝒙2 (7) proposed by the work [16]. It has been commonly used in variable
s.t. 𝑨1 𝒙1 + 𝑨2 𝒙2 = 𝒚 selection problems and has shown to be efficient in high-dimensional
where 𝑔1 and 𝑔2 are penalties for sparsity promotion. 𝑨1 ∈ R𝑚×𝑛1 and variable selection problems as compared to other penalties. The SCAD
𝑨2 ∈ R𝑚×𝑛2 are known (deterministic) dictionaries, on which 𝒙1 and 𝒙2 is defined as:
can be sparsely (or approximately sparsely) represented. ⎧ 𝜆|𝑥𝑖 |, if |𝑥𝑖 | < 𝜆
A popular manner is to use convex relaxation that replaces the 𝑙0 ⎪
⎪ 2𝛾𝜆|𝑥𝑖 | − 𝑥2𝑖 − 𝜆2
norm by the 𝑙1 norm [13]: 𝑃 (𝑥𝑖 ) = ⎨ , if 𝜆 ≤ |𝑥𝑖 | < 𝛾𝜆 (12)
⎪ 2(𝛾 − 1)
min 𝜇‖ ‖ ‖ ‖
‖𝒙1 ‖1 + ‖𝒙2 ‖1 ⎪ (𝛾 + 1)𝜆2 ∕2,
𝒙1 ,𝒙2 . (8) ⎩ if |𝑥𝑖 | ≥ 𝛾𝜆
s.t. 𝑨1 𝒙1 + 𝑨2 𝒙2 = 𝒚
The SCAD penalty is continuously differentiable on R but singular
2.1.2. Regularization at 0, with its derivatives zero outside the range [−𝛾𝜆, 𝛾𝜆]. SCAD also
produces a sparse set of solutions and this penalty is nonconvex.
𝑙1 norm The 𝑙1 is defined as
Minimax Concave Penalty (MCP) The MCP function [17]
∑
𝑁
| |
‖𝒙‖1 = |𝑥𝑗 | , (9) ⎧ 𝑥2𝑖
| | ⎪ 𝜆 |𝑥 | − , if ||𝑥𝑖 || ≤ 𝛾𝜆
𝑗=1 𝑃 (𝑥𝑖 ) = ⎨ | 𝑖 | 2𝛾 (13)
where 𝑥𝑗 is the elements of 𝒙. For sparsity promotion, the 𝑙1 norm regu-
1
⎪ 2 𝛾𝜆2 , if ||𝑥𝑖 || > 𝛾𝜆
⎩
larization is easy to establish convex optimization problems. However,
is another alternative to get less biased in sparse models. The MCP
the 𝑙1 norm regularization cause a bias for large coefficients. In recent
provides the sparse convexity to the broadest extent by minimizing the
years, there has been a trend to study sparse regularizations with less
maximum concavity.
bias.
Generalized Minimax Concave (GMC) A new penalty based on
𝑙𝑝 norm With the aim of bridging the gap between the 𝑙0 norm and
MCP is GMC [18] which is non-convex but maintains the convexity
𝑙1 norm, there is a significant interest in the use of the nonconvex 𝑙𝑝
property of some cost function such as the least squared loss function
norm, 1
2
‖𝒚 − 𝑨𝒙‖22 . The GMC penalty is defined in terms of a new multivariate
(𝑁 )1 generalization of the Huber function:
∑ | |𝑝 𝑝
‖𝒙‖𝑝 = |𝑥𝑗 | , (10)
| | 𝑃 (𝒙) = ‖𝒙‖1 − 𝑆𝐵 (𝒙) (14)
𝑗=1

148
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 1
The convex sets and their corresponding (orthogonal) projections.
Convex set The projections
Nonnegative orthant 𝐶1 = R𝑛+ [𝒙]+
( )−1
Affine set 𝐶2 = {𝒙 ∈ R𝑛 ∶ 𝑨𝒙 = 𝒚} 𝒙 − 𝑨𝑇 𝑨𝑨𝑇 (𝑨𝒙 − 𝒚)
( { { } })𝑛
Box 𝐶3 = Box{[𝑏𝑖 , 𝑢𝑖 ]𝑛𝑖=1 } where 𝑏𝑖 , 𝑢𝑖 ∈ (−∞, ∞] min max 𝑥𝑖 , 𝑏𝑖 , 𝑢𝑖 𝑖=1
{ } [𝒂𝑇 𝒙−𝛼]+
Half-space 𝐶4 = 𝒙 ∶ 𝒂𝑇 𝒙 ≤ 𝛼 𝒙− ‖𝒂‖2
𝒂

Firm Penalty The firm penalty is formulated as [21]

{ [ ]
𝜆 |𝑥| − 𝑥2 ∕(2𝛾) , if |𝑥| < 𝛾
𝑃 (𝑥) = . (20)
𝜆𝛾∕2, if |𝑥| ≥ 𝛾
Moreover, the firm thresholding is a continuous and piecewise-linear
approximation of the hard-thresholding.
Others Many penalty functions have been developed, such as the
log sum penalty (LSP) [22,23]

𝑃 (𝑥) = 𝜆 log (1 + |𝑥| ∕𝛾) , (21)

the rational penalty(Rat) [23,24]

𝜆|𝑥|
𝑃 (𝑥) = , (22)
1 + |𝑥|∕2𝛾
the arctangent penalty (Atan) [25]
( ( ) )
Fig. 4. Geometric view of two kinds of norms. 2 1 + 𝛾|𝑥| 𝜋
−1
𝑃 (𝑥) = 𝜆 √ tan √ − , (23)
3 3 6

and the exponential penalty (Exp) [26]

where
{ }
1 𝑃 (𝑥) = 𝜆(1 − e−𝛾|𝑥| ). (24)
𝑆𝐵 (𝒙) = inf 𝑛 ‖𝒗‖1 + ‖𝐵(𝒙 − 𝒗)‖22 . (15)
𝒗∈R 2
The penalty function usually satisfies the following properties:
Obviously, when the matrix 𝐵 meets the condition
Property 1 (P1): 𝑃 is continuous on R.
1 𝑇 Property 2 (P2): 𝑃 is non-decreasing, and concave on R+ .
𝐵𝑇 𝐵 ⪯ 𝐴 𝐴, (16)
𝜆 Property 3 (P3): 𝑃 is continuously differentiable on R+ .
the whole cost function Property 4 (P4): 𝑃 ′ is convex and nonnegative on R+ .
1 Property 5 (P5): 𝑃 (0) = 0.
‖𝒚 − 𝑨𝒙‖22 + 𝜆(‖𝒙‖1 − 𝑆𝐵 (∥ 𝒙)), 𝜆>0 (17)
2 Property 6 (P6): 𝑃 (−𝑥) = 𝑃 (𝑥).
is convex. Property 7 (P7): 𝑥 → 𝑥2 ∕2𝛼 + 𝑃 (𝑥) is convex on R with 𝛼 > 0.
Capped 𝑙1 The capped 𝑙1 [19] is defined as Property 8 (P8): lim𝑥→∞ 𝑃 ′′ (𝑥) = 0.
( ) Table 2 lists several penalty functions, and their properties which
𝑃 (𝑥𝑖 ) = 𝜆 min ||𝑥𝑖 || , 𝛾 . (18) are important in both theoretical analysis and applications. These
This formulation only regularizes those weights that are below a certain properties can guide us to construct the regularizations.
threshold 𝛾. For those beyond this value, they can grow arbitrarily with- Furthermore, ElasticNet uses a combination of 𝑙1 norm and 𝑙2 norm.
out experiencing penalties. It is noteworthy that it is a generalization These models have a tuning variable that controls the impact of the
of 𝑙1 norm. Particularly, it becomes 𝑙1 norm as 𝛾 → ∞. Generally, with shrinkage method on the model parameter estimation. The algorithms
to solve these regular problems include: (1) greedy strategy approxi-
a finite 𝛾, it approximates 𝑙0 norm better than 𝑙1 norm, and therefore
mation; (2) constrained optimization; (3) proximal methods; and (4)
leads to higher sparsity in some real-world applications. The key feature
homotopy algorithm-based sparse representation.
of capped 𝑙1 norm is the lack of penalty for weights whose magnitudes
are greater than 𝛾, this feature, however, may become a drawback
under some circumstances. 2.2. Sparse matrix-based regularization
Leaky Capped 𝑙1 Norm Regularizer (LCNR) With the analysis of
capped 𝑙1 in mind, the work [20] propose the leaky capped 𝑙1 norm Estimating large covariance or inverse covariance matrices has re-
regularizer as an improved variant: cently gained prominence due to the prevalence of high-dimensional
∑ ( ) ∑ ( ) data in modern applications such as economics and finance, bioin-
𝑃 (𝑥) = 𝛼 min ||𝑥𝑖 || , 𝛾 + 𝛽 max ||𝑥𝑖 || , 𝛾 , (19) formatics, social networks, smart grid, climate studies, and health
𝑖 𝑖
sciences [27–29]. It is well known that the sample covariance based on
where 0 < 𝛽 < 𝛼. As shown in Fig. 4, the functions of LCNR observed data is singular when the dimension is greater than the sample
and capped 𝑙1 are piecewise linear. The key difference between the size. The estimation problem is generally difficult when the dimension
proposed formulation and the standard 𝑙1 norm and the capped 𝑙1 is that of the covariance matrix is large [27]. An important assumption in the
large weights (i.e. those greater than 𝛾) are still penalized positively calculation of covariance and inverse covariance matrices is that the
but less heavily. To be more specific, this generalizes the capped 𝑙1 objective matrix is sparse. This section will reviews the large covariance
norm, with an additional coefficient 𝛽 to control how much those large matrix and inverse covariance matrix estimation that are important in
weights are regularized. In particular, when 𝛽 = 0, it reduces to the modern multivariate analysis. Furthermore, this section will provides
regular capped 𝑙1 . sparse matrix-based regularizations.

149
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 2
The properties of the regularizations. The horizontal axis of Image is the value of 𝑥, and the vertical axis of Image is the value of the corresponding penalty formulate.
Penalty Formulation P1 P2 P3 P4 P5 P6 P7 P8 Image
{
0 if 𝑥 = 0 √ √ √ √ √ √
𝑙0 norm × ×
1 otherwise

√ √ √ √ √ √ √ √
𝑙1 norm |𝑥|

√ √ √ √ √ √ √
𝑙0.5 norm |𝑥|𝑝 ×

√ √ √ √ √
Capped 𝑙1 𝜆 min (|𝑥| , 𝛾) × – ×

⎧ 𝜆|𝑥|, if |𝑥| < 𝜆

⎪ 2𝛾𝜆|𝑥|−𝑥2 −𝜆2 √ √ √ √ √ √
SCAD ⎨ 2(𝛾−1)
, if 𝜆 ≤ |𝑥| < 𝛾𝜆 × –
⎪ (𝛾 + 1)𝜆2 ∕2, if |𝑥| ≥ 𝛾𝜆
⎩
{ 2
𝜆 |𝑥| − 𝑥2𝛾 , if |𝑥| ≤ 𝛾𝜆 √ √ √ √ √ √
MCP 1
× –
2
𝛾𝜆2 , if |𝑥| > 𝛾𝜆

√ √ √ √ √ √ √ √
LSP 𝜆 log (1 + |𝑥| ∕𝛾)

𝜆|𝑥| √ √ √ √ √ √ √ √
Rat 1+|𝑥|∕2𝛾

( ( ) ) √ √ √ √ √ √ √ √
Atan 𝜆 √2 tan−1 1+𝛾|𝑥|
√ − 𝜋
6
3 3

√ √ √ √ √ √ √ √
Exp 𝜆(1 − e−𝛾|𝑥| )

{ [ ]
𝜆 |𝑥| − 𝑥2 ∕(2𝛾) , if |𝑥| ≤ 𝛾 √ √ √ √ √ √
Firm × –
𝜆𝛾∕2, if |𝑥| ≥ 𝛾

2.2.1. Application scenario Gaussian network model and has attracted much attention in the last
Large Sparse Covariance Matrix Estimation Large inverse covari- decade [32]. The estimation of large sparse inverse covariance matri-
ance matrix estimation is a fundamental problem in a wide range of ces is a tricky statistical problem in many application areas such as
applications, from economics and finance to genetics, social networks, mathematical finance, geology, health, or many others. Hence, sparsity
and health sciences. The estimation problem becomes more difficult
regularized negative log-likelihood minimization has become a popular
when the dimension of the covariance matrix is high. The assumptions
approach for estimating the sparse inverse covariance matrix. A com-
for high-dimensional covariance matrix estimation is sparsity. The
sparsity means that a majority of the off-diagonal elements are nearly mon method is adding some mechanism such as regularization term in
zero and thus decrease the number of free parameters to estimate [30]. the estimation model to explicitly enforce sparsity in 𝑿.
Since the diagonal elements of a correlation (also known as covariance) ∑ ( )
min tr(𝑺𝑿) − log |𝑿| + 𝑀 𝑋𝑖𝑗 , (28)
matrix are always positive, the assumption is for the off-diagonal 𝑿
𝑖≠𝑗
elements.
Let 𝒔1 , … , 𝒔𝑛 ∈ R𝑛 is independent and identically distributed (i.i.d.). where 𝑡𝑟(⋅) and | ⋅ | represent the trace and the determinant of matrix
The covariance which between each dimensions forms an 𝑛 × 𝑛 matrix respectively. The sample covariance matrix 𝑺 is invertible.
called the covariance matrix. Calculate the sample covariance matrix
𝑺 [31],
1 ∑( 𝑖
𝑛
)( )⊤ 2.2.2. Regularization
𝑺= 𝒔 − 𝒔̄ 𝒔𝑖 − 𝒔̄ (25)
𝑛 − 1 𝑖=1 Norm Penalty A convex model based on the 𝑙1,of f -regularized max-
imum likelihood problem
where 𝒔̄ is the mean of 𝒔1 , … , 𝒔𝑛 , i.e.
1∑ 𝑖
𝑛
min tr(𝑺𝑿) − log |𝑿| + 𝜆‖𝑿‖1,of f (29)
𝒔̄ = 𝒔. (26) 𝐗>0
𝑛 𝑗=1
where
The generalize thresholding estimator solves the following problem
∑
𝑛 ∑
𝑚
| |
1 ∑ ( ) ‖𝑿‖1,of f = |𝑋𝑖𝑗 | , (30)
min ‖𝑿 − 𝑺‖2𝐹 + 𝑀 𝑋𝑖𝑗 , (27) | |
𝑿 2 𝑖≠𝑗
𝑖=1 𝑗=1

where 𝑋𝑖𝑗 are the elements of the matrix 𝑿 and 𝑀 is generalized which refers to the element-wise 1 norm.
penalty function for sparsity. These notations still apply in the rest of The 𝐹 norm is defined as
this section. ( 𝑛 𝑚 )1∕2
Large Sparse Inverse Covariance Matrix Estimation Sparse in- ∑∑
𝑀(𝑿) = ‖𝑿‖𝐹 = 𝑋𝑖𝑗2 . (31)
verse covariance matrix estimation is a fundamental problem in a 𝑖=1 𝑗=1

150
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Generalizing the above norm, we denote the 𝑙𝑝,of f norm of matrix 𝑿 as The penalty function of matrix usually satisfies the following prop-
follows erties:
(𝑚 𝑛 )1∕𝑝 Property 9 (P9): 𝑀(𝑿) ≥ 0.
∑ ∑ | |𝑝
𝑀(𝑿) = ‖𝑿‖𝑝,of f = |𝑋𝑖𝑗 | (32) Property 10 (P10): 𝑀(𝑿) ≤ ‖𝑿‖1,of f .
| |
𝑖=1 𝑗=1 Table 3 lists several penalty functions and their properties.
with 𝑝 ≥ 1. These above norm is called the elements-wise matrix norms The algorithms to solve these regular problems include: (1) alterna-
since they imposes the same penalty to all elements. tive minimized algorithm; (2) block coordinate descent.
Note that 𝑙1,of f norm makes a distinction with the 1 norm of matrix
𝑿 ∈ R𝑚×𝑛 , 2.3. Low-rank matrix recovery based on low-rank regularization
∑
𝑚
| |
𝑀(𝑿) = ‖𝑿‖1 = max |𝑋𝑖𝑗 | , (33) This section reviews low-rank regularization based on the low-rank
1≤𝑗≤𝑛 | |
𝑖=1 recovery problems which conclude two hop topics problems: matrix
which is simply the maximum absolute column sum of the matrix. Mini- completion and robust PCA. Matrix completion aims to recover a low-
mizing the maximum absolute column sum is equivalent to minimizing rank matrix from partially observed entries, while robust PCA aims to
the upper bound of absolute column sum which enforces the column decompose a low-rank matrix from sparse corruption.
sum close to zero, i.e. feature selection.
The 𝑙2,1 norm of a matrix was first introduced in [33] as rotational 2.3.1. Application scenario
invariant 𝑙1 norm, that is Matrix Completion In many applications, the matrix is excepted to
(𝑚 )1∕2 be constructed in the sense that it is low-rank, which can be recovered
∑𝑛 ∑
𝑀(𝑿) = ‖𝑿‖2,1 = 𝑋𝑖𝑗2 , (34) from incomplete portions of the entries. For instance, vendors provide
𝑗=1 𝑖=1 recommendations to the user based on their preferences. Users and
which is proposed to overcome the difficulty of robustness to out- ratings are represented as rows and columns, respectively, in a data ma-
liers [33]. Similarly, the 𝑙1,2 norm of matrix 𝑿 is defined as trix. Users can rate movies but they typically rate only very few movies
so that very few scattered entries can be observed [38]. Commonly,
(𝑚 )2 1∕2
⎛∑ 𝑛 ∑| | ⎞ only a few factors effect to the preference of users so that the data
𝑀(𝑿) = ‖𝑿‖1,2 = ⎜ |𝑋𝑖𝑗 | ⎟ . (35) matrix of all users-rating can be regarded as a low-rank matrix. The
⎜ 𝑗=1 𝑖=1 | | ⎟
⎝ ⎠ goal of this problem is to complete the data matrix of all users-rating
For 𝑝, 𝑞 ≥ 1, a generalized form of the 𝑙1,2 norm is the 𝑙𝑝,𝑞 norm as using the observed data.
follows [34]: Given the incomplete observations 𝐴𝑖𝑗 , the goal is to recover a 𝑚 × 𝑛
1 matrix 𝑋𝑖𝑗 , that is,
(𝑚 )𝑞 𝑞
⎛∑ 𝑛 ∑ | |𝑝 𝑝 ⎞
𝑀(𝑿) = ‖𝑿‖𝑝,𝑞 =⎜ |𝑋 | ⎟ . (36) min rank(𝑿)
⎜ 𝑗=1 𝑖=1 | 𝑖𝑗 | ⎟ 𝐗 (42)
⎝ ⎠ s.t. 𝑋𝑖𝑗 = 𝐴𝑖𝑗 (𝑖, 𝑗) ∈ 𝛺
Moreover, there are many norm penalties based on the ∞ norm where 𝛺 ⊂ [1, … , 𝑚]×[1, … , 𝑛] is the index set in which 𝑋𝑖𝑗 is observed.
∑
𝑛 The rank(𝑿) is equal to the rank of the matrix 𝑿. It usually transforms
| |
𝑀(𝑿) = ‖𝑿‖∞ = max |𝑋𝑖𝑗 | , (37) to an unconstrained formulation which is written as [39]
1≤𝑖≤𝑚 | |
𝑗=1
1‖
 (𝑿) − 𝛺 (𝑴)‖
2
min ‖F + 𝐹 (𝑿) (43)
which are respectively 𝑙∞,1 norm [35] 𝑿 2‖ 𝛺
{
∑
𝑚
| | 𝑋𝑖𝑗 if (𝑖, 𝑗) ∈ 𝛺
𝑀(𝑿) = ‖𝑿‖∞,1 = max |𝑋𝑖𝑗 | , (38) where 𝛺 (𝐗) = .
1≤𝑗≤𝑛 | | 0 otherwise
𝑖=1
If 𝐹 (𝑿) = rank(𝑿), the objective function is nonconvex. A popular
and 𝑙∞,of f norm convex relaxation method is to approximate the rank function using the
| | nuclear norm, i.e. ‖𝑿‖∗ which is the sum of singular value of 𝑿 [40].
𝑀(𝑿) = ‖𝑿‖∞,of f = max |𝑋𝑖𝑗 | . (39)
𝑖≠𝑗 | | Robust PCA Robust PCA is applied in many important applications
Non-Norm Penalty The following penalties are not real norms, such as foreground detection of video [41], image processing [42],
because they are nonconvex and do not satisfy the triangle inequality fault detection and diagnosis [43] and so on. The robust PCA problem
of a norm. The capped 𝑙𝑝,1 is defined as [36]: is commonly thought of as a low-rank matrix recovery problem with
incorporates sparse corruption. The goal of robust PCA is to enhance
(𝑚 )1
∑
𝑛 ⎛ ∑ 𝑝 ⎞ the robustness of PCA against outliers or corrupted observations of PCA.
| |𝑝
𝑀(𝑿) = 𝜆 min ⎜ |𝑋𝑖𝑗 | , 𝛾⎟ (40) In fact, the data matrix 𝑨 of this problem is a composite matrix of
⎜ 𝑖=1 | | ⎟
𝑗=1 ⎝ ⎠ sparse and low-rank recovery which needs to be decomposed into two
with the given threshold 𝜃, the capped 𝑙𝑝,1 penalty focuses on rows with components such that 𝑨 = 𝑳 + 𝑺, where 𝑳 is a low-rank matrix and 𝑺
smaller 𝑙𝑝 norms than 𝜃, which is more likely to be sparse. is a sparse matrix. The matrix 𝑨 is observed and the problem can be
The elements-wise MCP function is defined as, represented as:

⎧ 𝛾𝜆2 | | min 𝐹 (𝑳) + 𝜆𝑀(𝑺)

⎪ 2 , |𝑋𝑖𝑗 | ≥ 𝛾𝜆 𝑳,𝑺 (44)
𝑀(𝑋𝑖𝑗 ) = ⎨ 2
| | (41) s.t. 𝑨=𝑳+𝑺
| | 𝑋𝑖𝑗
⎪ 𝜆 |𝑋𝑖𝑗 | − , otherwise
⎩ | | 2𝛾 where 𝐹 and 𝑀 are penalties for low-rank and sparsity promotion,
which is considered the natural extension of MCP functions on matri- respectively. A straightforward formulation is
ces [37]. The MCP function is an elements-wise matrix penalty. min 𝑟𝑎𝑛𝑘(𝑳) + 𝜆‖𝑺‖0
𝑳,𝑺 (45)
As shown in Fig. 5, these penalties can be classified into three cate-
s.t. 𝑨=𝑳+𝑺
gories: rows sparsity, columns sparsity, and sparsity of both
columns and rows, and it depends on which part of the matrix is Since the problem is NP-hard, the nuclear norm ‖𝑳‖∗ and 𝑙1,of f norm
punished. ‖𝑺‖1,of f are used to instead of the 𝑟𝑎𝑛𝑘(𝑳) and ‖𝑺‖0 , respectively.

151
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Fig. 5. Left: rows sparsity (Row 1, 3, 5, 8, 9). Middle: columns sparsity (Column 1, 3, 4, 6, 10). Right: sparsity of both columns and rows (Row 6, 7, 8 and Column 6,7).

Table 3
The sparse matrix regularizations.
Penalty Formulation Rows Columns Rows & Columns P9 P10
sparsity sparsity sparsity
∑𝑛 ∑𝑚 | | √ √ √
‖𝑿‖1,of f 𝑗=1 | 𝑖𝑗 |
𝑋 × ×
(∑ ∑ | |)1∕2
𝑖=1
𝑛 𝑚 √ √ √
‖𝑿‖𝐹 𝑖=1
𝑋2
𝑗=1 𝑖𝑗
× ×
Elements-wise
(∑ ∑ ) √ √ √
𝑚 𝑛 | |𝑝 1∕𝑝
‖𝑿‖𝑝,of f 𝑖=1 𝑗=1 || 𝑖𝑗 ||
𝑋 × ×
| | √ √ √
‖𝑿‖∞,of f max𝑖≠𝑗 |𝑋𝑖𝑗 | × ×
| |
⎧ 𝛾𝜆2 | |
⎪ 2 , if |𝑋𝑖𝑗 | ≥ 𝛾𝜆 √ √ √
MCP | |
⎨ | | 𝑋𝑖𝑗2 × ×
⎪ 𝜆 ||𝑋𝑖𝑗 || − 2𝛾 , otherwise
⎩
∑𝑛 ( ∑𝑚 )1∕2 √ √ √
‖𝑿‖2,1 𝑗=1
𝑋2
𝑖=1 𝑖𝑗
× ×
∑𝑚 ( ∑𝑛 )1∕2 √ √ √
‖𝑿‖2,1 𝑖=1
𝑋2
𝑗=1 𝑖𝑗
× ×
Non-elements-wise ( (∑ )𝑞 )1
∑𝑛 | |𝑝 𝑝 𝑞
𝑚 √ √ √
‖𝑿‖𝑝,𝑞 |𝑋𝑖𝑗 | × ×
𝑗=1 | |𝑖=1
∑𝑛 | | √ √ √
‖𝑿‖∞ max1≤𝑖≤𝑚 𝑗=1 |𝑋𝑖𝑗 | × ×
| |
∑𝑚 | | √ √ √
|𝑿||1 max1≤𝑗≤𝑛 𝑖=1 |𝑋𝑖𝑗 | × ×
| |
∑𝑚 | | √ √ √
‖𝑿‖∞,1 𝑖=1 max1≤𝑗≤𝑛 ||𝑋𝑖𝑗 || × ×
(( )
∑𝑛 ∑𝑚 | |𝑝 ) 𝑝 √ √ √
1

𝑖=1 || 𝑖𝑗 ||
Capped 𝑙𝑝,1 𝜆 𝑗=1 min 𝑋 ,𝛾 × ×

Considering the unconstrained problem is more tractable, the for- The weighted nuclear norm is to improve the flexibility of nuclear
mulations can be alternated by norm [47], which is defined as:
1
min ‖𝑳‖∗ + ‖𝑺‖1 + ‖𝑨 − 𝑳 − 𝑺‖2𝑭 (46) ∑
𝑘
𝑳,𝑺 2𝜇
𝐹 (𝑿) = ‖𝑿‖𝑤 = 𝑤 𝑖 𝜎𝑖 , (49)
where 𝜇 > 0 is the penalty parameter. 𝑖=1

2.3.2. Regularization where the singular values are assigned different weights. Moreover, this
In recent years, the rank-norm is replaced by some alternative idea can be generalized to the Schatten 𝑝-norm minimization [48].
regularizers, which can be divided into two groups: convex relaxations
The capped trace norm is defined as
and non-convex relaxations [4]. The main idea of these regularizers
is to make singular values sparse which is equal to make the matrix ∑
𝑘
( )
low-rank. 𝐹 (𝑿) = ‖𝑿‖𝐶∗𝜀 = min 𝜎𝑖 , 𝜀 . (50)
The work [44] proposed the elastic-net regularization for singular 𝑖=1
values
This regularization is used to improve the robustness to outliers by
∑
𝑘
( )
𝐹 (𝑿) = 𝜎𝑖 + 𝜆𝜎𝑖2 , (47) closely approximating the rank minimization.
𝑖=1
Moreover, numerous non-convex relaxations derived from sparse
where 𝜎𝑖 is the singular values of 𝑿 and 𝑘 = min{𝑚, 𝑛}. It is widely learning of the vectors have been proposed [49], such as Log Nuclear
applicable to robust subspace learning problems with heavy corruptions
Norm (LNN) [50]
such as outliers and missing entries.
Schatten 𝑝-Norm (Sp-Norm) [45] of matrix 𝐗 is defined as
( ) ∑𝑘
( )
( ) 1𝑝 𝐹 (𝑿) = 𝜆 log det 𝑰 + 𝑿 𝑇 𝑿 = 𝜆 log 𝜎𝑖 + 1 , (51)
∑
𝑘 ( (( ) 𝑝 ))
1
𝑝 𝑖=1
𝐹 (𝑿) = ‖𝑿‖𝑆𝑝 = 𝜎𝑖𝑝 = Tr 𝑿 𝑇 𝑿 2 (48)
𝑖=1 MCP
which is equivalent to 𝑙𝑝 norm of the singular values. The Sp-norm
⎧ 𝜎𝑖
minimization method provides a better approximation to the orig- ∑𝑘
⎪ 𝜆𝜎𝑖 − 2𝛾 , if 𝜎𝑖 < 𝛾𝜆
inal NP-hard problem, resulting in better theoretical and practical 𝐹 (𝑿) = ⎨ 𝛾𝜆2 , (52)
results [46]. 𝑖=1 ⎪ , if 𝜎𝑖 ≥ 𝛾𝜆
⎩ 2

152
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

SCAD
⎧ 𝜆𝜎𝑖 , if 𝜎𝑖 ≤ 𝜆
𝑘 ⎪ 2 2
∑ ⎪ −𝜎𝑖 + 2𝛾𝜆𝜎𝑖 − 𝜆 , if 𝜆 < 𝜎𝑖 ≤ 𝜆𝛾
𝐹 (𝑿) = ⎨ 2(𝛾 − 1) , (53)
𝑖=1 ⎪ 𝜆2 (𝛾 + 1)
⎪ , if 𝜎𝑖 > 𝜆𝛾
⎩ 2
ETP [51]
( ( ))
∑
𝑘
𝜆 1 − exp −𝛾𝜎𝑖
𝐹 (𝑿) = , (54)
𝑖=1
1 − exp(−𝛾)
Logarithm [52]
( )
∑𝑘
log 𝛾𝜎𝑖 + 1 Fig. 6. The geometry structure of the data distribution in low-dimension space.
𝐹 (𝑿) = , (55)
𝑖=1
log(𝛾 + 1)
Geman [53]
inherent structure. In manifold learning, this assumption is explicit:
∑𝑘
𝜆𝜎𝑖
𝐹 (𝑿) = , (56) it assumes that the observed data lies on a low-dimensional manifold
𝑖=1
𝜎 𝑖 +𝛾 embedded in a higher-dimensional space which can be seen in Fig. 6.
Laplace [54] Intuitively, this assumption states that the shape of data is relatively
( ( )) simple. The low-dimensional manifold model can be applied to surface
∑
𝑘
𝜎 patches in the point cloud, which uses he patch manifold prior to seek
𝐹 (𝑿) = 𝜆 1 − exp − 𝑖 , (57)
𝑖=1
𝛾 self-similar patches and remove noise [58].
In semi-supervised learning, there are two sets of samples 𝑥 ∈ R𝑛
and so on. Note that Laplace, ETP, and Exp are similar in formulation, {( )}𝑙
and they can be transformed into each other by adjusting parameters. which consists 𝑙 labeled samples 𝑄 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 and 𝑢 unlabeled
{( )}𝑢+𝑙
Similarly, this relationship exists between LNN, Logarithm and LSP. samples 𝑈 = 𝑥𝑖 , 𝑦𝑖 𝑖=1+𝑙 . The optimization can be written as
For the problems with low-rank regularization, a popular method is
1∑ (
𝑙
proximal method. It requires computing the proximal operator )
𝑓 ∗ = argmin 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 + 𝛾‖𝑓 ‖2𝐾 . (60)
𝑓 ∈𝐾 𝑙 𝑖=1
1 ̃ 2 + 𝐹 (𝑿)
Prox𝜆𝐹 (𝑿)
̃ = arg min ‖𝑿 − 𝑿‖ 𝐹 (58)
𝑿 2 where 𝐾 represents reproducing kernel Hilbert spaces. The first term
where usually 𝑿 ̃ is known. However, there may not exist a general ex- is the loss function and the regularization ‖𝑓 ‖𝐾 is used to control
plicit solution of the proximal operator. Hence, the proximity operator the complexity of the classification model. 𝛾 is a trade-off parameters.
is calculated by minimizing the problem Manifold learning adds a regularization ‖𝑓 ‖𝐼 to the loss function
1 that penalizes functions which are more complex with respect to the
min ‖𝑿 − 𝑿‖ ̃ 2 + 𝐹 (𝑿) , (59)
𝑿 2 𝐹 intrinsic geometry of the data manifold:
whose results are solved by the optimization solver here. In order to
1∑ (
𝑙
)
explain the shrinkage effect more intuitively, these results are regarded 𝑓 ∗ = argmin 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 + 𝛾𝐴 ‖𝑓 ‖2𝐾 + 𝛾𝐵 ‖𝑓 ‖2𝐼 (61)
𝑓 ∈𝐾 𝑙 𝑖=1
as approximate to explicit solutions.
Table 4 shows some low-rank regularizations. Among it, the shrink- where 𝛾𝐴 and 𝛾𝐵 are trade-off parameters and 𝐾 represents repro-
age image shows the results Prox𝜆𝐹 (𝑏) where 𝑏 is a variable of the ducing kernel Hilbert spaces. The regularization should bias learning
singular value. And it can be seen that when 𝑏 takes a small value toward smoother functions 𝑓 . Since ‖∇𝑓 (𝑥)‖ describes the smoothness
the shrinkage effect of different regularizers are similar. Nevertheless, of a function 𝑓 at 𝑥, a notion of the smoothness of 𝑓 on the entire
when 𝑏 takes a large value the difference of Prox𝜆𝐹 (𝑏) between non- manifold 𝑀 is given as:
convex regularizers and nuclear norm is significant. In particular, the
shrink of non-convex relaxations on larger values are very small, which ‖∇𝑓 (𝑥)‖2 𝑑𝑥. (62)
∫𝑀
is in contrast with the nuclear norm taking serious shrinks on larger
singular values. Hence, using non-convex regularizers can preserve This regularization expresses the idea that the functions should be
main information of 𝑏[55]. smooth on the manifold, not just smooth in the extrinsic space.
The algorithms to solve these regular problems include: (1) proximal To smooth 𝑓 in the intrinsic space, there are many manifold regu-
methods; (2) block coordinate methods; (3) alternating linearization larizations including the Laplacian regularization (LapR), Hessian reg-
algorithms; (4) greedy strategy approximation. ularization (HesR) and 𝑝-Laplacian regularization (pLapR) [59].

2.4. Manifold regularization

2.4.2. Regularization
Laplacian Regularization The manifold regularization cannot be
This section will introduce the manifold regularization, a regular-
ization technique for manifold learning, based on the assumption that calculated exactly with only finite data. The main idea behind man-
the data has some inherent structure. ifold regularization is to approximate this concept by substituting a
graph approximation for the manifold. Laplacian regularization is one
2.4.1. Application scenario of the most popular technologies which uses the graph Laplacian to
3D point cloud is a discrete collection of triples marking exterior approximate the manifold regularization.
object surface locations in 3D space [56,57]. 3D point cloud has been Let 𝐺 be a connected, undirected graph with edges and vertices.
widely adopted in indoor navigation, autopilot, and augmented reality, When discussing weighted graphs, the weight on the edge between
etc. In the conventional imperfect acquisition processes of 3D point nodes 𝑖 and 𝑗 is denote by 𝑊𝑖𝑗 . A function on a graph is smooth if its
cloud, noise data is non-negligible. An idea is the manifold learning. value at a node is similar to its value at each of the node’s neighbors,
To learn something from data, it is assumed that the data has some intuitively. Thus, the regularization ∫𝑀 ‖∇𝑓 (𝑥)‖2 𝑑𝑥 is analogous to

153
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 4
The low-rank regularizations. The horizontal axis of shrinkage image is the value of 𝜎𝑖 , and the vertical axis of shrinkage image is corresponding
shrinkage value.
Regularization Convexity Formulation Shrinkage image

∑𝑘
‖𝐗‖∗ Convex 𝑖=1
𝜎𝑖

∑𝑘
‖𝐗‖𝑤 Convex 𝑖=1 𝑤𝑖 𝜎𝑖

∑𝑘 ( )
‖𝐗‖𝐶∗𝜀 Non-Convex 𝑖=1
min 𝜎𝑖 , 𝜀

(∑ )1
𝑘
𝜎𝑖𝑝
𝑝
‖𝐗‖𝑆𝑝 Non-Convex 𝑖=1

∑𝑘 ( 2
)
Elastic-net Convex 𝑖=1 𝜎𝑖 + 𝜆𝜎𝑖

∑𝑘 ( )
LNN Non-Convex 𝑖=1
𝜆 log 𝜎𝑖 + 1

{ 𝜎𝑖
∑𝑘 𝜆𝜎𝑖 − 2𝛾
, if 𝜎𝑖 < 𝛾𝜆
MCP Non-Convex 𝑖=1 𝛾𝜆2
2
, if 𝜎𝑖 ≥ 𝛾𝜆

⎧ 𝜆𝜎𝑖 , if 𝜎𝑖 ≤ 𝜆
∑𝑘 ⎪ −𝜎𝑖2 +2𝛾𝜆𝜎𝑖 −𝜆2
SCAD Non-Convex 𝑖=1 ⎨ 2(𝛾−1)
, if 𝜆 < 𝜎𝑖 ≤ 𝜆𝛾
⎪ 𝜆2 (𝛾+1)
⎩ 2
, if 𝜎𝑖 > 𝜆𝛾

∑𝑘 𝜆(1−exp(−𝛾𝜎𝑖 ))
ETP Non-Convex 𝑖=1 1−exp(−𝛾)

∑𝑘 log(𝛾𝜎𝑖 +1)
Logarithm Non-Convex 𝑖=1 log(𝛾+1)

∑𝑘 𝜆𝜎𝑖
Geman Non-Convex 𝑖=1 𝜎𝑖 +𝛾

∑𝑘 ( ( ))
𝜎
Laplace Non-Convex 𝑖=1
𝜆 1 − exp − 𝛾𝑖

[ ( ) ( ) ( )]𝑇
the squared difference between a node’s value and the values of its where 𝒇 = 𝑓 𝑥1 , 𝑓 𝑥2 , … , 𝑓 𝑥𝑛 is prediction of all of the
neighbors in the graph case. That is, ‖𝑓 ‖2𝐼 is approximated as data(labeled and unlabeled), 𝐷 is a diagonal matrix with 𝑊𝑖𝑖 =
∑𝑛
1 ∑ ( ( ) ( ))2 𝑗=1 𝑊𝑖𝑗 , and 𝑳 = 𝑫 − 𝑾 is called the Laplacian of the graph 𝐺, which
𝑊𝑖𝑗 𝑓 𝑥𝑖 − 𝑓 𝑥𝑗 . (63) quantifies the smoothness of functions. Notice that, in Eq. (61), {𝑓𝑖 }𝑙𝑖=1
𝑛2 𝑖,𝑗 ( )
should satisfy the constraint 𝑓 𝑥𝑖 = 𝑦𝑖 , 𝑖 = 1, … , 𝑙, which is reflected
∑ ( )
LapR constructs a nearest neighbor graph on the feature space in the term 1𝑙 𝑙𝑖=1 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 .
which utilizes a graph similarity (affinity) matrix 𝑾 = [𝑊𝑖𝑗 ] which
Considering the 𝑉 as the hinge loss, the framework of LapR is
measures the similarity between samples 𝑥𝑖 and 𝑥𝑗 :
written as follow:
⎧ ⎧ ‖ ‖2 ⎫
1 ∑(
𝑙
( ))
⎪ ‖𝑥𝑖 − 𝑥𝑗 ‖
⎪ ( ) ( ) 𝛾
⎪ exp ⎨− ‖ ‖2 ⎪ 𝑓 ∗ = arg min 1 − 𝑦𝑖 𝑓 𝑥𝑖 + + 𝛾𝐴 ‖𝑓 ‖2𝐾 + 𝐵 𝒇 𝑇 𝑳𝒇 (66)
⎬, if 𝑥𝑖 ∈ 𝑁𝑞 𝑥𝑗 or 𝑥𝑖 ∈ 𝑁𝑞 𝑥𝑖 𝑓 ∈𝐾 𝑙 𝑖=1 𝑛2
𝑊𝑖𝑗 = ⎨ ⎪ 2𝜎 2
⎪ (64)
⎪ ⎩ ⎭
⎪ where 𝑛 = 𝑙 +𝑢 and 𝐾 represents reproducing kernel Hilbert spaces. In
⎩ 0, otherwise
this way, the data geometric structure existed in high dimensional space
where 𝑁𝑝 (𝒙) is the set of 𝑝-nearest neighbors of sample 𝒙 and 𝑾 is is preserved and the data distribution information is well explored.
usually symmetrized by 𝑾 = (𝑾 𝑇 + 𝑾 )∕2. It assumes that if two point In the image annotation, the manifold regularization is used to
𝑥1 and 𝑥2 are close in the manifold geometry, their labels should be
ensure that local geometric structures are consistent between different
similar.
view feature spaces and the latent semantics matrix among different
The geometry structure is exploited as
views [60,61]. The work [62] further learns different the graph Lapla-
1 ∑∑
𝑛 𝑛
( ( ) ( ))2 ( ) cian from the different views and adjusts the combination coefficients.
𝑊 𝑓 𝑥𝑖 − 𝑓 𝑥𝑗 = Tr 𝒇 𝑇 𝑳𝒇 (65)
2 𝑖=1 𝑗=1 𝑖𝑗 An alternative approach assumes that the intrinsic manifold is the

154
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

norm regularization for image restoration with biomedical applications.

The multiview Hessian regularization (mHR) optimally combined mul-
tiple Hessian regularizations which are obtained from the particular
view of instances and direct the classification function which varies
linearly along the data manifold [70].
𝑝-Laplacian Regularization As a nonlinear generalization of the
standard graph Laplacian, 𝑝-Laplacian has attracted attention from
Fig. 7. The process of defining the tangent Hessian [149].
machine learning fields.
Similar to the graph Laplacian [71], the unnormalized 𝑝-Laplacian
(for 𝑝 > 1) can be defined by
convex combination of the pre-learned manifold candidates, that is
1 ∑∑
𝑛 𝑛
∑
𝑚 ∑
𝑚 𝒇 𝑇 𝑳𝑝 𝒇 = 𝑤 |𝑓 − 𝑓𝑗 |𝑝 . (72)
𝑳= 𝜇𝑘 𝑳𝑘 , s.t. 𝜇𝑘 = 1, 𝜇𝑘 ≥ 0, for 𝑘 = 1, … , 𝑚 (67) 2 𝑖=1 𝑗=1 𝑖𝑗 𝑖
𝑘=1 𝑘=1
And if 𝑝 = 2, the 𝑝-Laplacian becomes the standard graph Laplacian.
where 𝑚 represents 𝑚 views and 𝑪 = {𝑳1 , 𝑳2 , … , 𝑳𝑚 } is the set of The objective function of 𝑝-Laplacian can be computed as fol-
the candidates graph Laplacian. Thus the 𝐿 ∈ 𝐜𝐨𝐧𝐯𝑪 is also a graph lows [72]:
Laplacian [63]. The work [64] proposed pairwise constraints where the { }
graph Laplacian matrix is composed of the must-link Laplacian matrix 𝑓 ∗ = argmin𝑓 ∈(𝑉 ) 𝑆𝑝 (𝑓 ) + 𝜇‖𝑓 − 𝑦‖2 , (73)
and cannot-link Laplacian matrix. ∑
where 𝑆𝑝 (𝑓 ) ∶= 12 𝑣∈𝑉 ‖ ‖𝑝
Hessian Regularization The geodesic function in the null space ‖∇𝑣 𝑓 ‖ and 𝜇 is the trade-off parameters.
The objective function is convex so that it has a unique solution.
of Laplacian is no other than a const, which implicates that LapR
However, it is still strenuous work to approximate graph 𝑝-Laplacian
biases the solution towards a constant function and then leads to poor
so that extremely limits the applications of 𝑝-Laplacian regularization.
extrapolation capability [65]. A theory of the Hessian is a substitute for
Luo et al. [73] used the 𝑝-Laplacian for multi-class clustering. Inspired
the Laplacian. Given a smooth manifold 𝑀 ⊂ R𝑛 and the tangent space
by Luo et al. [73], Liu et al.[74] proposed an efficient approxima-
𝑇𝑝 (𝑀) ⊂ R𝑛 for each point 𝑝 ∈ 𝑀, thus the Hessian regularization can
tion method of graph 𝑝-Laplacian and built 𝑝-Laplacian regularization
be defined as
framework
‖∇ ∇ 𝑓 ‖2 ∑ | |𝑝
𝐻(𝑓 ) =
∫𝑀 ‖ 𝑎 𝑏 ‖𝑇𝑝 (𝑀)⊗𝑇𝑝 (𝑀)
𝑑𝑉 (𝑝) (68) ∑ 𝑖𝑗 𝑤𝑖𝑗 |𝑓𝑖 −𝑓𝑗 |
| |
min 𝐽 (𝒇 ) = 𝑘 𝑝
𝒇 ‖𝑓 ‖𝑝 (74)
where ∇𝑎 ∇𝑏 𝑓 is the second covariant derivative of 𝑓 and 𝑑𝑉 (𝑝) is s.t. 𝒇𝑇 𝒇 = 𝐼
the natural volume element. [ ( ) ( ) ( )]𝑇
The construction of the Hessian matrix of a point depends on the where 𝒇 = 𝑓 𝑥1 , 𝑓 𝑥2 , … , 𝑓 𝑥𝑛 . The results show the pLapR
choice of the coordinate system. In the normal coordinates, hessian can fit the data exactly and extrapolates smoothly to unseen data with
matrix can be evaluated quite easily. The local normal coordinates is the geodesic distance. Especially, they proposed an efficient approxima-
used to define the Hessian of a function 𝑓 . Let 𝑁𝑘 (𝑝) as a neighborhood tion method of graph 𝑝-Laplacian and built 𝑝-Laplacian regularization
of the 𝑘 nearest of a sample 𝑝. Considering the tangent space as framework.
a subspace of R𝑛 , the orthonormal coordinate system of 𝑇𝑝 (𝑀) an The work [75] introduced a new class of non-local 𝑝-Laplacian
be approximated by the eigenspace of neighborhood 𝑁𝑘 (𝑝), which is operators that interpolate between non-local Laplacian and infinity
spanned by the largest 𝑑 eigenvalues. For each point 𝑝′ ∈ 𝑁𝑘 (𝑝), there Laplacian. Graph-based 𝑝-Laplacian regularization has found further
is a unique closest point 𝑣 on 𝑇𝑝 (𝑀). The details are shown in Fig. 7. applications in semi-supervised learning and image processing [76]. Liu
Then the rule 𝑔(𝑥) = 𝑓 (𝑝′ ) defines a function 𝑔 ∶ 𝑈 → R, where 𝑈 et al. [77] proposed 𝑝-Laplacian regularized sparse coding for human
is a neighborhood of 0 in R𝑛 . Thus, the Hessian of 𝑓 at 𝑝 in tangent activity recognition. Ma et al.[78] presented an efficient and effective
coordinates is defined as the ordinary Hessian of 𝑔, approximation algorithm of hyper-graph 𝑝-Laplacian and then pro-
( ) posed hypergraph 𝑝-Laplacian regularization (HpLapR) provided more
𝜕 𝜕
𝐻𝑓tan (𝑝) = 𝑔(𝑥)|𝑥=0 . (69) potential to exploiting the local structure preserving. Ma et al. [79]
𝑖,𝑗 𝜕𝑥𝑖 𝜕𝑥𝑗
developed an ensemble 𝑝-Laplacian regularization (EpLapR) to fully
Meanwhile, the Frobenius norm of a Hessian matrix is invariant to the approximate the intrinsic manifold of the data distribution. According
coordinate changes, and therefore to the manifold regularization framework, the proposed EpLapR can be
‖ tan ‖2 written as the following optimization problem:
𝐻(𝑓 ) = ‖𝐻 (𝑝)‖ 𝑑𝑝 (70)
∫𝑝∈𝑀 ‖ 𝑓 ‖𝐹
1∑ (
𝑙
)
min 𝑉 𝑥𝑖 , 𝑦𝑖 , 𝑓 + 𝛾𝐴 ‖𝑓 ‖2𝐾 + 𝛾𝐼 𝒇 𝑇 𝑳𝑝 𝒇 (75)
measures the average curviness of 𝑓 over the manifold 𝑀. 𝑓 ∈𝐾 𝑙
𝑖=1
Kim et al. [66] introduce a sparse matrix approximation 𝐻 by fitting a ∑ 𝑝 ∑𝑚
quadratic function to the data points, which yields an objective function where 𝑳𝑝 = 𝑚 𝑘=1 𝜇𝑘 𝑳𝑘 , and s.t. 𝑘=1 𝜇𝑘 = 1, 𝜇𝑘 ≥ 0, for 𝑘 = 1, … , 𝑚.
almost identical to that of Laplacian-based manifold regularization: Fig. 8 illustrates the differences of semi-supervised regression by
using LapR, HesR, and pLapR for fitting two points on the 1-D spiral.
( )2 Differences of semi-supervised regression by using LapR, HesR, and
∑
𝑚 ∑
𝑘 ∑
𝑛 ∑ ∑ (𝑖)
(𝑖)
𝐻(𝑓 ) ≈ 𝐻𝑟𝑠𝛼 𝑓𝛼 = 𝑓𝛼 𝑓𝛽 𝐻𝛼𝛽 = 𝒇 𝑇 𝑯𝒇 pLapR for fitting two points on the 1-D spiral. The LapR prefers to fit
𝑟,𝑠=1 𝛼=1 𝑖=1 𝛼∈𝑁𝑘 (𝑝𝑖 ) 𝛽∈𝑁𝑘 (𝑝𝑖 ) data as the constant function which seems improper. The HesR fits the
(71) data as linear function along the spiral thus it is better than LapR. The
pLapR can fit the data exactly and more smooth [80].
(𝑖)
where 𝐻𝑟𝑠𝛼is local Hessian operator with normal coordinate 𝑥𝑟 and 𝑥𝑠 ,
(𝑖)
and 𝑯 is the accumulated matrix summing up all the matrices 𝐻𝑟𝑠𝛼 . 2.5. Discussion
A learning framework mHLR [67] was proposed which integrates
multiple kernel learning and ensemble graph Hessian learning, and This part will conclude how to choose a regularization for tradi-
combines multiple Hessian regularizations to boost the exploring of tional machine learning.
local geometry. The work [68] applied the Hessian regularization to In the first subsection, the vector-based sparse regularizations are
remote sensing image recognition. The work [69] applied hessian-based summarized. Sparse regularizations are important since many fields

155
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Fig. 8. The differences of semi-supervised regression by using LapR, HesR, and pLapR for fitting two points on the 1-D spiral.

need to impose sparse constraints on variables, such as compressing 3.1. Data augmentation
sensing, feature selection, sparse signals separation, sparse PCA, and
sparse signals separation. 𝑙0 norm constrains the value of the weight to One powerful way to improve model generalization performance is
close to zero. Except for no penalty to zero, 𝑙0 norm penalizes nonzero to train it on more data. In practice, the lack of a sufficient amount of
variables in the same way. As seen from images of regularizations, training data or uneven class balance within the datasets are common
penalties function decrease rapidly as variables close to zero. Thus, problems. Data augmentation is based on an assumption that more in-
the nonconvex penalties are closer to 𝑙0 norm. Moreover, the common formation can be obtained from the new data after augmentations [81].
properties of these regularizations are as well proposed. It is worth- Dataset augmentation has been a particularly effective technique for
while to discuss some good mathematical properties since having these image classification. For some image classification tasks, it is reasonable
properties may result in some positive characteristics, such as being to create new fake data to add to the data set. In computer vision fields,
more conducive to solving or being easier to obtain convergence. These there are some basic augmentation operations, which can be seen as
images and ideas can be used to construct vector-based regularizations. a kind of oversampling [82–84]. This section also contains black-box
The second subsection summarized the matrix-based sparse regu- approaches focused on deep neural networks, in addition to traditional
larizations which are important in large covariance matrix and inverse white-box methods.
covariance matrix estimation. These regularizations are classified into
three types: row sparsity, column sparsity, and column and row spar- 3.1.1. Traditional augmentation
sity. Furthermore, two properties of these regularizations are proposed, The augmentation methods [85,86] include roughly: flipping, crop-
which give the regularizations’ binds. All of the above can make us ping, resizing, rotating, transposing, inverting, brightness, sharpness,
choose or construct a regularization for specific tasks. equalize, auto contrast, convert and color balance. These methods have
Some matrix-based low-rank regularizations are given in the third experimented on the bee image of hymenoptera data. The image pixels
subsection, which can be applied in matrix completion and robust PCA. are 500 × 464. The result of these methods is shown in Fig. 9. Table 5
The low-rank regularization adds penalty to singular value, which is lists the methods and their own descriptions which are used to augment
motivated by the vector-based sparse regularizations. Thus, a straight- data.
forward way to construct regularization can be proposed by extending Affine transformation is essentially a linear transformation of the
the vector-based regularizations. image coordinate vector space, it could be formulated as:
The last subsection reviewed the manifold regularizations. The man-
ifold assumes the data has some inherent structure, which is widely 𝒚 = 𝑾 𝒙 + 𝒃, (76)
used in computer vision and multimedia. The manifold learning prefers in which 𝒙 and 𝒚 are 2-D image pixel coordinate vector before and
to find a more smooth model. This regularization captures the intuition after affine transformation, 𝑾 and 𝒃 are linear transformation matrix
that our functions should be smooth on the manifold, not just smooth and translation vector respectively. The common affine-transformation-
in the extrinsic space. A natural way is to construct a regularization to based image data augmentation methods include translation, rotation,
measure the smoothness of 𝑓 , such as ∫𝑀 ‖∇𝑓 (𝑥)‖2 𝑑𝑥. and flipping, as illustrated in Fig. 9(c), 9(d), 9(e) respectively.
They have been proven to be easy, fast, repeatable, and depend-
3. The regularization strategies in deep learning able. However, some techniques result in the loss of image data. The
disadvantages of color space transformations are increased memory,
There are many strategies for deep learning. Aiming at the reason transformation costs, and training time. Moreover, color transforma-
which consists of noise data, the limited size of the training set, and tions may discard essential color details and thus are not always a
the complexity of classifiers, this section main introduced the data label-preserving transformation. For example, when the pixel value of
augmentation, dropout, early stopping, batch normalization and several an image is reduced to simulate a darker environment, the objects in
common methods to avoid the overfitting. the image may be invisible.

156
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Fig. 9. Examples of image data augmentation methods.

Table 5
The methods used to augment data and their own descriptions.
Translate Description
Crop Crop the image by a box and it will reduce the size of input.
Resize Zoom in or Zoom out the image and select the center of the scaled image.
Rotate Rotate the image some degrees.
Transpose Converted the image horizontally or vertically.
Invert Invert the pixels of the image.
Brightness Brightness enhancement and magnitude is proportional to brightness.
Sharpness Sharpness enhancement and magnitude is proportional to sharpness.
Equalize Equalize the image histogram.
Auto contrast Maximize the contrast of the image.
Convert Convert the mode of image.
Color balance Adjust the color balance of the image. A 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = 0 gives a black and white
image, while 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = 1 gives the original image.

It is worth noting that data augmentation approaches should take

into account preserving the label’s consistency post-transformation.
Rotations and cropping, for example, are usually secure on ImageNet
challenges like cat versus dog, but not on digit recognition tasks like 6
versus 9 [87].

3.1.2. Feature space augmentation

Data augmentation techniques have traditionally been applied to
input images only. However, for a computer, there is no real difference Fig. 10. GAN: 𝐺 represents the generator, 𝒛 represents the latent variable(random
between an input image and an intermediate representation. Feature noise), 𝒙𝑟 represents the real sample, 𝒙𝑓 represents the fake sample which is generated
space augmentation performs the transformation, not in input space, by 𝐺, and 𝐷 represents the discriminator which distinguishes the real or fake of 𝒙𝑓
but in a learned feature space [88]. and 𝒙𝑟 .

Shake-Shake regularization, which applies data augmentation tech-

niques to internal representations, was created as an attempt to produce
this sort of effect by stochastically ‘blending’ 2 viable tensors [89]. used to recover new vectors and convert them into images as a solution
Shake-Shake can be applied to ResNeXt only. The work [90] pro- to this problem. For deep CNNs, the proceeding is very difficult and
posed the ShakeDrop regularization which can be applied to more time-consuming to train because it needs to copy the entire encoding
neural network such as ResNet, Wide ResNet, and PyramidNet. The part of the CNN.
work [91] dynamically adjusted the regularization strength in the
training procedure, thereby balancing the underfitting and overfitting 3.1.3. Generative Adversarial Networks
of CNNs. Generative Adversarial Networks (GAN) gained considerable atten-
Adding the difference between two examples to a new example is tion in the field of unsupervised learning. An example of GAN is shown
a simple yet effective data augmentation method [92]. SMOTE is a in Fig. 10. The idea of GANs is to use two adversarial networks, 𝐺(𝒛)
popular augmentation utilizing the 𝑘 nearest neighbors to form new and 𝐷(𝒙), where one generates a photo-realistic image to fool the
instances [93]. generator 𝐺(𝒛) to better distinguish fake images which are created by
The use of auto-encoders is particularly useful for performing fea- the discriminator 𝐷(𝒛).
ture space augmentations on data [94]. An auto-encoder works by Table 6 shows the recent progress of GAN.
mapping one half of the network (encoder) to a low-dimensional vector Image data generation is one of the most frequently used fields of
representation, which the other half of the network (decoder) may re- GAN. GAN generates an approximate real data distribution. When there
construct back to the original picture. For feature space augmentations, are few real data, the performance of GAN is not ideal. Sample genera-
this encoded representation is used. The auto-encoder network can be tion based on a poor data distribution may not be effective. GAN needs

157
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 6
The GAN-variants.
Model Description Ref
CGAN Conditional GAN: Constructed by merely extra auxiliary information (e.g., class label). [150]
DCGAN Deep Convolutional GAN: First structure of de-convolutional neural networks (de-CNN). [151]
LapGAN Laplacian GAN: Combine the CGAN with the framework of the Laplacian pyramid. [152]
InfoGAN Information Maximizing GAN: Learn disentangled design in a wholly unsupervised way. [153]
WGAN Wasserstein GAN: Its loss function derived through Earth-Mover or Wasserstein distance. [154]
BEGAN Boundary Equilibrium GAN: Keep-up an equilibrium between variety and superiority. [155]
PGGAN Progressive-Growing GAN: A multi-scale based GAN architecture. [156]
BigGAN BigGAN: Train bigger neural networks. [157]
StyleGAN Style-Based Generator Architecture for GAN: Control the image synthesis process. [158]
EBGAN Energy-Based GAN: Its discriminator works as an energy function. [159]
RPDGAN Realistic Painting Drawing GAN: An unsupervised cross-domain image translation framework. [160]

Fig. 11. Style Transfer: The style of van Gogh is applied to the content of Black-and-White image to generate an image which preserves the original content of (b) and has the
style of (a) [99].

enough data to support the convergence of network training. When Some data augmentation strategies recently proposed methods fol-
training data are not sufficient, GAN is difficult to achieve a satisfactory low a generative approach. In the work [101], Wang et al. explored
equilibrium, and it is easy to fall into mode collapse. Although the Generative Adversarial Nets to generate images of different styles to
generated sample size has been expanded, it is not obviously helpful augment the dataset. Zheng et al. generated two stylized images for
for the diversity of samples, which is similar to the simple replication each input image, then merged the stylized images and original images
of samples. For the downstream neural network, the model will be to compose the final training dataset [102]. A disadvantage of Neural
biased. At the same time, when dividing training, verification and test Style Transfer Data Augmentation is the effort required to select styles
sets, data leakage may occur. For large scale data, there is generally no to transfer images into. If the style set is too small, further biases could
need to expand the sample or use simple data enhancement methods be introduced into the dataset.
to expand the sample size. When the amount of data is medium, it
seems that using GAN to generate samples and expand data sets is
3.1.5. Meta-learning
helpful to the training convergence of downstream neural networks.
Meta-learning, also known as learning to learn, is a scientific ap-
GAN’s data augmentation may be more effective in some tasks with
proach to observe how different machine learning methods perform in a
low resolution and acceptable definition. The work [95] proposed a
wide range of learning tasks, and then learning from this experience or
training scheme that first uses classical data augmentation to enlarge
meta data to learn new tasks faster than other possible methods [103].
the training set and then used GAN to synthetic data augmentation that
In contrast to conventional machine learning, which solves tasks from
enlarged the size and diversity of datasets. The work [96] used GAN to
scratch using a fixed learning algorithm, meta-learning attempts to
generate synthetic data with the purpose of training classifiers without
refine the learning algorithm itself based on the experience of multiple
using the original data set and oversampling a few class in unbalanced
learning episodes, aiming to learn on the basis of tasks rather than
classification scenario.
samples, that is, learning task-agnostic learning systems rather than
3.1.4. Neural style transfer task-specific models. Successful applications have been demonstrated in
Neural Style Transfer algorithms [97] can apply the artistic style of areas spanning few-shot image recognition [104], unsupervised learn-
one image to another image while preserving its original content, which ing [105], data efficient [106,107] and self-directed [108] reinforce-
is shown in Fig. 11. It serves as a great tool for Data Augmentation ment learning (RL), hyper-parameter optimization [109], and neural
which is somewhat analogous to color space lighting transformations architecture search (NAS) [110].
for images. The style transfer algorithms consist of a descriptive ap- Smart Augmentation is the process of learning suitable augmenta-
proach and a generative approach. The descriptive approach refers to tions when training deep neural networks. Its goal is to learn the best
changing the pixels of a noise image in an iteration, and the generative augmentation strategy for a given class of input data. The work [111]
approach uses a pre-trained model of the desired style to achieve the used two networks, Network A and Network B. Network A is an
same effect in a single forward pass [98]. augmentation network to generate new samples to train Network B.
A model is trained in advance for each style image in the generative Network B can perform some specific tasks. The change in the error
approach. Johnson et al. [99] proposed a two-component architecture- rate in Network B is then backpropagated to update Network A.
generator and loss networks. They also introduce a novel loss function Auto-Augment is a Reinforcement Learning algorithm that searches
based on a perceptual differences between the content and target. for an optimal augmentation policy among a constrained set of ge-
In [100], an approach was proposed to move the computational burden ometric transformations with miscellaneous levels of distortions. The
to a learning stage to improve the work of [99]. work [112] took the labeled images and predefined preprocessing

158
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

3.2.3. Standout
To ensure unconfident units more frequently dropout than confident
units, Standout overlays a binary belief network onto a neural network
which controls the dropout properties of individual neurons [117]. The
belief network is interpreted as tuning the architecture of the neural
network. For each weight in the original neural network, Standout
adds a corresponding weight parameter in the binary belief network.
A layer’s output during training is given by:
( ( ))
𝑦 = 𝑓 (𝑾 𝒙)◦𝒎, 𝑚𝑖 ∼ Bernoulli 𝑔 𝑾 𝑠 𝒙 , (80)
Fig. 12. An example of standard dropout. The left network is fully connected, and the
right has had neurons dropped with probability 0.5. where 𝐖𝑠 representing the belief network’s weights for that layer and
𝑔(⋅) representing the belief network’s activation function.
An effective approach to determine belief network weights is setting
transformations as the input set. It jointly learned the classifier and the them as:
optimal preprocessing transformations for individual images. 𝑾 𝑠 = 𝛼𝑾 + 𝛽 (81)
A disadvantage to meta-learning is that it is a relatively new concept
and has not been heavily tested. Additionally, meta-learning schemes at each training iteration for some constants 𝛼 and 𝛽. The output of
can be difficult and time-consuming to implement. each layer during testing is given by:
( )
3.2. Dropout 𝑦 = 𝑓 (𝑾 𝒙)◦𝑔 𝑾 𝑠 𝒙 . (82)

Training a deep neural net with a large number of learnable pa- 3.2.4. Curriculum dropout
rameters is very costly. Overfitting and long training time will become In a traditional machine learning algorithm, all training exam-
two fundamental challenges. Model compression can simplify model ples are presented to the model in an unorganized fashion, with fre-
size or running time but maintain accuracy, that is, training fewer quent random shuffling. The level of complexity of the concepts to
parameters without losing model accuracy. Dropout zeros out the ac- learn in curriculum learning is proportional to the age of the people,
tivation values of randomly selected neurons during training in order i.e. handling easier knowledge when babies and harder knowledge
to improve sparsity in neural network weights. This property means when adults. Inspired by this, training examples can be subdivided
that dropout methods can be applied in compressing neural network based on their difficulty. Then, the learning is configured so that easier
models by reducing the number of parameters needed to perform examples come first, eventually complicating them and processing the
effectively [113]. hardest ones at the end of training.
Curriculum Dropout was proposed to using a time schedule for the
3.2.1. Standard dropout probability of retaining neurons in the network [118]. This results in an
Hinton et al. proposed the dropout method for the first time [114] adaptive regularization scheme that dynamically increases the expected
and was subsequently applied to large scale visual recognition prob- number of suppressed units to boost the generalization of the model
lems [115]. The main idea is that individual nodes are either kept with while smoothly increasing the difficulty of the optimization problem.
a probability of 𝑝 or omitted from the network with a probability of 1−𝑝 Using a fixed dropout probability during training is proven to be a
in each training iteration. Dropout is not applied to the output layer. suboptimal choice. At the beginning of Curriculum Dropout, no entry
As shown in Fig. 12, on each presentation of each training case, each
of 𝑧0 is set to zero. This clearly corresponds to the easiest available
hidden unit is randomly omitted from the network with a probability
example and considers all possible available visual information. As
of 0.5.
learning time grows, a greater number of entries are set to zero. This
Mathematically, the behavior of standard dropout during training
complicates the challenge and necessitates a greater effort on the part
for a neural network layer is given by:
of the model to capitalize on the limited amount of uncorrupted data
𝒚 = 𝑓 (𝑾 𝒙)◦𝒎, 𝑚𝑖 ∼ Bernoulli(𝑝), (77) available at that point in the training phase.
where 𝒚 is the layer output, 𝑓 (⋅) is the activation function, 𝑾 is the
3.2.5. DropMaps
layer weight matrix, 𝒙 is the layer input, and 𝒎 is the layer dropout
Moradi et al. proposed DropMaps, where for a training batch, each
mask, with each element 𝑚𝑖 being 0 with probability 1−𝑝. Once trained,
feature is kept with a probability of 𝑝 and is dropped with the probabil-
the layer output is given by:
ity of 1 − 𝑝[119]. At test time, all feature maps are kept, and everyone
𝒚 = (1 − 𝑝)𝑓 (𝑾 𝒙). (78) is multiplied by 𝑝. Like Dropout, DropMaps has a regularization effect
that causes the coincidence of feature maps to be avoided. DropMaps
Standard dropout is equivalent to adding layer after a layer of neurons
can handle the problem of overfitting large models for tiny images.
that simply sets values to zero with some probability during training,
and multiplies them by 1 − 𝑝 during testing. Table 7 shows some proposed methods and theoretical advances in
The standard dropout promotes sparsity in the weights of neural dropout.
networks, causing more weights to be near zero [115]. Several variants
have been produced since the standard dropout was proposed, as shown 3.3. Early stopping
in Table 7.
Actually, noisy labels are very common in real-world training data.
3.2.2. Dropconnect Since model is overfitting to noisy labels in the training, the test data
One of the variations on standard dropout was Dropconnect [116]. may have weak generalization. As shown in Fig. 13, the training error
Therefore, in training, the output of a network layer is given by: decreases steadily over time, but validation set error begins to rise
again. A copy of the model parameters is saved every time the error
𝒚 = 𝑓 ((𝑾 ◦𝑴)𝒙), 𝑀𝑖𝑗 ∼ Bernoulli(𝑝), (79)
on the validation set improves. When the training algorithm finishes,
where 𝑴 is a binary matrix encoding the connection information and instead of using the most recent parameters, use these parameters as
𝑀𝑖𝑗 ∼ Bernoulli(𝑝). Dropconnect is only applicable to fully connected the result. Usually the model is stopped when the validation set error
layers and randomly drop the weights rather than the activations. has not been improved for a while, rather than the model reaching

159
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 7
Some proposed methods and theoretical advances in dropout.
Year Methods
2012 Standard dropout
2013 Standout; Fast dropout; Dropconnect; Maxout
2014 Annealed dropout; DropAll
2015 Variational dropout; RNNdrop; MaxTpooling dropout; Spatial dropout
2016 Evolutionary dropout; Variational RNN dropout; Monte Carlo dropout; Selective CNN dropout; Swapout
2017 Cutout; Concrete dropout; Curriculum dropout; Tuneout.
2018 Adversarial dropout methods; Fraternal dropout; Information dropout; Targeted dropout; DropBlock; Jumpout
2019 Spectral dropout; Ising dropout; Weighted Channel Dropout; DropMaps
2020 Gradient Dropout; Surrogate Dropout
2021 Ranked dropout; LocalDrop

The tested 𝛼 was 1, 2, 3, 4 and 5.

3.3.2. Low Progress (LP)

In general, it is impossible to judge that the global minimum has
already been reached or not from the start of the curve, i.e., whether
a rise in the generalization error implies real overfitting or is only
temporary. Furthermore, since the generalization loss may still be
running, the stopping may be suppressed if the training error is still
rapidly decreasing. It is assumed that overfitting will only be triggered
when the error decreases slowly. The low progress rule (LP) fires when
the improvements on the training error stall. Using a 𝑘 length training
Fig. 13. The training and validation error curves. strip to measure the training progress, the formulate of 𝑃𝑘 is
( ∑𝑡 ( ′) )
𝑡′ =𝑡−𝑘+1 𝐸𝑡𝑟 𝑡
𝑃𝑘 = 1000 ⋅ − 1 , (86)
the (local) minimum of the validation error. This strategy is known 𝑘 ⋅ min𝑡𝑡′ =𝑡−𝑘+1 𝐸𝑡𝑟 (𝑡′ )
as early stopping. Early stopping is widely used because it is simple where 𝑛 is divisible by 𝑘. It calculates that how much was the average
to understand and implement and has been reported to be superior to training error during the strip larger than the minimum training error
regularization methods in many cases. Early stopping has proven to be during the strip.
a quite effective method for avoiding overfitting [120].
This simply means to consider how much has changed in the past 𝑘
However, if the Early Stopping Rule (ESR) stops too late, the model
steps where 𝑘 is proposed to set to 5. Low Progress is then defined as:
will miss out faster training and eventually lose generality. If it stops
too early, the model may miss out on the accuracy of the predictions. 𝐿𝑃𝛼 ∶ 𝑃5 (𝑡) < 𝛼. (87)
There are two strategies for the ESR, primary rules and secondary
rules. Primary rules are known to become true at some point in time The tested 𝛼 was 1, 2, 3, 4 and 5.
during training and cannot evaluate prediction clearly, such as the
Maximum Number of Epochs. Secondary rules are that cannot be
3.3.3. Generality to Progress Ratio (PQ)
guaranteed to fire sometimes but carry the potential to improve average
accuracy and training speed. Accepting higher generalization loss when there is more develop-
To formally describe these rules, some definitions are introduced ment on the training set in the hopes of later changes on the validation
first. Let 𝐸 be the objective function (error function) of the training set seems to be often preferable. The following formulate uses the
algorithm, for example the mean squared error (MSE) [121]. Then 𝐸𝑡𝑟 (𝑡) quotient of generalization loss and progress:
is the average error per example over the training set, measured after 𝐺𝐿(𝑡)
epoch 𝑡. 𝐸𝑣𝑎 (𝑡) is the corresponding error on the validation set and is 𝑃 𝑄𝛼 ∶ >𝛼 (88)
𝑃5 (𝑡)
used by the stopping criterion. 𝐸𝑡𝑒 (𝑡) is the corresponding error on the
test set, which characterizes the quality of the model resulting from with setting 𝛼 to 1, 2, 3, 4 and 5, and measure the validation error only
training. 𝐸𝑜𝑝𝑡 is the optimal error. at the end of each strip.

3.3.1. Loss of Generality (GL) 3.3.4. Consecutive loss in generality (UP)

Loss of Generality is an automatic rule adapted from manual stop-
While GL is a global method that takes the minimum of all of the
ping procedure. Stop training when the error of the validation dataset
observed samples, UP is a local version of GL. UP is defined as:
exceeds a certain threshold, which is the relatively minimal error
observed so far [122]. 𝑈 𝑃𝑘 ∶ 𝐸𝑣𝑎 (𝑡 − 𝑘) < 𝐸𝑣𝑎 (𝑡 − 𝑘 + 1) < ⋯ < 𝐸𝑣𝑎 (𝑡) . (89)
The optimal error is defined as: ⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
( ) 𝑘
𝐸𝑜𝑝𝑡 (𝑡) = min 𝐸va 𝑡′ . (83)
′
𝑡 ≤𝑡 The main idea of this definition is that when the validation error
Then the loss of generality is defined as: increases more than once during 𝑠 consecutive strip (independent
( ) of how large the increases), it is considered that this indicates the
𝐸va (𝑡)
𝐺𝐿(𝑡) = 100 ⋅ −1 . (84) beginning of the final overfitting. The advantage of UP criteria is
𝐸opt (𝑡)
that changes can be measured locally. During long training periods,
A high generalization loss is one obvious candidate reason to stop the error must be allowed to remain at a much higher level than
training since it directly indicates overfitting. This leads the rule 𝐺𝐿𝛼 the previous minimum. Only any of these criteria cannot guarantee
as:
termination, so the maximum number of iterations is often set at the
𝐺𝐿𝛼 ∶ stop after first epoch 𝑡 with 𝐺𝐿(𝑡) > 𝛼. (85) same time.

160
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

3.3.5. Steady-State Stop Training Trigger (SSSTT)

The basic concept of Steady-State Stop Training Trigger (SSSTT)
is to stop training when the change in prediction is invisible in the
residuals measured by the training data subset. Usually, the training
data set is randomly sampled from 20% to 30% as a small subset. When
the noise exceeds the learning effect then stop training [123].

3.3.6. High Noise Ratio (HNR)

However, it is observed that in some cases the training curves 𝐸𝑡𝑟
get noisier at the end indicating the attempt of the network to overfit.
Motivated by the SSSTT rule, a novel rule is introduced to stop training Fig. 14. Covariate shift vs. Internal covariate shift.
before noise grows too high which is called the High Noise Ratio rule
(HNR). The formulate is
∑𝑘
𝐸𝑡𝑟 (𝑡 − 𝑖) − 2 ⋅ 𝐸𝑡𝑟 (𝑡 − 𝑖 − 1) + 𝐸𝑡𝑟 (𝑡 − 𝑖 − 2) and
𝐻𝑁𝑅𝑘 (𝑡) = 𝑖=1 ∑𝑘 . (90) 𝑛 ( )2
𝑗=1 𝐸𝑡𝑟 (𝑡 − 𝑗) 1 ∑ (𝓁−1)
𝜎𝐵2 = 𝑥 − 𝜇𝐵 (93)
𝑛 𝑖=1 𝑖
The 𝑘 value was set at 20, which is the best value from experience.
Early stopping occurs when are the mean and variance, respectively. Then,

𝐻𝑁𝑅𝛼 ∶ 𝐻𝑁𝑅20 (𝑡) > 𝛼 (91) 𝑥(𝓁−1) − 𝜇𝐵

𝑥̂ (𝓁−1)
𝑖 = 𝑖
√ , (94)
with setting 𝛼 to 5, 10, 15, 20, and 25. 𝜎𝐵2 + 𝜖

3.3.7. Others 𝑥(𝓁) (𝓁) (𝓁−1)

𝑖 =𝛾 𝑥 ̂𝑖 + 𝛽 (𝓁) . (95)
A cross-validated stop is one technique, in which a network is where 𝜇𝐵 is the batch mean and 𝜎𝐵2
the batch variance. The learned
registered and trained for a long time before being cross-validated at scale and shift parameters are denoted by 𝛾 and 𝛽, respectively. Any
recorded epochs [124]. A novel early stopping criteria based on quickly layer that previously received 𝑥 as the input, now receives 𝑓𝐵𝑁 (𝑥).
calculated local statistics of computed gradients eliminates the need The difference between the approaches are how to calculate the mean
for a held-out validation collection entirely [125]. Song et al. resume 𝜇𝑖 and variance 𝜎𝑖2 . The work [127] considered a mini-batch batch
training the early stopped network using a maximal safe set which normalization which replaces 𝑛 by 𝑚 in Eqs. (92) and (93).
maintains a collection of almost certainly true-labeled samples at each Another problem is the covariate shift which frequently occurs in
epoch at the beginning of the early stopping [126]. real-world problems (Fig. 14). A common covariate shift problem is
Early stopping is advantageous since it reduces the computational the difference in the distribution of the training and test set which can
cost of training procedure. In addition to the obvious cost savings by lead to suboptimal generalization performance. At test time, 𝜇 and 𝜎
minimizing the number of training iterations, it provides regularization of each set of mini-batch training data in each layer of the network are
without needing the addition of penalty terms to the cost function retained. The batch mean and variance are replaced by the statistics
or the estimation of the gradients of those additional terms. Different of all samples which, specifically, uses unbiased estimates of mean and
ESR would have individual strengths and disadvantages. Early stopping variance:
may be used individually or in combination with other regularization ( ) 2 𝑚 ( 2 )
𝜇test = E 𝜇batch , 𝜎test = E 𝜎batch . (96)
techniques because computing the ESR is not very costly. Even when 𝑚−1
using regularization strategies that modify the objective function to where E is expectation.
encourage better generalization, it is rare for the best generalization One drawback of BN is its crucial reliance on batch size since a
to occur at a local minimum of the training objective. single batch influence the calculation of the approximation of the mean
and variance. Some works suggest that this restriction has not been
3.4. Batch normalization completely eliminated, but it has been alleviated, which will be further
discussed below.
During the training period, the network parameters are continually
changed. In addition to the data of the input layer, the update of 3.4.1. Layer Normalization
the training parameters of the previous layer will cause the change Layer normalization (LN) [128] normalizes the inputs crossing the
of the input data distribution of the latter layer. The change of data features, while batch normalization normalizes the input features cross-
distribution in the hidden layers is called ’Internal Covariate Shift’ ing the batch dimension. In batch normalization, the statistics are
(Fig. 14). Ideally, each layer should be transformed into space where computed across the batch and are the same for each example in the
they have the same distribution but the functional relationship stays batch. BN is for the output value of the same dimension of multiple
the same. Batch Normalization (BN) incorporates data standardization samples. In contrast, layer normalization is for the output value of each
into the network architecture, avoiding this issue during the training dimension of a sample where the statistics are computed across each
feature and independent of other samples.
process. This is achieved by adding the layers that set the first two
Compared with BN, LN is that all hidden units in a layer share same
moments (mean and variance) of the distribution of each activation to
normalization terms 𝜇 and 𝜎, but different training cases have different
be zero and one respectively. Moreover, BN improves both convergence
normalization terms. LN has no restrictions on the size of a mini-batch
and generalization in training neural networks. BN is most widely used
and can be used in the pure online regime of batch size 1.
for visual fields, while it is limit in nature language processing (NLP). In
In BN, there will be a problem that some channels do not have cer-
NLP, the length of input sentences is different which makes it difficult
tain sample data, even if specific character padding is used, it will cause
to batch process.
uneven distribution of certain channels. The forward normalization is
During the forward pass, the batch normalization process as the
the only decisive factor to LN which makes the input distribution more
follows:
stable, thus brings better convergence. The result of [129] shows that
1 ∑ (𝓁−1)
𝑛
forward normalization has little to do with the effectiveness and the
𝜇𝐵 = 𝑥 , (92)
𝑛 𝑖=1 𝑖 derivatives of the mean and variance play a significant role in LN.

161
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 8
For each normalization, 𝑁 is the batch axis, 𝐶 is the channel axis, and (𝐻, 𝑊 ) is the spatial axes. The pixel in blue are normalized by the sam mean and variance, computed by
aggregating the values of these pixel.
Item Content Mean and variant Image
1 ∑𝑁 ∑𝐻 ∑𝑊
𝜇𝑐 (𝑥) = 𝑁𝐻𝑊 𝑛=1 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
BN BN normalizes the data in each batch. √
1 ∑𝑁 ∑𝐻 ∑𝑊 ( )2
𝜎𝑐 (𝑥) = 𝑁𝐻𝑊 𝑛=1 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑐 (𝑥) +𝜖
1 ∑ 𝐻 ∑ 𝑊
IN is a separate normalization operation 𝜇𝑛𝑐 (𝑥) = 𝐻𝑊 ℎ=1 𝑥
𝑤=1 𝑛𝑐ℎ𝑤
IN √
1 ∑𝐻 ∑𝑊 ( )2
for each channel in a sample.
𝜎𝑛𝑐 (𝑥) = 𝐻𝑊 ℎ=1 𝑤=1
𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛𝑐 (𝑥) + 𝜖
1 ∑ 𝐶 ∑ 𝐻 ∑ 𝑊
𝜇𝑛 (𝑥) = 𝐶𝐻𝑊 𝑐=1 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
LN LN normalizes all data in a sample. √
1 ∑𝐶 ∑𝐻 ∑𝑊 ( )2
𝜎𝑛 (𝑥) = 𝐶𝐻𝑊 𝑐=1 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛 (𝑥) +𝜖
1 ∑(𝑔+1)𝐶∕𝐺 ∑𝐻 ∑𝑊
GN divides the channel of a sample into 𝜇𝑛𝑔 (𝑥) = (𝐶∕𝐺)𝐻𝑊 𝑐=𝑔𝐶∕𝐺 ℎ=1
𝑥
𝑤=1 𝑛𝑐ℎ𝑤
GN
multiple groups, then normalized each √
1 ∑𝑊 ∑𝐻 ∑𝑊 ( )2
group. 𝜎𝑛𝑔 (𝑥) = (𝐶∕𝐺)𝐻𝑊 𝑐=𝑔𝐶∕𝐺 ℎ=1 𝑤=1 𝑥𝑛𝑐ℎ𝑤 − 𝜇𝑛𝑔 (𝑥) +𝜖

Table 9
Some normalization methods.
Model Description Ref
Weight normalization It normalizes the weights of the layer. [161]
Batch-Instance Normalization It tries to learn how much style information should be used for each channel. [162]
Decorrelated Batch Normalization It centers, scales and whitens activations of each layer. [163]
Iterative Normalization It employs Newton’s iterations for much more efficient whitening and avoids the eigen-decomposition. [164]
Batch Renormalization The model outputs are dependent only on the individual examples during both training and inference. [165]
Switchable Normalization It selects different normalizers for different normalization layers of a deep neural network. [166]
Differentiable Dynamic Normalization It learns arbitrary normalization forms in data- and task-driven way for deep neural networks. [167]

3.4.2. Instance Normalization Table 10

The main applications of the regularization technolo-
Instance Normalization (IN) is a separate normalization operation
gies in deep learning.
for each channel in a sample [130]. LN and IN are very similar to each
Technology Main application
other but the difference between them is that IN normalizes across each
Data augmentation Computer vision
channel in each training example instead of normalizing across input
Dropout Model compress
features in a training example. Unlike BN, the instance normalization Early stopping Label noise
layer is applied at test time as well because of the non-dependency of Batch normalization Computer vision
mini-batch [131]. Pre-training Multi-task learning
Weight share Model compress
Multi-task learning Multi-task learning
3.4.3. Group normalization Adding noise Label noise
Group Normalization (GN) is a simple and effective alternative to
batch normalization. Since inaccurate batch statistics estimation, the
BN error grows exponentially as the batch size decreases. This problem
limits the application of BN to computer vision tasks including video, thought to each layer from bottom to top in this technique. Weights
detection, and segmentation which require small batches constrained find a favorable point in parameter space in this manner, resulting in
by memory consumption. Unlike BN, the accuracy of GN is stable over regularizing effects during the fine-tuning process.
a wide range of batch sizes, and its computation is independent of batch The work proposed a pre-training parameter method based on
size. unsupervised feature learning to solve the problem of gradient insta-
GN normalizes over the group of channels for each training exam- bility [133]. The original data is abstracted into highly conceptual and
ple, that is Group Norm is in between IN and LN [132]. GN divides independent features. After pre-training of network parameters, the
the channels into groups and computes the mean and variance for nor- gradient descent algorithm is used to fine-tuning the parameters for
malization within each group. The computation of GN is independent specific classification tasks. Common pre-training methods to initial-
of batch sizes, and its precision is stable in a wide variety of batch ize parameters are restricted Boltzmann machine (RBN) [134], auto-
sizes. When each channels is seen as a single group, GN is transformed encoder [135] and sparse encoding symmetric machine (SESM) [136].
into LN. When each channel is separated into separate classes, GN is
transformed into IN.
3.5.2. Weight share
Table 8 shows the comparison of four normalization methods which
mentioned above. Table 9 shows many other existing methods of Reusing trainable parameters in multiple parts of the network is
normalization. called weight sharing, which makes the model less complex [137].
Using less weights makes model simple if the weights have high prob-
3.5. Other ability density under the mixture model. Clustering the weights into
subsets, even if only one, to reduce the number of sets with weights
3.5.1. Pretraining that have very close values in each cluster. Since the effective mean
This course is so long in deep networks that the gradient signal or variance of the clustering is usually unknown in advance, it is
becomes very small at the network’s front. On the other hand, updates permissible to change the parameters of the mixed model concurrently
in later layers are done based on the weights of former layers that during the training stage.
are going to be changed. These effects slow the convergence process. A famous example is CNN. Weight sharing does not merely reduce
An excellent way to avoid this hassle is to pre-train the network in the number of weights, but also encodes the prior knowledge about the
an unsupervised manner, layer by layer. The identification feature is locality of feature extraction.

162
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

Table 11
The advantages and disadvantages of the regularization technologies in deep learning.
Technology Advantages Disadvantages
Data augmentation Obtains a smaller robust error Induces the data distribution bias
Dropout Prevents units from co-adapting Increases the training time
Early stopping No change to model/algorithm Need validation data
Batch normalization Removes dropout and increased accuracy Difficult to estimate 𝜇 and 𝜎 in the test.

Table 12
Model accuracy, number of epochs the model needs to be learned and the number of operation per input sample
during the training phase [5].
Method None 𝑙2 norm Dropout Data Batch Adding
augmentation normalization noise
Accuracy 82.1 88.3 90.6 88.1 90.8 84.5
Epochs 12 12 28 17 6 16
Training Ops/Sam (M) 230 232 118 232 420 231

3.5.3. Multi-task learning The combination of two or more regularization techniques may lead
A special type of regularization technology is multi-task learning to a great effect, but others might make ’1 + 1 < 1’. For example,
(MTL). Most multi-task data can be collected from diverse domains or Li et al. [147] reconciles dropout and batch normalization, which
various tasks. MTL helps the relationship among tasks. MTL improves reduces the error rate in those applications. It recommends that apply
the performance of all tasks by learning multiple tasks simultane- dropout after batch normalization with a small dropout rate. While
ously [138]. the work [148] shows that 𝑙2 regularization has no regularizing effect
MTL consists feature learning approach, task clustering approach when combined with normalization. Therefore, the combination of
and task relation learning approach. The feature learning approach regularization technologies plays a fundamental role in regularization
assumes that related tasks are in a common feature subspace [139]. The research.
task clustering approach assumes that the tasks form several clusters Computational cost is a factor in the regularization selection. Ta-
where the model is trained by task groups [140]. Task relation learn- ble 12 showed the model accuracy, the number of convergent epochs
ing approach assumes more complex relationships among tasks which and the number of operation per input sample of some regularization
usually learns pairwise relations directly among tasks, such as task technologies [5]. As can be seen, weight decay and data augmentation
similarity [141], task correlations [142], and task covariance [143]. have little computational side effects. Therefore, they can be used
in most applications. In the case of enough computational resources,
3.5.4. Adding noise the related methods of Dropout are reasonable to be used. Moreover,
Adding noise to a model or weight may be used to make structures in the case of abundant computing resources, batch normalization
more generalizable and avoid overfitting [144,145]. In deep learning, family methods are reasonable strategies used as regularization in the
noise can be added either to the input data or to the weights of the network.
network. Adding noise to the input data has a relatively long history in
data pre-processing and in some literature, it is referred to as dithering. 4. Conclusion
In deep learning, adding noise to the input data hinders memorizing but
preserves learnability. The idea of adding noise has similar effects on In this paper, we reviewed favorite regularization techniques in
different problems in machine learning. recent years. In machine learning, this paper introduced sparse regular-
ization, low-rank regularization, and manifold regularization. In deep
3.6. Discussion learning, data augmentation, dropout, early stopping, batch normal-
ization are seen as the common regularization strategies. This paper
This part will compare the mentioned regularization technologies summarized how to choose the regularization technologies as follows:
and conclude existing opportunities and challenges of the regulariza-
• For sparse regularization, it is meaningful to construct a reg-
tions in deep learning.
ularization that has good mathematical properties for specific
While simple to state, these comparisons might have profound
tasks.
implications:
• The low-rank regularization can be constructed by extending the
• The main applications of these regularization technologies are sparse regularizations.
shown in Table 10. • Choose the regularization technologies that lead to the optimiza-
• The advantages and disadvantages of the regularization technolo- tion properties.
gies in deep learning are discussed in Table 11. • Manifold regularization should guarantee the smoothness of the
learning function.
Regularization technologies have rapidly become an essential part
• Computational cost can be considered as a factor when choosing
of the deep learning toolkit. However, a fundamental problem that
regularizations.
has accompanied this rapid progress is a lack of theoretical proof. In
particular, it is unclear how these technologies improve generalization, It seems the trend of employing more effective regularization tech-
even unknown whether they achieved better generalization. niques will continue and we will experience better and smarter methods
The lack of these theories greatly limits the development of reg- in the near future.
ularization technologies. However, a recent work thinks BN makes Several findings of this article, as well as potential future research,
the optimization landscape significantly smoother instead of avoiding are summarized below: (1) Developing more effective regularization
covariate shift [146]. Specifically, the Lipschitzness of both the loss strategies has been the subject of significant research efforts. (2) The
and the gradients induce a more predictive and stable behavior of advancement of regularization technologies is severely hampered by
the gradients, allowing for faster training. Regularizations with good the lack of theoretical clarification. In particular, it is unclear how
properties should be preferred. these regularization technologies improved generalization, and it is also

163
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

unclear whether they achieved better generalization. (3) The combi- [24] D. Geman, G. Reynolds, Constrained restoration and the recovery of
nation of two or more regularization techniques may lead to a great discontinuities, IEEE Trans. Pattern Anal. Mach. Intell. 14 (3) (1992) 367–383.
[25] I.W. Selesnick, I. Bayram, Sparse signal estimation by maximally sparse convex
effect, but others might make ’1 + 1 < 1’. Therefore, taking into
optimization, IEEE Trans. Signal Process. 62 (5) (2014) 1078–1092.
account the combination of regularization technologies is important in [26] M. Malek-Mohammadi, C.R. Rojas, B. Wahlberg, A class of nonconvex penalties
regularization research. preserving overall convexity in optimization-based mean filtering, IEEE Trans.
Signal Process. 64 (24) (2016) 6650–6664.
CRediT authorship contribution statement [27] J. Fan, Y. Liao, H. Liu, An overview of the estimation of large covariance and
precision matrices, Econom. J. 19 (1) (2016) C1–C32.
[28] J. Fan, F. Han, H. Liu, Challenges of big data analysis, Natl. Sci. Rev. 1 (2)
Yingjie Tian: Funding acquisition, Resources, Writing – review- (2014) 293–314.
ing and editing, Conceptualization, Supervision. Yuqi Zhang: Writing- [29] R.C. Qiu, P. Antonik, Smart Grid using Big Data Analytics: A Random Matrix
original draft, Visualization, Methodology. Theory Approach, John Wiley and Sons, 2017.
[30] H. Liu, L. Wang, T. Zhao, Sparse covariance matrix estimation with eigenvalue
constraints, J. Comput. Graph. Statist. 23 (2) (2014) 439–459.
Declaration of competing interest
[31] D. Belomestny, M. Trabs, A.B. Tsybakov, Sparse covariance matrix estimation
in high-dimensional deconvolution, Bernoulli 25 (3) (2019) 1901–1938.
The authors declare that they have no known competing finan- [32] X. Liu, N. Zhang, Sparse inverse covariance matrix estimation via the-norm with
cial interests or personal relationships that could have appeared to tikhonov regularization, Inverse Problems 35 (11) (2019) 115010.
influence the work reported in this paper. [33] C. Ding, D. Zhou, X. He, H. Zha, R 1-pca: rotational invariant l 1-norm principal
component analysis for robust subspace factorization, in: Proceedings of the
23rd International Conference on Machine Learning, 2006, pp. 281–288.
Acknowledgment [34] N. Wang, Y. Xue, Q. Lin, P. Zhong, Structured sparse multi-view feature
selection based on weighted hinge loss, Multimedia Tools Appl. 78 (11) (2019)
This work has been partially supported by grants from: National 15455–15481.
Natural Science Foundation of China (No. 12071458, 71731009). [35] H. Liu, M. Palatucci, J. Zhang, Blockwise coordinate descent procedures for
the multi-task lasso, with applications to neural semantic basis discovery, in:
Proceedings of the 26th Annual International Conference on Machine Learning,
References 2009, pp. 649–656.
[36] P. Gong, J. Ye, C. Zhang, Multi-stage multi-task feature learning, J. Mach. Learn.
[1] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. Res. 14 (1) (2013) 2979–3010.
[2] J. Kukavcka, V. Golkov, D. Cremers, Regularization for deep learning: A [37] S. Wang, D. Liu, Z. Zhang, Nonconvex relaxation approaches to robust ma-
taxonomy, 2017, arXiv preprint arXiv:1710.10686. trix recovery, in: Twenty-Third International Joint Conference on Artificial
[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep Intelligence, 2013.
learning (still) requires rethinking generalization, Commun. ACM 64 (3) (2021) [38] E.J. Candes, B. Recht, Exact matrix completion via convex optimization, Found.
107–115. Comput. Math. 9 (6) (2009) 717–772.
[4] Z. Hu, F. Nie, R. Wang, X. Li, Low rank regularization: a review, Neural Netw. [39] R. Mazumder, T. Hastie, R. Tibshirani, Spectral regularization algorithms for
(2020). learning large incomplete matrices, J. Mach. Learn. Res. 11 (2010) 2287–2322.
[5] R. Moradi, R. Berangi, B. Minaei, A survey of regularization strategies for deep
[40] T. Hastie, R. Mazumder, J.D. Lee, R. Zadeh, Matrix completion and low-rank
models, Artif. Intell. Rev. 53 (6) (2020) 3947–3986.
SVD via fast alternating least squares, J. Mach. Learn. Res. 16 (1) (2015)
[6] F. Wen, L. Chu, P. Liu, R.C. Qiu, A survey on nonconvex regularization-
3367–3402.
based sparse and low-rank recovery in signal processing, statistics, and machine
[41] T. Bouwmans, E.H. Zahzah, Robust PCA via principal component pursuit: A
learning, IEEE Access 6 (2018) 69883–69906.
review for a comparative evaluation in video surveillance, Comput. Vis. Image
[7] L.C. Potter, E. Ertin, J.T. Parker, M. Cetin, Sparsity and compressed sensing in
Underst. 122 (2014) 22–34.
radar imaging.
[42] T. Bouwmans, S. Javed, H. Zhang, Z. Lin, R. Otazo, On the applications of robust
[8] C.R. Berger, Z. Wang, J. Huang, S. Zhou, Application of compressive sensing
PCA in image and video processing, Proc. IEEE 106 (8) (2018) 1427–1457.
to sparse channel estimation, IEEE Commun. Mag. 48 (11) (2010) 164–174.
[43] L. Luo, S. Bao, C. Tong, Sparse robust principal component analysis with
[9] M. Lustig, D. Donoho, J.M. Pauly, Sparse MRI: The application of compressed
applications to fault detection and diagnosis, Ind. Eng. Chem. Res. 58 (3) (2019)
sensing for rapid mr imaging, Magn. Reson. Med. 58 (6) (2007) 1182–1195.
1300–1309.
[10] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse
[44] E. Kim, M. Lee, S. Oh, Elastic-net regularization of singular values for robust
representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873.
subspace learning, in: Proceedings of the IEEE Conference on Computer Vision
[11] X. Jiang, R. Ying, F. Wen, S. Jiang, P. Liu, An improved sparse reconstruction
and Pattern Recognition, 2015, pp. 915–923.
algorithm for speech compressive sensing using structured priors, in: 2016 IEEE
[45] F. Nie, H. Huang, C. Ding, Low-rank matrix recovery via efficient schatten p-
International Conference on Multimedia and Expo (ICME), IEEE, 2016, pp. 1–6.
norm minimization, in: Twenty-Sixth AAAI Conference on Artificial Intelligence,
[12] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc.
2012.
Ser. B Stat. Methodol. 58 (1) (1996) 267–288.
[13] M.B. McCoy, V. Cevher, Q.T. Dinh, A. Asaei, L. Baldassarre, Convexity in source [46] L. Liu, W. Huang, D.-R. Chen, Exact minimum rank approximation via schatten
separation: Models, geometry, and algorithms, IEEE Signal Process. Mag. 31 (3) p-norm minimization, J. Comput. Appl. Math. 267 (2014) 218–227.
(2014) 87–95. [47] S. Gu, L. Zhang, W. Zuo, X. Feng, Weighted nuclear norm minimization with
[14] Z. Xu, X. Chang, F. Xu, H. Zhang, l1/2 regularization: A thresholding represen- application to image denoising, in: Proceedings of the IEEE Conference on
tation theory and a fast solver, IEEE Trans. Neural Netw. Learn. Syst. 23 (7) Computer Vision and Pattern Recognition, 2014, pp. 2862–2869.
(2012) 1013–1027. [48] Y. Xie, S. Gu, Y. Liu, W. Zuo, W. Zhang, L. Zhang, Weighted schatten 𝑝-norm
[15] N. Parikh, S. Boyd, et al., Proximal algorithms, Found. Trends R Optimiz. 1 minimization for image denoising and background subtraction, IEEE Trans.
(2014). Image Process. 25 (10) (2016) 4842–4857.
[16] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its [49] C. Lu, J. Tang, S. Yan, Z. Lin, Generalized nonconvex nonsmooth low-rank
oracle properties, J. Amer. Statist. Assoc. 96 (456) (2001) 1348–1360. minimization, in: Proceedings of the IEEE Conference on Computer Vision and
[17] C.-H. Zhang, Nearly unbiased variable selection under minimax concave Pattern Recognition, 2014, pp. 4130–4137.
penalty, Ann. Statist. 38 (2) (2010) 894–942. [50] C. Peng, Z. Kang, H. Li, Q. Cheng, Subspace clustering using log-determinant
[18] N. Anantrasirichai, R. Zheng, I. Selesnick, A. Achim, Image fusion via sparse rank approximation, in: Proceedings of the 21th ACM SIGKDD International
regularization with non-convex penalties, Pattern Recognit. Lett. 131 (2020) Conference on Knowledge Discovery and Data Mining, 2015, pp. 925–934.
355–360. [51] C. Gao, N. Wang, Q. Yu, Z. Zhang, A feasible nonconvex relaxation approach
[19] T. Zhang, Analysis of multi-stage convex relaxation for sparse regularization., to feature selection, in: Proceedings of the AAAI Conference on Artificial
J. Mach. Learn. Res. 11 (3) (2010). Intelligence, 2011, 25(1).
[20] J. Wangni, D. Lin, Learning sparse visual representations with leaky capped [52] J.H. Friedman, Fast sparse regression and classification, Int. J. Forecast. 28 (3)
norm regularizers, 2017, arXiv preprint arXiv:1711.02857. (2012) 722–738.
[21] H.-Y. Gao, A.G. Bruce, Waveshrink with firm shrinkage, Statist. Sinica (1997) [53] D. Geman, C. Yang, Nonlinear image recovery with half-quadratic regulariza-
855–874. tion, IEEE Trans. Image Process. 4 (7) (1995) 932–946.
[22] I. Selesnick, M. Farshchian, Sparse signal approximation via nonseparable [54] J. Trzasko, A. Manduca, Highly undersampled magnetic resonance image
regularization, IEEE Trans. Signal Process. 65 (10) (2017) 2561–2575. reconstruction via homotopic l0-minimization, IEEE Trans. Med. Imaging 28
[23] M. Nikolova, Energy Minimization Methods, Springer, 2011. (1) (2008) 106–121.

164
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

[55] C. Lu, C. Zhu, C. Xu, S. Yan, Z. Lin, Generalized singular value thresholding, [86] C. Lei, B. Hu, D. Wang, S. Zhang, Z. Chen, A preliminary study on data
in: Proceedings of the AAAI Conference on Artificial Intelligence, 2015, vol. augmentation of deep learning for image classification, in: Proceedings of the
29(1). 11th Asia-Pacific Symposium on Internetware, 2019, pp. 1–6.
[56] Y. Zhang, M. Rabbat, A graph-cnn for 3d point cloud classification, in: 2018 [87] H. Bagherinezhad, M. Horton, M. Rastegari, A. Farhadi, Label refinery: Im-
IEEE International Conference on Acoustics, Speech and Signal Processing proving imagenet classification through label progression, 2018, arXiv preprint
(ICASSP), IEEE, 2018, pp. 6279–6283. arXiv:1805.02641.
[57] D. Yang, W. Gao, Pointmanifold: Using manifold learning for point cloud [88] T. DeVries, G.W. Taylor, Dataset augmentation in feature space, 2017, arXiv
classification, 2020, arXiv preprint arXiv:2010.07215. preprint arXiv:1702.05538.
[58] J. Zeng, G. Cheung, M. Ng, J. Pang, C. Yang, 3D Point cloud denoising using [89] X. Gastaldi, Shake-shake regularization, 2017, arXiv preprint arXiv:1705.07485.
graph Laplacian regularization of a low dimensional manifold model, IEEE [90] Y. Yamada, M. Iwamura, T. Akiba, K. Kise, Shakedrop regularization for deep
Trans. Image Process. 29 (2019) 3474–3489. residual learning, IEEE Access 7 (2019) 186126–186136.
[59] X. Ma, W. Liu, Recent advances of manifold regularization, in: Manifolds [91] Y. Wang, Z.-P. Bian, J. Hou, L.-P. Chau, Convolutional neural networks with
II-Theory and Applications, IntechOpen, 2018. dynamic regularization, IEEE Trans. Neural Netw. Learn. Syst. 32 (5) (2020)
[60] Y. Zhang, J. Wu, Z. Cai, S.Y. Philip, Multi-view multi-label learning with sparse 2299–2304.
feature selection for image annotation, IEEE Trans. Multimed. 22 (11) (2020) [92] V. Kumar, H. Glaude, C. de Lichy, W. Campbell, A closer look at feature
2844–2857. space data augmentation for few-shot intent classification, 2019, arXiv preprint
[61] C. Shi, Q. Ruan, G. An, C. Ge, Semi-supervised sparse feature selection based arXiv:1910.04176.
on multi-view Laplacian regularization, Image Vis. Comput. 41 (2015) 1–10. [93] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic
[62] Y. Li, X. Shi, C. Du, Y. Liu, Y. Wen, Manifold regularized multi-view feature minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002)
selection for social image annotation, Neurocomputing 204 (2016) 135–141. 321–357.
[63] B. Geng, D. Tao, C. Xu, L. Yang, X.-S. Hua, Ensemble manifold regularization, [94] C. Shorten, T.M. Khoshgoftaar, A survey on image data augmentation for deep
IEEE Trans. Pattern Anal. Mach. Intell. 34 (6) (2012) 1227–1233. learning, J. Big Data 6 (1) (2019) 1–48.
[64] X. Ma, D. Tao, W. Liu, Effective human action recognition by combining [150] M. Mirza, S. Osindero, Conditional generative adversarial nets, 2014, arXiv
manifold regularization and pairwise constraints, Multimedia Tools Appl. 78 preprint arXiv:1411.1784.
(10) (2019) 13313–13329. [151] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[65] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview Hessian discriminative sparse convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012)
coding for image annotation, Comput. Vis. Image Underst. 118 (2014) 50–60. 1097–1105.
[149] D. Tao, L. Jin, W. Liu, X. Li, Hessian regularized support vector machines for [152] E. Denton, S. Chintala, A. Szlam, R. Fergus, Deep generative image models
mobile image annotation on the cloud, IEEE Trans. Multimed. 15 (4) (2013) using a laplacian pyramid of adversarial networks, 2015, arXiv preprint arXiv:
833–844. 1506.05751.
[66] K.I. Kim, F. Steinke, M. Hein, Semi-supervised regression using hessian energy
[153] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, Infogan:
with an application to semi-supervised dimensionality reduction, MPI for
Interpretable representation learning by information maximizing generative
Biological Cybernetics, 2010.
adversarial nets, in: Proceedings of the 30th International Conference on Neural
[67] W. Liu, H. Liu, D. Tao, Y. Wang, K. Lu, Multiview hessian regularized logistic
Information Processing Systems, 2016, pp. 2180–2188.
regression for action recognition, Signal Process. 110 (2015) 101–107.
[154] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial net-
[68] G. Feng, W. Liu, S. Li, D. Tao, Y. Zhou, Hessian-regularized multitask dictionary
works, in: International Conference on Machine Learning, PMLR, 2017, pp.
learning for remote sensing image recognition, IEEE Geosci. Remote Sens. Lett.
214–223.
16 (5) (2018) 821–825.
[155] D. Berthelot, T. Schumm, L. Metz, Began: Boundary equilibrium generative
[69] S. Lefkimmiatis, A. Bourquard, M. Unser, Hessian-based norm regularization for
adversarial networks, 2017, arXiv preprint arXiv:1703.10717.
image restoration with biomedical applications, IEEE Trans. Image Process. 21
[156] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for
(3) (2011) 983–995.
improved quality, stability, and variation, 2017, arXiv preprint arXiv:1710.
[70] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE
10196.
Trans. Image Process. 22 (7) (2013) 2676–2687.
[157] A. Brock, J. Donahue, K. Simonyan, Large scale GAN training for high fidelity
[71] T. Buhler, M. Hein, Spectral clustering based on the graph p-Laplacian, in:
natural image synthesis, 2018, arXiv preprint arXiv:1809.11096.
Proceedings of the 26th Annual International Conference on Machine Learning,
[158] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative
2009, pp. 81–88.
adversarial networks, in: Proceedings of the IEEE/CVF Conference on Computer
[72] D. Zhou, B. Scholkopf, Regularization on discrete spaces, in: Joint Pattern
Vision and Pattern Recognition, 2019, pp. 4401–4410.
Recognition Symposium, Springer, 2005, pp. 361–368.
[159] J. Zhao, M. Mathieu, Y. LeCun, Energy-based generative adversarial network,
[73] D. Luo, H. Huang, C. Ding, F. Nie, On the eigenvectors of p-Laplacian, Mach.
2016, arXiv preprint arXiv:1609.03126.
Learn. 81 (1) (2010) 37–51.
[74] W. Liu, X. Ma, Y. Zhou, D. Tao, J. Cheng, p-Laplacian regularization for scene [160] X. Gao, Y. Tian, Z. Qi, Rpd-gan: Learning to draw realistic paintings
recognition, IEEE Trans. Cybern. 49 (8) (2018) 2927–2940. with generative adversarial network, IEEE Trans. Image Process. 29 (2020)
[75] A. Elmoataz, X. Desquesnes, O. Lezoray, Non-local morphological PDEs and 8706–8720.
p-Laplacian equation on graphs with applications in image processing and [95] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data
machine learning, IEEE J. Sel. Top. Sign. Proces. 6 (7) (2012) 764–779. augmentation using GAN for improved liver lesion classification, in: 2018 IEEE
[76] A. Elmoataz, F. Lozes, M. Toutain, Nonlocal pdes on graphs: From tug-of-war 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018,
games to unified interpolation on images and point clouds, J. Math. Imaging pp. 289–293.
Vision 57 (3) (2017) 381–401. [96] F.H.K.d.S. Tanaka, C. Aranha, Data augmentation using GANs, 2019, arXiv
[77] W. Liu, Z.-J. Zha, Y. Wang, K. Lu, D. Tao, p-Laplacian regularized sparse preprint arXiv:1904.09135.
coding for human activity recognition, IEEE Trans. Ind. Electron. 63 (8) (2016) [97] L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style, 2015,
5120–5129. arXiv preprint arXiv:1508.06576.
[78] X. Ma, W. Liu, S. Li, D. Tao, Y. Zhou, Hypergraph p-Laplacian regularization [98] Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, M. Song, Neural style transfer: A review,
for remotely sensed image recognition, IEEE Trans. Geosci. Remote Sens. 57 IEEE Trans. Vis. Comput. Graphics 26 (11) (2019) 3365–3385.
(3) (2018) 1585–1595. [99] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer
[79] X. Ma, W. Liu, D. Tao, Y. Zhou, Ensemble p-laplacian regularization for scene and super-resolution, in: European Conference on Computer Vision, Springer,
image recognition, Cogn. Comput. 11 (6) (2019) 841–854. 2016, pp. 694–711.
[80] D. Slepcev, M. Thorpe, Analysis of p-laplacian regularization in semisupervised [100] D. Ulyanov, V. Lebedev, A. Vedaldi, V.S. Lempitsky, Texture networks: Feed-
learning, SIAM J. Math. Anal. 51 (3) (2019) 2085–2120. forward synthesis of textures and stylized images, in: ICML, vol. 1(2), 2016,
[81] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Revisiting unreasonable effective- pp. 4.
ness of data in deep learning era, in: Proceedings of the IEEE International [101] L. Perez, J. Wang, The effectiveness of data augmentation in image classification
Conference on Computer Vision, 2017, pp. 843–852. using deep learning, 2017, arXiv preprint arXiv:1712.04621.
[82] J. Lu, P. Gong, J. Ye, C. Zhang, Learning from very few samples: A survey, [102] X. Zheng, T. Chalasani, K. Ghosal, S. Lutz, A. Smolic, Stada: Style transfer as
2020, arXiv preprint arXiv:2009.02653. data augmentation, 2019, arXiv preprint arXiv:1909.01056.
[83] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in [103] T. Hospedales, A. Antoniou, P. Micaelli, A. Storkey, Meta-learning in neural
the details: Delving deep into convolutional nets, 2014, arXiv preprint arXiv: networks: A survey, 2020, arXiv preprint arXiv:2004.05439.
1405.3531. [104] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation
[84] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, of deep networks, in: International Conference on Machine Learning, PMLR,
in: European Conference on Computer Vision, Springer, 2014, pp. 818–833. 2017, pp. 1126–1135.
[85] E.D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q.V. Le, Autoaugment: Learning [105] L. Metz, N. Maheswaranathan, B. Cheung, J. Sohl-Dickstein, Meta-learning
augmentation strategies from data, in: Proceedings of the IEEE/CVF Conference update rules for unsupervised representation learning, 2018, arXiv preprint
on Computer Vision and Pattern Recognition, 2019, pp. 113–123. arXiv:1804.00222.

165
Y. Tian and Y. Zhang Information Fusion 80 (2022) 146–166

[106] Y. Duan, J. Schulman, X. Chen, P.L. Bartlett, I. Sutskever, P. Abbeel, Rl2: Fast [132] Y. Wu, K. He, Group normalization, in: Proceedings of the European Conference
reinforcement learning via slow reinforcement learning, 2016, arXiv preprint on Computer Vision (ECCV), 2018, pp. 3–19.
arXiv:1611.02779. [161] T. Salimans, D.P. Kingma, Weight normalization: A simple reparameterization
[107] R. Houthooft, R.Y. Chen, P. Isola, B.C. Stadie, F. Wolski, J. Ho, P. Abbeel, to accelerate training of deep neural networks, Adv. Neural Inf. Process. Syst.
Evolved policy gradients, 2018, arXiv preprint arXiv:1802.04821. 29 (2016) 901–909.
[108] F. Alet, M.F. Schneider, T. Lozano-Perez, L.P. Kaelbling, Meta-learning curiosity [162] H. Nam, H.-E. Kim, Batch-instance normalization for adaptively style-invariant
algorithms, 2020, arXiv preprint arXiv:2003.05325. neural networks, 2018, arXiv preprint arXiv:1805.07925.
[109] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, M. Pontil, Bilevel program- [163] L. Huang, D. Yang, B. Lang, J. Deng, Decorrelated batch normalization,
ming for hyperparameter optimization and meta-learning, in: International in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Conference on Machine Learning, PMLR, 2018, pp. 1568–1577. Recognition, 2018, pp. 791–800.
[110] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable architecture search, 2018, [164] L. Huang, Y. Zhou, F. Zhu, L. Liu, L. Shao, Iterative normalization: Beyond
arXiv preprint arXiv:1806.09055. standardization towards efficient whitening, in: Proceedings of the IEEE/CVF
[111] J. Lemley, S. Bazrafkan, P. Corcoran, Smart augmentation learning an optimal Conference on Computer Vision and Pattern Recognition, 2019, pp. 4874–4883.
data augmentation strategy, Ieee Access 5 (2017) 5858–5869. [165] S. Ioffe, Batch renormalization: Towards reducing minibatch dependence in
[112] T.N. Minh, M. Sinn, H.T. Lam, M. Wistuba, Automated image data preprocessing batch-normalized models, 2017, arXiv preprint arXiv:1702.03275.
with deep reinforcement learning, 2018, arXiv preprint arXiv:1806.05886. [166] P. Luo, J. Ren, Z. Peng, R. Zhang, J. Li, Differentiable learning-to-normalize via
[113] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: switchable normalization, 2018, arXiv preprint arXiv:1806.10779.
a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. [167] P. Luo, P. Zhanglin, S. Wenqi, Z. Ruimao, R. Jiamin, W. Lingyun, Differen-
15 (1) (2014) 1929–1958. tiable dynamic normalization for learning deep representation, in: International
[114] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Conference on Machine Learning, PMLR, 2019, pp. 4203–4211.
Improving neural networks by preventing co-adaptation of feature detectors, [133] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief
2012, arXiv preprint arXiv:1207.0580. nets.
[115] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep [134] C. Xie, J. Lv, X. Li, Finding a good initial configuration of parameters
convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012) for restricted Boltzmann machine pre-training, Soft Comput. 21 (21) (2017)
1097–1105. 6471–6479.
[116] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, R. Fergus, Regularization of neural [135] S. Kokalj-Filipovic, R. Miller, N. Chang, C.L. Lau, Mitigation of adversarial
networks using dropconnect, in: International Conference on Machine Learning, examples in rf deep classifiers utilizing autoencoder pre-training, in: 2019
PMLR, 2013, pp. 1058–1066. International Conference on Military Communications and Information Systems,
[117] J. Ba, B. Frey, Adaptive dropout for training deep neural networks, Adv. Neural ICMCIS, IEEE, 2019, pp. 1–6.
Inf. Process. Syst. 26 (2013) 3084–3092. [136] C. Plahl, T.N. Sainath, B. Ramabhadran, D. Nahamoo, Improved pre-training of
[118] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, Curriculum dropout. deep belief networks using sparse encoding symmetric machines, in: 2012 IEEE
[119] R. Moradi, R. Berangi, B. Minaei, Sparsemaps: convolutional networks with International Conference on Acoustics, Speech and Signal Processing, ICASSP,
sparse feature maps for tiny image classification, Expert Syst. Appl. 119 (2019) IEEE, 2012, pp. 4165–4168.
142–154. [137] S.J. Nowlan, G.E. Hinton, Simplifying neural networks by soft weight-sharing,
[120] A. Lodwich, Y. Rangoni, T. Breuel, Evaluation of robustness and performance of Neural Comput. 4 (4) (1992) 473–493.
early stopping rules with multi layer perceptrons, in: 2009 International Joint [138] J. Zhang, J. Miao, K. Zhao, Y. Tian, Multi-task feature selection with sparse
Conference on Neural Networks, IEEE, 2009, pp. 1877–1884. regularization to extract common and task-specific features, Neurocomputing
[121] R. Ganguli, S. Bandopadhyay, Neural network performance versus network 340 (2019) 76–89.
architecture for a quick stop training application, in: APCOM 2003: 31st [139] A. Maurer, M. Pontil, B. Romera-Paredes, Sparse coding for multitask and
International Symposium on Application of Computers and Operations Research transfer learning, in: International Conference on Machine Learning, PMLR,
in the Minerals Industries, South African Institute of Mining and Metallurgy, 2013, pp. 343–351.
2003, p. 39. [140] Y. Zhang, Q. Yang, A survey on multi-task learning, 2017, arXiv preprint
[122] L. Prechelt, Automatic early stopping using cross validation: quantifying the arXiv:1707.08114.
criteria, Neural Netw. 11 (4) (1998) 761–767. [141] C. Williams, E.V. Bonilla, K.M. Chai, Multi-task Gaussian process prediction,
[123] M.S. Iyer, R.R. Rhinehart, A novel method to stop neural network training, Adv. Neural Inf. Process. Syst. (2007) 153–160.
in: Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No. [142] Y. Zhang, D.Y. Yeung, Multilabel relationship learning, ACM Trans. Knowl.
00CH36334), vol. 2, IEEE, 2000, pp. 929–933. Discov. Data 7 (2) (2013) 1–30.
[124] R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning [143] Y. Zhang, D.-Y. Yeung, A regularization approach to learning task relationships
algorithms, in: Proceedings of the 23rd International Conference on Machine in multitask learning, ACM Trans. Knowl. Discov. Data 8 (3) (2014) 1–31.
Learning, 2006, pp. 161–168. [144] B. Poole, J. Sohl-Dickstein, S. Ganguli, Analyzing noise in autoencoders and
[125] M. Mahsereci, L. Balles, C. Lassner, P. Hennig, Early stopping without a deep networks, 2014, arXiv preprint arXiv:1406.1831.
validation set, 2017, arXiv preprint arXiv:1703.09580. [145] S. Hochreiter, J. Schmidhuber, Simplifying neural nets by discovering flat
[126] H. Song, M. Kim, D. Park, J.-G. Lee, How does early stopping help minima, in: Advances in Neural Information Processing Systems, 1995, pp.
generalization against label noise? 2019, arXiv preprint arXiv:1911.08059. 529–536.
[127] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training [146] S. Santurkar, D. Tsipras, A. Ilyas, A. Madry, How does batch normalization help
by reducing internal covariate shift, in: International Conference on Machine optimization? in: Proceedings of the 32nd International Conference on Neural
Learning, PMLR, 2015, pp. 448–456. Information Processing Systems, 2018, pp. 2488–2498.
[128] J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, 2016, arXiv preprint [147] X. Li, S. Chen, X. Hu, J. Yang, Understanding the disharmony between dropout
arXiv:1607.06450. and batch normalization by variance shift, in: Proceedings of the IEEE/CVF
[129] J. Xu, X. Sun, Z. Zhang, G. Zhao, J. Lin, Understanding and improving layer Conference on Computer Vision and Pattern Recognition, 2019, pp. 2682–2690.
normalization, 2019, arXiv preprint arXiv:1911.07013. [148] T. Van Laarhoven, L2 regularization versus batch and weight normalization,
[130] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: The missing 2017, arXiv preprint arXiv:1706.05350.
ingredient for fast stylization, 2016, arXiv preprint arXiv:1607.08022.
[131] Z. Xu, X. Yang, X. Li, X. Sun, The effectiveness of instance normalization: a
strong baseline for single image dehazing, 2018, arXiv preprint arXiv:1805.
03305.

166

Tiger Tools
No ratings yet
Tiger Tools
2 pages
ANSYS Workbench: Mechanical Examples
No ratings yet
ANSYS Workbench: Mechanical Examples
54 pages
Deep Learning 02
No ratings yet
Deep Learning 02
28 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions For Regularization in Reinforcement Learning
No ratings yet
Rethinking Discount Regularization: New Interpretations, Unintended Consequences, and Solutions For Regularization in Reinforcement Learning
48 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Deep Learning Basics Lecture 3 Regularization I
No ratings yet
Deep Learning Basics Lecture 3 Regularization I
32 pages
Regularization Networks and Support Vector Machines
No ratings yet
Regularization Networks and Support Vector Machines
53 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
21 CF With Regularization (Guide)
No ratings yet
21 CF With Regularization (Guide)
2 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Unit 4
No ratings yet
Unit 4
35 pages
Deep Learning Important Questions For Ia 1
No ratings yet
Deep Learning Important Questions For Ia 1
11 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
Generalization
No ratings yet
Generalization
10 pages
Generalization
No ratings yet
Generalization
10 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Stacked Generalization
No ratings yet
Stacked Generalization
58 pages
DL M2 Regularization
No ratings yet
DL M2 Regularization
12 pages
Achine Learning Egularization: Ntroduction
No ratings yet
Achine Learning Egularization: Ntroduction
10 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
AIML Exp 7 1227
No ratings yet
AIML Exp 7 1227
3 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
Regularization
No ratings yet
Regularization
46 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
No ratings yet
Regularization in Neural Networks: Sargur Srihari Srihari@buffalo - Edu
31 pages
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
No ratings yet
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
55 pages
1.5 Regularization and Optimization
No ratings yet
1.5 Regularization and Optimization
17 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
Mod 4
No ratings yet
Mod 4
65 pages
Generalize DL 2023
No ratings yet
Generalize DL 2023
28 pages
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
From Everand
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
No ratings yet
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
11 pages
Classification Problem: Feedforwardnet Patternnet Fitnet
No ratings yet
Classification Problem: Feedforwardnet Patternnet Fitnet
16 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
Ratnn Si 2015 09 04
No ratings yet
Ratnn Si 2015 09 04
23 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
4 MachineLearningForCV
No ratings yet
4 MachineLearningForCV
73 pages
Well Tuned Simple Nets
No ratings yet
Well Tuned Simple Nets
23 pages
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
No ratings yet
Domain Generalization by Marginal Transfer Learning: Gilles Blanchard
55 pages
Mod-1 Part-2
No ratings yet
Mod-1 Part-2
106 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
No ratings yet
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
9 pages
Statement of Purpose: Jaweria Amjad
No ratings yet
Statement of Purpose: Jaweria Amjad
3 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Regularization in Deep Learning
No ratings yet
Regularization in Deep Learning
49 pages
Introduction To Machine Learning: The Problem of Overfitting
No ratings yet
Introduction To Machine Learning: The Problem of Overfitting
8 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
FYP Final Report
No ratings yet
FYP Final Report
40 pages
Jain College, Jayanagar II PUC Mock Paper I 2018 Mathematics Duration: 3hr 15 Min Max - Marks: 100 Part A I. Answer All The Questions: 1 × 10 10
No ratings yet
Jain College, Jayanagar II PUC Mock Paper I 2018 Mathematics Duration: 3hr 15 Min Max - Marks: 100 Part A I. Answer All The Questions: 1 × 10 10
3 pages
Convolution Sum PDF
No ratings yet
Convolution Sum PDF
17 pages
ME 2019 New Question Format Chem 2
No ratings yet
ME 2019 New Question Format Chem 2
4 pages
Mathematics W 21
100% (1)
Mathematics W 21
25 pages
(Metrology and Measurement Systems) Algorithms and Methods For Analysis of The Optical Structure Factor of Fractal Aggregates
No ratings yet
(Metrology and Measurement Systems) Algorithms and Methods For Analysis of The Optical Structure Factor of Fractal Aggregates
12 pages
Department of Education: General Mathematics Weekly Home Learning Plan
No ratings yet
Department of Education: General Mathematics Weekly Home Learning Plan
3 pages
Term Project
No ratings yet
Term Project
8 pages
Grade 3
No ratings yet
Grade 3
2 pages
G2 Bayesian Analysis
No ratings yet
G2 Bayesian Analysis
4 pages
NCERT Grade 09 Mathematics Introduction-To-Euclids-Geometry
No ratings yet
NCERT Grade 09 Mathematics Introduction-To-Euclids-Geometry
8 pages
A Brief History of Feedback Control
No ratings yet
A Brief History of Feedback Control
20 pages
Optimum Power Flow Analysis by Newton Raphson Method, A Case Study
No ratings yet
Optimum Power Flow Analysis by Newton Raphson Method, A Case Study
9 pages
Quadratic Equation - Arjuna Jee 2.0 2025
No ratings yet
Quadratic Equation - Arjuna Jee 2.0 2025
15 pages
EViews Guide
100% (1)
EViews Guide
14 pages
Chapter 1 - Stress and Strain
No ratings yet
Chapter 1 - Stress and Strain
49 pages
1st Year Honours Syllabus Statistics Physics
No ratings yet
1st Year Honours Syllabus Statistics Physics
16 pages
Profitability Analysis
No ratings yet
Profitability Analysis
24 pages
Time Complexity: Dr. Zahid Halim
No ratings yet
Time Complexity: Dr. Zahid Halim
32 pages
Get Signals and Systems Principles and Applications 1st Edition Shaila Dinkar Apte Free All Chapters
No ratings yet
Get Signals and Systems Principles and Applications 1st Edition Shaila Dinkar Apte Free All Chapters
55 pages
Second Moment of Area
No ratings yet
Second Moment of Area
4 pages
Chapter1 Existence and Uniqueness Theorems
No ratings yet
Chapter1 Existence and Uniqueness Theorems
13 pages
Optimization of Line Losses Using Series Compensation: Vaibhav V. Gholase Sudhir A. Gadekar
No ratings yet
Optimization of Line Losses Using Series Compensation: Vaibhav V. Gholase Sudhir A. Gadekar
29 pages
June 2016 Paper
No ratings yet
June 2016 Paper
20 pages
Day 3 Solutions
100% (1)
Day 3 Solutions
5 pages
Diploma Strength of Materials 1st Unit
No ratings yet
Diploma Strength of Materials 1st Unit
70 pages
Example Think Aloud Script
No ratings yet
Example Think Aloud Script
1 page
Badmephisto's Speedcubing Guide First 2 Layers: Arranged by Andy Klise
No ratings yet
Badmephisto's Speedcubing Guide First 2 Layers: Arranged by Andy Klise
3 pages

ML Paper

Uploaded by

ML Paper

Uploaded by

Information Fusion 80 (2022) 146–166

Contents lists available at ScienceDirect

A comprehensive survey on regularization strategies in machine learning

ARTICLE INFO ABSTRACT

Training a large-scale model is a challenging problem. A model

• Provide a thorough review of the strategies for overcoming over-

The rest of this paper is organized as follows. Section 2 provides the

Let 𝑨 is the design(deterministic) matrix, 𝒙 is the unknown regres-

min 𝒚 = 𝑨𝒙 s.t. ‖𝒙‖0 ≤ 𝑘 (4)

called the 𝑘-sparse approximation problem.

Firm Penalty The firm penalty is formulated as [21]

𝑃 (𝑥) = 𝜆 log (1 + |𝑥| ∕𝛾) , (21)

the rational penalty(Rat) [23,24]

and the exponential penalty (Exp) [26]

⎧ 𝜆|𝑥|, if |𝑥| < 𝜆

⎧ 𝛾𝜆2 | | min 𝐹 (𝑳) + 𝜆𝑀(𝑺)

2.4. Manifold regularization

norm regularization for image restoration with biomedical applications.

Fig. 9. Examples of image data augmentation methods.

It is worth noting that data augmentation approaches should take

3.1.2. Feature space augmentation

Shake-Shake regularization, which applies data augmentation tech-

The tested 𝛼 was 1, 2, 3, 4 and 5.

3.3.2. Low Progress (LP)

3.3.1. Loss of Generality (GL) 3.3.4. Consecutive loss in generality (UP)

3.3.5. Steady-State Stop Training Trigger (SSSTT)

3.3.6. High Noise Ratio (HNR)

𝐻𝑁𝑅𝛼 ∶ 𝐻𝑁𝑅20 (𝑡) > 𝛼 (91) 𝑥(𝓁−1) − 𝜇𝐵

3.3.7. Others 𝑥(𝓁) (𝓁) (𝓁−1)

3.4.2. Instance Normalization Table 10

You might also like