Pid Ieee
Pid Ieee
Abstract—Deep neural networks (DNNs) are widely used and not only used in the field of machine learning [2], but also
demonstrated their power in many applications, like computer deep learning [3]. It is very important to explore how to boost
vision and pattern recognition. However, the training of these net- the speed of training DNNs while maintaining performance.
works can be time-consuming. Such a problem could be alleviated
by using efficient optimizers. As one of the most commonly used Furthermore, with a better optimization method, even a com-
optimizers, SGD-Momentum uses past and present gradients putation limited hardware (e.g., IoT device) can save lots of
for parameter updates. However, in the process of network time and memory usage. The accelerating methods of the
training, SGD-Momentum may encounter some drawbacks, such computational time for DNNs can be divided into two parts,
as the overshoot phenomenon. This problem would slow the the speed-up of training and that of test. The methods in [4]–
training convergence. To alleviate this problem and accelerate
the convergence of DNN optimization, we propose a proportional- [6] aiming to speed up test process of DNNs often focus on
integral-derivative (PID) approach. Specifically, we investigate the not only the decomposition of layers but also the optimization
intrinsic relationships between PID based controller and SGD- solutions to the decomposition. Besides, there has been other
Momentum firstly. We further proposed a PID based optimization streams on improving testing performance of DNNs, such as
algorithm to update the network parameters, where the past, the FFT-based algorithms [7] and reduced parameters in deep
current, and change of gradients are exploited. Consequently,
our proposed PID based optimization alleviates the overshoot nets [8]. As for the methods to speed up the training speed
problem suffered by SGD-Momentum. When tested on popular of DNNs, the key factor is the way to update the millions
DNN architectures, it also obtains up to 50% acceleration with of parameters of a DNN. This process mainly depends on
competitive accuracy. Extensive experiments about computer optimizer and the choice of optimizer is also a key point of a
vision and natural language processing demonstrate the effective- model. Even with the same dataset and architecture, different
ness of our method on benchmark datasets, including CIFAR10,
CIFAR100, Tiny-ImageNet, and PTB. We’ve released the code at optimizers could result in very different training effects, due to
https://github.com/tensorboy/PIDOptimizer. different directions of the gradient descent, different optimizers
may reach completely different local minimum [9].
Index Terms—Deep neural network, optimization, PID control,
SGD-Momentum. The learning rate is another principal hyper-parameter for
DNN training [10]. Based on different strategies of choosing
learning rates, DNN optimizers can be categorized into two
I. I NTRODUCTION
groups: 1. Hand-tuned learning rate optimizers: stochastic gra-
Benefitting from the availability of great number of data dient descent (SGD) [11], SGD Momentum [12], Nesterov0 s
(e.g., ImageNet [1]) and the fast-growing power of GPUs, deep Momentum [12], etc. 2. Auto learning rate optimizers such as
neural networks (DNNs) success in a wide range of applica- AdaGrad [13], RMSProp [14] and Adam [15], etc.
tions, like computer vision and natural language processing. The SGD-Momentum method puts past and current gra-
Despite the significant successes of DNNs, the training and dients into consideration and then updates the network pa-
inference of deep and wide DNNs are often computationally rameters. Although SGD-Momentum performs well in most
expensive, which may take several days or longer even with cases, it may encounter overshoot phenomenon [16], which
powerful GPUs. Many stochastic optimization algorithms are indicates the case where the weight exceeds its target value
This work is partially supported by the NSFC fund (61571259, 61831014, too much and fails to correct its update direction. Such an
61531014), in part by the Shenzhen Science and Technology Project under overshoot problem costs more resource (e.g., time and GPUs)
Grant (GGFW2017040714161462, JCYJ20170307153051701). (Correspond- to train a DNN and also hampers the convergence of SGD-
ing author: Y. Zhang, Email: [email protected].)
H. Wang, Y. Luo, W. An, and Y. Zhang are with the Graduate Momentum. So, a more efficient DNN optimizer is eagerly
School at Shenzhen, Tsinghua University, and also with Shenzhen Insti- desired to alleviate the overshoot problem and achieve better
tute of Future Media Technology, Shenzhen 518055, China. E-mail: wang- convergence.
[email protected], [email protected], [email protected],
[email protected]. The similarity between optimization algorithms popularly
Q. Sun is with Department of Mathematics, Stanford University, Stanford, employed in DNN training and classic control methods has
CA 94305. E-mail: [email protected]. been investigated in [17]. In automatic control systems, the
J. Xu is with College of Computer Science, Nankai University, Tianjin
300071, China. E-mail: [email protected]. feedback control is essential. Proportional-integral-derivative
Y. Zhang is with Department of ECE, Northeastern University, Boston, MA (PID) controller is the most widely used feedback control
02115. E-mail: [email protected]. mechanism, due to its simplicity and functionality [18]. Most
L. Zhang is with Department of Computing, the Hong Kong Polytechnic
University, Hong Kong, and also with the Artificial Intelligence Center, of industrial control system are based on PID [19], such as
Alibaba DAMO Academy. Email: [email protected]. unmanned aerial vehicles [20], robotics [21], and autonomous
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 2
vehicles [22]. PID control takes current error, change in II. R ELATED W ORKS
error (differentiation of the error over time), and the past
A. Classic Deep Neural Network Architectures
cumulative error (integral of the error over time) into account.
So, the difference between current and expected outputs will CNN. Convolutional neural networks (CNNs) [25] have
be minimized. recently achieved great successes in visual recognition tasks,
including image classification [26], object detection [27]–
On the other hand, few studies have been done on the [29], and scene parsing [30]. Recently, lots of deep CNN
connections between PID with DNN optimization. In this architectures, such as VGG, ResNet, and DenseNet, have been
work, we investigate specific relationships analytically and proposed to improve the performance of these tasks mentioned
mathematically towards this research line. We first clarify above. Network depth tends to improve network performance.
the intrinsic connection between PID controller and stochastic However, the computational cost of these deep networks also
optimization methods, including SGD, SGD-Momentum, and increases significantly. Moreover, real-world systems may be
Nesterov0 s Momentum. Finally, we propose a PID based affected by the high cost of these networks.
optimization method for DNN training. Similar to SGD-
GAN. Goodfellow et al. firstly proposed generative ad-
Momentum, our proposed PID optimizer also considers the
versarial network (GAN) [31], which consists of generative
past and current gradients for network update. The Laplace
and adversarial networks. The generator tries to obtain very
Transform [23] is further introduced for hyper-parameter ini-
realistic outputs to foolish the discriminator, which would
tialization, which makes our method simple yet effective. Our
be optimized to distinguish between the real data and the
major contributions of this work can be summarized in three
generated outputs. GANs will be trained to generate synthetic
folds:
data, mimicking genuine data distribution.
In machine learning, models can be classified into two
• By combining the error calculation in the feedback con- categories: generative model and discriminative model. A dis-
trol system with network parameters’ update, we reveal criminative network (denoted as D) can discriminate between
a potential relationship between DNN optimization and two (or more) different classes of data, such as CNN trained
feedback system control. We also find that some opti- for image classification. A generative network (denoted as
mizers (e.g., SGD-Momentum) are special cases of PID G) can generate new data, which fit the distribution of the
control device. training data. For example, a trained Gaussian Mixture Model
• We propose a PID based DNN optimization approach is able to generate new random data, which more-or-less fit
by taking the past, current, and changing information the distribution of the training data.
of the gradient into consideration. The hyper-parameter GANs pose a challenging optimization problem due to
in our PID optimizer is initialized by classical Laplace the multiple loss functions, which must be optimized simul-
Transform. taneously. The optimization of GAN is conducted by two
• We systematically experiment with our proposed PID steps: 1) optimize discriminative network while fixing the
optimizer on CIFAR10, CIFAR100, Tiny-ImageNet, and generative one. 2) optimize the generative network while
PTB datasets. The results show that PID optimizer is fixing the discriminative network. Here, fixing a network
faster than SGD-Montum in DNN training process. means only allowing the network to pass forward and not
perform back-propagation. These two steps are seamlessly
A preliminary version of this work was presented as a alternating updated and dependent on each other for efficient
conference version [24]. In the current work, we incorporate optimization. After enough training cycles, the optimization
additional contents in significant ways: objective V (D, G) introduced in [31] will reach the situation,
where the probability distribution of the generator exactly
• We evaluate the performance of our PID optimizer on the matches the true probability distribution of the training data.
language modeling application by utilizing the character- Meanwhile, the discriminator has the capability to distinguish
level Penn Treebank (PTB-c) dataset with an LSTM the realistic data from the virtual generated ones. However, the
network. perfect cooperation between the generator and discriminator
• The proposed PID optimizer is applied on GAN with will fail occasionally. The whole system will reach the status
MNIST dataset and show the digital images generated of “model collapse”, indicating that the discriminator and the
by them separately to illustrate that our method is also generator tend to produce the same outputs.
applicable in GAN. LSTM. Hochreiter et al. firstly proposed the Long Short
• We update the conclusion that the proposed PID optimizer Term network, generally called LSTM, to obtain long-term
exceeds SGD-Momentum in GANs and RNNs. dependency information from the network. As a type of
recurrent neural network (RNN), LSTM has been widely
We organize the rest of this paper as follows. Section II used and obtained excellent success in many applications.
briefly surveys related works. Section III investigates the LSTM is deliberately designed to avoid long-term dependency
relationship between PID controller and DNN optimization al- problems. Remember that long-term information is the default
gorithms. Section IV introduces the proposed PID approach for behavior of LSTM in practice, rather than the ability to acquire
DNN optimization. Experimental results and detailed analysis at great cost. All RNNs have a chained form of repeating
are reported in Section V. Section VI concludes this paper. network modules. In the standard RNN, this repeating module
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 3
often has a simple structure (e.g., “tanh” layer). The outputs C. Deep Learning Optimization
of all LSTM cells are utilized to construct a new feature,
where multinomial logistic regression is introduced to form In the training of DNN [10], learning rate is an essential
the LSTM model. hyper-parameter. DNN optimizers can be categorized into two
One widely used way to evaluate RNN models is the adding groups based on different strategies of setting the learning
task [32], [33], which takes two sequences of length T as input. rate: 1. Hand-tuned learning rate optimizers: stochastic gra-
By sampling in the range (0, 1) uniformly, we form the first dient descent (SGD) [11], SGD Momentum [12], Nesterov0 s
sequence. For another sequence, we set two entries as 1 and Momentum [12], etc. 2. Auto learning rate optimizers, such as
the rest as 0. The output is obtained by adding two entries in AdaGrad [13], RMSProp [14], and Adam [15], etc. Good re-
the first sequence. The positions of the entries are determined sults have been achieved on CIFAR10, CIFAR100, ImageNet,
by the two entries of 1 from the second sequence. PASCAL VOC, and MS COCO datasets. They were mostly
obtained by residual neural networks [42]–[45] trained by us-
ing SGD-Momentum. This work focuses on the improvement
B. Accelerating the Training/Test Process of DNNs of the fist category of optimizers. The introduction to these
Training process acceleration. Since DNNs are mostly optimizers is as follows.
computationally intensive, Han et al. [34] proposed a deep Classical Momentum [25] is the first ever variant of gradient
compression method to reduce the storage requirement of descent involving the usage of a momentum parameter. In the
DNNs by 35× to 49× without affecting the accuracy. More- objective across iterations, it accelerates gradient descent that
over, the compressed model has 3× to 4× layer-wise speedup collects a velocity vector in directions of continuous reduction.
and 3× to 7× better energy efficiency. Unimportant connec- Stochastic Gradient Descent (SGD) [11] is a widely used
tions are pruned. Weight sharing and Huffman coding are optimizer for DNN training. SGD is easy to apply, but the
applied to quantize the network. This work mainly attempts disadvantage of SGD is that it converges slowly and may
to reduce the number of parameters of neural networks. Liu oscillate at the saddle point. Moreover, how to choose the
et al. proposed the network slimming technique that can learning rate reasonably is a major difficulty of SGD.
simultaneously reduce the model size, running-time mem- SGD Momentum (SGD-M) [12] is an optimization method
ory, and computing operations [35]. Yang et al. proposed a that considers momentum. Compared to the original gradient
new filter pruning strategy based on the geometric median descent step, the SGD-M introduces variables related to the
to accelerate the training of deep CNNs [36]. Dai et al. previous step. It means that the parameter update direction
proposed a synthesis tool to synthesize compact yet accurate is decided not only by the present gradient, but also by the
DNNs [37]. Du et al. proposed a Continuous Growth and previously accumulated direction of the fall. This allows the
Pruning (CGaP) scheme to minimize the redundancy from parameters to change little in the direction where gradient
the beginning [38]. Hubara et al. introduced a method to change frequently. Contrary to this, SGD-M changes parame-
train Quantized Neural Networks that reduce memory size ters a lot in the direction where gradient change slowly.
and accesses during forward pass [39]. In [40], Han et al.
Nesterov0 s Momentum [12] is another momentum optimiza-
presented an intuitive and easier-to-tune version of ASGD
tion algorithm motivated by Nesterov0 s accelerated gradient
(please refer to Section IV) and showed that ASGD leads to
method [46]. Momentum is improved from the SGD algorithm,
faster convergence significantly with a comparable accuracy
so that each parameter update direction depends not only on
than SGD, Heavy Ball, and Nesterov0 s Momentum [12].
the gradient of the current position, but also on the direction
Test process acceleration. Denton et al. [4] proposed
of the last parameter update. In other words, Nesterov0 s Mo-
a method that compresses all convolutional layers. This is
mentum essentially uses the second-order information of the
achieved by approximating proper low-rank and then updat-
objective (loss function) so it can accelerate the convergence
ing the upper layers until the prediction result is enhanced.
better.
Based on singular value decompositions (SVD), this process
consists of numerous tensor decomposition operations and
filter clustering approaches to make use of similarities among
learned features. Jaderberg et al. [5] introduced an easy-to- D. PID Controller
implement method that can significantly speed up pretrained
CNNs with minimal modifications to existing frameworks. Traditionally, the PID controller has been used to control
There can be a small associated loss in performance, but a feedback system [19] by exploiting the present, past, and
this is tunable to a desired accuracy level. Zhang et al. [6] future information of prediction error. The theoretical basis of
first proposed a response reconstruction method, which in- the PID controller was first proposed by Maxwell in 1868
troduces the nonlinear neurons and a low-rank constraint. in his seminal paper “On Governors” [47]. Mathematical
Without the usage of SGD and based on generalized singular formulation was given by Minorsky [48]. In recent years,
value decomposition (GSVD), a solution is developed for this several advanced control algorithms have been proposed.
nonlinear problem. Li et al. presented a method to prune We define the difference between the actual output and the
filters with relatively low weight magnitudes to produce CNNs desired output as error e(t). The PID controller calculates the
with reduced computation costs without introducing irregular error e(t) in every step t, and then applies a correction u(t)
sparsity [41]. to the system as a function of the proportional (P), integral
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 4
Desired Value
Update
Error PID Controller Output
Control Devices
System
Feedback
Connection
𝜕𝐿 𝜕𝐿 𝜕𝐿
𝜕𝑤𝑖𝑗 𝜕𝑤𝑗𝑘
SGD-Momentum
𝜕𝑤𝑘𝑙
Update Backpropagation
𝑤𝑖𝑗 𝑤𝑗𝑘
𝑥0 𝑤𝑘𝑙
Deep Model
Training 𝑥1
Output
𝑥2 Loss L
𝑥3
𝜃 = 𝑤𝑖𝑗, 𝑤𝑗𝑘, 𝑤𝑘𝑙
Desired Value (Label)
Fig. 1. Illustrations about the relationships between control system and deep model training. It also shows the connection between PID controller and
SGD-Momentum.
(I), and derivative (D) terms of e(t). Mathematically, the PID A. Overall Connections
controller can be described as
Z t At first, we summarize the training process of deep learning.
d Deep neural networks (DNNs) need to map the input x to the
u(t) = K p e(t) + Ki e(t)dt + Kd e(t), (1)
0 dt output y though parameters θ . To measure the gap between
the DNN output and desired output, the loss function L is
where K p , Ki , and Kd correspond to the gain coefficients of introduced. Given some training data, we can calculate the loss
the P, I, and D terms, respectively. The function of error e(t) function L(θ , Xtrain ). In order to minimize the loss function L,
is the same as the gradient in optimization of deep learning. we find the derivative of the loss function L with respect to the
The coefficients K p , Ki , and Kd reflect the contribution to parameter θ and update θ with the gradient descent method
the current correction to the current, past, and future errors in most cases. DNNs gradually learn the complex relationship
respectively. between input x and output y by constantly updating the
According to our analyses, we find that PID control tech- parameters θ , which called DNN’s training. The updating of θ
niques can be more useful for optimization of deep network. is driven by the gradient of loss function until it’s converged.
The study presented in this paper is one of the first inves-
Then, the purpose of an automated control system is to eval-
tigations to apply PID as a new optimizer to deep learning
uate the system status and make it to the desired status through
field. Our studies have succeeded in demonstrating significant
a controller. In feedback control system, the controller’s action
advantages of the proposed optimizer. With the inheritance
is affected by the system’s output. The error e(t) between
of the advantages of PID controller, the proposed optimizer
the measured system status and desired status is taken into
performs well despite its simplicity.
consideration, so that controller can make system get close to
desired status.
III. PID AND D EEP N EURAL N ETWORK O PTIMIZATION More specifically, as shown in Eq. (1), PID controller
estimates a control variable u(t) by considering the current,
We reveal the intrinsic relation between PID control and past, and future (derivative) of the error e(t).
DNNs optimization. The intrinsic relation inspires us to ex- From here we can see that the error in the PID control
plore new DNNs optimization methods. The core idea of this system is related to the gradient in the deep neural network
section is to regard the parameter update in DNNs training training process. The update of parameters during deep neural
process as using PID controller in the system to reach an network (DNNs) training can be analogized to the adjustment
equilibrium. of the system by the PID controller.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 5
As can be seen from the discussion above, there is high Put Vt+1 into the 2nd formula of Eq. (3), we have
similarity between DNNs optimization and PID based control
t−1
system. Fig. 1 shows their flowchart respectively and we can ∂ Lt
θt+1 − θt = −r − r ∑ α t−i ∂ Li /∂ θi . (8)
see the similarity more intuitively. Based on the difference ∂ θt i=0
between the output and target, both of them change the sys-
tem/network. The negative feedback process in PID controller We could learn that parameter update process considers both
is similar as the back-propagation in DNNs optimization. the current gradient (P control) and the integral of past
One key difference is that the PID controller computes the gradients (I control). If we assume α = 1, we get following
update utilizing system error e(t). However, DNN optimizer equation
decides the updates by considering gradient ∂ L/∂ θ . Let’s re- t−1
gard the gradient ∂ L/∂ θ as the incarnation of error e(t). Then, θt+1 − θt = −r∂ Lt /∂ θt − r ∑ ∂ Li /∂ θi . (9)
PID controller could be fully related with DNN optimiza- i=0
tion. In the next, we prove that SGD, SGD-Momentum and Comparing Eq. (9) with Eq. (1), we can see that SGD-
Nesterov0 s Momentum all are special cases of PID controller. Momentum is a PI controller with K p = r and Ki = r. By
B. Stochastic Gradient Descent (SGD) using some mathematical skill [50], we simplify Eq. (3) by
In DNN training, there are widely used optimizers, such as removing Vt . Then, Eq. (9) can be rewritten as
SGD and its variants. The parameter update rule of SGD from t−1
iteration t to t + 1 is determined by θt+1 = θt − r∂ Lt /∂ θt − r ∑ ∂ Li /∂ θi α t−i . (10)
i=0
θt+1 = θt − r∂ Lt /∂ θt , (2)
We can see it clear that the network parameter update depends
where r is the learning rate. We now regard the gradient on both current gradient r∂ Lt /∂ θt and the integral of past
∂ Lt /∂ θt as error e(t) in PID control system. Comparing with gradients r ∑t−1 t−i . It should be noted that the I
i=0 ∂ Li /∂ θi α
PID controller in Eq. (1), we find that SGD can be viewed as term includes a decay factor α. Due to the huge number of
one type of P controller with K p = r. training data, it’s better to calculate the gradient based on mini-
batch of training data. So, the gradients behave in a stochastic
C. SGD-Momentum manner. The purpose of the introduction of decay term α is
SGD-Momentum is faster than SGD to train a DNN, be- to keep the gradients away from current value, so that it can
cause it can use history gradient. The rule of SGD-M updating alleviate noise. In all, based on the analyses, we can view
parameter is given by SGD-Momentum as a PI controller.
(
Vt+1 = αVt − r∂ Lt /∂ θt
, (3) D. Nesterov0 s Momentum
θt+1 = θt +Vt+1
where Vt is a term that accumulates historical gradients. α ∈ Momentum is improved from the SGD algorithm and it
(0, 1) is the factor that balances the past and current gradients. considers the second-order information of the objective (loss
It is usually set to 0.9 [49]. Dividing two sides of the 1st function), so it can accelerate the convergence better. We set
formula of Eq. (3) by α t+1 the update rule as
(
Vt+1 Vt ∂ Lt /∂ θt Vt+1 = αVt − r∂ Lt /∂ (θt + αVt )
= t −r . (4) (11)
α t+1 α α t+1 θt+1 = θt +Vt+1 .
By applying Eq. (4) from time t + 1 to 1, we have
Vt+1 Vt ∂ Lt /∂ θt By using a variable transform θ̂t = θt + αVt , and formulating
t+1
− t = −r the update rule with respect to θ̂ , we have
α t+1
α α
V V L t−1 /∂ θt−1
t t−1 ∂
− = −r
(
Vt+1 = αVt − r∂ Lt /∂ θ̂t
α t α t−1 αt (5) (12)
.. θ̂t+1 = θ̂t + (1 + α)Vt+1 − αVt .
.
∂ L0 /∂ θ0
V1 V0
− = −r . Similar to the derivation process in Eq. (4)-(6) of SGD-
α1 α0 α1 Momentum, we have
By adding the aforementioned equations together, we get
t
Vt+1 V0 t
∂ Li /∂ θi Vt+1 = −r( ∑ (α t−i ∂ Li /∂ θ̂i )). (13)
t+1
= 0
− r ∑ i+1
. (6) i=1
α α i=0 α
To make it more general, we set the initial condition V0 = 0, With Eq. (13), Eq. (11) can be rewritten as
and thus the above equation can be simplified as follows t−1
t θ̂t+1 − θ̂t = − r(1 + α)∂ Lt /∂ θ̂t − αr( ∑ (α t−i ∂ Li /∂ θ̂i )).
Vt+1 = −r ∑ α t−i ∂ Li /∂ θt−1 . (7) i=1
i=0 (14)
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 6
We could conclude that the network parameter update consid- B. PID Optimizer for DNN
ers the current gradient (P control) and the integral of past We are motivated by the simple example in Section IV-A
gradients (I control). If we assume α = 1, then and seek a PID optimizer to boost the convergence of DNN
t−1
training. From Eq. (10), SGD-Momentum can be viewed as
θ̂t+1 − θ̂t = − 2r(∂ Lt /∂ θ̂t ) − r( ∑ (∂ Li /∂ θ̂i )). (15) a PI controller, which takes current and past gradient infor-
i=0 mation actually. Fig. 2 shows that PID controller introduces a
derivative term of gradient to use the future information. Then,
Comparing Eq. (15) with Eq. (1), we can prove that Nesterov0 s the overshoot problem can be alleviated obviously.
Momentum is a PI controller with K p = 2r and Ki = r. On the other hand, it is very easy to introduce noise when
What’s more, compared with SGD-Momentum, the Nesterov0 s computing of gradients, because the training is often conducted
Momentum would utilize the current gradient and integral of in a mini-batch manner. We also try to estimate the average
past gradients to update the network parameters, but achieves moving of the derivative part. Our proposed PID optimizer
larger gain coefficient K p . updates network parameter θ in iteration (t + 1) by
Vt+1 = αVt − r∂ Lt /∂ θt
IV. PID BASED DNN O PTIMIZATION Dt+1 = αDt + (1 − α)(∂ Lt /∂ θt − ∂ Lt−1 /∂ θt−1 ) (19)
θt+1 = θt +Vt+1 + Kd Dt+1 .
A. The Overshoot Problem of SGD-Momentum
We could learn from Eq. (19) that a hyper-parameter Kd is
We can learn it from Eqs. (10) and (14) that the Momentum
introduced in the proposed PID optimizer. We initialize Kd
optimizer will accumulate history gradients to accelerate. But
by introducing Laplace Transform [23] theory and Ziegler-
on the other hand, the updating of parameters may be in wrong
Nichols [52] tuning method.
path, if the history gradients lag the update of parameters.
According to the definition “the maximum peak value of the
response curve measured from the desired response of the sys- C. Initialization of Hyper-parameter Kd
tem” in discrete-time control systems [16], this phenomenon The Laplace Transform converts the function of real variable
is named as overshoot. Specifically, it can be written as t to a function of complex variable s. The most common
usage is to convert time to frequency. Denote the Laplace
θmax − θ ∗
Overshoot = , (16) transformation of f (t) as F(s). There is
θ∗ Z ∞
where θmax and θ∗ are the maximum and optimum values of F(s) = e−st f (t) dt, for s > 0. (20)
0
the weight, respectively.
The overshoot problem’s test benchmark is the first function In general, it’s easier to solve F(s) than f (t), which can be
of De Jong0 s [51] due to its smooth, unimodal, and symmetric reconstructed from F(s) with the Inverse Laplace transform
characteristics. The function can be written as 1
Z γ+iT
f (t) = lim est F(s)ds, (21)
2πi T →∞ γ−iT
f (x) = 0.1x12 + 2x22 , (17)
where i is the unit of imagery part and γ is a real number.
whose search domain is −10 ≤ xi ≤ 10, i = 1, 2. For this func- By using Laplace Transform, we can first transform our
tion x∗ = (0, 0), f (x∗ ) = 0, we can pursue a global minimum PID optimizer into its Laplace transformed functions of s, and
rather then a local one. then simplify the algebra. After obtaining the transformation
To build a simple PID optimizer, we introduce a derivative F(s), we can achieve the desired solution f (t) with the inverse
term of gradient based on SGD-Momentum transform.
We initialize a parameter of a node in DNN model as a
PID = Momentum + Kd (∂ f (x)/∂ xc − ∂ f (x)/∂ xc−1 ), (18) scalar θ0 . After enough times of updates, the optimal value θ ∗
can be obtained. We simplify the parameter update in DNN
where c is the present iteration index for x. With different optimization as one step response (from θ0 to θ ∗ ) in control
choices of Kd in Eq. (18), we shows the results of simulation system. We introduce the Laplace Transform to set Kd and
in Fig. 2, where the loss-contour map is represented as the denote the time domain change of weight θ as θ (t).
∗
background. The redder, the bigger the loss function value is. The Laplace Transform of θ ∗ is θs [53]. We denote by
In contrast, the bluer, the smaller the loss function value is. θ (t) the weight at iteration t. The Laplace Transform of θ (t)
The x-axis and y-axis denote x1 and x2 , respectively. Both x1 is denoted as θ (s), and that of error e(t) as E(s),
and x2 are initialized to −10. We use red and yellow lines θ∗
to show the path of PID and SGD-Momentum, respectively. E(s) =
− θ (s).
s
It is obvious that SGD-Momentum optimizer suffers from
The Laplace transform of PID [53] is
overshoot problem. By increasing Kd gradually (0.1, 0.5, and
0.93, respectively), our PID optimizer uses more “future” 1
U(s) = (K p + Ki + Kd s)E(s). (22)
error, so that it can largely alleviate the overshoot problem. s
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 7
Fig. 2. The overshoot problem of momentum with different values of Kd . The red and yellow lines indicate the results obtained by PID and SGD-Momentum
respectively.
convergence rate. It should be noted that the value of hyper- Training Loss Validation Loss
0.5
parameter Kd in calculating the derivate 0.25
0.4
K p +1 0.20
− 2K
e−ζ ωn = e d . (32) 0.3
0.15
0.2
Based on the above analyses, we know that the training of 0.1 0.10
DNN can be accelerated by using large derivate. But on the 0.05
0.0
other hand, if Kd is too large, the system will be fragile. After 0 5 10 15 0 5 10 15
Epoch Epoch
some experiments, we set the Kd based on the Ziegler-Nichols Training Acc. Validation Acc.
optimum setting rule [52]. 100.0
98
According to the Ziegler-Nichols0 rule, the ideal setup of 97.5 97
TABLE I
C OMPARISONS BETWEEN PID AND SGD-M OMENTUM OPTIMIZERS IN TERMS OF TEST ERRORS AND TRAINING EPOCHS . W E REPORT THE RESULTS
BASED ON CIFAR10 AND CIFAR100.
Model Depth-k Params (M) Runs CIAFR10 Epochs CIFAR100 Epochs
- - - - PID/SGD-M PID/SGD-M PID/SGD-M PID/SGD-M
110 1.7 5 6.23/6.43 239/281 24.95/25.16 237/293
Resnet [42]
1202 10.2 5 7.81/7.93 230/293 27.93/27.82 251/296
PreActResNet [44] 164 1.7 5 5.23/5.46 230/271 24.17/24.33 241/282
8-64 34.43 10 3.65/3.43 221/294 17.46/17.77 232/291
ResNeXt29 [58]
16-64 68.16 10 3.42/3.58 209/289 17.11/17.31 229/283
16-8 11 10 4.42/4.81 213/290 21.93/22.07 229/283
WRN [45]
28-20 36.5 10 4.27/4.17 208/290 20.21/20.50 221/295
100-12 0.8 10 3.83/4.30 196/291 19.97/20.20 213/294
DenseNet [43]
190-40 25.6 10 3.11/3.32 194/293 16.95/17.17 208/297
better at their objectives in terms. So that the generator can SGD-M PID
fool the most sophisticated discriminator finally. In practice,
this method ends up with generative neural nets that are good Fig. 9. Comparison between PID and SGD-Momentum for the Adding task
of RNN. Top row: the curves of training and validation loss. PID optimizer
at producing new data. achieves lower training and validation losses than SGD-Momentum. Bottom
In the experiments, we use a deep convolutional generative row: the curves of training and validation accuracy. Our PID optimizer
performs better in both training and test performance.
adversarial networks (DCGAN) to test our proposed PID
optimizer. The discriminator of this DCGAN consists of 2
convolutional layers (with ReLU function and max pooling) fast convergence. It indicates that our proposed PID optimizer
and 2 fully-connected layers. The generator of this DCGAN could effectively train LSTM.
consists of a fully connected layer (with batch normalization Results on PTB dataset. In this subsection, we evaluate the
and ReLU function) and 3 convolutional layers. The binary character-level Penn Treebank (PTB-c) dataset to evaluate our
cross entropy is used as a loss function. The learning rate is proposed PID optimizer. We follow the similar experimental
initialized to 0.0003 for all optimizers. The qualitative results settings as in [59]. Specifically, we apply the frame-wise batch
of PID are illustrated in Fig. 8(b) and the SGD-Momentum normalization [60] and set batch size as 128. The learning rate
results are demonstrated in Fig. 8(a). From Fig. 8, we could is initially set to 0.0002 and decreases by 10 times when the
find that the generated images with PID optimizer are more validation performance no longer improve. We also introduce
realistic than these with SGD-Momentum optimizer. dropout [61] by using dropping probability of 0.25 and 0.3.
There is no overlapping in the sequences, whose length are set
D. Results of RNNs as T = 50 for both training and testing. Then we train networks
In this experiment, we employ a simple LSTM that only with PID and SGD-Momentum optimizers. The results are
has 1 layer with 100 hidden units. Mean squared error (MSE) shown in Fig. 10. Comparing with the SGD-Momentum,
is used as the objective function for the adding problem. The we can see that our proposed PID optimizer achieves better
initial learning rate is set to 0.002 for SGD-Momentum and performance on the LSTM model.
PID optimizer. The learning rate is reduced by a factor of
10 every 20, 000 training steps. We randomly generate all the E. Results of different Ki and Kd
training and testing data throughout the whole experiments. We also perform an ablation study on the hyper-parameters
The results are shown in Fig. 9. The LSTM model with of PID controller. The experiments are run on the CIFAR10
SGD-Momentum has troubles in convergence. However, our dataset with DenseNet 100-12. The initial learning rate is 0.1,
proposed PID optimizer can reach to a small error with very and it is reduced by 10 in the 150 and 225 epochs.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 11
95 80
96 80
90 94 60 60
40
85 92 40
20
90
0 5 10 15 0 5 10 15 20 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch Epoch Epoch
SGD-M PID D=100 D=50 D=25 D=10
Fig. 10. Comparison between PID and SGD-Momentum to train LSTM on Fig. 12. Comparison among PID controllers with different Kd on the
PTB dataset. Top row: the curves of training and validation loss. PID optimizer CIFAR10 dataset by using DenseNet 100-12. Ki is fixed to 3. Top row: the
helps to achieve smaller training and validation losses. Bottom row: the curves curves of training and validation loss. Bottom row: the curves of training and
of training and validation accuracy. PID optimizer helps to achieve higer validation accuracy.
training and validation performance.
Training Loss
4
Validation Loss larger the Kd , the more unstable the validation performance.
3.0 The reasons may be that large Kd leads to more change of
2.5 3 optimization path.
2.0
1.5 2 As can be seen from these experiments, Ki is more important
1.0 than Kd in this specific tasks (CIFAR10 with Densenet100-12).
1
0.5 Ki not only affects the speed of convergence, but also affects
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
the accuracy of verification.
Epoch Epoch
Training Acc. Validation Acc.
100
80 VI. C ONCLUSION AND F UTURE W ORK
80
60
60 Motivated by the outstanding performance of proportional-
40 40 integral-derivative (PID) controller in the field of automatic
20 20 control, we reveal the connections between PID controller
and stochastic optimizers and its variants. Then we propose a
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch new PID optimizer used in deep neural network training. The
I=50 I=25 I=10 I=3 I=1 I=0.3 proposed PID optimizer reduces the overshoot phenomenon
Fig. 11. Comparison among PID controllers with different Ki on the of SGD-momentum and accelerates the training process of
CIFAR10 dataset by using DenseNet 100-12. Kd is fixed to 10. Top row: the DNNs by combining the present, the past and the change in-
curves of training and validation loss. Bottom row: the curves of training and formation of gradients to update parameters. Our experiments
validation accuracy. Within a certain range, larger Ki achieves better validation
accuracy. on both image recognition tasks with MNIST, CIFAR, and
Tiny-ImageNet datasets and LSTM tasks with PTB dataset
The first group of experiments investigates the variation of validates that the proposed PID optimizer is 30% to 50% faster
training and verification statistics with Ki while Kd is fixed. than SGD-Momentum, while obtaining lower error rate. We
Fig. 11 demonstrates six PID controllers whose Kd is 10. In will continue to study the relationship among optimal hyper-
the training, the performance of all controllers differ from each parameters(K p , Ki , and Kd ) in specific task. We will conduct
other at an early stage, but eventually they can reach the same more in-depth researches for more general cases in the future.
level. In validation, controller with Ki = 10 achieves lowest And we will investigate how to associate PID optimizer with
loss and highest validation accuracy. We also repeat this ex- an adaptive learning rate for DNNs/RNNs optimization in
periment with Kd = 10, 25, 50, and100 respectively, and results future works.
are highly similar to Fig. 11. One interesting phenomenon is
that the larger the Ki , the more affected by the decreasing ACKNOWLEDGMENT
schedule.
Then we change the research object to Kd . The settings This work is partially supported by the NSFC fund
of the second group of experiments are kept the same as (61571259, 61831014, 61531014), in part by the
previous experiments, but the Ki is fixed. Fig. 12 shows that Shenzhen Science and Technology Project under Grant
their performance is highly consistent. It is also shown that the (GGFW2017040714161462, JCYJ20170307153051701).
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 12
R EFERENCES [26] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and
S. Yan, “Hcp: A flexible cnn framework for multi-label image classifica-
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and vol. 38, no. 9, pp. 1901–1907, Sept 2016.
F. fei Li, “Imagenet large scale visual recognition challenge,” IEEE [27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convo-
International Journal of Computer Vision (IJCV), 2015. lutional networks for accurate object detection and segmentation,” IEEE
[2] L. Bottou, “Large-scale machine learning with stochastic gradient de- Transactions on Pattern Analysis and Machine Intelligence, vol. 38,
scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177– no. 1, pp. 142–158, Jan 2016.
186. [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
[3] J. Zhang, “Gradient descent based optimization algorithms for deep object detection with region proposal networks,” IEEE Transactions on
learning models training,” arXiv preprint arXiv:1903.03614, 2019. Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–
[4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, 1149, June 2017.
“Exploiting linear structure within convolutional networks for efficient [29] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li,
evaluation,” in Advances in neural information processing systems, 2014, S. Yang, Z. Wang, H. Li, K. Wang, J. Yan, C. C. Loy, and X. Tang,
pp. 1269–1277. “Deepid-net: Object detection with deformable part based convolutional
[5] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo- neural networks,” IEEE Transactions on Pattern Analysis and Machine
lutional neural networks with low rank expansions,” arXiv preprint Intelligence, vol. 39, no. 7, pp. 1320–1334, July 2017.
arXiv:1405.3866, 2014. [30] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
[6] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu- features for scene labeling,” IEEE transactions on pattern analysis and
tional networks for classification and detection,” IEEE transactions on machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1943– [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
1955, 2016. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
[7] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and NIPS, 2014.
Y. LeCun, “Fast convolutional nets with fbfft: A gpu performance [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
evaluation,” arXiv preprint arXiv:1412.7580, 2014. Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:
[8] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and https://doi.org/10.1162/neco.1997.9.8.1735
Y. Bengio, “Fitnets: Hints for thin deep nets,” International Conference [33] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent
on Learning Representations, 2015. neural networks,” in Proceedings of the 33nd International Conference
[9] D. J. Im, M. Tao, and K. Branson, “An empirical analysis of the on Machine Learning, ICML 2016, New York City, NY, USA,
optimization of deep network loss surfaces,” in International Conference June 19-24, 2016, 2016, pp. 1120–1128. [Online]. Available:
for Learning Representations (ICLR), 2017. http://jmlr.org/proceedings/papers/v48/arjovsky16.html
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, [34] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
2016. neural networks with pruning, trained quantization and huffman coding,”
in International Conference for Learning Representations (ICLR), 2015.
[11] L. Bottou, “Online learning in neural networks,” D. Saad, Ed.
[35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning effi-
Cambridge University Press, 1998, ch. Online Learning and Stochastic
cient convolutional networks through network slimming,” in Proceedings
Approximations, pp. 9–42. [Online]. Available: http://dl.acm.org/
of the IEEE International Conference on Computer Vision, 2017, pp.
citation.cfm?id=304710.304720
2736–2744.
[12] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
[36] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via
of initialization and momentum in deep learning,” in International
geometric median for deep convolutional neural networks acceleration,”
conference on machine learning, 2013.
in Proceedings of the IEEE Conference on Computer Vision and Pattern
[13] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
Recognition, 2019, pp. 4340–4349.
for online learning and stochastic optimization,” Journal of Machine
[37] X. Dai, H. Yin, and N. Jha, “Nest: A neural network synthesis tool based
Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
on a grow-and-prune paradigm,” IEEE Transactions on Computers,
[14] G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a overview of 2019.
mini–batch gradient descent.” [38] X. Du, Z. Li, and Y. Cao, “Cgap: Continuous growth and pruning for
[15] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in efficient deep learning,” arXiv preprint arXiv:1905.11533, 2019.
International Conference for Learning Representations (ICLR), 2014. [39] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
[16] K. Ogata, Discrete-time control systems. Prentice Hall Englewood “Quantized neural networks: Training neural networks with low pre-
Cliffs, NJ, 1995, vol. 2. cision weights and activations,” The Journal of Machine Learning
[17] L. Lessard, B. Recht, and A. Packard, “Analysis and design of opti- Research, vol. 18, no. 1, pp. 6869–6898, 2017.
mization algorithms via integral quadratic constraints,” SIAM Journal [40] R. Kidambi, P. Netrapalli, P. Jain, and S. M. Kakade, “On the insuffi-
on Optimization, vol. 26, no. 1, pp. 57–95, 2016. ciency of existing momentum schemes for stochastic optimization,” in
[18] L. Wang, T. J. D. Barnes, and W. R. Cluett, “New frequency-domain de- International Conference for Learning Representations (ICLR), 2018.
sign method for pid controllers,” IEEE Control Theory and Applications, [41] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
Jul 1995. filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[19] K. Heong Ang, G. Chong, and Y. Li, “Pid control system analysis, [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
design, and technology,” vol. 13, pp. 559 – 576, 08 2005. image recognition,” in IEEE Conference on Computer Vision and Pattern
[20] A. L. Salih, M. Moghavvemi, H. A. F. Mohamed, and K. S. Gaeid, Recognition (CVPR), 2016.
“Modelling and pid controller design for a quadrotor unmanned air [43] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
vehicle,” in IEEE International Conference on Automation, Quality and connected convolutional networks,” in IEEE Conference on Computer
Testing, Robotics (AQTR), vol. 1, May 2010, pp. 1–5. Vision and Pattern Recognition (CVPR), 2017.
[21] P. Rocco, “Stability of pid control for industrial robot arms,” IEEE [44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
transactions on robotics and automation, 1996. networks,” in IEEE European Conference on Computer Vision (ECCV),
[22] P. Zhao, J. Chen, Y. Song, X. Tao, T. Xu, and T. Mei, “Design of 2016.
a control system for an autonomous vehicle based on adaptive-pid,” [45] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in BMVC,
International Journal of Advanced Robotic Systems, vol. 9, no. 2, p. 44, 2016.
2012. [Online]. Available: https://doi.org/10.5772/51314 [46] Y. Nesterov, “A method of solving a convex programming problem with
[23] P. S. de Laplace, Théorie analytique des probabilités. Courcier, 1820, convergence rate o (1/k2),” in Soviet Mathematics Doklady, 1983.
vol. 7. [47] J. C. Maxwell, “On governors,” Proceedings of the Royal Society of
[24] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang, “A pid controller London, vol. 16, pp. 270–283, 1867.
approach for stochastic optimization of deep networks,” in The IEEE [48] N. Minorsky, “Directional stability of automatically steered bodies,”
Conference on Computer Vision and Pattern Recognition (CVPR), June Journal of ASNE, 1922.
2018. [49] N. Qian, “On the momentum term in gradient descent learning algo-
[25] B. T. Polyak, “Some methods of speeding up the convergence of iter- rithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
ation methods,” USSR Computational Mathematics and Mathematical [50] M. R. Spiegel, Advanced mathematics. McGraw-Hill, Incorporated,
Physics, vol. 4, no. 5, pp. 1–17, 1964. 1991.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 13
[51] K. DE JONG, “An analysis of the behavior of a class of genetic adaptive Qingyun Sun is current working toward the Ph.D.
systems,” Doctoral Dissertation, University of Michigan, 1975. degree in Department of Mathematics, Stanford Uni-
[52] J. G. Ziegler and N. B. Nichols, “Optimum settings for automatic versity. He received B.S. from School of Mathemat-
controllers,” trans. ASME, vol. 64, no. 11, 1942. ical Sciences, Peking University, Beijing, China, in
[53] G. E. Robert and H. Kaufman, Table of Laplace transforms. Saunders, 2014. His research interests include mathematical
1966. foundation for artificial intelligence, data science,
[54] H. K. Khalil, Noninear Systems. Prentice-Hall, New Jersey, 1996. machine learning, algorithmic game theory, multi-
[55] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,” 2015. agent decision making, optimization, and high di-
[56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning mensional statistics.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[57] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
2009. Jun Xu is an Assistant Professor in College of Com-
[58] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual puter Science, Nankai University, Tianjin, China.
transformations for deep neural networks,” in IEEE Conference on Before that, he worked as a Research Scientist
Computer Vision and Pattern Recognition (CVPR), 2017. in Inception Institute of Artificial Intelligence. He
[59] T. Cooijmans, N. Ballas, C. Laurent, and A. C. Courville, “Recurrent received the B.Sc. degree in pure mathematics and
batch normalization,” CoRR, vol. abs/1603.09025, 2016. [Online]. the M.Sc. degree in Information and Probability both
Available: http://arxiv.org/abs/1603.09025 from the School of Mathematics Science, Nankai
[60] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch University, Tianjin, China, in 2011 and 2014, re-
normalized recurrent neural networks,” in 2016 IEEE International spectively. He received the Ph.D. degree in 2018
Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, from the Department of Computing, The Hong Kong
Shanghai, China, March 20-25, 2016, 2016, pp. 2657–2661. [Online]. Polytechnic University, supervised by Prof. David
Available: https://doi.org/10.1109/ICASSP.2016.7472159 Zhang and Prof. Lei Zhang.
[61] Y. Gal and Z. Ghahramani, “A theoretically grounded application of
dropout in recurrent neural networks,” in Advances in Neural Informa-
tion Processing Systems 29: Annual Conference on Neural Information Yongbing Zhang received the B.A. degree in En-
Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, glish and the M.S. and Ph.D degrees in computer
2016, pp. 1019–1027. science from the Harbin Institute of Technology,
Harbin, China, in 2004, 2006, and 2010, respec-
tively. He joined Graduate School at Shenzhen, Ts-
inghua University, Shenzhen, China in 2010, where
he is currently an associate professor. He was the
receipt of the Best Student Paper Award at IEEE
International Conference on Visual Communication
and Image Processing in 2015. His current research
Haoqian Wang (M’13) received the B.S. and interests include signal processing, computational
M.E. degrees from Heilongjiang University, Harbin, imaging, and machine learning.
China, in 1999 and 2002, respectively, and the Ph.D.
degree from the Harbin Institute of Technology,
Harbin, in 2005.He was a Post-Doctoral Fellow with Yulun Zhang received B.E. degree from School
Tsinghua University, Beijing, China, from 2005 to of Electronic Engineering, Xidian University, China,
2007. He has been a Faculty Member with the in 2013 and M.E. degree from Department of Au-
Graduate School at Shenzhen, Tsinghua University, tomation, Tsinghua University, China, in 2017. He is
Shenzhen, China, since 2008, where he has also been currently pursuing the Ph.D. degree with the Depart-
an Associate Professor since 2011, and the director ment of ECE, Northeastern University, USA. He was
of Shenzhen Institute of Future Media Technology. the receipt of the Best Student Paper Award at IEEE
His current research interests include generative adversarial networks, video International Conference on Visual Communication
communication and signal processing. and Image Processing(VCIP) in 2015. He also won
the Best Paper Award at IEEE International Confer-
ence on Computer Vision (ICCV) RLQ Workshop
in 2019. His research interests include image restoration and deep learning.