0% found this document useful (0 votes)
18 views13 pages

Pid Ieee

1) The document proposes a PID controller based stochastic optimization method to accelerate deep neural network training. 2) PID controllers take the current error, change in error over time, and past cumulative error into account to minimize the difference between current and expected outputs. 3) The authors investigate the relationship between PID controllers and stochastic optimization methods like SGD and SGD-Momentum. They then propose a PID based optimization algorithm to update network parameters.

Uploaded by

SourabhYadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Pid Ieee

1) The document proposes a PID controller based stochastic optimization method to accelerate deep neural network training. 2) PID controllers take the current error, change in error over time, and past cumulative error into account to minimize the difference between current and expected outputs. 3) The authors investigate the relationship between PID controllers and stochastic optimization methods like SGD and SGD-Momentum. They then propose a PID based optimization algorithm to update network parameters.

Uploaded by

SourabhYadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO.

XX, SEPTEMBER 2019 1

PID Controller based Stochastic Optimization


Acceleration for Deep Neural Networks
Haoqian Wang, Yi Luo, Wangpeng An, Qingyun Sun,
Jun Xu, Yongbing Zhang, Yulun Zhang, and Lei Zhang, Fellow, IEEE

Abstract—Deep neural networks (DNNs) are widely used and not only used in the field of machine learning [2], but also
demonstrated their power in many applications, like computer deep learning [3]. It is very important to explore how to boost
vision and pattern recognition. However, the training of these net- the speed of training DNNs while maintaining performance.
works can be time-consuming. Such a problem could be alleviated
by using efficient optimizers. As one of the most commonly used Furthermore, with a better optimization method, even a com-
optimizers, SGD-Momentum uses past and present gradients putation limited hardware (e.g., IoT device) can save lots of
for parameter updates. However, in the process of network time and memory usage. The accelerating methods of the
training, SGD-Momentum may encounter some drawbacks, such computational time for DNNs can be divided into two parts,
as the overshoot phenomenon. This problem would slow the the speed-up of training and that of test. The methods in [4]–
training convergence. To alleviate this problem and accelerate
the convergence of DNN optimization, we propose a proportional- [6] aiming to speed up test process of DNNs often focus on
integral-derivative (PID) approach. Specifically, we investigate the not only the decomposition of layers but also the optimization
intrinsic relationships between PID based controller and SGD- solutions to the decomposition. Besides, there has been other
Momentum firstly. We further proposed a PID based optimization streams on improving testing performance of DNNs, such as
algorithm to update the network parameters, where the past, the FFT-based algorithms [7] and reduced parameters in deep
current, and change of gradients are exploited. Consequently,
our proposed PID based optimization alleviates the overshoot nets [8]. As for the methods to speed up the training speed
problem suffered by SGD-Momentum. When tested on popular of DNNs, the key factor is the way to update the millions
DNN architectures, it also obtains up to 50% acceleration with of parameters of a DNN. This process mainly depends on
competitive accuracy. Extensive experiments about computer optimizer and the choice of optimizer is also a key point of a
vision and natural language processing demonstrate the effective- model. Even with the same dataset and architecture, different
ness of our method on benchmark datasets, including CIFAR10,
CIFAR100, Tiny-ImageNet, and PTB. We’ve released the code at optimizers could result in very different training effects, due to
https://github.com/tensorboy/PIDOptimizer. different directions of the gradient descent, different optimizers
may reach completely different local minimum [9].
Index Terms—Deep neural network, optimization, PID control,
SGD-Momentum. The learning rate is another principal hyper-parameter for
DNN training [10]. Based on different strategies of choosing
learning rates, DNN optimizers can be categorized into two
I. I NTRODUCTION
groups: 1. Hand-tuned learning rate optimizers: stochastic gra-
Benefitting from the availability of great number of data dient descent (SGD) [11], SGD Momentum [12], Nesterov0 s
(e.g., ImageNet [1]) and the fast-growing power of GPUs, deep Momentum [12], etc. 2. Auto learning rate optimizers such as
neural networks (DNNs) success in a wide range of applica- AdaGrad [13], RMSProp [14] and Adam [15], etc.
tions, like computer vision and natural language processing. The SGD-Momentum method puts past and current gra-
Despite the significant successes of DNNs, the training and dients into consideration and then updates the network pa-
inference of deep and wide DNNs are often computationally rameters. Although SGD-Momentum performs well in most
expensive, which may take several days or longer even with cases, it may encounter overshoot phenomenon [16], which
powerful GPUs. Many stochastic optimization algorithms are indicates the case where the weight exceeds its target value
This work is partially supported by the NSFC fund (61571259, 61831014, too much and fails to correct its update direction. Such an
61531014), in part by the Shenzhen Science and Technology Project under overshoot problem costs more resource (e.g., time and GPUs)
Grant (GGFW2017040714161462, JCYJ20170307153051701). (Correspond- to train a DNN and also hampers the convergence of SGD-
ing author: Y. Zhang, Email: [email protected].)
H. Wang, Y. Luo, W. An, and Y. Zhang are with the Graduate Momentum. So, a more efficient DNN optimizer is eagerly
School at Shenzhen, Tsinghua University, and also with Shenzhen Insti- desired to alleviate the overshoot problem and achieve better
tute of Future Media Technology, Shenzhen 518055, China. E-mail: wang- convergence.
[email protected], [email protected], [email protected],
[email protected]. The similarity between optimization algorithms popularly
Q. Sun is with Department of Mathematics, Stanford University, Stanford, employed in DNN training and classic control methods has
CA 94305. E-mail: [email protected]. been investigated in [17]. In automatic control systems, the
J. Xu is with College of Computer Science, Nankai University, Tianjin
300071, China. E-mail: [email protected]. feedback control is essential. Proportional-integral-derivative
Y. Zhang is with Department of ECE, Northeastern University, Boston, MA (PID) controller is the most widely used feedback control
02115. E-mail: [email protected]. mechanism, due to its simplicity and functionality [18]. Most
L. Zhang is with Department of Computing, the Hong Kong Polytechnic
University, Hong Kong, and also with the Artificial Intelligence Center, of industrial control system are based on PID [19], such as
Alibaba DAMO Academy. Email: [email protected]. unmanned aerial vehicles [20], robotics [21], and autonomous
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 2

vehicles [22]. PID control takes current error, change in II. R ELATED W ORKS
error (differentiation of the error over time), and the past
A. Classic Deep Neural Network Architectures
cumulative error (integral of the error over time) into account.
So, the difference between current and expected outputs will CNN. Convolutional neural networks (CNNs) [25] have
be minimized. recently achieved great successes in visual recognition tasks,
including image classification [26], object detection [27]–
On the other hand, few studies have been done on the [29], and scene parsing [30]. Recently, lots of deep CNN
connections between PID with DNN optimization. In this architectures, such as VGG, ResNet, and DenseNet, have been
work, we investigate specific relationships analytically and proposed to improve the performance of these tasks mentioned
mathematically towards this research line. We first clarify above. Network depth tends to improve network performance.
the intrinsic connection between PID controller and stochastic However, the computational cost of these deep networks also
optimization methods, including SGD, SGD-Momentum, and increases significantly. Moreover, real-world systems may be
Nesterov0 s Momentum. Finally, we propose a PID based affected by the high cost of these networks.
optimization method for DNN training. Similar to SGD-
GAN. Goodfellow et al. firstly proposed generative ad-
Momentum, our proposed PID optimizer also considers the
versarial network (GAN) [31], which consists of generative
past and current gradients for network update. The Laplace
and adversarial networks. The generator tries to obtain very
Transform [23] is further introduced for hyper-parameter ini-
realistic outputs to foolish the discriminator, which would
tialization, which makes our method simple yet effective. Our
be optimized to distinguish between the real data and the
major contributions of this work can be summarized in three
generated outputs. GANs will be trained to generate synthetic
folds:
data, mimicking genuine data distribution.
In machine learning, models can be classified into two
• By combining the error calculation in the feedback con- categories: generative model and discriminative model. A dis-
trol system with network parameters’ update, we reveal criminative network (denoted as D) can discriminate between
a potential relationship between DNN optimization and two (or more) different classes of data, such as CNN trained
feedback system control. We also find that some opti- for image classification. A generative network (denoted as
mizers (e.g., SGD-Momentum) are special cases of PID G) can generate new data, which fit the distribution of the
control device. training data. For example, a trained Gaussian Mixture Model
• We propose a PID based DNN optimization approach is able to generate new random data, which more-or-less fit
by taking the past, current, and changing information the distribution of the training data.
of the gradient into consideration. The hyper-parameter GANs pose a challenging optimization problem due to
in our PID optimizer is initialized by classical Laplace the multiple loss functions, which must be optimized simul-
Transform. taneously. The optimization of GAN is conducted by two
• We systematically experiment with our proposed PID steps: 1) optimize discriminative network while fixing the
optimizer on CIFAR10, CIFAR100, Tiny-ImageNet, and generative one. 2) optimize the generative network while
PTB datasets. The results show that PID optimizer is fixing the discriminative network. Here, fixing a network
faster than SGD-Montum in DNN training process. means only allowing the network to pass forward and not
perform back-propagation. These two steps are seamlessly
A preliminary version of this work was presented as a alternating updated and dependent on each other for efficient
conference version [24]. In the current work, we incorporate optimization. After enough training cycles, the optimization
additional contents in significant ways: objective V (D, G) introduced in [31] will reach the situation,
where the probability distribution of the generator exactly
• We evaluate the performance of our PID optimizer on the matches the true probability distribution of the training data.
language modeling application by utilizing the character- Meanwhile, the discriminator has the capability to distinguish
level Penn Treebank (PTB-c) dataset with an LSTM the realistic data from the virtual generated ones. However, the
network. perfect cooperation between the generator and discriminator
• The proposed PID optimizer is applied on GAN with will fail occasionally. The whole system will reach the status
MNIST dataset and show the digital images generated of “model collapse”, indicating that the discriminator and the
by them separately to illustrate that our method is also generator tend to produce the same outputs.
applicable in GAN. LSTM. Hochreiter et al. firstly proposed the Long Short
• We update the conclusion that the proposed PID optimizer Term network, generally called LSTM, to obtain long-term
exceeds SGD-Momentum in GANs and RNNs. dependency information from the network. As a type of
recurrent neural network (RNN), LSTM has been widely
We organize the rest of this paper as follows. Section II used and obtained excellent success in many applications.
briefly surveys related works. Section III investigates the LSTM is deliberately designed to avoid long-term dependency
relationship between PID controller and DNN optimization al- problems. Remember that long-term information is the default
gorithms. Section IV introduces the proposed PID approach for behavior of LSTM in practice, rather than the ability to acquire
DNN optimization. Experimental results and detailed analysis at great cost. All RNNs have a chained form of repeating
are reported in Section V. Section VI concludes this paper. network modules. In the standard RNN, this repeating module
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 3

often has a simple structure (e.g., “tanh” layer). The outputs C. Deep Learning Optimization
of all LSTM cells are utilized to construct a new feature,
where multinomial logistic regression is introduced to form In the training of DNN [10], learning rate is an essential
the LSTM model. hyper-parameter. DNN optimizers can be categorized into two
One widely used way to evaluate RNN models is the adding groups based on different strategies of setting the learning
task [32], [33], which takes two sequences of length T as input. rate: 1. Hand-tuned learning rate optimizers: stochastic gra-
By sampling in the range (0, 1) uniformly, we form the first dient descent (SGD) [11], SGD Momentum [12], Nesterov0 s
sequence. For another sequence, we set two entries as 1 and Momentum [12], etc. 2. Auto learning rate optimizers, such as
the rest as 0. The output is obtained by adding two entries in AdaGrad [13], RMSProp [14], and Adam [15], etc. Good re-
the first sequence. The positions of the entries are determined sults have been achieved on CIFAR10, CIFAR100, ImageNet,
by the two entries of 1 from the second sequence. PASCAL VOC, and MS COCO datasets. They were mostly
obtained by residual neural networks [42]–[45] trained by us-
ing SGD-Momentum. This work focuses on the improvement
B. Accelerating the Training/Test Process of DNNs of the fist category of optimizers. The introduction to these
Training process acceleration. Since DNNs are mostly optimizers is as follows.
computationally intensive, Han et al. [34] proposed a deep Classical Momentum [25] is the first ever variant of gradient
compression method to reduce the storage requirement of descent involving the usage of a momentum parameter. In the
DNNs by 35× to 49× without affecting the accuracy. More- objective across iterations, it accelerates gradient descent that
over, the compressed model has 3× to 4× layer-wise speedup collects a velocity vector in directions of continuous reduction.
and 3× to 7× better energy efficiency. Unimportant connec- Stochastic Gradient Descent (SGD) [11] is a widely used
tions are pruned. Weight sharing and Huffman coding are optimizer for DNN training. SGD is easy to apply, but the
applied to quantize the network. This work mainly attempts disadvantage of SGD is that it converges slowly and may
to reduce the number of parameters of neural networks. Liu oscillate at the saddle point. Moreover, how to choose the
et al. proposed the network slimming technique that can learning rate reasonably is a major difficulty of SGD.
simultaneously reduce the model size, running-time mem- SGD Momentum (SGD-M) [12] is an optimization method
ory, and computing operations [35]. Yang et al. proposed a that considers momentum. Compared to the original gradient
new filter pruning strategy based on the geometric median descent step, the SGD-M introduces variables related to the
to accelerate the training of deep CNNs [36]. Dai et al. previous step. It means that the parameter update direction
proposed a synthesis tool to synthesize compact yet accurate is decided not only by the present gradient, but also by the
DNNs [37]. Du et al. proposed a Continuous Growth and previously accumulated direction of the fall. This allows the
Pruning (CGaP) scheme to minimize the redundancy from parameters to change little in the direction where gradient
the beginning [38]. Hubara et al. introduced a method to change frequently. Contrary to this, SGD-M changes parame-
train Quantized Neural Networks that reduce memory size ters a lot in the direction where gradient change slowly.
and accesses during forward pass [39]. In [40], Han et al.
Nesterov0 s Momentum [12] is another momentum optimiza-
presented an intuitive and easier-to-tune version of ASGD
tion algorithm motivated by Nesterov0 s accelerated gradient
(please refer to Section IV) and showed that ASGD leads to
method [46]. Momentum is improved from the SGD algorithm,
faster convergence significantly with a comparable accuracy
so that each parameter update direction depends not only on
than SGD, Heavy Ball, and Nesterov0 s Momentum [12].
the gradient of the current position, but also on the direction
Test process acceleration. Denton et al. [4] proposed
of the last parameter update. In other words, Nesterov0 s Mo-
a method that compresses all convolutional layers. This is
mentum essentially uses the second-order information of the
achieved by approximating proper low-rank and then updat-
objective (loss function) so it can accelerate the convergence
ing the upper layers until the prediction result is enhanced.
better.
Based on singular value decompositions (SVD), this process
consists of numerous tensor decomposition operations and
filter clustering approaches to make use of similarities among
learned features. Jaderberg et al. [5] introduced an easy-to- D. PID Controller
implement method that can significantly speed up pretrained
CNNs with minimal modifications to existing frameworks. Traditionally, the PID controller has been used to control
There can be a small associated loss in performance, but a feedback system [19] by exploiting the present, past, and
this is tunable to a desired accuracy level. Zhang et al. [6] future information of prediction error. The theoretical basis of
first proposed a response reconstruction method, which in- the PID controller was first proposed by Maxwell in 1868
troduces the nonlinear neurons and a low-rank constraint. in his seminal paper “On Governors” [47]. Mathematical
Without the usage of SGD and based on generalized singular formulation was given by Minorsky [48]. In recent years,
value decomposition (GSVD), a solution is developed for this several advanced control algorithms have been proposed.
nonlinear problem. Li et al. presented a method to prune We define the difference between the actual output and the
filters with relatively low weight magnitudes to produce CNNs desired output as error e(t). The PID controller calculates the
with reduced computation costs without introducing irregular error e(t) in every step t, and then applies a correction u(t)
sparsity [41]. to the system as a function of the proportional (P), integral
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 4

Desired Value
Update
Error PID Controller Output
Control Devices
System
Feedback

Connection

𝜕𝐿 𝜕𝐿 𝜕𝐿
𝜕𝑤𝑖𝑗 𝜕𝑤𝑗𝑘
SGD-Momentum
𝜕𝑤𝑘𝑙
Update Backpropagation
𝑤𝑖𝑗 𝑤𝑗𝑘
𝑥0 𝑤𝑘𝑙
Deep Model
Training 𝑥1
Output
𝑥2 Loss L

𝑥3
𝜃 = 𝑤𝑖𝑗, 𝑤𝑗𝑘, 𝑤𝑘𝑙
Desired Value (Label)

Fig. 1. Illustrations about the relationships between control system and deep model training. It also shows the connection between PID controller and
SGD-Momentum.

(I), and derivative (D) terms of e(t). Mathematically, the PID A. Overall Connections
controller can be described as
Z t At first, we summarize the training process of deep learning.
d Deep neural networks (DNNs) need to map the input x to the
u(t) = K p e(t) + Ki e(t)dt + Kd e(t), (1)
0 dt output y though parameters θ . To measure the gap between
the DNN output and desired output, the loss function L is
where K p , Ki , and Kd correspond to the gain coefficients of introduced. Given some training data, we can calculate the loss
the P, I, and D terms, respectively. The function of error e(t) function L(θ , Xtrain ). In order to minimize the loss function L,
is the same as the gradient in optimization of deep learning. we find the derivative of the loss function L with respect to the
The coefficients K p , Ki , and Kd reflect the contribution to parameter θ and update θ with the gradient descent method
the current correction to the current, past, and future errors in most cases. DNNs gradually learn the complex relationship
respectively. between input x and output y by constantly updating the
According to our analyses, we find that PID control tech- parameters θ , which called DNN’s training. The updating of θ
niques can be more useful for optimization of deep network. is driven by the gradient of loss function until it’s converged.
The study presented in this paper is one of the first inves-
Then, the purpose of an automated control system is to eval-
tigations to apply PID as a new optimizer to deep learning
uate the system status and make it to the desired status through
field. Our studies have succeeded in demonstrating significant
a controller. In feedback control system, the controller’s action
advantages of the proposed optimizer. With the inheritance
is affected by the system’s output. The error e(t) between
of the advantages of PID controller, the proposed optimizer
the measured system status and desired status is taken into
performs well despite its simplicity.
consideration, so that controller can make system get close to
desired status.
III. PID AND D EEP N EURAL N ETWORK O PTIMIZATION More specifically, as shown in Eq. (1), PID controller
estimates a control variable u(t) by considering the current,
We reveal the intrinsic relation between PID control and past, and future (derivative) of the error e(t).
DNNs optimization. The intrinsic relation inspires us to ex- From here we can see that the error in the PID control
plore new DNNs optimization methods. The core idea of this system is related to the gradient in the deep neural network
section is to regard the parameter update in DNNs training training process. The update of parameters during deep neural
process as using PID controller in the system to reach an network (DNNs) training can be analogized to the adjustment
equilibrium. of the system by the PID controller.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 5

As can be seen from the discussion above, there is high Put Vt+1 into the 2nd formula of Eq. (3), we have
similarity between DNNs optimization and PID based control
t−1
system. Fig. 1 shows their flowchart respectively and we can ∂ Lt
θt+1 − θt = −r − r ∑ α t−i ∂ Li /∂ θi . (8)
see the similarity more intuitively. Based on the difference ∂ θt i=0
between the output and target, both of them change the sys-
tem/network. The negative feedback process in PID controller We could learn that parameter update process considers both
is similar as the back-propagation in DNNs optimization. the current gradient (P control) and the integral of past
One key difference is that the PID controller computes the gradients (I control). If we assume α = 1, we get following
update utilizing system error e(t). However, DNN optimizer equation
decides the updates by considering gradient ∂ L/∂ θ . Let’s re- t−1
gard the gradient ∂ L/∂ θ as the incarnation of error e(t). Then, θt+1 − θt = −r∂ Lt /∂ θt − r ∑ ∂ Li /∂ θi . (9)
PID controller could be fully related with DNN optimiza- i=0
tion. In the next, we prove that SGD, SGD-Momentum and Comparing Eq. (9) with Eq. (1), we can see that SGD-
Nesterov0 s Momentum all are special cases of PID controller. Momentum is a PI controller with K p = r and Ki = r. By
B. Stochastic Gradient Descent (SGD) using some mathematical skill [50], we simplify Eq. (3) by
In DNN training, there are widely used optimizers, such as removing Vt . Then, Eq. (9) can be rewritten as
SGD and its variants. The parameter update rule of SGD from t−1
iteration t to t + 1 is determined by θt+1 = θt − r∂ Lt /∂ θt − r ∑ ∂ Li /∂ θi α t−i . (10)
i=0
θt+1 = θt − r∂ Lt /∂ θt , (2)
We can see it clear that the network parameter update depends
where r is the learning rate. We now regard the gradient on both current gradient r∂ Lt /∂ θt and the integral of past
∂ Lt /∂ θt as error e(t) in PID control system. Comparing with gradients r ∑t−1 t−i . It should be noted that the I
i=0 ∂ Li /∂ θi α
PID controller in Eq. (1), we find that SGD can be viewed as term includes a decay factor α. Due to the huge number of
one type of P controller with K p = r. training data, it’s better to calculate the gradient based on mini-
batch of training data. So, the gradients behave in a stochastic
C. SGD-Momentum manner. The purpose of the introduction of decay term α is
SGD-Momentum is faster than SGD to train a DNN, be- to keep the gradients away from current value, so that it can
cause it can use history gradient. The rule of SGD-M updating alleviate noise. In all, based on the analyses, we can view
parameter is given by SGD-Momentum as a PI controller.
(
Vt+1 = αVt − r∂ Lt /∂ θt
, (3) D. Nesterov0 s Momentum
θt+1 = θt +Vt+1
where Vt is a term that accumulates historical gradients. α ∈ Momentum is improved from the SGD algorithm and it
(0, 1) is the factor that balances the past and current gradients. considers the second-order information of the objective (loss
It is usually set to 0.9 [49]. Dividing two sides of the 1st function), so it can accelerate the convergence better. We set
formula of Eq. (3) by α t+1 the update rule as
(
Vt+1 Vt ∂ Lt /∂ θt Vt+1 = αVt − r∂ Lt /∂ (θt + αVt )
= t −r . (4) (11)
α t+1 α α t+1 θt+1 = θt +Vt+1 .
By applying Eq. (4) from time t + 1 to 1, we have

Vt+1 Vt ∂ Lt /∂ θt By using a variable transform θ̂t = θt + αVt , and formulating

t+1
− t = −r the update rule with respect to θ̂ , we have
α t+1


 α α
V V L t−1 /∂ θt−1

 t t−1 ∂
− = −r
 (
Vt+1 = αVt − r∂ Lt /∂ θ̂t

α t α t−1 αt (5) (12)
 .. θ̂t+1 = θ̂t + (1 + α)Vt+1 − αVt .


 .
∂ L0 /∂ θ0


 V1 V0
 − = −r . Similar to the derivation process in Eq. (4)-(6) of SGD-
α1 α0 α1 Momentum, we have
By adding the aforementioned equations together, we get
t
Vt+1 V0 t
∂ Li /∂ θi Vt+1 = −r( ∑ (α t−i ∂ Li /∂ θ̂i )). (13)
t+1
= 0
− r ∑ i+1
. (6) i=1
α α i=0 α
To make it more general, we set the initial condition V0 = 0, With Eq. (13), Eq. (11) can be rewritten as
and thus the above equation can be simplified as follows t−1
t θ̂t+1 − θ̂t = − r(1 + α)∂ Lt /∂ θ̂t − αr( ∑ (α t−i ∂ Li /∂ θ̂i )).
Vt+1 = −r ∑ α t−i ∂ Li /∂ θt−1 . (7) i=1
i=0 (14)
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 6

We could conclude that the network parameter update consid- B. PID Optimizer for DNN
ers the current gradient (P control) and the integral of past We are motivated by the simple example in Section IV-A
gradients (I control). If we assume α = 1, then and seek a PID optimizer to boost the convergence of DNN
t−1
training. From Eq. (10), SGD-Momentum can be viewed as
θ̂t+1 − θ̂t = − 2r(∂ Lt /∂ θ̂t ) − r( ∑ (∂ Li /∂ θ̂i )). (15) a PI controller, which takes current and past gradient infor-
i=0 mation actually. Fig. 2 shows that PID controller introduces a
derivative term of gradient to use the future information. Then,
Comparing Eq. (15) with Eq. (1), we can prove that Nesterov0 s the overshoot problem can be alleviated obviously.
Momentum is a PI controller with K p = 2r and Ki = r. On the other hand, it is very easy to introduce noise when
What’s more, compared with SGD-Momentum, the Nesterov0 s computing of gradients, because the training is often conducted
Momentum would utilize the current gradient and integral of in a mini-batch manner. We also try to estimate the average
past gradients to update the network parameters, but achieves moving of the derivative part. Our proposed PID optimizer
larger gain coefficient K p . updates network parameter θ in iteration (t + 1) by

Vt+1 = αVt − r∂ Lt /∂ θt

IV. PID BASED DNN O PTIMIZATION Dt+1 = αDt + (1 − α)(∂ Lt /∂ θt − ∂ Lt−1 /∂ θt−1 ) (19)

θt+1 = θt +Vt+1 + Kd Dt+1 .

A. The Overshoot Problem of SGD-Momentum
We could learn from Eq. (19) that a hyper-parameter Kd is
We can learn it from Eqs. (10) and (14) that the Momentum
introduced in the proposed PID optimizer. We initialize Kd
optimizer will accumulate history gradients to accelerate. But
by introducing Laplace Transform [23] theory and Ziegler-
on the other hand, the updating of parameters may be in wrong
Nichols [52] tuning method.
path, if the history gradients lag the update of parameters.
According to the definition “the maximum peak value of the
response curve measured from the desired response of the sys- C. Initialization of Hyper-parameter Kd
tem” in discrete-time control systems [16], this phenomenon The Laplace Transform converts the function of real variable
is named as overshoot. Specifically, it can be written as t to a function of complex variable s. The most common
usage is to convert time to frequency. Denote the Laplace
θmax − θ ∗
Overshoot = , (16) transformation of f (t) as F(s). There is
θ∗ Z ∞
where θmax and θ∗ are the maximum and optimum values of F(s) = e−st f (t) dt, for s > 0. (20)
0
the weight, respectively.
The overshoot problem’s test benchmark is the first function In general, it’s easier to solve F(s) than f (t), which can be
of De Jong0 s [51] due to its smooth, unimodal, and symmetric reconstructed from F(s) with the Inverse Laplace transform
characteristics. The function can be written as 1
Z γ+iT
f (t) = lim est F(s)ds, (21)
2πi T →∞ γ−iT
f (x) = 0.1x12 + 2x22 , (17)
where i is the unit of imagery part and γ is a real number.
whose search domain is −10 ≤ xi ≤ 10, i = 1, 2. For this func- By using Laplace Transform, we can first transform our
tion x∗ = (0, 0), f (x∗ ) = 0, we can pursue a global minimum PID optimizer into its Laplace transformed functions of s, and
rather then a local one. then simplify the algebra. After obtaining the transformation
To build a simple PID optimizer, we introduce a derivative F(s), we can achieve the desired solution f (t) with the inverse
term of gradient based on SGD-Momentum transform.
We initialize a parameter of a node in DNN model as a
PID = Momentum + Kd (∂ f (x)/∂ xc − ∂ f (x)/∂ xc−1 ), (18) scalar θ0 . After enough times of updates, the optimal value θ ∗
can be obtained. We simplify the parameter update in DNN
where c is the present iteration index for x. With different optimization as one step response (from θ0 to θ ∗ ) in control
choices of Kd in Eq. (18), we shows the results of simulation system. We introduce the Laplace Transform to set Kd and
in Fig. 2, where the loss-contour map is represented as the denote the time domain change of weight θ as θ (t).

background. The redder, the bigger the loss function value is. The Laplace Transform of θ ∗ is θs [53]. We denote by
In contrast, the bluer, the smaller the loss function value is. θ (t) the weight at iteration t. The Laplace Transform of θ (t)
The x-axis and y-axis denote x1 and x2 , respectively. Both x1 is denoted as θ (s), and that of error e(t) as E(s),
and x2 are initialized to −10. We use red and yellow lines θ∗
to show the path of PID and SGD-Momentum, respectively. E(s) =
− θ (s).
s
It is obvious that SGD-Momentum optimizer suffers from
The Laplace transform of PID [53] is
overshoot problem. By increasing Kd gradually (0.1, 0.5, and
0.93, respectively), our PID optimizer uses more “future” 1
U(s) = (K p + Ki + Kd s)E(s). (22)
error, so that it can largely alleviate the overshoot problem. s
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 7

7.5 7.5 7.5

5.0 5.0 5.0

2.5 2.5 2.5

0.0 0.0 0.0

2.5 2.5 2.5

5.0 5.0 5.0

7.5 7.5 7.5

10.0 10.0 10.0


10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5

Small Kd Moderate Kd Big Kd

Fig. 2. The overshoot problem of momentum with different values of Kd . The red and yellow lines indicate the results obtained by PID and SGD-Momentum
respectively.

In our case, the u(t) corresponds to the update of θ (t). So we θ(t)



replace U(s) with θ (s), and with E(s) = θs − θ (s). Eq. (22)
θmax
can be rewritten as
1 θ∗
θ (s) = (K p + Ki + Kd s)( − θ (s)). (23) θ∗
s s
With this form, it is easy to derive a standard closed loop
transfer function [54] as
θ∗ 1 ωn2
− θ (s) = , (24)
s Kd s2 + 2ζ ωn s + ωn2
where θ0 t
tmax
p +1
(K
Kd = 2ζ ωn Fig. 3. The evolution process of the weight θ (t) for PID optimizer.
Ki (25)
Kd = ωn2 .
Eq. (16), it is obvious that ζ is monotonically decreasing with
Eq. (24) can be rewritten as overshoot. Then, Ki is a monotonically increasing function of
(s + ζ ωn ) + √ ζ ωn
p
1−ζ2 overshoot. In a word, the more history error (Integral part),
θ∗ 1+ζ 2 (26) the more overshoot problem in the system. This is a good
− θ (s) = .
s (s + ζ ωn )2 + ωn2 (1 − ζ 2 ) explanation of why SGD-Momentum overshoots its target and
We can get the time (iteration) domain form of θ (s) by need more training time.
using the Inverse Laplace Transform table [53] and the initial By differentiating θ (t) w.r.t. time t, and let
condition of the θ (θ0 ):
dθ (t)
p = 0.
∗ (θ ∗ − θ0 ) sin(ωn 1 − ζ 2t + arccos(ζ )) dt
θ (t) = θ − p (27)
eζ ωn t 1 − ζ 2 We have the peak time of the weight as
and
π
tmax = . (30)
(
(K p + 1)/Kd = 2ζ ωn
p
(28) wn 1 − ζ 2
Ki /Kd = ωn2 ,
Put tmax to Eq. (27), we have θmax , and put θmax to Eq. (16),
where ζ is damping ratio and ωn is natural frequency of the we have
system. The evolution process of a weight as an example of
−ζ π
θ (t) is shown in Fig. 3. From Eq. (28), we get θ (tmax ) − θ ∗ √
1−ζ 2
Overshoot = = e . (31)
(K p + 1)2 θ∗
Ki = . (29)
4Kd ζ p
We could learn from Eq. (27) that the term sin(ωn 1 − ζ 2t +
From Eq. (29) we know that Ki is a monotonically decreas- arccos(ζ )) brings periodically oscillation change to the weight,
ing function of ζ . Based on the definition of overshoot in which is no more than 1. The term e−ζ ωn t mainly controls the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 8

convergence rate. It should be noted that the value of hyper- Training Loss Validation Loss
0.5
parameter Kd in calculating the derivate 0.25
0.4
K p +1 0.20
− 2K
e−ζ ωn = e d . (32) 0.3
0.15
0.2
Based on the above analyses, we know that the training of 0.1 0.10
DNN can be accelerated by using large derivate. But on the 0.05
0.0
other hand, if Kd is too large, the system will be fragile. After 0 5 10 15 0 5 10 15
Epoch Epoch
some experiments, we set the Kd based on the Ziegler-Nichols Training Acc. Validation Acc.
optimum setting rule [52]. 100.0
98
According to the Ziegler-Nichols0 rule, the ideal setup of 97.5 97

Kd should be one third of the oscillation period (T ), which 95.0 96

means Kd = 13 T . From Eq. (27), we can get T = √2π 2 . If 92.5 95


ωn 1−ζ 90.0 94
we make a simplification that the α in Momentum is equal to 93
87.5
1, then Ki = Kd = r. Combined with Eq. (28), Kd will have a 92
0 5 10 15 0 5 10 15
closed form solution Epoch Epoch

16 SGD-M PID Adam Nesterovs-M


Kd = 0.25r + 0.5 + (1 + π 2 )/r. (33)
9
Fig. 4. Comparison between PID and other optimizers on the MNIST dataset
For real-world cases, where different DNNs are applied train for 20 epochs. Top row: the curves of training loss and validation loss. PID
on different datasets, we would firstly start with this ideal optimizer obtains lower losses and converges faster. Bottom row: the curves
setting of Kd and change it slightly latter. of training accuracy and validation accuracy. PID optimizer performs much
better than others for both training and test accuracies.
V. E XPERIMENTAL R ESULTS
We introduce four commonly used datasets for the experi-
ments. Then, we compare our proposed optimizer with other Training Loss Standard deviation Validation Loss
optimizers by using CNN and LSTM on four commonly used 0.025

datasets. Specifically, we first train an multilayer perceptron 0.003 0.020

(MLP) on the MNIST dataset to demonstrate the advantages 0.002


0.015

of PID optimizer. We then train CNNs on the CIFAR datasets 0.010


0.001
to show that our PID optimizer achieves competitive accuracy 0.005
compared with other optimizers, but it has a faster training 0.000 0.000
0 5 10 15 0 5 10 15
speed. Further studies are carried out to prove that our PID Epoch Epoch
optimizer also performs well on a larger dataset. Based on 0.20
Training Acc. Validation Acc.
the Tiny-ImageNet dataset [55], we carry out a series of 0.6
0.15
experiments. The results indicate that it is applicable for
our PID optimizer to be extended to modern networks. Our 0.10 0.4

proposed PID optimizer is set to use all hyper-parameters that 0.2


0.05
are detailed for SGD-Momentum. The initial learning rate and
learning rate schedule vary with different experiments. 0.00
0 5 10 15
0.0
0 5 10 15
Epoch Epoch
SGD-M PID Adam Nesterovs-M
A. Datasets
MNIST Dataset. The MNIST dataset [56] of handwritten Fig. 5. PID vs. other optimizers on the MNIST dataset for 20 epochs. Standard
numbers from 0 to 9. Being a subset of a larger dataset NIST, deviation of 10 runs. Top row: the curves of training loss and validation loss.
MNIST consists of 60, 000 training data and 10, 000 test ones. Bottom row: the curves of training accuracy and validation accuracy.
The digits have been size-normalized and centered in a fixed-
size image of 28 × 28 pixels. With the usage of anti-aliasing
technique, the preprocessed images contain gray levels. Tiny ImageNet Dataset. There are 200 classes in Tiny-
CIFAR Datasets. The CIFAR10 dataset [57] has 60, 000 Imagenet [55] dataset. Each class contains 500, 50, and 50
RGB color images, the shape of which is 32 × 32. There are images for training, validation, and testing respectively. The
10 classes, each of which includes 6, 000 images. 50, 000 and Tiny-ImageNet is harder to be classed correctly than the
10, 000 images are used for training and testing respectively. CIFAR datasets. It is not only because a larger number of
Similar as CIFAR10, CIFAR100 dataset [57] consists of 100 classes, but also the relevant objects need to be classified
classes with 600 images for each class. 500 and 100 images are usually occupy little pixels of the whole image.
extracted from each class for training and testing respectively. PTB Dataset. Penn Treebank dataset, known as PTB
The 100 classes in the CIFAR100 [57] are further arranged dataset, is widely used in machine learning of NLP (Natural
into 20 super classes. We performed random crops, horizontal Language Processing) research. The PTB dataset has 2, 499
flips, and padded 4 pixels around each side on the original stories which come from a three-year WSJ collection of
image for data augmentation. 98, 732 stories.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 9

TABLE I
C OMPARISONS BETWEEN PID AND SGD-M OMENTUM OPTIMIZERS IN TERMS OF TEST ERRORS AND TRAINING EPOCHS . W E REPORT THE RESULTS
BASED ON CIFAR10 AND CIFAR100.
Model Depth-k Params (M) Runs CIAFR10 Epochs CIFAR100 Epochs
- - - - PID/SGD-M PID/SGD-M PID/SGD-M PID/SGD-M
110 1.7 5 6.23/6.43 239/281 24.95/25.16 237/293
Resnet [42]
1202 10.2 5 7.81/7.93 230/293 27.93/27.82 251/296
PreActResNet [44] 164 1.7 5 5.23/5.46 230/271 24.17/24.33 241/282
8-64 34.43 10 3.65/3.43 221/294 17.46/17.77 232/291
ResNeXt29 [58]
16-64 68.16 10 3.42/3.58 209/289 17.11/17.31 229/283
16-8 11 10 4.42/4.81 213/290 21.93/22.07 229/283
WRN [45]
28-20 36.5 10 4.27/4.17 208/290 20.21/20.50 221/295
100-12 0.8 10 3.83/4.30 196/291 19.97/20.20 213/294
DenseNet [43]
190-40 25.6 10 3.11/3.32 194/293 16.95/17.17 208/297

Training Loss Validation Loss Training Loss Validation Loss


600 35
5
2.5 500 30
4 25
2.0 400
3 20
1.5 300
2 15
1.0 200
10
0.5 100 1
5
0.0 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 25 50 75 100 125 150 0 25 50 75 100 125 150
Epoch Epoch Epoch Epoch
Training Acc. Validation Acc. Training Acc. Validation Acc.
100 100 50
80 80 40
80
60 60 30
60
40 20
40
40 20 10
20
20 0 0
0 25 50 75 100 125 150 0 25 50 75 100 125 150
0 50 100 150 200 250 300 0 50 100 150 200 250 300 Epoch Epoch
Epoch Epoch
SGD-M PID Adam Nesterovs-M
SGD-M PID Adam Nesterovs-M

Fig. 7. Comparison among PID and other optimizers on the Tiny-imagenet


Fig. 6. Comparison among PID and other optimizers on the CIFAR10 dataset dataset with DenseNet 100-12 backbone. Top row: the curves of training and
by using DenseNet 190-40. Top row: the curves of training and validation loss. validation loss. We can see PID optimizer obtains lower training and validation
PID optimizer obtains lower losses and behaves more stable. Bottom row: the losses. Bottom row: the curves of training accuracy and validation accuracy.
curves of training accuracy and validation accuracy. PID optimizer performs PID optimizer achieves best performance.
slightly better than SGD-Momentum for both training and test accuracies.

optimizer on recent leading DNN models (ResNet [42],


B. Results of CNNs PreActResNet [44], ResNeXt29 [58], WRN [45], and
Results of MLP on MNIST dataset. To compare the DenseNet [43]). The details are shown in Tab. I, where the
proposed PID optimizer with SGD-Momentum [12], we first second column lists the depth of networks and k. The k in
carry out a series of experiments. We use MNIST dataset to ResNeXt29, WRN, and DenseNet represent cardinality, widen-
train a basic network, MLP. There are 1, 000 hidden nodes in ing factor, and growth rate respectively. The third column lists
the hidden layer. ReLU acts as nonlinearity layer in the MLP the number of parameters. The fourth column shows the update
network. We place softmax layer on the top. The training batch numbers to calculate the mean test error. The next 4 columns
size is 128 for 20 epochs. After running the experiments for 10 show the average test error and the numbers of epoch, when
times, we obtain the average results.Fig. 4 shows comparisons they accomplish the test errors firstly (the minimum number
among four methods in terms of training statistics. The Adam of epoch to reach the best accuracy).
performs well in the early stages of training, but overall it The following conclusion can be given from Tab. I. First,
could be very unstable and slower than PID optimizer. As compared with SGD-Momentum, our PID optimizer obtains
Fig. 4 shows, the PID optimizer performs faster convergence lower test errors for all architectures (except for ResNet with
than other optimizers. What’s more, PID optimizer achieves depth 1, 202) based on results from CIFAR10 and CIFAR100
lower loss and higher accuracy in both training and validation datasets. Second, for the training epochs needed to reach
phases. Plus, it has stronger generalization ability on the test the best results, PID optimizer needs less number of train-
dataset. It can be seen from Fig. 5 that the standard deviation ing than SGD-Momentum. Specifically, compared with SGD-
of the PID optimizer during training is minimal, which proves Momentum, our proposed PID optimizer achieves 35% and up
its training stability. The accuracy is 98% in PID optimizer to 50% acceleration on average. This reveals that the gradient
and 97.5% in SGD-Momentum. descent’s direction acts a very important role, which can be
Results on CIFAR datasets. In order to fully test our utilized to alleviate the overshoot problem and contribute to
proposed PID optimizer, we compare it with SGD-Momentum faster convergence for training of DNNs. In Fig. 6, we further
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 10

present more training statistics on CIFAR10 to compare PID


and SGD-Momentum optimizers. For the backbone DenseNet
190-40 [43], we set its network depth as 190 and growth rate
as 40. Based on the experiments, we can obviously conclude
that our PID optimizer achieves faster converges than SGD-
Momentum. More important, in both training and validation
phases, PID optimizer obtains lower loss and higher accuracy.
Results on Tiny-ImageNet. We also apply our proposed
PID optimizer on the Tint-ImageNet dataset with DenseNet
100-12 architecture to indicate its effectiveness. The initial
learninig rate of four optimizers are 0.1. The decreasing (a) (b)
schedule is set to 50% and 75% of training epochs. The batch Fig. 8. PID vs. SGD-Momentum for generating images through GANs on
size is 500. In Fig. 7, we show the curves of training loss MNIST dataset. (a) The generated images from SGD-Momentum. (b) The
and accuracy, as well as validation loss and accuracy over the generated images from PID.
Training Loss Validation Loss
change of epochs for the four optimizers. Similar to results 4

tested on the CIFAR datasets, the proposed PID optimizer 3


8

not only converges faster but also obtains better performance. 6


2
These results prove the generalization ability of our proposed 4
PID optimizer. 1
2
0
0 25 50 75 100 125 150 0 25 50 75 100 125 150
C. Results of GANs Epoch Epoch
Training Acc. Validation Acc.
During the training of generative adversarial networks
70
(GANs), both G and D needs to be trained. We train them 80
60
both in an alternating manner. Each of their objectives can be 60 50
expressed as a loss function that we can optimize via gradient 40
40
descent. So, we train G for a couple steps, then train D for a 30
20 20
couple steps, then give G the chance to improve itself, and so
on. The result is that the generator and the discriminator get 0 25 50 75
Epoch
100 125 150 0 25 50 75
Epoch
100 125 150

better at their objectives in terms. So that the generator can SGD-M PID
fool the most sophisticated discriminator finally. In practice,
this method ends up with generative neural nets that are good Fig. 9. Comparison between PID and SGD-Momentum for the Adding task
of RNN. Top row: the curves of training and validation loss. PID optimizer
at producing new data. achieves lower training and validation losses than SGD-Momentum. Bottom
In the experiments, we use a deep convolutional generative row: the curves of training and validation accuracy. Our PID optimizer
performs better in both training and test performance.
adversarial networks (DCGAN) to test our proposed PID
optimizer. The discriminator of this DCGAN consists of 2
convolutional layers (with ReLU function and max pooling) fast convergence. It indicates that our proposed PID optimizer
and 2 fully-connected layers. The generator of this DCGAN could effectively train LSTM.
consists of a fully connected layer (with batch normalization Results on PTB dataset. In this subsection, we evaluate the
and ReLU function) and 3 convolutional layers. The binary character-level Penn Treebank (PTB-c) dataset to evaluate our
cross entropy is used as a loss function. The learning rate is proposed PID optimizer. We follow the similar experimental
initialized to 0.0003 for all optimizers. The qualitative results settings as in [59]. Specifically, we apply the frame-wise batch
of PID are illustrated in Fig. 8(b) and the SGD-Momentum normalization [60] and set batch size as 128. The learning rate
results are demonstrated in Fig. 8(a). From Fig. 8, we could is initially set to 0.0002 and decreases by 10 times when the
find that the generated images with PID optimizer are more validation performance no longer improve. We also introduce
realistic than these with SGD-Momentum optimizer. dropout [61] by using dropping probability of 0.25 and 0.3.
There is no overlapping in the sequences, whose length are set
D. Results of RNNs as T = 50 for both training and testing. Then we train networks
In this experiment, we employ a simple LSTM that only with PID and SGD-Momentum optimizers. The results are
has 1 layer with 100 hidden units. Mean squared error (MSE) shown in Fig. 10. Comparing with the SGD-Momentum,
is used as the objective function for the adding problem. The we can see that our proposed PID optimizer achieves better
initial learning rate is set to 0.002 for SGD-Momentum and performance on the LSTM model.
PID optimizer. The learning rate is reduced by a factor of
10 every 20, 000 training steps. We randomly generate all the E. Results of different Ki and Kd
training and testing data throughout the whole experiments. We also perform an ablation study on the hyper-parameters
The results are shown in Fig. 9. The LSTM model with of PID controller. The experiments are run on the CIFAR10
SGD-Momentum has troubles in convergence. However, our dataset with DenseNet 100-12. The initial learning rate is 0.1,
proposed PID optimizer can reach to a small error with very and it is reduced by 10 in the 150 and 225 epochs.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 11

Training Loss Validation Loss Training Loss Validation Loss


0.8 0.35 3.0

0.30 2.5 2.0


0.6 2.0
0.25 1.5
0.20 1.5
0.4
1.0 1.0
0.15
0.2 0.5
0.10 0.5
0.0
0 5 10 15 0 5 10 15 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch Epoch Epoch
Training Acc. Validation Acc. Training Acc. Validation Acc.
98 100

95 80
96 80

90 94 60 60

40
85 92 40
20
90
0 5 10 15 0 5 10 15 20 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch Epoch Epoch
SGD-M PID D=100 D=50 D=25 D=10

Fig. 10. Comparison between PID and SGD-Momentum to train LSTM on Fig. 12. Comparison among PID controllers with different Kd on the
PTB dataset. Top row: the curves of training and validation loss. PID optimizer CIFAR10 dataset by using DenseNet 100-12. Ki is fixed to 3. Top row: the
helps to achieve smaller training and validation losses. Bottom row: the curves curves of training and validation loss. Bottom row: the curves of training and
of training and validation accuracy. PID optimizer helps to achieve higer validation accuracy.
training and validation performance.
Training Loss
4
Validation Loss larger the Kd , the more unstable the validation performance.
3.0 The reasons may be that large Kd leads to more change of
2.5 3 optimization path.
2.0
1.5 2 As can be seen from these experiments, Ki is more important
1.0 than Kd in this specific tasks (CIFAR10 with Densenet100-12).
1
0.5 Ki not only affects the speed of convergence, but also affects
0.0
0 50 100 150 200 250 300 0 50 100 150 200 250 300
the accuracy of verification.
Epoch Epoch
Training Acc. Validation Acc.
100
80 VI. C ONCLUSION AND F UTURE W ORK
80
60
60 Motivated by the outstanding performance of proportional-
40 40 integral-derivative (PID) controller in the field of automatic
20 20 control, we reveal the connections between PID controller
and stochastic optimizers and its variants. Then we propose a
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch new PID optimizer used in deep neural network training. The
I=50 I=25 I=10 I=3 I=1 I=0.3 proposed PID optimizer reduces the overshoot phenomenon
Fig. 11. Comparison among PID controllers with different Ki on the of SGD-momentum and accelerates the training process of
CIFAR10 dataset by using DenseNet 100-12. Kd is fixed to 10. Top row: the DNNs by combining the present, the past and the change in-
curves of training and validation loss. Bottom row: the curves of training and formation of gradients to update parameters. Our experiments
validation accuracy. Within a certain range, larger Ki achieves better validation
accuracy. on both image recognition tasks with MNIST, CIFAR, and
Tiny-ImageNet datasets and LSTM tasks with PTB dataset
The first group of experiments investigates the variation of validates that the proposed PID optimizer is 30% to 50% faster
training and verification statistics with Ki while Kd is fixed. than SGD-Momentum, while obtaining lower error rate. We
Fig. 11 demonstrates six PID controllers whose Kd is 10. In will continue to study the relationship among optimal hyper-
the training, the performance of all controllers differ from each parameters(K p , Ki , and Kd ) in specific task. We will conduct
other at an early stage, but eventually they can reach the same more in-depth researches for more general cases in the future.
level. In validation, controller with Ki = 10 achieves lowest And we will investigate how to associate PID optimizer with
loss and highest validation accuracy. We also repeat this ex- an adaptive learning rate for DNNs/RNNs optimization in
periment with Kd = 10, 25, 50, and100 respectively, and results future works.
are highly similar to Fig. 11. One interesting phenomenon is
that the larger the Ki , the more affected by the decreasing ACKNOWLEDGMENT
schedule.
Then we change the research object to Kd . The settings This work is partially supported by the NSFC fund
of the second group of experiments are kept the same as (61571259, 61831014, 61531014), in part by the
previous experiments, but the Ki is fixed. Fig. 12 shows that Shenzhen Science and Technology Project under Grant
their performance is highly consistent. It is also shown that the (GGFW2017040714161462, JCYJ20170307153051701).
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 12

R EFERENCES [26] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and
S. Yan, “Hcp: A flexible cnn framework for multi-label image classifica-
[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and vol. 38, no. 9, pp. 1901–1907, Sept 2016.
F. fei Li, “Imagenet large scale visual recognition challenge,” IEEE [27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convo-
International Journal of Computer Vision (IJCV), 2015. lutional networks for accurate object detection and segmentation,” IEEE
[2] L. Bottou, “Large-scale machine learning with stochastic gradient de- Transactions on Pattern Analysis and Machine Intelligence, vol. 38,
scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177– no. 1, pp. 142–158, Jan 2016.
186. [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
[3] J. Zhang, “Gradient descent based optimization algorithms for deep object detection with region proposal networks,” IEEE Transactions on
learning models training,” arXiv preprint arXiv:1903.03614, 2019. Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–
[4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, 1149, June 2017.
“Exploiting linear structure within convolutional networks for efficient [29] W. Ouyang, X. Zeng, X. Wang, S. Qiu, P. Luo, Y. Tian, H. Li,
evaluation,” in Advances in neural information processing systems, 2014, S. Yang, Z. Wang, H. Li, K. Wang, J. Yan, C. C. Loy, and X. Tang,
pp. 1269–1277. “Deepid-net: Object detection with deformable part based convolutional
[5] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convo- neural networks,” IEEE Transactions on Pattern Analysis and Machine
lutional neural networks with low rank expansions,” arXiv preprint Intelligence, vol. 39, no. 7, pp. 1320–1334, July 2017.
arXiv:1405.3866, 2014. [30] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical
[6] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu- features for scene labeling,” IEEE transactions on pattern analysis and
tional networks for classification and detection,” IEEE transactions on machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2013.
pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1943– [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
1955, 2016. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
[7] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and NIPS, 2014.
Y. LeCun, “Fast convolutional nets with fbfft: A gpu performance [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
evaluation,” arXiv preprint arXiv:1412.7580, 2014. Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available:
[8] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and https://doi.org/10.1162/neco.1997.9.8.1735
Y. Bengio, “Fitnets: Hints for thin deep nets,” International Conference [33] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent
on Learning Representations, 2015. neural networks,” in Proceedings of the 33nd International Conference
[9] D. J. Im, M. Tao, and K. Branson, “An empirical analysis of the on Machine Learning, ICML 2016, New York City, NY, USA,
optimization of deep network loss surfaces,” in International Conference June 19-24, 2016, 2016, pp. 1120–1128. [Online]. Available:
for Learning Representations (ICLR), 2017. http://jmlr.org/proceedings/papers/v48/arjovsky16.html
[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, [34] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
2016. neural networks with pruning, trained quantization and huffman coding,”
in International Conference for Learning Representations (ICLR), 2015.
[11] L. Bottou, “Online learning in neural networks,” D. Saad, Ed.
[35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning effi-
Cambridge University Press, 1998, ch. Online Learning and Stochastic
cient convolutional networks through network slimming,” in Proceedings
Approximations, pp. 9–42. [Online]. Available: http://dl.acm.org/
of the IEEE International Conference on Computer Vision, 2017, pp.
citation.cfm?id=304710.304720
2736–2744.
[12] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
[36] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via
of initialization and momentum in deep learning,” in International
geometric median for deep convolutional neural networks acceleration,”
conference on machine learning, 2013.
in Proceedings of the IEEE Conference on Computer Vision and Pattern
[13] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
Recognition, 2019, pp. 4340–4349.
for online learning and stochastic optimization,” Journal of Machine
[37] X. Dai, H. Yin, and N. Jha, “Nest: A neural network synthesis tool based
Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
on a grow-and-prune paradigm,” IEEE Transactions on Computers,
[14] G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a overview of 2019.
mini–batch gradient descent.” [38] X. Du, Z. Li, and Y. Cao, “Cgap: Continuous growth and pruning for
[15] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in efficient deep learning,” arXiv preprint arXiv:1905.11533, 2019.
International Conference for Learning Representations (ICLR), 2014. [39] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
[16] K. Ogata, Discrete-time control systems. Prentice Hall Englewood “Quantized neural networks: Training neural networks with low pre-
Cliffs, NJ, 1995, vol. 2. cision weights and activations,” The Journal of Machine Learning
[17] L. Lessard, B. Recht, and A. Packard, “Analysis and design of opti- Research, vol. 18, no. 1, pp. 6869–6898, 2017.
mization algorithms via integral quadratic constraints,” SIAM Journal [40] R. Kidambi, P. Netrapalli, P. Jain, and S. M. Kakade, “On the insuffi-
on Optimization, vol. 26, no. 1, pp. 57–95, 2016. ciency of existing momentum schemes for stochastic optimization,” in
[18] L. Wang, T. J. D. Barnes, and W. R. Cluett, “New frequency-domain de- International Conference for Learning Representations (ICLR), 2018.
sign method for pid controllers,” IEEE Control Theory and Applications, [41] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
Jul 1995. filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
[19] K. Heong Ang, G. Chong, and Y. Li, “Pid control system analysis, [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
design, and technology,” vol. 13, pp. 559 – 576, 08 2005. image recognition,” in IEEE Conference on Computer Vision and Pattern
[20] A. L. Salih, M. Moghavvemi, H. A. F. Mohamed, and K. S. Gaeid, Recognition (CVPR), 2016.
“Modelling and pid controller design for a quadrotor unmanned air [43] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
vehicle,” in IEEE International Conference on Automation, Quality and connected convolutional networks,” in IEEE Conference on Computer
Testing, Robotics (AQTR), vol. 1, May 2010, pp. 1–5. Vision and Pattern Recognition (CVPR), 2017.
[21] P. Rocco, “Stability of pid control for industrial robot arms,” IEEE [44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
transactions on robotics and automation, 1996. networks,” in IEEE European Conference on Computer Vision (ECCV),
[22] P. Zhao, J. Chen, Y. Song, X. Tao, T. Xu, and T. Mei, “Design of 2016.
a control system for an autonomous vehicle based on adaptive-pid,” [45] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in BMVC,
International Journal of Advanced Robotic Systems, vol. 9, no. 2, p. 44, 2016.
2012. [Online]. Available: https://doi.org/10.5772/51314 [46] Y. Nesterov, “A method of solving a convex programming problem with
[23] P. S. de Laplace, Théorie analytique des probabilités. Courcier, 1820, convergence rate o (1/k2),” in Soviet Mathematics Doklady, 1983.
vol. 7. [47] J. C. Maxwell, “On governors,” Proceedings of the Royal Society of
[24] W. An, H. Wang, Q. Sun, J. Xu, Q. Dai, and L. Zhang, “A pid controller London, vol. 16, pp. 270–283, 1867.
approach for stochastic optimization of deep networks,” in The IEEE [48] N. Minorsky, “Directional stability of automatically steered bodies,”
Conference on Computer Vision and Pattern Recognition (CVPR), June Journal of ASNE, 1922.
2018. [49] N. Qian, “On the momentum term in gradient descent learning algo-
[25] B. T. Polyak, “Some methods of speeding up the convergence of iter- rithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
ation methods,” USSR Computational Mathematics and Mathematical [50] M. R. Spiegel, Advanced mathematics. McGraw-Hill, Incorporated,
Physics, vol. 4, no. 5, pp. 1–17, 1964. 1991.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, SEPTEMBER 2019 13

[51] K. DE JONG, “An analysis of the behavior of a class of genetic adaptive Qingyun Sun is current working toward the Ph.D.
systems,” Doctoral Dissertation, University of Michigan, 1975. degree in Department of Mathematics, Stanford Uni-
[52] J. G. Ziegler and N. B. Nichols, “Optimum settings for automatic versity. He received B.S. from School of Mathemat-
controllers,” trans. ASME, vol. 64, no. 11, 1942. ical Sciences, Peking University, Beijing, China, in
[53] G. E. Robert and H. Kaufman, Table of Laplace transforms. Saunders, 2014. His research interests include mathematical
1966. foundation for artificial intelligence, data science,
[54] H. K. Khalil, Noninear Systems. Prentice-Hall, New Jersey, 1996. machine learning, algorithmic game theory, multi-
[55] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,” 2015. agent decision making, optimization, and high di-
[56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning mensional statistics.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[57] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
2009. Jun Xu is an Assistant Professor in College of Com-
[58] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual puter Science, Nankai University, Tianjin, China.
transformations for deep neural networks,” in IEEE Conference on Before that, he worked as a Research Scientist
Computer Vision and Pattern Recognition (CVPR), 2017. in Inception Institute of Artificial Intelligence. He
[59] T. Cooijmans, N. Ballas, C. Laurent, and A. C. Courville, “Recurrent received the B.Sc. degree in pure mathematics and
batch normalization,” CoRR, vol. abs/1603.09025, 2016. [Online]. the M.Sc. degree in Information and Probability both
Available: http://arxiv.org/abs/1603.09025 from the School of Mathematics Science, Nankai
[60] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch University, Tianjin, China, in 2011 and 2014, re-
normalized recurrent neural networks,” in 2016 IEEE International spectively. He received the Ph.D. degree in 2018
Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, from the Department of Computing, The Hong Kong
Shanghai, China, March 20-25, 2016, 2016, pp. 2657–2661. [Online]. Polytechnic University, supervised by Prof. David
Available: https://doi.org/10.1109/ICASSP.2016.7472159 Zhang and Prof. Lei Zhang.
[61] Y. Gal and Z. Ghahramani, “A theoretically grounded application of
dropout in recurrent neural networks,” in Advances in Neural Informa-
tion Processing Systems 29: Annual Conference on Neural Information Yongbing Zhang received the B.A. degree in En-
Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, glish and the M.S. and Ph.D degrees in computer
2016, pp. 1019–1027. science from the Harbin Institute of Technology,
Harbin, China, in 2004, 2006, and 2010, respec-
tively. He joined Graduate School at Shenzhen, Ts-
inghua University, Shenzhen, China in 2010, where
he is currently an associate professor. He was the
receipt of the Best Student Paper Award at IEEE
International Conference on Visual Communication
and Image Processing in 2015. His current research
Haoqian Wang (M’13) received the B.S. and interests include signal processing, computational
M.E. degrees from Heilongjiang University, Harbin, imaging, and machine learning.
China, in 1999 and 2002, respectively, and the Ph.D.
degree from the Harbin Institute of Technology,
Harbin, in 2005.He was a Post-Doctoral Fellow with Yulun Zhang received B.E. degree from School
Tsinghua University, Beijing, China, from 2005 to of Electronic Engineering, Xidian University, China,
2007. He has been a Faculty Member with the in 2013 and M.E. degree from Department of Au-
Graduate School at Shenzhen, Tsinghua University, tomation, Tsinghua University, China, in 2017. He is
Shenzhen, China, since 2008, where he has also been currently pursuing the Ph.D. degree with the Depart-
an Associate Professor since 2011, and the director ment of ECE, Northeastern University, USA. He was
of Shenzhen Institute of Future Media Technology. the receipt of the Best Student Paper Award at IEEE
His current research interests include generative adversarial networks, video International Conference on Visual Communication
communication and signal processing. and Image Processing(VCIP) in 2015. He also won
the Best Paper Award at IEEE International Confer-
ence on Computer Vision (ICCV) RLQ Workshop
in 2019. His research interests include image restoration and deep learning.

Yi Luo is received the B.E. degree in Xidian Uni-


versity, Xi’an, China, in 2019. He is pursuing a Lei Zhang (M’04-SM’14-F’18) received his B.Sc.
master degree in Tsinghua University. He is also degree in 1995 from Shenyang Institute of Aeronau-
working as a research assistance in Graduate School tical Engineering, Shenyang, P.R. China, and M.Sc.
at Shenzhen, Tsinghua University, Shenzhen, China. and Ph.D degrees in Control Theory and Engineering
His research interests include optimization in deep from Northwestern Polytechnical University, Xi’an,
learning. P.R. China, in 1998 and 2001, respectively. From
2001 to 2002, he was a research associate in the
Department of Computing, The Hong Kong Poly-
technic University. From January 2003 to January
2006 he worked as a Postdoctoral Fellow in the
Department of Electrical and Computer Engineering,
McMaster University, Canada. In 2006, he joined the Department of Com-
puting, The Hong Kong Polytechnic University, as an Assistant Professor.
Wangpeng An received the B.E. degree from Kun- Since July 2017, he has been a Chair Professor in the same department. His
ming University of Science and Technology in 2012. research interests include Computer Vision, Image and Video Analysis, Pattern
He is pursuing a master degree in Tsinghua Uni- Recognition, and Biometrics, etc. Prof. Zhang has published more than 200
versity, supervised by Professor Qionghai Dai. His papers in those areas. As of 2019, his publications have been cited more than
research interests include face attribute recognition, 38,000 times in literature. Prof. Zhang is a Senior Associate Editor of IEEE
generative adversarial networks and deep learning Trans. on Image Processing, and an Associate Editor of SIAM Journal of
optimization. Imaging Sciences and Image and Vision Computing, etc. He is a Clarivate
Analytics Highly Cited Researcher from 2015 to 2018. More information can
be found in his homepage http://www4.comp.polyu.edu.hk/cslzhang/.

You might also like