An A PID Controller CVPR 2018 Paper
An A PID Controller CVPR 2018 Paper
Wangpeng An1,2 , Haoqian Wang1,3 , Qingyun Sun4 , Jun Xu2 , Qionghai Dai1,3 , and Lei Zhang ∗2
1 Graduate
School at Shenzhen, Tsinghua University, Shenzhen, China
2 Dept.
of Computing, The Hong Kong Polytechnic University, Hong Kong, China.
3 Shenzhen Institute of Future Media Technology, Shenzhen, China
4 Stanford University, CA, USA
1 [email protected], 1 [email protected], 2 [email protected]
Abstract fication [2], object detection [3], and face recognition [4],
etc. Despite the great successes of deep learning, the train-
Deep neural networks have demonstrated their power in ing of deep networks on large-scale datasets is usually com-
many computer vision applications. State-of-the-art deep putationally expensive, costing several days or even weeks
architectures such as VGG, ResNet, and DenseNet are using GPU equipped high-end PCs. It is substantially im-
mostly optimized by the SGD-Momentum algorithm, which portant to investigate how to accelerate the training speed
updates the weights by considering their past and cur- of deep models without sacrificing the accuracy, which can
rent gradients. Nonetheless, SGD-Momentum suffers from save the time and memory cost, particularly for resource
the overshoot problem, which hinders the convergence of limited applications.
network training. Inspired by the prominent success of The key component of DNN training is the optimizer,
proportional-integral-derivative (PID) controller in auto- which defines how the millions or even billions of parame-
matic control, we propose a PID approach for accelerat- ters of a deep model are updated. The learning rate is one of
ing deep network optimization. We first reveal the intrinsic the most important hyper-parameters to train a DNN [5].
connections between SGD-Momentum and PID based con- Based on how the learning rate is set, deep learning op-
troller, then present the optimization algorithm which ex- timizers can be categorized into two groups, hand-tuned
ploits the past, current, and change of gradients to update learning rate optimizers such as stochastic gradient descent
the network parameters. The proposed PID method reduces (SGD) [6], SGD Momentum [7] and Nesterov′ s Momen-
much the overshoot phenomena of SGD-Momentum, and it tum [7], and auto learning rate optimizers such as Ada-
achieves up to 50% acceleration on popular deep network Grad [8], RMSProp [9] and Adam [10], etc. Auto learning
architectures with competitive accuracy, as verified by our rate optimizers adaptively tune an individual learning rate
experiments on the benchmark datasets including CIFAR10, for each parameter. Such a goal of fine adaptation is attrac-
CIFAR100, and Tiny-ImageNet. tive and it is expected to yield better deep model learning
results. However, the recent findings by Wilson et al. [11]
show that hand-tuned SGD-Momentum achieves better re-
1. Introduction sult at the same speed or even faster speed. The hypothe-
sis put forth here is that adaptive methods may converge
Benefitting from the availability of large-scale visual
to different local minima [12]. It is also noted that most of
datasets such as ImageNet [1], deep neural networks
the best-performance deep models such as ResNet [13] and
(DNN), especially deep convolutional neural networks
DenseNet [14] are usually trained by SGD-Momentum.
(CNNs), have significantly improved the system accuracy
in many computer vision problems, such as image classi- The strategy of SGD-Momentum is to consider both the
past and present gradients to update the network parame-
∗ Corresponding author. This work is supported by HK RGC GRF grant
ters. However, SGD-Momentum suffers from the overshoot
(PolyU 152135/16E), the NSFC fund (61571259, 61531014), Shenzhen
Science and Technology Project under Grant (GGFW2017040714161462,
problem [15], which refers to the phenomena that a weight′ s
JCYJ20170307153051701) and the National High-tech R&D Program of value exceeds much its target value and does not change its
China (863Program, 2015AA015901). update direction. Such an overshoot problem hinders the
8522
convergence of SGD-Momentum, and costs more training The rest of this paper is organized as follows. Section 2
time and resources. It is of significant importance to inves- briefly reviews related work. Section 3 connects PID con-
tigate whether we can design a new DNN optimizer which is troller with DNN optimization. Section 4 introduces the
free of overshoot problem and has faster convergence speed proposed PID approach for DNN optimization. Experimen-
while maintaining good accuracy. tal results and detailed analysis are reported in Section 5.
It has been found that many optimization algorithms Section 6 concludes this paper.
popularly employed in machine learning studies share cer-
tain similarity to those classic control methods studied since 2. Related Work
1950s [16]. In literature of automatic control, the feedback
control system plays a key role, while the proportional- 2.1. Deep Learning Optimization
integral-derivative (PID) controller is the most commonly
used feedback control mechanism due to its simplicity, The learning rate is the most important hyper-parameter
functionality, and broad applicability [17]. More than 90% to train deep neural networks [9]. Based on how the learning
of industrial controllers are implemented based on PID [18], rate is set, two classes of deep learning optimization meth-
including self-driving car [19], unmanned flying vehi- ods can be categorized. The first class indicates fixed learn-
cles [20], robotics [21], etc. The basic idea of PID con- ing rate methods such as SGD [6], SGD Momentum [7],
trol is that the control action should be proportional to the and Nesterov′ s Momentum [7], etc., and the second class
current error (the difference between system output and de- includes auto learning rate methods, such as AdaGrad [8],
sired output), the integral of the past error over time, and RMSProp [9], and Adam [10], etc. Our work is based
the derivative of the error, which represents future trend. on fixed learning rate methods considering that the current
Though PID controller has gained massive successes in state-of-the-art results on CIFAR10, CIFAR100, ImageNet,
different industries of control and automation, little study PASCAL VOC and MS COCO datasets were mostly ob-
has been done on its connections with stochastic optimiza- tained by Residual Neural Networks [13, 14, 23, 24] trained
tion, as well as its potential applications to DNN training. by use of SGD Momentum.
In this paper, we make the first attempt along this line. We Stochastic Gradient Descent (SGD) [6] is a widely used
first bridge the gap between PID controller and stochas- optimization algorithm for machine learning in general, es-
tic optimization methods such as SGD, SGD-Momentum pecially for deep learning. SGD usually uses a fixed learn-
and Nesterov′ s Momentum, and consequently develop a ing rate. This is because the SGD gradient estimator intro-
PID approach for DNN optimization. Compared with SGD- duces a source of noise (the random sampling of m training
Momentum which utilizes the past and current gradients, examples), and that noise does not vanish even when the
the proposed PID optimization approach also utilizes the loss arrives at a minimum.
gradient changes to update the network. We further intro-
duce the Laplace Transform [22] to initialize the hyper- SGD Momentum [7] is designed to accelerate learning,
parameter introduced in our method, resulting in a simple especially in the case of small and consistent gradients.
yet effective stochastic DNN optimization algorithm. The The momentum algorithm accumulates an exponentially
major contributions of this work are summarized as follows. decayed moving average of past gradients and continues to
move in the consistent direction. The name momentum de-
• By linking the calculation of errors in feedback con-
rives from a physical analogy, in which the negative gradi-
trol system and the calculation of gradient in network
ent is a force moving a particle through parameter space. A
updating, we reveal the intrinsic connections between
hyper-parameter α ∈ (0, 1) determines how much the past
deep network optimization and feedback system con-
gradients to the current update of the weights.
trol, and show that SGD-Momentum is a special case
of PID controller with only proportional (P) and inte- Nesterov′ s Momentum [7] is a variant of the momentum
gral (I) components. algorithm that was motivated by Nesterov′ s accelerated
gradient method [25]. The difference between Nesterov
• We then propose a PID approach to optimize DNN by
momentum and regular momentum lies on where the
utilizing the present, past and changing information of
gradient is evaluated. With Nesterov′ s momentum, the
the gradient. The classical Laplace Transform is intro-
gradient is estimated after the current velocity is applied.
duced to understand and initialize the hyper-parameter
Thus one can interpret Nesterov′ s momentum as attempting
in our algorithm.
to add a correction factor to the standard method of
• We systematically evaluate the proposed approach, and momentum. Recently, Nesterov′ s Momentum method has
the extensive experiments on CIFAR10, CIFAR100 been characterized as a second order ordinary differential
and Tiny-Imagenet datasets demonstrate the efficiency equation in the small step limit [26].
and effectiveness of our PID approach.
8523
2.2. PID Controller PID controller computes a control variable u(t) based on the
current, past and future (i.e., derivative) of the error e(t), as
The PID controller exploits the present, past and future
shown in Eq. (1).
information of prediction error to control a feedback sys-
tem [18]. PID based controller originates in the 19th cen- Deep learning aims to learn an approximation function
tury for speed control. The theoretical foundation for the or mapping function f with parameters θ to map the input x
operation of PID was first described by Maxwell in 1868 to the desired output y, i.e., y = f (x, θ ), assuming that there
in his seminal paper “On Governors” [27]. Minorsky [28] are (complex) relationships or causality between x and y.
then gave this a mathematical formulation. Over the years, With enough training data, deep learning can train a net-
many advanced control algorithms have also been proposed. work with millions of parameters (weights w) to fit those
However, most industrial controllers are implemented with complex relationships which cannot be formulated using
a PID algorithm because it is simple, robust and easy to analytical functions. Usually, a loss function L will be de-
use [29]. A PID controller continuously calculates an error fined based on the desired output y and the predicted output
e(t), which is the difference between the desired optimal f (x, θ ) to measure whether the goal is reached. The loss af-
output and a measured system output, and applies a correc- fects the weights by performing “backward propagation of
tion u(t) to the system based on the proportional (P), inte- errors” [30]. That is, it distributes the error to each node
gral (I), and derivative (D) terms of e(t). Mathematically, by calculating the gradients of weights. If the loss L is not
there is: small enough, the network will update its weights θ based
Z t
d on the gradients ∂ L/∂ θ . Therefore, it is reasonable to as-
u(t) = K p e(t) + Ki e(t)dt + Kd e(t), (1) sociate the “error” in PID control with the “gradient” in
0 dt
DL. This procedure is iterated till L converges or is small
where K p , Ki and Kd are the gain coefficients on the P, I and enough. Many optimizers have been proposed to minimize
D terms, respectively. the loss L by updating θ using the gradients ∂ L/∂ θ , includ-
One can see that the error e(t), defined as the difference ing SGD, SGD-Momentum, Adam, etc.
between the desired value and the actual output, has the
same spirit as the gradient used in deep learning optimiza- From the above discussions, we can see that deep net-
tion. The coefficients K p , Ki and Kd determine the contri- work optimization shares high similarity to PID based con-
butions of present, past and future errors to the current cor- trol. Both of them update the system/network based on
rection. Such analyses inspire us to adapt the PID control the difference/loss between actual output and desired out-
techniques to the field of deep network optimization. To the put. The feedback in PID control corresponds to the back-
best of our knowledge, we are the first to introduce the idea propagation in network optimization. The major difference
of PID into the field of deep learning as a new optimizer. As is that the PID controller computes the update using sys-
we will see later in this paper, the proposed optimizer inher- tem error e(t), while deep network optimizers determines
its fantastic advantages of PID controller and stays simple the updates based on gradient ∂ L/∂ θ . If we view gradient
and efficient. ∂ L/∂ θ as the incarnation of error e(t), PID controller can
be fully connected with DNN optimization. In the follow-
3. PID and Deep Network Optimization ing, we will see that SGD, SGD-Momentum and Nesterov′ s
Momentum all can be explained as a kind of PID controller.
In this section, we disclose the connections between PID
control and SGD based deep optimization. Such connec-
tions motivate us to propose a new optimization method to
3.2. SGD is a P Controller
accelerate the training of DNNs. Updating the weights in a SGD and its variants are probably the most widely used
deep network can be viewed as deploying many PID con- optimization algorithms for DNN optimization. The param-
trollers to drive the system to reach an equilibrium. eter update rule of SGD from time (i.e., iteration) t to time
3.1. General Connections t + 1 is given by:
8524
Figure 1. The connection between control system and deep model training, and the connection between PID controller and SGD-
Momentum.
3.3. Momentum Optimization is a PI Controller gradients far away from present to reduce noise. Overall,
SGD-Momentum can be viewed as a PI controller.
SGD-Momentum is able to reach the objective more
quickly than SGD along the small but consistent directions, 3.4. Nesterov′ s Momentum Optimization is a PI
resulting in a faster convergence speed. Its parameter Controller with larger P
update rule is given by:
The Nesterov′ s Momentum update rule is given by:
(
Vt+1 = αVt − r∂ Lt /∂ θt (
(3) Vt+1 = αVt − r∂ Lt /∂ (θt + αVt )
θt+1 = θt +Vt+1 , (5)
θt+1 = θt +Vt+1 ,
where Vt is the accumulation of history gradient, and α ∈
(0, 1) is the rate of moving average decay. By using a variable transform θ̂t = θt + αVt , and expressing
With some mathematical tricks (Sum Formula for a Se- the update rule in terms of θ̂ , we have:
quence of Numbers [31]), we can remove Vt from Eq. (3),
and rewrite the update rule as: (
Vt+1 = αVt − r∂ Lt /∂ θ̂t
t−1 (6)
θt+1 = θt − r∂ Lt /∂ θt − r( ∑ (∂ Li /∂ θi α t−i )). (4) θ̂t+1 = θ̂t + (1 + α)Vt+1 − αVt .
i=0
Again, by using the Sum Formula for a Sequence of Num-
One can see that the update of parameters relies on both the bers [31], we can have (the detailed derivation can be found
present gradient (r∂ Lt /∂ θt ) and the integral of past gradi- in the supplementary file):
ents r ∑t−1
i=0 (∂ Li /∂ θi α
t−i ). The only difference is that there
is a decay term α in the I term. This difference is because θ̂t+1 = θ̂t − r(1 + α)∂ Lt /∂ θ̂t
deep learning algorithms use a mini-batch of training ex- t−1 (7)
amples to compute the gradient, and thus the gradients are − rα(α t−i ∑ (∂ Li /∂ θ̂i )).
stochastic. The introduction of decay term α is to forget the i=1
8525
One can see that like SGD-Momentum, the Nesterov′ s Mo- a PID controller can effectively reduce the overshoot prob-
mentum also uses the present gradient and integral of past lem, as shown in Figure 2. Considering that the training of
gradients to update the parameters, while the gain coeffi- deep models is usually in a mini-batch based manner, which
cient K p is larger than that in SGD-Momentum. may introduce noise in the computing of gradients, we also
compute the moving average of the derivative part. The pro-
4. PID based Deep Optimization posed PID optimizer updates parameter θ at iteration (t +1)
by:
4.1. The Overshoot Problem of SGD-Momentum
From Eq. (4) and Eq. (7), one can see that the Mo- Vt+1 = αVt − r∂ Lt /∂ θt
mentum will accumulate history gradients. However, if the Dt+1 = αDt + (1 − α)(∂ Lt /∂ θt − ∂ Lt−1 /∂ θt−1 ) (10)
θt+1 = θt +Vt+1 + Kd Dt+1 .
weights should change their descending direction, the his-
tory gradients will lag the update of weights. Such a phe-
As can be seen from Eq. (10), however, our optimizer
nomenon caused by history gradient is called overshoot,
introduces a hyperparameter Kd compared with SGD-
which is defined in discrete-time control systems [15] as
Momentum. Fortunately, this hyper-parameter Kd can be
”the maximum peak value of the response curve measured
well initialized by employing the theory of Laplace Trans-
from the desired response of the system”. Mathematically,
form [22] with Ziegler-Nichols [33] tuning method, as we
it is defined as:
describe in the following section.
θmax − θ ∗
Overshoot = , (8) 4.3. Initialization of Hyper-parameter Kd
θ∗
The Laplace Transform converts the function of real vari-
where θmax and θ ∗ are the maximum and optimum values able t (time) to a function of complex variable s (frequency).
of the weight, respectively. Denote by F(s) the Laplace transform of f (t). There is
One commonly used test benchmark of overshoot is the Z ∞
first function of De Jong′ s [32] because it is smooth, uni- F(s) = e−st f (t) dt, for s > 0. (11)
modal, and symmetric. The function can be defined as 0
f (x) = 0.1x12 + 2x22 , whose search domain is −10 ≤ xi ≤ Usually F(s) is easier to solve than f (t), and f (t) can be
10, i = 1, 2. There is no local minimum but a global mini- recovered from F(s) by the Inverse Laplace transform:
mum of this function: x∗ = (0, 0), f (x∗ ) = 0. Z γ+iT
1
We add a derivative (change of gradient) term to SGD- f (t) = lim est F(s) ds
Momentum to build a simple PID optimizer: 2πi T →∞ γ−iT
8526
7.5 7.5 7.5
θ(t) the larger the derivate, the earlier the training convergence
we will reach. However, when Kd gets too large, the sys-
θmax
tem will be fragile. In practice, we set the hyper-parameter
Kd based on the Ziegler-Nichols optimum setting rule [33],
θ∗ which is widely used by engineers in PID feedback control
since its origin in 1940s.
According to Ziegler-Nichols′ rule, the ideal setup of
Kd should be one third of the oscillation period, which
means Kd = 31 T , where T is the period of oscillation. From
Eq. (12), we can get T = √2π 2 . If we make a simplifica-
ωn 1−ζ
θ0 t tion that the α in Momentum is equal to 1, then Ki = Kd = r.
tmax
Combined with Eq. (13), Kd will have a closed form solu-
Figure 3. The evolution of the weight by PID optimizer tion:
16 2
and Kd = 0.25r + 0.5 + (1 + π )/r (14)
( 9
(K p + 1)/Kd = 2ζ ωn In practice, we can start with this ideal setting of Kd
, (13)
Ki /Kd = ωn2 and change it slightly when use different network models
to train on different datasets.
where ζ and ωn are damping ratio and natural frequency of
the system, respectively. In Figure 3, we show the evolution 5. Experimental Results
process of a weight as an example of θ (t). From Eq. (13),
p (K +1)2 In this section, we first trained an MLP on the MNIST
we have Ki = 4K . One can see that Ki is a monoton-
dζ handwritten digit dataset in Section 5.2 to show the advan-
ically decreasing function of ζ . Refer to the definition of
tage of PID optimizer, and then trained CNNs on the CI-
overshoot in Eq. (8), one can see that ζ is monotonically
FAR datasets in Section 5.3 to demonstrate that PID opti-
decreasing with overshoot. Then Ki is a monotonically in-
mizer is competitive with SGD-Momentum in accuracy but
creasing function of overshoot. So more history error (Inte-
with much faster training speed. To further validate our PID
gral part), more overshoot the system will have. That is the
optimizer on a larger dataset, in Section 5.4 we performed
reason why SGD-Momentum which accumulates past gra-
experiments on the Tiny-Imagenet dataset [36]. The results
dients will overshoot its target and spend more time during
showed that our PID optimizer can generalize to modern
training.
networks and datasets. It should be noted that except for
As pcan be observed from Eq. (12), the term
the additional hyper-parameter Kd which is set by Eq.(14),
sin(ωn 1 − ζ 2t + arccos(ζ )) brings periodically oscilla-
all the other hyper-parameters in our PID optimizer are set
tion change to the weight, which is no more than 1. The
as the same as SGD-Momentum. The learning rate starts
term e−ζ ωn t mainly controls the convergence rate. One
from 0.01 and is divided by 10 when the error plateaus. The
should note the value of hyper-parameter Kd in calculat-
K p +1 source code of our PID optimizer can be found at https:
− 2K
ing the derivate e−ζ ωn = e d . It is easy to observe that //github.com/tensorboy/PIDOptimizer.
8527
Table 1. Test errors and training epochs of PID and SGD-Momentum on CIFAR10 and CIFAR100.
Model Depth-k Params (M) Runs CIAFR10 Epochs CIFAR100 Epochs
- - - - PID/SGD-M PID/SGD-M PID/SGD-M PID/SGD-M
110 1.7 5 6.23/6.43 239/281 24.95/25.16 237/293
Resnet [13]
1202 10.2 5 7.81/7.93 230/293 27.93/27.82 251/296
PreActResNet [23] 164 1.7 5 5.23/5.46 230/271 24.17/24.33 241/282
8-64 34.43 10 3.65/3.43 221/294 17.46/17.77 232/291
ResNeXt29 [35]
16-64 68.16 10 3.42/3.58 209/289 17.11/17.31 229/283
16-8 11 10 4.42/4.81 213/290 21.93/22.07 229/283
WRN [24]
28-20 36.5 10 4.27/4.17 208/290 20.21/20.50 221/295
100-12 0.8 10 3.83/4.30 196/291 19.97/20.20 213/294
DenseNet [14]
190-40 25.6 10 3.11/3.32 194/293 16.95/17.17 208/297
8528
5.3. Results on CIFAR datasets Train Loss Valid Loss
4 SGD-momentum 10 SGD-momentum
We then compared PID and SGD-Momentum optimizers PID PID
3 8
on CIFAR10 and CIFAR100 by using five state-of-the-art
2 6
CNN models, including ResNet [13], PreActResNet [23],
4
ResNeXt29 [35], WRN [24], and DenseNet [14]. The re- 1
sults are summarized in Table 1. The second column lists 2
0
the number of depth of those networks, while the third col- 0 50 100 150 0 50 100 150
umn lists the number of parameters for each network model. Train Acc. Valid Acc.
The fourth column indicates the number of runs to calcu-
80
late the average test error. In the fifth and sixth columns 60
of Table 1, we presented the average test errors on CI- 60
40 40
FAR10 and showed the numbers of Epochs by PID and
SGD-Momentum when they achieve the reported test errors 20 SGD-momentum 20 SGD-momentum
PID PID
for the first time (i.e., the least number of Epochs to reach
0 50 100 150 0 50 100 150
the best accuracy). The last two columns of Table 1 present
such comparisons on CIFAR100. Figure 6. PID vs. SGD-Momentum on the Tiny-imagenet dataset
From Table 1, we can have the following observations. by using DenseNet 190-40. Top row: the curves of training loss
First, our proposed PID optimizer achieves lower test errors and validation loss. Bottom row: the curves of training accuracy
than SGD-Momentum for all the used CNN architectures and validation accuracy.
on both the two CIFAR datasets, except for ResNet with
depth 1202. Second, PID optimizer converges faster (with
tions with stochastic optimizers such as SGD and its vari-
less Epochs) than SGD-Momentum to reach the best results. ants, and presented a novel PID controller approach to deep
In particular, our PID optimizer has on average 35% and network optimization. The proposed PID optimizer exploits
up to 50% acceleration compared with SGD-Momentum. the present, past and change information of gradients to up-
This demonstrates the importance of the change of gradi- date the network parameters, reducing greatly the overshoot
ent, which can be exploited to reduce the overshoot prob- problem of SGD-momentum and accelerating the learning
lem and speed up the learning process of DNNs. Figure 5 process of DNNs. Our experiments on MINIST, CIFAR and
shows the detailed training statistics by the two methods on Tiny-ImageNet datasets validated that the proposed PID op-
CIFAR10 with DenseNet 190-40 (190 layers with growth timizer is 30% ∼ 50% faster than SGD-Momentum, whiling
rate of 40) [14]. One can see that PID optimizer converges resulting in lower error rate. In future work, we will inves-
faster than SGD-Momentum with lower loss and higher ac- tigate how to adapt our PID optimizer to other network ar-
chitectures such as LSTM and RNN, and how to associate
curacy.
PID optimizer with an adaptive learning rate for DNN opti-
mization.
5.4. Experiments on Tiny-ImageNet
To further demonstrate the effectiveness of our PID op- References
timizer, we employed the DenseNet190-40 architecture to
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
perform experiments on the Tiny-ImageNet dataset. Fig-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
ure 6 shows the curves of training loss and accuracy over
Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,
Epochs, as well as the validation loss and accuracy by the and Fei fei Li. Imagenet large scale visual recognition
PID and SGD-Momentum optimizers. The learning rate of challenge. IEEE International Journal of Computer Vision
SGD-Momentum and PID was fixed to 0.01. Training was (IJCV), 2015. 1
conducted 150 epochs using batch size 64. The results are
averaged over 5 runs. [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Similar conclusions to those on CIFAR datasets can be Imagenet classification with deep convolutional neural net-
made. In both training and validation, PID converges faster works. In Advances in neural information processing sys-
tems (NIPS), 2012. 1
than SGD-Momentum, has lower loss and achieves higher
accuracy. Such results confirm the generalization capability [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
of PID based DNN optimizer to large-scale datasets. Faster r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information pro-
6. Conclusion cessing systems (NIPS), 2015. 1
Inspired by the prominent success of PID controller in [4] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
the field of automatic control, we investigated its connec- Facenet: A unified embedding for face recognition and clus-
8529
tering. In IEEE Conference on Computer Vision and Pattern [19] Pan Zhao, Jiajia Chen, Yan Song, Xiang Tao, Tiejuan Xu,
Recognition (CVPR), 2015. 1 and Tao Mei. Design of a control system for an autonomous
vehicle based on adaptive-pid. International Journal of Ad-
[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep vanced Robotic Systems, 9(2):44, 2012. 2
Learning. MIT Press, 2016. 1
[20] A. L. Salih, M. Moghavvemi, H. A. F. Mohamed, and K. S.
[6] Léon Bottou. Online learning in neural networks. chapter Gaeid. Modelling and pid controller design for a quadro-
Online Learning and Stochastic Approximations, pages 9– tor unmanned air vehicle. In IEEE International Conference
42. Cambridge University Press, 1998. 1, 2 on Automation, Quality and Testing, Robotics (AQTR), vol-
[7] Ilya Sutskever, James Martens, George Dahl, and Geoffrey ume 1, pages 1–5, May 2010. 2
Hinton. On the importance of initialization and momentum
[21] Paolo Rocco. Stability of pid control for industrial robot
in deep learning. In International conference on machine
arms. IEEE transactions on robotics and automation, 1996.
learning, 2013. 1, 2, 7
2
[8] John Duchi, Elad Hazan, and Yoram Singer. Adap-
[22] Pierre Simon de Laplace. Théorie analytique des proba-
tive subgradient methods for online learning and stochas-
bilités, volume 7. Courcier, 1820. 2, 5
tic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011. 1, 2 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[9] Geoffrey Hinton, N Srivastava, and Kevin Swersky. Lecture Identity mappings in deep residual networks. In IEEE Euro-
6a overview of mini–batch gradient descent. 1, 2 pean Conference on Computer Vision (ECCV), 2016. 2, 7,
8
[10] Diederik Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In International Conference for [24] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
Learning Representations (ICLR), 2014. 1, 2 works. In BMVC, 2016. 2, 7, 8
[11] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Sre- [25] Yurii Nesterov. A method of solving a convex programming
bro, and Benjamin Recht. The marginal value of adaptive problem with convergence rate o (1/k2). In Soviet Mathe-
gradient methods in machine learning. In Advances in Neu- matics Doklady, 1983. 2
ral Information Processing Systems (NIPS), 2017. 1
[26] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differ-
[12] Daniel Jiwoong Im, Michael Tao, and Kristin Branson. An ential equation for modeling nesterov′ s accelerated gradient
empirical analysis of the optimization of deep network loss method: Theory and insights. In Advances in Neural Infor-
surfaces. In International Conference for Learning Repre- mation Processing Systems (NIPS), 2014. 2
sentations (ICLR), 2017. 1
[27] J Clerk Maxwell. On governors. Proceedings of the Royal
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Society of London, 16:270–283, 1867. 3
Deep residual learning for image recognition. In IEEE
Conference on Computer Vision and Pattern Recognition [28] Nicolas Minorsky. Directional stability of automatically
(CVPR), 2016. 1, 2, 7, 8 steered bodies. Journal of ASNE, 1922. 3
[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- [29] Emre Sariyildiz, Haoyong Yu, and Kouhei Ohnishi. A prac-
ian Q Weinberger. Densely connected convolutional net- tical tuning method for the robust pid controller with velocity
works. In IEEE Conference on Computer Vision and Pattern feed-back. Machines, 3(3):208–222, 2015. 3
Recognition (CVPR), 2017. 1, 2, 7, 8
[30] David E Rumelhart, Geoffrey E Hinton, and Ronald J
[15] Katsuhiko Ogata. Discrete-time control systems, volume 2. Williams. Learning representations by back-propagating er-
Prentice Hall Englewood Cliffs, NJ, 1995. 1, 5 rors. Nature, 1986. 3
[16] Laurent Lessard, Benjamin Recht, and Andrew Packard. [31] Murray R Spiegel. Advanced mathematics. McGraw-Hill,
Analysis and design of optimization algorithms via inte- Incorporated, 1991. 4
gral quadratic constraints. SIAM Journal on Optimization,
26(1):57–95, 2016. 2 [32] KA DE JONG. An analysis of the behavior of a class of
genetic adaptive systems. Doctoral Dissertation, University
[17] L. Wang, T. J. D. Barnes, and W. R. Cluett. New frequency- of Michigan, 1975. 5
domain design method for pid controllers. IEEE Control
Theory and Applications, Jul 1995. 2 [33] John G Ziegler and Nathaniel B Nichols. Optimum settings
for automatic controllers. trans. ASME, 64(11), 1942. 5, 6
[18] Kiam Heong Ang, G Chong, and Yun Li. Pid control system
analysis, design, and technology. 13:559 – 576, 08 2005. 2, [34] George E Robert and Hyman Kaufman. Table of Laplace
3 transforms. Saunders, 1966. 5
8530
[35] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017. 7, 8
8531