0% found this document useful (0 votes)

95 views10 pages

An A PID Controller CVPR 2018 Paper

This document proposes a PID controller approach for optimizing deep neural networks during training. It summarizes that SGD-Momentum, a commonly used optimization algorithm, suffers from overshooting problems that hinder convergence. The paper draws connections between PID controllers and stochastic optimization methods. It then introduces a PID-based optimization algorithm that leverages past, current, and change in gradients to update network parameters. Experiments on benchmark datasets show the PID approach achieves up to 50% faster training speed while maintaining competitive accuracy compared to SGD-Momentum.

Uploaded by

fpttmm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views10 pages

An A PID Controller CVPR 2018 Paper

Uploaded by

fpttmm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A PID Controller Approach for Stochastic Optimization of Deep Networks

Wangpeng An1,2 , Haoqian Wang1,3 , Qingyun Sun4 , Jun Xu2 , Qionghai Dai1,3 , and Lei Zhang ∗2
1 Graduate
School at Shenzhen, Tsinghua University, Shenzhen, China
2 Dept.
of Computing, The Hong Kong Polytechnic University, Hong Kong, China.
3 Shenzhen Institute of Future Media Technology, Shenzhen, China
4 Stanford University, CA, USA
1 [email protected], 1 [email protected], 2 [email protected]

Abstract fication [2], object detection [3], and face recognition [4],
etc. Despite the great successes of deep learning, the train-
Deep neural networks have demonstrated their power in ing of deep networks on large-scale datasets is usually com-
many computer vision applications. State-of-the-art deep putationally expensive, costing several days or even weeks
architectures such as VGG, ResNet, and DenseNet are using GPU equipped high-end PCs. It is substantially im-
mostly optimized by the SGD-Momentum algorithm, which portant to investigate how to accelerate the training speed
updates the weights by considering their past and cur- of deep models without sacrificing the accuracy, which can
rent gradients. Nonetheless, SGD-Momentum suffers from save the time and memory cost, particularly for resource
the overshoot problem, which hinders the convergence of limited applications.
network training. Inspired by the prominent success of The key component of DNN training is the optimizer,
proportional-integral-derivative (PID) controller in auto- which defines how the millions or even billions of parame-
matic control, we propose a PID approach for accelerat- ters of a deep model are updated. The learning rate is one of
ing deep network optimization. We first reveal the intrinsic the most important hyper-parameters to train a DNN [5].
connections between SGD-Momentum and PID based con- Based on how the learning rate is set, deep learning op-
troller, then present the optimization algorithm which ex- timizers can be categorized into two groups, hand-tuned
ploits the past, current, and change of gradients to update learning rate optimizers such as stochastic gradient descent
the network parameters. The proposed PID method reduces (SGD) [6], SGD Momentum [7] and Nesterov′ s Momen-
much the overshoot phenomena of SGD-Momentum, and it tum [7], and auto learning rate optimizers such as Ada-
achieves up to 50% acceleration on popular deep network Grad [8], RMSProp [9] and Adam [10], etc. Auto learning
architectures with competitive accuracy, as verified by our rate optimizers adaptively tune an individual learning rate
experiments on the benchmark datasets including CIFAR10, for each parameter. Such a goal of fine adaptation is attrac-
CIFAR100, and Tiny-ImageNet. tive and it is expected to yield better deep model learning
results. However, the recent findings by Wilson et al. [11]
show that hand-tuned SGD-Momentum achieves better re-
1. Introduction sult at the same speed or even faster speed. The hypothe-
sis put forth here is that adaptive methods may converge
Benefitting from the availability of large-scale visual
to different local minima [12]. It is also noted that most of
datasets such as ImageNet [1], deep neural networks
the best-performance deep models such as ResNet [13] and
(DNN), especially deep convolutional neural networks
DenseNet [14] are usually trained by SGD-Momentum.
(CNNs), have significantly improved the system accuracy
in many computer vision problems, such as image classi- The strategy of SGD-Momentum is to consider both the
past and present gradients to update the network parame-
∗ Corresponding author. This work is supported by HK RGC GRF grant
ters. However, SGD-Momentum suffers from the overshoot
(PolyU 152135/16E), the NSFC fund (61571259, 61531014), Shenzhen
Science and Technology Project under Grant (GGFW2017040714161462,
problem [15], which refers to the phenomena that a weight′ s
JCYJ20170307153051701) and the National High-tech R&D Program of value exceeds much its target value and does not change its
China (863Program, 2015AA015901). update direction. Such an overshoot problem hinders the

8522
convergence of SGD-Momentum, and costs more training The rest of this paper is organized as follows. Section 2
time and resources. It is of significant importance to inves- briefly reviews related work. Section 3 connects PID con-
tigate whether we can design a new DNN optimizer which is troller with DNN optimization. Section 4 introduces the
free of overshoot problem and has faster convergence speed proposed PID approach for DNN optimization. Experimen-
while maintaining good accuracy. tal results and detailed analysis are reported in Section 5.
It has been found that many optimization algorithms Section 6 concludes this paper.
popularly employed in machine learning studies share cer-
tain similarity to those classic control methods studied since 2. Related Work
1950s [16]. In literature of automatic control, the feedback
control system plays a key role, while the proportional- 2.1. Deep Learning Optimization
integral-derivative (PID) controller is the most commonly
used feedback control mechanism due to its simplicity, The learning rate is the most important hyper-parameter
functionality, and broad applicability [17]. More than 90% to train deep neural networks [9]. Based on how the learning
of industrial controllers are implemented based on PID [18], rate is set, two classes of deep learning optimization meth-
including self-driving car [19], unmanned flying vehi- ods can be categorized. The first class indicates fixed learn-
cles [20], robotics [21], etc. The basic idea of PID con- ing rate methods such as SGD [6], SGD Momentum [7],
trol is that the control action should be proportional to the and Nesterov′ s Momentum [7], etc., and the second class
current error (the difference between system output and de- includes auto learning rate methods, such as AdaGrad [8],
sired output), the integral of the past error over time, and RMSProp [9], and Adam [10], etc. Our work is based
the derivative of the error, which represents future trend. on fixed learning rate methods considering that the current
Though PID controller has gained massive successes in state-of-the-art results on CIFAR10, CIFAR100, ImageNet,
different industries of control and automation, little study PASCAL VOC and MS COCO datasets were mostly ob-
has been done on its connections with stochastic optimiza- tained by Residual Neural Networks [13, 14, 23, 24] trained
tion, as well as its potential applications to DNN training. by use of SGD Momentum.
In this paper, we make the first attempt along this line. We Stochastic Gradient Descent (SGD) [6] is a widely used
first bridge the gap between PID controller and stochas- optimization algorithm for machine learning in general, es-
tic optimization methods such as SGD, SGD-Momentum pecially for deep learning. SGD usually uses a fixed learn-
and Nesterov′ s Momentum, and consequently develop a ing rate. This is because the SGD gradient estimator intro-
PID approach for DNN optimization. Compared with SGD- duces a source of noise (the random sampling of m training
Momentum which utilizes the past and current gradients, examples), and that noise does not vanish even when the
the proposed PID optimization approach also utilizes the loss arrives at a minimum.
gradient changes to update the network. We further intro-
duce the Laplace Transform [22] to initialize the hyper- SGD Momentum [7] is designed to accelerate learning,
parameter introduced in our method, resulting in a simple especially in the case of small and consistent gradients.
yet effective stochastic DNN optimization algorithm. The The momentum algorithm accumulates an exponentially
major contributions of this work are summarized as follows. decayed moving average of past gradients and continues to
move in the consistent direction. The name momentum de-
• By linking the calculation of errors in feedback con-
rives from a physical analogy, in which the negative gradi-
trol system and the calculation of gradient in network
ent is a force moving a particle through parameter space. A
updating, we reveal the intrinsic connections between
hyper-parameter α ∈ (0, 1) determines how much the past
deep network optimization and feedback system con-
gradients to the current update of the weights.
trol, and show that SGD-Momentum is a special case
of PID controller with only proportional (P) and inte- Nesterov′ s Momentum [7] is a variant of the momentum
gral (I) components. algorithm that was motivated by Nesterov′ s accelerated
gradient method [25]. The difference between Nesterov
• We then propose a PID approach to optimize DNN by
momentum and regular momentum lies on where the
utilizing the present, past and changing information of
gradient is evaluated. With Nesterov′ s momentum, the
the gradient. The classical Laplace Transform is intro-
gradient is estimated after the current velocity is applied.
duced to understand and initialize the hyper-parameter
Thus one can interpret Nesterov′ s momentum as attempting
in our algorithm.
to add a correction factor to the standard method of
• We systematically evaluate the proposed approach, and momentum. Recently, Nesterov′ s Momentum method has
the extensive experiments on CIFAR10, CIFAR100 been characterized as a second order ordinary differential
and Tiny-Imagenet datasets demonstrate the efficiency equation in the small step limit [26].
and effectiveness of our PID approach.

8523
2.2. PID Controller PID controller computes a control variable u(t) based on the
current, past and future (i.e., derivative) of the error e(t), as
The PID controller exploits the present, past and future
shown in Eq. (1).
information of prediction error to control a feedback sys-
tem [18]. PID based controller originates in the 19th cen- Deep learning aims to learn an approximation function
tury for speed control. The theoretical foundation for the or mapping function f with parameters θ to map the input x
operation of PID was first described by Maxwell in 1868 to the desired output y, i.e., y = f (x, θ ), assuming that there
in his seminal paper “On Governors” [27]. Minorsky [28] are (complex) relationships or causality between x and y.
then gave this a mathematical formulation. Over the years, With enough training data, deep learning can train a net-
many advanced control algorithms have also been proposed. work with millions of parameters (weights w) to fit those
However, most industrial controllers are implemented with complex relationships which cannot be formulated using
a PID algorithm because it is simple, robust and easy to analytical functions. Usually, a loss function L will be de-
use [29]. A PID controller continuously calculates an error fined based on the desired output y and the predicted output
e(t), which is the difference between the desired optimal f (x, θ ) to measure whether the goal is reached. The loss af-
output and a measured system output, and applies a correc- fects the weights by performing “backward propagation of
tion u(t) to the system based on the proportional (P), inte- errors” [30]. That is, it distributes the error to each node
gral (I), and derivative (D) terms of e(t). Mathematically, by calculating the gradients of weights. If the loss L is not
there is: small enough, the network will update its weights θ based
Z t
d on the gradients ∂ L/∂ θ . Therefore, it is reasonable to as-
u(t) = K p e(t) + Ki e(t)dt + Kd e(t), (1) sociate the “error” in PID control with the “gradient” in
0 dt
DL. This procedure is iterated till L converges or is small
where K p , Ki and Kd are the gain coefficients on the P, I and enough. Many optimizers have been proposed to minimize
D terms, respectively. the loss L by updating θ using the gradients ∂ L/∂ θ , includ-
One can see that the error e(t), defined as the difference ing SGD, SGD-Momentum, Adam, etc.
between the desired value and the actual output, has the
same spirit as the gradient used in deep learning optimiza- From the above discussions, we can see that deep net-
tion. The coefficients K p , Ki and Kd determine the contri- work optimization shares high similarity to PID based con-
butions of present, past and future errors to the current cor- trol. Both of them update the system/network based on
rection. Such analyses inspire us to adapt the PID control the difference/loss between actual output and desired out-
techniques to the field of deep network optimization. To the put. The feedback in PID control corresponds to the back-
best of our knowledge, we are the first to introduce the idea propagation in network optimization. The major difference
of PID into the field of deep learning as a new optimizer. As is that the PID controller computes the update using sys-
we will see later in this paper, the proposed optimizer inher- tem error e(t), while deep network optimizers determines
its fantastic advantages of PID controller and stays simple the updates based on gradient ∂ L/∂ θ . If we view gradient
and efficient. ∂ L/∂ θ as the incarnation of error e(t), PID controller can
be fully connected with DNN optimization. In the follow-
3. PID and Deep Network Optimization ing, we will see that SGD, SGD-Momentum and Nesterov′ s
Momentum all can be explained as a kind of PID controller.
In this section, we disclose the connections between PID
control and SGD based deep optimization. Such connec-
tions motivate us to propose a new optimization method to
3.2. SGD is a P Controller
accelerate the training of DNNs. Updating the weights in a SGD and its variants are probably the most widely used
deep network can be viewed as deploying many PID con- optimization algorithms for DNN optimization. The param-
trollers to drive the system to reach an equilibrium. eter update rule of SGD from time (i.e., iteration) t to time
3.1. General Connections t + 1 is given by:

In Figure 1, we show the flowchart of a PID controller

based feedback control system, and the flowchart of SGD- θt+1 = θt − r∂ Lt /∂ θt , (2)
Momentum based DNN optimization. The goal of a control
system is to measure the output system status consecutively where r is the learning rate. By viewing the gradient
and update it to the desired status by using a control unit. ∂ Lt /∂ θt as error e(t), and comparing Eq. (2) to PID
In feedback control, the output will affect the input quan- controller in Eq. (1), one can see that SGD only uses the
tity, and the controller will make appropriate updates of the present gradient to update the weights. It is a type of P
system status based on the error e(t) between the measured controller with K p = r.
system status and the desired status. To reach this goal, the

8524
Figure 1. The connection between control system and deep model training, and the connection between PID controller and SGD-
Momentum.

3.3. Momentum Optimization is a PI Controller gradients far away from present to reduce noise. Overall,
SGD-Momentum can be viewed as a PI controller.
SGD-Momentum is able to reach the objective more
quickly than SGD along the small but consistent directions, 3.4. Nesterov′ s Momentum Optimization is a PI
resulting in a faster convergence speed. Its parameter Controller with larger P
update rule is given by:
The Nesterov′ s Momentum update rule is given by:
(
Vt+1 = αVt − r∂ Lt /∂ θt (
(3) Vt+1 = αVt − r∂ Lt /∂ (θt + αVt )
θt+1 = θt +Vt+1 , (5)
θt+1 = θt +Vt+1 ,
where Vt is the accumulation of history gradient, and α ∈
(0, 1) is the rate of moving average decay. By using a variable transform θ̂t = θt + αVt , and expressing
With some mathematical tricks (Sum Formula for a Se- the update rule in terms of θ̂ , we have:
quence of Numbers [31]), we can remove Vt from Eq. (3),
and rewrite the update rule as: (
Vt+1 = αVt − r∂ Lt /∂ θ̂t
t−1 (6)
θt+1 = θt − r∂ Lt /∂ θt − r( ∑ (∂ Li /∂ θi α t−i )). (4) θ̂t+1 = θ̂t + (1 + α)Vt+1 − αVt .
i=0
Again, by using the Sum Formula for a Sequence of Num-
One can see that the update of parameters relies on both the bers [31], we can have (the detailed derivation can be found
present gradient (r∂ Lt /∂ θt ) and the integral of past gradi- in the supplementary file):
ents r ∑t−1
i=0 (∂ Li /∂ θi α
t−i ). The only difference is that there

is a decay term α in the I term. This difference is because θ̂t+1 = θ̂t − r(1 + α)∂ Lt /∂ θ̂t
deep learning algorithms use a mini-batch of training ex- t−1 (7)
amples to compute the gradient, and thus the gradients are − rα(α t−i ∑ (∂ Li /∂ θ̂i )).
stochastic. The introduction of decay term α is to forget the i=1

8525
One can see that like SGD-Momentum, the Nesterov′ s Mo- a PID controller can effectively reduce the overshoot prob-
mentum also uses the present gradient and integral of past lem, as shown in Figure 2. Considering that the training of
gradients to update the parameters, while the gain coeffi- deep models is usually in a mini-batch based manner, which
cient K p is larger than that in SGD-Momentum. may introduce noise in the computing of gradients, we also
compute the moving average of the derivative part. The pro-
4. PID based Deep Optimization posed PID optimizer updates parameter θ at iteration (t +1)
by:
4.1. The Overshoot Problem of SGD-Momentum 
From Eq. (4) and Eq. (7), one can see that the Mo- Vt+1 = αVt − r∂ Lt /∂ θt

mentum will accumulate history gradients. However, if the Dt+1 = αDt + (1 − α)(∂ Lt /∂ θt − ∂ Lt−1 /∂ θt−1 ) (10)

θt+1 = θt +Vt+1 + Kd Dt+1 .

weights should change their descending direction, the his-
tory gradients will lag the update of weights. Such a phe-
As can be seen from Eq. (10), however, our optimizer
nomenon caused by history gradient is called overshoot,
introduces a hyperparameter Kd compared with SGD-
which is defined in discrete-time control systems [15] as
Momentum. Fortunately, this hyper-parameter Kd can be
”the maximum peak value of the response curve measured
well initialized by employing the theory of Laplace Trans-
from the desired response of the system”. Mathematically,
form [22] with Ziegler-Nichols [33] tuning method, as we
it is defined as:
describe in the following section.
θmax − θ ∗
Overshoot = , (8) 4.3. Initialization of Hyper-parameter Kd
θ∗
The Laplace Transform converts the function of real vari-
where θmax and θ ∗ are the maximum and optimum values able t (time) to a function of complex variable s (frequency).
of the weight, respectively. Denote by F(s) the Laplace transform of f (t). There is
One commonly used test benchmark of overshoot is the Z ∞
first function of De Jong′ s [32] because it is smooth, uni- F(s) = e−st f (t) dt, for s > 0. (11)
modal, and symmetric. The function can be defined as 0

f (x) = 0.1x12 + 2x22 , whose search domain is −10 ≤ xi ≤ Usually F(s) is easier to solve than f (t), and f (t) can be
10, i = 1, 2. There is no local minimum but a global mini- recovered from F(s) by the Inverse Laplace transform:
mum of this function: x∗ = (0, 0), f (x∗ ) = 0. Z γ+iT
1
We add a derivative (change of gradient) term to SGD- f (t) = lim est F(s) ds
Momentum to build a simple PID optimizer: 2πi T →∞ γ−iT

where γ is a real number and i is the unit of imagery part.

PID = Momentum + Kd (∂ f (x)/∂ xc − ∂ f (x)/∂ xc−1 ), (9) In practice, we could decompose a Laplace transform into
known transforms of functions in the Laplace table [34],
where c is the current iteration number for x. The simula- which includes most of the commonly used Laplace trans-
tion results by setting different values of Kd in Eq. (9) are forms, and then construct the inverse transform. With
illustrated in Figure 2. The background is the loss-contour Laplace Transform, we can convert the PID optimizer into
map; the redder, the bigger the loss value is, and the bluer, its Laplace transformed functions of s, and then simplify the
the smaller the loss value is. The x-axis and y-axis de- algebra. Once we find the transformed solution of F(s), we
note x1 and x2 , respectively. Both x1 and x2 are initialized can inverse the transform to obtain the required solution f
to −10. The yellow line shows the optimization route of as a function of t.
SGD-Momentum, and the red line shows the route of PID A weight of a deep model node is initialized as a scalar
optimizer. One can see that SGD-Momentum has obvious θ0 , and it is updated iteratively to reach its optimal value
overshoot problem. With the Kd set to 0.1, 0.5 and 0.93, denoted by θ ∗ . Then the optimization of each weight in
respectively, the PID optimizer exploits more ”future” error DNN can be simplified as a step response (from θ0 to θ ∗ ) in
(the change of gradients), and largely reduces the overshoot control theory. We can use the Laplace Transform as a guide
problem. to set Kd . Denote by θ (t) the time domain change of weight
θ . After some mathematical derivation (please refer to our
4.2. PID Optimizer for DNN
supplementary file for the detailed derivation process), we
The toy example in Section 4.1 motivates us to propose have:
a PID optimizer to accelerate the training of DNN. As we
show in Eq. (4), SGD-Momentum is actually a PI con- p
(θ ∗ − θ0 ) sin(ωn 1 − ζ 2t + arccos(ζ ))
troller which uses present and past gradient information. By θ (t) = θ ∗ − p , (12)
adding a derivative term to introduce the future information, eζ ωn t 1 − ζ 2

8526
7.5 7.5 7.5

5.0 5.0 5.0

2.5 2.5 2.5

0.0 0.0 0.0

2.5 2.5 2.5

5.0 5.0 5.0

7.5 7.5 7.5

10.0 10.0 10.0

10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5

Small Kd Moderate Kd Big Kd

Figure 2. The overshoot problem of momentum. The red and yellow lines are the results obtained by PID and SGD-Momentum, respec-
tively.

θ(t) the larger the derivate, the earlier the training convergence
we will reach. However, when Kd gets too large, the sys-
θmax
tem will be fragile. In practice, we set the hyper-parameter
Kd based on the Ziegler-Nichols optimum setting rule [33],
θ∗ which is widely used by engineers in PID feedback control
since its origin in 1940s.
According to Ziegler-Nichols′ rule, the ideal setup of
Kd should be one third of the oscillation period, which
means Kd = 31 T , where T is the period of oscillation. From
Eq. (12), we can get T = √2π 2 . If we make a simplifica-
ωn 1−ζ
θ0 t tion that the α in Momentum is equal to 1, then Ki = Kd = r.
tmax
Combined with Eq. (13), Kd will have a closed form solu-
Figure 3. The evolution of the weight by PID optimizer tion:
16 2
and Kd = 0.25r + 0.5 + (1 + π )/r (14)
( 9
(K p + 1)/Kd = 2ζ ωn In practice, we can start with this ideal setting of Kd
, (13)
Ki /Kd = ωn2 and change it slightly when use different network models
to train on different datasets.
where ζ and ωn are damping ratio and natural frequency of
the system, respectively. In Figure 3, we show the evolution 5. Experimental Results
process of a weight as an example of θ (t). From Eq. (13),
p (K +1)2 In this section, we first trained an MLP on the MNIST
we have Ki = 4K . One can see that Ki is a monoton-
dζ handwritten digit dataset in Section 5.2 to show the advan-
ically decreasing function of ζ . Refer to the definition of
tage of PID optimizer, and then trained CNNs on the CI-
overshoot in Eq. (8), one can see that ζ is monotonically
FAR datasets in Section 5.3 to demonstrate that PID opti-
decreasing with overshoot. Then Ki is a monotonically in-
mizer is competitive with SGD-Momentum in accuracy but
creasing function of overshoot. So more history error (Inte-
with much faster training speed. To further validate our PID
gral part), more overshoot the system will have. That is the
optimizer on a larger dataset, in Section 5.4 we performed
reason why SGD-Momentum which accumulates past gra-
experiments on the Tiny-Imagenet dataset [36]. The results
dients will overshoot its target and spend more time during
showed that our PID optimizer can generalize to modern
training.
networks and datasets. It should be noted that except for
As pcan be observed from Eq. (12), the term
the additional hyper-parameter Kd which is set by Eq.(14),
sin(ωn 1 − ζ 2t + arccos(ζ )) brings periodically oscilla-
all the other hyper-parameters in our PID optimizer are set
tion change to the weight, which is no more than 1. The
as the same as SGD-Momentum. The learning rate starts
term e−ζ ωn t mainly controls the convergence rate. One
from 0.01 and is divided by 10 when the error plateaus. The
should note the value of hyper-parameter Kd in calculat-
K p +1 source code of our PID optimizer can be found at https:
− 2K
ing the derivate e−ζ ωn = e d . It is easy to observe that //github.com/tensorboy/PIDOptimizer.

8527
Table 1. Test errors and training epochs of PID and SGD-Momentum on CIFAR10 and CIFAR100.
Model Depth-k Params (M) Runs CIAFR10 Epochs CIFAR100 Epochs
- - - - PID/SGD-M PID/SGD-M PID/SGD-M PID/SGD-M
110 1.7 5 6.23/6.43 239/281 24.95/25.16 237/293
Resnet [13]
1202 10.2 5 7.81/7.93 230/293 27.93/27.82 251/296
PreActResNet [23] 164 1.7 5 5.23/5.46 230/271 24.17/24.33 241/282
8-64 34.43 10 3.65/3.43 221/294 17.46/17.77 232/291
ResNeXt29 [35]
16-64 68.16 10 3.42/3.58 209/289 17.11/17.31 229/283
16-8 11 10 4.42/4.81 213/290 21.93/22.07 229/283
WRN [24]
28-20 36.5 10 4.27/4.17 208/290 20.21/20.50 221/295
100-12 0.8 10 3.83/4.30 196/291 19.97/20.20 213/294
DenseNet [14]
190-40 25.6 10 3.11/3.32 194/293 16.95/17.17 208/297

5.1. Dataset Training Loss Validation Loss

SGD-Momentum SGD-Momentum
PID 0.25 PID
0.4
MNIST dataset: The MNIST dataset [37] contains 60, 000 0.20
0.2 0.15
training samples and 10, 000 test samples of the handwrit-
0.10
ten digits from 0 to 9. The images are of 28 × 28 pixels and
0.0 0.05
in grey level format. 0 5 10 15 20 0 5 10 15 20
Epoch Epoch
CIFAR Dataset: The CIFAR10 and CIFAR100
Training Acc. Validation Acc.
datasets [38] consist of 60, 000 RGB color images of 100 98
resolution 32 × 32, drawn from 10 and 100 classes, respec-
95 96
tively, and both split into 50, 000 training and 10, 000 test
images. For data augmentation, we performed horizontal 90 94
SGD-Momentum SGD-Momentum
flips and random crops on the original image padded by 4 PID PID
92
pixels on each side. 0 5 10 15 20 0 5 10 15 20
Epoch Epoch
Tiny ImageNet Dataset: The Tiny-ImageNet [36] dataset
has 200 classes. Each class has 500 training images, 50 Figure 4. PID vs. SGD-Momentum on the MNIST dataset for 20
validation images, and 50 test images. The Tiny-ImageNet epochs. Top row: the curves of training loss and validation loss.
is more difficult than the CIFAR datasets because more Bottom row: the curves of training accuracy and validation accu-
racy.
classes are involved, and the relevant objects to be classified
often cover only a tiny subspace of the image.
Training Loss Validation Loss
1.5 SGD-Momentum SGD-Momentum
PID 1.00 PID
1.0
5.2. Results of MLP on MNIST dataset 0.75
0.5 0.50
0.25
We first trained a simple MLP network on the MNIST 0.0
0 100 200 300 0 100 200 300
handwritten digit classification dataset using the proposed Epoch Epoch
PID optimizer and compare it with SGD-Momentum [7]. Training Acc. Validation Acc.
The MLP network is with ReLU nonlinearity and 1, 000 100
90
hidden nodes in the hidden layer, followed by the softmax 80 80
output layer on top. The training was on mini-batches with
60 70
128 images per batch for 20 epochs through the training set. SGD-Momentum SGD-Momentum
PID 60 PID
We run the experiments for 10 times and reported the aver- 0 100 200 300 0 100 200 300
age results. The detailed training statistics by the two meth- Epoch Epoch
ods are illustrated in Figure 4, from which we can see that Figure 5. PID vs. SGD-Momentum on the CIFAR10 dataset by
PID optimizer not only converges more quickly than SGD- using DenseNet 190-40. Top row: the curves of training loss and
Momentum with lower loss and higher accuracy, but also validation loss. Bottom row: the curves of training accuracy and
has higher generalization ability on the validation dataset. validation accuracy.
On the test dataset, PID optimizer achieves 98% accuracy
and SGD-Momentum achieves an accuracy of 97.5%.

8528
5.3. Results on CIFAR datasets Train Loss Valid Loss
4 SGD-momentum 10 SGD-momentum
We then compared PID and SGD-Momentum optimizers PID PID
3 8
on CIFAR10 and CIFAR100 by using five state-of-the-art
2 6
CNN models, including ResNet [13], PreActResNet [23],
4
ResNeXt29 [35], WRN [24], and DenseNet [14]. The re- 1
sults are summarized in Table 1. The second column lists 2
0
the number of depth of those networks, while the third col- 0 50 100 150 0 50 100 150
umn lists the number of parameters for each network model. Train Acc. Valid Acc.
The fourth column indicates the number of runs to calcu-
80
late the average test error. In the fifth and sixth columns 60
of Table 1, we presented the average test errors on CI- 60
40 40
FAR10 and showed the numbers of Epochs by PID and
SGD-Momentum when they achieve the reported test errors 20 SGD-momentum 20 SGD-momentum
PID PID
for the first time (i.e., the least number of Epochs to reach
0 50 100 150 0 50 100 150
the best accuracy). The last two columns of Table 1 present
such comparisons on CIFAR100. Figure 6. PID vs. SGD-Momentum on the Tiny-imagenet dataset
From Table 1, we can have the following observations. by using DenseNet 190-40. Top row: the curves of training loss
First, our proposed PID optimizer achieves lower test errors and validation loss. Bottom row: the curves of training accuracy
than SGD-Momentum for all the used CNN architectures and validation accuracy.
on both the two CIFAR datasets, except for ResNet with
depth 1202. Second, PID optimizer converges faster (with
tions with stochastic optimizers such as SGD and its vari-
less Epochs) than SGD-Momentum to reach the best results. ants, and presented a novel PID controller approach to deep
In particular, our PID optimizer has on average 35% and network optimization. The proposed PID optimizer exploits
up to 50% acceleration compared with SGD-Momentum. the present, past and change information of gradients to up-
This demonstrates the importance of the change of gradi- date the network parameters, reducing greatly the overshoot
ent, which can be exploited to reduce the overshoot prob- problem of SGD-momentum and accelerating the learning
lem and speed up the learning process of DNNs. Figure 5 process of DNNs. Our experiments on MINIST, CIFAR and
shows the detailed training statistics by the two methods on Tiny-ImageNet datasets validated that the proposed PID op-
CIFAR10 with DenseNet 190-40 (190 layers with growth timizer is 30% ∼ 50% faster than SGD-Momentum, whiling
rate of 40) [14]. One can see that PID optimizer converges resulting in lower error rate. In future work, we will inves-
faster than SGD-Momentum with lower loss and higher ac- tigate how to adapt our PID optimizer to other network ar-
chitectures such as LSTM and RNN, and how to associate
curacy.
PID optimizer with an adaptive learning rate for DNN opti-
mization.
5.4. Experiments on Tiny-ImageNet
To further demonstrate the effectiveness of our PID op- References
timizer, we employed the DenseNet190-40 architecture to
[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
perform experiments on the Tiny-ImageNet dataset. Fig-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
ure 6 shows the curves of training loss and accuracy over
Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,
Epochs, as well as the validation loss and accuracy by the and Fei fei Li. Imagenet large scale visual recognition
PID and SGD-Momentum optimizers. The learning rate of challenge. IEEE International Journal of Computer Vision
SGD-Momentum and PID was fixed to 0.01. Training was (IJCV), 2015. 1
conducted 150 epochs using batch size 64. The results are
averaged over 5 runs. [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Similar conclusions to those on CIFAR datasets can be Imagenet classification with deep convolutional neural net-
made. In both training and validation, PID converges faster works. In Advances in neural information processing sys-
tems (NIPS), 2012. 1
than SGD-Momentum, has lower loss and achieves higher
accuracy. Such results confirm the generalization capability [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
of PID based DNN optimizer to large-scale datasets. Faster r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information pro-
6. Conclusion cessing systems (NIPS), 2015. 1

Inspired by the prominent success of PID controller in [4] Florian Schroff, Dmitry Kalenichenko, and James Philbin.
the field of automatic control, we investigated its connec- Facenet: A unified embedding for face recognition and clus-

8529
tering. In IEEE Conference on Computer Vision and Pattern [19] Pan Zhao, Jiajia Chen, Yan Song, Xiang Tao, Tiejuan Xu,
Recognition (CVPR), 2015. 1 and Tao Mei. Design of a control system for an autonomous
vehicle based on adaptive-pid. International Journal of Ad-
[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep vanced Robotic Systems, 9(2):44, 2012. 2
Learning. MIT Press, 2016. 1
[20] A. L. Salih, M. Moghavvemi, H. A. F. Mohamed, and K. S.
[6] Léon Bottou. Online learning in neural networks. chapter Gaeid. Modelling and pid controller design for a quadro-
Online Learning and Stochastic Approximations, pages 9– tor unmanned air vehicle. In IEEE International Conference
42. Cambridge University Press, 1998. 1, 2 on Automation, Quality and Testing, Robotics (AQTR), vol-
[7] Ilya Sutskever, James Martens, George Dahl, and Geoffrey ume 1, pages 1–5, May 2010. 2
Hinton. On the importance of initialization and momentum
[21] Paolo Rocco. Stability of pid control for industrial robot
in deep learning. In International conference on machine
arms. IEEE transactions on robotics and automation, 1996.
learning, 2013. 1, 2, 7
2
[8] John Duchi, Elad Hazan, and Yoram Singer. Adap-
[22] Pierre Simon de Laplace. Théorie analytique des proba-
tive subgradient methods for online learning and stochas-
bilités, volume 7. Courcier, 1820. 2, 5
tic optimization. Journal of Machine Learning Research,
12(Jul):2121–2159, 2011. 1, 2 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[9] Geoffrey Hinton, N Srivastava, and Kevin Swersky. Lecture Identity mappings in deep residual networks. In IEEE Euro-
6a overview of mini–batch gradient descent. 1, 2 pean Conference on Computer Vision (ECCV), 2016. 2, 7,
8
[10] Diederik Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In International Conference for [24] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-
Learning Representations (ICLR), 2014. 1, 2 works. In BMVC, 2016. 2, 7, 8

[11] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Sre- [25] Yurii Nesterov. A method of solving a convex programming
bro, and Benjamin Recht. The marginal value of adaptive problem with convergence rate o (1/k2). In Soviet Mathe-
gradient methods in machine learning. In Advances in Neu- matics Doklady, 1983. 2
ral Information Processing Systems (NIPS), 2017. 1
[26] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differ-
[12] Daniel Jiwoong Im, Michael Tao, and Kristin Branson. An ential equation for modeling nesterov′ s accelerated gradient
empirical analysis of the optimization of deep network loss method: Theory and insights. In Advances in Neural Infor-
surfaces. In International Conference for Learning Repre- mation Processing Systems (NIPS), 2014. 2
sentations (ICLR), 2017. 1
[27] J Clerk Maxwell. On governors. Proceedings of the Royal
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Society of London, 16:270–283, 1867. 3
Deep residual learning for image recognition. In IEEE
Conference on Computer Vision and Pattern Recognition [28] Nicolas Minorsky. Directional stability of automatically
(CVPR), 2016. 1, 2, 7, 8 steered bodies. Journal of ASNE, 1922. 3

[14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil- [29] Emre Sariyildiz, Haoyong Yu, and Kouhei Ohnishi. A prac-
ian Q Weinberger. Densely connected convolutional net- tical tuning method for the robust pid controller with velocity
works. In IEEE Conference on Computer Vision and Pattern feed-back. Machines, 3(3):208–222, 2015. 3
Recognition (CVPR), 2017. 1, 2, 7, 8
[30] David E Rumelhart, Geoffrey E Hinton, and Ronald J
[15] Katsuhiko Ogata. Discrete-time control systems, volume 2. Williams. Learning representations by back-propagating er-
Prentice Hall Englewood Cliffs, NJ, 1995. 1, 5 rors. Nature, 1986. 3

[16] Laurent Lessard, Benjamin Recht, and Andrew Packard. [31] Murray R Spiegel. Advanced mathematics. McGraw-Hill,
Analysis and design of optimization algorithms via inte- Incorporated, 1991. 4
gral quadratic constraints. SIAM Journal on Optimization,
26(1):57–95, 2016. 2 [32] KA DE JONG. An analysis of the behavior of a class of
genetic adaptive systems. Doctoral Dissertation, University
[17] L. Wang, T. J. D. Barnes, and W. R. Cluett. New frequency- of Michigan, 1975. 5
domain design method for pid controllers. IEEE Control
Theory and Applications, Jul 1995. 2 [33] John G Ziegler and Nathaniel B Nichols. Optimum settings
for automatic controllers. trans. ASME, 64(11), 1942. 5, 6
[18] Kiam Heong Ang, G Chong, and Yun Li. Pid control system
analysis, design, and technology. 13:559 – 576, 08 2005. 2, [34] George E Robert and Hyman Kaufman. Table of Laplace
3 transforms. Saunders, 1966. 5

8530
[35] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2017. 7, 8

[36] Ya Le and Xuan Yang. Tiny imagenet visual recognition

challenge. 2015. 6, 7

[37] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick

Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
7

[38] Alex Krizhevsky. Learning multiple layers of features from

tiny images. 2009. 7

8531

Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Ali Kotler
75% (4)
Ali Kotler
801 pages
Zte Lte FDD Volte Feature Guide
100% (2)
Zte Lte FDD Volte Feature Guide
216 pages
A Modified Adam Algorithm For Deep Neural Network Optimization
No ratings yet
A Modified Adam Algorithm For Deep Neural Network Optimization
18 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
Lit Rev 7
No ratings yet
Lit Rev 7
12 pages
(Gin, Craig, Et Al.), Deep Learning Models For Global Coordinate Transformations That Linearize Pdes., Arxiv Preprint Arxiv-1911.02710 (2019) .
No ratings yet
(Gin, Craig, Et Al.), Deep Learning Models For Global Coordinate Transformations That Linearize Pdes., Arxiv Preprint Arxiv-1911.02710 (2019) .
27 pages
Preprints202403 0914 v1
No ratings yet
Preprints202403 0914 v1
18 pages
08 Training
No ratings yet
08 Training
18 pages
PID Control Algorithm Based On Multistrategy Enhan
No ratings yet
PID Control Algorithm Based On Multistrategy Enhan
27 pages
Reinforcement Learning Approach To Autonomous PID Tuning
No ratings yet
Reinforcement Learning Approach To Autonomous PID Tuning
6 pages
10.1007@s00521 018 3712 X PDF
No ratings yet
10.1007@s00521 018 3712 X PDF
13 pages
Mathematics 11 00316
No ratings yet
Mathematics 11 00316
15 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
No ratings yet
Near-Optimal Control of Dynamical Systems With Neural Ordinary Differential Equations
23 pages
769 Padam Closing The Generalizati
No ratings yet
769 Padam Closing The Generalizati
16 pages
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
No ratings yet
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
46 pages
D L T R: A O C D S P: EEP Earning Heory Eview N Ptimal Ontrol and Ynamical Ystems Erspective
No ratings yet
D L T R: A O C D S P: EEP Earning Heory Eview N Ptimal Ontrol and Ynamical Ystems Erspective
25 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Glushchenko 2019
No ratings yet
Glushchenko 2019
8 pages
Self-Adaptive Physics-Informed Neural Networks
No ratings yet
Self-Adaptive Physics-Informed Neural Networks
23 pages
16992-Article Text-20486-1-2-20210518
No ratings yet
16992-Article Text-20486-1-2-20210518
9 pages
Interpretable PID Parameter Tuning For Control Engineering Using General Dynamic Neural Networks: An Extensive Comparison
No ratings yet
Interpretable PID Parameter Tuning For Control Engineering Using General Dynamic Neural Networks: An Extensive Comparison
16 pages
Op Tim Ization
No ratings yet
Op Tim Ization
1 page
State-Space Modeling For Control Based On Physics-Informed Neural Networks
No ratings yet
State-Space Modeling For Control Based On Physics-Informed Neural Networks
10 pages
Comparative Analysis of Optimizers in Deep Neural Networks
No ratings yet
Comparative Analysis of Optimizers in Deep Neural Networks
4 pages
Training Neural ODEs Using Fully Discretized Simultaneous Optimization
No ratings yet
Training Neural ODEs Using Fully Discretized Simultaneous Optimization
6 pages
Mathematics 11 02466 v2
No ratings yet
Mathematics 11 02466 v2
37 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
Zhou Et Al 2020 Deep Neural Networks As Add On Modules For Enhancing Robot Performance in Impromptu Trajectory Tracking
No ratings yet
Zhou Et Al 2020 Deep Neural Networks As Add On Modules For Enhancing Robot Performance in Impromptu Trajectory Tracking
22 pages
Physics-Informed Neural Nets For Control of Dynamical Systems
No ratings yet
Physics-Informed Neural Nets For Control of Dynamical Systems
23 pages
2022 Predicting Parametric Spatiotemporal Dynamics by Multi-Resolution PDE Structure-Preserved Deep Learning
No ratings yet
2022 Predicting Parametric Spatiotemporal Dynamics by Multi-Resolution PDE Structure-Preserved Deep Learning
51 pages
An Overview On Machine Learning Methods For Partial Differential Equations From Physics Informed Neural Networks To Deep Operator Learning
No ratings yet
An Overview On Machine Learning Methods For Partial Differential Equations From Physics Informed Neural Networks To Deep Operator Learning
59 pages
A Study of The Optimization Algorithms in Deep Learning
No ratings yet
A Study of The Optimization Algorithms in Deep Learning
4 pages
Merger 02
No ratings yet
Merger 02
5 pages
1 s2.0 S095741742301686X Main
No ratings yet
1 s2.0 S095741742301686X Main
10 pages
DocumentsTraining Neural Networks - Part II
No ratings yet
DocumentsTraining Neural Networks - Part II
91 pages
Reinenforement Learning With Pid Loop
No ratings yet
Reinenforement Learning With Pid Loop
7 pages
Tac 232
No ratings yet
Tac 232
7 pages
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
15 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Snode PP
No ratings yet
Snode PP
15 pages
Training Neural Networks Without Gradients
No ratings yet
Training Neural Networks Without Gradients
10 pages
Approximation of Solution Operators for High-dimensional PDEs部分3
No ratings yet
Approximation of Solution Operators for High-dimensional PDEs部分3
2 pages
Handbook Control ML 2022
No ratings yet
Handbook Control ML 2022
29 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
UNIT3
No ratings yet
UNIT3
17 pages
Batch Normalization
No ratings yet
Batch Normalization
11 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Pid With Neural
No ratings yet
Pid With Neural
4 pages
Backpropagation For Continuous Theta Neural Networks
No ratings yet
Backpropagation For Continuous Theta Neural Networks
90 pages
Axioms-12-00306-V2 NCKH Xe T Hành
No ratings yet
Axioms-12-00306-V2 NCKH Xe T Hành
12 pages
NeuralPID ThanhVo
No ratings yet
NeuralPID ThanhVo
9 pages
Hierarchical Multi Scale Parametric Optimization of Deep Neural Networks
No ratings yet
Hierarchical Multi Scale Parametric Optimization of Deep Neural Networks
28 pages
Understanding Consumer Behavior With Recurrent Neural Networks
No ratings yet
Understanding Consumer Behavior With Recurrent Neural Networks
8 pages
Research Discussion Paper: A Multi-Sector Model of The Australian Economy
No ratings yet
Research Discussion Paper: A Multi-Sector Model of The Australian Economy
73 pages
Gensim 3.0.1: Python Framework For Fast Vector Space Modelling
No ratings yet
Gensim 3.0.1: Python Framework For Fast Vector Space Modelling
3 pages
How To Predict Sales Using Markov Chain - Supply Chain Link Blog - Arkieva
No ratings yet
How To Predict Sales Using Markov Chain - Supply Chain Link Blog - Arkieva
3 pages
insideHPC Report Reinventing The Retail Industry
No ratings yet
insideHPC Report Reinventing The Retail Industry
9 pages
Mathematical Model Based On The Product Sales Market Forecast of Markov Forecasting and Application
No ratings yet
Mathematical Model Based On The Product Sales Market Forecast of Markov Forecasting and Application
7 pages
File 2.144368 001 PDF
No ratings yet
File 2.144368 001 PDF
108 pages
Production Control and Demand Management in Capacitated Flow Process Industries
No ratings yet
Production Control and Demand Management in Capacitated Flow Process Industries
142 pages
Production Control and Demand Management in Capacitated Flow Process Industries
No ratings yet
Production Control and Demand Management in Capacitated Flow Process Industries
142 pages
Diffusion 3D PDF
No ratings yet
Diffusion 3D PDF
3 pages
Glossary of Project Management - Wikipedia
100% (1)
Glossary of Project Management - Wikipedia
11 pages
Prescriptive Analytics: For Business Leaders
No ratings yet
Prescriptive Analytics: For Business Leaders
97 pages
Diffusion Gaussian Kernel PDF
No ratings yet
Diffusion Gaussian Kernel PDF
13 pages
Statistics PDF
100% (1)
Statistics PDF
304 pages
IMP Greens
No ratings yet
IMP Greens
6 pages
Getting Started With Theano
No ratings yet
Getting Started With Theano
39 pages
Bookss16 Whole Book v2
No ratings yet
Bookss16 Whole Book v2
310 pages
Kwh-Per-Capita - Electricity Consumption Per Capita - Country Comparison
No ratings yet
Kwh-Per-Capita - Electricity Consumption Per Capita - Country Comparison
7 pages
8th and Vine Redevelopment Plans - Bedroom Floor Plans 223 A 300
No ratings yet
8th and Vine Redevelopment Plans - Bedroom Floor Plans 223 A 300
1 page
Phrasal Verbs
No ratings yet
Phrasal Verbs
10 pages
R9350 enGB-US 11 07 11723-0 Leibher
100% (1)
R9350 enGB-US 11 07 11723-0 Leibher
22 pages
RT Procedure
100% (4)
RT Procedure
14 pages
Law of Karma Value Systems For Success.
No ratings yet
Law of Karma Value Systems For Success.
51 pages
Selenium Exception Handling 1744116046
No ratings yet
Selenium Exception Handling 1744116046
6 pages
Oct2023
No ratings yet
Oct2023
7 pages
Searchq 8070+Mytee+Lite&Rlz 1CDGOYI EnUS1063US1063&Oq 80&Gs LCRP EgZjaHJvbWUqDggBEEUYJxg7GIAEGIoFMggIABB
No ratings yet
Searchq 8070+Mytee+Lite&Rlz 1CDGOYI EnUS1063US1063&Oq 80&Gs LCRP EgZjaHJvbWUqDggBEEUYJxg7GIAEGIoFMggIABB
1 page
Diseases of Nervous System of Farm Animals by Ali Sadiek
100% (7)
Diseases of Nervous System of Farm Animals by Ali Sadiek
65 pages
(Engexam - Info) IELTS Reading Practice Test 4
No ratings yet
(Engexam - Info) IELTS Reading Practice Test 4
14 pages
Ach - Dda 0210
No ratings yet
Ach - Dda 0210
1 page
Leila Fletcher Book 1
No ratings yet
Leila Fletcher Book 1
59 pages
Student Lms - Usecs
No ratings yet
Student Lms - Usecs
1 page
LAB REPORT 23 Rosales and Brassicales
100% (1)
LAB REPORT 23 Rosales and Brassicales
11 pages
Annual Report 2020 Maj 21-09-23 Compressed
No ratings yet
Annual Report 2020 Maj 21-09-23 Compressed
160 pages
2020-02.25 Prodigy Disc Flight Chart PDF
No ratings yet
2020-02.25 Prodigy Disc Flight Chart PDF
1 page
Seminar Report On Bio-Diesel: (In Partial Fulfilment To B.Tech Degree From MMEC, Mullana.)
No ratings yet
Seminar Report On Bio-Diesel: (In Partial Fulfilment To B.Tech Degree From MMEC, Mullana.)
9 pages
Joint Application For Sale and Transfer of Permanent Authority
No ratings yet
Joint Application For Sale and Transfer of Permanent Authority
5 pages
Lecture Notes For Introductory Probability - Gravner
No ratings yet
Lecture Notes For Introductory Probability - Gravner
218 pages
E 0211
No ratings yet
E 0211
23 pages
Fringe Benefit Tax!!!!
No ratings yet
Fringe Benefit Tax!!!!
4 pages
Total Loss Claim Settlement
No ratings yet
Total Loss Claim Settlement
3 pages
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
No ratings yet
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
182 pages
Dll-Types of Chemical RXN
No ratings yet
Dll-Types of Chemical RXN
23 pages
Love With Pain E-Book
No ratings yet
Love With Pain E-Book
124 pages
Riopipeline2019 1107 201905201751ibp1107 19 Jacques PDF
No ratings yet
Riopipeline2019 1107 201905201751ibp1107 19 Jacques PDF
7 pages
Qualitative and Qualitative Research Paradigm
No ratings yet
Qualitative and Qualitative Research Paradigm
20 pages
Lenoir-Lowood - TheatersOfWar - THE MILITARY-ENTERTAINMENT COMPLEX
No ratings yet
Lenoir-Lowood - TheatersOfWar - THE MILITARY-ENTERTAINMENT COMPLEX
42 pages
Philippines Faces Bigger Shortage of Rice Farms - Miraflor (2020)
No ratings yet
Philippines Faces Bigger Shortage of Rice Farms - Miraflor (2020)
3 pages

An A PID Controller CVPR 2018 Paper

Uploaded by

An A PID Controller CVPR 2018 Paper

Uploaded by

A PID Controller Approach for Stochastic Optimization of Deep Networks

In Figure 1, we show the flowchart of a PID controller

where γ is a real number and i is the unit of imagery part.

5.0 5.0 5.0

2.5 2.5 2.5

0.0 0.0 0.0

2.5 2.5 2.5

5.0 5.0 5.0

7.5 7.5 7.5

10.0 10.0 10.0

Small Kd Moderate Kd Big Kd

5.1. Dataset Training Loss Validation Loss

[36] Ya Le and Xuan Yang. Tiny imagenet visual recognition

[37] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick

[38] Alex Krizhevsky. Learning multiple layers of features from

You might also like