Q Learning
Q Learning
Consider the image below. You can see a dog in a room that has to
perform an action, which is fetching. The dog is the agent; the room is
the environment it has to work in, and the action to be performed is
fetching.
The equation is given below. It uses the current state, and the reward
associated with that state, along with the maximum expected reward
and a discount rate, which determines its importance to the current
state, to find the next state of our agent. The learning rate determines
how fast or slow, the model will be learning.
Figure 6: Bellman Equation
A Q-Table helps us to find the best action for each state in the
environment. We use the Bellman Equation at each state to get the
expected future state and reward and save it in a table to compare with
other states.
Lets us create a q-table for an agent that has to learn to run, fetch and
sit on command. The steps taken to construct a q-table are :
When we initially start, the values of all states and rewards will be 0.
Consider the Q-Table shown below which shows a dog simulator
learning to perform actions :
Figure 7: Initial Q-Table (to refer images please refer
simplelearn.com)
Step 2: Choose an action and perform it. Update values in the table
Step 3: Get the value of the reward and calculate the value Q-Value
using Bellman Equation
For the action performed, we need to calculate the value of the actual
reward and the Q( S, A ) value
Figure 9: Updating Q-Table with Bellman Equation
Step 4: Continue the same until the table is filled or an episode ends
The agent continues taking actions and for each action, the reward and
Q-value are calculated and it updates the table.
Deep Q-Network
Experience Replay
Target Network
Clipping Rewards
Skipping Frames
Experience Replay
Target Network
Clipping Rewards
Each game has different score scales. For example, in Pong, players
can get 1 point when wining the play. Otherwise, players get -1 point.
However, in SpaceInvaders, players get 10~30 points when defeating
invaders. This difference would make training unstable. Thus Clipping
Rewards technique clips scores, which all positive rewards are set +1
and all negative rewards are set -1.
Skipping Frames
ALE is capable of rendering 60 images per second. But actually
people don’t take actions so much in a second. AI doesn’t need to
calculate Q values every frame. So Skipping Frames technique is that
DQN calculates Q values every 4 frames and use past 4 frames as
inputs. This reduces computational cost and gathers more experiences.
Performance
Introduction
Reinforcement learning is probably the most general framework in
which reward-related learning problems of animals, humans or
machine can be phrased. However, most of the methods proposed in
the reinforcement learning community are not yet applicable to many
problems such as robotics, motor control, etc. This inapplicability
may result from problems with uncertain state information. Thus,
those systems need to be modeled as partially observable Markov
decision problems which often results in excessive computational
demands. Most traditional reinforcement learning methods have no
convergence guarantees and there exist even divergence examples.
Continuous states and actions in high dimensional spaces cannot be
treated by most off-the-shelf reinforcement learning approaches.
Policy gradient methods differ significantly as they do not suffer from
these problems in the same way. For example, uncertainty in the state
might degrade the performance of the policy (if no additional state
estimator is being used) but the optimization techniques for the policy
do not need to be changed. Continuous states and actions can be dealt
with in exactly the same way as discrete ones while, in addition, the
learning performance is often increased. Convergence at least to a
local optimum is guaranteed.
The advantages of policy gradient methods for real world applications
are numerous. Among the most important ones are that the policy
representations can be chosen so that it is meaningful for the task and
can incorporate domain knowledge, that often fewer parameters are
needed in the learning process than in value-function based
approaches and that there is a variety of different algorithms for
policy gradient estimation in the literature which have a rather strong
theoretical underpinning. Additionally, policy gradient methods can
be used either model-free or model-based as they are a generic
formulation.
Of course, policy gradients are not the salvation to all problems but
also have significant problems. They are by definition on-policy (note
that tricks like importance sampling can slightly alleviate this
problem) and need to forget data very fast in order to avoid the
introduction of a bias to the gradient estimator. Hence, the use of
sampled data is not very efficient. In tabular representations, value
function methods are guaranteed to converge to a global maximum
while policy gradients only converge to a local maximum and there
may be many maxima in discrete problems. Policy gradient methods
are often quite demanding to apply, mainly because one has to have
considerable knowledge about the system one wants to control to
make reasonable policy definitions. Finally, policy gradient methods
always have an open parameter, the learning rate, which may decide
over the order of magnitude of the speed of convergence, these have
led to new approaches inspired by expectation-maximization (see,
e.g., Vlassis et al., 2009; Kober & Peters, 2008).
Nevertheless, due to their advantages stated above, policy gradient
methods have become particularly interesting for robotics applications
as these have both continuous actions and states. For example, there
has been a series of successful applications in robot locomotion,
where good policy parametrizations such as CPGs are known.
Benbrahim & Franklin (1997) already explored 2D dynamic biped
walking, Tedrake et al. (2004) extended these results to 3D passive
dynamics-based walkers and Endo (2005) showed that a full-body
gait with sensory feedback can be learned with policy gradients. Kohl
& Stone (2004) were able to apply policy gradients to optimize
quadruped gaits. There have also been various applications in skill
learning starting with the peg-in-a-hole tasks learned by Gullapalli et
al. (1994) and ranging to Peters & Schaals' optimizations of discrete
movements primitives such as T-Ball swings.
Note that in most applications, there exist many local maxima; for
example, if we were told build a high jumping robot, there is a
multitude of styles. Current policy gradient methods would be helpful
for improving a jumping style of a teacher, let's say the classical
straddle jump. However, discovering a Fosbery flop when starting
with a basic straddle jump policy is probably not possible with policy
gradient methods.
Assumptions and Notation
We assume that we can model the control system in a discrete-time
manner and we will denote the current time step by k . In order to take
possible stochasticity of the plant into account, we denote it using a
probability distribution xk+1∼p(xk+1|xk,uk) as model
where uk∈RM denotes the current action, and xk , xk+1∈RN denote
the current and next state, respectively. We furthermore assume that
actions are generated by a policy uk∼πθ(uk|xk) which is modeled as a
probability distribution in order to incorporate exploratory actions; for
some special problems, the optimal solution to a control problem is
actually a stochastic controller (Sutton, McAllester, Singh, and
Mansour, 2000). The policy is assumed to be parameterized
by K policy parameters θ∈RK .The sequence of states and actions
forms a trajectory denoted by τ=[x0:H,u0:H] where H denotes the
horizon which can be infinite. In this article, we will use the words
trajectory, history, trial, or roll-out interchangeably. At each instant of
time, the learning system receives a reward denoted
by rk=r(xk,uk)∈R .
The general goal of policy optimization in reinforcement learning is to
optimize the policy parameters θ∈RK so that the expected return
J(θ)=E{∑Hk=0akrk}
is optimized where ak denote time-step dependent weighting factors,
often set to ak=γk for discounted reinforcement learning (where γ is
in [0,1]) or ak=1/H for the average reward case.
For real-world applications, we require that any change to the policy
parameterization has to be smooth as drastic changes can be
hazardous for the actor as well as useful initializations of the policy
based on domain knowledge would otherwise vanish after a single
update step. For these reasons, policy gradient methods which follow
the steepest descent on the expected return are the method of choice.
These methods update the policy parameterization according to the
gradient update rule
θh+1=θh+αh∇θJ|θ=θh,
where αh∈R+ denotes a learning rate and h∈{0,1,2,…} the current
update number.
The time step k and update number h are two different variables. In
actor-critic-based policy gradient methods, the frequency of updates
of h can be nearly as high as of k . However, in most episodic
methods, the policy update h will be significantly less frequent. Here,
cut-off allows updates before the end of the episode (for ak=γk , it is
obvious that there comes a point where any future reward becomes
irrelevant; a generically good cut-off point). If the gradient estimate is
unbiased and learning rates
fulfill ∑∞h=0αh>0 and ∑∞h=0α2h=const , the learning process is
guaranteed to converge at least to a local minimum.
∇θpθ(τ)=pθ(τ)∇θlogpθ(τ)
REINFORCE (Williams, 1992) trick, i.e., by using
∇θJ(θ)=∫T∇θpθ(τ)r(τ)dτ=∫Tpθ(τ)∇θlogpθ(τ)r(τ)dτ=E{∇θlogpθ(τ)r(τ
obtain
)}.
derivative ∇θlogp(xk+1|xk,uk)=∇uklogp(xk+1|xk,uk)∇θπθ(xk) to
x) , computing such a derivative would require the
where ⟨⋅⟩ denotes the average over trajectories. This type of method is
gRF=⟨(∑Hk=0∇θlogπθ(uk|xk))(∑Hl=0alrl−b)⟩,
⟨(∑Hh=0∇θklogπθ(uh∣∣xh))2⟩
b=⟨(∑Hh=0∇θklogπθ(uh∣∣xh))2∑Hl=0alrl⟩
update Δθ~ that is most similar to the true gradient ∇θJ while the
In that case, the natural gradient is defined by Amari (1998) as the
change in our path distribution is limited to ε . Hence, it is given by
the program
argmaxΔθΔθT∇θJs.t.ΔθTFθ Δθ=ε.
The solution to this program is given by
where ∇θJ denotes the regular likelihood ratio policy gradient from
Δθ∝F−1θ∇θJ,
Please refer here for the policy gradient algorithm basics and
refer here for the value-based RL algorithm basics.
We can call the reward and baseline term as advantage function. It can
be denoted as below:
Advantage function
An important point to be noted in the above equation is baseline b is
the function of s_t not s_t`[4]
Source: [4]
Source: [3][4]
If the second term with baseline is zero, it can prove that adding
baseline function, b has added no bias in the gradient estimate. That
means
Source: [3]
The derivation above prove that adding baseline function has no bias
on gradient estimate
Actor-critic
AUTOENCODING:.
Autoencoders are very useful in the field of unsupervised machine
learning. You can use them to compress the data and reduce its
dimensionality.
Python3
autoencoder.fit(X_train, X_train, epochs=200)
Training of an Auto-encoder for data compression: For a data
compression procedure, the most important aspect of the
compression is the reliability of the reconstruction of the
compressed data. This requirement dictates the structure of the
Auto-encoder as a bottleneck. Step 1: Encoding the input
data The Auto-encoder first tries to encode the data using the
initialized weights and biases.
1. Encoder
2. Code
3. Decoder
The Encoder layer compresses the input image into a latent space
representation. It encodes the input image as a compressed
representation in a reduced dimension.
The Code layer represents the compressed input fed to the decoder
layer.
The decoder layer decodes the encoded image back to the original
dimension. The decoded image is reconstructed from latent space
representation, and it is reconstructed from the latent space
representation and is a lossy reconstruction of the original image.
Convolutional autoencoding:
This differs from regular ConvNets or neural nets in the sense that the
input size and the target size must be the same.
VARIATIONAL AUTOENCODING:
log�(�)=����(�)
+��(��(�∣�)∣∣�(�∣�))logp(x)=ELBO(λ)+KL(qλ(z∣x)∣∣p(z∣x))
By Jensen’s inequality, the Kullback-Leibler divergence is always
greater than or equal to zero. This means that minimizing the
Kullback-Leibler divergence is equivalent to maximizing the ELBO.
The abbreviation is revealed: the Variational Autoencoder (VAE):
in neural net language, a VAE consists of an encoder, a decoder, and
a loss function. In probability model terms, the variational
autoencoder refers to approximate inference in a latent Gaussian
model where the approximate posterior and model likelihood are
parametrized by neural nets (the inference and generative networks).
GLOSSARY:
1.Components:
3.Applications:
Image synthesis
Text-to-Image synthesis
Image-to-Image translation
Anomaly detection
Data augmentation
4.Limitations:
Get Started
Blog
Topics
EBooks
FAQ
About
Contact
Autoencoder Feature Extraction for
Classification
by Jason Brownlee on December 7, 2020 in Deep Learning
Tweet Tweet Share Share
The encoder can then be used as a data preparation technique to perform feature
extraction on raw data that can be used to train a different machine learning model.
In this tutorial, you will discover how to develop and evaluate an autoencoder for
classification predictive modeling.
Tutorial Overview
This tutorial is divided into three parts; they are:
1. Autoencoders for Feature Extraction
2. Autoencoder for Classification
3. Encoder as Data Preparation for Predictive Model
Autoencoders for Feature Extraction
An autoencoder is a neural network model that seeks to learn a compressed
representation of an input.
An autoencoder is a neural network that is trained to attempt to copy its input to its
output.
Autoencoders are typically trained as part of a broader model that attempts to recreate
the input.
For example:
X = model.predict(X)
The design of the autoencoder model purposefully makes this challenging by restricting
the architecture to a bottleneck at the midpoint of the model, from which the
reconstruction of the input data is performed.
There are many types of autoencoders, and their use varies, but perhaps the more
common use is as a learned or automatic feature extraction model.
In this case, once the model is fit, the reconstruction aspect of the model can be
discarded and the model up to the point of the bottleneck can be used. The output of the
model at the bottleneck is a fixed-length vector that provides a compressed
representation of the input data.
Usually they are restricted in ways that allow them to copy only approximately, and to
copy only input that resembles the training data. Because the model is forced to
prioritize which aspects of the input should be copied, it often learns useful properties of
the data.
We will use the make_classification() scikit-learn function to define a synthetic binary (2-
class) classification task with 100 input features (columns) and 1,000 examples (rows).
Importantly, we will define the problem in such a way that most of the input variables are
redundant (90 of the 100 or 90 percent), allowing the autoencoder later to learn a useful
compressed representation.
The example below defines the dataset and summarizes its shape.
1
#
2
s
y
3
n
t
4
h
5
e
t
6
i
c
c
l
a
s
s
i
f
i
c
a
t
i
o
n
d
a
t
a
s
e
t
f
r
o
m
s
k
l
e
a
r
n
.
d
a
t
a
s
e
t
s
i
m
p
o
r
t
m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
d
e
f
i
n
e
d
a
t
a
s
e
t
X
,
m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,
n
_
f
e
a
t
u
r
e
s
=
1
0
0
,
n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,
n
_
r
e
d
u
n
d
a
n
t
=
9
0
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
s
u
m
m
a
r
i
z
e
t
h
e
d
a
t
a
s
e
t
p
r
i
n
t
(
X
.
s
h
a
p
e
,
y
.
s
h
a
p
e
)
Running the example defines the dataset and prints the shape of the arrays, confirming
the number of rows and columns.
1
(
1
0
0
0
,
1
0
0
)
(
1
0
0
0
,
)
The autoencoder consists of two parts: the encoder and the decoder. The encoder
learns how to interpret the input and compress it to an internal representation defined by
the bottleneck layer. The decoder takes the output of the encoder (the bottleneck layer)
and attempts to recreate the input.
Once the autoencoder is trained, the decoder is discarded and we only keep the
encoder and use it to compress examples of input to vectors output by the bottleneck
layer.
In this first autoencoder, we won’t compress the input at all and will use a bottleneck
layer the same size as the input. This should be an easy problem that the model will
learn nearly perfectly and is intended to confirm our model is implemented correctly.
We will define the model using the functional API; if this is new to you, I recommend this
tutorial:
1
.
.
2
.
3
#
4
s
p
5
l
6
i
t
7
i
8
n
t
o
t
r
a
i
n
t
e
s
t
s
e
t
s
X
_
t
r
a
i
n
,
X
_
t
e
s
t
,
y
_
t
r
a
i
n
,
y
_
t
e
s
t
t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,
y
,
t
e
s
t
_
s
i
z
e
=
0
.
3
3
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
s
c
a
l
e
d
a
t
a
M
i
n
M
a
x
S
c
a
l
e
r
(
)
t
.
f
i
t
(
X
_
t
r
a
i
n
)
X
_
t
r
a
i
n
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)
X
_
t
e
s
t
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)
We will define the encoder to have two hidden layers, the first with two times the number
of inputs (e.g. 200) and the second with the same number of inputs (100), followed by
the bottleneck layer with the same number of inputs as the dataset (100).
To ensure the model learns well, we will use batch normalization and leaky ReLU
activation.
1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
7
e
8
e
9
n
c
1
o
0
d
e
1
r
1
v
1
i
2
s
i
1
b
3
l
1
e
4
=
I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)
e
n
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
L
e
a
k
y
R
e
L
U
(
)
(
e
)
e
n
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
e
L
e
a
k
y
R
e
L
U
(
)
(
e
)
b
o
t
t
l
e
n
e
c
k
n
_
b
o
t
t
l
e
n
e
c
k
n
_
i
n
p
u
t
s
b
o
t
t
l
e
n
e
c
k
D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)
It will have two hidden layers, the first with the number of inputs in the dataset (e.g. 100)
and the second with double the number of inputs (e.g. 200). The output layer will have
the same number of nodes as there are columns in the input data and will use a linear
activation function to output numeric values.
1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
7
e
8
d
9
e
c
1
o
0
d
e
1
r
1
,
1
l
2
e
v
1
e
3
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
d
e
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
o
u
t
p
u
t
l
a
y
e
r
o
u
t
p
u
t
D
e
n
s
e
(
n
_
i
n
p
u
t
s
,
a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)
d
e
f
i
n
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
o
u
t
p
u
t
)
The model will be fit using the efficient Adam version of stochastic gradient descent and
minimizes the mean squared error, given that reconstruction is a type of multi-output
regression problem.
1
.
.
2
.
3
#
c
o
m
p
i
l
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,
l
o
s
s
=
'
m
s
e
'
)
We can plot the layers in the autoencoder model to get a feeling for how the data flows
through the model.
1
.
.
2
.
3
#
p
l
o
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,
'
a
u
t
o
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
Next, we can train the model to reproduce the input and keep track of the performance
of the model on the hold-out test set.
1
.
.
2
.
3
#
f
i
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
t
o
r
e
c
o
n
s
t
r
u
c
t
i
n
p
u
t
h
i
s
t
o
r
y
m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,
X
_
t
r
a
i
n
,
e
p
o
c
h
s
=
2
0
0
,
b
a
t
c
h
_
s
i
z
e
=
1
6
,
v
e
r
b
o
s
e
=
2
,
v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)
After training, we can plot the learning curves for the train and test sets to confirm the
model learned the reconstruction problem well.
1
.
.
2
.
3
#
4
p
l
5
o
6
t
l
o
s
s
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
r
a
i
n
'
)
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
e
s
t
'
)
p
y
p
l
o
t
.
l
e
g
e
n
d
(
)
p
y
p
l
o
t
.
s
h
o
w
(
)
Finally, we can save the encoder model for use later, if desired.
1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
e
a
n
e
n
c
o
d
e
r
m
o
d
e
l
(
w
i
t
h
o
u
t
t
h
e
d
e
c
o
d
e
r
)
e
n
c
o
d
e
r
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)
p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,
'
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
s
a
v
e
t
h
e
e
n
c
o
d
e
r
t
o
f
i
l
e
e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)
As part of saving the encoder, we will also plot the encoder model to get a feeling for the
shape of the output of the bottleneck layer, e.g. a 100 element vector.
1
#
2
t
r
3
a
i
4
n
5
a
6
u
t
7
o
8
e
n
9
c
o
1
d
0
e
r
1
1
f
1
o
2
r
1
c
3
l
a
1
s
4
s
1
i
5
f
i
1
c
6
a
t
1
i
7
o
n
1
8
w
1
i
9
t
h
2
0
n
o
2
1
c
2
o
2
m
p
2
r
3
e
s
2
s
4
2
i
5
o
n
2
6
i
n
2
7
t
h
2
e
8
2
b
9
o
t
3
t
0
l
e
3
n
1
e
3
c
2
k
3
l
3
a
y
3
e
4
r
3
f
5
r
o
3
m
6
3
s
7
k
l
3
e
8
a
r
3
n
9
.
4
d
0
a
t
4
a
1
s
e
4
t
2
s
4
i
3
m
4
p
4
o
r
4
t
5
m
4
a
6
k
4
e
7
_
4
c
8
l
a
4
s
9
s
i
5
f
0
i
c
5
a
1
t
5
i
2
o
n
5
3
f
r
5
o
4
m
5
s
5
k
5
l
6
e
a
5
r
7
n
.
5
p
8
r
e
5
p
9
r
6
o
0
c
e
6
s
1
s
i
6
n
2
g
6
3
i
m
p
o
r
t
M
i
n
M
a
x
S
c
a
l
e
r
f
r
o
m
s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n
i
m
p
o
r
t
t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
m
o
d
e
l
s
i
m
p
o
r
t
M
o
d
e
l
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
I
n
p
u
t
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
D
e
n
s
e
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
L
e
a
k
y
R
e
L
U
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
u
t
i
l
s
i
m
p
o
r
t
p
l
o
t
_
m
o
d
e
l
f
r
o
m
m
a
t
p
l
o
t
l
i
b
i
m
p
o
r
t
p
y
p
l
o
t
d
e
f
i
n
e
d
a
t
a
s
e
t
X
,
m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,
n
_
f
e
a
t
u
r
e
s
=
1
0
0
,
n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,
n
_
r
e
d
u
n
d
a
n
t
=
9
0
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
n
u
m
b
e
r
o
f
i
n
p
u
t
c
o
l
u
m
n
s
n
_
i
n
p
u
t
s
X
.
s
h
a
p
e
[
1
]
s
p
l
i
t
i
n
t
o
t
r
a
i
n
t
e
s
t
s
e
t
s
X
_
t
r
a
i
n
,
X
_
t
e
s
t
,
y
_
t
r
a
i
n
,
y
_
t
e
s
t
t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,
y
,
t
e
s
t
_
s
i
z
e
=
0
.
3
3
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
s
c
a
l
e
d
a
t
a
M
i
n
M
a
x
S
c
a
l
e
r
(
)
t
.
f
i
t
(
X
_
t
r
a
i
n
)
X
_
t
r
a
i
n
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)
X
_
t
e
s
t
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)
d
e
f
i
n
e
e
n
c
o
d
e
r
v
i
s
i
b
l
e
I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)
e
n
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
L
e
a
k
y
R
e
L
U
(
)
(
e
)
e
n
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
L
e
a
k
y
R
e
L
U
(
)
(
e
)
b
o
t
t
l
e
n
e
c
k
n
_
b
o
t
t
l
e
n
e
c
k
n
_
i
n
p
u
t
s
b
o
t
t
l
e
n
e
c
k
D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)
d
e
f
i
n
e
d
e
c
o
d
e
r
,
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
d
e
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
o
u
t
p
u
t
l
a
y
e
r
o
u
t
p
u
t
D
e
n
s
e
(
n
_
i
n
p
u
t
s
,
a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)
d
e
f
i
n
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
o
u
t
p
u
t
)
c
o
m
p
i
l
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,
l
o
s
s
=
'
m
s
e
'
)
p
l
o
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,
'
a
u
t
o
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
f
i
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
t
o
r
e
c
o
n
s
t
r
u
c
t
i
n
p
u
t
h
i
s
t
o
r
y
m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,
X
_
t
r
a
i
n
,
e
p
o
c
h
s
=
2
0
0
,
b
a
t
c
h
_
s
i
z
e
=
1
6
,
v
e
r
b
o
s
e
=
2
,
v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)
p
l
o
t
l
o
s
s
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
r
a
i
n
'
)
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
e
s
t
'
)
p
y
p
l
o
t
.
l
e
g
e
n
d
(
)
p
y
p
l
o
t
.
s
h
o
w
(
)
d
e
f
i
n
e
a
n
e
n
c
o
d
e
r
m
o
d
e
l
(
w
i
t
h
o
u
t
t
h
e
d
e
c
o
d
e
r
)
e
n
c
o
d
e
r
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)
p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,
'
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
s
a
v
e
t
h
e
e
n
c
o
d
e
r
t
o
f
i
l
e
e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)
Running the example fits the model and reports loss on the train and test sets along the
way.
Note: if you have problems creating the plots of the model, you can comment out the
import and call the plot_model() function.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few
times and compare the average outcome.
In this case, we see that loss gets low, but does not go to zero (as we might have
expected) with no compression in the bottleneck layer. Perhaps further tuning the model
architecture or learning hyperparameters is required.
1
.
.
2
.
3
4
2
4
/
4
5
2
6
-
7
0
8
s
9
-
1
0
l
o
1
s
1
s
:
1
2
0
.
0
0
3
2
v
a
l
_
l
o
s
s
:
0
.
0
0
1
6
E
p
o
c
h
1
9
6
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
3
1
v
a
l
_
l
o
s
s
:
0
.
0
0
2
4
E
p
o
c
h
1
9
7
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
3
2
v
a
l
_
l
o
s
s
:
0
.
0
0
1
5
E
p
o
c
h
1
9
8
/
2
0
0
4
2
/
4
2
-
0
s
l
o
s
s
:
0
.
0
0
3
2
v
a
l
_
l
o
s
s
:
0
.
0
0
1
4
E
p
o
c
h
1
9
9
/
2
0
0
4
2
/
4
2
0
s
-
l
o
s
s
:
0
.
0
0
3
1
v
a
l
_
l
o
s
s
:
0
.
0
0
2
0
E
p
o
c
h
2
0
0
/
2
0
0
4
2
/
4
2
0
s
-
l
o
s
s
:
0
.
0
0
2
9
v
a
l
_
l
o
s
s
:
0
.
0
0
1
7
A plot of the learning curves is created showing that the model achieves a good fit in
reconstructing the input, which holds steady throughout training, not overfitting.
Learning Curves of Training the Autoencoder Model Without Compression
Next, let’s change the configuration of the model so that the bottleneck layer has half the
number of nodes (e.g. 50).
1
.
.
2
.
3
#
4
b
o
t
t
l
e
n
e
c
k
n
_
b
o
t
t
l
e
n
e
c
k
r
o
u
n
d
(
f
l
o
a
t
(
n
_
i
n
p
u
t
s
)
2
.
0
)
b
o
t
t
l
e
n
e
c
k
D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)
1
#
2
t
r
3
a
i
4
n
5
a
6
u
t
7
o
8
e
n
9
c
o
1
d
0
e
r
1
1
f
1
o
2
r
1
c
3
l
a
1
s
4
s
1
i
5
f
i
1
c
6
a
t
1
i
7
o
n
1
8
w
1
i
9
t
h
2
0
w
i
2
1
t
h
2
2
c
o
2
m
3
p
r
2
e
4
s
2
s
5
i
o
2
n
6
i
2
n
7
2
t
8
h
e
2
9
b
o
3
t
0
t
l
3
e
1
n
3
e
2
c
k
3
3
l
a
3
y
4
e
3
r
5
f
3
r
6
o
m
3
7
s
k
3
l
8
e
a
3
r
9
n
4
.
0
d
a
4
t
1
a
s
4
e
2
t
4
s
3
i
4
m
4
p
o
4
r
5
t
4
m
6
a
k
4
e
7
_
4
c
8
l
a
4
s
9
s
i
5
f
0
i
5
c
1
a
t
5
i
2
o
n
5
3
f
r
5
o
4
m
5
s
5
k
5
l
6
e
a
5
r
7
n
.
5
p
8
r
5
e
9
p
r
6
o
0
c
e
6
s
1
s
i
6
n
2
g
6
3
i
m
p
o
r
t
M
i
n
M
a
x
S
c
a
l
e
r
f
r
o
m
s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n
i
m
p
o
r
t
t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
m
o
d
e
l
s
i
m
p
o
r
t
M
o
d
e
l
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
I
n
p
u
t
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
D
e
n
s
e
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
L
e
a
k
y
R
e
L
U
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
f
r
o
m
t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
u
t
i
l
s
i
m
p
o
r
t
p
l
o
t
_
m
o
d
e
l
f
r
o
m
m
a
t
p
l
o
t
l
i
b
i
m
p
o
r
t
p
y
p
l
o
t
d
e
f
i
n
e
d
a
t
a
s
e
t
X
,
m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,
n
_
f
e
a
t
u
r
e
s
=
1
0
0
,
n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,
n
_
r
e
d
u
n
d
a
n
t
=
9
0
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
n
u
m
b
e
r
o
f
i
n
p
u
t
c
o
l
u
m
n
s
n
_
i
n
p
u
t
s
X
.
s
h
a
p
e
[
1
]
s
p
l
i
t
i
n
t
o
t
r
a
i
n
t
e
s
t
s
e
t
s
X
_
t
r
a
i
n
,
X
_
t
e
s
t
,
y
_
t
r
a
i
n
,
y
_
t
e
s
t
t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,
y
,
t
e
s
t
_
s
i
z
e
=
0
.
3
3
,
r
a
n
d
o
m
_
s
t
a
t
e
=
1
)
s
c
a
l
e
d
a
t
a
M
i
n
M
a
x
S
c
a
l
e
r
(
)
t
.
f
i
t
(
X
_
t
r
a
i
n
)
X
_
t
r
a
i
n
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)
X
_
t
e
s
t
=
t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)
d
e
f
i
n
e
e
n
c
o
d
e
r
v
i
s
i
b
l
e
I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)
e
n
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)
e
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
L
e
a
k
y
R
e
L
U
(
)
(
e
)
e
n
c
o
d
e
r
l
e
v
e
l
2
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
L
e
a
k
y
R
e
L
U
(
)
(
e
)
b
o
t
t
l
e
n
e
c
k
n
_
b
o
t
t
l
e
n
e
c
k
r
o
u
n
d
(
f
l
o
a
t
(
n
_
i
n
p
u
t
s
)
/
2
.
0
)
b
o
t
t
l
e
n
e
c
k
D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)
d
e
f
i
n
e
d
e
c
o
d
e
r
,
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
d
e
c
o
d
e
r
l
e
v
e
l
D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)
B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)
L
e
a
k
y
R
e
L
U
(
)
(
d
)
o
u
t
p
u
t
l
a
y
e
r
o
u
t
p
u
t
D
e
n
s
e
(
n
_
i
n
p
u
t
s
,
a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)
d
e
f
i
n
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
o
u
t
p
u
t
)
c
o
m
p
i
l
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,
l
o
s
s
=
'
m
s
e
'
)
p
l
o
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,
'
a
u
t
o
e
n
c
o
d
e
r
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
f
i
t
t
h
e
a
u
t
o
e
n
c
o
d
e
r
m
o
d
e
l
t
o
r
e
c
o
n
s
t
r
u
c
t
i
n
p
u
t
h
i
s
t
o
r
y
m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,
X
_
t
r
a
i
n
,
e
p
o
c
h
s
=
2
0
0
,
b
a
t
c
h
_
s
i
z
e
=
1
6
,
v
e
r
b
o
s
e
=
2
,
v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)
p
l
o
t
l
o
s
s
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
r
a
i
n
'
)
p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,
l
a
b
e
l
=
'
t
e
s
t
'
)
p
y
p
l
o
t
.
l
e
g
e
n
d
(
)
p
y
p
l
o
t
.
s
h
o
w
(
)
d
e
f
i
n
e
a
n
e
n
c
o
d
e
r
m
o
d
e
l
(
w
i
t
h
o
u
t
t
h
e
d
e
c
o
d
e
r
)
e
n
c
o
d
e
r
M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,
o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)
p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,
'
e
n
c
o
d
e
r
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,
s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)
s
a
v
e
t
h
e
e
n
c
o
d
e
r
t
o
f
i
l
e
e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)
Running the example fits the model and reports loss on the train and test sets along the
way.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few
times and compare the average outcome.
In this case, we see that loss gets similarly low as the above example without
compression, suggesting that perhaps the model performs just as well with a bottleneck
half the size.
1
.
.
2
.
3
4
2
4
/
4
5
2
6
-
7
0
8
s
9
-
1
0
l
o
1
s
1
s
:
1
2
0
.
0
0
2
9
v
a
l
_
l
o
s
s
:
0
.
0
0
1
0
E
p
o
c
h
1
9
6
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
2
9
v
a
l
_
l
o
s
s
:
0
.
0
0
1
3
E
p
o
c
h
1
9
7
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
3
0
v
a
l
_
l
o
s
s
:
9
.
4
4
7
2
e
-
0
4
E
p
o
c
h
1
9
8
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
2
8
v
a
l
_
l
o
s
s
:
0
.
0
0
1
5
E
p
o
c
h
1
9
9
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
3
3
v
a
l
_
l
o
s
s
:
0
.
0
0
2
1
E
p
o
c
h
2
0
0
/
2
0
0
4
2
/
4
2
0
s
l
o
s
s
:
0
.
0
0
2
7
v
a
l
_
l
o
s
s
:
8
.
7
7
3
1
e
-
0
4
A plot of the learning curves is created, again showing that the model achieves a good fit
in reconstructing the input, which holds steady throughout training, not overfitting.
The trained encoder is saved to the file “encoder.h5” that we can load and use later.
Next, let’s explore how we might use the trained encoder model.
DENOISING AUTOENCODER:
Architecture of a DAE. Copyright by Kirill Eremenko (Deep Learning A-Z™: Hands-On Artificial Neural
Networks)
SPARSE AUTOENCODER:
Sparse Autoencoder
There are actually two different ways to construct our
sparsity penalty: L1 regularization and KL-divergence.
And here we will only talk about L1 regularization.
Loss Function
Except for the first two terms, we add the third term which
penalizes the absolute value of the vector of activations a
in layer h for sample i. Then we use a hyperparameter to
control its effect on the whole loss function. And in this
way, we do build a sparse autoencoder.
Visualization
And after 100 epochs of training using 128 batch size and
Adam as the optimizer, I got below results:
Experiment Results