0% found this document useful (0 votes)
7 views187 pages

Q Learning

Q-learning is a model-free, off-policy reinforcement learning technique that enables an agent to determine the best action to take in a given state to maximize rewards. It utilizes concepts such as states, actions, rewards, and the Bellman Equation to update a Q-Table, which helps in finding optimal actions based on past experiences. Deep Q-Networks (DQN) enhance Q-learning by incorporating deep neural networks and techniques like Experience Replay and Target Networks to stabilize training and improve performance in complex environments.

Uploaded by

Jeffy Shiny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views187 pages

Q Learning

Q-learning is a model-free, off-policy reinforcement learning technique that enables an agent to determine the best action to take in a given state to maximize rewards. It utilizes concepts such as states, actions, rewards, and the Bellman Equation to update a Q-Table, which helps in finding optimal actions based on past experiences. Deep Q-Networks (DQN) enhance Q-learning by incorporating deep neural networks and techniques like Experience Replay and Target Networks to stabilize training and improve performance in complex environments.

Uploaded by

Jeffy Shiny
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 187

Q learning:

Reinforcement learning requires a machine learning model to learn


from the problem and come up with the most optimal solution by
itself. This means that we also arrive at fast and unique solutions
which the programmer might not even have thought of.

Consider the image below. You can see a dog in a room that has to
perform an action, which is fetching. The dog is the agent; the room is
the environment it has to work in, and the action to be performed is
fetching.

Figure 1: Agent, Action, and Environment

If the correct action is performed, we will reward the agent. If it


performs the wrong action, we will not give it any reward or give it a
negative reward, like a scolding.

Figure 2: Agent performing an action


What Is Q-Learning?

Q-Learning is a Reinforcement learning policy that will find the next


best action, given a current state. It chooses this action at random and
aims to maximize the reward.

Figure 3: Components of Q-Learning

Q-learning is a model-free, off-policy reinforcement learning that


will find the best course of action, given the current state of the agent.
Depending on where the agent is in the environment, it will decide the
next action to be taken.
The objective of the model is to find the best course of action given
its current state. To do this, it may come up with rules of its own or it
may operate outside the policy given to it to follow. This means that
there is no actual need for a policy, hence we call it off-policy.

Model-free means that the agent uses predictions of the environment’s


expected response to move forward. It does not use the reward system
to learn, but rather, trial and error.
An example of Q-learning is an Advertisement recommendation
system. In a normal ad recommendation system, the ads you get are
based on your previous purchases or websites you may have visited.
If you’ve bought a TV, you will get recommended TVs of different
brands.

Figure 4: Ad Recommendation System

Using Q-learning, we can optimize the ad recommendation system to


recommend products that are frequently bought together. The reward
will be if the user clicks on the suggested product.

Figure 5: Ad Recommendation System with Q-Learning


Important Terms in Q-Learning

1. States: The State, S, represents the current position of an


agent in an environment.

2. Action: The Action, A, is the step taken by the agent


when it is in a particular state.

3. Rewards: For every action, the agent will get a positive


or negative reward.

4. Episodes: When an agent ends up in a terminating state


and can’t take a new action.

5. Q-Values: Used to determine how good an Action, A,


taken at a particular state, S, is. Q (A, S).

6. Temporal Difference: A formula used to find the Q-


Value by using the value of current state and action and
previous state and action.

What Is The Bellman Equation?

The Bellman Equation is used to determine the value of a particular


state and deduce how good it is to be in/take that state. The optimal
state will give us the highest optimal value.

The equation is given below. It uses the current state, and the reward
associated with that state, along with the maximum expected reward
and a discount rate, which determines its importance to the current
state, to find the next state of our agent. The learning rate determines
how fast or slow, the model will be learning.
Figure 6: Bellman Equation

How to Make a Q-Table?

While running our algorithm, we will come across various solutions


and the agent will take multiple paths. How do we find out the best
among them? This is done by tabulating our findings in a table called
a Q-Table.

A Q-Table helps us to find the best action for each state in the
environment. We use the Bellman Equation at each state to get the
expected future state and reward and save it in a table to compare with
other states.

Lets us create a q-table for an agent that has to learn to run, fetch and
sit on command. The steps taken to construct a q-table are :

Step 1: Create an initial Q-Table with all values initialized to 0

When we initially start, the values of all states and rewards will be 0.
Consider the Q-Table shown below which shows a dog simulator
learning to perform actions :
Figure 7: Initial Q-Table (to refer images please refer
simplelearn.com)

Step 2: Choose an action and perform it. Update values in the table

This is the starting point. We have performed no other action as of


yet. Let us say that we want the agent to sit initially, which it does.
The table will change to:

Figure 8: Q-Table after performing an action

Step 3: Get the value of the reward and calculate the value Q-Value
using Bellman Equation

For the action performed, we need to calculate the value of the actual
reward and the Q( S, A ) value
Figure 9: Updating Q-Table with Bellman Equation

Step 4: Continue the same until the table is filled or an episode ends

The agent continues taking actions and for each action, the reward and
Q-value are calculated and it updates the table.

Figure 10: Final Q-Table at end of an episode

DEEP Q NETWORK (DQN):

What is DQN in reinforcement learning?


 The deep Q-network (DQN) algorithm is a model-free, online,
off-policy reinforcement learning method.
 A DQN agent is a value-based reinforcement learning agent that
trains a critic to estimate the return or future rewards. DQN is a
variant of Q-learning.
Deep Neural Network (DNN)

AlexNet has achieved an incredible score in ILSVRC 2012, image


classification competition by using DNN.

The greatest thing of DNN is extracting feature representations


through backpropagation.

learned weights of a convolutional layer in AlexNet


Classifiers don’t need hand-engineered features any more because of
this ability. After appropriate many backpropagations, DNN knows
which information like color or shape is important to do the task.

Bringing DNN into RL

People naturally think that DNN enables RL agent to associate images


to values. However things are not easy.

Comparison between naive DQN and linear model (with DQN


techniques) from Nature

Naive DQN has 3 convolutional layers and 2 fully connected layers to


estimate Q values directly from images. On the other hand, Linear
model has only 1 fully connected layer with some learning techniques
talked in the next section. Both model learns Q values in Q learning
way. As you see the above table, naive DQN has very poor results
worse than even linear model because DNN is easily overfitting in
online reinforcement learning.

Deep Q-Network

DQN is introduced in 2 papers, Playing Atari with Deep


Reinforcement Learning on NIPS in 2013 and Human-level control
through deep reinforcement learning on Nature in 2015. Interestingly,
there were only few papers about DRN between 2013 and 2015. I
guess that the reason was people couldn’t reproduce DQN
implementation without information in Nature version.

DQN agent playing Breakout

DQN overcomes unstable learning by mainly 4 techniques.

 Experience Replay

 Target Network

 Clipping Rewards

 Skipping Frames

I explain each technique one by one.

Experience Replay

Experience Replay is originally proposed in Reinforcement Learning


for Robots Using Neural Networks in 1993. DNN is easily overfitting
current episodes. Once DNN is overfitted, it’s hard to produce various
experiences. To solve this problem, Experience Replay stores
experiences including state transitions, rewards and actions, which are
necessary data to perform Q learning, and makes mini-batches to
update neural networks. This technique expects the following merits.

 reduces correlation between experiences in updating


DNN

 increases learning speed with mini-batches

 reuses past transitions to avoid catastrophic forgetting

Target Network

In TD error calculation, target function is changed frequently with


DNN. Unstable target function makes training difficult. So Target
Network technique fixes parameters of target function and replaces
them with the latest network every thousands steps.

target Q function in the red rectangular is fixed

Clipping Rewards

Each game has different score scales. For example, in Pong, players
can get 1 point when wining the play. Otherwise, players get -1 point.
However, in SpaceInvaders, players get 10~30 points when defeating
invaders. This difference would make training unstable. Thus Clipping
Rewards technique clips scores, which all positive rewards are set +1
and all negative rewards are set -1.

Skipping Frames
ALE is capable of rendering 60 images per second. But actually
people don’t take actions so much in a second. AI doesn’t need to
calculate Q values every frame. So Skipping Frames technique is that
DQN calculates Q values every 4 frames and use past 4 frames as
inputs. This reduces computational cost and gathers more experiences.

Performance

All of above techniques enables DQN to achieve stable training.

DQN overwhelms naive DQN

In Nature version, it shows how much Experience Replay and Target


Network contribute to stability.

Performance with and without Experience Replay and Target


Network

Experience Replay is very important in DQN. Target Network also


increases its performance.
Conclusion

DQN has achieved human-level control in many of Atari games with


above 4 techniques. However there are still some games DQN cannot
play. I will introduce papers that struggle them in this series.

POLICY GRADIENT METHODS:

Policy gradient methods are a type of reinforcement


learning techniques that rely upon optimizing parametrized policies
with respect to the expected return (long-term cumulative reward) by
gradient descent. They do not suffer from many of the problems that
have been marring traditional reinforcement learning approaches such
as the lack of guarantees of a value function, the intractability
problem resulting from uncertain state information and
the complexity arising from continuous states & actions.
Contents
[hide]
 1 Introduction
 2 Assumptions and Notation
 3 Approaches to Policy Gradient Estimation
o 3.1 Finite-difference Methods
o 3.2 Likelihood Ratio Methods and
REINFORCE
o 3.3 Natural Policy Gradients
 4 Conclusion
 5 References
 6 See also

Introduction
Reinforcement learning is probably the most general framework in
which reward-related learning problems of animals, humans or
machine can be phrased. However, most of the methods proposed in
the reinforcement learning community are not yet applicable to many
problems such as robotics, motor control, etc. This inapplicability
may result from problems with uncertain state information. Thus,
those systems need to be modeled as partially observable Markov
decision problems which often results in excessive computational
demands. Most traditional reinforcement learning methods have no
convergence guarantees and there exist even divergence examples.
Continuous states and actions in high dimensional spaces cannot be
treated by most off-the-shelf reinforcement learning approaches.
Policy gradient methods differ significantly as they do not suffer from
these problems in the same way. For example, uncertainty in the state
might degrade the performance of the policy (if no additional state
estimator is being used) but the optimization techniques for the policy
do not need to be changed. Continuous states and actions can be dealt
with in exactly the same way as discrete ones while, in addition, the
learning performance is often increased. Convergence at least to a
local optimum is guaranteed.
The advantages of policy gradient methods for real world applications
are numerous. Among the most important ones are that the policy
representations can be chosen so that it is meaningful for the task and
can incorporate domain knowledge, that often fewer parameters are
needed in the learning process than in value-function based
approaches and that there is a variety of different algorithms for
policy gradient estimation in the literature which have a rather strong
theoretical underpinning. Additionally, policy gradient methods can
be used either model-free or model-based as they are a generic
formulation.
Of course, policy gradients are not the salvation to all problems but
also have significant problems. They are by definition on-policy (note
that tricks like importance sampling can slightly alleviate this
problem) and need to forget data very fast in order to avoid the
introduction of a bias to the gradient estimator. Hence, the use of
sampled data is not very efficient. In tabular representations, value
function methods are guaranteed to converge to a global maximum
while policy gradients only converge to a local maximum and there
may be many maxima in discrete problems. Policy gradient methods
are often quite demanding to apply, mainly because one has to have
considerable knowledge about the system one wants to control to
make reasonable policy definitions. Finally, policy gradient methods
always have an open parameter, the learning rate, which may decide
over the order of magnitude of the speed of convergence, these have
led to new approaches inspired by expectation-maximization (see,
e.g., Vlassis et al., 2009; Kober & Peters, 2008).
Nevertheless, due to their advantages stated above, policy gradient
methods have become particularly interesting for robotics applications
as these have both continuous actions and states. For example, there
has been a series of successful applications in robot locomotion,
where good policy parametrizations such as CPGs are known.
Benbrahim & Franklin (1997) already explored 2D dynamic biped
walking, Tedrake et al. (2004) extended these results to 3D passive
dynamics-based walkers and Endo (2005) showed that a full-body
gait with sensory feedback can be learned with policy gradients. Kohl
& Stone (2004) were able to apply policy gradients to optimize
quadruped gaits. There have also been various applications in skill
learning starting with the peg-in-a-hole tasks learned by Gullapalli et
al. (1994) and ranging to Peters & Schaals' optimizations of discrete
movements primitives such as T-Ball swings.
Note that in most applications, there exist many local maxima; for
example, if we were told build a high jumping robot, there is a
multitude of styles. Current policy gradient methods would be helpful
for improving a jumping style of a teacher, let's say the classical
straddle jump. However, discovering a Fosbery flop when starting
with a basic straddle jump policy is probably not possible with policy
gradient methods.
Assumptions and Notation
We assume that we can model the control system in a discrete-time
manner and we will denote the current time step by k . In order to take
possible stochasticity of the plant into account, we denote it using a
probability distribution xk+1∼p(xk+1|xk,uk) as model
where uk∈RM denotes the current action, and xk , xk+1∈RN denote
the current and next state, respectively. We furthermore assume that
actions are generated by a policy uk∼πθ(uk|xk) which is modeled as a
probability distribution in order to incorporate exploratory actions; for
some special problems, the optimal solution to a control problem is
actually a stochastic controller (Sutton, McAllester, Singh, and
Mansour, 2000). The policy is assumed to be parameterized
by K policy parameters θ∈RK .The sequence of states and actions
forms a trajectory denoted by τ=[x0:H,u0:H] where H denotes the
horizon which can be infinite. In this article, we will use the words
trajectory, history, trial, or roll-out interchangeably. At each instant of
time, the learning system receives a reward denoted
by rk=r(xk,uk)∈R .
The general goal of policy optimization in reinforcement learning is to
optimize the policy parameters θ∈RK so that the expected return
J(θ)=E{∑Hk=0akrk}
is optimized where ak denote time-step dependent weighting factors,
often set to ak=γk for discounted reinforcement learning (where γ is
in [0,1]) or ak=1/H for the average reward case.
For real-world applications, we require that any change to the policy
parameterization has to be smooth as drastic changes can be
hazardous for the actor as well as useful initializations of the policy
based on domain knowledge would otherwise vanish after a single
update step. For these reasons, policy gradient methods which follow
the steepest descent on the expected return are the method of choice.
These methods update the policy parameterization according to the
gradient update rule
θh+1=θh+αh∇θJ|θ=θh,
where αh∈R+ denotes a learning rate and h∈{0,1,2,…} the current
update number.
The time step k and update number h are two different variables. In
actor-critic-based policy gradient methods, the frequency of updates
of h can be nearly as high as of k . However, in most episodic
methods, the policy update h will be significantly less frequent. Here,
cut-off allows updates before the end of the episode (for ak=γk , it is
obvious that there comes a point where any future reward becomes
irrelevant; a generically good cut-off point). If the gradient estimate is
unbiased and learning rates
fulfill ∑∞h=0αh>0 and ∑∞h=0α2h=const , the learning process is
guaranteed to converge at least to a local minimum.

estimator of the policy gradient ∇θJ|θ=θh . In robotics and control,


The main problem in policy gradient methods is to obtain a good

people have traditionally used deterministic model-based methods for


obtaining the gradient (Jacobson & Mayne, 1970; Dyer &
McReynolds, 1970; Hasdorff, 1976). However, in order to become
autonomous we cannot expect to be able to model every detail of the
system. Therefore, we need to estimate the policy gradient simply
from data generated during the execution of a task, i.e., without the
need for a model. In this section, we will study different approaches
and discuss their properties.
Approaches to Policy Gradient Estimation
The literature on policy gradient methods has yielded a variety of
estimation methods over the last years. The most prominent
approaches, which have been applied to robotics are finite-difference
and likelihood ratio methods, better known as REINFORCE in
reinforcement learning.
Finite-difference Methods
Finite-difference methods are among the oldest policy gradient
approaches; they originated from the stochastic simulation community
and are quite straightforward to understand. The policy
parameterization is varied I times by small increments Δθi,i=1…I and
for each policy parameter variation θh+Δθi roll-outs (or trajectories)
are performed which generate estimates ΔJ^i≈J(θh+Δθi)−Jref of the
expected return. There are different ways of choosing the reference
value Jref , e.g. forward-difference estimators with Jref=J(θh) and
central-difference estimators with Jref=J(θh−Δθi) . The policy
gradient estimate gFD≈∇θJ|θ=θh can be estimated by regression
yielding
gFD=(ΔΘTΔΘ)−1ΔΘTΔJ^,
where ΔΘ=[Δθ1,…,ΔθI]T and ΔJ^=[ΔJ^1,…,ΔJ^I]T denote
the I samples. This approach can be highly efficient in simulation
optimization of deterministic systems (Spall, 2003) or when a
common history of random numbers (Glynn, 1987) is being used (the
latter is known as PEGASUS in reinforcement learning (Ng & Jordan,
2000)), and the error of the gradient estimate can get close
to O(I−1/2) (Glynn, 1987). However, the uncertainty of real systems
will result in stochasticity and an artificial common history of random
numbers can no longer be applied. Hence, when used on a real
system, the performance degrades in a gradient estimate error ranging
between O(I−1/4) to O(I−2/5) depending on the chosen reference
value (Glynn, 1987). An implementation of this algorithm is shown
below.
input: policy parameterization θh
for i=1 to I do
generate policy variation Δθi
estimate J^i≈J(θh+Δθi)=⟨∑Hk=0akrk⟩ from roll-out
estimate J^ref , e.g., J^ref=J(θh−Δθi) from roll-out
compute ΔJ^i≈J(θh+Δθi)−Jref
end for
return gradient estimate gFD=(ΔΘTΔΘ)−1ΔΘTΔJ^
Note that an alternative implementation would keep increasing I until
the gradient estimate converges. The choice of I can be essential;
empirically it can be observed that taking I as twice the number of
parameters yields very accurate gradient estimates.
Due to the simplicity of this approach, such methods have been
successfully applied to numerous applications. However, the
straightforward application is not without peril as the generation of
the Δθi requires proper knowledge of the system, as badly
chosen Δθi can destabilize the policy so that the system becomes
instable and the gradient estimation process is prone to fail. If the
parameters differ highly in scale, significant difficulties could be the
consequences.
Advantages of this approach: Finite-difference methods require very
little skill and can usually be implemented out of the box. They work
both with stochastic and deterministic policies without any change. It
is highly efficient in simulation with a set of common histories of
random numbers and on totally deterministic systems.
Disadvantages of this approach: The perturbation of the parameters
is a very difficult problem often with disastrous impact on the
learning process when the system goes unstable. In the presence of
noise on a real system, the gradient estimate error decreases much
slower than for the following methods. Performance depends highly
on the chosen policy parametrization.
Sehnke et al. (2010) show several interesting newer methods
developed in this domain.
Likelihood Ratio Methods and REINFORCE
Likelihood ratio methods are driven by a different important insight.
Assume that trajectories τ are generated from a system by roll-outs,
i.e., τ∼pθ(τ)=p(τ|θ) with return r(τ)=∑Hk=0akrk which leads
to J(θ)=E{r(τ)}=∫Tpθ(τ)r(τ)dτ . In this case, the policy gradient can be
estimated using the likelihood ratio (see e.g. Glynn, 1987;
Aleksandrov, Sysoyev, and Shemeneva, 1968) better known as

∇θpθ(τ)=pθ(τ)∇θlogpθ(τ)
REINFORCE (Williams, 1992) trick, i.e., by using

from standard differential calculus (∇θlogpθ(τ)=∇θpθ(τ)/pθ(τ)), we

∇θJ(θ)=∫T∇θpθ(τ)r(τ)dτ=∫Tpθ(τ)∇θlogpθ(τ)r(τ)dτ=E{∇θlogpθ(τ)r(τ
obtain

)}.

by ⟨⋅⟩ , only the derivative ∇θlogpθ(τ) is needed for the estimator.


As the expectation E{⋅} can be replaced by sample averages, denoted

Importantly, this derivative can be computed without knowledge of


the generating distribution pθ(τ) as
pθ(τ)=p(x0)∏Hk=0p(xk+1|xk,uk)πθ(uk|xk)
implies that
∇θlogpθ(τ)=∑Hk=0∇θlogπθ(uk|xk)
as only the policy depends on θ .
Thus, the derivatives of p(xk+1|xk,uk) do not have to be computed
and no model needs to be maintained. However, if we had a
deterministic policy u=π(x) instead of a stochastic policy u∼π(u|

derivative ∇θlogp(xk+1|xk,uk)=∇uklogp(xk+1|xk,uk)∇θπθ(xk) to
x) , computing such a derivative would require the

compute ∇θlogpθ(τ) and, hence, it would require a system model.


In order to reduce the variance of the gradient estimator, a constant
baseline can be subtracted from the gradient, i.e.,
∇θJ(θ)=E{∇θlogpθ(τ)(r(τ)−b)},
where the baseline b∈R can be chosen arbitrarily (Williams, 1992). It
is straightforward to show that this baseline does not introduce bias in
the gradient as differentiating ∫Tpθ(τ)dτ=1 implies that
∫T∇θpθ(τ)dτ=0 ,
and, hence, the constant baseline will vanish for infinite data while
reducing the variance of the gradient estimator for finite data. See
Peters & Schaal, 2008 for an overview of how to choose the baseline
optimally. Therefore, the general path likelihood ratio estimator or
episodic REINFORCE gradient estimator is given by

where ⟨⋅⟩ denotes the average over trajectories. This type of method is
gRF=⟨(∑Hk=0∇θlogπθ(uk|xk))(∑Hl=0alrl−b)⟩,

guaranteed to converge to the true gradient at the fastest theoretically


possible error decrease of O(I−1/2) where I denotes the number of
roll-outs (Glynn, 1987) even if the data is generated from a highly
stochastic system. An implementation of this algorithm will be shown
below together with the estimator for the optimal baseline.
input: policy parameterization θh
repeat
perform a trial and obtain x0:H,u0:H,r0:H
for each gradient element gk
estimate optimal baseline

⟨(∑Hh=0∇θklogπθ(uh∣∣xh))2⟩
b=⟨(∑Hh=0∇θklogπθ(uh∣∣xh))2∑Hl=0alrl⟩

estimate the gradient element


gk=⟨(∑Hh=0∇θklogπθ(uh|xh))(∑Hl=0alrl−b)⟩
end for
until gradient estimate gRF=[g1,…,gK] converged
return gradient estimate gRF=[g1,…,gK]
Advantages of this approach: Besides the theoretically faster
convergence rate, likelihood ratio gradient methods have a variety of
advantages in comparison to finite difference methods when applied
to robotics. As the generation of policy parameter variations is no
longer needed, the complicated control of these variables can no
longer endanger the gradient estimation process. Furthermore, in
practice, already a single roll-out can suffice for an unbiased gradient
estimate viable for a good policy update step, thus reducing the
amount of roll-outs needed. Finally, this approach has yielded the
most real-world robotics results (Peters & Schaal, 2008) and the
likelihood ratio gradient is guaranteed to achieve the fastest
convergence of the error for a stochastic system.
Disadvantages of this approach: When used with a deterministic
policy, likelihood ratio gradients have to maintain a system model.
Such a model can be very hard to obtain for continuous states and
actions, hence, the simpler finite difference gradients are often
superior in this scenario. Similarly, finite difference gradients can still
be more useful than likelihood ratio gradients if the system is
deterministic and very repetitive. Also, the practical implementation
of a likelihood ratio gradient method is much more demanding than
the one of a finite difference method.
Natural Policy Gradients
One of the main reasons for using policy gradient methods is that we
intend to do just a small change Δθ to the policy πθ while improving
the policy. However, the meaning of small is ambiguous. When using
the Euclidian metric of ΔθTΔθ−−−−−−√ , then the gradient is
different for every parameterization θ of the policy πθ even if these
parameterization are related to each other by a linear transformation
(Kakade, 2002). This problem poses the question of how we can
measure the closeness between the current policy and the updated
policy based upon the distribution of the paths generated by each of
these. In statistics, a variety of distance measures for the closeness of
two distributions (e.g., pθ(τ) and pθ+Δθ(τ)) have been suggested, e.g.,
the Kullback-Leibler divergence dKL(pθ,pθ+Δθ) , the Hellinger
distance dHD and others (Su & Gibbs, 2002). Many of these distances
(e.g., the previously mentioned ones) can be approximated by its
second order Taylor expansion, i.e., by
dKL(pθ,pθ+Δθ)≈ΔθTFθ Δθ,
where
Fθ=∫Tpθ(τ)∇logpθ(τ)∇logpθ(τ)Tdτ=⟨∇logpθ(τ)∇logpθ(τ)T⟩
is known as the Fisher-information matrix. Let us now assume that we
restrict the change of our policy
to dKL(pθ,pθ+Δθ)≈ΔθTFθ Δθ=ε, where ε needs to be a very small
number (i.e., close to zero).

update Δθ~ that is most similar to the true gradient ∇θJ while the
In that case, the natural gradient is defined by Amari (1998) as the
change in our path distribution is limited to ε . Hence, it is given by
the program
argmaxΔθΔθT∇θJs.t.ΔθTFθ Δθ=ε.
The solution to this program is given by

where ∇θJ denotes the regular likelihood ratio policy gradient from
Δθ∝F−1θ∇θJ,

the previous section. The update step is unique up to a scaling factor,


which is often subsumed into the learning rate. It can be interpreted as
follows: determine the maximal improvement Δθ of the policy for a
constant fixed change of the policy ΔθTFθΔθ .
As illustrated in Figure 1, the natural gradient update in Figure 1 (b)
corresponds to a slightly rotated regular policy gradient update in
Figure 1 (a). It can be guaranteed that it is always turned by less than
90 degrees (Amari, 1998), hence all convergence properties of the
regular policy gradient transfer.

Figure 1: When plotting the expected return landscape for a simple


problem such as an 1d linear quadratic regulation, the differences
between regular (‘vanilla’) and natural policy gradients becomes
apparent.
This type of approach has its origin in supervised learning (Amari,
1998). It was first suggested in the context of reinforcement learning
by Kakade (2002) and has been explored in greater depth in (Bagnell
& Schneider, 2003; Peters, Vijayakumar & Schaal, 2003, 2005; Peters
& Schaal, 2008). The strongest theoretical advantage of this approach
is that its performance no longer depends on the parameterization of
the policy and is therefore safe to be used for arbitrary policies.
Hence, the regular policy gradient is sometimes referred to as a
flavored or vanilla gradient, as it keeps the `vanilla flavor' of the
policy. However, a well-chosen policy parametrization can sometimes
results in better convergence of the policy gradient than the natural
policy gradient. Nevertheless, in practice, a learning process based on
natural policy gradients often converges significantly faster for most
practical cases. Figure 1 gives an impression of the differences in the
learning process: while the regular policy gradient often points to
plateaus with little exploration, the natural gradient points to the
optimal solution.
One of the fastest general algorithms for estimating natural policy
gradients which does not need complex parameterized baselines is the
episodic natural actor critic. This algorithm, originally derived in
(Peters, Vijayakumar & Schaal, 2003), can be considered the `natural'
version of REINFORCE with a baseline optimal for this gradient
estimator. However, for steepest descent with respect to a metric, the
baseline also needs to minimize the variance with respect to the same
metric. In this case, we can minimize the whole covariance matrix of
the natural gradient estimate Δθ^ given by
Σ=Cov{Δθ^}Fθ=E{(Δθ^−F−1θgRF(b))TFθ(Δθ^−F−1θgRF(b))},
with
gRF(b)=⟨∇logpθ(τ)(r(τ)−b)⟩
being the REINFORCE gradient with baseline b . The re-weighting
with the Fisher information ensures that the best identifiable gradient
components get the highest weight. As outlined in (Peters & Schaal,
2008), it can be shown that the minimum-variance unbiased natural
gradient estimator can be determined as shown below.
input: policy parameterization θh
repeat
perform M trials and obtain x0:H,u0:H,r0:H for each trial
Obtain the sufficient statistics
Policy derivatives ψk=∇θlogπθ(uk|xk)
Fisher matrix Fθ=⟨(∑Hk=0ψk)(∑Hl=0ψl)T⟩
Vanilla gradient g=⟨(∑Hk=0ψk)(∑Hl=0alrl)⟩
Eligibility ϕ=⟨(∑Hk=0ψk)⟩
Average reward r¯=⟨∑Hl=0alrl⟩
Obtain natural gradient by computing
Baseline b=Q(r¯−ϕTF−1θg)
with Q=M−1(1+ϕT(MFθ−ϕϕT)−1ϕ)
Natural gradient gNG=F−1θ(g−ϕb)
until gradient estimate gNG=[g1,…,gh] converged
return gradient estimate gNG=[g1,…,gh]

For the derivation, see (Peters & Schaal, 2008).


Advantages of this approach: Natural policy gradients differ in a
deciding aspect from both finite difference gradients and regular
likelihood ratio gradients, i.e., they are independent from the choice of
policy parametrization if the choices have the same representational
power. As a result, they can be an order of magnitude faster than the
regular gradient. They also profit from most other advantages of the
regular policy gradients.
Disadvantages of this approach: In comparison to the regular policy
gradient, there are three disadvantages: first, the matrix inversion in
the gradient estimators may be numerically brittle and may scale
worse (note that there are tricks to alleviate this problem). Second, if
we can find a special policy parametrization that trivializes a problem,
the natural policy gradient may not make use of it. Third, the natural
policy gradient estimators are often much harder to implement.
Conclusion
We have presented an quick overview on policy gradient methods.
While many details needed to be omitted and may be found in (Peters
& Schaal, 2008), this entry roughly represents the state of the art in
policy gradient methods. All three major ways of estimating first
order gradients, i.e., finite-difference gradients, vanilla policy
gradients and natural policy gradients are discussed in this article and
practical algorithms are given.

ACTOR CRITIC ALGORITHM


The Actor-Critic Reinforcement Learning algorithm

Actor-Critic architecture. Source:[1]

Policy-based and value-based RL algorithm

Please refer here for the policy gradient algorithm basics and
refer here for the value-based RL algorithm basics.

In policy-based RL, the optimal policy is computed by manipulating


policy directly, and value-based function implicitly finds the optimal
policy by finding the optimal value function. Policy-based RL is
effective in high dimensional & stochastic continuous action spaces,
and learning stochastic policies. At the same time, value-based RL
excels in sample efficiency and stability.

The main challenge of policy gradient RL is the high gradient


variance. The standard approach to reduce the variance in gradient
estimate is to use baseline function b(st)[4]. There are lots of concern
about adding the based line will invite the bias in the gradient
estimate. There is proof that baseline doesn’t bring the basis to the
gradient estimate.

Proof that baseline is unbiased

Policy gradient expression of REINFORCE algorithm as shown


below:

Expectation form of policy gradient expression of REINFORCE

We can write reward of trajectory, R(τ) as below:

Then adding baseline function modify the policy gradient expression


as below:

Inserting the baseline function

We can call the reward and baseline term as advantage function. It can
be denoted as below:

Advantage function
An important point to be noted in the above equation is baseline b is
the function of s_t not s_t`[4]

We can rearrange the expression as below:

Source: [4]

The above equation is equivalent E(X-Y). Due to the linearity of


expectation, then we can rearrange the E(X-Y) as E(X)−E(Y)[3]. So
the above equation is modified as below:

Source: [3][4]

If the second term with baseline is zero, it can prove that adding
baseline function, b has added no bias in the gradient estimate. That
means

Source: [3]

We can generalize the expectation as below:

Proof for the second term is zero, as shown below:


Source: [3]

The derivation above prove that adding baseline function has no bias
on gradient estimate
Actor-critic

In a simple term, Actor-Critic is a Temporal Difference(TD) version


of Policy gradient[3]. It has two networks: Actor and Critic. The actor
decided which action should be taken and critic inform the actor how
good was the action and how it should adjust. The learning of the actor
is based on policy gradient approach. In comparison, critics evaluate
the action produced by the actor by computing the value function.

This type of architecture is in Generative Adversarial Network(GAN)


where both discriminator and generator participate in a game[2]. The
generator generates the fake images and discriminator evaluate how
good is the fake image generated with its representation of the real
image[2]. Over time Generator can create fake images which cannot
be distinguishable for the discriminator[2]. Similarly, Actor and Critic
are participating in the game, but both of them are improving over
time, unlike GAN[2].

Actor-critic is similar to a policy gradient algorithm called


REINFORCE with baseline. Reinforce is the MONTE-CARLO
learning that indicates that total return is sampled from the full
trajectory. But in actor-critic, we use bootstrap. So the main changes in
the advantage function.

Original advantage function in policy gradient total return is changed


to bootstrapping. Source: [3]

Finally, b(st) changed to value function of the current state. It can be


denoted as below:
We can write the new modified advantage function for actor-critic:

Advantage function of Actor-Critic algorithm

Alternatively, advantage function is called as TD error as shown in the


Actor-Critic framework. As mentioned above, the learning of the actor
is based on policy-gradient. The policy gradient expression of the
actor as shown below:

Policy-gradient expression of the actor

Pseudocode of Actor-Critic algorithm[6]

1. Sample {s_t, a_t}using the policy πθ from the actor-


network.

2. Evaluate the advantage function A_t. It can be called


as TD error δt. In Actor-critic algorithm, advantage
function is produced by the critic-network.

3. Evaluate the gradient using the below expression:

4. Update the policy parameters, θ


5. Update the weights of the critic based value-based RL(Q-learning).
δt is equivalent to advantage function.

6. Repeat 1 to 5 until we find the optimal policy πθ.

AUTOENCODING:.
Autoencoders are very useful in the field of unsupervised machine
learning. You can use them to compress the data and reduce its
dimensionality.

The main difference between Autoencoders and Principle Component


Analysis (PCA) is that while PCA finds the directions along which
you can project the data with maximum variance, Autoencoders
reconstruct our original input given just a compressed version of it.

If anyone needs the original data can reconstruct it from the


compressed data using an autoencoder.
The encoder part of the network is used for encoding and
sometimes even for data compression purposes although it
is not very effective as compared to other general
compression techniques like JPEG. Encoding is achieved by the
encoder part of the network which has a decreasing number of
hidden units in each layer. Thus this part is forced to pick up only
the most significant and representative features of the data. The
second half of the network performs the Decoding function. This
part has an increasing number of hidden units in each layer and
thus tries to reconstruct the original input from the encoded data.
Thus Auto-encoders are an unsupervised learning technique.
Example: See the below code, in autoencoder training data, is fitted
to itself. That’s why instead of fitting X_train to Y_train we have
used X_train in both places.

 Python3
autoencoder.fit(X_train, X_train, epochs=200)
Training of an Auto-encoder for data compression: For a data
compression procedure, the most important aspect of the
compression is the reliability of the reconstruction of the
compressed data. This requirement dictates the structure of the
Auto-encoder as a bottleneck. Step 1: Encoding the input
data The Auto-encoder first tries to encode the data using the
initialized weights and biases.

Step 2: Decoding the input data The Auto-encoder tries to


reconstruct the original input from the encoded data to test the
reliability of the encoding.
Step 3: Backpropagating the error After the reconstruction, the
loss function is computed to determine the reliability of the
encoding. The error generated is backpropagated.
The above-described training process is reiterated several times
until an acceptable level of reconstruction is reached.
After the training process, only the encoder part of the Auto-encoder
is retained to encode a similar type of data used in the training
process. The different ways to constrain the network are:-
 Keep small Hidden Layers: If the size of each hid-
den layer is kept as small as possible, then the network will
be forced to pick up only the representative features of the
data thus encoding the data.
 Regularization: In this method, a loss term is ad-
ded to the cost function which encourages the network to
train in ways other than copying the input.
 Denoising: Another way of constraining the network
is to add noise to the input and teach the network how to
remove the noise from the data.
 Tuning the Activation Functions: This method in-
volves changing the activation functions of various
nodes so that a majority of the nodes are
dormant thus effectively reducing the size of the hidden
layers.
The different variations of Auto-encoders are:-
 Denoising Auto-encoder: This type of auto-encoder
works on a partially corrupted input and trains to recover
the original undistorted image. As mentioned above, this
method is an effective way to constrain the network from
simply copying the input.
 Sparse Auto-encoder: This type of auto-encoder typ-
ically contains more hidden units than the input but only a
few are allowed to be active at once. This property is called
the sparsity of the network. The sparsity of the network can
be controlled by either manually zeroing the required hid-
den units, tuning the activation functions or by adding a loss
term to the cost function.
Variational Auto-encoder: This type of auto-encoder makes strong assumptions about the
distribution of latent variables and uses the Stochastic Gradient Variational
Bayes estimator in the training process. It assumes that the data is generated by a Directed
Graphical Model and tries to learn an approximation to to the conditional property where
and are the parameters of the encoder and the decoder respectively.
Below is the basic intuition code of how to build the autoencoder model and fitting X_train to
itself.

An Autoencoder consists of three layers:

1. Encoder

2. Code

3. Decoder
The Encoder layer compresses the input image into a latent space
representation. It encodes the input image as a compressed
representation in a reduced dimension.

The compressed image is a distorted version of the original image.

The Code layer represents the compressed input fed to the decoder
layer.

The decoder layer decodes the encoded image back to the original
dimension. The decoded image is reconstructed from latent space
representation, and it is reconstructed from the latent space
representation and is a lossy reconstruction of the original image.

Convolutional autoencoding:

A convolutional autoencoder is a neural network (a special case of an


unsupervised learning model) that is trained to reproduce its input
image in the output layer. An image is passed through an encoder,
which is a ConvNet that produces a low-dimensional representation of
the image. The decoder, which is another sample ConvNet, takes this
compressed image and reconstructs the original image.
The encoder is used to compress the data and the decoder is used to
reproduce the original image. Therefore, autoencoders may be used
for data, compression. Compression logic is data-specific, meaning it
is learned from data rather than predefined compression algorithms
such as JPEG, MP3, and so on. Other applications of autoencoders
can be image denoising (producing a cleaner image from a corrupted
image), dimensionality reduction, and image search:

This differs from regular ConvNets or neural nets in the sense that the
input size and the target size must be the same.

VARIATIONAL AUTOENCODING:

A variational autoencoder (VAE) provides a probabilistic manner


for describing an observation in latent space. Thus, rather than
building an encoder which outputs a single value to describe each
latent state attribute, we'll formulate our encoder to describe a
probability distribution for each latent attribute.
EXPLANATION:
DIAGRAM:
In neural net language, a variational autoencoder consists of an
encoder, a decoder, and a loss function.

THE ENCODER COMPRESSES DATA INTO A LATENT SPACE


(Z). THE DECODER RECONSTRUCTS THE DATA GIVEN THE
HIDDEN REPRESENTATION.
The encoder is a neural network. Its input is a datapoint �x,
its output is a hidden representation �z, and it has weights and
biases �θ.
The decoder is another neural net. Its input is the representation �z,
it outputs the parameters to the probability distribution of the data,
and has weights and biases �ϕ.
The decoder is denoted by ��(�∣�)pϕ(x∣z).
The decoder ‘decodes’ the real-valued numbers
in �z into 784784 real-valued numbers between 00 and 11.
Information from the original 784784-dimensional vector cannot be
perfectly transmitted, -784784-dimensional vector �z).
How much information is lost? We measure this using the
reconstruction log-likelihood log⁡��(�∣�)logpϕ(x∣z) whose units
are nats.
This measure tells us how effectively the decoder has learned to
reconstruct an input image �x given its latent representation �z.
The loss function of the variational autoencoder is the negative log-
likelihood with a regularizer.
Because there are no global representations that are shared by all
datapoints,
The probability model perspective

Now let’s think about variational autoencoders from a probability


model perspective. Please forget everything you know about deep
learning and neural networks for now. Thinking about the following
concepts in isolation from neural networks will clarify things. At the
very end, we’ll bring back neural nets.

In the probability model framework, a variational autoencoder


contains a specific probability model of data �x and latent
variables �z. We can write the joint probability of the model
as �(�,�)=�(�∣�)�(�)p(x,z)=p(x∣z)p(z). The generative
process can be written as follows.
For each datapoint �i:
 Draw latent variables ��∼�(�)zi∼p(z)
 Draw datapoint ��∼�(�∣�)xi∼p(x∣z)
We can represent this as a graphical model:
THE

GRAPHICAL MODEL REPRESENTATION OF THE MODEL IN


THE VARIATIONAL AUTOENCODER. THE LATENT
VARIABLE Z IS A STANDARD NORMAL, AND THE DATA
ARE DRAWN FROM P(X|Z). THE SHADED NODE FOR X
DENOTES OBSERVED DATA. FOR BLACK AND WHITE
IMAGES OF HANDWRITTEN DIGITS, THIS DATA
LIKELIHOOD IS BERNOULLI DISTRIBUTED.
This is the central object we think about when discussing variational
autoencoders from a probability model perspective. The latent
variables are drawn from a prior �(�)p(z). The data �x have a
likelihood �(�∣�)p(x∣z) that is conditioned on latent variables �z.
The model defines a joint probability distribution over data and latent
variables: �(�,�)p(x,z). We can decompose this into the likelihood
and prior: �(�,�)=�(�∣�)�(�)p(x,z)=p(x∣z)p(z). For black and
white digits, the likelihood is Bernoulli distributed.
Now we can think about inference in this model. The goal is to infer
good values of the latent variables given observed data, or to calculate
the posterior �(�∣�)p(z∣x). Bayes says:
�(�∣�)=�(�∣�)�(�)�(�).p(z∣x)=p(x)p(x∣z)p(z).
Examine the denominator �(�)p(x). This is called the evidence, and
we can calculate it by marginalizing out the latent
variables: �(�)=∫�(�∣�)�(�)��p(x)=∫p(x∣z)p(z)dz.
Unfortunately, this integral requires exponential time to compute as it
needs to be evaluated over all configurations of latent variables. We
therefore need to approximate this posterior distribution.
Variational inference approximates the posterior with a family of
distributions ��(�∣�)qλ(z∣x). The variational
parameter �λ indexes the family of distributions. For example,
if �q were Gaussian, it would be the mean and variance of the latent
variables for each datapoint ���=(���,���2))λxi=(μxi,σxi2
)).
How can we know how well our variational
posterior �(�∣�)q(z∣x) approximates the true
posterior �(�∣�)p(z∣x)? We can use the Kullback-Leibler
divergence, which measures the information lost when using �q to
approximate �p (in units of nats):
��(��(�∣�)∣∣�(�∣�))=KL(qλ
(z∣x)∣∣p(z∣x))=��[log⁡��(�∣�)]−��[log⁡�(�,�)]
+log⁡�(�)Eq[logqλ(z∣x)]−Eq[logp(x,z)]+logp(x)
Our goal is to find the variational parameters �λ that minimize this
divergence. The optimal approximate posterior is thus
��∗(�∣�)=arg⁡min⁡���(��(�∣�)∣∣�(�∣�)).qλ∗
(z∣x)=argminλKL(qλ(z∣x)∣∣p(z∣x)).
Why is this impossible to compute directly? The pesky
evidence �(�)p(x) appears in the divergence. This is intractable as
discussed above. We need one more ingredient for tractable
variational inference. Consider the following function:
����(�)=��[log⁡�(�,�)]
−��[log⁡��(�∣�)].ELBO(λ)=Eq[logp(x,z)]−Eq[logqλ(z∣x)].
Notice that we can combine this with the Kullback-Leibler divergence
and rewrite the evidence as

log⁡�(�)=����(�)
+��(��(�∣�)∣∣�(�∣�))logp(x)=ELBO(λ)+KL(qλ(z∣x)∣∣p(z∣x))
By Jensen’s inequality, the Kullback-Leibler divergence is always
greater than or equal to zero. This means that minimizing the
Kullback-Leibler divergence is equivalent to maximizing the ELBO.
The abbreviation is revealed: the Variational Autoencoder (VAE):
in neural net language, a VAE consists of an encoder, a decoder, and
a loss function. In probability model terms, the variational
autoencoder refers to approximate inference in a latent Gaussian
model where the approximate posterior and model likelihood are
parametrized by neural nets (the inference and generative networks).
GLOSSARY:

 Loss function: in neural net language, we think of


loss functions. Training means minimizing these loss
functions. But in variational inference, we maximize
the ELBO (which is not a loss function). This leads to
awkwardness like calling optimizer.minimize(-elbo) as
optimizers in neural net frameworks only support
minimization.
 Encoder: in the neural net world, the encoder is a
neural network that outputs a representation �z of
data �x. In probability model terms, the inference
network parametrizes the approximate posterior of the
latent variables �z. The inference network outputs
parameters to the distribution �(�∣�)q(z∣x).
 Decoder: in deep learning, the decoder is a neural
net that learns to reconstruct the data �x given a
representation �z. In terms of probability models, the
likelihood of the data �x given latent variables �z is
parametrized by a generative network. The generative
network outputs parameters to the likelihood
distribution �(�∣�)p(x∣z).
 Local latent variables: these are the ��zi for
each datapoint ��xi. There are no global latent
variables. Because there are only local latent variables,
we can easily decompose the ELBO into
terms ��Li that depend only on a single
datapoint ��xi. This enables stochastic gradient
descent.
 Inference: in neural nets, inference usually means
prediction of latent representations given new, never-
before-seen datapoints. In probability models, inference
refers to inferring the values of latent variables given
observed data.
One jargon-laden concept deserves its own subsection:

Mean-field versus amortized inference


This issue was very confusing for me, and I can see how it might be
even more confusing for someone coming from a deep learning
background. In deep learning, we think of inputs and outputs,
encoders and decoders, and loss functions. This can lead to fuzzy,
imprecise concepts when learning about probabilistic modeling.

Let’s discuss how mean-field inference differs from amortized


inference. This is a choice we face when doing approximate inference
to estimate a posterior distribution of latent variables. We might have
various constraints: do we have lots of data? Do we have big
computers or GPUs? Do we have local, per-datapoint latent variables,
or global latent variables shared across all datapoints?

Mean-field variational inference refers to a choice of a variational


distribution that factorizes across the �N data points, with no shared
parameters:
�(�)=∏���(��;��)q(z)=i∏Nq(zi;λi)
This means there are free parameters for each
datapoint ��λi (e.g. ��=(��,��)λi=(μi,σi) for Gaussian latent
variables). How do we do ‘learning’ for a new, unseen datapoint? We
need to maximize the ELBO for each new datapoint, with respect to
its mean-field parameter(s) ��λi.
Amortized inference refers to ‘amortizing’ the cost of inference
across datapoints. One way to do this is by sharing (amortizing) the
variational parameters �λ across datapoints. For example, in the
variational autoencoder, the parameters �θ of the inference network.
These global parameters are shared across all datapoints. If we see a
new datapoint and want to see what its approximate
posterior �(��)q(zi) looks like, we can run variational inference
again (maximizing the ELBO until convergence), or trust that the
shared parameters are ‘good-enough’. This can be an advantage over
mean-field.
Which one is more flexible? Mean-field inference is strictly more
expressive, because it has no shared parameters. The per-data
parameters ��λi can ensure our approximate posterior is most
faithful to the data. Another way to think of this is that we are limiting
the capacity or representational power of our variational family by
tying parameters across datapoints (e.g. with a neural network that
shares weights and biases across data).

GENERATIVE ADVERSITIAL NETWORKS:

A Generative Adversarial Network (GAN) is a deep learning


architecture that consists of two neural networks competing against
each other in a zero-sum game framework. The goal of GANs is to
generate new, synthetic data that resembles some known data
distribution.

1.Components:

 Generator network: creates synthetic data


 Discriminator network: evaluates the synthetic data
and tries to determine if it’s real or fake
2.Training:

 The generator network produces synthetic data and


the discriminator network evaluates it.
 The generator is trained to fool the discriminator and
the discriminator is trained to correctly identify real and
fake data.
 This process continues until the generator produces
data that is indistinguishable from real data.

3.Applications:

 Image synthesis
 Text-to-Image synthesis
 Image-to-Image translation
 Anomaly detection
 Data augmentation

4.Limitations:

 Training can be unstable and prone to mode collapse,


where the generator produces limited variations of synthetic
data.
 GANs can be difficult to train and require a lot of com-
putational resources.
 GANs can generate unrealistic or irrelevant synthetic
data if the generator and discriminator are not properly
trained.
Generative Adversarial Networks (GANs) are a powerful class of
neural networks that are used for unsupervised learning. It was
developed and introduced by Ian J. Goodfellow in 2014. GANs are
basically made up of a system of two competing neural network
models which compete with each other and are able to analyze,
capture and copy the variations within a dataset. Why were GANs
developed in the first place? It has been noticed most of the
mainstream neural nets can be easily fooled into misclassifying
things by adding only a small amount of noise into the original data.
Surprisingly, the model after adding noise has higher confidence in
the wrong prediction than when it predicted correctly. The reason
for such adversary is that most machine learning models learn from
a limited amount of data, which is a huge drawback, as it is prone to
overfitting. Also, the mapping between the input and the output is
almost linear. Although, it may seem that the boundaries of
separation between the various classes are linear, but in reality,
they are composed of linearities and even a small change in a point
in the feature space might lead to misclassification of data. How
does GANs work? Generative Adversarial Networks (GANs) can be
broken down into three parts:
 Generative: To learn a generative model, which de-
scribes how data is generated in terms of a probabilistic
model.
 Adversarial: The training of a model is done in an ad-
versarial setting.
 Networks: Use deep neural networks as the artificial
intelligence (AI) algorithms for training purpose.
In GANs, there is a generator and a discriminator. The Generator
generates fake samples of data(be it an image, audio, etc.) and tries
to fool the Discriminator. The Discriminator, on the other hand, tries
to distinguish between the real and fake samples. The Generator
and the Discriminator are both Neural Networks and they both run in
competition with each other in the training phase. The steps are
repeated several times and in this, the Generator and Discriminator
get better and better in their respective jobs after each repetition.
The working can be visualized by the diagram given
below:

Here, the generative model captures the distribution of data and is


trained in such a manner that it tries to maximize the probability of
the Discriminator in making a mistake. The Discriminator, on the
other hand, is based on a model that estimates the probability that
the sample that it got is received from the training data and not
from the Generator. The GANs are formulated as a minimax game,
where the Discriminator is trying to minimize its reward V(D, G) and
the Generator is trying to minimize the Discriminator’s reward or in
other words, maximize its loss. It can be mathematically described
by the formula
below:

where, G = Generator D = Discriminator Pdata(x) = distribution of


real data P(z) = distribution of generator x = sample from Pdata(x) z
= sample from P(z) D(x) = Discriminator network G(z) = Generator
network So, basically, training a GAN has two parts:
 Part 1: The Discriminator is trained while the Gener-
ator is idle. In this phase, the network is only forward
propagated and no back-propagation is done. The Discrimin-
ator is trained on real data for n epochs, and see if it can
correctly predict them as real. Also, in this phase, the Dis-
criminator is also trained on the fake generated data from
the Generator and see if it can correctly predict them as
fake.
 Part 2: The Generator is trained while the Discrimin-
ator is idle. After the Discriminator is trained by the gener-
ated fake data of the Generator, we can get its predictions
and use the results for training the Generator and get better
from the previous state to try and fool the Discriminator.
1. Vanilla GAN: This is the simplest type GAN.
Here, the Generator and the Discriminator are
simple multi-layer perceptrons. In vanilla GAN, the
algorithm is really simple, it tries to optimize the
mathematical equation using stochastic gradient
descent.
2. Conditional GAN (CGAN): CGAN can be de-
scribed as a deep learning method in which some
conditional parameters are put into place. In CGAN,
an additional parameter ‘y’ is added to the Gener-
ator for generating the corresponding data. Labels
are also put into the input to the Discriminator in or-
der for the Discriminator to help distinguish the real
data from the fake generated data.
3. Deep Convolutional GAN (DCGAN): DCGAN
is one of the most popular also the most successful
implementation of GAN. It is composed of ConvNets
in place of multi-layer perceptrons. The ConvNets
are implemented without max pooling, which is in
fact replaced by convolutional stride. Also, the lay-
ers are not fully connected.
4. Laplacian Pyramid GAN (LAPGAN): The
Laplacian pyramid is a linear invertible image rep-
resentation consisting of a set of band-pass images,
spaced an octave apart, plus a low-frequency resid-
ual. This approach uses multiple numbers of Gener-
ator and Discriminator networks and different levels
of the Laplacian Pyramid. This approach is mainly
used because it produces very high-quality images.
The image is down-sampled at first at each layer of
the pyramid and then it is again up-scaled at each
layer in a backward pass where the image acquires
some noise from the Conditional GAN at these lay-
ers until it reaches its original size.
5. Super Resolution GAN (SRGAN): SRGAN as
the name suggests is a way of designing a GAN in
which a deep neural network is used along with an
adversarial network in order to produce higher res-
olution images. This type of GAN is particularly use-
ful in optimally up-scaling native low-resolution im-
ages to enhance its details minimizing errors while
doing so.

Advantages of Generative Adversarial Networks (GANs):

1. Synthetic data generation: GANs can generate new,


synthetic data that resembles some known data distribu-
tion, which can be useful for data augmentation, anomaly
detection, or creative applications.
2. High-quality results: GANs can produce high-quality,
photorealistic results in image synthesis, video synthesis,
music synthesis, and other tasks.
3. Unsupervised learning: GANs can be trained without
labeled data, making them suitable for unsupervised learn-
ing tasks, where labeled data is scarce or difficult to obtain.
4. Versatility: GANs can be applied to a wide range of
tasks, including image synthesis, text-to-image synthesis,
image-to-image translation, anomaly detection, data aug-
mentation, and others.

Disadvantages of Generative Adversarial Networks


(GANs):
1. Training instability: GANs can be difficult to train, with
the risk of instability, mode collapse, or failure to converge.
2. Computational cost: GANs can require a lot of compu-
tational resources and can be slow to train, especially for
high-resolution images or large datasets.
3. Overfitting: GANs can overfit to the training data, pro-
ducing synthetic data that is too similar to the training data
and lacking diversity.
4. Bias and fairness: GANs can reflect the biases and un-
fairness present in the training data, leading to discriminat-
ory or biased synthetic data.
5. Interpretability and accountability: GANs can be
opaque and difficult to interpret or explain, making it chal-
lenging to ensure accountability, transparency, or fairness in
their applications.

AUTO ENCODERS FOR FEATURE EXTRACTION:

Click to Take the FREE Deep Learning Crash-Course

 Get Started
 Blog
 Topics
 EBooks
 FAQ
 About
 Contact
Autoencoder Feature Extraction for
Classification
by Jason Brownlee on December 7, 2020 in Deep Learning
Tweet Tweet Share Share

Autoencoder is a type of neural network that can be used to learn a compressed


representation of raw data.
An autoencoder is composed of an encoder and a decoder sub-models. The encoder
compresses the input and the decoder attempts to recreate the input from the
compressed version provided by the encoder. After training, the encoder model is saved
and the decoder is discarded.

The encoder can then be used as a data preparation technique to perform feature
extraction on raw data that can be used to train a different machine learning model.

In this tutorial, you will discover how to develop and evaluate an autoencoder for
classification predictive modeling.

After completing this tutorial, you will know:

 An autoencoder is a neural network model that can be used to learn a com-


pressed representation of raw data.
 How to train an autoencoder model on a training dataset and save just the en-
coder part of the model.
 How to use the encoder as a data preparation step when training a machine
learning model.
Let’s get started.

How to Develop an Autoencoder for Classification

Photo by Bernd Thaller, some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:
1. Autoencoders for Feature Extraction
2. Autoencoder for Classification
3. Encoder as Data Preparation for Predictive Model
Autoencoders for Feature Extraction
An autoencoder is a neural network model that seeks to learn a compressed
representation of an input.
An autoencoder is a neural network that is trained to attempt to copy its input to its
output.

— Page 502, Deep Learning, 2016.


They are an unsupervised learning method, although technically, they are trained using
supervised learning methods, referred to as self-supervised.

Autoencoders are typically trained as part of a broader model that attempts to recreate
the input.

For example:

 X = model.predict(X)
The design of the autoencoder model purposefully makes this challenging by restricting
the architecture to a bottleneck at the midpoint of the model, from which the
reconstruction of the input data is performed.

There are many types of autoencoders, and their use varies, but perhaps the more
common use is as a learned or automatic feature extraction model.

In this case, once the model is fit, the reconstruction aspect of the model can be
discarded and the model up to the point of the bottleneck can be used. The output of the
model at the bottleneck is a fixed-length vector that provides a compressed
representation of the input data.

Usually they are restricted in ways that allow them to copy only approximately, and to
copy only input that resembles the training data. Because the model is forced to
prioritize which aspects of the input should be copied, it often learns useful properties of
the data.

— Page 502, Deep Learning, 2016.


Input data from the domain can then be provided to the model and the output of the
model at the bottleneck can be used as a feature vector in a supervised learning model,
for visualization, or more generally for dimensionality reduction.
Next, let’s explore how we might develop an autoencoder for feature extraction on a
classification predictive modeling problem.

Autoencoder for Classification


In this section, we will develop an autoencoder to learn a compressed representation of
the input features for a classification predictive modeling problem.

First, let’s define a classification predictive modeling problem.

We will use the make_classification() scikit-learn function to define a synthetic binary (2-
class) classification task with 100 input features (columns) and 1,000 examples (rows).
Importantly, we will define the problem in such a way that most of the input variables are
redundant (90 of the 100 or 90 percent), allowing the autoencoder later to learn a useful
compressed representation.
The example below defines the dataset and summarizes its shape.

1
#

2
s
y
3
n
t
4
h
5
e
t
6
i
c

c
l
a
s
s
i
f
i
c
a
t
i
o
n

d
a
t
a
s
e
t

f
r
o
m

s
k
l
e
a
r
n
.
d
a
t
a
s
e
t
s

i
m
p
o
r
t

m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n

d
e
f
i
n
e

d
a
t
a
s
e
t

X
,

m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,

n
_
f
e
a
t
u
r
e
s
=
1
0
0
,

n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,

n
_
r
e
d
u
n
d
a
n
t
=
9
0
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

s
u
m
m
a
r
i
z
e

t
h
e

d
a
t
a
s
e
t

p
r
i
n
t
(
X
.
s
h
a
p
e
,

y
.
s
h
a
p
e
)

Running the example defines the dataset and prints the shape of the arrays, confirming
the number of rows and columns.

1
(
1
0
0
0
,

1
0
0
)

(
1
0
0
0
,
)

Next, we will develop a Multilayer Perceptron (MLP) autoencoder model.


The model will take all of the input columns, then output the same values. It will learn to
recreate the input pattern exactly.

The autoencoder consists of two parts: the encoder and the decoder. The encoder
learns how to interpret the input and compress it to an internal representation defined by
the bottleneck layer. The decoder takes the output of the encoder (the bottleneck layer)
and attempts to recreate the input.

Once the autoencoder is trained, the decoder is discarded and we only keep the
encoder and use it to compress examples of input to vectors output by the bottleneck
layer.

In this first autoencoder, we won’t compress the input at all and will use a bottleneck
layer the same size as the input. This should be an easy problem that the model will
learn nearly perfectly and is intended to confirm our model is implemented correctly.

We will define the model using the functional API; if this is new to you, I recommend this
tutorial:

 How to Use the Keras Functional API for Deep Learning


Prior to defining and fitting the model, we will split the data into train and test sets and
scale the input data by normalizing the values to the range 0-1, a good practice with
MLPs.

1
.
.
2
.
3
#
4
s
p
5
l
6
i
t
7
i
8
n
t
o

t
r
a
i
n

t
e
s
t

s
e
t
s

X
_
t
r
a
i
n
,

X
_
t
e
s
t
,

y
_
t
r
a
i
n
,

y
_
t
e
s
t

t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,
y
,

t
e
s
t
_
s
i
z
e
=
0
.
3
3
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

s
c
a
l
e

d
a
t
a

M
i
n
M
a
x
S
c
a
l
e
r
(
)

t
.
f
i
t
(
X
_
t
r
a
i
n
)

X
_
t
r
a
i
n

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)

X
_
t
e
s
t

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)

We will define the encoder to have two hidden layers, the first with two times the number
of inputs (e.g. 200) and the second with the same number of inputs (100), followed by
the bottleneck layer with the same number of inputs as the dataset (100).

To ensure the model learns well, we will use batch normalization and leaky ReLU
activation.
1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
7
e
8
e
9
n
c
1
o
0
d
e
1
r
1
v
1
i
2
s
i
1
b
3
l
1
e
4
=

I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)

e
n
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)

L
e
a
k
y
R
e
L
U
(
)
(
e
)

e
n
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)
e

L
e
a
k
y
R
e
L
U
(
)
(
e
)

b
o
t
t
l
e
n
e
c
k

n
_
b
o
t
t
l
e
n
e
c
k

n
_
i
n
p
u
t
s

b
o
t
t
l
e
n
e
c
k

D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)

The decoder will be defined with a similar structure, although in reverse.

It will have two hidden layers, the first with the number of inputs in the dataset (e.g. 100)
and the second with double the number of inputs (e.g. 200). The output layer will have
the same number of nodes as there are columns in the input data and will use a linear
activation function to output numeric values.

1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
7
e
8
d
9
e
c
1
o
0
d
e
1
r
1
,
1
l
2
e
v
1
e
3
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

d
e
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

o
u
t
p
u
t
l
a
y
e
r

o
u
t
p
u
t

D
e
n
s
e
(
n
_
i
n
p
u
t
s
,

a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)

d
e
f
i
n
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
o
u
t
p
u
t
)

The model will be fit using the efficient Adam version of stochastic gradient descent and
minimizes the mean squared error, given that reconstruction is a type of multi-output
regression problem.

1
.
.
2
.
3
#

c
o
m
p
i
l
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,

l
o
s
s
=
'
m
s
e
'
)

We can plot the layers in the autoencoder model to get a feeling for how the data flows
through the model.

1
.
.
2
.
3
#

p
l
o
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,

'
a
u
t
o
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

The image below shows a plot of the autoencoder.


Plot of Autoencoder Model for Classification With No Compression

Next, we can train the model to reproduce the input and keep track of the performance
of the model on the hold-out test set.

1
.
.
2
.
3
#

f
i
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

t
o

r
e
c
o
n
s
t
r
u
c
t

i
n
p
u
t

h
i
s
t
o
r
y

m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,

X
_
t
r
a
i
n
,

e
p
o
c
h
s
=
2
0
0
,

b
a
t
c
h
_
s
i
z
e
=
1
6
,
v
e
r
b
o
s
e
=
2
,

v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)

After training, we can plot the learning curves for the train and test sets to confirm the
model learned the reconstruction problem well.

1
.
.
2
.
3
#
4
p
l
5
o
6
t

l
o
s
s

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
r
a
i
n
'
)

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
e
s
t
'
)

p
y
p
l
o
t
.
l
e
g
e
n
d
(
)

p
y
p
l
o
t
.
s
h
o
w
(
)

Finally, we can save the encoder model for use later, if desired.

1
.
.
2
.
3
#
4
d
e
5
f
6
i
n
e

a
n

e
n
c
o
d
e
r

m
o
d
e
l

(
w
i
t
h
o
u
t

t
h
e

d
e
c
o
d
e
r
)

e
n
c
o
d
e
r

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)

p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,

'
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

s
a
v
e

t
h
e

e
n
c
o
d
e
r

t
o

f
i
l
e

e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)

As part of saving the encoder, we will also plot the encoder model to get a feeling for the
shape of the output of the bottleneck layer, e.g. a 100 element vector.

An example of this plot is provided below.


Plot of Encoder Model for Classification With No Compression
Tying this all together, the complete example of an autoencoder for reconstructing the
input data for a classification dataset without any compression in the bottleneck layer is
listed below.

1
#

2
t
r
3
a
i
4
n
5
a
6
u
t
7
o
8
e
n
9
c
o
1
d
0
e
r
1
1
f
1
o
2
r
1
c
3
l
a
1
s
4
s
1
i
5
f
i
1
c
6
a
t
1
i
7
o
n
1
8
w
1
i
9
t
h
2
0
n
o
2
1
c
2
o
2
m
p
2
r
3
e
s
2
s
4
2
i
5
o
n
2
6
i
n
2
7
t
h
2
e
8

2
b
9
o
t
3
t
0
l
e
3
n
1
e
3
c
2
k

3
l
3
a
y
3
e
4
r
3
f
5
r
o
3
m
6

3
s
7
k
l
3
e
8
a
r
3
n
9
.
4
d
0
a
t
4
a
1
s
e
4
t
2
s
4
i
3
m
4
p
4
o
r
4
t
5
m
4
a
6
k
4
e
7
_
4
c
8
l
a
4
s
9
s
i
5
f
0
i
c
5
a
1
t
5
i
2
o
n
5
3
f
r
5
o
4
m
5
s
5
k
5
l
6
e
a
5
r
7
n
.
5
p
8
r
e
5
p
9
r
6
o
0
c
e
6
s
1
s
i
6
n
2
g
6
3
i
m
p
o
r
t

M
i
n
M
a
x
S
c
a
l
e
r
f
r
o
m

s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n

i
m
p
o
r
t

t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
m
o
d
e
l
s

i
m
p
o
r
t

M
o
d
e
l

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s
i
m
p
o
r
t

I
n
p
u
t

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

D
e
n
s
e

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

L
e
a
k
y
R
e
L
U

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
u
t
i
l
s

i
m
p
o
r
t

p
l
o
t
_
m
o
d
e
l

f
r
o
m

m
a
t
p
l
o
t
l
i
b

i
m
p
o
r
t

p
y
p
l
o
t

d
e
f
i
n
e

d
a
t
a
s
e
t

X
,

m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,

n
_
f
e
a
t
u
r
e
s
=
1
0
0
,

n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,

n
_
r
e
d
u
n
d
a
n
t
=
9
0
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

n
u
m
b
e
r

o
f

i
n
p
u
t

c
o
l
u
m
n
s

n
_
i
n
p
u
t
s

X
.
s
h
a
p
e
[
1
]

s
p
l
i
t

i
n
t
o

t
r
a
i
n

t
e
s
t

s
e
t
s

X
_
t
r
a
i
n
,

X
_
t
e
s
t
,

y
_
t
r
a
i
n
,

y
_
t
e
s
t

t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,

y
,

t
e
s
t
_
s
i
z
e
=
0
.
3
3
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

s
c
a
l
e

d
a
t
a

M
i
n
M
a
x
S
c
a
l
e
r
(
)
t
.
f
i
t
(
X
_
t
r
a
i
n
)

X
_
t
r
a
i
n

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)

X
_
t
e
s
t

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)

d
e
f
i
n
e

e
n
c
o
d
e
r

v
i
s
i
b
l
e

I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)

e
n
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)

L
e
a
k
y
R
e
L
U
(
)
(
e
)

e
n
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)

L
e
a
k
y
R
e
L
U
(
)
(
e
)

b
o
t
t
l
e
n
e
c
k

n
_
b
o
t
t
l
e
n
e
c
k

n
_
i
n
p
u
t
s

b
o
t
t
l
e
n
e
c
k

D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)

d
e
f
i
n
e

d
e
c
o
d
e
r
,

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

d
e
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

o
u
t
p
u
t

l
a
y
e
r

o
u
t
p
u
t

D
e
n
s
e
(
n
_
i
n
p
u
t
s
,
a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)

d
e
f
i
n
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
o
u
t
p
u
t
)

c
o
m
p
i
l
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,

l
o
s
s
=
'
m
s
e
'
)

p
l
o
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,

'
a
u
t
o
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

f
i
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

t
o

r
e
c
o
n
s
t
r
u
c
t

i
n
p
u
t

h
i
s
t
o
r
y

m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,

X
_
t
r
a
i
n
,

e
p
o
c
h
s
=
2
0
0
,

b
a
t
c
h
_
s
i
z
e
=
1
6
,

v
e
r
b
o
s
e
=
2
,

v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)

p
l
o
t

l
o
s
s

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
r
a
i
n
'
)

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
e
s
t
'
)

p
y
p
l
o
t
.
l
e
g
e
n
d
(
)

p
y
p
l
o
t
.
s
h
o
w
(
)

d
e
f
i
n
e

a
n

e
n
c
o
d
e
r

m
o
d
e
l

(
w
i
t
h
o
u
t

t
h
e

d
e
c
o
d
e
r
)

e
n
c
o
d
e
r

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)

p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,

'
e
n
c
o
d
e
r
_
n
o
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

s
a
v
e

t
h
e

e
n
c
o
d
e
r

t
o

f
i
l
e

e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)

Running the example fits the model and reports loss on the train and test sets along the
way.

Note: if you have problems creating the plots of the model, you can comment out the
import and call the plot_model() function.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few
times and compare the average outcome.
In this case, we see that loss gets low, but does not go to zero (as we might have
expected) with no compression in the bottleneck layer. Perhaps further tuning the model
architecture or learning hyperparameters is required.

1
.
.
2
.

3
4
2
4
/
4
5
2
6
-
7
0
8
s
9
-
1
0
l
o
1
s
1
s
:
1
2
0
.
0
0
3
2

v
a
l
_
l
o
s
s
:

0
.
0
0
1
6

E
p
o
c
h

1
9
6
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
3
1

v
a
l
_
l
o
s
s
:

0
.
0
0
2
4

E
p
o
c
h

1
9
7
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
3
2

v
a
l
_
l
o
s
s
:

0
.
0
0
1
5

E
p
o
c
h

1
9
8
/
2
0
0

4
2
/
4
2

-
0
s

l
o
s
s
:

0
.
0
0
3
2

v
a
l
_
l
o
s
s
:

0
.
0
0
1
4

E
p
o
c
h

1
9
9
/
2
0
0

4
2
/
4
2

0
s
-

l
o
s
s
:

0
.
0
0
3
1

v
a
l
_
l
o
s
s
:

0
.
0
0
2
0

E
p
o
c
h

2
0
0
/
2
0
0

4
2
/
4
2

0
s

-
l
o
s
s
:

0
.
0
0
2
9

v
a
l
_
l
o
s
s
:

0
.
0
0
1
7

A plot of the learning curves is created showing that the model achieves a good fit in
reconstructing the input, which holds steady throughout training, not overfitting.
Learning Curves of Training the Autoencoder Model Without Compression

So far, so good. We know how to develop an autoencoder without compression.

Next, let’s change the configuration of the model so that the bottleneck layer has half the
number of nodes (e.g. 50).

1
.
.
2
.
3
#
4
b
o
t
t
l
e
n
e
c
k

n
_
b
o
t
t
l
e
n
e
c
k

r
o
u
n
d
(
f
l
o
a
t
(
n
_
i
n
p
u
t
s
)

2
.
0
)

b
o
t
t
l
e
n
e
c
k

D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)

Tying this together, the complete example is listed below.

1
#

2
t
r
3
a
i
4
n
5
a
6
u
t
7
o
8
e
n
9
c
o
1
d
0
e
r
1
1
f
1
o
2
r
1
c
3
l
a
1
s
4
s
1
i
5
f
i
1
c
6
a
t
1
i
7
o
n
1
8
w
1
i
9
t
h
2
0
w
i
2
1
t
h
2
2
c
o
2
m
3
p
r
2
e
4
s
2
s
5
i
o
2
n
6
i
2
n
7

2
t
8
h
e
2
9
b
o
3
t
0
t
l
3
e
1
n
3
e
2
c
k
3
3
l
a
3
y
4
e
3
r
5
f
3
r
6
o
m
3
7
s
k
3
l
8
e
a
3
r
9
n
4
.
0
d
a
4
t
1
a
s
4
e
2
t
4
s
3
i
4
m
4
p
o
4
r
5
t
4
m
6
a
k
4
e
7
_
4
c
8
l
a
4
s
9
s
i
5
f
0
i
5
c
1
a
t
5
i
2
o
n
5
3
f
r
5
o
4
m
5
s
5
k
5
l
6
e
a
5
r
7
n
.
5
p
8
r
5
e
9
p
r
6
o
0
c
e
6
s
1
s
i
6
n
2
g
6
3
i
m
p
o
r
t

M
i
n
M
a
x
S
c
a
l
e
r

f
r
o
m

s
k
l
e
a
r
n
.
m
o
d
e
l
_
s
e
l
e
c
t
i
o
n

i
m
p
o
r
t

t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
m
o
d
e
l
s

i
m
p
o
r
t

M
o
d
e
l

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

I
n
p
u
t

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

D
e
n
s
e

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

L
e
a
k
y
R
e
L
U

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
l
a
y
e
r
s

i
m
p
o
r
t

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n

f
r
o
m

t
e
n
s
o
r
f
l
o
w
.
k
e
r
a
s
.
u
t
i
l
s

i
m
p
o
r
t

p
l
o
t
_
m
o
d
e
l

f
r
o
m

m
a
t
p
l
o
t
l
i
b

i
m
p
o
r
t

p
y
p
l
o
t

d
e
f
i
n
e

d
a
t
a
s
e
t

X
,

m
a
k
e
_
c
l
a
s
s
i
f
i
c
a
t
i
o
n
(
n
_
s
a
m
p
l
e
s
=
1
0
0
0
,

n
_
f
e
a
t
u
r
e
s
=
1
0
0
,

n
_
i
n
f
o
r
m
a
t
i
v
e
=
1
0
,

n
_
r
e
d
u
n
d
a
n
t
=
9
0
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

n
u
m
b
e
r

o
f

i
n
p
u
t

c
o
l
u
m
n
s

n
_
i
n
p
u
t
s

X
.
s
h
a
p
e
[
1
]

s
p
l
i
t

i
n
t
o

t
r
a
i
n

t
e
s
t

s
e
t
s

X
_
t
r
a
i
n
,

X
_
t
e
s
t
,

y
_
t
r
a
i
n
,

y
_
t
e
s
t

t
r
a
i
n
_
t
e
s
t
_
s
p
l
i
t
(
X
,

y
,

t
e
s
t
_
s
i
z
e
=
0
.
3
3
,

r
a
n
d
o
m
_
s
t
a
t
e
=
1
)

s
c
a
l
e

d
a
t
a

M
i
n
M
a
x
S
c
a
l
e
r
(
)

t
.
f
i
t
(
X
_
t
r
a
i
n
)

X
_
t
r
a
i
n

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
r
a
i
n
)

X
_
t
e
s
t
=

t
.
t
r
a
n
s
f
o
r
m
(
X
_
t
e
s
t
)

d
e
f
i
n
e

e
n
c
o
d
e
r

v
i
s
i
b
l
e

I
n
p
u
t
(
s
h
a
p
e
=
(
n
_
i
n
p
u
t
s
,
)
)

e
n
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
v
i
s
i
b
l
e
)
e

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)

L
e
a
k
y
R
e
L
U
(
)
(
e
)

e
n
c
o
d
e
r

l
e
v
e
l
2

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
e
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
e
)

L
e
a
k
y
R
e
L
U
(
)
(
e
)

b
o
t
t
l
e
n
e
c
k

n
_
b
o
t
t
l
e
n
e
c
k

r
o
u
n
d
(
f
l
o
a
t
(
n
_
i
n
p
u
t
s
)

/
2
.
0
)

b
o
t
t
l
e
n
e
c
k

D
e
n
s
e
(
n
_
b
o
t
t
l
e
n
e
c
k
)
(
e
)

d
e
f
i
n
e

d
e
c
o
d
e
r
,

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
)
(
b
o
t
t
l
e
n
e
c
k
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

d
e
c
o
d
e
r

l
e
v
e
l

D
e
n
s
e
(
n
_
i
n
p
u
t
s
*
2
)
(
d
)

B
a
t
c
h
N
o
r
m
a
l
i
z
a
t
i
o
n
(
)
(
d
)

L
e
a
k
y
R
e
L
U
(
)
(
d
)

o
u
t
p
u
t

l
a
y
e
r

o
u
t
p
u
t

D
e
n
s
e
(
n
_
i
n
p
u
t
s
,

a
c
t
i
v
a
t
i
o
n
=
'
l
i
n
e
a
r
'
)
(
d
)

d
e
f
i
n
e
a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
o
u
t
p
u
t
)

c
o
m
p
i
l
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

m
o
d
e
l
.
c
o
m
p
i
l
e
(
o
p
t
i
m
i
z
e
r
=
'
a
d
a
m
'
,
l
o
s
s
=
'
m
s
e
'
)

p
l
o
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

p
l
o
t
_
m
o
d
e
l
(
m
o
d
e
l
,

'
a
u
t
o
e
n
c
o
d
e
r
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

f
i
t

t
h
e

a
u
t
o
e
n
c
o
d
e
r

m
o
d
e
l

t
o

r
e
c
o
n
s
t
r
u
c
t

i
n
p
u
t

h
i
s
t
o
r
y

m
o
d
e
l
.
f
i
t
(
X
_
t
r
a
i
n
,

X
_
t
r
a
i
n
,
e
p
o
c
h
s
=
2
0
0
,

b
a
t
c
h
_
s
i
z
e
=
1
6
,

v
e
r
b
o
s
e
=
2
,

v
a
l
i
d
a
t
i
o
n
_
d
a
t
a
=
(
X
_
t
e
s
t
,
X
_
t
e
s
t
)
)

p
l
o
t

l
o
s
s

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
r
a
i
n
'
)

p
y
p
l
o
t
.
p
l
o
t
(
h
i
s
t
o
r
y
.
h
i
s
t
o
r
y
[
'
v
a
l
_
l
o
s
s
'
]
,

l
a
b
e
l
=
'
t
e
s
t
'
)

p
y
p
l
o
t
.
l
e
g
e
n
d
(
)

p
y
p
l
o
t
.
s
h
o
w
(
)

d
e
f
i
n
e

a
n

e
n
c
o
d
e
r

m
o
d
e
l

(
w
i
t
h
o
u
t

t
h
e

d
e
c
o
d
e
r
)

e
n
c
o
d
e
r

M
o
d
e
l
(
i
n
p
u
t
s
=
v
i
s
i
b
l
e
,

o
u
t
p
u
t
s
=
b
o
t
t
l
e
n
e
c
k
)

p
l
o
t
_
m
o
d
e
l
(
e
n
c
o
d
e
r
,

'
e
n
c
o
d
e
r
_
c
o
m
p
r
e
s
s
.
p
n
g
'
,

s
h
o
w
_
s
h
a
p
e
s
=
T
r
u
e
)

s
a
v
e

t
h
e

e
n
c
o
d
e
r

t
o

f
i
l
e

e
n
c
o
d
e
r
.
s
a
v
e
(
'
e
n
c
o
d
e
r
.
h
5
'
)

Running the example fits the model and reports loss on the train and test sets along the
way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few
times and compare the average outcome.
In this case, we see that loss gets similarly low as the above example without
compression, suggesting that perhaps the model performs just as well with a bottleneck
half the size.

1
.
.
2
.
3
4
2
4
/
4
5
2
6
-
7
0
8
s
9
-
1
0
l
o
1
s
1
s
:
1
2
0
.
0
0
2
9

v
a
l
_
l
o
s
s
:

0
.
0
0
1
0

E
p
o
c
h

1
9
6
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
2
9

v
a
l
_
l
o
s
s
:

0
.
0
0
1
3

E
p
o
c
h

1
9
7
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
3
0

v
a
l
_
l
o
s
s
:

9
.
4
4
7
2
e
-
0
4

E
p
o
c
h

1
9
8
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
2
8

v
a
l
_
l
o
s
s
:

0
.
0
0
1
5
E
p
o
c
h

1
9
9
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
3
3

v
a
l
_
l
o
s
s
:

0
.
0
0
2
1

E
p
o
c
h

2
0
0
/
2
0
0

4
2
/
4
2

0
s

l
o
s
s
:

0
.
0
0
2
7

v
a
l
_
l
o
s
s
:

8
.
7
7
3
1
e
-
0
4
A plot of the learning curves is created, again showing that the model achieves a good fit
in reconstructing the input, which holds steady throughout training, not overfitting.

Learning Curves of Training the Autoencoder Model With Compression

The trained encoder is saved to the file “encoder.h5” that we can load and use later.
Next, let’s explore how we might use the trained encoder model.

DENOISING AUTOENCODER:

Autoencoders are Neural Networks which are commonly


used for feature selection and extraction. However, when
there are more nodes in the hidden layer than there are
inputs, the Network is risking to learn the so-called
“Identity Function”, also called “Null Function”, meaning
that the output equals the input, marking the Autoencoder
useless.

Denoising Autoencoders solve this problem by corrupting


the data on purpose by randomly turning some of the input
values to zero. In general, the percentage of input nodes
which are being set to zero is about 50%. Other sources
suggest a lower count, such as 30%. It depends on the
amount of data and input nodes you have.

Architecture of a DAE. Copyright by Kirill Eremenko (Deep Learning A-Z™: Hands-On Artificial Neural
Networks)

When calculating the Loss function, it is important to


compare the output values with the original input, not with
the corrupted input. That way, the risk of learning the
identity function instead of extracting features is
eliminated.

A great implementation has been posted


by opendeep.org where they use Theano to build a very
basic Denoising Autoencoder and train it on the MNIST
dataset. The OpenDeep articles are very basics and are
made for beginners. So even if you don’t have too much
experience with Neural Networks, the article is definitely
worth checking out!

Original input, corrupted data and reconstructed data. Copyright by opendeep.org.

Denoising Autoencoders are an important and crucial tool


for feature selection and extraction and now you know
what it is! Enjoy and thanks for reading!

The structure of a DAE

First, let’s do a quick recap on a high-level structure of


Autoencoders. The critical components of Autoencoders
are:

 Input layer — to pass input data into the net-


work

 Hidden layer consisting of Encoder and De-


coder — to process information by applying
weights, biases and activation functions

 Output layer — typically matches the input


neurons
Here is an illustration of the above summary:

A high-level illustration of layers within an Autoencoder Neural Network. Image by author.

The most common type of Autoencoder is


an Undercomplete Autocencoder which squeezes
(encodes) data into fewer neurons (lower dimension) while
removing “unimportant” information. It achieves that by
training an encoder and decoder simultaneously, so the
output neurons match inputs as closely as possible.

Here is an example of what the network diagram would


look like for an Undercomplete Autoencoder:
Undercomplete Autoencoder Neural Network. Image by author, created using AlexNail’s NN-SVG
tool.

Denoising Autoencoder (DAE)

The purpose of a DAE is to remove noise. You can also


think of it as a customised denoising algorithm tuned to
your data.

Note the emphasis on the word customised. Given that


we train a DAE on a specific set of data, it will be
optimised to remove noise from similar data. For example,
if we train it to remove noise from a collection of images, it
will work well on similar images but will not be suitable for
cleaning text data.

Unlike Undercomplete AE, we may use the same or higher


number of neurons within the hidden layer, making the
DAE overcomplete.

The second difference comes from not using identical


inputs and outputs. Instead, the outputs are the original
data (e.g., images), while the inputs contain data with
some added noise.

SPARSE AUTOENCODER:

A sparse autoencoder is simply an autoencoder whose


training criterion involves a sparsity penalty. In most
cases, we would construct our loss function by
penalizing activations of hidden layers so that only a few
nodes are encouraged to activate when a single sample is
fed into the network.

The intuition behind this method is that, for example, if a


man claims to be an expert in mathematics, computer
science, psychology, and classical music, he might be just
learning some quite shallow knowledge in these subjects.
However, if he only claims to be devoted to mathematics,
we would like to anticipate some useful insights from
him. And it’s the same for autoencoders we’re training —
fewer nodes activating while still keeping its performance
would guarantee that the autoencoder is actually
learning latent representations instead of redundant
information in our input data.

Sparse Autoencoder
There are actually two different ways to construct our
sparsity penalty: L1 regularization and KL-divergence.
And here we will only talk about L1 regularization.

Why L1 Regularization Sparse

L1 regularization and L2 regularization are widely used in


machine learning and deep learning. L1 regularization
adds “absolute value of magnitude” of coefficients as
penalty term while L2 regularization adds “squared
magnitude” of coefficient as a penalty term.

Although L1 and L2 can both be used as regularization


term, the key difference between them is that L1
regularization tends to shrink the penalty coefficient to
zero while L2 regularization would move
coefficients towards zero but they will never reach. Thus
L1 regularization is often used as a method of feature
extraction. But why L1 regularization leads to sparsity?

Consider that we have two loss functions L1 and L2 which


represent L1 regularization and L2 regularization
respectively.
Gradient descent is always used in optimizing neural
networks. If we plot these two loss functions and their
derivatives, it looks like this:

L1 regularization and its derivative

L2 regularization and its derivative

We can notice that for L1 regularization, the gradient is


either 1 or -1 except when w=0, which means that L1
regularization will always move w towards zero with same
step size (1 or -1) regardless of the value of w. And when
w=0, the gradient becomes zero and no update will be
made anymore. However, for L2 regularization things are
different. L2 regularization will also move w towards zero
but the step size becomes smaller and smaller which
means that w will never reach zero.
This is the intuition behind L1 regularization’s sparsity.
More mathematic details can be reached here.

Loss Function

Finally, after the above analysis, we get the idea of using


L1 regularization in sparse autoencoder and the loss
function is as below:

Except for the first two terms, we add the third term which
penalizes the absolute value of the vector of activations a
in layer h for sample i. Then we use a hyperparameter to
control its effect on the whole loss function. And in this
way, we do build a sparse autoencoder.

Visualization

Wait, but how does it behave? To test its performance, I


tried to build a deep autoencoder and train it on MNIST
dataset without L1 regularization and with regularization.
The structure of this deep autoencoder is plotted as below:
The structure of autoencoder

And after 100 epochs of training using 128 batch size and
Adam as the optimizer, I got below results:

Experiment Results

As we can see, sparse autoencoder with L1 regularization


with best mse loss 0.0301 actually performs better than
autoencoder with best mse loss 0.0318. Although it’s just
a slight improvement, it comes out that the sparse
autoencoder actually learns better representation than
autoencoder.

And what about sparsity? We can simply extract the


weights in the first hidden layer and reshape them for
visualizations to check if the activations of sparse
autoencoder are actually more “sparse” than original
autoencoder.
Here comes the conclusion: due to the sparsity of L1
regularization, sparse autoencoder actually learns better
representations and its activations are more
sparse which makes it perform better than original
autoencoder without L1 regularization.

You might also like