Unit-4 Aiml
Unit-4 Aiml
The perceptron was intended to be a machine, rather than a program, and while its
first implementation was in software for the IBM 704, it was subsequently
implemented in custom-built hardware as the “Mark 1 perceptron”. This machine
was designed for image recognition: it had an array of 400 photocells, randomly
connected to the “neurons”. Weights were encoded in potentiometers, and weight
updates during learning were performed by electric motors.
Although the perceptron initially seemed promising, it was quickly proved that
perceptron’s could not be trained to recognize many classes of patterns. This caused
the field of neural network research to stagnate for many years before it was
recognized that a feedforward neural network with two or more layers (also called a
multilayer perceptron) had greater processing power than perceptron’s with one
layer (also called a single-layer perceptron).
Single layer perceptron’s are only capable of learning linearly separable patterns. For
a classification task with some step activation function, a single node will have a
single line dividing the data points forming the patterns. More nodes can create
more dividing lines, but those lines must somehow be combined to form more
complex classifications. A second layer of perceptron’s, or even linear nodes, are
sufficient to solve a lot of otherwise non-separable problems.
In 1969, a famous book entitled Perceptron’s by Marvin Minsky and Seymour Paper
showed that it was impossible for these classes of network to learn an XOR function.
It is often believed (incorrectly) that they also conjectured that a similar result would
hold for a multi-layer perceptron network. However, this is not true, as both Minsky
and Paper already knew that multi-layer perceptron’s could produce an XOR
function. (See the page on Perceptron’s (book) for more information.) Nevertheless,
the often-miscited Minsky/Paper text caused a significant decline in interest and
funding of neural network research. It took ten more years until neural network
research experienced a resurgence in the 1980s. This text was reprinted in 1987 as
“Perceptron’s Expanded Edition” where some errors in the original text are shown
and corrected.
A set of data points are said to be linearly separable if the data can be divided into
two classes using a straight line. If the data is not divided into two classes using a
straight line, such data points are said to be called non-linearly separable data.
Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not
linearly separable.
A second training rule, called the delta rule, is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta rule converges toward a
best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.
This rule is important because gradient descent provides the basis for the
BACKPROPAGATON algorithm, which can learn networks with many interconnected
units.
Multilayer networks
In network theory, multidimensional networks, a special type of multilayer network,
are networks with multiple kinds of relations. Increasingly sophisticated attempts to
model real-world systems as multidimensional networks have yielded valuable
insight in the fields of social network analysis, economics, urban and international
transport, ecology, psychology, medicine, biology, Commerce, climatology, physics,
computational neuroscience, operations management, infrastructures, and finance.
The rapid exploration of complex networks in recent years has been dogged by a
lack of standardized naming conventions, as various groups use overlapping and
contradictory terminology to describe specific network configurations (e.g.,
multiplex, multilayer, multilevel, multidimensional, multirelational, interconnected).
Formally, multidimensional networks are edge-labeled multigraphs. The term “fully
multidimensional” has also been used to refer to a multipartite edge-labeled
multigraph. Multidimensional networks have also recently been reframed as specific
instances of multilayer networks. In this case, there are as many layers as there are
dimensions, and the links between nodes within each layer are simply all the links
for a given dimension.
In another example, Stella used an “eco multiplex model” to study the spread of
Trypanosoma Cruzi (cause of Chagas disease in humans) across different mammal
species. This pathogen can be transmitted either through invertebrate vectors
(Triatominae or kissing bugs) or through predation when a susceptible predator
feeds on infected prey or vectors. Thus, their model included two
ecological/transmission layers: the food-web and vector layers. Their results showed
that studying the multiplex network structure offered insights on which host species
facilitate parasite spread, and thus which would be more effective to immunize to
control the spread. At the same time, they showed how, in this system, when
parasites spread occurs primarily through the trophic layer, immunizing predators
hampers parasite transmission more than immunizing prey.
For coupled processes involving a disease alongside a social process (i.e., spread of
information or disease awareness), we might expect that the spread of the pathogen
will be associated with the spread of disease awareness or preventative behaviors
such as mask-wearing, and in these cases theoretical models suggest that
considering the spread of disease awareness can result in reduced disease spread. A
model was presented by Granell, which represented two competing processes on the
same network: infection spread (modeled using a Susceptible-Infected-Susceptible
compartmental model) coupled with information spread through a social network (an
Unaware-Aware-Unaware compartmental model). The authors used their model to
show that the timing of self-awareness of infection had little effect on the epidemic
dynamics. However, the degree of immunization (a parameter which regulates the
probability of becoming infected when aware) and mass media information spread
on the social layer did critically impact disease spread. A similar framework has been
used to study the effect of the diffusion of vaccine opinion (pro or anti) across a
social network with concurrent infectious disease spread. The study showed a clear
regime shift from a vaccinated population and controlled outbreak to vaccine refusal
and epidemic spread depending on the strength of opinion on the perceived risks of
the vaccine. The shift in outcomes from a controlled to uncontrolled outbreak was
accompanied by an increase in the spatial correlation of cases. While models in the
veterinary literature have accounted for altered behavior of nodes (imposition of
control measures) because of detection or awareness of disease, it is not common
for awareness to be considered as a dynamic process that is influenced by how each
node has interacted with the pathogen (i.e., contact with an infected neighbor). For
example, the rate of adoption of biosecurity practices at a farm, such as enhanced
surveillance, use of vaccination, or installation of air filtration systems, may be
dependent on the presence of disease in neighboring farms or the farmers’
awareness of a pathogen through a professional network of colleagues.
There is also some evidence that nodes that are more connected in their “social
support” networks (e.g., connections with family and close friends in humans) can
alter network processes that result in negative outcomes, such as pathogen
exposure or engagement in high-risk behavior. In a case based on users of
injectable drugs, social connections with non-injectors can reduce drug-users
connectivity in a network based on risky behavior with other drug injectors. In a
model presented by Chen, a social-support layer of a multiplex network drove the
allocation of resources for infection recovery, meaning that infected individuals
recovered faster if they possessed more neighbors in the social support layer. In
animal (both wild and domesticated) populations, this concept could be adapted to
represent an individual’s likelihood of recovery from, or tolerance to, infection being
influenced by the buffering effect of affiliative social relationships. For domestic
animals, investment in certain resources at a farm level could influence a premise’s
ability to recover (e.g., treatment) or onwards transmission of a pathogen (e.g.,
treatment or biosecurity practices). Sharing of these resources between farms could
be modeled through a “social-support” layer in a multiplex, for example, where a
farm’s transmissibility is impacted by access to shared truck-washing facilities.
Multi-Host Infections
Multilayer networks can be used to study the features of mixed species contact
networks or model the spread of a pathogens in a host community, providing
important insights into multi-host pathogens. Scenarios like this are commonplace at
the livestock-wildlife interface and therefore the insights provided could be of real
interest to veterinary epidemiology. In the case of multi-host pathogens, intralayer
and interlayer edges represent the contacts between individuals of the same species
and between individuals of different species, respectively. They can therefore be
used to identify bottlenecks of transmission and provide a clearer idea of how
spillover occurs. For example, Silk used an interconnected network with three layers
to study potential routes of transmission in a multi-host system. One layer consisted
of a wild European badger (Meles meles) contact network, the second a
domesticated cattle contact network, and the third a layer containing badger latrine
sites (potentially important sites of indirect environmental transmission). No
intralayer edges were possible in the latrine layer. The authors demonstrated the
importance of these environmental sites in shortening paths through the multilayer
network (for both between- and within-species transmission routes) and showed
that some latrine sites were more important than others in connecting the different
layers. Pilosof presented a theoretical model, labeling the species as focal (i.e., of
interest) and non-focal, showing that the outbreak probability and outbreak size
depend on which species originates the outbreak and on asymmetries in between-
species transmission probabilities.
Backpropagation Algorithm
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for
training feedforward neural networks. Generalizations of backpropagation exist for
other artificial neural networks (ANNs), and for functions generally. These classes of
algorithms are all referred to generically as “backpropagation”. In fitting a neural
network, backpropagation computes the gradient of the loss function with respect to
the weights of the network for a single input–output example, and does so
efficiently, unlike a naive direct computation of the gradient with respect to each
weight individually. This efficiency makes it feasible to use gradient methods for
training multilayer networks, updating weights to minimize loss; gradient descent, or
variants such as stochastic gradient descent, are commonly used. The
backpropagation algorithm works by computing the gradient of the loss function
with respect to each weight by the chain rule, computing the gradient one layer at a
time, iterating backward from the last layer to avoid redundant calculations of
intermediate terms in the chain rule; this is an example of dynamic programming.
The term backpropagation strictly refers only to the algorithm for computing the
gradient, not how the gradient is used; however, the term is often used loosely to
refer to the entire learning algorithm, including how the gradient is used, such as by
stochastic gradient descent. Backpropagation generalizes the gradient computation
in the delta rule, which is the single-layer version of backpropagation, and is in turn
generalized by automatic differentiation, where backpropagation is a special case of
reverse accumulation (or “reverse mode”). The term backpropagation and its
general use in neural networks was announced in Rumelhart, Hinton & Williams
(1986a), then elaborated and popularized in Rumelhart, Hinton & Williams (1986b),
but the technique was independently rediscovered many times, and had many
predecessors dating to the 1960s; see § History. A modern overview is given in the
deep learning textbook by Goodfellow, Bengio & Courville.
The algorithm is used to effectively train a neural network through a method called
chain rule. In simple terms, after each forward pass through a network,
backpropagation performs a backward pass while adjusting the model’s parameters
(weights and biases).
Input layer
The neurons, colored in purple, represent the input data. These can be as simple as
scalars or more complex like vectors or multidimensional matrices.
Hidden layers
The final values at the hidden neurons, colored in green, are computed using z^l —
weighted inputs in layer l, and a^l— activations in layer l.
Output layer
The final part of a neural network is the output layer which produces the predicated
value. In our simple example, it is presented as a single neuron, colored in blue.
The adjective “deep” in deep learning refers to the use of multiple layers in the
network. Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of
unbounded width can. Deep learning is a modern variation which is concerned with
an unbounded number of layers of bounded size, which permits practical application
and optimized implementation, while retaining theoretical universality under mild
conditions. In deep learning the layers are also permitted to be heterogeneous and
to deviate widely from biologically informed connectionist models, for the sake of
efficiency, trainability, and understandability, whence the “structured” part.
Most modern deep learning models are based on artificial neural networks,
specifically convolutional neural networks (CNN)s, although they can also include
propositional formulas or latent variables organized layer-wise in deep generative
models such as the nodes in deep belief networks and deep Boltzmann machines.
In deep learning, each level learns to transform its input data into a slightly more
abstract and composite representation. In an image recognition application, the raw
input may be a matrix of pixels; the first representational layer may abstract the
pixels and encode edges; the second layer may compose and encode arrangements
of edges; the third layer may encode a nose and eyes; and the fourth layer may
recognize that the image contains a face. Importantly, a deep learning process can
learn which features to optimally place in which level on its own. This does not
completely eliminate the need for hand-tuning; for example, varying numbers of
layers and layer sizes can provide different degrees of abstraction.
The word “Deep” in “Deep learning” refers to the number of layers through which
the data is transformed. More precisely, deep learning systems have a substantial
credit assignment path (CAP) depth. The CAP is the chain of transformations from
input to output. CAPs describe potentially causal connections between input and
output. For a feedforward neural network, the depth of the CAPs is that of the
network and is the number of hidden layers plus one (as the output layer is also
parameterized). For recurrent neural networks, in which a signal may propagate
through a layer more than once, the CAP depth is potentially unlimited. No
universally agreed-upon threshold of depth divides shallow learning from deep
learning, but most researchers agree that deep learning involves CAP depth higher
than 2. CAP of depth 2 has been shown to be a universal approximator in the sense
that it can emulate any function. Beyond that, more layers do not add to the
function approximator ability of the network. Deep models (CAP > 2) can extract
better features than shallow models and hence, extra layers help in learning the
features effectively.
The universal approximation theorem for deep neural networks concerns the
capacity of networks with bounded width but the depth is allowed to grow. Lu
proved that if the width of a deep neural network with ReLU activation is strictly
larger than the input dimension, then the network can approximate any Lebesgue
integrable function; If the width is smaller or equal to the input dimension, then
deep neural network is not a universal approximator.
The probabilistic interpretation derives from the field of machine learning. It features
inference, as well as the optimization concepts of training and testing, related to
fitting and generalization, respectively. More specifically, the probabilistic
interpretation considers the activation nonlinearity as a cumulative distribution
function. The probabilistic interpretation led to the introduction of dropout as
regularize in neural networks. The probabilistic interpretation was introduced by
researchers including Hopfield, Widrow and Narendra and popularized in surveys
such as the one by Bishop.
Architectures:
Recurrent (perform same task for every element of a sequence) Neural Network:
Allows for parallel and sequential computation. Like the human brain (large feedback
network of connected neurons). They can remember important things about the
input they received and hence enables them to be more precise.
Limitations:
Advantages:
Disadvantages:
Applications:
Automatic Text Generation: Corpus of text is learned and from this model new
text is generated, word-by-word or character-by-character. Then this model is
capable of learning how to spell, punctuate, form sentences, or it may even capture
the style.
Architecture
In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x
(input width) x (input channels). After passing through a convolutional layer, the
image becomes abstracted to a feature map, also called an activation map, with
shape: (number of inputs) x (feature map height) x (feature map width) x (feature
map channels).
Convolutional layers convolve the input and pass its result to the next layer. This is
similar to the response of a neuron in the visual cortex to a specific stimulus. Each
convolutional neuron processes data only for its receptive field. Although fully
connected feedforward neural networks can be used to learn features and classify
data, this architecture is generally impractical for larger inputs such as high-
resolution images. It would require a very high number of neurons, even in a
shallow architecture, due to the large input size of images, where each pixel is a
relevant input feature. For instance, a fully connected layer for a (small) image of
size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead,
convolution reduces the number of free parameters, allowing the network to be
deeper. For example, regardless of image size, using a 5 x 5 tiling region, each with
the same shared weights, requires only 25 learnable parameters. Using regularized
weights over fewer parameters avoids the vanishing gradients and exploding
gradients problems seen during backpropagation in traditional neural networks.
Furthermore, convolutional neural networks are ideal for data with a grid-like
topology (such as images) as spatial relations between separate features are
considered during convolution and/or pooling.
Pooling layers
Convolutional networks may include local and/or global pooling layers along with
traditional convolutional layers. Pooling layers reduce the dimensions of data by
combining the outputs of neuron clusters at one layer into a single neuron in the
next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are
commonly used. Global pooling acts on all the neurons of the feature map. There
are two common types of pooling in popular use: max and average. Max pooling
uses the maximum value of each local cluster of neurons in the feature map, while
average pooling takes the average value.
Fully connected layers connect every neuron in one layer to every neuron in another
layer. It is the same as a traditional multi-layer perceptron neural network (MLP).
The flattened matrix goes through a fully connected layer to classify the images.
Receptive field
In neural networks, each neuron receives input from some number of locations in
the previous layer. In a convolutional layer, each neuron receives input from only a
restricted area of the previous layer called the neuron’s receptive field. Typically, the
area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the
receptive field is the entire previous layer. Thus, in each convolutional layer, each
neuron takes input from a larger area in the input than previous layers. This is due
to applying the convolution over and over, which considers the value of a pixel, as
well as its surrounding pixels. When using dilated layers, the number of pixels in the
receptive field remains constant, but the field is more sparsely populated as its
dimensions grow when combining the effect of several layers.
Weights
The vector of weights and the bias are called filters and represent features of the
input (e.g., a particular shape). A distinguishing feature of CNNs is that many
neurons can share the same filter. This reduces the memory footprint because a
single bias and a single vector of weights are used across all receptive fields that
share that filter, as opposed to each receptive field having its own bias and vector
weighting.
Convolutional layers are the major building blocks used in convolutional neural
networks.
Activation function
Activation function decides whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
Neural network has neurons that work in correspondence of weight, bias and their
respective activation function. In a neural network, we would update the weights
and biases of the neurons based on the error at the output. This process is known
as back-propagation. Activation functions make the back-propagation possible since
the gradients are supplied along with the error to update the weights and biases.
1) Linear Function:
Equation: Linear function has the equation similar to as of a straight line i.e. y
= ax
No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
Range: -inf to +inf
Uses: Linear activation function is used at just one place i.e. output layer.
Issues: If we will differentiate linear function to bring non-linearity, result will
no longer depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.
2) Sigmoid Function:
A = 1/(1 + e-x)
Tanh Function: The activation that works almost always better than sigmoid
function is Tanh function also knows as Tangent Hyperbolic function. It’s
mathematically shifted version of the sigmoid function. Both are similar and can be
derived from each other.
Pooling Layer
The pooling or down sampling layer is responsible for reducing the special size of the
activation maps. In general, they are used after multiple stages of other layers (i.e.
convolutional and non-linearity layers) to reduce the computational requirements
progressively through the network as well as minimizing the likelihood of overfitting.
The key concept of the pooling layer is to provide translational invariance since
particularly in image recognition tasks, the feature detection is more important
compared to the feature’s exact location. Therefore, the pooling operation aims to
preserve the detected features in a smaller representation and does so, by
discarding less significant data at the cost of spatial resolution.
Fully connected
Fully connected layers connect every neuron in one layer to every neuron in another
layer. It is the same as a traditional multi-layer perceptron neural network (MLP).
The flattened matrix goes through a fully connected layer to classify the images.
Fully connected neural networks (FCNNs) are a type of artificial neural network
where the architecture is such that all the nodes, or neurons, in one layer are
connected to the neurons in the next layer.
While this type of algorithm is commonly applied to some types of data, in practice
this type of network has some issues in terms of image recognition and
classification. Such networks are computationally intense and may be prone to
overfitting. When such networks are also ‘deep’ (meaning there are many layers of
nodes or neurons) they can be particularly hard for humans to understand.
With the human-like ability to problem-solve and apply that skill to huge datasets
neural networks possess the following powerful attributes:
Adaptive Learning: Like humans, neural networks model non-linear and complex
relationships and build on previous knowledge. For example, software uses adaptive
learning to teach math and language arts.
Self-Organization: The ability to cluster and classify vast amounts of data makes
neural networks uniquely suited for organizing the complicated visual problems
posed by medical image analysis.
Fault Tolerance: When significant parts of a network are lost or missing, neural
networks can fill in the blanks. This ability is especially useful in space exploration,
where the failure of electronic devices is always a possibility.
Neural networks are highly valuable because they can carry out tasks to make sense
of data while retaining all their other attributes. Here are the critical tasks that
neural networks perform:
Clustering: They identify a unique feature of the data and classify it without any
knowledge of prior data.
Associating: You can train neural networks to “remember” patterns. When you
show an unfamiliar version of a pattern, the network associates it with the most
comparable version in its memory and reverts to the latter.
Neural Network is having two input units and one output unit with no hidden layers.
These are also known as ‘single-layer perceptron’s.’
These networks are like the feed-forward Neural Network, except radial basis
function is used as these neurons’ activation function.
These networks use more than one hidden layer of neurons, unlike single-layer
perceptron. These are also known as Deep Feedforward Neural Networks.
Hopfield Network
These networks are like the Hopfield network, except some neurons are input, while
others are hidden in nature. The weights are initialized randomly and learn through
the backpropagation algorithm.
Get a complete overview of Convolutional Neural Networks through our blog Log
Analytics with Machine Learning and Deep Learning.
The environment is typically stated in the form of a Markov decision process (MDP)
because many reinforcement learning algorithms for this context use dynamic
programming techniques. The main difference between the classical dynamic
programming methods and reinforcement learning algorithms is that the latter do
not assume knowledge of an exact mathematical model of the MDP and they target
large MDPs where exact methods become infeasible.
Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions to a
particular problem
Training: The training is based upon the input; The model will return a state and
the user will decide to reward or punish the model based on its output. The model
keeps continues to learn. The best solution is decided based on the maximum
reward.
Positive:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish
the results
Negative:
Increases Behavior.
Provide defiance to a minimum standard of performance.
It Only provides enough to meet up the minimum behavior.
There are many different algorithms that tackle this issue. As a matter of fact,
Reinforcement Learning is defined by a specific type of problem, and all its solutions
are classed as Reinforcement Learning algorithms. In the problem, an agent is
supposed to decide the best action to select based on his current state. When this
step is repeated, the problem is known as a Markov Decision Process.
State
A State is a set of tokens that represent every state that the agent can be in.
Model
Actions
An Action A is a set of all possible actions. A(s) defines the set of actions that can be
taken being in state S.
For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in
the sense of maximizing the expected value of the total reward over all successive
steps, starting from the current state. Q-learning can identify an optimal action-
selection policy for any given FMDP, given infinite exploration time and a partly
random policy. “Q” refers to the function that the algorithm computes the expected
rewards for an action taken in each state.
The goal of the agent is to maximize its total reward. It does this by adding the
maximum reward attainable from future states to the reward for achieving its
current state, effectively influencing the current action by the potential future
reward. This potential reward is a weighted sum of expected values of the rewards
of all future steps starting from the current state.
On the next day, by random chance (exploration), you decide to wait and let other
people depart first. This initially results in a longer wait time. However, time-fighting
other passengers are less. Overall, this path has a higher reward than that of the
previous day, since the total boarding time is now:
Through exploration, despite the initial (patient) action resulting in a larger cost (or
negative reward) than in the forceful strategy, the overall cost is lower, thus
revealing a more rewarding strategy.
Q Learning Algorithm
Q-learning is an off-policy learner. Means it learns the value of the optimal policy
independently of the agent’s actions. On the other hand, an on-policy learner learns
the value of the policy being carried out by the agent, including the exploration steps
and it will find a policy that is optimal, considering the exploration inherent in the
policy.
Q-Table
Q-Table is the data structure used to calculate the maximum expected future
rewards for action at each state. Basically, this table will guide us to the best action
at each state. To learn each value of the Q-table, Q-Learning algorithm is used.
We will first build a Q-table. There are n columns, where n= number of actions.
There are m rows, where m= number of states. We will initialize the values at 0.
This combination of steps is done for an undefined amount of time. This means that
this step runs until the time we stop the training, or the training loop stops as
defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as
mentioned earlier when the episode initially starts, every Q-value is 0.
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward. We need to
update the function Q(s,a).
Manufacturing
In Fanuc, a robot uses deep reinforcement learning to pick a device from one box
and putting it in a container. Whether it succeeds or fails, it memorizes the object
and gains knowledge and train’s itself to do this job with great speed and precision.
Many warehousing facilities used by eCommerce sites and other supermarkets use
these intelligent robots for sorting their millions of products every day and helping to
deliver the right products to the right people. If you look at Tesla’s factory, it
comprises of more than 160 robots that do major part of work on its cars to reduce
the risk of any defect.
Finance
Inventory Management
Healthcare
With technology improving and advancing on a regular basis, it has taken over
almost every industry today, especially the healthcare sector. With the
implementation of reinforcement learning, the healthcare system has generated
better outcomes consistently. One of the most common areas of reinforcement
learning in the healthcare domain is Quotient Health.
Delivery Management
Image Processing
Deep Q-learning
The DeepMind system used a deep convolutional neural network, with layers of tiled
convolutional filters to mimic the effects of receptive fields. Reinforcement learning is
unstable or divergent when a nonlinear function approximator such as a neural
network is used to represent Q. This instability comes from the correlations present
in the sequence of observations, the fact that small updates to Q may significantly
change the policy of the agent and the data distribution, and the correlations
between Q and the target values.
The technique used experience replay, a biologically inspired mechanism that uses a
random sample of prior actions instead of the most recent action to proceed. This
removes correlations in the observation sequence and smooths changes in the data
distribution. Iterative updates adjust Q towards target values that are only
periodically updated, further reducing correlations with the target.