0% found this document useful (0 votes)

29 views26 pages

04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning

This document discusses superposition-inspired reinforcement learning (SIRL) and quantum reinforcement learning (QRL). It begins with an introduction to reinforcement learning and how quantum computation could help speed up learning. It then describes how SIRL uses a probabilistic exploration policy inspired by quantum superposition to balance exploration and exploitation. QRL is proposed as an extension of SIRL to quantum mechanical systems, representing state values with quantum states. The rest of the document outlines the fundamentals of quantum computation and provides more details on SIRL and QRL methods.

Uploaded by

Branko Nikolic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views26 pages

04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning

Uploaded by

Branko Nikolic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

4

Superposition-Inspired Reinforcement Learning

and Quantum Reinforcement Learning
Chun-Lin Chen and Dao-Yi Dong
Nanjing University & Institute of Systems Science, CAS
China

1. Introduction
Reinforcement Learning (RL) remains an active research area for a long time (Kaelbling et
al., 1996; Sutton & Barto, 1998) and is still one of the most rapidly developing machine
learning methods in recent years (Barto & Mahadevan, 2003). Related algorithms and
techniques have been used in different applications such as motion control, operations
research, robotics and sequential decision process (He & Jagannathan, 2005; Kondo & Ito,
2004; Morimoto & Doya, 2001; Chen et al., 2006b). However how to speed up learning has
always been one of the key problems for the theoretical research and applications of RL
methods (Sutton & Barto, 1998).
Recently there comes up a new approach for solving this problem owning to the rapid
development of quantum information and quantum computation (Preskill, 1998; Nielsen &
Chuang, 2000). Some results have shown that quantum computation can efficiently speed
up the solutions of some classical problems, and even can solve some difficult problems that
classical algorithms can not solve. Two important quantum algorithms, Shor’s factoring
Open Access Database www.i-techonline.com

algorithm (Shor, 1994; Ekert & Jozsa, 1996) and Grover’s searching algorithm (Grover, 1996;
Grover, 1997), have been proposed in 1994 and 1996 respectively. Shor’s factoring algorithm
can give an exponential speedup for factoring large integers into prime numbers and its
experimental demonstration has been realized using nuclear magnetic resonance
(Vandersypen et al., 2001). Grover’s searching algorithm can achieve a square speedup over
classical algorithms in unsorted database searching and its experimental implementations
have also been demonstrated using nuclear magnetic resonance (Chuang et al., 1998; Jones,
1998a; Jones et al., 1998b) and quantum optics (Kwiat et al., 2000; Scully & Zubairy, 2001).
Taking advantage of quantum computation, the algorithm integration inspired by quantum
characteristics will not only improve the performance of existing algorithms on traditional
computers, but also promote the development of related research areas such as quantum
computer and machine learning. According to our recent research results (Dong et al.,
2005a; Dong et al., 2006a; Dong et al., 2006b; Chen et al., 2006a; Chen et al., 2006c; Chen &
Dong, 2007; Dong et al., 2007a; Dong et al., 2007b), in this chapter the RL methods based on
quantum theory are introduced following the developing roadmap from Superposition-
Inspired Reinforcement Learning (SIRL) to Quantum Reinforcement Learning (QRL).
As for SIRL methods we concern mainly about the exploration policy. Inspired by the
superposition principle of quantum state, in a RL system, a probabilistic exploration policy
Source: Reinforcement Learning: Theory and Applications, Book edited by Cornelius Weber, Mark Elshaw and Norbert Michael Mayer
ISBN 978-3-902613-14-1, pp.424, January 2008, I-Tech Education and Publishing, Vienna, Austria
60 Reinforcement Learning: Theory and Applications

is proposed to mimic the state collapse phenomenon according to quantum measurement

postulate, which leads to a good balance between exploration and exploitation. In this way,
the simulated experiments show that SIRL may accelerate the learning process and allow
avoiding the locally optimal policies.
When SIRL is extended to quantum mechanical systems, QRL theory is proposed naturally
(Dong et al., 2005a, Dong et al., 2007b). In a QRL system, the state value can be represented
with quantum state and be obtained by randomly observing the quantum state, which will
lead to state collapse according to quantum measurement postulate. The occurrence
probability of eigenvalue is determined by probability amplitude, which is updated
according to rewards. So this approach represents the whole state-action space with the
superposition of quantum state, which leads to real parallel computing and a good tradeoff
between exploration and exploitation using probability as well.
Besides the introduction of SIRL and QRL methods, in this chapter, the relationship between
different theories and algorithms are briefly analyzed, and their applications are also
introduced respectively. The organization of this chapter is as follows. Section 2 gives a brief
introduction to the fundamentals of quantum computation, which include the superposition
principle, parallel computation and quantum gates. In Section 3, the SIRL method is
presented in a probabilistic version through mimicking the quantum behaviors. Section 4
gives the introduction of QRL method based on quantum superposition and quantum
parallelism. Related issues and future work are discussed as a conclusion in Section 5.

2. Fundamentals of quantum computation

2.1 State superposition and quantum parallel computation

In quantum computation, information unit (also called as qubit) is represented with
quantum state and a qubit is an arbitrary superposition state of two-state quantum system
(Dirac’s representation) (Preskill, 1998):

| ψ 〉 = α | 0〉 + β | 1〉 (1)

where α and β | α |2 + | β |2 = 1 . | 0〉 and | 1〉 are

are complex coefficients and satisfy
two orthogonal states (also called basis vectors of quantum state | ψ 〉 ), and they
correspond to logic states 0 and 1. | α | represents the occurrence probability of | 0〉 when
2

the qubit is measured, and | β | is the probability of obtaining result | 1〉 . The physical
2

carrier of a qubit is any two-state quantum system such as two-level atom, spin-1/2 particle
and polarized photon. The value of classical bit is either Boolean value 0 or value 1, but a
qubit can be prepared in the coherent superposition state of 0 and 1, i.e. a qubit can
simultaneously store 0 and 1, which is the main difference between classical computation
and quantum computation.
According to quantum computation theory, the quantum computing process can be looked
upon as a unitary transformation U from input qubits to output qubits. If one applies a
transformation U to a superposition state, the transformation will act on all basis vectors of
this superposition state and the output will be a new superposition state by superposing the
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 61

results of all basis vectors. So when one processes function f(x) by the method, the
transformation U can simultaneously work out many different results for a certain input
x . This is analogous with parallel process of classical computer and is called quantum
parallelism. The powerful ability of quantum algorithm is just derived from the parallelism
of quantum computation.
Suppose the input qubit | z〉 lies in the superposition state:
1
| z〉 = (| 0〉+ | 1〉 ) (2)
2
The transformation U z describing computing process is defined as the following:
U z :| z, y〉 →| z, y ⊕ f (z)〉 (3)

where | z, y〉 represents the input joint state and | z, y ⊕ f (z)〉 is the output joint state.
Let y = 0 and we can easily obtain (Nielsen & Chuang, 2000):
1
U z | z〉 = (| 0, f (0)〉+ | 1, f (1)〉 ) (4)
2
The result contains information about both f (0) and f (1) , and we seem to evaluate f (z)
for two values of z simultaneously.
Now consider an n-qubit cluster and it lies in the following superposition state:
} n } n
11L1 11L1
|ψ 〉 = ∑C
x =00L0
x | x〉 (where ∑| C
x =00L0
x |2 = 1 ) (5)

where Cx is complex coefficients and | C x |2 represents occurrence probability of | x〉

when state |ψ 〉 is measured. | x〉 can take on 2n values, so the superposition state can be
looked upon as the superposition state of all integers from 0 to 2 n − 1 . Since U is a unitary
transformation, computing function f (x) can give (Preskill, 1998):

} n } n } n
11L1 11L1 11L1
U ∑C
x =00L0
x | x,0〉 = ∑ C U | x,0〉 = ∑ C
x =00L0
x
x =00L0
x | x, f ( x )〉 (6)

Based on the above analysis, it is easy to find that an n-qubit cluster can simultaneously
process 2n states. However, this is different from the classical parallel computation, where
multiple circuits built to compute f (x) are executed simultaneously, since quantum
parallel computation doesn’t necessarily make a tradeoff between computation time and
62 Reinforcement Learning: Theory and Applications

needed physical space. In fact, quantum parallelism employs a single circuit to evaluate the
function for multiple values of x simultaneously by exploiting the quantum state
superposition principle and provides an exponential-scale computation space in the n-qubit
linear physical space. Therefore quantum computation can effectively increase the
computing speed of some important classical functions. So it is possible to obtain significant
result through fusing quantum computation into reinforcement learning theory.

2.2 Quantum gates

Analogous to classical computer, quantum computer accomplishes some quantum
computation tasks through quantum gates. A quantum gate or quantum logic gate is a basic
quantum circuit operating on a small number of qubits. They can be represented by unitary
matrices. Here we will introduce several simple quantum gates including quantum NOT
gate, Hadamard gate, phase gate and quantum CNOT gate. The detailed description of
quantum gates can refer to (Nielsen & Chuang, 2000).
A quantum NOT gate maps | 0〉 →| 1〉 and | 1〉 →| 0〉 respectively and that can be
described by the following matrix:

⎡0 1 ⎤
U NOT = ⎢ ⎥ (7)
⎣1 0 ⎦
When a quantum NOT gate is applied on a single qubit with state | ψ 〉 = α | 0〉 + β | 1〉 ,
then the output will become | ψ 〉 = α | 1〉 + β | 0〉 . The symbol for the NOT gate is drawn
in Fig.1 (a).
The Hadamard gate is one of the most useful quantum gates and can be represented as:

1 ⎡1 1 ⎤
H= ⎢ ⎥ (8)
2 ⎣1 - 1⎦
Through the Hadamard gate, a qubit in the state | 0〉 is transformed into a superposition
state in the two states, i.e.

1 ⎡1 1 ⎤⎛1 ⎞ 1 ⎛1⎞ 1 1
H | 0〉 ≡ ⎢1 − 1⎥⎜⎜ 0 ⎟⎟ = ⎜⎜ ⎟⎟ = | 0〉 + | 1〉 (9)
2⎣ ⎦⎝ ⎠ 2 ⎝1⎠ 2 2
Another important gate is phase gate which can be expressed as

⎡1 0 ⎤
Up = ⎢ ⎥ (10)
⎣0 i ⎦
Up generates a relative phase π between the two basis states of the input state, i.e.

U p | ψ 〉 = α | 0〉 + iβ | 1〉 (11)
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 63

The CNOT gate acts on two qubits simultaneously and can be represented by the following
matrix:

⎡1 0 0 0⎤
⎢0 1 0 0⎥⎥
U CNOT =⎢ (12)
⎢0 0 0 1⎥
⎢ ⎥
⎣0 0 1 0⎦
The symbol for the CNOT gate is shown as in Fig.1 (b). If the first control qubit is equal to
| 1〉 , then CNOT gate flips the target (second) qubit. Otherwise the target remains
unaffected. This can be described as follows:

⎧U CNOT 00 = 00
⎪
⎪U CNOT 01 = 01
⎨ (13)
⎪U CNOT 10 = 11
⎪U = 10
⎩ CNOT 11
Just like AND and NOT form a universal set for classical boolen circuits, the CNOT gate
combined with one qubit rotation gate can implement any kind of quantum calculation.

(a)

(b)
Fig. 1. Symbols for NOT and CNOT gate

3. Superposition-inspired reinforcement learning

Similar the standard RL, SIRL is also a RL method that is designed for the traditional
computer, instead of a quantum algorithm. However, it borrows the ideas from quantum
characteristics and provides an alternative exploration strategy, i.e., action selection method.
In this section, the SIRL will be presented after a brief introduction of the standard RL
theory and the existing exploration strategies.

3.1 Reinforcement learning and exploration strategy

Standard framework of RL is based on discrete-time, finite Markov decision processes
(MDPs) (Sutton & Barto, 1998). RL algorithms assume that state S and action A ( sn ) can be
divided into discrete values. At a certain step, the agent observes the state of the
64 Reinforcement Learning: Theory and Applications

environment (inside and outside of the agent) s t , and then choose an action a t . After
executing the action, the agent receives a reward rt +1 , which reflects how good that action is
(in a short-term sense).
The goal of reinforcement learning is to learn a mapping from states to actions, that is to say,
the agent is to learn a policy π : S × ∪i∈S A(i ) → [0,1] , so that expected sum of
discounted reward of each state will be maximized:

V(πs ) = E{rt +1 + γrt +2 + γ 2 rt +3 + L | st = s, π }

= E[rt +1 + γV(πst +1 ) | st = s, π ] (14)

= ∑ π (s, a)[r
a∈As
s
a
+ γ ∑ pssa 'V(πs ') ]
s'

where γ ∈ [0,1) is discounted factor, π ( s, a ) is the probability of selecting action a

according to state s under policy π , p = Pr{st +1 = s'| st = s, at = a}
a
ss ' is probability

for state transition and rsa = E{rt +1 | st = s, at = a} is expected one-step reward. Then
we have the optimal state-value function

V(s ) = max[rsa + γ ∑ pssa 'V(s ') ] (15)

a∈As
s'

π * = arg max V(πs ) , ∀s ∈ S (16)

*
In dynamic programming, (15) is also called Bellman equation of V .
As for state-action pairs, there are similar value functions and Bellman equations, where
Q π ( s, a ) stands for the value of taking action a in state s under policy π:
Q(πs ,a ) = E{rt +1 + γrt +2 + γ 2 rt +3 + K | st = s, at = a, π }
= rsa + γ ∑ pssa 'V π ( s ' ) (17)
s'

= rsa + γ ∑ pssa ' ∑ π ( s ' , a ' )Q(πs ',a ')

s' a'

Q(*s ,a ) = max Q( s ,a ) = rsa + γ ∑ pssa ' max Q(*s ',a ') (18)
π a'
s'

Let α be the learning rate, the one-step update rule of Q-learning (a widely used
reinforcement learning algorithm) (Watkins & Dayan, 1992) is:

Q( st , at ) ← (1 − α )Q( st , at ) + α (rt +1 + γ max a ' Q( st +1 , a' ) (19)

Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 65

Besides Q-learning, there are also many other RL algorithms such as temporal
difference (TD), SARSA and multi-step version of these algorithms. For more detail,
please refer to (Sutton & Barto, 1998).
To approach the optimal policy effectively and efficiently, the RL algorithms always
need a certain exploration strategy. One widely used exploration strategy is ε -
greedy (ε ∈ [0,1)) , where the optimal action is selected with probability 1 − ε and a
random action is selected with probability ε . Sutton and Barto (Sutton & Barto, 1998)
have compared the performance of RL for different ε , which shows that a nonzero ε
is usually better than ε = 0 (i.e., blind greedy strategy). Moreover, the exploration
probability ε can be reduced over time, which moves the agent from exploration to
exploitation. The ε -greedy method is simple and effective, but it has one drawback
that when it explores it chooses equally among all actions. This means that it makes no
difference to choose the worst action or the next-to-best action. Another problem is that
it is difficult to choose a proper parameter ε which can offer the optimal balancing
between exploration and exploitation.
Another kind of action selection methods are randomized strategies, such as Boltzmann
exploration (i.e., Softmax method) (Sutton & Barto, 1998) and Simulated Annealing (SA)
method (Guo et al., 2004). It uses a positive parameter τ called the temperature and
chooses action with the probability proportional to exp(Q( s ,a ) / τ ) . Compared with ε -
greedy method, the greedy action is still given the highest selection probability, but all
the others are ranked and weighted according to their value estimates. It can also move
from exploration to exploitation by adjusting the "temperature" parameter τ . It is
natural to sample actions according to this distribution, but it is very difficult to set and
adjust a good parameter τ and may converge unnecessarily slowly unless the
parameter τ is manually tuned with great care. It also has another potential
shortcoming that it may works badly when the values of the actions are close and the
best action can not be separated from the others. A third problem is that when the
parameter τ is reduced over time to acquire more exploitation, there is no effective
mechanism to guarantee re-exploration when necessary.
Therefore, the existing exploration strategies usually suffer from the difficulties to hold
the good balancing between exploration and exploitation and to provide an easy
method of parameter setting. Hence new ideas are necessary to explore more effective
exploration strategies to achieve better performance. Inspired by the main
characteristics of quantum computation, we present the SIRL algorithm with a
probabilistic exploration policy.

3.2 Superposition-inspired RL
The exploration strategy for SIRL is inspired by the state superposition principle of a
quantum system and collapse postulate, where a combined action form is adopted to
provide a probabilistic mechanism for each state in the SIRL system. At state s , the
action to be selected is represented as:
m
c1 c2 c c
as = f ( s ) = + + ... + m = ∑ i (20)
a1 a2 am i=1 ai
66 Reinforcement Learning: Theory and Applications

m
Where ∑c
i =1
i = 1 , 0 ≤ ci ≤ 1 , i = 1,2,...m . as is the action to be selected at state s and

the action selection set is {a1 , a2 ,..., am } . Equation (20) is not for numerical computation
and it just means that at the state s , the agent will choose the action ai with the occurrence

probability ci , which leads to a natural exploration strategy for SIRL.

After the execution of action ai from state s , the corresponding probability ci is updated
according to the immediate reward r and the estimated value of the next state V ( s ' ) .

ci ← ci + k (r + V ( s ' )) (21)

where k is the updating step and the probability distribution (c1 , c2 ,..., cm ) is normalized
after each updating process. The procedural algorithm of standard SIRL is shown as in Fig.
2.
Procedural SIRL:
Initialize V ( s ) arbitrarily, π to the policy to be evaluated
m
c1 c2 c c
π : as = f ( s ) = + + ... + m = ∑ i
a1 a2 am i=1 ai
Repeat (for each episode):
Initialize s
Repeat (for each step of episode):
a ← action given by π for s
Take actiona : observe reward, r , and next state, s '
V ( s) ← V ( s ) + α [r + γV ( s' ) − V ( s)]
ci ← ci + k (r + V ( s ' ))
s ← s'
until s is terminal
until the learning process ends
Fig. 2. A standard SIRL algorithm
In the SIRL algorithm, the exploration policy is accomplished through a probability
distribution over the action set. When the agent is going to choose an action at a certain
state, the action ai will be selected with probability ci , which is also updated along with
the value funcion updating. Comparing the SIRL algorithm with basic RL algorithms, the
main difference is that with the probabilistic exploration policy, the SIRL algorithm makes
better tradeoff between exporation and exploitation without bothering to tune it by the
designers.
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 67

3.3 Simulated experiments

The performance of the SIRL algorithm is tested with two examples, which are a puzzle
problem and a mobile robot navigation problem.

1. The puzzle problem

First, let’s consider a puzzle problem as shown in Fig. 3, which is in a 13×13(0 ~ 12)
gridworld environment. From any state the agent can perform one of four primary actions:
up, down, left and right, and actions that would lead into a blocked cell are not executed.
The task is to find an optimal policy which will let the agent move from S(11,1) to G(1,11)
with minimized cost (number of moving steps).
The experiment setting is as follows. Once the agent finds the goal state it receives a reward
of 100 and then ends this episode. All steps are punished by a reward of -1. The discount
factor γ is set to 0.99 for all the algorithms that we have carried out in this example. In this
experiment, we compare the proposed method with TD algorithm. For the action selection
policy of TD algorithm, we use ε -greedy policy ( ε = 0.01). As for SIRL method, the action
selecting policy uses the values of ci to denote the probability of an action, which is defined
m
c1 c2 c c
as as = f ( s ) = + + ... + m = ∑ i . For the four cell-to-cell actions ci is
a1 a2 am i=1 ai
initialized uniformly.

Fig. 3. A puzzle problem. The task is to move from start (S) to goal (G) with minimum
number of steps
68 Reinforcement Learning: Theory and Applications

Fig. 4. Performance of SIRL (the left figure) compared with TD algorithm (the right figure)
The experimental results of the SIRL method compared with TD method are plotted in Fig.
4. It is obvious that at the beginning phase SIRL with this superposition-inspired exploration
strategy learns extraordinarily fast, and then steadily converges to the optimal policy that
costs 40 steps to the goal G. The results show that the SIRL method makes a good tradeoff
between exploration and exploitation.

2. Mobile robot navigation

A simulation environment has also been set up with a larger grid-map of 400×600. And the
configuration of main parameters is as follows: learning rate α = 0.5 , discount factor
γ = 0.9 . Fig. 5. shows the result in complex indoor environment, which verifies the
effectiveness of robot learning using SIRL for navigation in large unknown environments.

Fig. 5. Simulation result of robot navigation in indoor environment

Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 69

4. Quantum reinforcement learning

When the SIRL is applied to a real quantum system, for example, to run the algorithm on a
quantum computer, the representation and the computation mode will be dramatically
different, which will lead to quantum reinforcement learning (QRL). Then we can take the
most advantages of this quantum algorithm, such as the speeding up due to quantum
parallel computation.

4.1 Representation
One of the most fundamental principles of quantum mechanics is the state superposition
principle. As we represent a QRL system with quantum concepts, similarly, we have the
following definitions and propositions for QRL.
Definition 1: (Eigenvalue of states or actions) States s or actions a in a RL system are
denoted as corresponding orthogonal quantum states | sn 〉 (or | a n 〉 ) and are called the
eigenvalue of states or actions in QRL.
Then we get the set of eigenvalues of states: S = {| s n 〉} and that of actions for state i:
A (i ) = {| a n 〉} .
Corollary 1: Every possible state | s〉 or action | a〉 can be expanded in terms of an
orthogonal complete set of functions, respectively. We have

| s〉 = ∑ β n | s n 〉 (22)
n

| a〉 = ∑ β n | a n 〉 (23)
n

where βn is probability amplitude, which can be a complex number, | sn 〉 and | an 〉 are

eigenvalues of states and actions, respectively. And the βn in equation (22) is not
necessarily the same as the ones in equation (23), which just mean this corollary holds for
both of | s〉 and | a〉 . | β n | 2 means the probability of corresponding eigenvalues and
satisfies

∑| βn
n |2 = 1 (24)

Proof: (sketch)
(1) State space {| s〉} in QRL system is a N -dimension Hilbert space,
(2) States {| s n 〉} in traditional RL system are the eigenvalue of states | s〉 in QRL system,
(Definition 1)
70 Reinforcement Learning: Theory and Applications

Then {| sn 〉} are N linear independent vectors for this N -dimension Hilbert space,
according to the definition of Hilbert space, any possible state | s〉 can be expanded in
terms of the complete set of | s n 〉 . And it is the same for action space {| a〉} .
So the states and actions in QRL are different from those in traditional RL.
1. The sum of several states (or actions) does not have a definite meaning in traditional
RL, but the sum of states (or actions) in QRL is still a possible state (or action) of the
same quantum system, and it will simultaneously take on the superposition state of
some eigenvalues.
2. The measurement value of | s〉 relates to its probability density. When | s〉 takes on an

eigenstate | si 〉 , its value is exclusive. Otherwise, its value has the probability of | β i | 2
to be one of the eigenstate | si 〉 .
Like what has been described in Section 2, quantum computation is built upon the concept
of qubit. Now we consider the systems of multiple qubits and propose a formal
representation of them for QRL system.
Let Ns and Na be the numbers of states and actions respectively, then choose numbers m
and n, which are characterized by the following inequalities:

N s ≤ 2m ≤ 2N s , N a ≤ 2n ≤ 2N a (25)

And use m and n qubits to represent eigenstate set S＝{s} and eigenaction set A＝{a}
respectively:

⎡a1 a2 am ⎤
s: ⎢ ⋅ ⋅ ⋅ ⎥ , where | ai |2 + | bi |2 = 1 , i = 1,2,...m
⎣b1 b2 bm ⎦

⎡α 1 α 2 α n ⎤
a: ⎢ ⋅ ⋅ ⋅ ⎥ , where | α i |2 + | β i |2 = 1 , i = 1,2,...n
⎣ β1 β 2 βn ⎦
Thus the states and actions of a QRL system may lie in superposition states:
} m
11...1
| s ( m) 〉 = ∑C
s = 00L0
s | s〉 (26)
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 71

} n
11...1
| a ( n) 〉 = ∑C
a =00L0
a | a〉 (27)

where Cs and Ca can be complex numbers and satisfy

} m
11...1

∑| C
s = 00L0
s |2 = 1 (28)

} n
11...1

∑| C
a = 00L0
a |2 = 1 (29)

4.2 Action selection policy

In QRL, the agent is also to learn a policy π : S × ∪ i∈S A(i ) → [0,1] , which will maximize
the expected sum of discounted reward of each state. That is to say, the mapping from states
to actions is f ( s ) = π : S → A , and we have
} n
11...1
f ( s ) =| a s( n ) 〉 = ∑C
a =00L0
a | a〉 (30)

where Ca is probability amplitude of action | a〉 and satisfies (29).

Definition 2: (Collapse) When a quantum state |ψ 〉 = ∑ β n |ψ n 〉 is measured, it will be

changed and collapse randomly into one |ψ n 〉 of its eigenstates with corresponding

probability | 〈ψ n | ψ 〉 | 2
:

| 〈ψ n | ψ 〉 |2 =| (| ψ n 〉 )* | ψ 〉 |2 =| β n |2 (31)

Then when an action | a s( n ) 〉 is measured, we will get | a〉 with the occurrence probability
2
of | Ca | . In QRL algorithm, we will amplify the probability of “good” action according to
corresponding rewards. It is obvious that the collapse action selection method is not a real
action selection method theoretically. It is just a fundamental phenomenon when a quantum
state is measured, which results in a good balancing between exploration and exploitation
and a natural “action selection” without setting parameters.
72 Reinforcement Learning: Theory and Applications

4.3 Value function updating and reinforcement strategy

In Corollary 1 we pointed out that every possible state of QRL | s〉 can be expanded in
terms of an orthogonal complete set of eigenstate | s n 〉 : | s〉 = ∑ β n | s n 〉 . If we use an m-
m
n
}
11...1

qubit register, it will be | s (m) 〉 = ∑C

s = 00L0
s | s〉 .

According to quantum parallel computation theory, a certain unitary transformation U

from input qubit to output qubit can be implemented. Suppose we have such a “quantum
black box” which can simultaneously process these 2m states with the value updating rule

V ( s ) ← V ( s ) + α (r + V ( s ' ) − V ( s )) (32)

where α is learning rate, and r is the immediate reward. It is like parallel value updating
of traditional RL over all states, however, it provides an exponential-scale computation
space in the m-qubit linear physical space and can speed up the solutions of related
functions.
The reinforcement strategy is accomplished by changing the probability amplitudes of the
actions according to the updated value function. As we know that action selection is
executed by measuring action | a s( n ) 〉 related to certain state | s ( m ) 〉 , which will collapse to
| a〉 with the occurrence probability of | C a |2 . So it is no doubt that probability amplitude
updating is the key of recording the “trial-and-error” experience and learning to be more
intelligent. When an action | a〉 is executed, it should be able to memorize whether it is
“good” or “bad” by changing its probability amplitude C a . For more details, please refer to
(Chen et al., 2006a; Dong et al., 2006b; Dong et al., 2007b).
As action | a s( n ) 〉 is the superposition of n possible eigenactions, to find out | a〉 and to
change its probability amplitudes are usually interactional for a quantum system. So we
simply update the probability amplitude of | a s( n ) 〉 without searching | a〉 , which is
inspired by Grover’s searching algorithm (Grover, 1996).
The updating of probability amplitude is based on Grover iteration. First, prepare the
equally weighted superposition of all eigenactions
} n
11...1
1
| a0( n ) 〉 = ( ∑ | a〉 ) (33)
2n a =00L0

This process can be done easily by applying the Hadamard transformation to each qubit of
an initial state | a = 0〉 . We know that | a〉 is an eigenaction and can get
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 73

1
〈 a | a0( n ) 〉 = (34)
2n
Now assume the eigenaction to be reinforced is | aj〉, and we can construct Grover

iteration through combining two reflections U aj and U a( n ) (Preskill, 1998; Nielsen &
0

Chuang, 2000)

U a j = I − 2 | a j 〉 〈a j | (35)

U a ( n ) = 2 | a 0( n ) 〉 〈 a 0( n ) | − I (36)
0

where I is unitary matrix. U aj flips the sign of the action | a j 〉 , but acts trivially on any
action orthogonal to | aj〉. This transformation has a simple geometrical interpretation.

Acting on any vector in the 2 n -dimensional Hilbert space, U a j reflects the vector about the

hyperplane orthogonal to | a j 〉 . On the other hand, U a ( n ) preserves | a0( n ) 〉 , but flips the
0

sign of any vector orthogonal to | a0( n ) 〉 . Grover iteration is the unitary transformation
U Grov = U a ( n ) U a j (37)
0

By repeatedly applying the transformation U Grov on | a0( n ) 〉 , we can enhance the

probability amplitude of the basis action | aj〉 while suppressing the amplitude of all other
actions. This can also be looked upon as a kind of rotation in two-dimensional space.
Applying Grover iteration U Grov for K times on | a0( n ) 〉 can be represented as

K
U Grov | a 0( n ) 〉 = sin((2 K + 1)θ ) | a j 〉 + cos((2 K + 1)θ ) | φ 〉 (38)

1
where | φ〉 = ∑ | a〉 , θ
2 − 1 a≠a j
n
satisfying sin θ = 1 / 2 n . Through repeating Grover

iteration, we can reinforce the probability amplitude of corresponding action according to

the reward value.
Thus when an action | a0( n ) 〉 is executed, the probability amplitude of | a j 〉 is updated by
carrying out [k ( r + V ( s ' ))] (an integer) times of Grover iteration. k is a parameter and
the probability amplitudes will be normalized with ∑ | C a | = 1 after each updating.
2

a
74 Reinforcement Learning: Theory and Applications

4.4 Quantum reinforcement learning algorithm

The procedural form of a standard QRL algorithm is described as Fig. 6 (Dong et al., 2007b).
QRL is inspired by the superposition principle of quantum state and quantum parallel
computation. The state value can be represented with quantum state and be obtained by
randomly observing the simulated quantum state, which will lead to state collapse
according to quantum measurement postulate. And the occurrence probability of
eigenvalue is determined by probability amplitude, which is updated according to rewards.
So this approach represents the whole state-action space with the superposition of quantum
state and makes a good tradeoff between exploration and exploitation using probability. The
merit of QRL is twofold. First, as for simulation algorithm on traditional computer it is an
effective algorithm with novel representation and computation methods. Second, the
representation and computation mode are consistent with quantum parallel computation
system and can speed up learning in exponential scale with quantum computer or quantum
logic gates.
In this QRL algorithm we use temporal difference (TD) prediction for the state value
updating, and TD algorithm has been proved to converge for absorbing Markov chain when
the stepsize is nonnegative and digressive (Sutton & Barto, 1998; Watkins & Dayan, 1992).
Since QRL is a stochastic iterative algorithm and Bertsekas and Tsitsiklis have verified the
convergence of stochastic iterative algorithms (Bertsekas & Tsitsiklis, 1996), we give the
convergence result about the QRL algorithm as Theorem 1. The proof and related
discussions can be found in (Dong et al., 2006a; Chen et al., 2006c; Dong et al., 2007b):
Theorem 1: For any Markov chain, quantum reinforcement learning algorithm converges at
the optimal state value function V (s )* with probability 1 under proper exploration policy
when the following conditions hold (where αk is stepsize and nonnegative):

T T
lim ∑α k = ∞ , lim ∑α k2 < ∞ (39)
T →∞ T →∞
k =1 k =1

From the procedure of QRL in Fig. 6, we can see that the learning process of QRL is carried
out through parallel computation, which also provides a mechanism of parallel updating.
Sutton and Barto (Sutton & Barto, 1998) have pointed out that for the basic RL algorithms
the parallel updating does not affect such performances of RL as learning speed and
convergence in general. But we find that the parallel updating will speed up the learning
process for the RL algorithms with a hierarchical setting (Sutton et al., 1999; Barto &
Mahadevan, 2003; Chen et al., 2005), because the parallel updating rules give more chance to
the updating of the upper level learning process and this experience for the agent can work
as the “sub-goals” intrinsically that will speed up the lower learning process.
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 75

Procedure QRL:
11L1 11L1
Initialize | s (m) 〉 = ∑ Cs | s〉 , f (s) =| as( n) 〉 =
s = 00L0
∑C
a = 00L0
a | a〉 and V ( s) arbitrarily

Repeat (for each episode)

11L1
For all states | s (m) 〉 = ∑C
s = 00L0
s | s〉 :

1. Observe f ( s ) =| a 〉 and get | a〉 ;

( n)
s

2. Take action | a〉 , observe next state | s ' 〉 , reward r , then

(a) Update state value: V ( s ) ← V ( s ) + α ( r + γV ( s ' ) − V ( s ))
(b) Update probability amplitudes:
repeat for [k (r + V ( s' ))] times
U Grov | as( n ) 〉 = U a( n )U a | as( n ) 〉
0

Until for all states | ΔV ( s ) |≤ ε .

Fig. 6. The algorithm of a standard QRL (Dong et al., 2007b)

4.5 Physical implementation

Now let’s simply consider the physical realization of QRL and detailed discussion can be
found in (Dong et al., 2006b). In QRL algorithm, the three main operations occur in
preparing the equally weighted superposition state for calculating the times of Grover
iteration, initializing the quantum system for representing states or actions, and carrying out
a certain times of Grover iteration for updating probability amplitude according to reward
value. In fact, we can initialize the quantum system by equally weighted superposition for
representing states or actions. So the main operations required are preparing the equally
weighted superposition state and carrying out Grover iteration. These can be implemented
using the Hadamard transform and the conditional phase shift operation, both of which are
relatively easy in quantum computation.
Consider a quantum system described by n qubits, it has 2n possible states. To prepare an
equally weighted superposition state, initially let each qubit lie in the state | 0〉 , then we can
perform the transformation H on each qubit independently in sequence and thus change
the state of the system. The state transition matrix representing this operation will be of
dimension 2 × 2 and it can be implemented by n shunt-wound Hadamard gates. This
n n

process can be represented into:

76 Reinforcement Learning: Theory and Applications

} n
67n8
11...1
1
H ⊗n | 00L 0〉 =
2n
∑ | a〉
a =00L0
(40)

The other operation is the conditional phase shift operation which is an important element
to carry out the Grover iteration. According to quantum information theory, this
transformation may be efficiently implemented using phase gates on a quantum computer.
The conditional phase shift operation does not change the probability of each state since the
square of the absolute value of the amplitude in each state stays the same.

4.6 Simulated experiments

The presented QRL algorithm is also tested using two examples: Prisoner’s Diploma and the
control of a five-qubit system.
1. Prisoner’s Diploma
The first example is derived from typical Prisoners’ Dilemma. In the Prisoners’ Dilemma,
each of the two prisoners, prisoner I and prisoner II, must independently make the action
selection to agree to give evidence against the other guy or to refuse to do so. The situation
is as described in Table 1 with the entries giving the length of the prison sentence (years in
prison) for each prisoner, in every possible situation. In this case, each of the prisoners is
assumed to minimize his sentence. As we know, this play may lead to Nash equilibrium by
giving the action selection (agree to give evidence, agree to give evidence) with the outcome of (3,
3) years in prison.

prisoner II
Agree to give evidence Refuse to give evidence
Prisoner I

Agree (3, 3) (0, 5)

Refuse (5, 0) (1, 1)

Table 1. The Prisoners’ Dilemma
Now, we assume that this Prisoners game can be played repeatedly. Each of them can
choose to agree or refuse to give evidence against the other guy and the probabilities of the
action selection (agree to give evidence, agree to give evidence) are initially equal. To find a better
outcome, the two prisoners try to improve their action selection using learning. By applying
the QRL method proposed in this chapter, we get the results as shown in Fig. 6 and Fig. 7
(Chen et al., 2006a; Chen et al., 2006c). From the results, it is obvious that the two prisoners
get smarter when they try to cooperate indeliberately and both of them select the action of
“Refuse to give evidence” after about 40 episodes of play. Then they steadily get the outcome
of (1, 1) instead of (3, 3) (Nash equilibrium).
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 77

Fig. 6. The outcome (years in prison) of the Prisoners problem for each prisoner

Fig. 7. The whole outcome of the Prisoners problem (Sum of years in prison for both
prisoners)
78 Reinforcement Learning: Theory and Applications

2. Control of a five-qubit system

The second axample is about the control of a five-qubit system (Dong et al., 2006c). With the
development of quantum information technology, quantum control theory has drawn the
attention of many scientists (Chen et al., 2005). The objective of quantum control is to
determine how to drive quantum systems from an initial given quantum state to a pre-
determined target quantum state with some given time. According to quantum mechanics,
the state | ψ ( t )〉 of arbitary time t can be reached through an evolution on the initial state
| ψ (0)〉 . It can be expressed as

| ψ ( t )〉 = Û | ψ (0)〉 (41)

where Û is a unitary operator and satisfies:

ÛÛ + = Û + Û = I (42)

where Û + Û . So the control problem of quantum

is the Hermitian conjugate operator of
state can be converted into finding appropriate unitary operator Û .
In this example, we consider the five-qubit system, it has 32 eigenstates. In practical
quantum information technology, some state transitions can easily be completed through
appropriate unitary transformations but the other ones are not easy to be accomplished.
Assume we know its state transitions satisfy the following equations through some
experiments:
| 00001〉 = Û 00 | 00000〉 ; | 00010〉 = Û 01 | 00001〉 ; | 00011〉 = Û 02 | 00010〉 ;
| 00100〉 = Û 03 | 00011〉 ; | 00101〉 = Û 04 | 00100〉 ; | 00111〉 = Û11 | 00001〉 ;
| 01000〉 = Û12 | 00010〉 ; | 01010〉 = Û14 | 00100〉 ; | 01011〉 = Û15 | 00101〉 ;
| 01000〉 = Û 21 | 00111〉 ; | 01011〉 = Û 24 | 01010〉 ; | 01101〉 = Û 31 | 00111〉 ;
| 10000〉 = Û 34 | 01010〉 ; | 10001〉 = Û 35 | 01011〉 ; | 01101〉 = Û 40 | 01100〉 ;
| 10001〉 = Û 44 | 10000〉 ; | 10010〉 = Û 50 | 01100〉 ; | 10110〉 = Û 54 | 10000〉 ;
| 10111〉 = Û 55 | 10001〉 ; | 10101〉 = Û 62 | 10100〉 ; | 10110〉 = Û 63 | 10101〉 ;
| 10111〉 = Û 64 | 10110〉 ; | 11000〉 = Û 70 | 10010〉 ; | 11100〉 = Û 74 | 10110〉 ;
| 11101〉 = Û 75 | 10111〉 ; | 11001〉 = Û 80 | 11001〉 ; | 11101〉 = Û 84 | 11100〉 ;
| 11111〉 = Û 91 | 11001〉
In the above equations, Û is reversible operator. For example, we can easily get
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 79

| 00000〉 = Û -100 | 00001〉 (43)

Assume the other transitions are impossible except the above transitions and corresponding
inverse transitions. If the initial state and the target state are | 11100〉 and | 11111〉
respectively, the following task is to find optimal control sequence through QRL.

Fig. 8. The grid representation for the quantum control problem of a five-qubit system
Therefor we first fill the eigenstates of five-qubit system in a grid room and they can be
described as shown in Fig. 8. Every eigenstate is arranged in a corresponding grid and the
hatched grid indicates that the corresponding state can not be attained. The two states with
a common side are mutually reachable through one-step control and other states can not
directly reach each other through one-step control. Now the task of the quantum learning
system is to find an optimal control sequence which will let the five-qubit system transform
from | 11100〉 to | 11111〉 . Using the QRL method proposed previously, we get the
results as shown in Fig. 9. And more experimental results are shown in Fig. 10 to
demonstrate its performance with different learning rates. From the results, it is obvious that
the control system can robustly find the optimal control sequence for the five-qubit system
through learning and the optimal control sequences are shown in Fig. 11. We can easily
obtain two optimal control sequences from Fig. 11:
80 Reinforcement Learning: Theory and Applications

Sequence_1 = {Û-174, Û54

-1 -1
, Û34 -1
, Û14 , Û-103, Û-102, Û12 , Û-121, Û31, Û-140, Û50, Û70, Û80 , Û91} (44)

Sequence_ 2 = {Û-174, Û54

-1 -1
, Û34 -1
, Û14 , Û-103, Û-102, Û-101, Û11, Û31, Û-140, Û50, Û70, Û80, Û91} (45)

Fig. 9. The performance of QRL for optimal control sequence

Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 81

Fig. 10. The performance of QRL with different learning rates

Fig. 11. The control paths for the control of a five-qubit system
82 Reinforcement Learning: Theory and Applications

5. Conclusion
According to the existing problems in RL area, such as low learning speed and tradeoff
between exploration and exploitation, SIRL and QRL methods are introduced based on the
theory of RL and quantum computation in this chapter, which follows the developing
roadmap from the superposition-inspired methods to the RL methods in quantum systems.
Just as simulated annealing algorithm comes from mimicking the physical annealing
process, quantum characteristics also broaden our mind and provide alternative approaches
to novel RL methods.
In this chapter, SIRL method emphasizes the exploration policy and uses a probabilistic
action selection method that is inspired by the state superposition principle and collapse
postulate. The experiments, which include a puzzle problem and a mobile robot navigation
problem, demanstrate the effectiveness of SIRL algorithm and show that it is superior to
basic TD algorithm with ε -greedy policy. As for QRL, the state/action value is represented
with quantum superposition state and the action selection is carried out by observing
quantum state according to quantum collapse postulate, which means a QRL system is
designed for the real quantum system although it can also be simulated on a traditional
computer. The results of simulated experiments verified its feasibility and effectiveness with
two examples: Prisoner’s Dilemma and the control of a five-qubit system. The contents
presented in this chapter are mainly the basic ideas and methods related to the combination
of RL theory and quantum computation. More theoretic research and applictions are to be
investigated in the future.

6. References
Barto, A.G. & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning,
Discrete Event Dynamic Systems: Theory and applications, Vol. 13, pp. 41-77
Bertsekas, D.P. & Tsitsiklis, J.N. (1996). Neuro-Dynamic Programming, Athena Scientific,
Belmont, MA
Chen, C.L. & Chen, Z.H. (2005). Reinforcement learning for mobile robot: from reaction to
deliberation. Journal of Systems Engineering and Electronics, Vol. 16, No. 3, pp. 611-
617
Chen, C.L.; Dong, D.Y. & Chen, Z.H. (2006a). Quantum computation for action selection
using reinforcement learning. International Journal of Quantum Information, Vol. 4,
No. 6, pp. 1071-1083
Chen, C.L.; Dong, D.Y. & Chen, Z.H. (2006b). Grey reinforcement learning for incomplete
information processing. Lecture Notes in Computer Science, Vol. 3959, pp. 399-407
Chen, C.L.; Dong, D.Y.; Dong, Y. & Shi, Q. (2006c). A quantum reinforcement learning
method for repeated game theory. Proceedings of the 2006 International Conference on
Computational Intelligence and Security, Part I, pp. 68-72, Guangzhou, China, Nov.
2006, IEEE Press
Chen, C.L. & Dong D.Y. (2007). Quantum mobile intelligent system, In: Quantum-Inspired
Evolutionary Computation, N. Nedjah, L. S. Coelho & L. M. Mourelle (Eds.), Springer,
in press
Chen, Z.H.; Dong, D.Y. & Zhang, C.B. (2005). Quantum Control Theory: An Introduction,
University of Science and Technology of China Press, ISBN 7-312-01863-7/TP. 363,
Hefei (In Chinese)
Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning 83

Chuang I.L.; Gershenfeld N. & Kubinec M. (1998). Experimental implementation of fast

quantum searching, Physical Review Letters, Vol. 80, pp. 3408-3411
Dong, D.Y.; Chen, C.L. & Chen, Z.H. (2005a). Quantum reinforcement learning. Lecture Notes
in Computer Science, Vol. 3611, pp. 686-689
Dong, D.Y.; Chen, C.L.; Zhang, C.B. & Chen, Z.H. (2005b). An autonomous mobile robot
based on quantum algorithm. Lecture Notes in Artificial Intelligence, Vol. 3801, pp.
394-399
Dong, D.Y.; Chen, C.L.; Zhang, C.B. & Chen, Z.H. (2006a). Quantum robot: structure,
algorithms and applications. Robotica, Vol. 24, No.4, July 2006, pp. 513-521
Dong, D.Y.; Chen, C.L.; Chen, Z.H. & Zhang, C.B. (2006b). Quantum mechanics helps in
learning for more intelligent robots. Chinese Physics Letters, Vol. 23, No. 7, pp. 1691-
1694
Dong, D.Y.; Chen, C.L.; Chen, Z.H. & Zhang, C.B. (2006c). Control of five-qubit system
based on quantum reinforcement learning. Proceedings of the 2006 International
Conference on Computational Intelligence and Security, Part I, pp. 164-167, Guangzhou,
China, Nov. 2006, IEEE Press
Dong, D.Y.; Chen, C.L. & Li, H.X. (2007a). Reinforcement strategy using quantum amplitude
amplification for robot learning. Proceedings of the 26th Chinese Control Conference,
Part VI, pp. 571-575, Zhangjiajie, China, Jul. 2007, IEEE Press
Dong, D.Y.; Chen, C.L.; Li, H.X. & Tarn, T.J. (2007b). Quantum reinforcement learning. IEEE
Transaction on System, Man, and Cybernetics B, under review
Ekert, A. & Jozsa, R. (1996). Quantum computation and Shor’s factoring algorithm, Reviews
of Modern Physics, Vol. 68, pp. 733-753
Grover, L. K. (1996). A fast quantum mechanical algorithm for database search, Proceedings
of the 28th Annual ACM Symposium on the Theory of Computation, pp. 212-219, New
York, 1996, ACM Press
Grover, L. K. (1997). Quantum mechanics helps in searching for a needle in a haystack,
Physical Review Letters, Vol. 79, pp. 325-327, 1997
Guo, M.Z.; Liu, Y. & Malec, J. (2004). A new Q-learning algorithm based on the metropolis
criterion, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, Vol.
34, No. 5, pp. 2140-2143
He, P. & Jagannathan, S. (2005). Reinforcement learning-based output feedback control of
nonlinear systems with input constraints, IEEE Transactions on Systems Man and
Cybernetics, Part B-Cybernetics, Vol. 35, No. 1, pp. 150-154
Jones, J.A. (1998a). Fast searches with nuclear magnetic resonance computers, Science, Vol.
280, pp. 229
Jones, J.A.; Mosca, M. & Hansen R.H. (1998b). Implementation of a quantum Search
algorithm on a quantum computer, Nature, Vol. 393, pp. 344-346
Kaelbling, L.P.; Littman, M.L. & Moore, A.W. (1996). Reinforcement learning: a survey,
Journal of Artificial Intelligence Research, Vol. 4, pp. 237-287
Kondo, T. & Ito, K. (2004). A reinforcement learning with evolutionary state recruitment
strategy for autonomous mobile robots control, Robotics and Autonomous Systems,
Vol. 46, pp. 111-124
Kwiat, P.G.; Mitchell, J.R.; Schwindt, P.D.D. et al. (2000). Grover’s search algorithm: an
optical approach, Journal of Modern Optics, Vol. 47, pp. 257-266
84 Reinforcement Learning: Theory and Applications

Morimoto, J. & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using
hierarchical reinforcement learning, Robotics and Autonomous Systems, Vol. 36, pp.
37-51
Nielsen, M.A. & Chuang, I.L. (2000). Quantum Computation and Quantum Information,
Cambridge University Press, Cambridge, England
Preskill, J. (1998). Physics 229: Advanced Mathematical Methods of Physics--Quantum
Information and Computation. California Institute of Technology, 1998. Available
electronically via http://www.theory.caltech.edu/people/preskill/ph229/
Scully, M.O. & Zubairy, M.S. (2001). Quantum optical implementation of Grover’s
algorithm, Proceedings of the National Academy of Sciences of the United States of
America, Vol. 98, pp. 9490-9493
Shor, P. W. (1994). Algorithms for quantum computation: discrete logarithms and factoring,
Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pp. 124-
134, Los Alamitos, CA, IEEE Press
Sutton, R. & Barto A.G. (1998). Reinforcement Learning: An Introduction, MIT Press,
Cambridge, MA
Sutton, R.; Precup, D. and Singh, S. (1999). Between mdps and semi-mdps: a framework for
temporal abstraction in reinforcement learning, Artificial Intelligence, Vol. 112, pp.
181-211
Vandersypen, L.M.K.; Steffen, M.; Breyta, G. et al. (2001). Experimental realization of Shor’s
quantum factoring algorithm using nuclear magnetic resonance, Nature, Vol. 414,
pp. 883-887
Watkins, J.C.H. and Dayan, P. (1992). Q-learning, Machine Learning, Vol. 8, pp. 279-292

DMGR2 - The Castle Guide (2E)
100% (5)
DMGR2 - The Castle Guide (2E)
130 pages
TSR Inc - AD&D 2nd Edition - Dragonlance_ a Saga Companion
No ratings yet
TSR Inc - AD&D 2nd Edition - Dragonlance_ a Saga Companion
130 pages
Ancient Sicily (1890)
No ratings yet
Ancient Sicily (1890)
398 pages
Edg All About History Book of the Pilgrims 2018
No ratings yet
Edg All About History Book of the Pilgrims 2018
129 pages
TSR Inc - AD&D 2nd Edition - Dragonlance_ Battle Lines Adventure 1 - The Sylvan Veil
No ratings yet
TSR Inc - AD&D 2nd Edition - Dragonlance_ Battle Lines Adventure 1 - The Sylvan Veil
116 pages
sc_0000839536_00000001739230
No ratings yet
sc_0000839536_00000001739230
176 pages
HarvardTheologStud.v3.EphdArkStRecRelAnctHeb.wrarnold.hup.1917.Gs
No ratings yet
HarvardTheologStud.v3.EphdArkStRecRelAnctHeb.wrarnold.hup.1917.Gs
187 pages
GrkEcclesHistFrstSixCentChristEra.v2.EusebEcclHis.tr,CFCruse,Etc.lon.SB.1847.Gs
No ratings yet
GrkEcclesHistFrstSixCentChristEra.v2.EusebEcclHis.tr,CFCruse,Etc.lon.SB.1847.Gs
466 pages
GrkEcclesHistFrstSixCentChristEra.v5.TheodrtEcclsHis.5Bks,Etc.lon.SB.1843.Gs
No ratings yet
GrkEcclesHistFrstSixCentChristEra.v5.TheodrtEcclsHis.5Bks,Etc.lon.SB.1843.Gs
425 pages
DMGR1 - Campaign Sourcebook and Catacomb Guide PDF
100% (12)
DMGR1 - Campaign Sourcebook and Catacomb Guide PDF
130 pages
The Ashlar Vol I Sep 1855 - Aug 1856
No ratings yet
The Ashlar Vol I Sep 1855 - Aug 1856
562 pages
The Ashlar Vol II Sep 1856 - Aug 1857
No ratings yet
The Ashlar Vol II Sep 1856 - Aug 1857
562 pages
Old Assyrian Cuneiform - Ancient Scripts [English]
No ratings yet
Old Assyrian Cuneiform - Ancient Scripts [English]
161 pages
Hippocratic Oratory
No ratings yet
Hippocratic Oratory
170 pages
227387715 Bazm e Taimooriya Syed Sabahuddin Abdur Rahman
No ratings yet
227387715 Bazm e Taimooriya Syed Sabahuddin Abdur Rahman
483 pages
TSR Inc - AD&D 2nd Edition - Dragonlance_ Battle Lines Adventure 2 - Rise of the Titans
No ratings yet
TSR Inc - AD&D 2nd Edition - Dragonlance_ Battle Lines Adventure 2 - Rise of the Titans
98 pages
227387715 Bazm e Taimooriya Syed Sabahuddin Abdur Rahman
No ratings yet
227387715 Bazm e Taimooriya Syed Sabahuddin Abdur Rahman
483 pages
Lucky House Others
No ratings yet
Lucky House Others
16 pages
Mikro DM38
No ratings yet
Mikro DM38
2 pages
The Elephantine Papyri in English
No ratings yet
The Elephantine Papyri in English
662 pages
External Content
No ratings yet
External Content
484 pages
9789004506114
No ratings yet
9789004506114
324 pages
Maps Ancient-Classical Italy Neighbors
No ratings yet
Maps Ancient-Classical Italy Neighbors
52 pages
Freemasonry in The Holy Land - R Morris PDF
No ratings yet
Freemasonry in The Holy Land - R Morris PDF
612 pages
Hipocrates - VOLUME 8
No ratings yet
Hipocrates - VOLUME 8
440 pages
Plutarch's Lives PDF
No ratings yet
Plutarch's Lives PDF
615 pages
My Time Table 2024-25
No ratings yet
My Time Table 2024-25
1 page
190 MP IgM-IFU-en-EU-IVDD-V2.1
No ratings yet
190 MP IgM-IFU-en-EU-IVDD-V2.1
2 pages
AIML Curriculum Powered by IBM - Pregrad-Merged
No ratings yet
AIML Curriculum Powered by IBM - Pregrad-Merged
66 pages
Carcassonne V3 Supplement
No ratings yet
Carcassonne V3 Supplement
2 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
CSC441 Script Video Sawanah Koko
No ratings yet
CSC441 Script Video Sawanah Koko
2 pages
Chemistry Homework 8-1
No ratings yet
Chemistry Homework 8-1
7 pages
For Placement
No ratings yet
For Placement
7 pages
Hesiodhomerichym00hesi 0
No ratings yet
Hesiodhomerichym00hesi 0
692 pages
792224FT-371008 - 207-00011
No ratings yet
792224FT-371008 - 207-00011
13 pages
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
No ratings yet
CMPC Pulp: Insulation Requirement: Heat Conservation (For Personnel Protection, See Notes 3 & 4) Service
3 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
ADnD DMGR9 of Ships and Sea 2nd Edition
No ratings yet
ADnD DMGR9 of Ships and Sea 2nd Edition
130 pages
GDC BCP Template
No ratings yet
GDC BCP Template
53 pages
Emergency Cart Checklist
No ratings yet
Emergency Cart Checklist
1 page
American Scientist, Vol. 111.1 (January-February 2023)
No ratings yet
American Scientist, Vol. 111.1 (January-February 2023)
68 pages
TRM401 Energy-Savings-Calculator pump-and-fan-VFD v4 1 14
No ratings yet
TRM401 Energy-Savings-Calculator pump-and-fan-VFD v4 1 14
30 pages
Studies in The Grammar and Lexicon of Neo Aramaic
No ratings yet
Studies in The Grammar and Lexicon of Neo Aramaic
543 pages
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Application Form
No ratings yet
Application Form
2 pages
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
Morals and Dogma by Albert Pike PDF
No ratings yet
Morals and Dogma by Albert Pike PDF
574 pages
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
DMGR8 - Sages and Specialists
100% (3)
DMGR8 - Sages and Specialists
130 pages
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Obu Manuvo Tribe
No ratings yet
The Obu Manuvo Tribe
2 pages
Experiment 106: Uniform Circular Motion
No ratings yet
Experiment 106: Uniform Circular Motion
7 pages
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
All About History Book of Myths and Legends. (Wilkinson, Philip)
100% (6)
All About History Book of Myths and Legends. (Wilkinson, Philip)
164 pages
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Cimplicity 1
No ratings yet
Cimplicity 1
119 pages
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Electrical Electronics VOL.08 PDF
50% (2)
Electrical Electronics VOL.08 PDF
148 pages
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Operation & Service Manual For Cable Tensiometer: Series
No ratings yet
Operation & Service Manual For Cable Tensiometer: Series
28 pages
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Lesson 5
No ratings yet
Lesson 5
2 pages
The Moon: Questions
No ratings yet
The Moon: Questions
10 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Designation
No ratings yet
Designation
12 pages
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2141)
Begun and Held in Metro Manila, On Monday, The Twenty-Fifth Day of July, Two Thousand Sixteen
No ratings yet
Begun and Held in Metro Manila, On Monday, The Twenty-Fifth Day of July, Two Thousand Sixteen
3 pages
Using The Universal PE Unpacker
No ratings yet
Using The Universal PE Unpacker
11 pages
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
GSTR1 Excel Workbook Template V1.4
No ratings yet
GSTR1 Excel Workbook Template V1.4
84 pages
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Audio Spotlight
No ratings yet
Audio Spotlight
40 pages
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
GMAT - 2018.PDF Version 1
No ratings yet
GMAT - 2018.PDF Version 1
21 pages
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
No ratings yet
A'Seeb Wastewater Project Seeb, Muscat, Sultanate of Oman
3 pages

04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning

Uploaded by

04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning

Uploaded by

4

Superposition-Inspired Reinforcement Learning

is proposed to mimic the state collapse phenomenon according to quantum measurement

2. Fundamentals of quantum computation

2.1 State superposition and quantum parallel computation

where α and β | α |2 + | β |2 = 1 . | 0〉 and | 1〉 are

where Cx is complex coefficients and | C x |2 represents occurrence probability of | x〉

2.2 Quantum gates

3. Superposition-inspired reinforcement learning

3.1 Reinforcement learning and exploration strategy

V(πs ) = E{rt +1 + γrt +2 + γ 2 rt +3 + L | st = s, π }

where γ ∈ [0,1) is discounted factor, π ( s, a ) is the probability of selecting action a

V(*s ) = max[rsa + γ ∑ pssa 'V(*s ') ] (15)

π * = arg max V(πs ) , ∀s ∈ S (16)

= rsa + γ ∑ pssa ' ∑ π ( s ' , a ' )Q(πs ',a ')

Q( st , at ) ← (1 − α )Q( st , at ) + α (rt +1 + γ max a ' Q( st +1 , a' ) (19)

probability ci , which leads to a natural exploration strategy for SIRL.

3.3 Simulated experiments

1. The puzzle problem

2. Mobile robot navigation

Fig. 5. Simulation result of robot navigation in indoor environment

4. Quantum reinforcement learning

where βn is probability amplitude, which can be a complex number, | sn 〉 and | an 〉 are

where Cs and Ca can be complex numbers and satisfy

4.2 Action selection policy

where Ca is probability amplitude of action | a〉 and satisfies (29).

Definition 2: (Collapse) When a quantum state |ψ 〉 = ∑ β n |ψ n 〉 is measured, it will be

4.3 Value function updating and reinforcement strategy

qubit register, it will be | s (m) 〉 = ∑C

According to quantum parallel computation theory, a certain unitary transformation U

By repeatedly applying the transformation U Grov on | a0( n ) 〉 , we can enhance the

iteration, we can reinforce the probability amplitude of corresponding action according to

4.4 Quantum reinforcement learning algorithm

Repeat (for each episode)

1. Observe f ( s ) =| a 〉 and get | a〉 ;

2. Take action | a〉 , observe next state | s ' 〉 , reward r , then

Until for all states | ΔV ( s ) |≤ ε .

Fig. 6. The algorithm of a standard QRL (Dong et al., 2007b)

4.5 Physical implementation

process can be represented into:

4.6 Simulated experiments

Agree (3, 3) (0, 5)

Refuse (5, 0) (1, 1)

2. Control of a five-qubit system

where Û is a unitary operator and satisfies:

where Û + Û . So the control problem of quantum

| 00000〉 = Û -100 | 00001〉 (43)

Sequence_1 = {Û-174, Û54

Sequence_ 2 = {Û-174, Û54

Fig. 9. The performance of QRL for optimal control sequence

Fig. 10. The performance of QRL with different learning rates

Chuang I.L.; Gershenfeld N. & Kubinec M. (1998). Experimental implementation of fast

You might also like

V(s ) = max[rsa + γ ∑ pssa 'V(s ') ] (15)