0% found this document useful (0 votes)

1 views

01 Module 2 Neural Network Based Reinforcement Learning

Uploaded by

sherlockplus650b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

01 Module 2 Neural Network Based Reinforcement Learning

Uploaded by

sherlockplus650b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 133

Q Networks

Learning Objectives
● TD-Gammon

● Deep Q Networks

○ The Loss Function

○ Memory

○ Code
Agenda
TD-Gammon

Deep Q Networks - Loss

Deep Q Networks - Memory

Deep Q Networks - Code

From Chess to Backgammon
From Chess to Backgammon
From Chess to Backgammon
Q Tables vs Q Networks

Q - table

Left Down Right Up

0 0 0 0 0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0
Q Tables vs Q Networks

Output: TD(λ)
Q - table approximation

Left Down Right Up Hidden Layer 2

0 0 0 0 0

1 0 0 0 0 Hidden Layer 1

2 0 0 0 0
Input: State Features
3 0 0 0 0
Agenda
TD-Gammon

Deep Q Networks - Loss

Deep Q Networks - Memory

Deep Q Networks - Code

Deep Q Learning

Deep Reinforcement Learning

Deep Q Learning - Loss Function

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Deep Q Learning - Loss Function

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Δw = ꭤ(r + γ · maxa{Q(st+1,a,w)} - Q(st,at ,w))▽wQ(s,a,w)

Deep Q Learning - Loss Function

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Δw = ꭤ(r + γ · maxa{Q(st+1,a,w)} - Q(st,at ,w))▽wQ(s,a,w)

Deep Q Learning - Loss Function

Q(st,at ) = Q(st,at ) + ꭤt(rt + γ · maxa{Q(st+1,a)} - Q(st,at ))

Δw = ꭤ(r + γ · maxa{Q(st+1,a,w)} - Q(st,at ,w))▽wQ(s,a,w)

Deep Q Learning - Training
Action 1 Cycle Returns:
● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime

Agent Environment

State Reward
Deep Q Learning - Training
1 Cycle Returns:
● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime

Feed in State Prime

Deep Q Learning - Training
1 Cycle Returns:
Q(s’, a0) Q(s’, a1) Q(s’, a2) ● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime
Deep Q Learning - Training
1 Cycle Returns:
Q(s’, a0) Q(s’, a1) Q(s’, a2) Take Max Value ● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime
Deep Q Learning - Training
1 Cycle Returns:
Q(s’, a0) Q(s’, a1) Q(s’, a2) Max Q ● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime

r + γ · maxa{Q(st+1,a,w)} Calculate Label

Deep Q Learning - Training
1 Cycle Returns:
0 0 Label Apply label to action ● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime
Deep Q Learning - Training
1 Cycle Returns:
0 0 Label ● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime

Train on State
Deep Q Learning - Loss Function

Δw = ꭤ(r + γ · maxa{Q(st+1,a,w)} - Q(st,at ,w))▽wQ(s,a,w)

Label Predicted Value

Deep Q Learning - Loss Function

Δw = ꭤ(r + γ · maxa{Q(st+1,a,w)} - Q(st,at ,w))▽wQ(s,a,w)

Label Predicted Value

Loss = (y - ŷ)2
Agenda
TD-Gammon

Deep Q Networks - Loss

Deep Q Networks - Memory

Deep Q Networks - Code

Experience Replay
Deep Q Learning - Memory
Action 1 Cycle Returns:
● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime

Agent Environment

State Reward
Deep Q Learning - Memory
1 Cycle Returns:
Memory Buffer ● 1 State
● 1 Chosen Action
state ● 1 Reward
Idx state action reward
prime
● 1 State Prime
0 s0 a0 r0 s1

1 s1 a1 r1 s2

2 s2 a2 r2 s3

...
Deep Q Learning - Memory
1 Cycle Returns:
Memory Buffer ● 1 State
● 1 Chosen Action
state ● 1 Reward
Idx state action reward
prime
● 1 State Prime
0 s0 a0 r0 s1

1 s1 a1 r1 s2

2 s2 a2 r2 s3

...
Deep Q Learning - Memory

Memory Buffer Training Sample

state state
Idx state action reward Idx state action reward
prime prime

0 s0 a0 r0 s1 2 s2 a2 r2 s3

1 s1 a1 r1 s2 11 s11 a11 r11 s12

2 s2 a2 r2 s3 25 s25 a25 r25 s26

... ...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
...

def add(self, experience):

...

def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size

def add(self, experience):

...

def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size

def add(self, experience):

# Adds a (state, action, reward, state_prime, done) tuple.
self.buffer.append(experience)

def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size

def add(self, experience):

# Adds a (state, action, reward, state_prime, done) tuple.
self.buffer.append(experience)

def sample(self):
buffer_size = len(self.buffer)
index = np.random.choice(
np.arange(buffer_size), size=self.batch_size, replace=False)
batch = [self.buffer[i] for i in index]
return batch
Experience Replay
Agenda
TD-Gammon

Deep Q Networks - Loss

Deep Q Networks - Memory

Deep Q Networks - Code

Deep Q Learning - Network
def deep_q_network(state_shape, action_size, learning_rate, hidden_neurons):
state_input = Input(state_shape, name='frames')

hidden_1 = Dense(hidden_neurons, activation='relu')(state_input)

hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
q_values = Dense(action_size)(hidden_2)

model = Model(inputs=[state_input], outputs=q_values)

optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
model.compile(loss='mse', optimizer=optimizer)
return model
Deep Q Learning - Network (advanced)
def deep_q_network(state_shape, action_size, learning_rate, hidden_neurons):
state_input = Input(state_shape, name='frames')
actions_input = Input((action_size,), name='mask')

hidden_1 = Dense(hidden_neurons, activation='relu')(state_input)

hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
q_values = Dense(action_size)(hidden_2)
masked_q_values = Multiply()([q_values, actions_input])

model = Model(inputs=[state_input, actions_input], outputs=masked_q_values)

hidden_1 Training
= Dense(hidden_neurons, activation='relu')(state_input)
hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
a0 a1 = Dense(action_size)(hidden_2)
q_values a2 a3
masked_q_values = Multiply()([q_values, actions_input])
0 0 1 0
model = Model(inputs=[state_input, actions_input], outputs=masked_q_values)
optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
model.compile(loss='mse', optimizer=optimizer)
return model
Deep Q Learning - Network (advanced)
def deep_q_network(state_shape, action_size, learning_rate, hidden_neurons):
state_input = Input(state_shape, name='frames')
actions_input = Input((action_size,), name='mask')

hidden_1Predicting
= Dense(hidden_neurons, activation='relu')(state_input)
hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
a0 a1 = Dense(action_size)(hidden_2)
q_values a2 a3
masked_q_values = Multiply()([q_values, actions_input])
1 1 1 1
model = Model(inputs=[state_input, actions_input], outputs=masked_q_values)
optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
model.compile(loss='mse', optimizer=optimizer)
return model
Deep Q Learning - The Act Function
def act(self, state, training=False):
if training:
# Random actions until enough simulations to train the model.
if len(self.memory.buffer) >= self.memory.batch_size:
self.random_rate *= self.random_decay

if self.random_rate > np.random.rand():

return random.randint(0, self.action_size-1)

# If not acting randomly, take action with highest predicted value.

state_batch = np.expand_dims(state, axis=0)
predict_mask = np.ones((1, self.action_size,))
action_qs = self.network.predict([state_batch, predict_mask])
return np.argmax(action_qs[0])
Deep Q Learning - The Act Function
def act(self, state, training=False):
if training:
# Random actions until enough simulations to train the model.
if len(self.memory.buffer) >= self.memory.batch_size:
self.random_rate *= self.random_decay

if self.random_rate > np.random.rand():

return random.randint(0, self.action_size-1)

# If not acting randomly, take action with highest predicted value.

if self.random_rate > np.random.rand():

return random.randint(0, self.action_size-1)

# If not acting randomly, take action with highest predicted value.

if self.random_rate > np.random.rand():

return random.randint(0, self.action_size-1)

# If not acting randomly, take action with highest predicted value.

state_batch = np.expand_dims(state, axis=0)
predict_mask = np.ones((1, self.action_size,))
action_qs = self.network.predict([state_batch, predict_mask])
return np.argmax(action_qs[0])
Deep Q Learning - Update Q Function
def update_Q(self):
state_mb, action_mb, reward_mb, state_prime_mb, done_mb = (
self.memory.sample())

# Get Q values for state_prime_mb.

...

# Apply the Bellman Equation

...

# Match training batch to network output

...
Deep Q Learning - Update Q Function
def update_Q(self):
state_mb, action_mb, reward_mb, state_prime_mb, done_mb = (
self.memory.sample())