01 Module 2 Neural Network Based Reinforcement Learning
01 Module 2 Neural Network Based Reinforcement Learning
Learning Objectives
● TD-Gammon
● Deep Q Networks
○ Memory
○ Code
Agenda
TD-Gammon
Q - table
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
Q Tables vs Q Networks
Output: TD(λ)
Q - table approximation
0 0 0 0 0
1 0 0 0 0 Hidden Layer 1
2 0 0 0 0
Input: State Features
3 0 0 0 0
Agenda
TD-Gammon
Agent Environment
State Reward
Deep Q Learning - Training
1 Cycle Returns:
● 1 State
● 1 Chosen Action
● 1 Reward
● 1 State Prime
Train on State
Deep Q Learning - Loss Function
Loss = (y - ŷ)2
Agenda
TD-Gammon
Agent Environment
State Reward
Deep Q Learning - Memory
1 Cycle Returns:
Memory Buffer ● 1 State
● 1 Chosen Action
state ● 1 Reward
Idx state action reward
prime
● 1 State Prime
0 s0 a0 r0 s1
1 s1 a1 r1 s2
2 s2 a2 r2 s3
...
Deep Q Learning - Memory
1 Cycle Returns:
Memory Buffer ● 1 State
● 1 Chosen Action
state ● 1 Reward
Idx state action reward
prime
● 1 State Prime
0 s0 a0 r0 s1
1 s1 a1 r1 s2
2 s2 a2 r2 s3
...
Deep Q Learning - Memory
state state
Idx state action reward Idx state action reward
prime prime
0 s0 a0 r0 s1 2 s2 a2 r2 s3
... ...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
...
def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size
def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size
def sample(self):
...
The Memory Buffer
class Memory():
def __init__(self, memory_size, batch_size):
self.buffer = deque(maxlen=memory_size)
self.batch_size = batch_size
def sample(self):
buffer_size = len(self.buffer)
index = np.random.choice(
np.arange(buffer_size), size=self.batch_size, replace=False)
batch = [self.buffer[i] for i in index]
return batch
Experience Replay
Agenda
TD-Gammon
hidden_1 Training
= Dense(hidden_neurons, activation='relu')(state_input)
hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
a0 a1 = Dense(action_size)(hidden_2)
q_values a2 a3
masked_q_values = Multiply()([q_values, actions_input])
0 0 1 0
model = Model(inputs=[state_input, actions_input], outputs=masked_q_values)
optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
model.compile(loss='mse', optimizer=optimizer)
return model
Deep Q Learning - Network (advanced)
def deep_q_network(state_shape, action_size, learning_rate, hidden_neurons):
state_input = Input(state_shape, name='frames')
actions_input = Input((action_size,), name='mask')
hidden_1Predicting
= Dense(hidden_neurons, activation='relu')(state_input)
hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
a0 a1 = Dense(action_size)(hidden_2)
q_values a2 a3
masked_q_values = Multiply()([q_values, actions_input])
1 1 1 1
model = Model(inputs=[state_input, actions_input], outputs=masked_q_values)
optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate)
model.compile(loss='mse', optimizer=optimizer)
return model
Deep Q Learning - The Act Function
def act(self, state, training=False):
if training:
# Random actions until enough simulations to train the model.
if len(self.memory.buffer) >= self.memory.batch_size:
self.random_rate *= self.random_decay
●
Screencast
Policy Gradients
Agenda
Policy Gradients
Actor - Critic
Agenda
Policy Gradients
Actor - Critic
Deep Q vs Policy Gradients
Deep Q Network
State Properties
Deep Q vs Policy Gradients
Deep Q Network Policy Gradient
.4 .3 .3
Deep Q vs Policy Gradients
Deep Q Network Policy Gradient
.4 .3 .3
Policy Gradients - Loss
Chosen Action
0 0 1
State Properties
Policy Gradients - Loss
*
Δw = ꭤ▽πw(a ,s)
Policy Gradients - Loss
*
Δw = ꭤ▽πw(a ,s)
*
πw(a ,s)
Policy Gradients - Loss
*
Δw = ꭤ▽wlog(πw(a ,s))
Policy Gradients - Loss
Δw = ꭤ▽wlog(πw(a,s)) · Gt
Policy Gradients - Loss
Δw = ꭤ▽wlog(πw(a,s)) · Gt
def custom_loss(y_true, y_pred):
y_pred_clipped = K.clip(y_pred, 1e-8, 1-1e-8)
log_likelihood = y_true * K.log(y_pred_clipped)
return K.sum(-log_likelihood*g)
Policy Gradients - Network
def build_networks(state_shape, action_size, learning_rate, hidden_neurons):
state_input = Input(state_shape, name='frames')
g = Input((1,), name='G')
hidden_1 = Dense(hidden_neurons, activation='relu')(state_input)
hidden_2 = Dense(hidden_neurons, activation='relu')(hidden_1)
probabilities = Dense(action_size, activation='softmax')(hidden_2)
policy = Model(
inputs=[state_input, g], outputs=[probabilities])
optimizer = Adam(lr=learning_rate)
policy.compile(loss=custom_loss, optimizer=optimizer)
def sample(self):
batch = np.array(self.buffer).T.tolist()
states_mb = np.array(batch[0], dtype=np.float32)
actions_mb = np.array(batch[1], dtype=np.int8)
rewards_mb = np.array(batch[2], dtype=np.float32)
self.buffer = []
return states_mb, actions_mb, rewards_mb
Policy Gradients - Training
def learn(self):
"""Trains the Deep Q Network based on stored experiences."""
# Obtain random mini-batch from memory.
state_mb, action_mb, reward_mb = self.memory.sample()
actions = tf.one_hot(action_mb, self.action_size)
# Normalized TD(1)
discount_mb = np.zeros_like(reward_mb)
total_rewards = 0
for t in reversed(range(len(reward_mb))):
total_rewards = reward_mb[t] + total_rewards * self.memory.gamma
discount_mb[t] = total_rewards
discount_mb = (discount_mb - np.mean(discount_mb)) / np.std(discount_mb)
# Normalized TD(1)
discount_mb = np.zeros_like(reward_mb)
total_rewards = 0
for t in reversed(range(len(reward_mb))):
total_rewards = reward_mb[t] + total_rewards * self.memory.gamma
discount_mb[t] = total_rewards
discount_mb = (discount_mb - np.mean(discount_mb)) / np.std(discount_mb)
Actor - Critic
Breaking Down Q
State Inputs
Breaking Down Q
State Inputs
Breaking Down Q
State Inputs
A2C - Network
def build_networks(state_shape, action_size, actor_lr, critic_lr, neurons):
state_input = layers.Input(state_shape, name='frames')
advantage = layers.Input((1,), name='A') # Now A instead of G.
#Apply TD(0)
discount_mb = reward_mb + next_v_mb * self.memory.gamma * (1 - dones_mb)
state_values = self.critic.predict([state_mb])
advantages = discount_mb - np.squeeze(state_values)
self.actor.train_on_batch([state_mb, advantages], [action_mb, discount_mb])
Lab
Use Reinfocement
Learning in Trading
Lab Objectives
●
●
Screencast
What is LSTM?
Daniel Sparing
Machine Learning Solutions Engineer
Google Cloud
Agenda
Sequence Models
RNN limitations
LSTM
RNN limitations
LSTM
output layer
hidden layers
input layer
Feed Forward Networks
Inference is stateless
Embeddings are
Aggregation aggregated using sum
or average
This is essentially
the “bag-of-words”
Aggregation
approach
Embedding
Let’s
Let’s wrap
wrap
aa DNN
DNN in in
aa for
for loop!
loop!
RNNs: Networks with Loops
ht
A: a subgraph of the NN
A x(t): RNN input at time t
h(t): RNN state at time t
xt - h(t) = (hidden state, output)
for t in range(len(x)):
h_next = A(x[t], h[t-1].hidden)
h.append(h_next)
loss = sum([loss_fn(y) for y in h.output])
Parts of this slide and the following slides are building on ideas and visualizations from
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Unrolled Recurrent Neural Networks
ht ht ht ht ht
A = A A A A
xt xt xt xt ... xt
Secret sauce:
● Tie (share) weights of A for all t.
● Backprop updates same weights for all t.
(sum gradients from all t).
RNNs provide temporal context
h0 h1 h2 h3 h4
A A A A A
x0 x1 x2 x3 x4
RNN limitations
LSTM
A A A A A A
...
x0 x1 x2 xt xt+1 xt+2
RNN limitations
LSTM
Standard RNN:
ht-1 ht ht+1
A A
tanh
xt-1 xt xt+1
Vanishing Gradients - Two Weird Tricks
● LSTM: “magic” solution to the vanishing gradient problem
● Trick #1: Memory cell carried over time
● Trick #2: Gates that learn to manage the memory
x
x +
A A
tanh
x
x
σ σ σ tanh σ
xt-1 xt xt+1
Long Short Term Memory Networks (LSTM)
ht-1 ht ht+1
xx +
A A
tanh
x
x
σ σ σ tanh σ
xt-1 xt xt+1
LSTM - Cell State
ht
Ct-1 Ct
x +
tanh
ft it x ot ● “Conveyer belt”
~ x
Ct ● LSTM can “add” or “remove”
ht-1 ht information to cell state via
σ σ tanh σ
gates.
xt
Gates: Optionally Let Information Through
xt
Input Gate and Candidate State
ht
Ct-1 Ct
x +
tanh
ft it ~
x ot it = σ (Wi . [ht-1 , xt] + bi )
Ct x ~
ht-1 σ σ tanh σ h Ct = tanh (Wc. [ht-1 , xt] + bC )
t
xt
Update the Cell State
ht
Ct-1 Ct
x +
tanh ~
ft it ~
x ot Ct = ft * Ct-1 + it* Ct
Ct x
ht-1 σ σ tanh σ ht
xt
Output Gate
ht
Ct-1 Ct
x +
tanh
ft it ~
x ot ot = σ (Wo [ ht-1 , xt ] + bo )
Ct x
ht-1 σ σ tanh σ ht ht = ot * tanh (Ct )
xt
● Output filtered version of cell state
●
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Apply LSTM to Time
Series data
Agenda
Sequence Models
RNN limitations
LSTM
5 20.51 1 1.1
Feature (and label!)
engineering
0.8 -0.51 2.9 -0.82
sale_week
2010-12-26 134640
2011-01-02 1150000
2011-01-09 945000
2011-01-16 995000
2011-01-23 1150000
Sliding Window to create features and label
Example: Create a feature table, window_size = 3,
horizon = 1
datetime value
2018-01-01 0:00:00 0.7713206433
2018-01-02 0:00:00 0.02075194936 pred_datetime -3_steps -2_steps -1_steps label
2018-01-03 0:00:00 0.6336482349 2018-01-04 0:00:00 0.7713206433 0.02075194936 0.6336482349 0.7488038825
2018-01-04 0:00:00 0.7488038825 2018-01-05 0:00:00 0.02075194936 0.6336482349 0.7488038825 0.4985070123
2018-01-05 0:00:00 0.4985070123 2018-01-06 0:00:00 0.6336482349 0.7488038825 0.4985070123 0.2247966455
2018-01-06 0:00:00 0.2247966455 2018-01-07 0:00:00 0.7488038825 0.4985070123 0.2247966455 0.1980628648
2018-01-07 0:00:00 0.1980628648 2018-01-08 0:00:00 0.4985070123 0.2247966455 0.1980628648 0.7605307122
2018-01-08 0:00:00 0.7605307122 2018-01-09 0:00:00 0.2247966455 0.1980628648 0.7605307122 0.1691108366
2018-01-09 0:00:00 0.1691108366 2018-01-10 0:00:00 0.1980628648 0.7605307122 0.1691108366 0.08833981417
2018-01-10 0:00:00 0.08833981417
Features, label
Input table
Create the features and label
import time_series
WINDOW_SIZE = 52 * 1
HORIZON = 4*6
df = time_series.create_rolling_features_label(sales,
window_size=WINDOW_SIZE,
pred_offset=HORIZON)
https://github.com/GoogleCloudPlatform/training-data-an
alyst/blob/master/blogs/gcp_forecasting/time_series.py
Date features can provide performance lift
dates = df.index
df = time_series.add_date_features(df, dates)
155 3 6 2012 0
162 10 6 2012 0
169 17 6 2012 0
176 24 6 2012 0
183 1 7 2012 1
Train/test set: split temporally
# Features, label.
X = df.drop('label', axis=1)
y = df['label']
"""
Global Baseline Model results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RMSE: 376544.261
MAE: 316352.450
MALR: 0.207 It doesn’t beat my baseline
"""
model
Machine learn: Random Forest
# Train model.
cl = RandomForestRegressor(n_estimators=500,
max_features='sqrt', random_state=10, criterion='mse')
cl.fit(X_train, y_train)
pred = cl.predict(X_test)
random_forest_metrics = time_series.Metrics(y_test,
pred)
random_forest_metrics.report("Forest Model")
"""
Forest Model results
~~~~~~~~~~~~~~~~~~~~
RMSE: 259388.403
MAE: 202647.688
MALR: 0.125 it’s working
"""
Machine learn: using LSTM
Instead of the simple Random Forest model, we can also build an LSTM model on the
same prepared dataset to attempt to increase model performance.
ht-1 ht ht+1
x
x +
A A
tanh
x
x
σ σ σ tanh σ
xt-1 xt xt+1
Lab
Use LSTM framework to
set up a simple Buy/Sell
trading model
Lab Objectives
●
●
Screencast