Tutorial: NumPy deep reinforcement learning with Pong from pixels #35

8bitmp3 · 2020-10-26T04:34:25Z

This tutorial demonstrates how to implement a deep reinforcement learning (RL) agent from scratch using a policy gradient method that learns to play the Pong video game using screen pixels as inputs with NumPy. Your Pong agent will obtain experience on the go using an artificial neural network as its policy.

This example is based on the code developed by Andrej Karpathy for the Deep RL Bootcamp in 2017 at UC Berkeley.

Table of contents

A note on RL and deep RL <- research included here, it's a complex field, helping new users

Deep RL glossary <-helping new users

About policy gradients <- research included here

Set up Pong

(Optional) Enable video playback in a notebook <- Colaboratory and Binder support

Preprocess frames (the observation)

Create the policy (the neural network)

Define the discounted rewards function

Train the agent <- a very long section because everything is from scratch

Next steps <- research included here

All feedback welcome. Thank you all!

review-notebook-app · 2020-10-26T04:34:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

8bitmp3 · 2020-10-26T15:54:31Z

@melissawm env = gym.make("Pong-v0") (the Pong game environment in Gym) is defined and the notebook runs alright, but the CI test won't recognize calling env.reset() because it thinks env isn't "defined" or something.

Note that reset() is part of gym that performs all the game magic (the other one is numpy) 🤔 Fyi, the user has to install gym in one of the first steps to make everything work.

For context, here's how env.reset() is used in a simple example (http://gym.openai.com):

You call it at the beginning of the training loop and at the end of each training episode.

8bitmp3 · 2020-10-26T17:19:54Z

After a few thousands episodes of training today, it scored 2 points but lost 21:2:

🏓🤖

melissawm · 2020-10-28T19:26:09Z

Don't worry, @8bitmp3 , this is because we should add gym as a dependency for the repo and the CI will pick that up. For now it doesn't matter, we'll do that later if it's alright with you.

melissawm · 2020-10-30T23:20:50Z

Hi @8bitmp3 ! Just to let you know - I'm reviewing this but feel like this is going to require some reworking to get right. Because it's a complicated subject, I'm trying to figure out ways to make it simpler and maybe reorganize some things. I'll let you know as soon as I have reached a nice balance :)

melissawm · 2020-11-11T23:53:26Z

Hello, @8bitmp3 ! I am finally back, sorry it took me so long. I took my time also because I wanted to read this carefully and I have to say it is a really cool project :)

Here's a slightly modified version of the tutorial.

Again, those are suggestions. You can see if they make sense to you. The code apparently works - I've tested comparing to your original and get very similar results.

A few points:

I have moved some of the explanations to an appendix, trying to keep it simple at least in the beginning but leaving some information for advanced users.
I've added a maximum number of episodes because a distracted user may hit "Run all cells" in that notebook and not realize that there is an infinite loop. Also, this is something we have to think about: it makes no sense to add this cell to our CI here in the repo, since it will take up too many resources.
I've tried to simplify the glossary, but I'm not sure if I've missed something. For example, it was not immediate clear to me how the action-value, expected return and discounted return functions were all connected. Hope I got it right!
Under "Preprocess frames", one of the code cells mentions in a comment Remove the background and apply other enhancements.. What are those other enhancements?
Under "Create the policy", one code comment says Apply the sigmoid function for non-linearity; what does it mean to "add nonlinearity" and why is that important?
Under "Train the agent", you mention that for every batch of episodes you must "Compute the cumulative reward and discount it to present." I may be missing something, but what does "discount to present" mean?
I didn't understand the role of the render variable; it seems to me like it's always False, and I'm not sure if we should keep it. In fact, while I really like the idea of playing the video in the notebook, it doesn't seem feasible at this point (involves installation instructions which won't be available for all users and only applies to Google Colab). I'm not sure I'd keep it, so I have moved it to the end of the document.

Last thing: it would be very helpful to note some expected values for each episode, because the "-21" values are not very encouraging and don't seem to be the right answer if you're not paying attentions to the expected results :)

I hope this all makes sense and we'll certainly need to do a couple more passes to get it right, but that's a really interesting tutorial, so thanks again! Please reach out if you have any questions.

8bitmp3 · 2020-11-24T22:55:30Z

Thank you again @melissawm for the awesome feedback! 🥳

I have moved some of the explanations to an appendix, trying to keep it simple at least in the beginning but leaving some information for advanced users.

Awesome. I also moved most of the stuff to the dedicated Appendix. In addition, I set up a section for the backward pass and renamed some sections, while keeping the separate numbered entries for each step (it should make it easier to follow for readers IMHO) (I found this guide very useful https://docs.microsoft.com/en-us/style-guide/procedures-instructions/writing-step-by-step-instructions#complex-procedures)

I've added a maximum number of episodes because a distracted user may hit "Run all cells" in that notebook and not realize that there is an infinite loop. Also, this is something we have to think about: it makes no sense to add this cell to our CI here in the repo, since it will take up too many resources.

That's a great idea. I suggest to keep it at 100 (😬) for demo purposes and advise the reader to increase it, if they have enough computing power. 100 is very low but it probably won't crash anything during testing. Also, a free Colab session with a GPU can freeze sometimes with the process running on the background, if you train on an RL algorithm for a long time. The need for a high number of episodes (sample inefficiency) is covered in a section I wrote at the very end.

I've tried to simplify the glossary, but I'm not sure if I've missed something. For example, it was not immediate clear to me how the action-value, expected return and discounted return functions were all connected. Hope I got it right!

OK! I moved parts of the policy-gradient explanations to the glossary.
I also moved the sentence about "cumulative reward function" to the "cumulative rewards" section to minimize confusion. I'd like to keep that sentence, as well as the one that talks about using a discount factor (for discounting rewards)

Under "Preprocess frames", one of the code cells mentions in a comment Remove the background and apply other enhancements.. What are those other enhancements?

Good catch. The first two steps remove the background and the last one puts an emphasis on 🎾 / 🏓. I added it in the comments (from the original code) to clarify this.

Under "Create the policy", one code comment says Apply the sigmoid function for non-linearity; what does it mean to "add nonlinearity" and why is that important?

OK! There's a section that talks about non-linear activations and I changed a few things which will hopefully help make more sense out of it now 😄 (e.g. "(non-linear activation)" instead of "nonlinearity".)

Under "Train the agent", you mention that for every batch of episodes you must "Compute the cumulative reward and discount it to present." I may be missing something, but what does "discount to present" mean?

OK! I rephrased for clarity ("Compute the cumulative return and, to provide more weight to shorter-term rewards versus the longer-term ones, use a discount factor discount.")

I didn't understand the role of the render variable; it seems to me like it's always False, and I'm not sure if we should keep it. In fact, while I really like the idea of playing the video in the notebook, it doesn't seem feasible at this point (involves installation instructions which won't be available for all users and only applies to Google Colab). I'm not sure I'd keep it, so I have moved it to the end of the document.

I rephrased one of the steps to "3. Set the game rendering default variable for Gym's render method (it is used to display the observation and is optional but can be useful during debugging):" - I hope this provides more clarity.
Putting the "how to view videos of the gameplay" section at the end of the doc is a good idea 👍 I embedded the instructions and the code block into the same cell in Markdown since it's optional.
Plus, that means you don't have to add any more dependencies to the CI to pass 😅. WDYT? @melissawm

Last thing: it would be very helpful to note some expected values for each episode, because the "-21" values are not very encouraging and don't seem to be the right answer if you're not paying attentions to the expected results :)

I think the scores are expected to have high variance ("jumping around" 🎃) like in any RL scenario (deep RL isn't nearly as "stable" as "classic" deep learning, I think, as the "datasets" (state/observation, action, reward trajectories) are newly generated at every episode).
Hopefully, the agent improves its scores in the long run. In a lot of experiments, the training takes many thousands of episodes and, sometimes, days or weeks.
Re: "-21": the rules of Pong—as mentioned in the beginning of the tutorial—say that "if a player reaches 21 points, they win", so I hope the readers understand the output. WDYT?

DONE:

Add a frame preview example using Matplotlib
To help the readers understand the steps, provide a detailed diagram

@melissawm PTAL thanks! 👍

melissawm · 2020-11-25T23:38:14Z

Hi @8bitmp3 , thanks for the explanations! Yes, they do make sense. I'll do a thorough re-read now.

content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.ipynb

melissawm · 2020-11-26T02:07:06Z

Overall I think this is great. I like the subject and content and it feels interesting. My only remaining concern is the length. I don't think having extra content in the bottom of the document should be a problem, though - to me it makes sense and I'm the sort of person who would like to read that extra bit of info :) Others may disagree, though.

Last comment: it would be nice to follow @bjnath 's template at least for the first part of the document (What you'll learn, What you'll do...) because it makes the content we have here in the repo cohesive and the users know what to expect.

Thanks again!

8bitmp3 · 2020-11-27T18:22:21Z

My only remaining concern is the length. I don't think having extra content in the bottom of the document should be a problem, though - to me it makes sense and I'm the sort of person who would like to read that extra bit of info

Thanks for all the awesome feedback @melissawm 😃 Really appreciate it.

"...teaching RL is hard, and there are so many ways for teaching deep RL to go wrong" - from the foreword in the Grokking Deep RL book (the book uses PyTorch).

This tutorial attempts to explain the ins and outs of the "vanilla" policy gradient method in-depth using mostly NumPy. And, given all the background literature—including research papers and books—that I 🔍 scanned through in preparation for this tutorial, I think I've minimized the need for extra googling for readers (I hope, at least).

Also, this tutorial is something that I wish I'd come across earlier when researching (googling) this topic. And, on top of it all, we aren't using a library/framework like TensorFlow or PyTorch, that RL researchers use, which makes a bunch of steps so much easier to write.

But this is NumPy from scratch 👍 If you want to learn something in-depth, teach it and/or do it in NumPy 🤗 (I think those are the words by @iamtrask)

template at least for the first part of the document (What you'll learn, What you'll do...) because it makes the content we have here in the repo cohesive and the users know what to expect.

Ok! I think the first paragraph and the table of contents cover most of this—I tried following Ben's structure like in my other tutorial:

What you'll learn:

"This tutorial demonstrates how to implement a deep reinforcement learning (RL) agent from scratch using a policy gradient method that learns to play the Pong video game using screen pixels as inputs with NumPy. Your Pong agent will obtain experience on the go using an artificial neural network as its policy."

What you'll do:

Table of contents

...

Set up Pong

Preprocess frames (the observation)

Create the policy (the neural network) and the forward pass

Set up the update step (backpropagation)

Define the discounted rewards (expected return) function

Train the agent for 100 episodes

Next steps

Appendix

Notes on RL and deep RL

How to set up video playback in your Jupyter notebook

I'll try to think of some ways to enhance the intro! 👍

melissawm · 2020-11-27T19:31:42Z

Thanks, @8bitmp3 ! I think we're getting there! I don't see any further issues right now. When we feel it's ready I'll merge and convert to the new repo format. Cheers! 🎉

melissawm · 2021-01-21T21:39:07Z

@8bitmp3: I just did a commit to this PR updating the file to match the .md format. I also added a note about reducing the amount of training steps because of our CI. Please let me know if this makes sense. If you want to open the .md file as a notebook, you just need to install jupytext as a python package, and use either jupyter classic or jupyter-lab and open the markdown file "as a notebook" (there are different options depending on the interface you are using, but it's the same idea).

I'll also need to add gym[atari] to our environment.yml but I just wanted to check with you first. Thanks!

8bitmp3 · 2021-01-21T23:32:30Z

Looks good @melissawm Thank you. I also found a repetition ("First, First") and updated the diagram (one of the arrows should be pointing to the outer layer):

melissawm · 2021-01-22T19:00:43Z

Great! I think all that is left is to fix the README so this document is listed there, and fix environment.yml for the gym dependency - I couldn't make this work with conda, so we have to add a pip dependency for gym[atari]. Can you do that @8bitmp3 ? Then we'll finally merge! 🎉

8bitmp3 · 2021-02-15T22:01:48Z

@melissawm I've updated the YAML file and README but I'm getting merge conflicts which I can't resolve 🤔

…Zero, make minor changes

…p instruction

…rammar

…from pixels

…; other minor changes

…ith Pong from pixels tutorial

melissawm · 2021-02-17T17:14:22Z

Hi @8bitmp3 ! I think I solved it, although I'm not sure this was the right approach (I was not expecting gh to redo all commits like that...) I may need help here, @rossbar would you mind letting me know if this makes sense?

rossbar · 2021-02-17T18:21:23Z

Hmm yeah the fact that the author for all of the commits has been re-done is surprising to me too (though at least @8bitmp3 is preserved as the actual author, so it's not wrong). Maybe the author was modified during a rebase? Either way this looks fine to me :)

8bitmp3 · 2021-02-17T18:25:08Z

Thank you. As long as some people find this tutorial useful -> 👍 @rossbar @melissawm

rossbar · 2021-02-17T18:27:16Z

As long as some people find this tutorial useful

Of that I have no doubt! 🎉

melissawm · 2021-02-17T18:48:49Z

So just to clarify, I checked out this PR using gh, fixed the merge conflicts using a rebase and pushed it. I can undo that but honestly don't know of a different approach to fixing this, is there one?

rossbar · 2021-02-17T19:06:45Z

No that's definitely what I do in this situation as well, I guess I've just never paid attention to what that does to the commiter/author bubble icons on GitHub after doing so.

content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.md

melissawm · 2021-03-02T18:13:30Z

There are also a couple of things that crept in during the re-formatting (I think). @8bitmp3 would you be willing to fix them? I'll mark them and you can include this typo fix in the same commit. Thanks!

melissawm

After that, you also need to add your new document to site/index.md, and Sphinx should be happy.

content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.md

Co-authored-by: Melissa Weber Mendonça <[email protected]>

8bitmp3 · 2021-03-02T22:53:32Z

After that, you also need to add your new document to site/index.md, and Sphinx should be happy.

Anything to keep Sphinx happy. @melissawm Am I doing this right? 😃 Here's the diff:

---
maxdepth: 1
---

content/cs231_tutorial
content/tutorial-svd
content/mooreslaw-tutorial
content/save-load-arrays
content/tutorial-deep-learning-on-mnist
+ content/tutorial-deep-reinforcement-learning-with-pong-from-pixels
content/tutorial-x-ray-image-processing

melissawm · 2021-03-04T21:52:58Z

Ah! I got it - MyST parser apparently doesn't like the dollar sign inside the code block. I tried escaping but it ends up throwing errors no matter what I try. I tried all the documented options but something seems to go wrong and I don't know where. If you have ideas, that would be great!

Also, I noticed that when building locally I was not seeing the images, turns out you need to give a different path. So in both places where the png image show up, you should have

<center><img src="../../../content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.png" width="800", hspace="20" vspace="20"></center>

8bitmp3 · 2021-03-04T22:19:08Z

@melissawm Updated 2x to

<center><img src="../../../content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.png" width="800", hspace="20" vspace="20"></center>

Let me know if this works. Note that the image may not render on github.com

8bitmp3 · 2021-03-15T16:38:44Z

@melissawm 👍

8bitmp3 · 2021-03-15T21:13:18Z

Thanks @melissawm

melissawm · 2021-03-15T21:19:29Z

Thank you, @8bitmp3 ! 🎉

melissawm reviewed Nov 26, 2020

View reviewed changes

melissawm mentioned this pull request Dec 14, 2020

Adds control for the execution of cell that caused timeout. #54

Merged

8bitmp3 added 13 commits February 17, 2021 13:54

Add a NumPy tutorial: deep reinforcement learning with Pong from pixels

83a7c3b

Improve the pseudocode for policy gradients, add author-year to Alpha…

5e863cf

…Zero, make minor changes

Refactor 🏓 description

96d262c

⚠️ Attempt to pass the CI test - comment env.close() as per the ste…

a0c472e

…p instruction

🔧 Clarify available actions for the experiment and env.close(), fix g…

d58bc35

…rammar

Update NumPy tutorial: deep reinforcement learning with Pong from pixels

a2741ee

Format (nbfmt) NumPy tutorial: deep reinforcement learning with Pong …

490e844

…from pixels

Add a Pong RL training diagram, display input; improve the pseudocode…

396e816

…; other minor changes

Revise install. steps, format RL training pseudocode in the Deep RL w…

a0c04e0

…ith Pong from pixels tutorial

Jupytext conversion

2399ca4

Remove repetition ("First")

e94b4d6

Update deep RL policy gradients Pong diagram

ce2d0eb

Add gym[atari] dependency to the Deep RL Pong tutorial

4b44b43

8bitmp3 added 2 commits February 17, 2021 13:59

Update NumPy Tutorials README, include the deep RL with Pong reference

eea75f3

Update environment.yaml, attempt to resolve the merge conflict

57f9b99

8bitmp3 commented Mar 1, 2021

View reviewed changes

content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.md Outdated Show resolved Hide resolved

melissawm reviewed Mar 2, 2021

View reviewed changes

8bitmp3 and others added 5 commits March 2, 2021 22:44

Fix non-consecutive header to address Sphinx issue

c4d799e

Co-authored-by: Melissa Weber Mendonça <[email protected]>

Fix the Python rendering issue to address Sphinx issue

5f7c1fd

Co-authored-by: Melissa Weber Mendonça <[email protected]>

Fix non-consecutive header to address Sphinx issue

4dd768d

Co-authored-by: Melissa Weber Mendonça <[email protected]>

Fix a typo

307ca04

Update the toc tree with deep RL with Pong

6b0a79c

Remove an extra parenthesis in text

30379c4

Update the diagram link to fix to local build

1727889

Base automatically changed from master to main March 6, 2021 11:39

Toggle output scrolling

ab5e4bc

melissawm merged commit 144f46f into numpy:main Mar 15, 2021

Uh oh!

Tutorial: NumPy deep reinforcement learning with Pong from pixels #35

Tutorial: NumPy deep reinforcement learning with Pong from pixels #35

Uh oh!

Conversation

8bitmp3 commented Oct 26, 2020

Table of contents

Uh oh!

review-notebook-app bot commented Oct 26, 2020

Uh oh!

8bitmp3 commented Oct 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

8bitmp3 commented Oct 26, 2020

Uh oh!

melissawm commented Oct 28, 2020

Uh oh!

melissawm commented Oct 30, 2020

Uh oh!

melissawm commented Nov 11, 2020

Uh oh!

8bitmp3 commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melissawm commented Nov 25, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melissawm commented Nov 26, 2020

Uh oh!

8bitmp3 commented Nov 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Table of contents

Uh oh!

melissawm commented Nov 27, 2020

Uh oh!

melissawm commented Jan 21, 2021

Uh oh!

8bitmp3 commented Jan 21, 2021

Uh oh!

melissawm commented Jan 22, 2021

Uh oh!

8bitmp3 commented Feb 15, 2021

Uh oh!

melissawm commented Feb 17, 2021

Uh oh!

rossbar commented Feb 17, 2021

Uh oh!

8bitmp3 commented Feb 17, 2021

Uh oh!

rossbar commented Feb 17, 2021

Uh oh!

melissawm commented Feb 17, 2021

Uh oh!

rossbar commented Feb 17, 2021

Uh oh!

Uh oh!

melissawm commented Mar 2, 2021

Uh oh!

melissawm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

8bitmp3 commented Mar 2, 2021

Uh oh!

melissawm commented Mar 4, 2021

Uh oh!

8bitmp3 commented Mar 4, 2021

Uh oh!

8bitmp3 commented Mar 15, 2021

Uh oh!

8bitmp3 commented Mar 15, 2021

Uh oh!

melissawm commented Mar 15, 2021

Uh oh!

Uh oh!

8bitmp3 commented Oct 26, 2020 •

edited

Loading

8bitmp3 commented Nov 24, 2020 •

edited

Loading

8bitmp3 commented Nov 27, 2020 •

edited

Loading