Skip to content

Tutorial: NumPy deep reinforcement learning with Pong from pixels #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Mar 15, 2021
Merged

Tutorial: NumPy deep reinforcement learning with Pong from pixels #35

merged 23 commits into from
Mar 15, 2021

Conversation

8bitmp3
Copy link
Contributor

@8bitmp3 8bitmp3 commented Oct 26, 2020

Hi @melissawm @mattip @bjnath 👋

This tutorial demonstrates how to implement a deep reinforcement learning (RL) agent from scratch using a policy gradient method that learns to play the Pong video game using screen pixels as inputs with NumPy. Your Pong agent will obtain experience on the go using an artificial neural network as its policy.

This example is based on the code developed by Andrej Karpathy for the Deep RL Bootcamp in 2017 at UC Berkeley.

Table of contents

  • A note on RL and deep RL <- research included here, it's a complex field, helping new users
  • Deep RL glossary <-helping new users
  • About policy gradients <- research included here
  1. Set up Pong
  • (Optional) Enable video playback in a notebook <- Colaboratory and Binder support
  1. Preprocess frames (the observation)
  2. Create the policy (the neural network)
  3. Define the discounted rewards function
  4. Train the agent <- a very long section because everything is from scratch
  5. Next steps <- research included here

All feedback welcome. Thank you all!

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Oct 26, 2020

@melissawm env = gym.make("Pong-v0") (the Pong game environment in Gym) is defined and the notebook runs alright, but the CI test won't recognize calling env.reset() because it thinks env isn't "defined" or something.

Note that reset() is part of gym that performs all the game magic (the other one is numpy) 🤔 Fyi, the user has to install gym in one of the first steps to make everything work.

For context, here's how env.reset() is used in a simple example (http://gym.openai.com):

  • You call it at the beginning of the training loop and at the end of each training episode.

image

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Oct 26, 2020

After a few thousands episodes of training today, it scored 2 points but lost 21:2:

image

🏓🤖

@melissawm
Copy link
Member

Don't worry, @8bitmp3 , this is because we should add gym as a dependency for the repo and the CI will pick that up. For now it doesn't matter, we'll do that later if it's alright with you.

@melissawm
Copy link
Member

Hi @8bitmp3 ! Just to let you know - I'm reviewing this but feel like this is going to require some reworking to get right. Because it's a complicated subject, I'm trying to figure out ways to make it simpler and maybe reorganize some things. I'll let you know as soon as I have reached a nice balance :)

@melissawm
Copy link
Member

Hello, @8bitmp3 ! I am finally back, sorry it took me so long. I took my time also because I wanted to read this carefully and I have to say it is a really cool project :)

Here's a slightly modified version of the tutorial.

Again, those are suggestions. You can see if they make sense to you. The code apparently works - I've tested comparing to your original and get very similar results.

A few points:

  • I have moved some of the explanations to an appendix, trying to keep it simple at least in the beginning but leaving some information for advanced users.
  • I've added a maximum number of episodes because a distracted user may hit "Run all cells" in that notebook and not realize that there is an infinite loop. Also, this is something we have to think about: it makes no sense to add this cell to our CI here in the repo, since it will take up too many resources.
  • I've tried to simplify the glossary, but I'm not sure if I've missed something. For example, it was not immediate clear to me how the action-value, expected return and discounted return functions were all connected. Hope I got it right!
  • Under "Preprocess frames", one of the code cells mentions in a comment Remove the background and apply other enhancements.. What are those other enhancements?
  • Under "Create the policy", one code comment says Apply the sigmoid function for non-linearity; what does it mean to "add nonlinearity" and why is that important?
  • Under "Train the agent", you mention that for every batch of episodes you must "Compute the cumulative reward and discount it to present." I may be missing something, but what does "discount to present" mean?
  • I didn't understand the role of the render variable; it seems to me like it's always False, and I'm not sure if we should keep it. In fact, while I really like the idea of playing the video in the notebook, it doesn't seem feasible at this point (involves installation instructions which won't be available for all users and only applies to Google Colab). I'm not sure I'd keep it, so I have moved it to the end of the document.

Last thing: it would be very helpful to note some expected values for each episode, because the "-21" values are not very encouraging and don't seem to be the right answer if you're not paying attentions to the expected results :)

I hope this all makes sense and we'll certainly need to do a couple more passes to get it right, but that's a really interesting tutorial, so thanks again! Please reach out if you have any questions.

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Nov 24, 2020

Thank you again @melissawm for the awesome feedback! 🥳

I have moved some of the explanations to an appendix, trying to keep it simple at least in the beginning but leaving some information for advanced users.

I've added a maximum number of episodes because a distracted user may hit "Run all cells" in that notebook and not realize that there is an infinite loop. Also, this is something we have to think about: it makes no sense to add this cell to our CI here in the repo, since it will take up too many resources.

  • That's a great idea. I suggest to keep it at 100 (😬) for demo purposes and advise the reader to increase it, if they have enough computing power. 100 is very low but it probably won't crash anything during testing. Also, a free Colab session with a GPU can freeze sometimes with the process running on the background, if you train on an RL algorithm for a long time. The need for a high number of episodes (sample inefficiency) is covered in a section I wrote at the very end.

I've tried to simplify the glossary, but I'm not sure if I've missed something. For example, it was not immediate clear to me how the action-value, expected return and discounted return functions were all connected. Hope I got it right!

  • OK! I moved parts of the policy-gradient explanations to the glossary.
  • I also moved the sentence about "cumulative reward function" to the "cumulative rewards" section to minimize confusion. I'd like to keep that sentence, as well as the one that talks about using a discount factor (for discounting rewards)

Under "Preprocess frames", one of the code cells mentions in a comment Remove the background and apply other enhancements.. What are those other enhancements?

  • Good catch. The first two steps remove the background and the last one puts an emphasis on 🎾 / 🏓. I added it in the comments (from the original code) to clarify this.

Under "Create the policy", one code comment says Apply the sigmoid function for non-linearity; what does it mean to "add nonlinearity" and why is that important?

  • OK! There's a section that talks about non-linear activations and I changed a few things which will hopefully help make more sense out of it now 😄 (e.g. "(non-linear activation)" instead of "nonlinearity".)

Under "Train the agent", you mention that for every batch of episodes you must "Compute the cumulative reward and discount it to present." I may be missing something, but what does "discount to present" mean?

  • OK! I rephrased for clarity ("Compute the cumulative return and, to provide more weight to shorter-term rewards versus the longer-term ones, use a discount factor discount.")

I didn't understand the role of the render variable; it seems to me like it's always False, and I'm not sure if we should keep it. In fact, while I really like the idea of playing the video in the notebook, it doesn't seem feasible at this point (involves installation instructions which won't be available for all users and only applies to Google Colab). I'm not sure I'd keep it, so I have moved it to the end of the document.

  • I rephrased one of the steps to "3. Set the game rendering default variable for Gym's render method (it is used to display the observation and is optional but can be useful during debugging):" - I hope this provides more clarity.
  • Putting the "how to view videos of the gameplay" section at the end of the doc is a good idea 👍 I embedded the instructions and the code block into the same cell in Markdown since it's optional.
  • Plus, that means you don't have to add any more dependencies to the CI to pass 😅. WDYT? @melissawm

Last thing: it would be very helpful to note some expected values for each episode, because the "-21" values are not very encouraging and don't seem to be the right answer if you're not paying attentions to the expected results :)

  • I think the scores are expected to have high variance ("jumping around" 🎃) like in any RL scenario (deep RL isn't nearly as "stable" as "classic" deep learning, I think, as the "datasets" (state/observation, action, reward trajectories) are newly generated at every episode).
  • Hopefully, the agent improves its scores in the long run. In a lot of experiments, the training takes many thousands of episodes and, sometimes, days or weeks.
  • Re: "-21": the rules of Pong—as mentioned in the beginning of the tutorial—say that "if a player reaches 21 points, they win", so I hope the readers understand the output. WDYT?

DONE:

  • Add a frame preview example using Matplotlib
  • To help the readers understand the steps, provide a detailed diagram

@melissawm PTAL thanks! 👍

@melissawm
Copy link
Member

Hi @8bitmp3 , thanks for the explanations! Yes, they do make sense. I'll do a thorough re-read now.

@melissawm
Copy link
Member

Overall I think this is great. I like the subject and content and it feels interesting. My only remaining concern is the length. I don't think having extra content in the bottom of the document should be a problem, though - to me it makes sense and I'm the sort of person who would like to read that extra bit of info :) Others may disagree, though.

Last comment: it would be nice to follow @bjnath 's template at least for the first part of the document (What you'll learn, What you'll do...) because it makes the content we have here in the repo cohesive and the users know what to expect.

Thanks again!

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Nov 27, 2020

My only remaining concern is the length. I don't think having extra content in the bottom of the document should be a problem, though - to me it makes sense and I'm the sort of person who would like to read that extra bit of info

Thanks for all the awesome feedback @melissawm 😃 Really appreciate it.

"...teaching RL is hard, and there are so many ways for teaching deep RL to go wrong" - from the foreword in the Grokking Deep RL book (the book uses PyTorch).

This tutorial attempts to explain the ins and outs of the "vanilla" policy gradient method in-depth using mostly NumPy. And, given all the background literature—including research papers and books—that I 🔍 scanned through in preparation for this tutorial, I think I've minimized the need for extra googling for readers (I hope, at least).

Also, this tutorial is something that I wish I'd come across earlier when researching (googling) this topic. And, on top of it all, we aren't using a library/framework like TensorFlow or PyTorch, that RL researchers use, which makes a bunch of steps so much easier to write.

But this is NumPy from scratch 👍 If you want to learn something in-depth, teach it and/or do it in NumPy 🤗 (I think those are the words by @iamtrask)


template at least for the first part of the document (What you'll learn, What you'll do...) because it makes the content we have here in the repo cohesive and the users know what to expect.

Ok! I think the first paragraph and the table of contents cover most of this—I tried following Ben's structure like in my other tutorial:

  1. What you'll learn:

"This tutorial demonstrates how to implement a deep reinforcement learning (RL) agent from scratch using a policy gradient method that learns to play the Pong video game using screen pixels as inputs with NumPy. Your Pong agent will obtain experience on the go using an artificial neural network as its policy."

  1. What you'll do:

Table of contents

...

  1. Set up Pong
  2. Preprocess frames (the observation)
  3. Create the policy (the neural network) and the forward pass
  4. Set up the update step (backpropagation)
  5. Define the discounted rewards (expected return) function
  6. Train the agent for 100 episodes
  7. Next steps
  8. Appendix
    • Notes on RL and deep RL
    • How to set up video playback in your Jupyter notebook

I'll try to think of some ways to enhance the intro! 👍

@melissawm
Copy link
Member

Thanks, @8bitmp3 ! I think we're getting there! I don't see any further issues right now. When we feel it's ready I'll merge and convert to the new repo format. Cheers! 🎉

@melissawm
Copy link
Member

@8bitmp3: I just did a commit to this PR updating the file to match the .md format. I also added a note about reducing the amount of training steps because of our CI. Please let me know if this makes sense. If you want to open the .md file as a notebook, you just need to install jupytext as a python package, and use either jupyter classic or jupyter-lab and open the markdown file "as a notebook" (there are different options depending on the interface you are using, but it's the same idea).

I'll also need to add gym[atari] to our environment.yml but I just wanted to check with you first. Thanks!

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Jan 21, 2021

Looks good @melissawm Thank you. I also found a repetition ("First, First") and updated the diagram (one of the arrows should be pointing to the outer layer):

image

@melissawm
Copy link
Member

Great! I think all that is left is to fix the README so this document is listed there, and fix environment.yml for the gym dependency - I couldn't make this work with conda, so we have to add a pip dependency for gym[atari]. Can you do that @8bitmp3 ? Then we'll finally merge! 🎉

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Feb 15, 2021

@melissawm I've updated the YAML file and README but I'm getting merge conflicts which I can't resolve 🤔

@melissawm
Copy link
Member

Hi @8bitmp3 ! I think I solved it, although I'm not sure this was the right approach (I was not expecting gh to redo all commits like that...) I may need help here, @rossbar would you mind letting me know if this makes sense?

@rossbar
Copy link
Collaborator

rossbar commented Feb 17, 2021

Hmm yeah the fact that the author for all of the commits has been re-done is surprising to me too (though at least @8bitmp3 is preserved as the actual author, so it's not wrong). Maybe the author was modified during a rebase? Either way this looks fine to me :)

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Feb 17, 2021

Thank you. As long as some people find this tutorial useful -> 👍 @rossbar @melissawm

@rossbar
Copy link
Collaborator

rossbar commented Feb 17, 2021

As long as some people find this tutorial useful

Of that I have no doubt! 🎉

@melissawm
Copy link
Member

So just to clarify, I checked out this PR using gh, fixed the merge conflicts using a rebase and pushed it. I can undo that but honestly don't know of a different approach to fixing this, is there one?

@rossbar
Copy link
Collaborator

rossbar commented Feb 17, 2021

No that's definitely what I do in this situation as well, I guess I've just never paid attention to what that does to the commiter/author bubble icons on GitHub after doing so.

@melissawm
Copy link
Member

There are also a couple of things that crept in during the re-formatting (I think). @8bitmp3 would you be willing to fix them? I'll mark them and you can include this typo fix in the same commit. Thanks!

Copy link
Member

@melissawm melissawm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After that, you also need to add your new document to site/index.md, and Sphinx should be happy.

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Mar 2, 2021

After that, you also need to add your new document to site/index.md, and Sphinx should be happy.

Anything to keep Sphinx happy. @melissawm Am I doing this right? 😃 Here's the diff:

---
maxdepth: 1
---

content/cs231_tutorial
content/tutorial-svd
content/mooreslaw-tutorial
content/save-load-arrays
content/tutorial-deep-learning-on-mnist
+ content/tutorial-deep-reinforcement-learning-with-pong-from-pixels
content/tutorial-x-ray-image-processing

@melissawm
Copy link
Member

Ah! I got it - MyST parser apparently doesn't like the dollar sign inside the code block. I tried escaping but it ends up throwing errors no matter what I try. I tried all the documented options but something seems to go wrong and I don't know where. If you have ideas, that would be great!

Also, I noticed that when building locally I was not seeing the images, turns out you need to give a different path. So in both places where the png image show up, you should have

<center><img src="../../../content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.png" width="800", hspace="20" vspace="20"></center>

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Mar 4, 2021

@melissawm Updated 2x to

<center><img src="../../../content/tutorial-deep-reinforcement-learning-with-pong-from-pixels.png" width="800", hspace="20" vspace="20"></center>

Let me know if this works. Note that the image may not render on github.com

Base automatically changed from master to main March 6, 2021 11:39
@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Mar 15, 2021

@melissawm 👍

@8bitmp3
Copy link
Contributor Author

8bitmp3 commented Mar 15, 2021

Thanks @melissawm

@melissawm melissawm merged commit 144f46f into numpy:main Mar 15, 2021
@melissawm
Copy link
Member

Thank you, @8bitmp3 ! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants