Skip to content

adding tutorial for creating a Moore's law linear regression #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Oct 26, 2020

Conversation

cooperrc
Copy link
Member

@cooperrc cooperrc commented Oct 8, 2020

This tutorial builds a linear regression model based upon historical semiconductor manufacturing data for number of transistors per microchip. It teaches a a typical user workflow by:

  1. You'll learn to load data from a comma separated values (*.csv) file
  2. You'll use ordinary least squares to do linear regression and predict exponential growth
  3. You'll compare exponential growth constants between models
  4. You'll share your analysis as NumPy zipped files (*.npz)
  5. You'll share your analysis as a *.csv file
  6. You'll assess the amazing progress semiconductor manufacturers have made in the last five decades

Some notes on this PR:

  1. I originally started writing this to address issue #17266, but the scope of building a linear regression model made it more difficult to directly address this issue. I can write a seperate, shorter tutorial to save and load data.
  2. I do not know how to build intersphinx references for the functions I used. In the notebook I made html comments
    <!--need ... reference-->
    in each spot where I thought there should be Sphinx link.
    @mattip said our best-practice is numpy.org/doc/stable so I updated the references
  3. What makes this tutorial great is that it has real data and compares the results to Gordon Moore's prediction. It keeps the reader focused on learning, then building, comparing, then finally sharing the results.

Thanks to @melissawm, @mattip, @bjnath, and @rgommers for guidance and initial reviews.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@melissawm
Copy link
Member

I think the failure might be related to the outputs from the notebook. Can you try removing all outputs before submitting and see if that helps? Thanks!

@cooperrc
Copy link
Member Author

cooperrc commented Oct 8, 2020

I think the failure might be related to the outputs from the notebook. Can you try removing all outputs before submitting and see if that helps? Thanks!

@melissawm, I ran clean_ipynb to remove the output from the notebook. I didn't realize that we shouldn't have output in the notebook cells. There are less errors now, but we will need to add statsmodels the the repo dependencies.

@bjnath
Copy link
Contributor

bjnath commented Oct 8, 2020

Clearly a lot of thought and effort went into this! I have just a few comments on the opening section.

Regarding this sentence

the rate of semiconductors on a computer chip

  • Did you mean to say it's the number (rather than the rate) that will double?
  • Better to say it's transistor count that's doubling ("semiconductor" is the chip material).
  • Although we're focusing the analysis on microprocessors, they're a subset of integrated circuits (there's also memory chips, for instance); I think Moore's prediction was for all ICs. It's more interesting to look at microprocessors because they emerge more often than new generations of memory. (Memory chips should follow a similar curve but with higher transistor counts -- memories are denser because of a regular layout.)

I'd propose this opening:

What you'll do

In 1965, engineer Gordon Moore predicted that transistors on a chip would double every two years in the coming decade. We'll compare that against actual transistor counts in the 53 years following his prediction.

Skills you'll learn

  • Load data from a *.csv file
  • Perform linear regression and predict exponential growth using ordinary least squares
  • Share your analysis in a file:
    • a NumPy zipped file (*.npz)
    • a *.csv file

What you'll need

  1. These packages
  • NumPy
  • Matplotlib
  • statsmodels ordinary linear regression

imported with the commands

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

We'll be using use these libraries from those packages:
...

  1. Since this is an exponential law, you'll need a little background in natural logs and exponentials.

I don't think I allowed for this in my tutorial guidelines, but since there are many steps it might be good to summarize them first -- possibly just with a table of contents.

@cooperrc
Copy link
Member Author

cooperrc commented Oct 8, 2020

@bjnath, I really like the suggested edits. Thanks! I can update the tutorial.

@melissawm, I think the failed checks now stem from the missing statsmodels.

----> 3 import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

It looks like the subsequent errors come from missing sm.<func> assignments.

@melissawm
Copy link
Member

@bjnath, I really like the suggested edits. Thanks! I can update the tutorial.

@melissawm, I think the failed checks now stem from the missing statsmodels.

----> 3 import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

It looks like the subsequent errors come from missing sm.<func> assignments.

I just opened #32 to fix that, should work. Thanks @bjnath for the suggestions, @cooperrc you can update the notebook and as soon as #32 is merged we should have no other problems.

@bjnath
Copy link
Contributor

bjnath commented Oct 11, 2020

Would it be clearer to start from the law itself?

Building Moore's law

Making "doubles every two years" into an equation, if the transistor count in year Y is T, in year n it will be

That is, two years later T is multiplied by a factor of

in four years, by

and so on.

If we plot this, it's an exponential curve:
(possibly a plot here)

But we can make it a straight line by taking the log of both sides:

This has the familar form y = ax + b, where

There were 2250 transistors in 1971, so

Y = 1971
T = 2250

giving the straight-line formula

It's more convenient to work with base-10 logs than base-2 logs; since

we can multiply both sides by log(2) to get the equivalent formula

@cooperrc
Copy link
Member Author

Would it be clearer to start from the law itself?

Building Moore's law

I had considered this too, but I thought it would be worthwhile to keep the two functions as similar as possible i.e. linear on a natural log scale.

You have to go through a few steps to get to the log2, log10, and loge, so I just kept it as loge, since that is the default NumPy/Mathematics logarithm. In engineering, loge is ln and log is log10, so there is that extra layer of confusion.

@melissawm
Copy link
Member

@cooperrc if you merge master now, the tests should pass. Sorry for taking so long!

@cooperrc
Copy link
Member Author

@cooperrc if you merge master now, the tests should pass. Sorry for taking so long!

No need to apologize! I took some time to remove passive voice and add the docs/stable links in the notebook. Should be much better now. Plus, it passes all checks!

@melissawm
Copy link
Member

This looks good to me! I have only a few (final) minor comments, then I feel like this can be merged.

@melissawm
Copy link
Member

I think those are the final comments from me.

@melissawm
Copy link
Member

Thanks, @cooperrc ! @mattip do you want to take a look before we merge?

@mattip mattip merged commit 1030801 into numpy:master Oct 26, 2020
@mattip
Copy link
Member

mattip commented Oct 26, 2020

As per the discussion today, merging as-is, but there may be more tweaks

@mattip
Copy link
Member

mattip commented Oct 26, 2020

thanks @cooperrc

@cooperrc
Copy link
Member Author

thanks @cooperrc

You're very welcome! Thanks @melissawm and @bjnath for your careful review!

@8bitmp3
Copy link
Contributor

8bitmp3 commented Dec 23, 2020

@cooperrc @melissawm @rossbar The Markdown versions/conversions of GSoD notebooks, like the Moore's Law or Deep Learning with MNIST are effectively new PRs authored by the maintainers. So, would it be OK if we add the authors' names to .md files after the titles (e.g. "by @cooperrc" or "Author: @cooperrc" — similar to: https://keras.io/guides/functional_api/, https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) LMKWYT 👍

@rossbar
Copy link
Collaborator

rossbar commented Dec 23, 2020

Thanks @8bitmp3 , an author tag is definitely a good idea. I think this can already be specified in the notebook metadata, but we'd need to check that and then verify that the theme picks that up and displays it prominently in the article. If not, we can always just add our own to the template.

@rgommers
Copy link
Member

For the record, displaying author names used to be done in doc sections a very long time ago, and we removed it because it's usually a bad idea. There are a couple of reasons for that:

  • It's not relevant info to the audience
  • It's often not reflective of the development process; there are usually multiple participants.
  • It's yet another thing to decide on in the future when more edits are made, and then the later contributors aren't mentioned.

The Markdown versions/conversions of GSoD notebooks, like the Moore's Law or Deep Learning with MNIST are effectively new PRs authored by the maintainers.

I may be missing something here, but authorship needs to be preserved across commits/reworks. If what happened is that a contributor wrote a lot of content but didn't end up in the commit log as an author, that's problematic. Even if you copy over content to a new file format, you should use git commit --author=<...> to preserve authorship usually. And then add more changes on top.

@rossbar
Copy link
Collaborator

rossbar commented Dec 23, 2020

Even if you copy over content to a new file format, you should use git commit --author=<...> to preserve authorship usually.

This is the step we were missing - it was simply something I failed to consider when converting from .ipynb -> .md formats. Any suggestions how to fix for the existing tutorials? We could amend the commit authors but that will change the hashes.

@rgommers
Copy link
Member

So in this case the MNIST tutorial looks fine to me. @8bitmp3 has many commits in master, and @melissawm has one for the .ipynb -> .md conversion. That looks as it should, and also https://github.com/numpy/numpy-tutorials/graphs/contributors looks fine; @cooperrc and @8bitmp3 are the top two contributors to this repo.

We could amend the commit authors but that will change the hashes.

Yeah that's bad practice in general. Here we could still consider it, in case there's something important enough to fix. There's only 15 forks right now, so fixing those up by hand isn't too painful.

@rossbar
Copy link
Collaborator

rossbar commented Dec 23, 2020

Yeah, the history as a whole should be fine, the commits related to the .ipynb work should all be properly attributed. The problem mostly arises when looking at the history only for individual files on GitHub. For example, looking at the history for the mooreslaw-tutorial.md doesn't show all the work @cooperrc in the original authorship of the .ipynb file that this was converted from.

@rgommers
Copy link
Member

I'm not sure that matters that much. People mostly look at number of commits or full history (e.g. gitk --all) I think.

Related: we need a new way at some point to acknowledge contributions on the website. The old THANKS.txt in the root of the main numpy repo makes very little sense anymore. I'm thinking we should use the bot from https://allcontributors.org/. But not on the main repo, on the website one and then make the about page use it. Could also do the team gallery that way perhaps. Looking at the html that that bot generates, it could replace what we do now.

The idea is that then we could simply have an issue there that maintainers can comment on whenever a new significant contribution is made: @bot add @username for content contributions.

@melissawm
Copy link
Member

I commented on #57 before reading this - still catching up. I did not realize this problem beforehand, sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants