adding tutorial for creating a Moore's law linear regression #31

cooperrc · 2020-10-08T12:30:24Z

This tutorial builds a linear regression model based upon historical semiconductor manufacturing data for number of transistors per microchip. It teaches a a typical user workflow by:

You'll learn to load data from a comma separated values (*.csv) file
You'll use ordinary least squares to do linear regression and predict exponential growth
You'll compare exponential growth constants between models
You'll share your analysis as NumPy zipped files (*.npz)
You'll share your analysis as a *.csv file
You'll assess the amazing progress semiconductor manufacturers have made in the last five decades

Some notes on this PR:

I originally started writing this to address issue #17266, but the scope of building a linear regression model made it more difficult to directly address this issue. I can write a seperate, shorter tutorial to save and load data.
I do not know how to build intersphinx references for the functions I used. In the notebook I made html comments

in each spot where I thought there should be Sphinx link. @mattip said our best-practice is numpy.org/doc/stable so I updated the references
What makes this tutorial great is that it has real data and compares the results to Gordon Moore's prediction. It keeps the reader focused on learning, then building, comparing, then finally sharing the results.

Thanks to @melissawm, @mattip, @bjnath, and @rgommers for guidance and initial reviews.

…headers

review-notebook-app · 2020-10-08T12:30:27Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

melissawm · 2020-10-08T13:18:34Z

I think the failure might be related to the outputs from the notebook. Can you try removing all outputs before submitting and see if that helps? Thanks!

cooperrc · 2020-10-08T13:35:37Z

I think the failure might be related to the outputs from the notebook. Can you try removing all outputs before submitting and see if that helps? Thanks!

@melissawm, I ran clean_ipynb to remove the output from the notebook. I didn't realize that we shouldn't have output in the notebook cells. There are less errors now, but we will need to add statsmodels the the repo dependencies.

bjnath · 2020-10-08T14:08:46Z

Clearly a lot of thought and effort went into this! I have just a few comments on the opening section.

Regarding this sentence

the rate of semiconductors on a computer chip

Did you mean to say it's the number (rather than the rate) that will double?
Better to say it's transistor count that's doubling ("semiconductor" is the chip material).
Although we're focusing the analysis on microprocessors, they're a subset of integrated circuits (there's also memory chips, for instance); I think Moore's prediction was for all ICs. It's more interesting to look at microprocessors because they emerge more often than new generations of memory. (Memory chips should follow a similar curve but with higher transistor counts -- memories are denser because of a regular layout.)

I'd propose this opening:

What you'll do

In 1965, engineer Gordon Moore predicted that transistors on a chip would double every two years in the coming decade. We'll compare that against actual transistor counts in the 53 years following his prediction.

Skills you'll learn

Load data from a *.csv file
Perform linear regression and predict exponential growth using ordinary least squares
Share your analysis in a file:
- a NumPy zipped file (*.npz)
- a *.csv file

What you'll need

These packages

NumPy
Matplotlib
statsmodels ordinary linear regression

imported with the commands

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

We'll be using use these libraries from those packages:
...

Since this is an exponential law, you'll need a little background in natural logs and exponentials.

I don't think I allowed for this in my tutorial guidelines, but since there are many steps it might be good to summarize them first -- possibly just with a table of contents.

cooperrc · 2020-10-08T14:19:53Z

@bjnath, I really like the suggested edits. Thanks! I can update the tutorial.

@melissawm, I think the failed checks now stem from the missing statsmodels.

----> 3 import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'

It looks like the subsequent errors come from missing sm.<func> assignments.

melissawm · 2020-10-09T21:38:50Z

@bjnath, I really like the suggested edits. Thanks! I can update the tutorial.

@melissawm, I think the failed checks now stem from the missing statsmodels.
----> 3 import statsmodels.api as sm

ModuleNotFoundError: No module named 'statsmodels'
It looks like the subsequent errors come from missing sm.<func> assignments.

I just opened #32 to fix that, should work. Thanks @bjnath for the suggestions, @cooperrc you can update the notebook and as soon as #32 is merged we should have no other problems.

bjnath · 2020-10-11T14:48:32Z

Would it be clearer to start from the law itself?

Building Moore's law

Making "doubles every two years" into an equation, if the transistor count in year Y is T, in year n it will be

$t = T * 2^{(n-Y)/2}$

That is, two years later T is multiplied by a factor of

$2^{((T+2)-T)/2} = 2^{2/2} = 2^1 = 2$

in four years, by

$2^{((T+4)-T)/2} = 2^{4/2} = 2^2 = 2 * 2$

and so on.

If we plot this, it's an exponential curve:
(possibly a plot here)

But we can make it a straight line by taking the log of both sides:

$\begin{align} log_2(t) &= log_2(T* 2^{(n-Y)/2}) \notag \\ &= log_2(T) + log_2(2^{(n-Y)/2}) \notag \\ &= log_2(T) + (n-Y)/2 \notag \\ &= log_2(T) + n/2 - Y/2 \notag \\ \notag \end{align}$

This has the familar form y = ax + b, where

$\begin{align} y &= log_2(t) \notag \\ a &= 1/2 \notag \\ b &= log_2(T) - Y/2 \notag \\ \notag \end{align}$

There were 2250 transistors in 1971, so

Y = 1971
T = 2250

giving the straight-line formula

$\begin{align} log_2(t) &= n/2 + log_2(2250)-1971/2 \notag \\ &= n/2 - 974.4 \notag \end{align}$

It's more convenient to work with base-10 logs than base-2 logs; since

$log_2(x) = {log(x) \over log(2)}$

we can multiply both sides by log(2) to get the equivalent formula

cooperrc · 2020-10-16T19:12:57Z

Would it be clearer to start from the law itself?

Building Moore's law

I had considered this too, but I thought it would be worthwhile to keep the two functions as similar as possible i.e. linear on a natural log scale.

You have to go through a few steps to get to the log2, log10, and loge, so I just kept it as loge, since that is the default NumPy/Mathematics logarithm. In engineering, loge is ln and log is log10, so there is that extra layer of confusion.

…headers

… into mooreslaw

melissawm · 2020-10-20T22:24:04Z

@cooperrc if you merge master now, the tests should pass. Sorry for taking so long!

Need to add statsmodels to dependencies

cooperrc · 2020-10-22T12:56:28Z

@cooperrc if you merge master now, the tests should pass. Sorry for taking so long!

No need to apologize! I took some time to remove passive voice and add the docs/stable links in the notebook. Should be much better now. Plus, it passes all checks!

melissawm · 2020-10-23T00:02:14Z

This looks good to me! I have only a few (final) minor comments, then I feel like this can be merged.

content/mooreslaw-tutorial.ipynb

melissawm · 2020-10-23T19:33:36Z

I think those are the final comments from me.

melissawm · 2020-10-26T14:42:03Z

Thanks, @cooperrc ! @mattip do you want to take a look before we merge?

mattip · 2020-10-26T17:41:30Z

As per the discussion today, merging as-is, but there may be more tweaks

mattip · 2020-10-26T17:41:39Z

thanks @cooperrc

cooperrc · 2020-10-26T17:57:59Z

thanks @cooperrc

You're very welcome! Thanks @melissawm and @bjnath for your careful review!

8bitmp3 · 2020-12-23T14:34:22Z

@cooperrc @melissawm @rossbar The Markdown versions/conversions of GSoD notebooks, like the Moore's Law or Deep Learning with MNIST are effectively new PRs authored by the maintainers. So, would it be OK if we add the authors' names to .md files after the titles (e.g. "by @cooperrc" or "Author: @cooperrc" — similar to: https://keras.io/guides/functional_api/, https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) LMKWYT 👍

rossbar · 2020-12-23T17:16:33Z

Thanks @8bitmp3 , an author tag is definitely a good idea. I think this can already be specified in the notebook metadata, but we'd need to check that and then verify that the theme picks that up and displays it prominently in the article. If not, we can always just add our own to the template.

rgommers · 2020-12-23T17:45:31Z

For the record, displaying author names used to be done in doc sections a very long time ago, and we removed it because it's usually a bad idea. There are a couple of reasons for that:

It's not relevant info to the audience
It's often not reflective of the development process; there are usually multiple participants.
It's yet another thing to decide on in the future when more edits are made, and then the later contributors aren't mentioned.

The Markdown versions/conversions of GSoD notebooks, like the Moore's Law or Deep Learning with MNIST are effectively new PRs authored by the maintainers.

I may be missing something here, but authorship needs to be preserved across commits/reworks. If what happened is that a contributor wrote a lot of content but didn't end up in the commit log as an author, that's problematic. Even if you copy over content to a new file format, you should use git commit --author=<...> to preserve authorship usually. And then add more changes on top.

rossbar · 2020-12-23T18:20:01Z

Even if you copy over content to a new file format, you should use git commit --author=<...> to preserve authorship usually.

This is the step we were missing - it was simply something I failed to consider when converting from .ipynb -> .md formats. Any suggestions how to fix for the existing tutorials? We could amend the commit authors but that will change the hashes.

rgommers · 2020-12-23T18:40:22Z

So in this case the MNIST tutorial looks fine to me. @8bitmp3 has many commits in master, and @melissawm has one for the .ipynb -> .md conversion. That looks as it should, and also https://github.com/numpy/numpy-tutorials/graphs/contributors looks fine; @cooperrc and @8bitmp3 are the top two contributors to this repo.

We could amend the commit authors but that will change the hashes.

Yeah that's bad practice in general. Here we could still consider it, in case there's something important enough to fix. There's only 15 forks right now, so fixing those up by hand isn't too painful.

rossbar · 2020-12-23T19:07:53Z

Yeah, the history as a whole should be fine, the commits related to the .ipynb work should all be properly attributed. The problem mostly arises when looking at the history only for individual files on GitHub. For example, looking at the history for the mooreslaw-tutorial.md doesn't show all the work @cooperrc in the original authorship of the .ipynb file that this was converted from.

rgommers · 2020-12-23T20:33:11Z

I'm not sure that matters that much. People mostly look at number of commits or full history (e.g. gitk --all) I think.

Related: we need a new way at some point to acknowledge contributions on the website. The old THANKS.txt in the root of the main numpy repo makes very little sense anymore. I'm thinking we should use the bot from https://allcontributors.org/. But not on the main repo, on the website one and then make the about page use it. Could also do the team gallery that way perhaps. Looking at the html that that bot generates, it could replace what we do now.

The idea is that then we could simply have an issue there that maintainers can comment on whenever a new significant contribution is made: @bot add @username for content contributions.

melissawm · 2021-01-04T19:11:17Z

I commented on #57 before reading this - still catching up. I did not realize this problem beforehand, sorry about that.

cooperrc added 5 commits September 28, 2020 10:41

first draft of moores law tutorial

200d4b2

added numpy tutorials guide headers to beginning plus edits and more …

248de5a

…headers

added references and settled on statsmodels.api for OLS

096df96

added an intro figure to show final model and moores law

aa530ea

added comments for each missing func ref

0ab265b

used clean_ipynb to remove outputs

1f46438

cooperrc added 10 commits October 16, 2020 15:19

added @bjnath suggestions to intro

64fd791

Merge remote-tracking branch 'upstream/master' into mooreslaw

d00dbc3

first draft of moores law tutorial

7c3a935

added numpy tutorials guide headers to beginning plus edits and more …

241e11b

…headers

added references and settled on statsmodels.api for OLS

69df5c7

added an intro figure to show final model and moores law

581f48e

added comments for each missing func ref

7f037cd

used clean_ipynb to remove outputs

bef815f

added @bjnath suggestions to intro

34d34e6

Merge branch 'mooreslaw' of https://github.com/cooperrc/numpy-tutorials…

40c999b

… into mooreslaw

cooperrc added 2 commits October 22, 2020 08:49

Merge remote-tracking branch 'upstream/master' into mooreslaw

177e81c

Need to add statsmodels to dependencies

added links to docs/stable and removed passive voice

eda9fb7

melissawm reviewed Oct 23, 2020

View reviewed changes

respond to @melissawm 20201022 comments

9c2f22a

fix exp explanation amd mv image to content

ac9c8f0

melissawm approved these changes Oct 26, 2020

View reviewed changes

mattip merged commit 1030801 into numpy:master Oct 26, 2020

rossbar mentioned this pull request Dec 23, 2020

ENH: Add author tag to notebooks #57

Closed

Uh oh!

adding tutorial for creating a Moore's law linear regression #31

adding tutorial for creating a Moore's law linear regression #31

Uh oh!

Conversation

cooperrc commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Oct 8, 2020

Uh oh!

melissawm commented Oct 8, 2020

Uh oh!

cooperrc commented Oct 8, 2020

Uh oh!

bjnath commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cooperrc commented Oct 8, 2020

Uh oh!

melissawm commented Oct 9, 2020

Uh oh!

bjnath commented Oct 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cooperrc commented Oct 16, 2020

Uh oh!

melissawm commented Oct 20, 2020

Uh oh!

cooperrc commented Oct 22, 2020

Uh oh!

melissawm commented Oct 23, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

melissawm commented Oct 23, 2020

Uh oh!

melissawm commented Oct 26, 2020

Uh oh!

mattip commented Oct 26, 2020

Uh oh!

mattip commented Oct 26, 2020

Uh oh!

cooperrc commented Oct 26, 2020

Uh oh!

8bitmp3 commented Dec 23, 2020

Uh oh!

rossbar commented Dec 23, 2020

Uh oh!

rgommers commented Dec 23, 2020

Uh oh!

rossbar commented Dec 23, 2020

Uh oh!

rgommers commented Dec 23, 2020

Uh oh!

rossbar commented Dec 23, 2020

Uh oh!

rgommers commented Dec 23, 2020

Uh oh!

melissawm commented Jan 4, 2021

Uh oh!

Uh oh!

cooperrc commented Oct 8, 2020 •

edited

Loading

bjnath commented Oct 8, 2020 •

edited

Loading

bjnath commented Oct 11, 2020 •

edited

Loading