From: Paul T. <pau...@gm...> - 2012-09-26 04:05:48
|
In R, there are many default data sets one can use to both illustrate code and explore the scripting language. Instead of having to fake data, one can pull from meaningful data sets, created in the real world. For example, this one liner actually produces a plot: plot(mtcars$hp~mtcars$mpg) where mtcars refers to a built-in data set taken from Motor Trend Magazine. I don't believe matplotlib has anything similar. I have started to download some of the R data sets and store them as pickles for my own use. Does anyone else have any interest in creating a repository for these data sets or otherwise sharing them in some way? Paul |
From: <jos...@gm...> - 2012-09-26 04:28:48
|
On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <pau...@gm...> wrote: > In R, there are many default data sets one can use to both illustrate code > and explore the scripting language. Instead of having to fake data, one can > pull from meaningful data sets, created in the real world. For example, this > one liner actually produces a plot: > > plot(mtcars$hp~mtcars$mpg) > > where mtcars refers to a built-in data set taken from Motor Trend Magazine. > I don't believe matplotlib has anything similar. I have started to download > some of the R data sets and store them as pickles for my own use. Does > anyone else have any interest in creating a repository for these data sets > or otherwise sharing them in some way? Vincent converted several R datasets back to csv, that can be easily loaded from the web with, for example, pandas. http://vincentarelbundock.github.com/Rdatasets/ The collection is a bit random. statsmodels has some datasets that we use for examples and tests http://statsmodels.sourceforge.net/devel/datasets/index.html We were always a bit slow with adding datasets because we were too cautious about licensing issues. But R seems to get away with considering most datasets to be public domain. We keep adding datasets to statsmodels as we need them for new models. The machine learning packages like sklearn have packaged the typical machine learning datasets. If you are interested, you could join up with statsmodels or with Vincent to expand on what's available. Josef > > Paul > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Matplotlib-users mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-users > |
From: Michael D. <md...@st...> - 2012-09-26 13:13:27
|
On 09/26/2012 12:28 AM, jos...@gm... wrote: > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <pau...@gm...> wrote: >> In R, there are many default data sets one can use to both illustrate code >> and explore the scripting language. Instead of having to fake data, one can >> pull from meaningful data sets, created in the real world. For example, this >> one liner actually produces a plot: >> >> plot(mtcars$hp~mtcars$mpg) >> >> where mtcars refers to a built-in data set taken from Motor Trend Magazine. >> I don't believe matplotlib has anything similar. I have started to download >> some of the R data sets and store them as pickles for my own use. Does >> anyone else have any interest in creating a repository for these data sets >> or otherwise sharing them in some way? > Vincent converted several R datasets back to csv, that can be easily > loaded from the web with, for example, pandas. > http://vincentarelbundock.github.com/Rdatasets/ > The collection is a bit random. > > statsmodels has some datasets that we use for examples and tests > http://statsmodels.sourceforge.net/devel/datasets/index.html > We were always a bit slow with adding datasets because we were too > cautious about licensing issues. But R seems to get away with > considering most datasets to be public domain. > We keep adding datasets to statsmodels as we need them for new models. > > The machine learning packages like sklearn have packaged the typical > machine learning datasets. > > If you are interested, you could join up with statsmodels or with > Vincent to expand on what's available. > It seems to me like contributing to (rather than duplicating) the work of one of these projects would be a great idea. It would also be nice to add functionality in matplotlib to make it easier to download these things as a one-off -- obviously not exactly the same syntax as with R, but ideally with a single function call. Mike |
From: Benjamin R. <ben...@ou...> - 2012-09-26 13:33:47
|
On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <md...@st...> wrote: > On 09/26/2012 12:28 AM, jos...@gm... wrote: > > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay <pau...@gm...> > wrote: > >> In R, there are many default data sets one can use to both illustrate > code > >> and explore the scripting language. Instead of having to fake data, one > can > >> pull from meaningful data sets, created in the real world. For example, > this > >> one liner actually produces a plot: > >> > >> plot(mtcars$hp~mtcars$mpg) > >> > >> where mtcars refers to a built-in data set taken from Motor Trend > Magazine. > >> I don't believe matplotlib has anything similar. I have started to > download > >> some of the R data sets and store them as pickles for my own use. Does > >> anyone else have any interest in creating a repository for these data > sets > >> or otherwise sharing them in some way? > > Vincent converted several R datasets back to csv, that can be easily > > loaded from the web with, for example, pandas. > > http://vincentarelbundock.github.com/Rdatasets/ > > The collection is a bit random. > > > > statsmodels has some datasets that we use for examples and tests > > http://statsmodels.sourceforge.net/devel/datasets/index.html > > We were always a bit slow with adding datasets because we were too > > cautious about licensing issues. But R seems to get away with > > considering most datasets to be public domain. > > We keep adding datasets to statsmodels as we need them for new models. > > > > The machine learning packages like sklearn have packaged the typical > > machine learning datasets. > > > > If you are interested, you could join up with statsmodels or with > > Vincent to expand on what's available. > > > It seems to me like contributing to (rather than duplicating) the work > of one of these projects would be a great idea. It would also be nice > to add functionality in matplotlib to make it easier to download these > things as a one-off -- obviously not exactly the same syntax as with R, > but ideally with a single function call. > > Mike > > We did have such a thing. matplotlib.cbook.get_sample_data(). I think we got rid of it for 1.2.0? Ben |
From: <jos...@gm...> - 2012-09-26 13:41:39
|
On Wed, Sep 26, 2012 at 9:33 AM, Benjamin Root <ben...@ou...> wrote: > > > On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <md...@st...> wrote: >> >> On 09/26/2012 12:28 AM, jos...@gm... wrote: >> > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay >> > <pau...@gm...> wrote: >> >> In R, there are many default data sets one can use to both illustrate >> >> code >> >> and explore the scripting language. Instead of having to fake data, one >> >> can >> >> pull from meaningful data sets, created in the real world. For example, >> >> this >> >> one liner actually produces a plot: >> >> >> >> plot(mtcars$hp~mtcars$mpg) >> >> >> >> where mtcars refers to a built-in data set taken from Motor Trend >> >> Magazine. >> >> I don't believe matplotlib has anything similar. I have started to >> >> download >> >> some of the R data sets and store them as pickles for my own use. Does >> >> anyone else have any interest in creating a repository for these data >> >> sets >> >> or otherwise sharing them in some way? >> > Vincent converted several R datasets back to csv, that can be easily >> > loaded from the web with, for example, pandas. >> > http://vincentarelbundock.github.com/Rdatasets/ >> > The collection is a bit random. >> > >> > statsmodels has some datasets that we use for examples and tests >> > http://statsmodels.sourceforge.net/devel/datasets/index.html >> > We were always a bit slow with adding datasets because we were too >> > cautious about licensing issues. But R seems to get away with >> > considering most datasets to be public domain. >> > We keep adding datasets to statsmodels as we need them for new models. >> > >> > The machine learning packages like sklearn have packaged the typical >> > machine learning datasets. >> > >> > If you are interested, you could join up with statsmodels or with >> > Vincent to expand on what's available. >> > >> It seems to me like contributing to (rather than duplicating) the work >> of one of these projects would be a great idea. It would also be nice >> to add functionality in matplotlib to make it easier to download these >> things as a one-off -- obviously not exactly the same syntax as with R, >> but ideally with a single function call. >> >> Mike >> > > We did have such a thing. matplotlib.cbook.get_sample_data(). I think we > got rid of it for 1.2.0? I don't know the details, but it looks like in pandas they spend some time on python 3 compatibility, in case that was a problem https://github.com/pydata/pandas/pull/970 Josef > > Ben > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Matplotlib-users mailing list > Mat...@li... > https://lists.sourceforge.net/lists/listinfo/matplotlib-users > |
From: Michael D. <md...@st...> - 2012-09-26 14:18:24
|
On 09/26/2012 09:33 AM, Benjamin Root wrote: > > > On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <md...@st... > <mailto:md...@st...>> wrote: > > On 09/26/2012 12:28 AM, jos...@gm... > <mailto:jos...@gm...> wrote: > > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay > <pau...@gm... <mailto:pau...@gm...>> wrote: > >> In R, there are many default data sets one can use to both > illustrate code > >> and explore the scripting language. Instead of having to fake > data, one can > >> pull from meaningful data sets, created in the real world. For > example, this > >> one liner actually produces a plot: > >> > >> plot(mtcars$hp~mtcars$mpg) > >> > >> where mtcars refers to a built-in data set taken from Motor > Trend Magazine. > >> I don't believe matplotlib has anything similar. I have started > to download > >> some of the R data sets and store them as pickles for my own > use. Does > >> anyone else have any interest in creating a repository for > these data sets > >> or otherwise sharing them in some way? > > Vincent converted several R datasets back to csv, that can be easily > > loaded from the web with, for example, pandas. > > http://vincentarelbundock.github.com/Rdatasets/ > > The collection is a bit random. > > > > statsmodels has some datasets that we use for examples and tests > > http://statsmodels.sourceforge.net/devel/datasets/index.html > > We were always a bit slow with adding datasets because we were too > > cautious about licensing issues. But R seems to get away with > > considering most datasets to be public domain. > > We keep adding datasets to statsmodels as we need them for new > models. > > > > The machine learning packages like sklearn have packaged the typical > > machine learning datasets. > > > > If you are interested, you could join up with statsmodels or with > > Vincent to expand on what's available. > > > It seems to me like contributing to (rather than duplicating) the work > of one of these projects would be a great idea. It would also be nice > to add functionality in matplotlib to make it easier to download these > things as a one-off -- obviously not exactly the same syntax as > with R, > but ideally with a single function call. > > Mike > > > We did have such a thing. matplotlib.cbook.get_sample_data(). I think > we got rid of it for 1.2.0? It was removed because the server side was a moving target and would constantly break. It was based on pulling files out of the svn (and later git) repository, and sourceforge and github have had a habit of changing the urls used to do so. All of the data that was there was moved into the main repository and is now installed alongside matplotlib, so get_sample_data() still works. See this PR: https://github.com/matplotlib/matplotlib/pull/498 I should have mentioned it earlier, that we do have a very small set of standard data sets included there -- but these other projects linked to above are much better and more extensive. If we can rely on them to have static urls over time, I think they are much better options than anything matplotlib has had in the past. Mike |
From: Paul T. <pau...@gm...> - 2012-09-28 04:47:07
|
On 9/26/12 10:15 AM, Michael Droettboom wrote: > On 09/26/2012 09:33 AM, Benjamin Root wrote: >> >> >> On Wed, Sep 26, 2012 at 9:10 AM, Michael Droettboom <md...@st... >> <mailto:md...@st...>> wrote: >> >> On 09/26/2012 12:28 AM, jos...@gm... >> <mailto:jos...@gm...> wrote: >> > On Wed, Sep 26, 2012 at 12:05 AM, Paul Tremblay >> <pau...@gm... <mailto:pau...@gm...>> wrote: >> >> In R, there are many default data sets one can use to both >> illustrate code >> >> and explore the scripting language. Instead of having to fake >> data, one can >> >> pull from meaningful data sets, created in the real world. For >> example, this >> >> one liner actually produces a plot: >> >> >> >> plot(mtcars$hp~mtcars$mpg) >> >> >> >> where mtcars refers to a built-in data set taken from Motor >> Trend Magazine. >> >> I don't believe matplotlib has anything similar. I have >> started to download >> >> some of the R data sets and store them as pickles for my own >> use. Does >> >> anyone else have any interest in creating a repository for >> these data sets >> >> or otherwise sharing them in some way? >> > Vincent converted several R datasets back to csv, that can be >> easily >> > loaded from the web with, for example, pandas. >> > http://vincentarelbundock.github.com/Rdatasets/ >> > The collection is a bit random. >> > >> > statsmodels has some datasets that we use for examples and tests >> > http://statsmodels.sourceforge.net/devel/datasets/index.html >> > We were always a bit slow with adding datasets because we were too >> > cautious about licensing issues. But R seems to get away with >> > considering most datasets to be public domain. >> > We keep adding datasets to statsmodels as we need them for new >> models. >> > >> > The machine learning packages like sklearn have packaged the >> typical >> > machine learning datasets. >> > >> > If you are interested, you could join up with statsmodels or with >> > Vincent to expand on what's available. >> > >> It seems to me like contributing to (rather than duplicating) the >> work >> of one of these projects would be a great idea. It would also be >> nice >> to add functionality in matplotlib to make it easier to download >> these >> things as a one-off -- obviously not exactly the same syntax as >> with R, >> but ideally with a single function call. >> >> Mike >> >> >> We did have such a thing. matplotlib.cbook.get_sample_data(). I >> think we got rid of it for 1.2.0? > It was removed because the server side was a moving target and would > constantly break. It was based on pulling files out of the svn (and > later git) repository, and sourceforge and github have had a habit of > changing the urls used to do so. All of the data that was there was > moved into the main repository and is now installed alongside > matplotlib, so get_sample_data() still works. > > See this PR: https://github.com/matplotlib/matplotlib/pull/498 > > I should have mentioned it earlier, that we do have a very small set > of standard data sets included there -- but these other projects > linked to above are much better and more extensive. If we can rely on > them to have static urls over time, I think they are much better > options than anything matplotlib has had in the past. > > Mike Drawing on other posts, it is conceivable to download both the R sets and the stats models sets and include them in site-packages/matplotlib/mpl-data/sample_data/? I understand that pulling data sets not in this directory creates problems because of moving URLs, but why even try to do a web pull when the data can exists in a reliable place? I suppose one might raise reasonable objections to my suggestion, but at any rate, it doesn't seem I can add anything else to either project, since they both seem complete. I see only a small though significant problem with the R data sets in that it leaves out the header of the first column because of the structure of R data frames. Python needs this header. Paul |