Alternative I/O how-to #15

bjnath · 2020-07-21T18:53:36Z

Taking a different approach than PR #14 for the I/O how-to. This is a leaner style.

The draft glosses over pandas; maybe it deserves more play. Among Stack Overflow questions tagged numpy, Convert pandas dataframe to NumPy array is among the top 10 votegetters.

Also, I wasn't sure whether additional or different formats should be listed for exchanging data with other tools.

Trying a leaner approach than numpy#14 for the I/O how-to.

mattip

Nice classification of storage formats.

Is there a reason you chose not to present this as a Jupyter Notebook?
The file itself should be moved to the content directory
Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR Added new How To (simple io with NumPy) #14 is more of a tutorial, this is more of a how-to.

mattip · 2020-07-28T09:30:29Z

howto_io.md

@@ -0,0 +1,268 @@
+# How-to: NumPy I/O


Needs a two sentence intro

Thanks for taking time to review this.

Is there a reason you chose not to present this as a Jupyter Notebook?

It was a prototype to see if there was support for the PR.

Unlike a tutorial, where the user is invited to work with the examples, this are rote steps; many are just links.

To make this executable in a user environment (for instance, via binder), files used in the examples would be introduced as dependencies, plus libraries like Zarr that the user would need to install.

If it's just documentation and not user-runnable, nothing is added by making it a notebook; Markdown is easier to update and review.

The file itself should be moved to the content directory

Since it isn't a notebook, I wanted to keep it clear of whatever CI / build mechanism is going on in the repo till I knew what would happen if an .md were thrown in.

Also the repo isn't yet organized to separately handle tutorials and how-tos.

Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR gh-14 is more of a tutorial, this is more of a how-to.

I agree there's value in covering topics with both tutorials and how-tos. gh-14 has a tutorial feel but doesn't qualify as a tutorial in its current form. A tutorial needs a single narrative thread; it can't be a catalog.

Needs a two sentence intro

If the title is something like "How to: NumPy I/O", what would be added by an intro?

It would set expectations and provide keywords to help discoverability. "NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible."

Opening with "NumPy has many methods to serialize ndarrays" is like answering "There are many ways to tell time" to someone who asks the time.

Just like the guy asking for the time, the reader's expectations are already in place: they have a how-to question and expect to find the answer. Our job is to meet the expectations or fail.

If it weren't obvious how to use the contents -- if, say, we were presenting a triangular mileage map -- we might want to spare a few words to explain it. But everyone can pattern-match on headings.

The difference is that we are not the only person for miles around on a rural road being queried nor are we an atlas sitting on a dusty shelf. We are a web-page on an internet full of other versions of the same information. If we wish to be noticed, we need to rise to the top of the heap: we need to be indexed and searchable and appear at the top of searches. An appropriate minimum of words helps that process. A bare list of links does not: it scores very low and will not be noticed.

I might agree more if I could find a single useful keyword in:

NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible.

The "bare links" point to headings which are full of keywords. Search engines have no trouble finding our api pages, which have no empty sentences propping them up.

mattip · 2020-07-28T09:31:25Z

howto_io.md

+- [Convert from a pandas DataFrame to a NumPy array](#convert-from-a-pandas-dataframe-to-a-numpy-array)
+- [Save/restore using `numpy.tofile` and `numpy.fromfile`](#saverestore-using-numpytofile-and-numpyfromfile)
+
+## Read a .csv or other text file with no missing values


Should define the acronym and explain in ten words what missing values are

Should define the acronym

Some acronyms need no introduction. We don't spell out ASCII. .csv is widely known even to people who have no idea what NumPy is, and will certainly be known to the person who needs to read a file with a .csv extension. Moreover, csv delimiters aren't always commas, so spelling it out makes it less clear.

A compromise might be to link to the Wikipedia article. In that case I'd follow the Google doc style guide (which I didn't here) and call the format CSV,

explain in ten words what missing values are

What wording do you think would be helpful? Won't it be clear from the how-to headings that a "missing value" can be either a field missing data so the column count runs short, or a field that contains a special marker indicating the data is unusable? And won't users reading data files be painfully knowledgeable on the subject of missing data?

mattip · 2020-07-28T09:34:12Z

howto_io.md

+values (if ``usemask=True``) or will fill in the missing value with the value
+specified in ``filling_values`` (default is ``np.nan`` for float, -1 for int).
+
+### Non-whitespace delimiters


I think this should all be under a top-level "Reading formatted text files", with an introduction narrative including the CSV acronym, loadtxt and genfromtext, with a sentence about when to use each. Then this section's title should be "Missing Values: CSV files" and the next section "Missing Values: other formats"

mattip · 2020-07-28T09:43:55Z

howto_io.md

+       [  7., 888.,   9.]])
+```
+
+## Read a file in .npy or .npz format


Reverse the order of write and read, so you can say that "load can read files generated by any of save, savez, or savez_compressed"

In the how-to I've put read headings first because I think this is how work goes generally, but this is a good addition and I've added it.

howto_io.md

mattip · 2020-07-28T09:49:37Z

howto_io.md

+* For **Zarr**, see [here](https://zarr.readthedocs.io/en/stable/tutorial.html#reading-and-writing-data).
+* For **NetCDF**, use [scipy.io.netcdf_file](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html).
+
+## Write files for reading by other (non-NumPy) tools


Not sure what this section contributes.

Which section do you mean?

howto_io.md

mattip · 2020-07-28T09:52:21Z

howto_io.md

+ * security: not secure against erroneous or maliciously constructed data
+ * portability: may not be loadable on different Python installations
+
+Use `np.save` and `np.load`, setting ``allow_pickle=False``.


Might mention that this is required for NumPy arrays with dtypes that include python objects.

Thanks, will add.

howto_io.md

Also added section numbering to clarify the subject nesting.

bjnath · 2020-07-28T16:12:42Z

Preview of updated file.

mattip · 2020-07-28T17:08:52Z

I disagree with the premise that a how-to has absolutely no narrative text. Could you point to another project that uses this kind of format? The skeleton page you link to above is too sparse.

bjnath · 2020-07-28T18:04:36Z

the premise that a how-to has absolutely no narrative text

What's making you think I believe this? I don't. The bottom line is always, "Does it help the user?", and I'd be a charlatan if I let anything else come first.

too sparse

I'm not certain what you're after. Is the suggestion that a page titled How-to: NumPy I/O in a section of NumPy documentation labeled How-tos needs an opening sentence "This is a how-to on NumPy I/O"?

The page is "too sparse" if people can't find the information they're looking for. It's "just right" if they find the information quickly and can use it. Do you disagree?

mattip

Someone else will have to comment on the markdown/notebook format and in which directory to put this file. In general the anchors seem a bit unwieldy. One advantage to sphinx-style labels is that even though the title may change the label will still provide a stable anchor to the section.

mattip · 2020-08-02T11:40:22Z

howto_io.md

+<a name="1-reading-text-files"></a>
+## 1 Reading text files
+
+<a name="11-read-a-csvhttpsenwikipediaorgwikicomma-separated_values-or-other-text-file-with-no-missing-values"></a>


A shorter anchor would be better

Valid point. If I may, The Zen of Anchors:

An anchor that works is better than a missing anchor.
An anchor generated by software is better than one generated by hand.
The look of an anchor is the sound of one hand clapping.

So yes, the anchor length is ridiculous, but since it isn't clear what format this doc will ultimately take, I'd like to set this consideration aside for now.

mattip · 2020-08-02T11:46:35Z

howto_io.md

+## 5 Write or read large arrays
+
+Arrays too large to fit in memory can be treated like ordinary in-memory arrays using
+[numpy.mmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html).


In light of issue numpy/numpy#16979 maybe add something along the lines of "also useful to reduce the number of data copies made via large_array[some_slice] = np.load('path/to/small_array', mmap_mode='r')

Great suggestion. Added.

howto_io.md

bjnath · 2020-08-02T13:50:05Z

Someone else will have to comment on the markdown/notebook format and in which directory to put this file.

My hope is that no one will comment on either of these things until we've talked about content. I chose an arbitrary format so we could start the discussion on how-tos. Let's first pick a direction, and following that have a profitable discussion on what format and location works best.

Addresses numpy#15 (comment) and numpy#15 (comment).

melissawm · 2020-09-14T20:52:34Z

As discussed in the Documentation Team meeting today, I think this can be merged and if there are outstanding issues we can open separate PR's for that. However, I think now is the time to convert this to a notebook. @bjnath can you do that?

bjnath · 2020-09-14T22:08:11Z

I have real problems with the formatting. The how-to is written like a decision tree, and the typography of headings doesn't help you understand where in the tree you are. Although I put a TOC on top, I think most people will still be scanning through the text and getting lost. That's what I tried to solve with numbered headings.

The lost-in-a-large-file-where-am-I problem is fixed if we make this an rst -- the right sidebar in the new theme shows exactly where you are. Numbered headings and a manually added TOC become unnecessary. That is another reason I think how-tos should be rst files, not notebooks, as argued in #16.

I do agree with Matti that if we use numbered headings they must be automatically generated. And I think that if one how-to has numbered headings, we probably need to commit to using them on every how-to.

melissawm · 2020-09-14T22:24:26Z

Maybe both documents should then be moved to the main repo. The formatting is more appropriate and the content, too.

bjnath · 2020-10-02T15:49:02Z

The guide has gone Sphinx: PR 17353

Alternative I/O how-to

c7807fe

Trying a leaner approach than numpy#14 for the I/O how-to.

bjnath mentioned this pull request Jul 21, 2020

Add a How-To guide on storing and loading array data numpy/numpy#15760

Closed

mattip reviewed Jul 28, 2020

View reviewed changes

DOC: Update PR numpy#15 per mattip's comments

65ceb53

Also added section numbering to clarify the subject nesting.

melissawm mentioned this pull request Jul 31, 2020

Added new How To (simple io with NumPy) #14

Closed

mattip reviewed Aug 2, 2020

View reviewed changes

DOC: Additional info for PR numpy#15

bf6725c

Addresses numpy#15 (comment) and numpy#15 (comment).

melissawm mentioned this pull request Sep 14, 2020

DOC: How-to guide for how-tos #16

Closed

bjnath closed this Oct 2, 2020

Uh oh!

Alternative I/O how-to #15

Alternative I/O how-to #15

Uh oh!

Conversation

bjnath commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjnath Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjnath Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjnath Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjnath Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjnath commented Jul 28, 2020

Uh oh!

mattip commented Jul 28, 2020

Uh oh!

bjnath commented Jul 28, 2020

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjnath commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melissawm commented Sep 14, 2020

Uh oh!

bjnath commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjnath commented Jul 21, 2020 •

edited

Loading

bjnath Jul 28, 2020 •

edited

Loading

bjnath Jul 28, 2020 •

edited

Loading

bjnath Jul 28, 2020 •

edited

Loading

bjnath Jul 28, 2020 •

edited

Loading

bjnath commented Aug 2, 2020 •

edited

Loading

bjnath commented Sep 14, 2020 •

edited

Loading