Skip to content

Alternative I/O how-to #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

bjnath
Copy link
Contributor

@bjnath bjnath commented Jul 21, 2020

Taking a different approach than PR #14 for the I/O how-to. This is a leaner style.

The draft glosses over pandas; maybe it deserves more play. Among Stack Overflow questions tagged numpy, Convert pandas dataframe to NumPy array is among the top 10 votegetters.

Also, I wasn't sure whether additional or different formats should be listed for exchanging data with other tools.

Trying a leaner approach than numpy#14 for the I/O how-to.
Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice classification of storage formats.

  • Is there a reason you chose not to present this as a Jupyter Notebook?
  • The file itself should be moved to the content directory
  • Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR Added new How To (simple io with NumPy) #14 is more of a tutorial, this is more of a how-to.

@@ -0,0 +1,268 @@
# How-to: NumPy I/O
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a two sentence intro

Copy link
Contributor Author

@bjnath bjnath Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to review this.

Is there a reason you chose not to present this as a Jupyter Notebook?

  • It was a prototype to see if there was support for the PR.
  • Unlike a tutorial, where the user is invited to work with the examples, this are rote steps; many are just links.
  • To make this executable in a user environment (for instance, via binder), files used in the examples would be introduced as dependencies, plus libraries like Zarr that the user would need to install.
  • If it's just documentation and not user-runnable, nothing is added by making it a notebook; Markdown is easier to update and review.

The file itself should be moved to the content directory

Since it isn't a notebook, I wanted to keep it clear of whatever CI / build mechanism is going on in the repo till I knew what would happen if an .md were thrown in.

Also the repo isn't yet organized to separately handle tutorials and how-tos.

Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR gh-14 is more of a tutorial, this is more of a how-to.

I agree there's value in covering topics with both tutorials and how-tos. gh-14 has a tutorial feel but doesn't qualify as a tutorial in its current form. A tutorial needs a single narrative thread; it can't be a catalog.

Needs a two sentence intro

If the title is something like "How to: NumPy I/O", what would be added by an intro?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would set expectations and provide keywords to help discoverability. "NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening with "NumPy has many methods to serialize ndarrays" is like answering "There are many ways to tell time" to someone who asks the time.

Just like the guy asking for the time, the reader's expectations are already in place: they have a how-to question and expect to find the answer. Our job is to meet the expectations or fail.

If it weren't obvious how to use the contents -- if, say, we were presenting a triangular mileage map -- we might want to spare a few words to explain it. But everyone can pattern-match on headings.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that we are not the only person for miles around on a rural road being queried nor are we an atlas sitting on a dusty shelf. We are a web-page on an internet full of other versions of the same information. If we wish to be noticed, we need to rise to the top of the heap: we need to be indexed and searchable and appear at the top of searches. An appropriate minimum of words helps that process. A bare list of links does not: it scores very low and will not be noticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might agree more if I could find a single useful keyword in:

NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible.

The "bare links" point to headings which are full of keywords. Search engines have no trouble finding our api pages, which have no empty sentences propping them up.

howto_io.md Outdated
- [Convert from a pandas DataFrame to a NumPy array](#convert-from-a-pandas-dataframe-to-a-numpy-array)
- [Save/restore using `numpy.tofile` and `numpy.fromfile`](#saverestore-using-numpytofile-and-numpyfromfile)

## Read a .csv or other text file with no missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should define the acronym and explain in ten words what missing values are

Copy link
Contributor Author

@bjnath bjnath Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should define the acronym

Some acronyms need no introduction. We don't spell out ASCII. .csv is widely known even to people who have no idea what NumPy is, and will certainly be known to the person who needs to read a file with a .csv extension. Moreover, csv delimiters aren't always commas, so spelling it out makes it less clear.

A compromise might be to link to the Wikipedia article. In that case I'd follow the Google doc style guide (which I didn't here) and call the format CSV,

explain in ten words what missing values are

What wording do you think would be helpful? Won't it be clear from the how-to headings that a "missing value" can be either a field missing data so the column count runs short, or a field that contains a special marker indicating the data is unusable? And won't users reading data files be painfully knowledgeable on the subject of missing data?

howto_io.md Outdated
values (if ``usemask=True``) or will fill in the missing value with the value
specified in ``filling_values`` (default is ``np.nan`` for float, -1 for int).

### Non-whitespace delimiters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should all be under a top-level "Reading formatted text files", with an introduction narrative including the CSV acronym, loadtxt and genfromtext, with a sentence about when to use each. Then this section's title should be "Missing Values: CSV files" and the next section "Missing Values: other formats"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

howto_io.md Outdated
[ 7., 888., 9.]])
```

## Read a file in .npy or .npz format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse the order of write and read, so you can say that "load can read files generated by any of save, savez, or savez_compressed"

Copy link
Contributor Author

@bjnath bjnath Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the how-to I've put read headings first because I think this is how work goes generally, but this is a good addition and I've added it.

howto_io.md Outdated
* For **Zarr**, see [here](https://zarr.readthedocs.io/en/stable/tutorial.html#reading-and-writing-data).
* For **NetCDF**, use [scipy.io.netcdf_file](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html).

## Write files for reading by other (non-NumPy) tools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this section contributes.

Copy link
Contributor Author

@bjnath bjnath Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which section do you mean?

howto_io.md Outdated
* security: not secure against erroneous or maliciously constructed data
* portability: may not be loadable on different Python installations

Use `np.save` and `np.load`, setting ``allow_pickle=False``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might mention that this is required for NumPy arrays with dtypes that include python objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will add.

Also added section numbering to clarify the subject nesting.
@bjnath
Copy link
Contributor Author

bjnath commented Jul 28, 2020

Preview of updated file.

@mattip
Copy link
Member

mattip commented Jul 28, 2020

I disagree with the premise that a how-to has absolutely no narrative text. Could you point to another project that uses this kind of format? The skeleton page you link to above is too sparse.

@bjnath
Copy link
Contributor Author

bjnath commented Jul 28, 2020

the premise that a how-to has absolutely no narrative text

What's making you think I believe this? I don't. The bottom line is always, "Does it help the user?", and I'd be a charlatan if I let anything else come first.

too sparse

I'm not certain what you're after. Is the suggestion that a page titled How-to: NumPy I/O in a section of NumPy documentation labeled How-tos needs an opening sentence "This is a how-to on NumPy I/O"?

The page is "too sparse" if people can't find the information they're looking for. It's "just right" if they find the information quickly and can use it. Do you disagree?

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone else will have to comment on the markdown/notebook format and in which directory to put this file. In general the anchors seem a bit unwieldy. One advantage to sphinx-style labels is that even though the title may change the label will still provide a stable anchor to the section.

<a name="1-reading-text-files"></a>
## 1 Reading text files

<a name="11-read-a-csvhttpsenwikipediaorgwikicomma-separated_values-or-other-text-file-with-no-missing-values"></a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shorter anchor would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. If I may, The Zen of Anchors:

An anchor that works is better than a missing anchor.
An anchor generated by software is better than one generated by hand.
The look of an anchor is the sound of one hand clapping.

So yes, the anchor length is ridiculous, but since it isn't clear what format this doc will ultimately take, I'd like to set this consideration aside for now.

## 5 Write or read large arrays

Arrays too large to fit in memory can be treated like ordinary in-memory arrays using
[numpy.mmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of issue numpy/numpy#16979 maybe add something along the lines of "also useful to reduce the number of data copies made via large_array[some_slice] = np.load('path/to/small_array', mmap_mode='r')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Added.

@bjnath
Copy link
Contributor Author

bjnath commented Aug 2, 2020

Someone else will have to comment on the markdown/notebook format and in which directory to put this file.

My hope is that no one will comment on either of these things until we've talked about content. I chose an arbitrary format so we could start the discussion on how-tos. Let's first pick a direction, and following that have a profitable discussion on what format and location works best.

@melissawm
Copy link
Member

As discussed in the Documentation Team meeting today, I think this can be merged and if there are outstanding issues we can open separate PR's for that. However, I think now is the time to convert this to a notebook. @bjnath can you do that?

@bjnath
Copy link
Contributor Author

bjnath commented Sep 14, 2020

I have real problems with the formatting. The how-to is written like a decision tree, and the typography of headings doesn't help you understand where in the tree you are. Although I put a TOC on top, I think most people will still be scanning through the text and getting lost. That's what I tried to solve with numbered headings.

The lost-in-a-large-file-where-am-I problem is fixed if we make this an rst -- the right sidebar in the new theme shows exactly where you are. Numbered headings and a manually added TOC become unnecessary. That is another reason I think how-tos should be rst files, not notebooks, as argued in #16.

I do agree with Matti that if we use numbered headings they must be automatically generated. And I think that if one how-to has numbered headings, we probably need to commit to using them on every how-to.

@melissawm
Copy link
Member

Maybe both documents should then be moved to the main repo. The formatting is more appropriate and the content, too.

@bjnath
Copy link
Contributor Author

bjnath commented Oct 2, 2020

The guide has gone Sphinx: PR 17353

@bjnath bjnath closed this Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants