-
-
Notifications
You must be signed in to change notification settings - Fork 196
Alternative I/O how-to #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Trying a leaner approach than numpy#14 for the I/O how-to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice classification of storage formats.
- Is there a reason you chose not to present this as a Jupyter Notebook?
- The file itself should be moved to the content directory
- Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR Added new How To (simple io with NumPy) #14 is more of a tutorial, this is more of a how-to.
@@ -0,0 +1,268 @@ | |||
# How-to: NumPy I/O |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a two sentence intro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking time to review this.
Is there a reason you chose not to present this as a Jupyter Notebook?
- It was a prototype to see if there was support for the PR.
- Unlike a tutorial, where the user is invited to work with the examples, this are rote steps; many are just links.
- To make this executable in a user environment (for instance, via binder), files used in the examples would be introduced as dependencies, plus libraries like Zarr that the user would need to install.
- If it's just documentation and not user-runnable, nothing is added by making it a notebook; Markdown is easier to update and review.
The file itself should be moved to the content directory
Since it isn't a notebook, I wanted to keep it clear of whatever CI / build mechanism is going on in the repo till I knew what would happen if an .md were thrown in.
Also the repo isn't yet organized to separately handle tutorials and how-tos.
Maybe the IO section of this repo should have two documents: a tutorial and a how-to. PR gh-14 is more of a tutorial, this is more of a how-to.
I agree there's value in covering topics with both tutorials and how-tos. gh-14 has a tutorial feel but doesn't qualify as a tutorial in its current form. A tutorial needs a single narrative thread; it can't be a catalog.
Needs a two sentence intro
If the title is something like "How to: NumPy I/O", what would be added by an intro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would set expectations and provide keywords to help discoverability. "NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening with "NumPy has many methods to serialize ndarrays" is like answering "There are many ways to tell time" to someone who asks the time.
Just like the guy asking for the time, the reader's expectations are already in place: they have a how-to question and expect to find the answer. Our job is to meet the expectations or fail.
If it weren't obvious how to use the contents -- if, say, we were presenting a triangular mileage map -- we might want to spare a few words to explain it. But everyone can pattern-match on headings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that we are not the only person for miles around on a rural road being queried nor are we an atlas sitting on a dusty shelf. We are a web-page on an internet full of other versions of the same information. If we wish to be noticed, we need to rise to the top of the heap: we need to be indexed and searchable and appear at the top of searches. An appropriate minimum of words helps that process. A bare list of links does not: it scores very low and will not be noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might agree more if I could find a single useful keyword in:
NumPy has many methods to serialize ndarrays. This How-To will provide an overview of when to choose which one, based on the task at hand, with links for more information and examples. Since this is a How-To (link to the how-to explaination), it is purposefully abbreviated to be as short as possible.
The "bare links" point to headings which are full of keywords. Search engines have no trouble finding our api pages, which have no empty sentences propping them up.
howto_io.md
Outdated
- [Convert from a pandas DataFrame to a NumPy array](#convert-from-a-pandas-dataframe-to-a-numpy-array) | ||
- [Save/restore using `numpy.tofile` and `numpy.fromfile`](#saverestore-using-numpytofile-and-numpyfromfile) | ||
|
||
## Read a .csv or other text file with no missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should define the acronym and explain in ten words what missing values are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should define the acronym
Some acronyms need no introduction. We don't spell out ASCII. .csv
is widely known even to people who have no idea what NumPy is, and will certainly be known to the person who needs to read a file with a .csv extension. Moreover, csv delimiters aren't always commas, so spelling it out makes it less clear.
A compromise might be to link to the Wikipedia article. In that case I'd follow the Google doc style guide (which I didn't here) and call the format CSV,
explain in ten words what missing values are
What wording do you think would be helpful? Won't it be clear from the how-to headings that a "missing value" can be either a field missing data so the column count runs short, or a field that contains a special marker indicating the data is unusable? And won't users reading data files be painfully knowledgeable on the subject of missing data?
howto_io.md
Outdated
values (if ``usemask=True``) or will fill in the missing value with the value | ||
specified in ``filling_values`` (default is ``np.nan`` for float, -1 for int). | ||
|
||
### Non-whitespace delimiters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should all be under a top-level "Reading formatted text files", with an introduction narrative including the CSV acronym, loadtxt and genfromtext, with a sentence about when to use each. Then this section's title should be "Missing Values: CSV files" and the next section "Missing Values: other formats"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
howto_io.md
Outdated
[ 7., 888., 9.]]) | ||
``` | ||
|
||
## Read a file in .npy or .npz format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverse the order of write and read, so you can say that "load can read files generated by any of save, savez, or savez_compressed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the how-to I've put read headings first because I think this is how work goes generally, but this is a good addition and I've added it.
howto_io.md
Outdated
* For **Zarr**, see [here](https://zarr.readthedocs.io/en/stable/tutorial.html#reading-and-writing-data). | ||
* For **NetCDF**, use [scipy.io.netcdf_file](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html). | ||
|
||
## Write files for reading by other (non-NumPy) tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this section contributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which section do you mean?
howto_io.md
Outdated
* security: not secure against erroneous or maliciously constructed data | ||
* portability: may not be loadable on different Python installations | ||
|
||
Use `np.save` and `np.load`, setting ``allow_pickle=False``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might mention that this is required for NumPy arrays with dtypes that include python objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will add.
Also added section numbering to clarify the subject nesting.
I disagree with the premise that a how-to has absolutely no narrative text. Could you point to another project that uses this kind of format? The skeleton page you link to above is too sparse. |
What's making you think I believe this? I don't. The bottom line is always, "Does it help the user?", and I'd be a charlatan if I let anything else come first.
I'm not certain what you're after. Is the suggestion that a page titled How-to: NumPy I/O in a section of NumPy documentation labeled How-tos needs an opening sentence "This is a how-to on NumPy I/O"? The page is "too sparse" if people can't find the information they're looking for. It's "just right" if they find the information quickly and can use it. Do you disagree? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone else will have to comment on the markdown/notebook format and in which directory to put this file. In general the anchors seem a bit unwieldy. One advantage to sphinx-style labels is that even though the title may change the label will still provide a stable anchor to the section.
<a name="1-reading-text-files"></a> | ||
## 1 Reading text files | ||
|
||
<a name="11-read-a-csvhttpsenwikipediaorgwikicomma-separated_values-or-other-text-file-with-no-missing-values"></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A shorter anchor would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point. If I may, The Zen of Anchors:
An anchor that works is better than a missing anchor.
An anchor generated by software is better than one generated by hand.
The look of an anchor is the sound of one hand clapping.
So yes, the anchor length is ridiculous, but since it isn't clear what format this doc will ultimately take, I'd like to set this consideration aside for now.
## 5 Write or read large arrays | ||
|
||
Arrays too large to fit in memory can be treated like ordinary in-memory arrays using | ||
[numpy.mmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of issue numpy/numpy#16979 maybe add something along the lines of "also useful to reduce the number of data copies made via large_array[some_slice] = np.load('path/to/small_array', mmap_mode='r')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Added.
My hope is that no one will comment on either of these things until we've talked about content. I chose an arbitrary format so we could start the discussion on how-tos. Let's first pick a direction, and following that have a profitable discussion on what format and location works best. |
As discussed in the Documentation Team meeting today, I think this can be merged and if there are outstanding issues we can open separate PR's for that. However, I think now is the time to convert this to a notebook. @bjnath can you do that? |
I have real problems with the formatting. The how-to is written like a decision tree, and the typography of headings doesn't help you understand where in the tree you are. Although I put a TOC on top, I think most people will still be scanning through the text and getting lost. That's what I tried to solve with numbered headings. The lost-in-a-large-file-where-am-I problem is fixed if we make this an rst -- the right sidebar in the new theme shows exactly where you are. Numbered headings and a manually added TOC become unnecessary. That is another reason I think how-tos should be rst files, not notebooks, as argued in #16. I do agree with Matti that if we use numbered headings they must be automatically generated. And I think that if one how-to has numbered headings, we probably need to commit to using them on every how-to. |
Maybe both documents should then be moved to the main repo. The formatting is more appropriate and the content, too. |
The guide has gone Sphinx: PR 17353 |
Taking a different approach than PR #14 for the I/O how-to. This is a leaner style.
The draft glosses over pandas; maybe it deserves more play. Among Stack Overflow questions tagged
numpy
, Convert pandas dataframe to NumPy array is among the top 10 votegetters.Also, I wasn't sure whether additional or different formats should be listed for exchanging data with other tools.