Skip to content

DOC: stop referring to 'S' dtype as string #8942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 22, 2017

Conversation

juliantaylor
Copy link
Contributor

The S dtype is zero terminated bytes which happen to match what
Python 2 called strings. As this is not the case in Python 3 we should
stop naming it wrong in our documentation.

[ci skip]

@@ -178,7 +178,7 @@ Built-in Python types
:class:`bool` :class:`bool\_`
:class:`float` :class:`float\_`
:class:`complex` :class:`cfloat`
:class:`str` :class:`string`
:class:`str` :class:`bytes` (Python2) or :class:`unicode\_` (Python3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be bytes_?

How about not listing str at all, and having

:class:`bytes`      :class:`bytes_`

Perhaps a note after the table that str in python is just an alias for either bytes or unicode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to include str in this table, given how common it is to see dtype=str.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it should be kept so people can look up what it means. I have added a note in the section.

@@ -233,7 +233,7 @@ Array-protocol type strings (see :ref:`arrays.interface`)
``'m'`` timedelta
``'M'`` datetime
``'O'`` (Python) objects
``'S'``, ``'a'`` (byte-)string
``'S'``, ``'a'`` (byte-)string (not recommended)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth unparenthesizing the (byte), and putting the word "string" after unicode on the line below, to stop people who are looking for string picking "S"?

Also, any way to link this to the admonition below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done and added a reference label, can be linked with ref:title<label>

remain zero terminated bytes and ``np.string_`` continues to map to
``np.bytes_``.
To use actual strings in Python 3 use ``U`` or ``np.unicode_``.
For bytes that do not need zero termination ``i1`` can be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just B, which is a better mnemonic for byte anyway, and is unsigned.

@@ -298,7 +307,6 @@ Type strings
.. admonition:: Example

>>> dt = np.dtype((void, 10)) # 10-byte wide data block
>>> dt = np.dtype((str, 35)) # 35-character string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to keep this, but using bytes instead. Also, void is a NameError on the line above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bytes dtype is kind of useless in python3, I don't think it is worth mentioning.

@@ -178,7 +178,7 @@ Built-in Python types
:class:`bool` :class:`bool\_`
:class:`float` :class:`float\_`
:class:`complex` :class:`cfloat`
:class:`str` :class:`string`
:class:`str` :class:`bytes` (Python2) or :class:`unicode\_` (Python3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to include str in this table, given how common it is to see dtype=str.

@juliantaylor juliantaylor force-pushed the string-doc branch 5 times, most recently from 2574cc4 to b851c92 Compare April 14, 2017 09:48
@eric-wieser
Copy link
Member

Can we add a row to that table for bytes as well?

@juliantaylor juliantaylor force-pushed the string-doc branch 2 times, most recently from 3648e4d to 220e1a3 Compare April 14, 2017 10:02
@juliantaylor
Copy link
Contributor Author

added bytes

@eric-wieser
Copy link
Member

Is np.string still a thing? If so, perhaps we should leave that in the table as something like np.string, an alias for bytes_ (py3) and unicode_ (py3) - else we lose the documentation for what np.string actually is

@juliantaylor
Copy link
Contributor Author

np.string_ is always bytes. Would be nice to deprecate that symbol is possible. But we probably can't for the same reason we can't deprecate 'S' and 'a'. It's used too much.

:class:`unicode` :class:`unicode\_`
:class:`buffer` :class:`void`
(all others) :class:`object_`
================ ===============

Note that ``str`` refers to either null terminated bytes or unicode strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking about array types, the unicode strings are null terminated UTF-32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string kind of implies that it is terminated in some way, bytes does not.
We could also clarify this for strings better but it dilutes how bad the bytes type is a bit.

@mhvk
Copy link
Contributor

mhvk commented Apr 15, 2017

This looks good. One question, mostly relevant only for the addition of "not recommended": are there still any plans to add an "encoding" option to S? It would make the type useful again...

@mhvk
Copy link
Contributor

mhvk commented Apr 15, 2017

p.s. See astropy/astropy#5700 for efforts to add an encoding for the ndarray subclass Column (obviously, easier done in a subclass in python than in C).

@juliantaylor
Copy link
Contributor Author

yes, I wanted to look into adding an encoded byte type for the next release. I was thinking about utf8 for the convenience of automatically a relatively compact encoding from unknown strings. The disadvantage would be that it is hard to predict when adding a string to an existing array will truncate.
But I first wanted to check what is involved in adding a new dtype, what type of encoding (or user selectable similar to datetime) we use can be decided later.

@charris
Copy link
Member

charris commented Apr 15, 2017

Make it latin-1, which is just the first 256 code points and should work well with legacy scientific data. UTF-8 is a different animal altogether and for that I think we should have a ragged string type, basically an object array.

@@ -218,7 +218,7 @@ PyArrayDescr_Type
interface typestring notation). A 'b' represents Boolean, a 'i'
represents signed integer, a 'u' represents unsigned integer, 'f'
represents floating point, 'c' represents complex floating point, 'S'
represents 8-bit character string, 'U' represents 32-bit/character
represents 8-bit zero-terminated bytes , 'U' represents 32-bit/character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space

remain zero terminated bytes and ``np.string_`` continues to map to
``np.bytes_``.
To use actual strings in Python 3 use ``U`` or ``np.unicode_``.
For bytes that do not need zero termination ``i1`` or ``B`` can be used.
Copy link
Member

@eric-wieser eric-wieser Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that u1 == B, and i1 == b, so we should probably decide whether to recommend signed or unsigned here, rather than suggesting both with different syntaxes. I'd err on the side of u1/B.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the above table is wrong, it lists b as boolean but it is ?. Updated adding boolean and unsigned bytes.

@rainwoodman
Copy link
Contributor

As a non-latin language speaker, I'll throw in my two cents.

I'd recommend using UTF-8 unless there is good reason against it.

  • The default encoding of byte string appears to be utf-8 in Python3. (https://docs.python.org/3/library/stdtypes.html#str.encode ).

  • The benefits of utf-8 are nicely summarized in https://docs.python.org/3/howto/unicode.html#encodings -- it is deterministic and robust, also a super set of ASCII.

  • In addition, UTF-8 was settled as the default encoding in most Linux platforms for good reasons -- I recall reading about the debate on this many many years ago -- my memory was 10 years ago the linux gang in China has started switching to UTF-8 from the legacy GBK/GB18030 encodings. Even the electronic bullet boards system (BBS) of Peking University now have a UTF-8 encoding interface -- it used to be GBK/Big5 only 10 years ago.

  • While researchers in the West may be able to get away with latin-1, what about those in the rest of world, using non-latin characters in their research? Defaulting to UTF-8 will avoid those people real troubles -- at least UTF-8 is consistent with how Python deal with byte strings and will have least surprises.

  • no real worry about strlen confusion : people in the MultiByteString world are used to strlen is never supposed to give the correct number of characters anyways; those were taught in text books.

@shoyer
Copy link
Member

shoyer commented Apr 17, 2017 via email

@eric-wieser
Copy link
Member

eric-wieser commented Apr 17, 2017

I'd recommend using UTF-8 unless there is good reason against it.

Referring to the encoding argument added in this PR, or this proposed new string type suggested above?

I don't think we can switch the default encoding of text io from latin1 to utf8 without breaking compatibility, sadly

@shoyer
Copy link
Member

shoyer commented Apr 17, 2017

Referring to the encoding argument added in this PR, or this proposed new string type suggested above?

Both?

I don't think we can switch the default encoding of text io from latin1 to utf8 without breaking compatibility, sadly

np.load defaults to encoding='ASCII': https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html

@juliantaylor juliantaylor force-pushed the string-doc branch 2 times, most recently from 4249c30 to 4724993 Compare April 18, 2017 12:07
``'b'`` boolean
``'?'`` boolean
``'b'`` (signed) byte
``'B'`` unsigned byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want completeness here, then we support (most of?) the stuct pack codes too (https://docs.python.org/2/library/struct.html#format-characters). Certainly, bBhHiIlLqQ are all accepted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but it is imo not a good idea, the struct typecodes are the poor old C type where it is unknown what they actually mean.
The numpy ones i1, i2, i4, i8 are explicit and should be used instead.
so technically b and B should also not be in there but they do have some relevance in regards to the S type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy ones i1, i2, i4, i8 are explicit and should be used instead.

Yep, I completely agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems p and P are things, which I guess correspond to intp and uintp? These have different meanings in struct.pack, which is potentially confusing

Copy link
Member

@eric-wieser eric-wieser Apr 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's documented in array.scalars.rst. Perhaps this table should link there instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that file still need some string updates

@eric-wieser
Copy link
Member

Unless you've got anything you still think you need to fix @juliantaylor, I think this is good to merge?

The S dtype is zero terminated bytes which happen to match what
Python 2 called strings. As this is not the case in Python 3 we should
stop naming it wrong in our documentation.

[ci skip]
@juliantaylor
Copy link
Contributor Author

should be good

@eric-wieser eric-wieser merged commit cb640fa into numpy:master Apr 22, 2017
@eric-wieser
Copy link
Member

Thanks @juliantaylor. I'm sure that there'll always be more improvements to make here, but this is a good start

@charris charris changed the title DOC: stop refering to 'S' dtype as string DOC: stop referring to 'S' dtype as string Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants