DOC: stop referring to 'S' dtype as string #8942

juliantaylor · 2017-04-13T18:41:09Z

The S dtype is zero terminated bytes which happen to match what
Python 2 called strings. As this is not the case in Python 3 we should
stop naming it wrong in our documentation.

[ci skip]

eric-wieser · 2017-04-13T18:51:30Z

doc/source/reference/arrays.dtypes.rst

@@ -178,7 +178,7 @@ Built-in Python types
    :class:`bool`     :class:`bool\_`
    :class:`float`    :class:`float\_`
    :class:`complex`  :class:`cfloat`
-    :class:`str`      :class:`string`
+    :class:`str`      :class:`bytes` (Python2) or :class:`unicode\_` (Python3)


Shouldn't this be bytes_?

How about not listing str at all, and having

:class:`bytes` :class:`bytes_`

Perhaps a note after the table that str in python is just an alias for either bytes or unicode.

I think it's important to include str in this table, given how common it is to see dtype=str.

I agree it should be kept so people can look up what it means. I have added a note in the section.

eric-wieser · 2017-04-13T18:53:40Z

doc/source/reference/arrays.dtypes.rst

@@ -233,7 +233,7 @@ Array-protocol type strings (see :ref:`arrays.interface`)
   ``'m'``            timedelta
   ``'M'``            datetime
   ``'O'``            (Python) objects
-   ``'S'``, ``'a'``   (byte-)string
+   ``'S'``, ``'a'``   (byte-)string (not recommended)


Probably worth unparenthesizing the (byte), and putting the word "string" after unicode on the line below, to stop people who are looking for string picking "S"?

Also, any way to link this to the admonition below?

done and added a reference label, can be linked with ref:title<label>

eric-wieser · 2017-04-13T18:55:39Z

doc/source/reference/arrays.dtypes.rst

+    remain zero terminated bytes and ``np.string_`` continues to map to
+    ``np.bytes_``.
+    To use actual strings in Python 3 use ``U`` or ``np.unicode_``.
+    For bytes that do not need zero termination ``i1`` can be used.


Or just B, which is a better mnemonic for byte anyway, and is unsigned.

eric-wieser · 2017-04-13T18:56:59Z

doc/source/reference/arrays.dtypes.rst

@@ -298,7 +307,6 @@ Type strings
    .. admonition:: Example

       >>> dt = np.dtype((void, 10))  # 10-byte wide data block
-       >>> dt = np.dtype((str, 35))   # 35-character string


I'd be tempted to keep this, but using bytes instead. Also, void is a NameError on the line above?

the bytes dtype is kind of useless in python3, I don't think it is worth mentioning.

shoyer · 2017-04-14T04:41:48Z

doc/source/reference/arrays.dtypes.rst

@@ -178,7 +178,7 @@ Built-in Python types
    :class:`bool`     :class:`bool\_`
    :class:`float`    :class:`float\_`
    :class:`complex`  :class:`cfloat`
-    :class:`str`      :class:`string`
+    :class:`str`      :class:`bytes` (Python2) or :class:`unicode\_` (Python3)


I think it's important to include str in this table, given how common it is to see dtype=str.

eric-wieser · 2017-04-14T09:57:55Z

Can we add a row to that table for bytes as well?

juliantaylor · 2017-04-14T10:03:08Z

added bytes

eric-wieser · 2017-04-14T10:08:11Z

Is np.string still a thing? If so, perhaps we should leave that in the table as something like np.string, an alias for bytes_ (py3) and unicode_ (py3) - else we lose the documentation for what np.string actually is

juliantaylor · 2017-04-14T10:34:58Z

np.string_ is always bytes. Would be nice to deprecate that symbol is possible. But we probably can't for the same reason we can't deprecate 'S' and 'a'. It's used too much.

charris · 2017-04-14T16:12:37Z

doc/source/reference/arrays.dtypes.rst

    :class:`unicode`  :class:`unicode\_`
    :class:`buffer`   :class:`void`
    (all others)      :class:`object_`
    ================  ===============

+    Note that ``str`` refers to either null terminated bytes or unicode strings


If we are talking about array types, the unicode strings are null terminated UTF-32.

A string kind of implies that it is terminated in some way, bytes does not.
We could also clarify this for strings better but it dilutes how bad the bytes type is a bit.

mhvk · 2017-04-15T17:40:13Z

This looks good. One question, mostly relevant only for the addition of "not recommended": are there still any plans to add an "encoding" option to S? It would make the type useful again...

mhvk · 2017-04-15T17:41:53Z

p.s. See astropy/astropy#5700 for efforts to add an encoding for the ndarray subclass Column (obviously, easier done in a subclass in python than in C).

juliantaylor · 2017-04-15T17:47:03Z

yes, I wanted to look into adding an encoded byte type for the next release. I was thinking about utf8 for the convenience of automatically a relatively compact encoding from unknown strings. The disadvantage would be that it is hard to predict when adding a string to an existing array will truncate.
But I first wanted to check what is involved in adding a new dtype, what type of encoding (or user selectable similar to datetime) we use can be decided later.

charris · 2017-04-15T18:28:44Z

Make it latin-1, which is just the first 256 code points and should work well with legacy scientific data. UTF-8 is a different animal altogether and for that I think we should have a ragged string type, basically an object array.

eric-wieser · 2017-04-16T22:01:54Z

doc/source/reference/c-api.types-and-structures.rst

@@ -218,7 +218,7 @@ PyArrayDescr_Type
    interface typestring notation). A 'b' represents Boolean, a 'i'
    represents signed integer, a 'u' represents unsigned integer, 'f'
    represents floating point, 'c' represents complex floating point, 'S'
-    represents 8-bit character string, 'U' represents 32-bit/character
+    represents 8-bit zero-terminated bytes , 'U' represents 32-bit/character


nit: extra space

eric-wieser · 2017-04-16T22:02:57Z

doc/source/reference/arrays.dtypes.rst

+    remain zero terminated bytes and ``np.string_`` continues to map to
+    ``np.bytes_``.
+    To use actual strings in Python 3 use ``U`` or ``np.unicode_``.
+    For bytes that do not need zero termination ``i1`` or ``B`` can be used.


Note that u1 == B, and i1 == b, so we should probably decide whether to recommend signed or unsigned here, rather than suggesting both with different syntaxes. I'd err on the side of u1/B.

hm the above table is wrong, it lists b as boolean but it is ?. Updated adding boolean and unsigned bytes.

rainwoodman · 2017-04-17T05:42:08Z

As a non-latin language speaker, I'll throw in my two cents.

I'd recommend using UTF-8 unless there is good reason against it.

The default encoding of byte string appears to be utf-8 in Python3. (https://docs.python.org/3/library/stdtypes.html#str.encode ).
The benefits of utf-8 are nicely summarized in https://docs.python.org/3/howto/unicode.html#encodings -- it is deterministic and robust, also a super set of ASCII.
In addition, UTF-8 was settled as the default encoding in most Linux platforms for good reasons -- I recall reading about the debate on this many many years ago -- my memory was 10 years ago the linux gang in China has started switching to UTF-8 from the legacy GBK/GB18030 encodings. Even the electronic bullet boards system (BBS) of Peking University now have a UTF-8 encoding interface -- it used to be GBK/Big5 only 10 years ago.
While researchers in the West may be able to get away with latin-1, what about those in the rest of world, using non-latin characters in their research? Defaulting to UTF-8 will avoid those people real troubles -- at least UTF-8 is consistent with how Python deal with byte strings and will have least surprises.
no real worry about strlen confusion : people in the MultiByteString world are used to strlen is never supposed to give the correct number of characters anyways; those were taught in text books.

shoyer · 2017-04-17T06:17:30Z

As a point of reference, HDF5 supports two character sets (UTF-8 and ASCII), both of which come in fixed and variable lengths. Given that ASCII is a subset of UTF-8, I think we can get away with only UTF-8.

…

On Sun, Apr 16, 2017 at 10:42 PM Yu Feng ***@***.***> wrote: As a non-latin language speaker, I'll throw in my two cents. I'd recommend using UTF-8 unless there is good reason against it. - The default encoding of byte string appears to be utf-8 in Python3. ( https://docs.python.org/3/library/stdtypes.html#str.encode ). - The benefits of utf-8 are nicely summarized in https://docs.python.org/3/howto/unicode.html#encodings -- it is deterministic and robust, also a super set of ASCII. - In addition, UTF-8 was settled as the default encoding in most Linux platforms for good reasons -- I recall reading about the debate on this many many years ago -- my memory was 10 years ago the linux gang in China has started switching to UTF-8 from the legacy GBK/GB18030 encodings. Even the electronic bullet boards system (BBS) of Peking University now have a UTF-8 encoding interface -- it used to be GBK/Big5 only 10 years ago. - While researchers in the West may be able to get away with latin-1, what about those in the rest of world, using non-latin characters in their research? Defaulting to UTF-8 will avoid those people real troubles -- at least UTF-8 is consistent with how Python deal with byte strings and will have least surprises. - no real worry about strlen confusion : people in the MultiByteString world are used to strlen is never supposed to give the correct number of characters anyways; those were taught in text books. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8942 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1hEi-n5IjpBrhFZ5fDsDPF7VT8iUks5rwvuzgaJpZM4M9DyN> .

eric-wieser · 2017-04-17T08:22:03Z

I'd recommend using UTF-8 unless there is good reason against it.

Referring to the encoding argument added in this PR, or this proposed new string type suggested above?

I don't think we can switch the default encoding of text io from latin1 to utf8 without breaking compatibility, sadly

shoyer · 2017-04-17T17:32:28Z

Referring to the encoding argument added in this PR, or this proposed new string type suggested above?

Both?

I don't think we can switch the default encoding of text io from latin1 to utf8 without breaking compatibility, sadly

np.load defaults to encoding='ASCII': https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html

eric-wieser · 2017-04-18T12:13:35Z

doc/source/reference/arrays.dtypes.rst

-   ``'b'``            boolean
+   ``'?'``            boolean
+   ``'b'``            (signed) byte
+   ``'B'``            unsigned byte


If we really want completeness here, then we support (most of?) the stuct pack codes too (https://docs.python.org/2/library/struct.html#format-characters). Certainly, bBhHiIlLqQ are all accepted

yes but it is imo not a good idea, the struct typecodes are the poor old C type where it is unknown what they actually mean.
The numpy ones i1, i2, i4, i8 are explicit and should be used instead.
so technically b and B should also not be in there but they do have some relevance in regards to the S type.

The numpy ones i1, i2, i4, i8 are explicit and should be used instead.

Yep, I completely agree.

Also, it seems p and P are things, which I guess correspond to intp and uintp? These have different meanings in struct.pack, which is potentially confusing

Ah, that's documented in array.scalars.rst. Perhaps this table should link there instead?

ah that file still need some string updates

eric-wieser · 2017-04-21T22:02:54Z

Unless you've got anything you still think you need to fix @juliantaylor, I think this is good to merge?

The S dtype is zero terminated bytes which happen to match what Python 2 called strings. As this is not the case in Python 3 we should stop naming it wrong in our documentation. [ci skip]

juliantaylor · 2017-04-22T10:16:22Z

should be good

eric-wieser · 2017-04-22T10:25:02Z

Thanks @juliantaylor. I'm sure that there'll always be more improvements to make here, but this is a good start

eric-wieser reviewed Apr 13, 2017

View reviewed changes

shoyer approved these changes Apr 14, 2017

View reviewed changes

juliantaylor force-pushed the string-doc branch 5 times, most recently from 2574cc4 to b851c92 Compare April 14, 2017 09:48

juliantaylor force-pushed the string-doc branch 2 times, most recently from 3648e4d to 220e1a3 Compare April 14, 2017 10:02

charris reviewed Apr 14, 2017

View reviewed changes

charris added the 04 - Documentation label Apr 14, 2017

juliantaylor added this to the 1.13.0 release milestone Apr 14, 2017

eric-wieser approved these changes Apr 16, 2017

View reviewed changes

juliantaylor force-pushed the string-doc branch 2 times, most recently from 4249c30 to 4724993 Compare April 18, 2017 12:07

eric-wieser reviewed Apr 18, 2017

View reviewed changes

juliantaylor force-pushed the string-doc branch from 4724993 to 960d4eb Compare April 18, 2017 13:03

DOC: stop refering to 'S' dtype as string

0107956

The S dtype is zero terminated bytes which happen to match what Python 2 called strings. As this is not the case in Python 3 we should stop naming it wrong in our documentation. [ci skip]

juliantaylor force-pushed the string-doc branch from 960d4eb to 0107956 Compare April 22, 2017 10:16

eric-wieser merged commit cb640fa into numpy:master Apr 22, 2017

charris changed the title ~~DOC: stop refering to 'S' dtype as string~~ DOC: stop referring to 'S' dtype as string Dec 12, 2017

Uh oh!

DOC: stop referring to 'S' dtype as string #8942

DOC: stop referring to 'S' dtype as string #8942

Uh oh!

Conversation

juliantaylor commented Apr 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Apr 14, 2017

Uh oh!

juliantaylor commented Apr 14, 2017

Uh oh!

eric-wieser commented Apr 14, 2017

Uh oh!

juliantaylor commented Apr 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Apr 15, 2017

Uh oh!

mhvk commented Apr 15, 2017

Uh oh!

juliantaylor commented Apr 15, 2017

Uh oh!

charris commented Apr 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Apr 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rainwoodman commented Apr 17, 2017

Uh oh!

shoyer commented Apr 17, 2017 via email

Uh oh!

eric-wieser commented Apr 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Apr 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Apr 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Apr 21, 2017

Uh oh!

juliantaylor commented Apr 22, 2017

Uh oh!

eric-wieser Apr 16, 2017 •

edited

Loading

eric-wieser commented Apr 17, 2017 •

edited

Loading

eric-wieser Apr 18, 2017 •

edited

Loading