-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
DOC: stop referring to 'S' dtype as string #8942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -178,7 +178,7 @@ Built-in Python types | |||
:class:`bool` :class:`bool\_` | |||
:class:`float` :class:`float\_` | |||
:class:`complex` :class:`cfloat` | |||
:class:`str` :class:`string` | |||
:class:`str` :class:`bytes` (Python2) or :class:`unicode\_` (Python3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be bytes_
?
How about not listing str
at all, and having
:class:`bytes` :class:`bytes_`
Perhaps a note after the table that str
in python is just an alias for either bytes
or unicode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to include str
in this table, given how common it is to see dtype=str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be kept so people can look up what it means. I have added a note in the section.
@@ -233,7 +233,7 @@ Array-protocol type strings (see :ref:`arrays.interface`) | |||
``'m'`` timedelta | |||
``'M'`` datetime | |||
``'O'`` (Python) objects | |||
``'S'``, ``'a'`` (byte-)string | |||
``'S'``, ``'a'`` (byte-)string (not recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth unparenthesizing the (byte), and putting the word "string" after unicode on the line below, to stop people who are looking for string picking "S"?
Also, any way to link this to the admonition below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done and added a reference label, can be linked with ref:title<label>
remain zero terminated bytes and ``np.string_`` continues to map to | ||
``np.bytes_``. | ||
To use actual strings in Python 3 use ``U`` or ``np.unicode_``. | ||
For bytes that do not need zero termination ``i1`` can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just B
, which is a better mnemonic for byte
anyway, and is unsigned.
@@ -298,7 +307,6 @@ Type strings | |||
.. admonition:: Example | |||
|
|||
>>> dt = np.dtype((void, 10)) # 10-byte wide data block | |||
>>> dt = np.dtype((str, 35)) # 35-character string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to keep this, but using bytes
instead. Also, void
is a NameError on the line above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bytes dtype is kind of useless in python3, I don't think it is worth mentioning.
@@ -178,7 +178,7 @@ Built-in Python types | |||
:class:`bool` :class:`bool\_` | |||
:class:`float` :class:`float\_` | |||
:class:`complex` :class:`cfloat` | |||
:class:`str` :class:`string` | |||
:class:`str` :class:`bytes` (Python2) or :class:`unicode\_` (Python3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to include str
in this table, given how common it is to see dtype=str
.
2574cc4
to
b851c92
Compare
Can we add a row to that table for |
3648e4d
to
220e1a3
Compare
added bytes |
Is |
|
:class:`unicode` :class:`unicode\_` | ||
:class:`buffer` :class:`void` | ||
(all others) :class:`object_` | ||
================ =============== | ||
|
||
Note that ``str`` refers to either null terminated bytes or unicode strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are talking about array types, the unicode strings are null terminated UTF-32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A string kind of implies that it is terminated in some way, bytes does not.
We could also clarify this for strings better but it dilutes how bad the bytes type is a bit.
This looks good. One question, mostly relevant only for the addition of "not recommended": are there still any plans to add an "encoding" option to |
p.s. See astropy/astropy#5700 for efforts to add an encoding for the ndarray subclass |
yes, I wanted to look into adding an encoded byte type for the next release. I was thinking about utf8 for the convenience of automatically a relatively compact encoding from unknown strings. The disadvantage would be that it is hard to predict when adding a string to an existing array will truncate. |
Make it latin-1, which is just the first 256 code points and should work well with legacy scientific data. UTF-8 is a different animal altogether and for that I think we should have a ragged string type, basically an object array. |
@@ -218,7 +218,7 @@ PyArrayDescr_Type | |||
interface typestring notation). A 'b' represents Boolean, a 'i' | |||
represents signed integer, a 'u' represents unsigned integer, 'f' | |||
represents floating point, 'c' represents complex floating point, 'S' | |||
represents 8-bit character string, 'U' represents 32-bit/character | |||
represents 8-bit zero-terminated bytes , 'U' represents 32-bit/character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space
remain zero terminated bytes and ``np.string_`` continues to map to | ||
``np.bytes_``. | ||
To use actual strings in Python 3 use ``U`` or ``np.unicode_``. | ||
For bytes that do not need zero termination ``i1`` or ``B`` can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that u1 == B
, and i1 == b
, so we should probably decide whether to recommend signed or unsigned here, rather than suggesting both with different syntaxes. I'd err on the side of u1
/B
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm the above table is wrong, it lists b
as boolean but it is ?
. Updated adding boolean and unsigned bytes.
As a non-latin language speaker, I'll throw in my two cents. I'd recommend using UTF-8 unless there is good reason against it.
|
As a point of reference, HDF5 supports two character sets (UTF-8 and
ASCII), both of which come in fixed and variable lengths. Given that ASCII
is a subset of UTF-8, I think we can get away with only UTF-8.
…On Sun, Apr 16, 2017 at 10:42 PM Yu Feng ***@***.***> wrote:
As a non-latin language speaker, I'll throw in my two cents.
I'd recommend using UTF-8 unless there is good reason against it.
-
The default encoding of byte string appears to be utf-8 in Python3. (
https://docs.python.org/3/library/stdtypes.html#str.encode ).
-
The benefits of utf-8 are nicely summarized in
https://docs.python.org/3/howto/unicode.html#encodings -- it is
deterministic and robust, also a super set of ASCII.
-
In addition, UTF-8 was settled as the default encoding in most Linux
platforms for good reasons -- I recall reading about the debate on this
many many years ago -- my memory was 10 years ago the linux gang in China
has started switching to UTF-8 from the legacy GBK/GB18030 encodings. Even
the electronic bullet boards system (BBS) of Peking University now have a
UTF-8 encoding interface -- it used to be GBK/Big5 only 10 years ago.
-
While researchers in the West may be able to get away with latin-1,
what about those in the rest of world, using non-latin characters in their
research? Defaulting to UTF-8 will avoid those people real troubles -- at
least UTF-8 is consistent with how Python deal with byte strings and will
have least surprises.
-
no real worry about strlen confusion : people in the MultiByteString
world are used to strlen is never supposed to give the correct number of
characters anyways; those were taught in text books.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8942 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1hEi-n5IjpBrhFZ5fDsDPF7VT8iUks5rwvuzgaJpZM4M9DyN>
.
|
Referring to the encoding argument added in this PR, or this proposed new string type suggested above? I don't think we can switch the default encoding of text io from latin1 to utf8 without breaking compatibility, sadly |
Both?
|
4249c30
to
4724993
Compare
``'b'`` boolean | ||
``'?'`` boolean | ||
``'b'`` (signed) byte | ||
``'B'`` unsigned byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really want completeness here, then we support (most of?) the stuct pack codes too (https://docs.python.org/2/library/struct.html#format-characters). Certainly, bBhHiIlLqQ
are all accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but it is imo not a good idea, the struct typecodes are the poor old C type where it is unknown what they actually mean.
The numpy ones i1, i2, i4, i8 are explicit and should be used instead.
so technically b
and B
should also not be in there but they do have some relevance in regards to the S
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy ones i1, i2, i4, i8 are explicit and should be used instead.
Yep, I completely agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems p
and P
are things, which I guess correspond to intp
and uintp
? These have different meanings in struct.pack
, which is potentially confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's documented in array.scalars.rst
. Perhaps this table should link there instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that file still need some string updates
4724993
to
960d4eb
Compare
Unless you've got anything you still think you need to fix @juliantaylor, I think this is good to merge? |
The S dtype is zero terminated bytes which happen to match what Python 2 called strings. As this is not the case in Python 3 we should stop naming it wrong in our documentation. [ci skip]
960d4eb
to
0107956
Compare
should be good |
Thanks @juliantaylor. I'm sure that there'll always be more improvements to make here, but this is a good start |
The S dtype is zero terminated bytes which happen to match what
Python 2 called strings. As this is not the case in Python 3 we should
stop naming it wrong in our documentation.
[ci skip]