Skip to content

mb_str_split without any mbfl modifications PR3715 related #3808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

legale
Copy link
Contributor

@legale legale commented Feb 9, 2019

mb_str_split without any mbfl modifications
This PR is related to #3715

@nikic nikic self-requested a review February 9, 2019 09:12
@legale legale force-pushed the master branch 4 times, most recently from ff56509 to a501664 Compare February 9, 2019 19:07
mbniebergall and others added 3 commits February 9, 2019 20:28
@legale
Copy link
Contributor Author

legale commented Feb 9, 2019

Why windows tests are so unstable???

@petk petk added the Feature label Feb 10, 2019
This reverts commit b743eab.
@legale
Copy link
Contributor Author

legale commented Feb 11, 2019

@nikic, @guilliamxavier, @IMSoP
The test fails, but the reason is deeper. Unusual situation. Thing is that it does not pass the test for an incorrect utf16 string.

https://www.ietf.org/rfc/rfc2781.txt
RFC2781 about UTF-16 says:

2.2 Decoding UTF-16
   Decoding of a single character from UTF-16 to an ISO 10646 character
   value proceeds as follows. Let W1 be the next 16-bit integer in the
   sequence of integers representing the text. Let W2 be the (eventual)
   next integer following W1.
   1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
      of W1. Terminate.
   **2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
      is in error and no valid character can be obtained using W1.
      Terminate.**
   3) If there is no W2 (that is, the sequence ends with W1), or if W2
      is not between 0xDC00 and 0xDFFF, the sequence is in error.
      Terminate.
   4) Construct a 20-bit unsigned integer U', taking the 10 low-order
      bits of W1 as its 10 high-order bits and the 10 low-order bits of
      W2 as its 10 low-order bits.

So there is no valid character for the 0xDC00 (first low surrogate). I think that in such cases, decoder should skip the character as it is, but the mbfl library decoder changes it.

@nikic
Copy link
Member

nikic commented Feb 12, 2019

@legale As mentioned on https://bugs.php.net/bug.php?id=77607, invalid character being replaced with ? or some other substitution character is expected behavior here (the problem is actually that the same does not happen for UTF-8 due to the table optimization, but that's a much more general issue than just this function).

@legale
Copy link
Contributor Author

legale commented Feb 12, 2019 via email

@nikic
Copy link
Member

nikic commented Feb 12, 2019

@legale Yes ;)

@nikic
Copy link
Member

nikic commented Feb 12, 2019

When compiling this just now, I got a bunch of warnings:

/home/nikic/php-7.4/ext/mbstring/mbstring.c: In function ‘mbfl_split_output’:
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2308:54: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
         add_next_index_stringl(params->return_value, chunk->val, chunk->len); /* add chunk to the array */
                                                      ^~~~~
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~
/home/nikic/php-7.4/ext/mbstring/mbstring.c: In function ‘zif_mb_str_split’:
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2384:35: warning: pointer targets in initialization differ in signedness [-Wpointer-sign]
             char const *chunk_p = p; /* chunk first byte pointer */
                                   ^
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2465:50: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
             add_next_index_stringl(return_value, p, chunk_len);
                                                  ^
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘const unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~
/home/nikic/php-7.4/ext/mbstring/mbstring.c:2467:46: warning: pointer targets in passing argument 2 of ‘add_next_index_stringl’ differ in signedness [-Wpointer-sign]
         add_next_index_stringl(return_value, p, last - p);
                                              ^
In file included from /home/nikic/php-7.4/main/php.h:37:0,
                 from /home/nikic/php-7.4/ext/mbstring/mbstring.c:27:
/home/nikic/php-7.4/Zend/zend_API.h:429:14: note: expected ‘const char *’ but argument is of type ‘const unsigned char *’
 ZEND_API int add_next_index_stringl(zval *arg, const char *str, size_t length);
              ^~~~~~~~~~~~~~~~~~~~~~

@nikic nikic mentioned this pull request Feb 12, 2019
@legale
Copy link
Contributor Author

legale commented Feb 12, 2019

@nikic Did we've finished?

@nikic
Copy link
Member

nikic commented Feb 12, 2019

Merged as d77ad27 into 7.4+, with some style fixes. In particular, note that C code in php-src uses tab indentation, not space indentation.

Thanks for seeing this through :)

@nikic nikic closed this Feb 12, 2019
@kamil-tekiela
Copy link
Member

Thank you for this RFC. It is the true answer to my Stack Overflow question: https://stackoverflow.com/questions/55782088/convert-a-string-into-an-array-of-characters-multi-byte
It has passed my UTF-8 test according to my expectations: https://3v4l.org/M85Fi/rfc#output

@legale
Copy link
Contributor Author

legale commented Apr 28, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants