-
Notifications
You must be signed in to change notification settings - Fork 7.9k
mb_str_split() added #3715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mb_str_split() added #3715
Conversation
Co-Authored-By: legale <[email protected]>
Since the function is part of ext/mbstring, it should support all MB character encodings. As it is now, it only supports UTF-8. This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account. It seems that this could be accomplished in userland quite simple and efficiently by using grapheme_extract(). |
To support other encodings, it might be useful to take a look at some of the libmbfl functions, such as mbfl_strlen: https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L658 As you can see, there are basically three cases: Fixed-length encodings, where you can simply collect characters in groups of 1, 2 or 4. Then UTF-8, which is the mblen_table case and corresponds to what you implemented here, and finally the use of a mbfl_convert_filter, which handles the case of arbitrary encodings. In this case the filter will feed you characters into a callback, from which you can push into an array.
I don't see the immediate use either, but I guess it would make some sense from an API parity point of view, if nothing else. It's also worth noting that the case where you want to split by single codepoints is already available through |
Thanks. Now I've set user ~/.gitignore
…On Mon, 24 Dec 2018, 23:17 Christoph M. Becker, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In .gitignore
<#3715 (comment)>:
> !/ext/bcmath/libbcmath/src/config.h
!/ext/fileinfo/libmagic/config.h
!/ext/fileinfo/libmagic.patch
!/ext/fileinfo/magicdata.patch
!/ext/mbstring/oniguruma.patch
!/ext/pcre/pcre2lib/config.h
!/win32/build/Makefile
+
+# ------------------------------------------------------------------------------
+# Clion IDE files
+# ------------------------------------------------------------------------------
+.idea/**
What was meant in the previous comment was to omit this change :)
ACK. I suggest to configure the Git client to ignore IDE files – this can
also be done globally (i.e. for all Git repos on the machine).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3715 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AfHEl1MaiDqR97IJUm-jiqKFLyhsx5Vyks5u8VKTgaJpZM4Zf_1g>
.
|
I’m sure that there are many other ways to split multibyte string at the moment. preg_split() for example. But specialized functions usually are more efficient than not specilized. Now we’ve got str_split() but haven’t got multibyte analog. |
The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc. |
I'll try do add additional codepages support. |
New review needed. |
@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call. |
Thanks for your kind words. This is my first collaboration experience. |
Something wrong with this test on windows: |
@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request. I'll have a closer look at the pull request as soon as possible. |
I've changed |
New review required. |
Ok, thank you.
…On Tue, Jan 15, 2019, 15:09 agares ***@***.*** wrote:
***@***.**** commented on this pull request.
------------------------------
In ext/mbstring/mbstring.c
<#3715 (comment)>:
> + /* count mb_len */
+ filter = mbfl_convert_filter_new(
+ string.encoding,
+ &mbfl_encoding_wchar,
+ filter_count_output,
+ 0,
+ &mb_len);
+ if (filter == NULL){
+ mbfl_convert_filter_delete(filter);
+ RETURN_FALSE; /* something wrong with the filter */
+ }
+ while (p < last) { /* cycle each byte with callback function */
+ (*filter->filter_function)(*p++, filter);
+ }
+ mbfl_convert_filter_delete(filter);
+ /* count mb_len end */
It is NOT a bug. The number of codepoints in UTF-16 string is not equal to
numberOfBytes/2.
Some characters might be encoded as two 16-bit units (surrogate pairs).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3715 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AfHElxxYxCpvlveUCey9nyvrDU_NCWm7ks5vDeEwgaJpZM4Zf_1g>
.
|
ext/mbstring/mbstring.c
Outdated
* | second scenario: "2- or 4-bytes width encodings UTF-16LE UTF-16BE | | ||
* +----------------------------------------------------------------------+ | ||
*/ | ||
} else if (mbfl_encoding->flag & ( MBFL_ENCTYPE_MWC2LE | MBFL_ENCTYPE_MWC2BE )) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to have this kind of optimization, it should be part of libmbfl, rather than only this one function.
Personally, I don't think this is necessary. Thankfully UTF-16 is not an important encoding in PHP and it's not an issue if it does not have a very fast implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try to implement it for the libmbfl if more experienced developers approve of this approach. I mean 2 relatively large tables of 65k each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way is at least twice faster than libmfl way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do i need to create new RFC for this?
…bmbfl/mbfl/ reverted to master branch commit ecd533d
i've implemented utf-16 optimization to the whole mbfl library. |
Closing this PR in favor of #3808. |
RFC: https://wiki.php.net/rfc/mb_str_split