mb_str_split() added #3715

legale · 2018-12-23T21:37:52Z

RFC: https://wiki.php.net/rfc/mb_str_split

.gitignore

Co-Authored-By: legale <[email protected]>

cmb69 · 2018-12-24T11:15:07Z

Since the function is part of ext/mbstring, it should support all MB character encodings. As it is now, it only supports UTF-8.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account. It seems that this could be accomplished in userland quite simple and efficiently by using grapheme_extract().

nikic · 2018-12-24T12:05:31Z

To support other encodings, it might be useful to take a look at some of the libmbfl functions, such as mbfl_strlen: https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L658

As you can see, there are basically three cases: Fixed-length encodings, where you can simply collect characters in groups of 1, 2 or 4. Then UTF-8, which is the mblen_table case and corresponds to what you implemented here, and finally the use of a mbfl_convert_filter, which handles the case of arbitrary encodings. In this case the filter will feed you characters into a callback, from which you can push into an array.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account.

I don't see the immediate use either, but I guess it would make some sense from an API parity point of view, if nothing else.

It's also worth noting that the case where you want to split by single codepoints is already available through IntlBreakIterator::createCodePointInstance(), albeit in the intl rather than mbstring extension.

legale · 2018-12-25T01:15:16Z

Thanks. Now I've set user ~/.gitignore

…

On Mon, 24 Dec 2018, 23:17 Christoph M. Becker, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In .gitignore <#3715 (comment)>: > !/ext/bcmath/libbcmath/src/config.h !/ext/fileinfo/libmagic/config.h !/ext/fileinfo/libmagic.patch !/ext/fileinfo/magicdata.patch !/ext/mbstring/oniguruma.patch !/ext/pcre/pcre2lib/config.h !/win32/build/Makefile + +# ------------------------------------------------------------------------------ +# Clion IDE files +# ------------------------------------------------------------------------------ +.idea/** What was meant in the previous comment was to omit this change :) ACK. I suggest to configure the Git client to ignore IDE files – this can also be done globally (i.e. for all Git repos on the machine). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3715 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AfHEl1MaiDqR97IJUm-jiqKFLyhsx5Vyks5u8VKTgaJpZM4Zf_1g> .

legale · 2018-12-25T02:00:57Z

I’m sure that there are many other ways to split multibyte string at the moment. preg_split() for example. But specialized functions usually are more efficient than not specilized. Now we’ve got str_split() but haven’t got multibyte analog.

cmb69 · 2018-12-25T12:28:09Z

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

legale · 2018-12-25T12:38:58Z

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

I'll try do add additional codepages support.

legale · 2018-12-28T09:19:14Z

New review needed.

cmb69 · 2018-12-28T13:39:58Z

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

legale · 2018-12-28T14:15:29Z

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

Thanks for your kind words. This is my first collaboration experience.

legale · 2018-12-28T18:02:32Z

Something wrong with this test on windows:
Bug #55509 (segfault on x86_64 using more than 2G memory) [C:\projects\php-src\Zend\tests\bug55509.phpt]
I compared the code that passed the test and the one that did not pass it. The only difference is data types. const char * fails the test and char * pass it. Could this be a cause?

cmb69 · 2018-12-28T18:24:39Z

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

legale · 2018-12-28T18:35:45Z

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

I've changed const char * to char * and test passed.

ext/mbstring/tests/mb_str_split.phpt

ext/mbstring/mbstring.c

ext/mbstring/tests/mb_str_split_jp.phpt

legale · 2019-01-14T11:36:00Z

New review required.

ext/mbstring/libmbfl/mbfl/mbfilter.h

ext/mbstring/mbstring.c

legale · 2019-01-15T14:16:05Z

Ok, thank you.

…

On Tue, Jan 15, 2019, 15:09 agares ***@***.*** wrote: ***@***.**** commented on this pull request. ------------------------------ In ext/mbstring/mbstring.c <#3715 (comment)>: > + /* count mb_len */ + filter = mbfl_convert_filter_new( + string.encoding, + &mbfl_encoding_wchar, + filter_count_output, + 0, + &mb_len); + if (filter == NULL){ + mbfl_convert_filter_delete(filter); + RETURN_FALSE; /* something wrong with the filter */ + } + while (p < last) { /* cycle each byte with callback function */ + (*filter->filter_function)(*p++, filter); + } + mbfl_convert_filter_delete(filter); + /* count mb_len end */ It is NOT a bug. The number of codepoints in UTF-16 string is not equal to numberOfBytes/2. Some characters might be encoded as two 16-bit units (surrogate pairs). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3715 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AfHElxxYxCpvlveUCey9nyvrDU_NCWm7ks5vDeEwgaJpZM4Zf_1g> .

ext/mbstring/mbstring.c

nikic · 2019-01-18T12:19:26Z

ext/mbstring/mbstring.c

+	 * | second scenario: "2- or 4-bytes width encodings UTF-16LE UTF-16BE    |
+	 * +----------------------------------------------------------------------+
+	 */
+	} else if (mbfl_encoding->flag & ( MBFL_ENCTYPE_MWC2LE | MBFL_ENCTYPE_MWC2BE )) {


If we want to have this kind of optimization, it should be part of libmbfl, rather than only this one function.

Personally, I don't think this is necessary. Thankfully UTF-16 is not an important encoding in PHP and it's not an issue if it does not have a very fast implementation.

I could try to implement it for the libmbfl if more experienced developers approve of this approach. I mean 2 relatively large tables of 65k each.

This way is at least twice faster than libmfl way.

Do i need to create new RFC for this?

ext/mbstring/libmbfl/mbfl/mbfilter.c

…bmbfl/mbfl/ reverted to master branch commit ecd533d

legale · 2019-01-22T20:34:49Z

i've implemented utf-16 optimization to the whole mbfl library.

legale · 2019-02-08T14:10:17Z

@KalleZ,
@derickr,
@nikic,
@Agares,
@rybakit,
@petk,
@cmb69,
@carusogabriel,
dear colleagues, can someone of you merge this PR? Is there somethind need to be done before?

nikic · 2019-02-12T13:03:30Z

Closing this PR in favor of #3808.

legale added 3 commits December 23, 2018 21:35

mb_str_split() added

57e82dc

mb_str_split() error fixed

2b99b9d

mb_str_split() error fixed

47ce704

legale mentioned this pull request Dec 23, 2018

mb_str_split() added #3713

Closed

petk added the Feature label Dec 23, 2018

petk reviewed Dec 23, 2018

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

petk reviewed Dec 23, 2018

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

petk added the Waiting on Review label Dec 23, 2018

Update .gitignore

63c8e79

Co-Authored-By: legale <[email protected]>

petk removed the Waiting on Review label Dec 24, 2018

legale added 2 commits December 28, 2018 03:46

new mb_str_split() using libmbfl library functions

bef21b5

new mb_str_split() using libmbfl library functions

91a7309

petk added the Waiting on Review label Dec 28, 2018

mb_str_split() optional argument "encoding" added + minor changes

223a8a9

legale added 3 commits December 28, 2018 16:21

mb_str_split() minor changes in function argument names

79978f7

mb_str_split() tests

cbdf106

minor changes to pass appveyor tests

64cd160

rybakit reviewed Jan 2, 2019

View reviewed changes

ext/mbstring/tests/mb_str_split.phpt Outdated Show resolved Hide resolved

legale added 2 commits January 10, 2019 21:11

minor tests changes

7f2ce93

rerun test

314619e

nikic reviewed Jan 11, 2019

View reviewed changes

legale added 6 commits January 13, 2019 15:57

mb_str_split function rewritten completely

e09d115

mbfl collector_substr moved back to static

d9bb662

tests improved

2f412a3

trying to fix a memory leak 1

5a64309

tests minor changes

4b0523f

tests minor changes

f6ee1fa

nikic reviewed Jan 14, 2019

View reviewed changes

legale added 4 commits January 17, 2019 14:42

refactoring + faster way to parse UTF-16

12c5928

refactoring & more tests

e202945

minor comment changes

65eaec9

minor changes

87824d8

ramonacat reviewed Jan 18, 2019

View reviewed changes

ext/mbstring/mbstring.c Outdated Show resolved Hide resolved

comments changes

95e0647

nikic reviewed Jan 18, 2019

View reviewed changes

ext/mbstring/libmbfl/mbfl/mbfilter.c Outdated Show resolved Hide resolved

legale added 3 commits January 18, 2019 23:30

git checkout ecd533d -- ext/mbstring/libmbfl/ path /ext/mbstring/li…

c0b3f57

…bmbfl/mbfl/ reverted to master branch commit ecd533d

UTF-16 parse bug fixed and related test added

f036661

utf-16 optimization

d868059

legale added 2 commits January 22, 2019 22:51

endian.h replaced with brg_endian.h

ad77e03

minor changes + bug fix in php_mb_mbchar_bytes_ex()

2ff7061

legale mentioned this pull request Feb 9, 2019

mb_str_split without any mbfl modifications PR3715 related #3808

Closed

nikic closed this Feb 12, 2019

mb_str_split() added #3715

mb_str_split() added #3715

Uh oh!

Conversation

legale commented Dec 23, 2018 • edited by nikic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmb69 commented Dec 24, 2018

Uh oh!

nikic commented Dec 24, 2018

Uh oh!

legale commented Dec 25, 2018 via email

Uh oh!

legale commented Dec 25, 2018

Uh oh!

cmb69 commented Dec 25, 2018

Uh oh!

legale commented Dec 25, 2018

Uh oh!

legale commented Dec 28, 2018

Uh oh!

cmb69 commented Dec 28, 2018

Uh oh!

legale commented Dec 28, 2018

Uh oh!

legale commented Dec 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmb69 commented Dec 28, 2018

Uh oh!

legale commented Dec 28, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

legale commented Jan 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

legale commented Jan 15, 2019 via email

Uh oh!

Uh oh!

nikic Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

legale Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

legale Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

legale Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

legale commented Jan 22, 2019

Uh oh!

legale commented Feb 8, 2019

Uh oh!

nikic commented Feb 12, 2019

Uh oh!

Uh oh!

legale commented Dec 23, 2018 •

edited by nikic

Loading

legale commented Dec 28, 2018 •

edited

Loading

legale commented Jan 14, 2019 •

edited

Loading