Skip to content

mb_str_split() added #3715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from
Closed

mb_str_split() added #3715

wants to merge 31 commits into from

Conversation

legale
Copy link
Contributor

@legale legale commented Dec 23, 2018

@legale legale mentioned this pull request Dec 23, 2018
@petk petk added the Feature label Dec 23, 2018
Co-Authored-By: legale <[email protected]>
@cmb69
Copy link
Member

cmb69 commented Dec 24, 2018

Since the function is part of ext/mbstring, it should support all MB character encodings. As it is now, it only supports UTF-8.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account. It seems that this could be accomplished in userland quite simple and efficiently by using grapheme_extract().

@nikic
Copy link
Member

nikic commented Dec 24, 2018

To support other encodings, it might be useful to take a look at some of the libmbfl functions, such as mbfl_strlen: https://github.com/php/php-src/blob/master/ext/mbstring/libmbfl/mbfl/mbfilter.c#L658

As you can see, there are basically three cases: Fixed-length encodings, where you can simply collect characters in groups of 1, 2 or 4. Then UTF-8, which is the mblen_table case and corresponds to what you implemented here, and finally the use of a mbfl_convert_filter, which handles the case of arbitrary encodings. In this case the filter will feed you characters into a callback, from which you can push into an array.

This said, I wonder whether splitting Unicode strings is actually useful, unless grapheme clusters are taken into account.

I don't see the immediate use either, but I guess it would make some sense from an API parity point of view, if nothing else.

It's also worth noting that the case where you want to split by single codepoints is already available through IntlBreakIterator::createCodePointInstance(), albeit in the intl rather than mbstring extension.

@legale
Copy link
Contributor Author

legale commented Dec 25, 2018 via email

@legale
Copy link
Contributor Author

legale commented Dec 25, 2018

I’m sure that there are many other ways to split multibyte string at the moment. preg_split() for example. But specialized functions usually are more efficient than not specilized. Now we’ve got str_split() but haven’t got multibyte analog.

@cmb69
Copy link
Member

cmb69 commented Dec 25, 2018

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

@legale
Copy link
Contributor Author

legale commented Dec 25, 2018

Now we’ve got str_split() but haven’t got multibyte analog.

The function you are proposing isn't a true multibyte analog, though, since it could only handle UTF-8, but not UTF-16, SJIS, BIG-5 etc.

I'll try do add additional codepages support.

@legale
Copy link
Contributor Author

legale commented Dec 28, 2018

New review needed.

@cmb69
Copy link
Member

cmb69 commented Dec 28, 2018

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

@legale
Copy link
Contributor Author

legale commented Dec 28, 2018

@legale Thanks! For consistency with other MBString functions there should be a final optional $encoding parameter, to allow to override the internal encoding for a single function call.

Thanks for your kind words. This is my first collaboration experience.

@legale
Copy link
Contributor Author

legale commented Dec 28, 2018

Something wrong with this test on windows:
Bug #55509 (segfault on x86_64 using more than 2G memory) [C:\projects\php-src\Zend\tests\bug55509.phpt]
I compared the code that passed the test and the one that did not pass it. The only difference is data types. const char * fails the test and char * pass it. Could this be a cause?

@cmb69
Copy link
Member

cmb69 commented Dec 28, 2018

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

@legale
Copy link
Contributor Author

legale commented Dec 28, 2018

@legale This test is failing not rarely. Not sure why, but almost certainly it is unrelated to this pull request.

I'll have a closer look at the pull request as soon as possible.

I've changed const char * to char * and test passed.

@legale
Copy link
Contributor Author

legale commented Jan 14, 2019

New review required.

@legale
Copy link
Contributor Author

legale commented Jan 15, 2019 via email

* | second scenario: "2- or 4-bytes width encodings UTF-16LE UTF-16BE |
* +----------------------------------------------------------------------+
*/
} else if (mbfl_encoding->flag & ( MBFL_ENCTYPE_MWC2LE | MBFL_ENCTYPE_MWC2BE )) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to have this kind of optimization, it should be part of libmbfl, rather than only this one function.

Personally, I don't think this is necessary. Thankfully UTF-16 is not an important encoding in PHP and it's not an issue if it does not have a very fast implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try to implement it for the libmbfl if more experienced developers approve of this approach. I mean 2 relatively large tables of 65k each.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is at least twice faster than libmfl way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do i need to create new RFC for this?

@legale
Copy link
Contributor Author

legale commented Jan 22, 2019

i've implemented utf-16 optimization to the whole mbfl library.

@legale
Copy link
Contributor Author

legale commented Feb 8, 2019

@KalleZ,
@derickr,
@nikic,
@Agares,
@rybakit,
@petk,
@cmb69,
@carusogabriel,
dear colleagues, can someone of you merge this PR? Is there somethind need to be done before?

@nikic
Copy link
Member

nikic commented Feb 12, 2019

Closing this PR in favor of #3808.

@nikic nikic closed this Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants