-
Notifications
You must be signed in to change notification settings - Fork 7.8k
mb_detect_encoding does not return the first matching encoding anymore #8279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, that looks wrong: https://3v4l.org/YGQ8s. @alexdowad, can you please have a look? |
It's on purpose. As you say, previously If you really, really just want to know "which is the first text encoding in this list which is valid for this string", it would be better to call If the docs need to be updated, I would like to suggest you could open a GitHub issue for any inaccurate statements you found there. Now let's see the reason for the difference you found. First of all, your string is valid in both UTF-8 and ISO-8859-1. In UTF-8, the bytes break down as:
Whereas in ISO-8859-1, they break down in this (trivial) way:
All those 2-byte UTF-8 characters can also be interpreted as two ISO-8859-1 characters, as you can see. The current implementation of I don't know if that accented letter is commonly used in any major language of the world, but currently If the input string was longer, then UTF-8 would have much better chances of winning out. In general, trying to automatically detect text encoding on short strings is very error-prone. This is also true for the simpler approach of "just picking an encoding that works"; as the input string becomes shorter, the chances that an unintended encoding will match by accident become higher. If there are good reasons why |
@alexdowad Where is the bug tracker for the documentation then? It needs to state that |
@come-nc The repository for the PHP documentation, English version is https://github.com/php/doc-en. |
Thank you |
@come-nc Thank you also very much for the report. |
It was reported to us that the problem was happening on common slavic names. I tried several names from https://en.wikipedia.org/wiki/Slavic_names and found that both 'Dušan' and 'Živko' are wrongly detected as latin1. |
Thanks for reporting that. What language are those names from? |
Please disregard my question about 'what language these are from'. I did a bit of searching on the web and found my own answers.
Certainly, it is a problem, and I would like to thank you for reporting it. While I respect your view, I do not (at present) agree that What may be more pertinent, and which we do agree on, is that Also, you are very right that the new algorithm for |
Yes, sure, this can (and should be) adjusted in the documentation! |
Just adding a note here for future readers. Please correct me if this is not correct! ISO-8859-1 is a standard for strings of 8-bit single-byte character encodings - matches ASCII for the lower half, and other extra characters encoded in the upper half. So, many/most of the 8-bit patterns are valid ISO-8858-1 encodings. That means that when parsing a string of 8-bit values, the string is very often going to be a possible set of ISO-8859-1 encodings. That means that, if
would very often return 'ISO-8859-1', including for many "shortish" strings that are actually some different set of characters encoded as UTF-8 by some application. So, IMO, that is a reasonable reason for not strictly prioritizing - if it did strictly prioritize then people would have to understand exactly what they are doing with the array of encodings. For example, a call like: But actually, the implemented heuristics are, IMO, a better approach - it will have a bit more of an "educated guess" when a string could be multiple of the encodings. As mentioned above, if the caller really wants strict prioritizing then they can call End of dump. |
@phil-davis That is exactly the point. Actually, for many 8-bit legacy text encodings, every possible string is valid. So if such an encoding was listed first as input to Long before this change to the behavior of |
A unit test was failing due to a behavior change of `\mb_detect_encoding()` [0]. The test was testing a force encoding conversion but this does not seem to be really needed anymore. Part of request #22659: Run Tuleap with PHP 8.1 [0] php/php-src#8279 (comment) Change-Id: I9021994a0cfe8662602379dc45b281e2435c6248
…() (#195) See also: php/php-src#8279 Submitted by: Andrew Hardie This fixes #195
Description
The following code:
https://3v4l.org/kqHre
Resulted in this output:
But I expected this output instead:
It seems the behavior of mb_detect_encoding changed in PHP 8.1, not clear if this is on purpose or not.
The documentation of mb_detect_encoding suggest that it will return the first matching encoding, which it does up until PHP 8.0
But with 8.1 it returns iso even if mb_check_encoding returns true for both utf and iso.
PHP Version
8.1
Operating System
No response
The text was updated successfully, but these errors were encountered: