Skip to content

mb_detect_encoding does not return the first matching encoding anymore #8279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
come-nc opened this issue Mar 30, 2022 · 12 comments
Closed

mb_detect_encoding does not return the first matching encoding anymore #8279

come-nc opened this issue Mar 30, 2022 · 12 comments

Comments

@come-nc
Copy link

come-nc commented Mar 30, 2022

Description

The following code:

<?php

$str = '/dav/files/admin/%C3%BC%C3%B6%C3%A4%C3%B6%C3%A4%C3%BC%C3%B6%C3%A4%C3%BB%C5%B7%C3%AE';
$rawstr = rawurldecode($str);

var_dump(
    mb_detect_encoding($rawstr, ['UTF-8', 'ISO-8859-1']),
    mb_detect_encoding($rawstr, ['ISO-8859-1', 'UTF-8']),
    mb_check_encoding($rawstr, 'ISO-8859-1'),
    mb_check_encoding($rawstr, 'UTF-8'),
);

https://3v4l.org/kqHre

Resulted in this output:

string(10) "ISO-8859-1"
string(10) "ISO-8859-1"
bool(true)
bool(true)

But I expected this output instead:

string(5) "UTF-8"
string(10) "ISO-8859-1"
bool(true)
bool(true)

It seems the behavior of mb_detect_encoding changed in PHP 8.1, not clear if this is on purpose or not.
The documentation of mb_detect_encoding suggest that it will return the first matching encoding, which it does up until PHP 8.0
But with 8.1 it returns iso even if mb_check_encoding returns true for both utf and iso.

PHP Version

8.1

Operating System

No response

@cmb69
Copy link
Member

cmb69 commented Mar 30, 2022

Yeah, that looks wrong: https://3v4l.org/YGQ8s.

@alexdowad, can you please have a look?

@alexdowad
Copy link
Contributor

It seems the behavior of mb_detect_encoding changed in PHP 8.1, not clear if this is on purpose or not.

It's on purpose. As you say, previously mb_detect_encoding would just return the first listed encoding which 'works' for the input string. Now it also applies heuristics to detect which of the valid text encodings in the specified list (if there are more than one) is most likely to be correct.

If you really, really just want to know "which is the first text encoding in this list which is valid for this string", it would be better to call mb_check_encoding in a loop.

If the docs need to be updated, I would like to suggest you could open a GitHub issue for any inaccurate statements you found there.

Now let's see the reason for the difference you found. First of all, your string is valid in both UTF-8 and ISO-8859-1. In UTF-8, the bytes break down as:

2f 64 61 76 2f 66 69 6c 65 73 2f 61 64 6d 69 6e 2f c3bc c3b6 c3a4 c3b6 c3a4 c3bc c3b6 c3a4 c3bb c5b7 c3ae

Whereas in ISO-8859-1, they break down in this (trivial) way:

2f 64 61 76 2f 66 69 6c 65 73 2f 61 64 6d 69 6e 2f c3 bc c3 b6 c3 a4 c3 b6 c3 a4 c3 bc c3 b6 c3 a4 c3 bb c5 b7 c3 ae

All those 2-byte UTF-8 characters can also be interpreted as two ISO-8859-1 characters, as you can see.

The current implementation of mb_detect_encoding has a slight preference for text encodings in which the input string comes out to a shorter sequence of codepoints. However, what makes it prefer ISO-8859-1 over UTF-8 in this case is the letter ŷ.

I don't know if that accented letter is commonly used in any major language of the world, but currently mb_detect_encoding classifies it as a "rare" character and in this case, penalizes UTF-8 because of it. If you remove ŷ from that string, mb_detect_encoding will decide that UTF-8 is more likely what you wanted.

If the input string was longer, then UTF-8 would have much better chances of winning out. In general, trying to automatically detect text encoding on short strings is very error-prone. This is also true for the simpler approach of "just picking an encoding that works"; as the input string becomes shorter, the chances that an unintended encoding will match by accident become higher.

If there are good reasons why mb_detect_encoding should consider ŷ to be a "common" character, please feel free to propose that.

@come-nc
Copy link
Author

come-nc commented Mar 31, 2022

@alexdowad Where is the bug tracker for the documentation then?

It needs to state that mb_detect_encoding does not return the first matching encoding but apply heuristic, and this should be listed in the upgrade from 8 to 8.1 page as it is a behavior change.

@alexdowad
Copy link
Contributor

@come-nc The repository for the PHP documentation, English version is https://github.com/php/doc-en.

@come-nc
Copy link
Author

come-nc commented Mar 31, 2022

Thank you

@alexdowad
Copy link
Contributor

@come-nc Thank you also very much for the report.

@come-nc
Copy link
Author

come-nc commented Apr 25, 2022

I don't know if that accented letter is commonly used in any major language of the world, but currently mb_detect_encoding classifies it as a "rare" character and in this case, penalizes UTF-8 because of it. If you remove ŷ from that string, mb_detect_encoding will decide that UTF-8 is more likely what you wanted.

If the input string was longer, then UTF-8 would have much better chances of winning out. In general, trying to automatically detect text encoding on short strings is very error-prone. This is also true for the simpler approach of "just picking an encoding that works"; as the input string becomes shorter, the chances that an unintended encoding will match by accident become higher.

If there are good reasons why mb_detect_encoding should consider ŷ to be a "common" character, please feel free to propose that.

It was reported to us that the problem was happening on common slavic names. I tried several names from https://en.wikipedia.org/wiki/Slavic_names and found that both 'Dušan' and 'Živko' are wrongly detected as latin1.
I understand that detecting encoding on a short string is hard but it is a problem that the function is failing on common words, and it should in this case use the order of the passed encoding to return the first valid one as before.

See https://3v4l.org/dX7FR

@alexdowad
Copy link
Contributor

It was reported to us that the problem was happening on common slavic names. I tried several names from https://en.wikipedia.org/wiki/Slavic_names and found that both 'Dušan' and 'Živko' are wrongly detected as latin1. I understand that detecting encoding on a short string is hard but it is a problem that the function is failing on common words, and it should in this case use the order of the passed encoding to return the first valid one as before.

Thanks for reporting that. What language are those names from?

@alexdowad
Copy link
Contributor

Please disregard my question about 'what language these are from'. I did a bit of searching on the web and found my own answers.

It was reported to us that the problem was happening on common slavic names. I tried several names from https://en.wikipedia.org/wiki/Slavic_names and found that both 'Dušan' and 'Živko' are wrongly detected as latin1. I understand that detecting encoding on a short string is hard but it is a problem that the function is failing on common words, and it should in this case use the order of the passed encoding to return the first valid one as before.

Certainly, it is a problem, and I would like to thank you for reporting it. While I respect your view, I do not (at present) agree that mb_detect_encoding should rely on the order of the passed encodings in this case. However, I am always ready to be convinced by a logical argument.

What may be more pertinent, and which we do agree on, is that mb_detect_encoding should definitely return 'UTF-8' in this case. I believe that #8439 is the best resolution to this issue. If you would like to review that commit and leave some comments, please do so.

Also, you are very right that the new algorithm for mb_detect_encoding should have been listed as a change from PHP 8.0 to 8.1 in the documentation; if that can still be adjusted, it would be great.

@cmb69
Copy link
Member

cmb69 commented Apr 27, 2022

Also, you are very right that the new algorithm for mb_detect_encoding should have been listed as a change from PHP 8.0 to 8.1 in the documentation; if that can still be adjusted, it would be great.

Yes, sure, this can (and should be) adjusted in the documentation!

@phil-davis
Copy link
Contributor

phil-davis commented Jun 24, 2022

Just adding a note here for future readers. Please correct me if this is not correct!

ISO-8859-1 is a standard for strings of 8-bit single-byte character encodings - matches ASCII for the lower half, and other extra characters encoded in the upper half. So, many/most of the 8-bit patterns are valid ISO-8858-1 encodings. That means that when parsing a string of 8-bit values, the string is very often going to be a possible set of ISO-8859-1 encodings.

That means that, if mb_detect_encoding did strictly prioritize the "first match" encoding in the array passed in, then a call like this:

mb_detect_encoding($rawstr, ['ISO-8859-1', 'UTF-8'])

would very often return 'ISO-8859-1', including for many "shortish" strings that are actually some different set of characters encoded as UTF-8 by some application.

So, IMO, that is a reasonable reason for not strictly prioritizing - if it did strictly prioritize then people would have to understand exactly what they are doing with the array of encodings. For example, a call like:
mb_detect_encoding($rawstr, ['UTF-8', 'ISO-8859-1'])
might generally be a better thing to recommend.

But actually, the implemented heuristics are, IMO, a better approach - it will have a bit more of an "educated guess" when a string could be multiple of the encodings.

As mentioned above, if the caller really wants strict prioritizing then they can call mb_check_encoding in a loop, checking for encoding matches one-by-one.

End of dump.

@alexdowad
Copy link
Contributor

@phil-davis That is exactly the point. Actually, for many 8-bit legacy text encodings, every possible string is valid. So if such an encoding was listed first as input to mb_detect_encoding, it would always be "detected".

Long before this change to the behavior of mb_detect_encoding was made, you could see posts on programming forums recommending the use of mb_check_encoding in a loop rather than mb_detect_encoding. This was because of various weird inconsistencies which existed in the 'detection' code. Those inconsistencies no longer exist; but still, if one wants to strictly evaluate a series of candidate encodings one by one and pick the first one that works, using mb_check_encoding in a loop is the way to go.

LeSuisse added a commit to Enalean/tuleap that referenced this issue Dec 19, 2022
A unit test was failing due to a behavior change of `\mb_detect_encoding()` [0].
The test was testing a force encoding conversion but this does not seem
to be really needed anymore.

Part of request #22659: Run Tuleap with PHP 8.1

[0] php/php-src#8279 (comment)

Change-Id: I9021994a0cfe8662602379dc45b281e2435c6248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants