Summary
When requesting a HTTP resource using the DOM or SimpleXML extensions, the wrong content-type
header is used to determine the charset when the requested resource performs a redirect.
Details
When the HTTP stream wrapper follows a redirect, it does not clear the list of captured headers before performing the following requests. This means in the returned array containing the response headers, the headers of multiple requests are stored one after each other. The final request comes last in this array.
The php_libxml_input_buffer_create_filename()
/ php_libxml_sniff_charset_from_stream()
function scans the header array from top to bottom, returning after finding the first content-type
header. This content-type
header does not necessarily belong to the response that corresponds to the HTML body that is being parsed.
PoC
redirect.php
<?php
header('content-type: text/html;charset=utf-16');
header('location: http://example.com');
Run: php -S localhost:8080
and then execute
<?php
// Or using DOMDocument / SimpleXML
$document = \Dom\HTMLDocument::createFromFile("http://localhost:8080/redirect.php");
if (\str_contains($document->querySelector('body')->textContent, 'Example')) {
throw new Exception('Refusing to store example content');
}
var_dump(\str_contains($document->saveHtml(), 'Example')); // bool(true)
Impact
This allows an attacker to cause a document to be parsed incorrectly, changing its meaning and possibly bypassing validation. When exporting such a document with ->saveHtml()
the document will be returned with the original charset.
Users that request documents via HTTP using the DOM or SimpleXML extensions are impacted.
Summary
When requesting a HTTP resource using the DOM or SimpleXML extensions, the wrong
content-type
header is used to determine the charset when the requested resource performs a redirect.Details
When the HTTP stream wrapper follows a redirect, it does not clear the list of captured headers before performing the following requests. This means in the returned array containing the response headers, the headers of multiple requests are stored one after each other. The final request comes last in this array.
The
php_libxml_input_buffer_create_filename()
/php_libxml_sniff_charset_from_stream()
function scans the header array from top to bottom, returning after finding the firstcontent-type
header. Thiscontent-type
header does not necessarily belong to the response that corresponds to the HTML body that is being parsed.PoC
redirect.php
Run:
php -S localhost:8080
and then executeImpact
This allows an attacker to cause a document to be parsed incorrectly, changing its meaning and possibly bypassing validation. When exporting such a document with
->saveHtml()
the document will be returned with the original charset.Users that request documents via HTTP using the DOM or SimpleXML extensions are impacted.