Skip to content

PharData incorrectly extracts zip file #13037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Dec 27, 2023 · 2 comments
Closed

PharData incorrectly extracts zip file #13037

ghost opened this issue Dec 27, 2023 · 2 comments

Comments

@ghost
Copy link

ghost commented Dec 27, 2023

Description

The following code:

<?php
function archive_extract(string $path, string $destination = ".")
{
	$phar = new PharData($path);
	$phar->extractTo($destination);
}

archive_extract("file.zip", "extracted_dir1");

Extracts the zip file with invisible characters such as \x14 and other control characters in a few of HUNDREDS of files.
Files that are corrupted when extracted:

diff -r -q /path/to/dir1 /path/to/dir2 output:

Files test/wordpress/wp-includes/blocks/site-tagline/editor-rtl.min.css and test2/wordpress/wp-includes/blocks/site-tagline/editor-rtl.min.css differ
Files test/wordpress/wp-includes/blocks/site-tagline/editor.min.css and test2/wordpress/wp-includes/blocks/site-tagline/editor.min.css differ
Files test/wordpress/wp-includes/blocks/site-title/style-rtl.css and test2/wordpress/wp-includes/blocks/site-title/style-rtl.css differ
Files test/wordpress/wp-includes/blocks/site-title/style-rtl.min.css and test2/wordpress/wp-includes/blocks/site-title/style-rtl.min.css differ
Files test/wordpress/wp-includes/blocks/site-title/style.css and test2/wordpress/wp-includes/blocks/site-title/style.css differ
Files test/wordpress/wp-includes/blocks/site-title/style.min.css and test2/wordpress/wp-includes/blocks/site-title/style.min.css differ
Files test/wordpress/wp-includes/blocks/spacer/style-rtl.css and test2/wordpress/wp-includes/blocks/spacer/style-rtl.css differ
Files test/wordpress/wp-includes/blocks/spacer/style-rtl.min.css and test2/wordpress/wp-includes/blocks/spacer/style-rtl.min.css differ

This code:

<?php
function unzip(string $path, string $destination = ".")
{
    $zip = new ZipArchive;
    $zip->open($path);
    $zip->extractTo($destination);
    $zip->close();
}
archive_extract("file.zip", "extracted_dir2");

correctly extracts the zip file as expected, in the same way as command line tool unzip does.

Working file from ZipArchive:
ziparchive

Corrupted file from PharData:
phardata

PHP Version

PHP 8.2.7

Operating System

Debian 12.4 raspberrypi

@ghost ghost changed the title PharData incorrectly extracts zip file PharData incorrectly extracts zip file Dec 27, 2023
@nielsdos
Copy link
Member

nielsdos commented Dec 27, 2023

The corruption only seems to happen when the compression flags of the file entry are zero, i.e. no compression is used.
What's also interesting is that it seems to read 4 bytes prior to the file, which explains why we see 4 bytes of junk at the front and 4 bytes missing at the end.
Not sure yet what's going on exactly, but I'm digging into it.

EDIT: looks like the zip file entry offset is maybe not set up properly, the entry.offset variable set up at phar_parse_zipfile seems wrong, it points to offset 12398848 for vendor/joypixels/emoji-toolkit/joypixels.json, but that's indeed 4 bytes prior to the file start.

@nielsdos
Copy link
Member

The difference of 4 bytes comes from an inconsistency in the extra field length in the sample zip files.
In the central directory entry for the file, this is 24 bytes, but in the local header this is 28 bytes. A difference of 4.
libzip seems to use that of the local header (confirmed by compiling and debugging libzip), hence it reads at the correct offset. This has probably always been done incorrectly in phar.

According to https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Immediately following the local header for a file SHOULD be placed the compressed or stored data for the file.
Note that it does not mention consistency between the local directory and central directory.

Furthermore, upon finding this I googled a bit further and found this: https://stackoverflow.com/questions/58702783/finding-start-of-compressed-data-for-items-in-a-zip-with-zip4j

The reason you have to add 4 to the offset in your example is because the size of the extra data field in central directory of this entry (= file header) is different than the one in local file header, and it is perfectly legal as per zip specification to have different extra data field sizes in central directory and local header. In fact the extra data field we are talking about, Extended Timestamp extra field (signature 0x5455), has an official definition which has varied lengths between the two.

And indeed that's the case here too! What an amazing file format design!

Here's a very quick and dirty proof of concept patch:

diff --git a/ext/phar/zip.c b/ext/phar/zip.c
index 1804d926b4..c07ddf9b10 100644
--- a/ext/phar/zip.c
+++ b/ext/phar/zip.c
@@ -386,8 +386,16 @@ int phar_parse_zipfile(php_stream *fp, char *fname, size_t fname_len, char *alia
 		entry.timestamp = phar_zip_d2u_time(zipentry.timestamp, zipentry.datestamp);
 		entry.flags = PHAR_ENT_PERM_DEF_FILE;
 		entry.header_offset = PHAR_GET_32(zipentry.offset);
+		// entry.offset = entry.offset_abs = PHAR_GET_32(zipentry.offset) + sizeof(phar_zip_file_header) + PHAR_GET_16(zipentry.filename_len) +
+			// PHAR_GET_16(zipentry.extra_len);
+
+		zend_off_t loc = php_stream_tell(fp);
+		phar_zip_file_header local_file_header;
+		php_stream_seek(fp, entry.header_offset, SEEK_SET);
+		php_stream_read(fp, (char *) &local_file_header, sizeof(local_file_header));
+		php_stream_seek(fp, loc, SEEK_SET);
 		entry.offset = entry.offset_abs = PHAR_GET_32(zipentry.offset) + sizeof(phar_zip_file_header) + PHAR_GET_16(zipentry.filename_len) +
-			PHAR_GET_16(zipentry.extra_len);
+			PHAR_GET_16(local_file_header.extra_len);
 
 		if (PHAR_GET_16(zipentry.flags) & PHAR_ZIP_FLAG_ENCRYPTED) {
 			PHAR_ZIP_FAIL("Cannot process encrypted zip files");

I'll write up a proper patch and testcase tomorrow, it's late.

@nielsdos nielsdos self-assigned this Dec 28, 2023
nielsdos added a commit to nielsdos/php-src that referenced this issue Dec 28, 2023
The code currently assumes that the extra field length of the central
directory entry and the local entry are the same, but that's not the
case. For example, the "Extended Timestamp extra field" differs in size
for local vs central directory entries. This causes the file contents
offset to be incorrect because it is based on the central directory
length instead of the local entry length. Fix it by reading the local
entry and getting the size from there as well as checking consistency
for the file name length.
@nielsdos nielsdos linked a pull request Dec 28, 2023 that will close this issue
nielsdos added a commit that referenced this issue Jan 25, 2024
* PHP-8.2:
  Fix GH-10344: imagettfbbox(): Could not find/open font UNC path
  Fix GH-13037: PharData incorrectly extracts zip file
nielsdos added a commit that referenced this issue Jan 25, 2024
* PHP-8.3:
  Fix GH-10344: imagettfbbox(): Could not find/open font UNC path
  Fix GH-13037: PharData incorrectly extracts zip file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants