Skip to content

the fileSizeInBytes of orc and parquet are inconsistent #1666

@zhangjun0x01

Description

@zhangjun0x01

I found that the value stored in the variable fileSizeInBytes of DataFile, orc and parquet format are inconsistent. The orc format stores the deserialized data size, while the parquet stores the file size.

This will cause a problem. In RewriteDataFilesAction, the default value of the targetSizeInBytes is 128M,if it is orc format, , after rewrite action,the size of the datafile is only 10M. Because in RewriteDataFilesAction ,we read the orc data according to the deserialized data size ,not the file size ,so the size of the new generated datafile is not enough to 128M.

The parquet format is normal and meets my expectations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions