Skip to content

Improve read times and reduce size of metadata.json by storing schemas in external files #9734

@bdilday

Description

@bdilday

Feature Request / Improvement

Improve read times of metadata.json by storing schemas in a file and using the file pointers in the metadata.json instead of storing all schema copies that get accumulated with schema evolution.

Request

Instead of storing an ascii copy of a table schema in metadata.json, store a pointer to a file (i.e. a file path) containing the schema. The metadata files would then look something like,

schemas: [{type:"str", schema_file: "s3://some-bucket/some-schema.json", "schema-id": 0
}, {...}, ...]

And the file some-schema.json would contain a schema structure,
, e.g.,

{
"type": "struct",
"fields": [
   {
     "id": 1,
       "name": "vendor_id",
         "type": "long",
         "required": false
    },
   ...
   ],
"identifier-field-ids": []
}

The schemas would be read from the external files lazily, e.g., only the current schema would need to be read here
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L353

An alternative solution, namely to create clean-up tools for schemas, has been mentioned in #3462, however, we're interested in a making an OSS contribution that could prevent the metadata files from growing so large to begin with.

Motivation

We have a use case for a table with approximately 10k columns. We expect to apply schema changes fairly often. The current json format is problematic for this use case because when a schema change is applied the entire previous schema(s) is reproduced in every susbsequent metadata.json. This causes the metadata files to grow large, using up significant disc space and impacting read time of the json. This issue of large metadata files has been mentioned previously in #5219, but the issue wasn't resolved.

We're interested in making the code changes and submitting a PR to implement the suggestion, but want to gauge community interest before pursuing it.

Query engine

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions