Improve read times and reduce size of metadata.json by storing schemas in external files

### Feature Request / Improvement

Improve read times of metadata.json by storing schemas in a file and using the file pointers in the metadata.json instead of storing all schema copies that get accumulated with schema evolution.

## Request

Instead of storing an ascii copy of a table schema in `metadata.json`, store a pointer to a file (i.e. a file path) containing the schema. The metadata files would then look something like,

```
schemas: [{type:"str", schema_file: "s3://some-bucket/some-schema.json", "schema-id": 0
}, {...}, ...]
```

And the file `some-schema.json` would contain a schema structure, 
, e.g.,
```
{
"type": "struct",
"fields": [
   {
     "id": 1,
       "name": "vendor_id",
         "type": "long",
         "required": false
    },
   ...
   ],
"identifier-field-ids": []
}
```

The schemas would be read from the external files lazily, e.g., only the current schema would need to be read here
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L353


An alternative solution, namely to create clean-up tools for schemas, has been mentioned in https://github.com/apache/iceberg/pull/3462, however, we're interested in a making an OSS contribution that could prevent the metadata files from growing so large to begin with.


## Motivation

We have a use case for a table with approximately 10k columns. We expect to apply schema changes fairly often. The current json format is problematic for this use case because when a schema change is applied the entire previous schema(s) is reproduced in every susbsequent `metadata.json`. This causes the metadata files to grow large, using up significant disc space and impacting read time of the json. This issue of large metadata files has been mentioned previously in https://github.com/apache/iceberg/issues/5219, but the issue wasn't resolved. 

We're interested in making the code changes and submitting a PR to implement the suggestion, but want to gauge community interest before pursuing it.

### Query engine

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve read times and reduce size of metadata.json by storing schemas in external files #9734

Feature Request / Improvement

Request

Motivation

Query engine

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve read times and reduce size of metadata.json by storing schemas in external files #9734

Description

Feature Request / Improvement

Request

Motivation

Query engine

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions