-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Use V2 format in new tables by default #8381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| return new Builder() | ||
| .upgradeFormatVersion(formatVersion) | ||
| .setFormatVersion(formatVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit dangerous to expose the ability to set rather than upgrade the format version. I think people are going to call it when they shouldn't. Is it actually needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builder inits the format version to the default format version (2 now) and if users explicitly pass 1 as the desired format version, create table statements start to fail as it is not allowed to downgrade.
We can make this package-private given that it is only used during table creation. Or we could lazily init the default format version. Either way would work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it private, which is still accessible in TableMetadata. It is kind of weird to call upgrade when creating a fresh metadata file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like upgradeFormatVersion (despite the name) might be more consistent the other setters, which perform checks and add the MetadataUpdate as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the validation in upgradeFormatVersion is applicable when creating new tables. I also don't think we are supposed to generate an update cause we are not updating the format version, the table does not exist.
90d8b60 to
c3e613f
Compare
|
@rdblue @nastra @Fokko @amogh-jahagirdar, could you help debug fails in |
|
When testing w/ V2 as the default, I encountered a REST catalog test failure with the concurrent replace transaction tests. in particular this check was failing. Say transaction 1 starts, then transaction 2 starts and commits, then transaction 1 commits, the sequence number for the second commit will be the same as the first. One such test is this one. It seems like concurrent transactions will always fail with V2 because of this. (Just saw your post now after I posted this @aokolnychyi ) @nastra I think you found the issue to be with |
| table.refresh(); | ||
| PartitionSpec expected = | ||
| PartitionSpec.builderFor(table.schema()).withSpecId(1).day("ts").build(); | ||
| PartitionSpecParser.fromJson( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to switch to JSON so that I can correctly validate field IDs used in the spec, which is not accessible in a programatic way in this module.
|
@nastra @amogh-jahagirdar We may need to update the REST catalog commit protocol for this. When we pass the |
7569eae to
8539085
Compare
|
@bryanck @nastra @amogh-jahagirdar, I am wrapping up required test changes. Do you think it would be possible to fix the REST catalog tests/implementation quickly or shall I keep using V1 tables there for now? |
|
@aokolnychyi I feel like the REST catalog change will need to happen in a follow-up, so I'd say keep with V1 for now there if you want to get this in. |
1500843 to
c9b2be3
Compare
| Assert.assertTrue( | ||
| "All transforms should be void", | ||
| table.spec().fields().stream().allMatch(pf -> pf.transform().toString().equals("void"))); | ||
| Assert.assertTrue("Table should be unpartitioned", table.spec().isUnpartitioned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think it would be good to switch assertions to AssertJ in such a case - the project is trying to move away from Junit4-style assertions, and it would be great to have new assertions written using AssertJ to make the migration process easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've used the new assertion style whenever adding new checks or when modifying classes that migrated. Here, I modify the existing check so I followed the rest of the class. I was not sure it is worth to just switch a single check.
|
@bryanck @szehon-ho @rdblue @nastra @RussellSpitzer, this one is ready for a detailed review round. Could you take a look? This change covers many modules and rebasing it won't be fun. |
|
|
||
| // it is only safe to set the format version directly while creating tables | ||
| // in all other cases, use the method below to safely upgrade the format version | ||
| private Builder setFormatVersion(int newFormatVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc comment here is confusing since the method is write below here, I think you should say "in all other cases, use upgradeFormatVersion.
We also could try to encapsulate some of this doc into the method name? newTableFormatVersion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the method name and the doc.
| .withProperty("key1", "value1") | ||
| .createOrReplaceTransaction(); | ||
|
|
||
| createTxn.newAppend().appendFile(FILE_A).commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: but we could refactor all this out since it's all identical except for the version. Not a big deal now, but maybe makes sense since we'll have a "V3" soon enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored all applicable tests to run against both versions.
| } | ||
|
|
||
| @Test | ||
| public void testReplaceTxnBuilderDefaultFormatVersion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same nit about extracting out common code since we will probably have 3 of these tests in the near future.
| .build(); | ||
|
|
||
| Assert.assertEquals("Should have new spec field", expected, table.spec()); | ||
| Assert.assertTrue("New spec must be unpartitioned", table.spec().isUnpartitioned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep a test for the V1 Behavior? We haven't actually deprecated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are being tested against both now.
| sql("ALTER TABLE %s ADD PARTITION FIELD days(ts)", tableName); | ||
| table.refresh(); | ||
| PartitionSpec expected = | ||
| PartitionSpec.builderFor(table.schema()).withSpecId(1).day("ts").build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do we need to do the change here? I'm not a big fan of have String constants for expected values, even if we are parsing them into a friendly object for comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned the reason why some tests refer to String constants here.
| createSourceTable(CREATE_PARQUET, source); | ||
| assertSnapshotFileCount(SparkActions.get().snapshotTable(source).as(dest), source, dest); | ||
| SnapshotTable action = SparkActions.get().snapshotTable(source).as(dest); | ||
| action.tableProperty(TableProperties.FORMAT_VERSION, "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test need to be with Version1, or is this just to minimize changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually has to be v1 as the test verifies it is v1 and then does an update to v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, didn't get that from the test name
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. I think this is a good thing to do but it seems like we have taken the approach of fixing each test suite in the way that is simplest. Those which are more complicated are kept on V1 and others are moved to V2. I think it's fine if we want to follow up with this later but we probably need to decide if we want to:
- Do all the tests on the default spec version
- Do all tests with all available spec versions
This is a bit of a mix at the moment. I do think this would be a good task though for a new contributor.
c9b2be3 to
aa47543
Compare
|
|
||
| @Test | ||
| public void testCreateTableBuilder() throws Exception { | ||
| @ParameterizedTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did not make sense to parameterize the entire suite, so I only changed applicable tests.
| .as("Table should have a spec with one void field") | ||
| .isEqualTo(v1Expected); | ||
| } else { | ||
| Assertions.assertThat(table.spec().isUnpartitioned()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a valid check on both versions now, unless we want to be sure to check for the void transform
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L89-L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it is better to highlight the difference in this case.
| } else { | ||
| expected = | ||
| PartitionSpecParser.fromJson( | ||
| table.schema(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker but I think we should expose whatever we need to do this programmatically to testing so we don't have to parse jsons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'll create an issue to address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8434.
|
|
||
| protected Table createTablePartitioned(int partitions, int files) { | ||
| return createTablePartitioned(partitions, files, SCALE, Maps.newHashMap()); | ||
| Map<String, String> properties = ImmutableMap.of(TableProperties.FORMAT_VERSION, "1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why V1 For Rewrite DataFiles? (also just this particular create method?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change that for a single test, actually. I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, left a few minor questions. Some things to be done in followups
aa47543 to
825cc0b
Compare
|
Thanks for reviewing everyone! |
| C catalog = catalog(); | ||
|
|
||
| // TODO: temporarily ignore this test for REST catalogs (issue #8390) | ||
| Assumptions.assumeFalse(catalog instanceof RESTCatalog); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi, while I understand wanting to get this in, I don't think it's a good idea to set a precedent that we will disable tests that aren't passing. I also don't see much description of the problem on #8390.
This PR switches new tables to use the V2 format by default, as discussed on the dev list.
The purpose of this change is to avoid some workarounds in the V1 format that may impact the user experience for anyone starting with Iceberg. In particular, v1 tables use always null transforms during spec evolution (confusing) and require a flag to enable snapshot ID inheritance (performance).
Note that V2 tables without delete files (disabled by default) should be no different than V1 tables. If there are engines that don't support V2 tables, they should modify their logic to check for presence of delete files instead.