ORC: Implement buildEqualityWriter() and buildPositionWriter() #2935

pvary · 2021-08-04T16:23:33Z

No description provided.

aokolnychyi · 2021-08-04T18:51:11Z

Let me do a pass.

I think I agree with moving GenericOrcXXX classes to orc from data as we only had ORC classes in data before.

orc/src/main/java/org/apache/iceberg/orc/ORC.java

aokolnychyi · 2021-08-04T19:11:18Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+
+      // the appender uses the row schema without extra columns
+      appenderBuilder.schema(rowSchema);
+      appenderBuilder.createWriterFunc(createWriterFunc);


Do we plan to introduce write table properties for ORC like we have for Parquet? Like controlling the stripe size?

I think our ORC and Parquet implementations are inconsistent here. We can probably set some ORC props directly in table properties and they will be applied but then we have no way to configure delete writers separately.

@rdblue @omalley @pvary @edgarRd @shardulm94, any particular reason why we don't have table properties to configure ORC writes?

That's a good question, I think this would merit a longer discussion and probably a different PR after we have consensus. I feel a little ambivalent about the Iceberg defined defaults. I feel that they might interfere with the execution engine / file format defaults and could cause surprises for the users. We have them for Parquet, so I expect that we went thought about this in detail before before, so I would be happy to learn what was our though process.

+1 to keep consistent approach between ORC and parquet to define the data writers and delete writers's configurations. But I'm okay to open separate issue to address it.

I would suggest to do this in a next PR

orc/src/main/java/org/apache/iceberg/orc/ORC.java

orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

orc/src/main/java/org/apache/iceberg/orc/ORC.java

aokolnychyi · 2021-08-04T20:16:32Z

Thanks for working on this, @pvary! I think this is the right direction. I'd love to see this integrated in our Flink and Spark writer factories within this PR to ensure things work as expected.

My biggest question is around DeleteWriter in GenericOrcWriters. I'd align it with the equivalent path in Parquet.

aokolnychyi · 2021-08-04T20:16:52Z

cc @openinx @edgarRd @shardulm94

kbendick

Thank you for taking the time to work on this @pvary! I would also like to see it integrated with Flink and Spark, though I think that keeping the PRs smaller is likely better.

Maybe in a separate PR we could introduce a proof of concept for one of the engines, as mentioned by @aokolnychyi ? I'm personally ok with separate PRs entirely as I'm just happy this is getting done, but it would be nice to ensure that the integration works as expected.

Please let me know if I can help somehow =)

kbendick · 2021-08-05T20:13:04Z

orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

+
+    @Override
+    public void write(PositionDelete<Record> row, VectorizedRowBatch output) throws IOException {
+      record.set(0, row.path());


Nit / non-blocking: Do we have a named constant for the 0 / 1 record position (for path and pos and I guess row for 2)?

If we have that elsewhere, would be nice to see it used.

We use this in Parquet and ORC, but my feeling that this should be limited to the specific FileFormats

orc/src/main/java/org/apache/iceberg/orc/ORC.java

orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

openinx

Thanks for contributing this , @pvary ! I left several comments that need to be addressed. btw, I also feel we don't need to move those GenericOrcReader(s) and GenericOrcWriter(s) in this adding delete writer PR in case of making more conflicts when cherry-pick in into people's own repo.

pvary · 2021-09-27T10:50:52Z

Thanks for the review @openinx!
Will need some time to catch up with my other tasks, but I would like to come back to your review after my immediate tasks.

Changed the PositionDeleteWriter write signature to write(PositionDelete<?> row, VectorizedRowBatch output)

pvary · 2021-10-06T13:02:46Z

@openinx : By any chance, do you have time for another round of review?
Thanks,
Peter

pvary · 2021-10-07T17:12:15Z

btw, I also feel we don't need to move those GenericOrcReader(s) and GenericOrcWriter(s) in this adding delete writer PR in case of making more conflicts when cherry-pick in into people's own repo.

ORC java needs to access GenericOrcReader, and iceberg-data already depends on iceberg-orc, iceberg-parquet. To solve this problem I have to move the classes. The added benefit is that the layout is the same as for the Parquet.

openinx · 2021-10-08T06:24:05Z

@pvary Thanks for the work, I think I will take another round for this tomorrow !

pvary · 2021-10-08T13:02:14Z

@openinx: You might want to take a look at #3250. With the help of #3248 I was able to run all of the tests for ORC as well.

pvary · 2021-10-11T13:47:33Z

The solution in #3250 seems better, so closing this PR.
Will ping everybody who participated in the review there.

Thanks,
Peter

github-actions bot added the ORC label Aug 4, 2021

pvary mentioned this pull request Aug 4, 2021

Implement ORC deletes for v2 spec #2914

Closed

pvary force-pushed the deleteWriters branch from 9137094 to 23efe18 Compare August 4, 2021 16:45

pvary requested review from aokolnychyi and rdblue August 4, 2021 18:37