Skip to content

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Aug 4, 2021

No description provided.

@aokolnychyi
Copy link
Contributor

Let me do a pass.

I think I agree with moving GenericOrcXXX classes to orc from data as we only had ORC classes in data before.


// the appender uses the row schema without extra columns
appenderBuilder.schema(rowSchema);
appenderBuilder.createWriterFunc(createWriterFunc);
Copy link
Contributor

@aokolnychyi aokolnychyi Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to introduce write table properties for ORC like we have for Parquet? Like controlling the stripe size?

I think our ORC and Parquet implementations are inconsistent here. We can probably set some ORC props directly in table properties and they will be applied but then we have no way to configure delete writers separately.

@rdblue @omalley @pvary @edgarRd @shardulm94, any particular reason why we don't have table properties to configure ORC writes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, I think this would merit a longer discussion and probably a different PR after we have consensus. I feel a little ambivalent about the Iceberg defined defaults. I feel that they might interfere with the execution engine / file format defaults and could cause surprises for the users. We have them for Parquet, so I expect that we went thought about this in detail before before, so I would be happy to learn what was our though process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keep consistent approach between ORC and parquet to define the data writers and delete writers's configurations. But I'm okay to open separate issue to address it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to do this in a next PR

@aokolnychyi
Copy link
Contributor

Thanks for working on this, @pvary! I think this is the right direction. I'd love to see this integrated in our Flink and Spark writer factories within this PR to ensure things work as expected.

My biggest question is around DeleteWriter in GenericOrcWriters. I'd align it with the equivalent path in Parquet.

@aokolnychyi
Copy link
Contributor

cc @openinx @edgarRd @shardulm94

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to work on this @pvary! I would also like to see it integrated with Flink and Spark, though I think that keeping the PRs smaller is likely better.

Maybe in a separate PR we could introduce a proof of concept for one of the engines, as mentioned by @aokolnychyi ? I'm personally ok with separate PRs entirely as I'm just happy this is getting done, but it would be nice to ensure that the integration works as expected.

Please let me know if I can help somehow =)


@Override
public void write(PositionDelete<Record> row, VectorizedRowBatch output) throws IOException {
record.set(0, row.path());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / non-blocking: Do we have a named constant for the 0 / 1 record position (for path and pos and I guess row for 2)?

If we have that elsewhere, would be nice to see it used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this in Parquet and ORC, but my feeling that this should be limited to the specific FileFormats

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this , @pvary ! I left several comments that need to be addressed. btw, I also feel we don't need to move those GenericOrcReader(s) and GenericOrcWriter(s) in this adding delete writer PR in case of making more conflicts when cherry-pick in into people's own repo.

@pvary
Copy link
Contributor Author

pvary commented Sep 27, 2021

Thanks for the review @openinx!
Will need some time to catch up with my other tasks, but I would like to come back to your review after my immediate tasks.

Peter Vary added 3 commits October 5, 2021 14:59
Changed the PositionDeleteWriter write signature to write(PositionDelete<?> row, VectorizedRowBatch output)
@pvary
Copy link
Contributor Author

pvary commented Oct 6, 2021

@openinx : By any chance, do you have time for another round of review?
Thanks,
Peter

@github-actions github-actions bot added the flink label Oct 7, 2021
@pvary
Copy link
Contributor Author

pvary commented Oct 7, 2021

btw, I also feel we don't need to move those GenericOrcReader(s) and GenericOrcWriter(s) in this adding delete writer PR in case of making more conflicts when cherry-pick in into people's own repo.

ORC java needs to access GenericOrcReader, and iceberg-data already depends on iceberg-orc, iceberg-parquet. To solve this problem I have to move the classes. The added benefit is that the layout is the same as for the Parquet.

@openinx
Copy link
Member

openinx commented Oct 8, 2021

@pvary Thanks for the work, I think I will take another round for this tomorrow !

@pvary
Copy link
Contributor Author

pvary commented Oct 8, 2021

@openinx: You might want to take a look at #3250. With the help of #3248 I was able to run all of the tests for ORC as well.

@pvary
Copy link
Contributor Author

pvary commented Oct 11, 2021

The solution in #3250 seems better, so closing this PR.
Will ping everybody who participated in the review there.

Thanks,
Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants