-
Notifications
You must be signed in to change notification settings - Fork 3k
ORC: Implement buildEqualityWriter() and buildPositionWriter() #2935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Let me do a pass. I think I agree with moving |
|
|
||
| // the appender uses the row schema without extra columns | ||
| appenderBuilder.schema(rowSchema); | ||
| appenderBuilder.createWriterFunc(createWriterFunc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to introduce write table properties for ORC like we have for Parquet? Like controlling the stripe size?
I think our ORC and Parquet implementations are inconsistent here. We can probably set some ORC props directly in table properties and they will be applied but then we have no way to configure delete writers separately.
@rdblue @omalley @pvary @edgarRd @shardulm94, any particular reason why we don't have table properties to configure ORC writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, I think this would merit a longer discussion and probably a different PR after we have consensus. I feel a little ambivalent about the Iceberg defined defaults. I feel that they might interfere with the execution engine / file format defaults and could cause surprises for the users. We have them for Parquet, so I expect that we went thought about this in detail before before, so I would be happy to learn what was our though process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to keep consistent approach between ORC and parquet to define the data writers and delete writers's configurations. But I'm okay to open separate issue to address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to do this in a next PR
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
|
Thanks for working on this, @pvary! I think this is the right direction. I'd love to see this integrated in our Flink and Spark writer factories within this PR to ensure things work as expected. My biggest question is around |
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to work on this @pvary! I would also like to see it integrated with Flink and Spark, though I think that keeping the PRs smaller is likely better.
Maybe in a separate PR we could introduce a proof of concept for one of the engines, as mentioned by @aokolnychyi ? I'm personally ok with separate PRs entirely as I'm just happy this is getting done, but it would be nice to ensure that the integration works as expected.
Please let me know if I can help somehow =)
|
|
||
| @Override | ||
| public void write(PositionDelete<Record> row, VectorizedRowBatch output) throws IOException { | ||
| record.set(0, row.path()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / non-blocking: Do we have a named constant for the 0 / 1 record position (for path and pos and I guess row for 2)?
If we have that elsewhere, would be nice to see it used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this in Parquet and ORC, but my feeling that this should be limited to the specific FileFormats
orc/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this , @pvary ! I left several comments that need to be addressed. btw, I also feel we don't need to move those GenericOrcReader(s) and GenericOrcWriter(s) in this adding delete writer PR in case of making more conflicts when cherry-pick in into people's own repo.
|
Thanks for the review @openinx! |
Changed the PositionDeleteWriter write signature to write(PositionDelete<?> row, VectorizedRowBatch output)
|
@openinx : By any chance, do you have time for another round of review? |
ORC java needs to access GenericOrcReader, and |
|
@pvary Thanks for the work, I think I will take another round for this tomorrow ! |
|
The solution in #3250 seems better, so closing this PR. Thanks, |
No description provided.