Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Oct 26, 2021

Supported the call procedure for rewrite_data_files.
Supported input arguments: table, [strategy], [sort_order], [options], [where]
Supported output arguments: rewritten_data_files_count, added_data_files_count
Added test cases to cover the scenarios.

Fixes #3355

@jackye1995
Copy link
Contributor

Thanks for working on this! Almost forgot we haven't added the call procedure for this yet...

I remember Russell had laid out the arguments for the procedure in https://docs.google.com/document/d/1aXo1VzuXxSuqcTzMLSQdnivMVgtLExgDWUFMvWeXRxc, could you follow that proposal?

@ajantha-bhat
Copy link
Member Author

@jackye1995 : Thanks for sharing this doc. I will look into it and implement as per that.

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented Oct 27, 2021

@jackye1995 , @RussellSpitzer :
I have added some more arguments. Also updated the document.
As per me only sort related options and where arguments are pending. For where , I am not sure how to convert text to expression yet. For sort related options I am not sure whether to have as map or individual options.
If ok, I want to handle this in subsequent PR. please review the current changes.

Also I should copy the same code in spark/v3.2 folder as well ?

p.s: Also let me know if any other arguments is missed.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review October 27, 2021 14:03
@RussellSpitzer
Copy link
Member

@jackye1995 , @RussellSpitzer : I have added some more arguments. Also updated the document. As per me only sort related options and where arguments are pending. For where , I am not sure how to convert text to expression yet. For sort related options I am not sure whether to have as map or individual options. If ok, I want to handle this in subsequent PR. please review the current changes.

Also I should copy the same code in spark/v3.2 folder as well ?

p.s: Also let me know if any other arguments is missed.

For "where" can you check and see if you can produce expressions using the Spark session's parser? I think you may be able to convert to expressions using the parser and then translate them using our Spark3 utils ... I'm not sure though.

@ajantha-bhat ajantha-bhat force-pushed the call branch 2 times, most recently from 6dea11f to 36ed20a Compare October 27, 2021 16:08
@ajantha-bhat
Copy link
Member Author

@jackye1995 , @RussellSpitzer , @rdblue :
Is there any available class for converting org.apache.spark.sql.catalyst.expressions.Expression to org.apache.iceberg.expressions.Expression. I looked up based on import statement. I didn't find any.

Actually for supporting filter, I have used the spark parser to parse the string as expression. But I got spark catalyst expression. But rewrite data files action needs Iceberg expression instead of spark catalyst expression.

@ajantha-bhat ajantha-bhat changed the title CALL procedure for rewrite_data_files Spark: CALL procedure for rewrite_data_files Nov 1, 2021
@rdblue
Copy link
Contributor

rdblue commented Nov 1, 2021

There is a conversion from Spark to Iceberg expressions in SparkFilters.

@ajantha-bhat
Copy link
Member Author

There is a conversion from Spark to Iceberg expressions in SparkFilters.

@rdblue : I think it is spark filter to Iceberg expression conversion. I needed spark expression to Iceberg expression conversion.

@ajantha-bhat
Copy link
Member Author

@RussellSpitzer , @jackye1995 , @rdblue :
Please review and help merge this PR.

I will back sync to v3.0 folder once this PR is reviewed and merged. Thanks.

@ajantha-bhat ajantha-bhat force-pushed the call branch 2 times, most recently from 2194b42 to 170fe7b Compare November 5, 2021 03:06
@ajantha-bhat
Copy link
Member Author

@RussellSpitzer : So, anything else need to be done for this PR or it can be merged ?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me, just some nit

@ajantha-bhat
Copy link
Member Author

@jackye1995 : Thanks for the review. I have pushed the nit changes also.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had some minor comments which I think we need to wrap up, other than that I think this is looking pretty good. I know it's difficult for us to work this kind of functionality into Iceberg when Spark keeps a lot of these methods from us.

@ajantha-bhat
Copy link
Member Author

@RussellSpitzer : Thanks for the review again. I have addressed the comments (also added a comment reply) and pushed the new changes.

@ajantha-bhat
Copy link
Member Author

@RussellSpitzer : I have updated the test case with dataframe write instead of sql now. So, I think all comments are handled now. Thanks.

@ajantha-bhat ajantha-bhat changed the title Spark: CALL procedure for rewrite_data_files Spark-3.2: CALL procedure for rewrite_data_files Dec 4, 2021
@ajantha-bhat ajantha-bhat changed the title Spark-3.2: CALL procedure for rewrite_data_files Spark: CALL procedure for rewrite_data_files Dec 4, 2021
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much, this is a great addition to the codebase

@ajantha-bhat
Copy link
Member Author

@RussellSpitzer : Thanks for merging the PR. Appreciate your guidance and patient reply to my questions. It helped in finally avoiding the custom codes 👍🏻

RussellSpitzer pushed a commit that referenced this pull request Dec 9, 2021
Backport of #3375 - Support CALL procedure for rewrite_data_files
RussellSpitzer pushed a commit that referenced this pull request Dec 9, 2021
Backport of #3375 - Support CALL procedure for rewrite_data_files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CALL procedure for rewrite_datafiles similar to rewrite_manifests

5 participants