-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Spark SQL Extensions for create branch #6617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c188f50
4a65c9c
d92e749
7e72948
a29615b
f79f304
e29c1a1
99c5885
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,6 +73,13 @@ statement | |
| | ALTER TABLE multipartIdentifier WRITE writeSpec #setWriteDistributionAndOrdering | ||
| | ALTER TABLE multipartIdentifier SET IDENTIFIER_KW FIELDS fieldList #setIdentifierFields | ||
| | ALTER TABLE multipartIdentifier DROP IDENTIFIER_KW FIELDS fieldList #dropIdentifierFields | ||
| | ALTER TABLE multipartIdentifier CREATE BRANCH identifier (AS OF VERSION snapshotId)? (RETAIN snapshotRefRetain snapshotRefRetainTimeUnit)? (snapshotRetentionClause)? #createBranch | ||
| ; | ||
|
|
||
| snapshotRetentionClause | ||
| : WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS | ||
| | WITH SNAPSHOT RETENTION snapshotRetain snapshotRetainTimeUnit | ||
| | WITH SNAPSHOT RETENTION numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit | ||
| ; | ||
|
|
||
| writeSpec | ||
|
|
@@ -168,34 +175,76 @@ fieldList | |
| ; | ||
|
|
||
| nonReserved | ||
| : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE | ||
| | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | IDENTIFIER_KW | FIELDS | SET | ||
| : ADD | ALTER | AS | ASC | BRANCH | BY | CALL | CREATE | DAYS | DESC | DROP | FIELD | FIRST | HOURS | LAST | NULLS | OF | ORDERED | PARTITION | TABLE | WRITE | ||
| | DISTRIBUTED | LOCALLY | MINUTES | MONTHS | UNORDERED | REPLACE | RETAIN | VERSION | WITH | IDENTIFIER_KW | FIELDS | SET | SNAPSHOT | SNAPSHOTS | ||
| | TRUE | FALSE | ||
| | MAP | ||
| ; | ||
|
|
||
| snapshotId | ||
| : number | ||
| ; | ||
|
|
||
| numSnapshots | ||
| : number | ||
| ; | ||
|
|
||
| snapshotRetain | ||
| : number | ||
| ; | ||
|
|
||
| snapshotRefRetain | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are there so many aliases for number? Are these rules useful?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jackye1995 asked the same question. I originally added snapshotRefRetain and snapshotRetain to make the statement parsing code more readable. Removing it is technically feasible. In the new version, I have removed (including create branch). |
||
| : number | ||
| ; | ||
|
|
||
| snapshotRefRetainTimeUnit | ||
jackye1995 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| : timeUnit | ||
| ; | ||
|
|
||
| snapshotRetainTimeUnit | ||
| : timeUnit | ||
| ; | ||
|
|
||
| timeUnit | ||
| : DAYS | ||
| | HOURS | ||
| | MINUTES | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing SECONDS?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we support it? I prefer at least the minute-level.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, I am fine with minute level. We can always add more if needed. |
||
| ; | ||
|
|
||
| ADD: 'ADD'; | ||
| ALTER: 'ALTER'; | ||
| AS: 'AS'; | ||
| ASC: 'ASC'; | ||
| BRANCH: 'BRANCH'; | ||
| BY: 'BY'; | ||
| CALL: 'CALL'; | ||
| DAYS: 'DAYS'; | ||
| DESC: 'DESC'; | ||
| DISTRIBUTED: 'DISTRIBUTED'; | ||
| DROP: 'DROP'; | ||
| FIELD: 'FIELD'; | ||
| FIELDS: 'FIELDS'; | ||
| FIRST: 'FIRST'; | ||
| HOURS: 'HOURS'; | ||
| LAST: 'LAST'; | ||
| LOCALLY: 'LOCALLY'; | ||
| MINUTES: 'MINUTES'; | ||
| MONTHS: 'MONTHS'; | ||
| CREATE: 'CREATE'; | ||
| NULLS: 'NULLS'; | ||
| OF: 'OF'; | ||
| ORDERED: 'ORDERED'; | ||
| PARTITION: 'PARTITION'; | ||
| REPLACE: 'REPLACE'; | ||
| RETAIN: 'RETAIN'; | ||
| RETENTION: 'RETENTION'; | ||
| IDENTIFIER_KW: 'IDENTIFIER'; | ||
| SET: 'SET'; | ||
| SNAPSHOT: 'SNAPSHOT'; | ||
| SNAPSHOTS: 'SNAPSHOTS'; | ||
| TABLE: 'TABLE'; | ||
| UNORDERED: 'UNORDERED'; | ||
| VERSION: 'VERSION'; | ||
| WITH: 'WITH'; | ||
| WRITE: 'WRITE'; | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.catalyst.plans.logical | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions.Attribute | ||
|
|
||
| case class CreateBranch(table: Seq[String], branch: String, snapshotId: Option[Long], numSnapshots: Option[Long], | ||
| snapshotRetain: Option[Long], snapshotRefRetain: Option[Long]) extends LeafCommand { | ||
|
|
||
| import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ | ||
|
|
||
| override lazy val output: Seq[Attribute] = Nil | ||
|
|
||
| override def simpleString(maxFields: Int): String = { | ||
| s"Create branch: ${branch} for table: ${table.quoted} " | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.execution.datasources.v2 | ||
|
|
||
| import org.apache.iceberg.spark.source.SparkTable | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.Attribute | ||
| import org.apache.spark.sql.catalyst.plans.logical.CreateBranch | ||
| import org.apache.spark.sql.connector.catalog.Identifier | ||
| import org.apache.spark.sql.connector.catalog.TableCatalog | ||
|
|
||
| case class CreateBranchExec( | ||
| catalog: TableCatalog, | ||
| ident: Identifier, | ||
| createBranch: CreateBranch) extends LeafV2CommandExec { | ||
|
|
||
| import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ | ||
|
|
||
| override lazy val output: Seq[Attribute] = Nil | ||
|
|
||
| override protected def run(): Seq[InternalRow] = { | ||
| catalog.loadTable(ident) match { | ||
| case iceberg: SparkTable => | ||
|
|
||
| val snapshotId = createBranch.snapshotId.getOrElse(iceberg.table.currentSnapshot().snapshotId()) | ||
| val manageSnapshot = iceberg.table.manageSnapshots() | ||
| .createBranch(createBranch.branch, snapshotId) | ||
|
|
||
| if (createBranch.numSnapshots.nonEmpty) { | ||
| manageSnapshot.setMinSnapshotsToKeep(createBranch.branch, createBranch.numSnapshots.get.toInt) | ||
| } | ||
|
|
||
| if (createBranch.snapshotRetain.nonEmpty) { | ||
| manageSnapshot.setMaxSnapshotAgeMs(createBranch.branch, createBranch.snapshotRetain.get) | ||
| } | ||
|
|
||
| if (createBranch.snapshotRefRetain.nonEmpty) { | ||
| manageSnapshot.setMaxRefAgeMs(createBranch.branch, createBranch.snapshotRefRetain.get) | ||
| } | ||
|
|
||
| manageSnapshot.commit() | ||
|
|
||
| case table => | ||
| throw new UnsupportedOperationException(s"Cannot add branch to non-Iceberg table: $table") | ||
|
Comment on lines
+60
to
+61
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this case will be common for all the reference based operations. We may want to see about extracting to a common parent. Not needed at this point, but we may revisit in later DDL implementations.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like an existing pattern for all extensions, so I think it is probably fine to leave it like this. |
||
| } | ||
|
|
||
| Nil | ||
| } | ||
|
|
||
| override def simpleString(maxFields: Int): String = { | ||
| s"Create branch: ${createBranch.branch} operation for table: ${ident.quoted}" | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,7 @@ import org.apache.spark.sql.catalyst.expressions.Literal | |
| import org.apache.spark.sql.catalyst.expressions.PredicateHelper | ||
| import org.apache.spark.sql.catalyst.plans.logical.AddPartitionField | ||
| import org.apache.spark.sql.catalyst.plans.logical.Call | ||
| import org.apache.spark.sql.catalyst.plans.logical.CreateBranch | ||
| import org.apache.spark.sql.catalyst.plans.logical.DeleteFromIcebergTable | ||
| import org.apache.spark.sql.catalyst.plans.logical.DropIdentifierFields | ||
| import org.apache.spark.sql.catalyst.plans.logical.DropPartitionField | ||
|
|
@@ -61,6 +62,9 @@ case class ExtendedDataSourceV2Strategy(spark: SparkSession) extends Strategy wi | |
| case AddPartitionField(IcebergCatalogAndIdentifier(catalog, ident), transform, name) => | ||
| AddPartitionFieldExec(catalog, ident, transform, name) :: Nil | ||
|
|
||
| case CreateBranch(IcebergCatalogAndIdentifier(catalog, ident), _, _, _, _, _) => | ||
| CreateBranchExec(catalog, ident, plan.asInstanceOf[CreateBranch]) :: Nil | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this pass the logical plan rather than passing the necessary information? Is it just to avoid a longer line?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I passed in the |
||
|
|
||
| case DropPartitionField(IcebergCatalogAndIdentifier(catalog, ident), transform) => | ||
| DropPartitionFieldExec(catalog, ident, transform) :: Nil | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify these 3 cases to be
WITH SNAPSHOT RETENTION (numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)??There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skips illegal statements :
WITH SNAPSHOT RETENTIONor we can use
WITH SNAPSHOT RETENTION ((numSnapshots SNAPSHOTS)? (snapshotRetain snapshotRetainTimeUnit)? | numSnapshots SNAPSHOTS snapshotRetain snapshotRetainTimeUnit )?But it's not intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I will leave it here to see if anyone has better suggestions. I am not an Antlr expert 😝