Skip to content

Conversation

@nssalian
Copy link
Contributor

PR Description:

Adding the table to hold namespace names and attributes like location, metadata, properties.
Splitting up the implementation in: #3274 into smaller single change PRs.

Changes:

  1. Split the JdbcUtil constants to JdbcCatalogNamespace and JdbcCatalogTable respectively to correspond to the sql statements for namespace and table respectively.
  2. Modified JdbcCatalog to support Namespace creation
  3. Updated tests to test for the new methods.

@github-actions github-actions bot added the core label Oct 11, 2021
@nssalian
Copy link
Contributor Author

CC: @rdblue @jackye1995

@nssalian
Copy link
Contributor Author

@jackye1995 . thanks for the review. Made the fixes based on the comments.

NAMESPACE_NAME + " VARCHAR(255) NOT NULL," +
NAMESPACE_LOCATION + " VARCHAR(5500)," +
NAMESPACE_METADATA + " VARCHAR(65535)," +
NAMESPACE_PROPERTIES + " VARCHAR(65535)," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference of metadata and properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createNamespace method had Map<String, String> metadata as a parameter. I kept this assuming there was separation of metadata from properties. Do we wish to have one or the other?

@nssalian
Copy link
Contributor Author

Thanks @jackye1995. Addressed the comments except the metadata properties comment. Let me know if we need to have two separate given how the method signature has metadata.

@nssalian nssalian requested a review from jackye1995 October 12, 2021 22:24
@nssalian nssalian mentioned this pull request Oct 13, 2021
@nssalian
Copy link
Contributor Author

@jackye1995 @rdblue , please have a look when you get a chance. Let me know if I can add/augment anything in the PR.

@nssalian nssalian requested a review from rdblue October 18, 2021 17:29
}

LOG.debug("Creating table {} to store iceberg catalog namespaces",
JdbcUtil.CATALOG_NAMESPACE_TABLE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: This one is also off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@nssalian nssalian requested a review from rdblue October 18, 2021 17:37
@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2021

Previously, a namespace existed if there was a table in that namespace. This changes the definition so that a namespace exists if there is a row in the namespaces table. That is an incompatible definition. I think we need to decide what the behavior should be for the JDBC catalog, and then make sure that the implementation is consistent. If you should be able to implicitly create a namespace by creating a table (as is still allowed) then this needs to return that a namespace exists if a table is in the namespace. On the other hand, if that isn't allowed then createTable needs to be updated to check namespaceExists. I think that because namespaces have been implicitly created up to now, we should either continue to allow it (for compatibility with existing catalogs) or we should add a property that enables the more strict behavior (probably preferred).

Next, it looks like this creates a generic table with catalog name, namespace name, and string blobs. Is that the right way to go? Why not use a namespace_properties table that has the schema catalog string, namespace string, key string, value string? A namespace would exist if there is a table with the namespace or if there is at least one property for the namespace. That seems like a better way to model the information than converting to and from JSON or base64 to me. What is the benefit of using a row per namespace instead?

@nssalian
Copy link
Contributor Author

Thanks for the comments @rdblue .

re: namespace and table
I would prefer the scricter approach since it has the check in place to ensure a table is created only if a namespace exists. A config can be added for legacy reasons but that will mean a dual behavior for the NamespaceExists logic - one with the config and one without. If that breaking change is ok with the community and can be rolled out, I think I can amend the PR to have that implementation.
The only implementation issue I see here is we need to add the properties that the Catalog is initialized with to get connection properties within the JdbcTableOperations class. I have this in the larger PR #3274 by adding a dbProperties field.

re: table schema
I agree and I think that makes more sense. A namespace_properties table could hold the properties with a single row representing a property per namespace. I see this akin to the Hive Metastore.
We just have to do some stitching of the properties when retrieving for a single namespace.
Removal will be easier in this model.

@nssalian
Copy link
Contributor Author

Also, I couldn't get clarity on whether we need both namespace metadata and properties or only properties, based on @jackye1995 's earlier comments.

@rdblue
Copy link
Contributor

rdblue commented Oct 18, 2021

A config can be added for legacy reasons but that will mean a dual behavior for the NamespaceExists logic - one with the config and one without. If that breaking change is ok with the community and can be rolled out, I think I can amend the PR to have that implementation.

I don't think we should introduce a breaking change like this. We could, in theory, fix it up when starting up the catalog by populating the table... but that's a bit odd. I think it's better to keep behavior and allow an option to be more strict.

The only implementation issue I see here is . . .

I didn't follow this. Can you explain it a bit more for me?

I agree and I think that makes more sense. A namespace_properties table could hold the properties with a single row representing a property per namespace. I see this akin to the Hive Metastore.

The trade-off here is that removing the last namespace property would basically drop the namespace. In that case, the logic to check whether tables or properties exist is actually a good thing. I like the idea of the two table being loosely coupled.

@nssalian
Copy link
Contributor Author

I didn't follow this. Can you explain it a bit more for me?

The JdbcTableOperations doesn't seem to have the properties that are used to initialize the JdbcCatalog. I noticed this in tests. For namespaceExists to work in JdbcTableOperations there has to be an instance of JdbcCatalog in JdbcTableOperations.

More explanation:

  1. I added this field
    public JdbcCatalog(Map<String, String> dbProperties) {
    to be passed
    return new JdbcTableOperations(connections, io, catalogName, tableIdentifier, dbProperties, getConf());
  2. This helps to initialize the catalog with the same connection properties as the JdbcCatalog:
    this.catalog = new JdbcCatalog(dbProperties);

Maybe there's a cleaner way to do this that I am missing.

@nssalian
Copy link
Contributor Author

I don't think we should introduce a breaking change like this. We could, in theory, fix it up when starting up the catalog by populating the table... but that's a bit odd. I think it's better to keep behavior and allow an option to be more strict.

Agreed.

The trade-off here is that removing the last namespace property would basically drop the namespace. In that case, the logic to check whether tables or properties exist is actually a good thing. I like the idea of the two table being loosely coupled.

Agreed.

I'll make the changes in upcoming commits.

@nssalian
Copy link
Contributor Author

nssalian commented Oct 22, 2021

@rdblue added the changes we discussed.

  1. Made a properties table
  2. Updated namespaceExists to the original behavior - I don't see the strictness being needed at the moment.

Question:

  1. Should I handle namespace properties (set and get) and dropNamespace in this PR or open a separate PR for it?
  2. Also, some of the workflows don't seem to be triggering, how does one run tests on PRs?

DatabaseMetaData dbMeta = conn.getMetaData();
ResultSet tableExists = dbMeta.getTables(null, null, JdbcUtil.CATALOG_TABLE_NAME, null);
ResultSet tableExists = dbMeta.getTables(null, null,
JdbcUtil.CATALOG_TABLE_NAME, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this need to change? Looks like it is a whitespace-only change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that. Will fix.

@rdblue
Copy link
Contributor

rdblue commented Oct 22, 2021

@nssalian, looks like there are two tables, iceberg_namespaces and namespace_properties. Do we need two? It seems to me that we only need the properties table and a namespace exists if the properties exist or if a table exists.

protected static final String CATALOG_NAMESPACE_TABLE_NAME = "iceberg_namespaces";
protected static final String NAMESPACE_NAME = "namespace_name";
protected static final String NAMESPACE_METADATA = "metadata";
protected static final String NAMESPACE_PROPERTIES_TABLE_NAME = "namespace_properties";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be iceberg_namespace_properties to fit with the other table names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

NAMESPACE_NAME + " VARCHAR(255) NOT NULL," +
NAMESPACE_PROPERTY_KEY + " VARCHAR(5500)," +
NAMESPACE_PROPERTY_VALUE + " VARCHAR(5500)," +
"PRIMARY KEY (" + CATALOG_NAME + ", " + NAMESPACE_NAME + ")" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the primary key should be catalog, namespace, and key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

"(" +
CATALOG_NAME + " VARCHAR(255) NOT NULL," +
NAMESPACE_NAME + " VARCHAR(255) NOT NULL," +
NAMESPACE_METADATA + " VARCHAR(65535)," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think namespace metadata is needed. This should be the same thing as properties, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createNamespace(Namespace namespace, Map<String, String> metadata)

is the reason I added that. What does metadata refer to here? Based on our discussion, I think createNamespace needs to change. In the master branch it is Unsupported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata here and the properties that can be set in the other method are the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the PR with setProperties as well in a follow up commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why are they metadata and properties named differently if they mean the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes, an argument is used with a different name because it shadows a field in the class, which would break checkstyle.

So if the class has a field properties, then an argument calld properties would make references to properties ambiguous. So our checkstyle doesn't allow it (outside of constructors where it's not ambiguous and this. has to be used to reference the class's value).

This is often why they're different. Also, in some cases it's just whomever wrote the function signature and they were different over time because of authors or just word choice of the person when they were working on that code🤷 . But in my experience, it's usually that the arguments name would shadow a field on the class.

@nssalian
Copy link
Contributor Author

@nssalian, looks like there are two tables, iceberg_namespaces and namespace_properties. Do we need two? It seems to me that we only need the properties table and a namespace exists if the properties exist or if a table exists.

I think we can remove iceberg_namespaces. What would createNamespace evolve to if we don't have a separate table? Perhaps a single row in the namespace_properties or go back to the being Unsupported.

@jackye1995
Copy link
Contributor

Trying to catch up with the conversation...

it looks like this creates a generic table with catalog name, namespace name, and string blobs. Is that the right way to go? Why not use a namespace_properties table that has the schema catalog string, namespace string, key string, value string? A namespace would exist if there is a table with the namespace or if there is at least one property for the namespace. That seems like a better way to model the information than converting to and from JSON or base64 to me.

Yes that's actually the very original idea that we thought about in the initial PR. The schema in the current PR was from @miR172 's POC, and I overlooked the fact that it would break existing behavior, sorry for that. So we should switch to use 1 table for namespace properties and 1 table for just table information as you suggested, and we don't need another namespace table because everything can be modeled as key value pairs. createNamespace can be modeled as a createdAt=1234567890 key value pair.

@nssalian
Copy link
Contributor Author

Thanks for the comment @jackye1995 .
I'll update the implementation of createNamespace and then restore it to just one table for properties.
Do we need two checks for namespaceExists? - table created with namespace and properties field exists?

@github-actions github-actions bot added the INFRA label Jan 12, 2022
@nssalian
Copy link
Contributor Author

@kbendick , @rdblue let me know if you have more comments. I addressed the latest comments in the commits.

@nssalian
Copy link
Contributor Author

@kbendick @rdblue , checking if you have a chance to review this to get this across and merged. I can rebase after you've had a look.

@kbendick
Copy link
Contributor

Sorry for the long delay on this @nssalian!

Now that we have the docker based JDBC catalog notebook env, I'm going to install this in it and run some actual tests (as it's been a while since I've reviewed this). If things work, then I'm ok with this as I'd like to see this supported and we've already addressed a lot of feedback.

ResultSet tableExists = dbMeta.getTables(null /* catalog name */, null /* schemaPattern */,
JdbcUtil.NAMESPACE_PROPERTIES_TABLE_NAME /* tableNamePattern */, null /* types */);
Copy link
Contributor

@kbendick kbendick Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably using the comments to define the values that are null etc one time is fine.

Maybe making this a private function that takes in conn and then has the comments would enhance readability.

Like

/**
  * Check if a table exists in the database via a pattern to match against.
  * Searches all schemas & catalogs.
  * @return true if a table matching the pattern exists in the database
  */
private static boolean doesTableExist(Connection conn, String tableNamePatternToCheck) {
  DatabaseMetaData dbMeta = conn.getMetaData();
  ResultSet tablesMatchingPattern = dbMeta.getTables(
      null /* catalog name */, null /* schemaPattern */, tableNamePatternToCheck, null /* types */);
  return tablesMatchingPattern.hasNext();
}

That way, the null fields can be annotated once and then we could just call

connections.run(conn -> {
  if (!doesTableExist(conn, JdbcUtil.NAMESPACE_PROPERTIES_TABLE_NAME)) {
   // create table
  }
  return true;
});

I'm also not sure if there's a reason to use .next() instead of .hasNext() on the ResultSet. I'm ok with either way, but hasNext seems like the more intuitive API if there's no semantic difference. I don't know enough about JDBC to be sure if there is a difference or not though so it might be safer to just leave it as is.

sql.setString(rowIndex + 1, key);
rowIndex += 1;
}
LOG.info("Final log string {}", sql);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This log line can be removed - or at least downgraded to debug and changed to something like LOG.debug("Running query to update namespace properties: {}", sql)

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some feedback. I'm going to try running with this and if it works, I'd say we merge it as it's been quite some time.

Comment on lines +562 to +563
AssertHelpers.assertThrows("Cannot create a namespace with null or empty metadata", IllegalArgumentException.class,
() -> catalog.createNamespace(testNamespace, null));
Copy link
Contributor

@kbendick kbendick Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a separate test, testCreateNamespaceFailsWithoutNamespaceProperties()?

That way, there would be something in the code that self-documents that we are intentional about needing namespace properties.

You could then even do something like the following

AssertHelpers.assertThrows(
    "JDBC catalog cannot create a namespace without adding namespace properties as the properties are used as an existence check for RDBMs that don't support an atomic compare and swap",
    ...

The long / detailed comment probably isn't needed, but having it as a separate test (with both null and an empty map being tried in the test) would be nice.

Comment on lines +292 to +293
throw new UncheckedInterruptedException(e, "Interrupted in call to insertProperties(namespace, properties) " +
"Namespace: %s", namespace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this error message to not refer to code, but human-readable ones?

If somebody said it was fine for here, then leave it. But generally speaking, we prefer to use plain English (vs English with code in it) as end users are unlikely to know the code.

I'd go with something like Interrupted while inserting properties to namespace %s in catalog %s. Insertion likely needs to be retried.

You can drop the insertion likely needs to be retried part if there's not a chance of a partial success (I don't think there is as you're using prepared statements).

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm sorry that it took me so long to get back to this review, @nssalian! I'm going to go ahead and merge this.

I'm also working on some catalog validation tests, so I'll add a few more tests as part of that work. I'll probably fix a couple of @kbendick's suggestions also.

@rdblue rdblue merged commit caefd6e into apache:master Feb 23, 2022
arminnajafi pushed a commit to arminnajafi/iceberg that referenced this pull request Feb 23, 2022
@nssalian nssalian deleted the namespace-table branch February 23, 2022 01:25
@nssalian
Copy link
Contributor Author

Thanks @rdblue @kbendick for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants