API: Add expression sanitizer, sanitize scan log messages #4672

rdblue · 2022-04-29T15:56:50Z

This adds an expression sanitizer and sanitizes filters logged by table scans. This is to avoid leaking IDs and other sensitive information in logs.

Literals are sanitized using the following rules:

Integers and longs are converted to (N-digit-int), where N is the number of digits in the number
Floats and doubles are converted to (N-digit-float), where N is the number of digits before the decimal point
Dates, times, and timestamps are converted to (date), (time), or (timestamp)
All other types are hashed to produce (hash-%08h) from the string representation

The result for string literals matches the result after the literals are bound and converted.

rdblue · 2022-04-29T15:57:37Z

@dain, this sanitizes those logged filters you pointed out.

kbendick

This looks good to me overall, +1 from me once tests are passing.

I'm wondering if we need to be this aggressive about sanitizing dates and times. Possibly a small portion of it could be left in?

I do recognize that determining the level of sanitization required is usually a business specific issue, so it's likely better to over-sanitize than under-sanitize.

Is this something we should provide a configuration to enable / disable, possibly within the engines somehow? In many cases, these values are not sensitive, or not sensitive enough to be a concern for people's logs. And they might want to disable this for debugging purposes.

This could all be addressed in a follow up task, of course.

kbendick · 2022-04-29T17:26:02Z

api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java

+   * @param expr an Expression to sanitize
+   * @return a sanitized expression string
+   */
+  public static String toSanitizedString(Expression expr) {


As the specific needs for sanitization are often very business specific, would it make sense to leave this somehow as something that end users could override somewhat easily?

Possibly the santitize method itself could be abstract with this as the default implementation.

We can improve this as needed, but I don't want to guess ahead of time what makes sense. People can always extend this later.

kbendick · 2022-04-29T17:28:10Z

The test failure seems to be a transient issue.

If I close this PR though, I'm not sure if I can reopen it.

szehon-ho

Thanks there is one more (that I added): #4169, if it can added to this as well?

szehon-ho · 2022-04-29T18:23:49Z

api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java

+
+    @Override
+    public Expression alwaysTrue() {
+      return Expressions.alwaysTrue();


For another pr, but might be nice to have these boilerplate codes (and the string ones) in an abstract base ExpressionVisitor class at some point to reduce chance of errors, seems they show up a few places in the code

The default is to return null, which I think is a reasonable thing to do. We'd probably have to change a lot of the implementations so I'm not sure if it is worth it.

rdblue · 2022-04-29T18:55:20Z

@szehon-ho, I think we'll need to address the log in #4169 separately because that's entirely in Spark code and we don't have an Iceberg filter. We'd need to either add the ability to sanitize Spark filters (which I'm reluctant to add) or just allow it for Spark. This PR changes instances in core, which gets used in other engines as well.

szehon-ho

Fair enough, maybe we can use SparkUtils redact methods for that, can discuss in another issue not to block this pr.

szehon-ho · 2022-04-29T22:42:25Z

api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java

+    } else if (literal instanceof Literals.IntegerLiteral) {
+      return sanitizeNumber(((Literals.IntegerLiteral) literal).value(), "int");
+    } else if (literal instanceof Literals.LongLiteral) {
+      return sanitizeNumber(((Literals.LongLiteral) literal).value(), "int");


why not long? (and below, why not double?)

Good catch! I used int or float in both cases because the type is determined by what you pass in to create the literal, not the actual type of data in a table (because the expression is unbound). Since the types are floating anyway, I didn't think it made much sense to be strict. What really matters is the magnitude of the value. If you have a 4-digit-int it doesn't matter if it was passed in as a long or passed in as an int.

I see, though is there any disadvantage to just putting the exact one being passed in ? I took a look and seems Spark does some schema-based binding of expression before passing it to Iceberg, so a comparison of long column results in long literal being passed down to Iceberg scan. Though understand its not a huge point.

I think that would require more processing later to use these sanitized values. Right now, this sanitizes log messages, but if you wanted to start collecting sanitized filters and doing analysis on them, one of the first steps would be to collapse these, I think. That's why I went ahead and normalized.

rdblue · 2022-05-12T16:34:15Z

@szehon-ho, could you take another look at this one?

szehon-ho

Looks good to me

rdblue · 2022-05-12T19:09:27Z

Thanks, @szehon-ho!

API: Add expression sanitizer, sanitize scan log messages.

295579b

rdblue requested a review from danielcweeks April 29, 2022 15:57

github-actions bot added API core labels Apr 29, 2022

kbendick approved these changes Apr 29, 2022

View reviewed changes

szehon-ho reviewed Apr 29, 2022

View reviewed changes

Fix checkstyle.

7481b32

szehon-ho reviewed Apr 29, 2022

View reviewed changes

szehon-ho approved these changes May 12, 2022

View reviewed changes

rdblue merged commit 26d8c3e into apache:master May 12, 2022

findepi mentioned this pull request Oct 3, 2022

Reduce 'Scanning table' log verbosity for long IN list #5908

Merged

API: Add expression sanitizer, sanitize scan log messages #4672

API: Add expression sanitizer, sanitize scan log messages #4672

Uh oh!

Conversation

rdblue commented Apr 29, 2022

Uh oh!

rdblue commented Apr 29, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

kbendick commented Apr 29, 2022

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue commented Apr 29, 2022

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 30, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue May 2, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 12, 2022

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho left a comment •

edited

Loading

szehon-ho Apr 29, 2022 •

edited

Loading