Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Apr 29, 2022

This adds an expression sanitizer and sanitizes filters logged by table scans. This is to avoid leaking IDs and other sensitive information in logs.

Literals are sanitized using the following rules:

  • Integers and longs are converted to (N-digit-int), where N is the number of digits in the number
  • Floats and doubles are converted to (N-digit-float), where N is the number of digits before the decimal point
  • Dates, times, and timestamps are converted to (date), (time), or (timestamp)
  • All other types are hashed to produce (hash-%08h) from the string representation

The result for string literals matches the result after the literals are bound and converted.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 29, 2022

@dain, this sanitizes those logged filters you pointed out.

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall, +1 from me once tests are passing.

I'm wondering if we need to be this aggressive about sanitizing dates and times. Possibly a small portion of it could be left in?

I do recognize that determining the level of sanitization required is usually a business specific issue, so it's likely better to over-sanitize than under-sanitize.

Is this something we should provide a configuration to enable / disable, possibly within the engines somehow? In many cases, these values are not sensitive, or not sensitive enough to be a concern for people's logs. And they might want to disable this for debugging purposes.

This could all be addressed in a follow up task, of course.

* @param expr an Expression to sanitize
* @return a sanitized expression string
*/
public static String toSanitizedString(Expression expr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the specific needs for sanitization are often very business specific, would it make sense to leave this somehow as something that end users could override somewhat easily?

Possibly the santitize method itself could be abstract with this as the default implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve this as needed, but I don't want to guess ahead of time what makes sense. People can always extend this later.

@kbendick
Copy link
Contributor

The test failure seems to be a transient issue.

If I close this PR though, I'm not sure if I can reopen it.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks there is one more (that I added): #4169, if it can added to this as well?


@Override
public Expression alwaysTrue() {
return Expressions.alwaysTrue();
Copy link
Member

@szehon-ho szehon-ho Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another pr, but might be nice to have these boilerplate codes (and the string ones) in an abstract base ExpressionVisitor class at some point to reduce chance of errors, seems they show up a few places in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is to return null, which I think is a reasonable thing to do. We'd probably have to change a lot of the implementations so I'm not sure if it is worth it.

@rdblue
Copy link
Contributor Author

rdblue commented Apr 29, 2022

@szehon-ho, I think we'll need to address the log in #4169 separately because that's entirely in Spark code and we don't have an Iceberg filter. We'd need to either add the ability to sanitize Spark filters (which I'm reluctant to add) or just allow it for Spark. This PR changes instances in core, which gets used in other engines as well.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, maybe we can use SparkUtils redact methods for that, can discuss in another issue not to block this pr.

} else if (literal instanceof Literals.IntegerLiteral) {
return sanitizeNumber(((Literals.IntegerLiteral) literal).value(), "int");
} else if (literal instanceof Literals.LongLiteral) {
return sanitizeNumber(((Literals.LongLiteral) literal).value(), "int");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not long? (and below, why not double?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I used int or float in both cases because the type is determined by what you pass in to create the literal, not the actual type of data in a table (because the expression is unbound). Since the types are floating anyway, I didn't think it made much sense to be strict. What really matters is the magnitude of the value. If you have a 4-digit-int it doesn't matter if it was passed in as a long or passed in as an int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, though is there any disadvantage to just putting the exact one being passed in ? I took a look and seems Spark does some schema-based binding of expression before passing it to Iceberg, so a comparison of long column results in long literal being passed down to Iceberg scan. Though understand its not a huge point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would require more processing later to use these sanitized values. Right now, this sanitizes log messages, but if you wanted to start collecting sanitized filters and doing analysis on them, one of the first steps would be to collapse these, I think. That's why I went ahead and normalized.

@rdblue
Copy link
Contributor Author

rdblue commented May 12, 2022

@szehon-ho, could you take another look at this one?

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@rdblue rdblue merged commit 26d8c3e into apache:master May 12, 2022
@rdblue
Copy link
Contributor Author

rdblue commented May 12, 2022

Thanks, @szehon-ho!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants