-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Add expression sanitizer, sanitize scan log messages #4672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@dain, this sanitizes those logged filters you pointed out. |
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall, +1 from me once tests are passing.
I'm wondering if we need to be this aggressive about sanitizing dates and times. Possibly a small portion of it could be left in?
I do recognize that determining the level of sanitization required is usually a business specific issue, so it's likely better to over-sanitize than under-sanitize.
Is this something we should provide a configuration to enable / disable, possibly within the engines somehow? In many cases, these values are not sensitive, or not sensitive enough to be a concern for people's logs. And they might want to disable this for debugging purposes.
This could all be addressed in a follow up task, of course.
| * @param expr an Expression to sanitize | ||
| * @return a sanitized expression string | ||
| */ | ||
| public static String toSanitizedString(Expression expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the specific needs for sanitization are often very business specific, would it make sense to leave this somehow as something that end users could override somewhat easily?
Possibly the santitize method itself could be abstract with this as the default implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve this as needed, but I don't want to guess ahead of time what makes sense. People can always extend this later.
|
The test failure seems to be a transient issue. If I close this PR though, I'm not sure if I can reopen it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks there is one more (that I added): #4169, if it can added to this as well?
|
|
||
| @Override | ||
| public Expression alwaysTrue() { | ||
| return Expressions.alwaysTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For another pr, but might be nice to have these boilerplate codes (and the string ones) in an abstract base ExpressionVisitor class at some point to reduce chance of errors, seems they show up a few places in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is to return null, which I think is a reasonable thing to do. We'd probably have to change a lot of the implementations so I'm not sure if it is worth it.
|
@szehon-ho, I think we'll need to address the log in #4169 separately because that's entirely in Spark code and we don't have an Iceberg filter. We'd need to either add the ability to sanitize Spark filters (which I'm reluctant to add) or just allow it for Spark. This PR changes instances in core, which gets used in other engines as well. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, maybe we can use SparkUtils redact methods for that, can discuss in another issue not to block this pr.
| } else if (literal instanceof Literals.IntegerLiteral) { | ||
| return sanitizeNumber(((Literals.IntegerLiteral) literal).value(), "int"); | ||
| } else if (literal instanceof Literals.LongLiteral) { | ||
| return sanitizeNumber(((Literals.LongLiteral) literal).value(), "int"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not long? (and below, why not double?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I used int or float in both cases because the type is determined by what you pass in to create the literal, not the actual type of data in a table (because the expression is unbound). Since the types are floating anyway, I didn't think it made much sense to be strict. What really matters is the magnitude of the value. If you have a 4-digit-int it doesn't matter if it was passed in as a long or passed in as an int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, though is there any disadvantage to just putting the exact one being passed in ? I took a look and seems Spark does some schema-based binding of expression before passing it to Iceberg, so a comparison of long column results in long literal being passed down to Iceberg scan. Though understand its not a huge point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would require more processing later to use these sanitized values. Right now, this sanitizes log messages, but if you wanted to start collecting sanitized filters and doing analysis on them, one of the first steps would be to collapse these, I think. That's why I went ahead and normalized.
|
@szehon-ho, could you take another look at this one? |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
|
Thanks, @szehon-ho! |
This adds an expression sanitizer and sanitizes filters logged by table scans. This is to avoid leaking IDs and other sensitive information in logs.
Literals are sanitized using the following rules:
(N-digit-int), whereNis the number of digits in the number(N-digit-float), where N is the number of digits before the decimal point(date),(time), or(timestamp)(hash-%08h)from the string representationThe result for string literals matches the result after the literals are bound and converted.