-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix date and timestamp transforms #1981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d5ae0f1 to
c5fe954
Compare
| Expression projection = Projections.inclusive(spec).project(filter); | ||
| UnboundPredicate<?> predicate = assertAndUnwrapUnbound(projection); | ||
|
|
||
| Assert.assertEquals(predicate.op(), expectedOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests need to be added here. I just wanted to get this up for review since I'm going to be out for a couple of days.
c5fe954 to
a0b8816
Compare
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making this change!
To make sure I understand correctly, changes in projection methods ensure that both behaviors before and after this fix will be accounted for by the projection, so that we might not need to have separate implementations for format v1 versus v2, with a slight penalty that in v2 we may scan more data than we have to?
api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/transforms/TestDatesProjection.java
Outdated
Show resolved
Hide resolved
Yes, this fixes the transforms and ensures that predicate projection includes the partitions that were written with bad values. That means that we won't need different implementations for v2, but it also means that we can avoid scanning the extra partitions in the future. Because this is fixed before v2, we can ensure that all v2 tables have been fixed. So if a table is created as v2, we should be able to know that no older writers with the bug have written to the table. The only case where a v2 table would have bad metadata is when a v1 table has been converted. We should add a flag to signal that the table was converted from v1 or one that signals it was created as v2 that allows us to skip the extra partitions. |
a6392db to
146ff9e
Compare
| } else if (pred.isLiteralPredicate()) { | ||
| return ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this); | ||
| UnboundPredicate<Integer> projected = ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this); | ||
| if (this != DAY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the same logic (if (condition) {return...} return ...) appears multiple times, can it be included inside a method, e.g. just embed it in the method defined in ProjectionUtil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only 4 instance of this, and they call 2 different fix methods. I don't think it would be worth adding 2 methods just to dedup 4 lines here.
| OffsetDateTime timestamp = Instant | ||
| .ofEpochSecond( | ||
| Math.floorDiv(timestampMicros, 1_000_000), | ||
| Math.floorMod(timestampMicros, 1_000_000) * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extract 1_000_000 and 1000 to be a constant with meaningful name or use predefined ones in java library.
| } | ||
|
|
||
| if (hasNegativeValue) { | ||
| return Expressions.in(projected.term(), fixedSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just always return this new expression? We build up fixedSet no matter what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no negative value, then there is no need to fixup the expression and we can return the original. That avoids some object allocation?
|
Thanks for fixing this, I've just run into this issue with Impala. (And this fix is also a prerequisite for issue #2043) So the plan is that 0.11+ writers will write the correct transformation values IIUC. Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch? E.g. new writer writes '1969-12-31 23:05:00' with YEAR partition transform (so ts_year=-1), old reader scans for 'ts > 1969-12-31 23:00:00' (so with old transform the projected predicate is ts_year >=0), and the file will be skipped because the partition data has ts_year=-1. I just want to clarify this, because with Impala I'm planning to write V1 files with the fixed transforms. |
Yes. Older readers will still have the bug. I think we can only make a backward-compatible fix for this. |
|
Nevermind, they can't really have transforms in the partitions anyway. |
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
|
Thanks for the reviews, @danielcweeks, @yyanyy, @RussellSpitzer, @jun-he, and @boroknagyz! |
Because of an Iceberg bug Impala didn't push predicates to Iceberg for dates/timestamps when the predicate referred to a value before the UNIX epoch. apache/iceberg#1981 fixed the Iceberg bug, and lately Impala switched to an Iceberg version that has the fix, therefore this patch enables predicate pushdown for all timestamp/date values. The above Iceberg patch maintains backward compatibility with the old, wrong behavior. Therefore sometimes we need to read plus one Iceberg partition than necessary. Testing: * Updated current e2e tests Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a Reviewed-on: http://gerrit.cloudera.org:8080/17417 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes incorrect values produced by date and time transforms and adds tests.
This also updates date and timestamp transform projection methods to include incorrectly transformed values.
Fixes #1680.