Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 23, 2020

This fixes incorrect values produced by date and time transforms and adds tests.

This also updates date and timestamp transform projection methods to include incorrectly transformed values.

Fixes #1680.

@github-actions github-actions bot added the API label Dec 23, 2020
@rdblue rdblue force-pushed the add-date-and-timestamp-tests branch from d5ae0f1 to c5fe954 Compare December 23, 2020 21:51
Expression projection = Projections.inclusive(spec).project(filter);
UnboundPredicate<?> predicate = assertAndUnwrapUnbound(projection);

Assert.assertEquals(predicate.op(), expectedOp);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests need to be added here. I just wanted to get this up for review since I'm going to be out for a couple of days.

@rdblue rdblue force-pushed the add-date-and-timestamp-tests branch from c5fe954 to a0b8816 Compare December 23, 2020 21:54
Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this change!

To make sure I understand correctly, changes in projection methods ensure that both behaviors before and after this fix will be accounted for by the projection, so that we might not need to have separate implementations for format v1 versus v2, with a slight penalty that in v2 we may scan more data than we have to?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 24, 2020

To make sure I understand correctly, changes in projection methods ensure that both behaviors before and after this fix will be accounted for by the projection, so that we might not need to have separate implementations for format v1 versus v2, with a slight penalty that in v2 we may scan more data than we have to?

Yes, this fixes the transforms and ensures that predicate projection includes the partitions that were written with bad values. That means that we won't need different implementations for v2, but it also means that we can avoid scanning the extra partitions in the future. Because this is fixed before v2, we can ensure that all v2 tables have been fixed. So if a table is created as v2, we should be able to know that no older writers with the bug have written to the table. The only case where a v2 table would have bad metadata is when a v1 table has been converted. We should add a flag to signal that the table was converted from v1 or one that signals it was created as v2 that allows us to skip the extra partitions.

} else if (pred.isLiteralPredicate()) {
return ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
UnboundPredicate<Integer> projected = ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
if (this != DAY) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the same logic (if (condition) {return...} return ...) appears multiple times, can it be included inside a method, e.g. just embed it in the method defined in ProjectionUtil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only 4 instance of this, and they call 2 different fix methods. I don't think it would be worth adding 2 methods just to dedup 4 lines here.

OffsetDateTime timestamp = Instant
.ofEpochSecond(
Math.floorDiv(timestampMicros, 1_000_000),
Math.floorMod(timestampMicros, 1_000_000) * 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract 1_000_000 and 1000 to be a constant with meaningful name or use predefined ones in java library.

}

if (hasNegativeValue) {
return Expressions.in(projected.term(), fixedSet);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just always return this new expression? We build up fixedSet no matter what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no negative value, then there is no need to fixup the expression and we can return the original. That avoids some object allocation?

@boroknagyz
Copy link
Contributor

Thanks for fixing this, I've just run into this issue with Impala. (And this fix is also a prerequisite for issue #2043)

So the plan is that 0.11+ writers will write the correct transformation values IIUC. Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?

E.g. new writer writes '1969-12-31 23:05:00' with YEAR partition transform (so ts_year=-1), old reader scans for 'ts > 1969-12-31 23:00:00' (so with old transform the projected predicate is ts_year >=0), and the file will be skipped because the partition data has ts_year=-1.

I just want to clarify this, because with Impala I'm planning to write V1 files with the fixed transforms.

@rdblue
Copy link
Contributor Author

rdblue commented Jan 9, 2021

Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?

Yes. Older readers will still have the bug. I think we can only make a backward-compatible fix for this.

@danielcweeks
Copy link
Contributor

danielcweeks commented Jan 22, 2021

So, the best I can follow, this looks good, but there's one issue I'm not clear on. If we're just expanding the projection to include possible matches (e.g. equality case), doesn't this cause a problem where the partition values are joined in (i.e. not materialized in the data like converted Hive tables)?

I assume we don't have a way to correct for that case since the recorded transform is incorrect and we cannot materialize the correct data to filter later?

Nevermind, they can't really have transforms in the partitions anyway.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@rdblue
Copy link
Contributor Author

rdblue commented Jan 22, 2021

Thanks for the reviews, @danielcweeks, @yyanyy, @RussellSpitzer, @jun-he, and @boroknagyz!

@rdblue rdblue merged commit ae96c32 into apache:master Jan 22, 2021
XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
asfgit pushed a commit to apache/impala that referenced this pull request May 19, 2021
Because of an Iceberg bug Impala didn't push predicates to
Iceberg for dates/timestamps when the predicate referred to a
value before the UNIX epoch.

apache/iceberg#1981 fixed the Iceberg
bug, and lately Impala switched to an Iceberg version that has
the fix, therefore this patch enables predicate pushdown for all
timestamp/date values.

The above Iceberg patch maintains backward compatibility with the
old, wrong behavior. Therefore sometimes we need to read plus one
Iceberg partition than necessary.

Testing:
 * Updated current e2e tests

Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a
Reviewed-on: http://gerrit.cloudera.org:8080/17417
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spec: document partition transforms for timestamps before 1970

6 participants