-
Notifications
You must be signed in to change notification settings - Fork 25.4k
ESQL: Change the order of the optimization rules #124335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @astefan, I've created a changelog YAML for you. |
@@ -695,3 +695,115 @@ emp_no:integer | languages:integer | gender:keyword | max_lang:integer | y:keywo | |||
10012 | 5 | null | 5 | null | |||
10014 | 5 | null | 5 | null | |||
; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the following tests fail with the same error. This is likely due to something that @alex-spies caught in a previous review, meaning too much is sent to the data nodes for processing, instead of being done on the coordinator. This is also tied to scenarios where multiple inlinestats
commands are used in a query.
@@ -34,7 +34,6 @@ protected LogicalPlan rule(InlineJoin plan) { | |||
// check if there's any grouping that uses a reference on the right side |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a small cleanup here.
…ticsearch into inlinestats_pickup3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
To double check my understanding : the eval being integrated as a project is fine (since the eval is just an alias) however in the case of inline the eval is needed stand alone to be propagated on the left side so the join works by returning the grouped column with the right alias.
Did I miss anything?
A logical optimization test might be useful that calls the two rules (three) rules one after the other to showcase the side effect.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nice and should be merged for the added test coverage alone.
However, did you maybe consider a solution where we make ReplaceAggregateNestedExpressionWithEval
aware of whether it is applied inside the right hand side of InlineJoin
? The root problem seems to me that this rule creates an Eval
that should be applied not on the the right hand side, but to the left hand side of an InlineJoin
- and we require a whole additional optimizer rule, namely PropagateInlineEvals
applied in the correct order without runs of ReplaceAliasingEvalWithProject
resp. CombineProjections
for this to work - which I find a little brittle.
...src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateInlineEvals.java
Show resolved
Hide resolved
if (groupingRefs.isEmpty()) { | ||
return p; | ||
} | ||
|
||
// find their declaration and remove it | ||
// TODO: this doesn't take into account aliasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment may be stale now.
...src/main/java/org/elasticsearch/xpack/esql/optimizer/rules/logical/PropagateInlineEvals.java
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
FROM employees | ||
| KEEP emp_no, languages, gender, first_name | ||
| RENAME first_name as f | ||
| INLINESTATS max_lang = MAX(languages) BY y = gender, l = languages, f = left(f, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtle semantic question:
The implementation assumes that expressions in the BY
clause take effect in the main branch of the join as well. I.e. INLINESTATS ... BY ..., f = left(f, 1)
means that the attribute f
in the left hand side gets overwritten before being aggregated by/joined on.
But there's a different way to understand this, and it's actually useful: INLINESTATS ... BY ..., f = left(f, 1)
could mean that f
is overwritten only on the right hand side where the aggregation happens, and then is matched up with the left hand side.
I don't think we want the latter to be the semantics here. But that could be useful in case that e.g. the left hand side has a counter
field that indicates the current row (or similar) - then INLINESTATS ... BY counter = counter + 1
would actually shift the stats by 1 row.
Yes.
I did consider another solution, but I discarded it because it didn't feel the right approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that it's probably a good idea to add tests for INLINESTATS ... BY x = bucket(...)
and INLINESTATS ... BY x = CATEGORIZE(...)
. I think the latter cannot work because the join key in BY x = CATEGORIZE(...)
is computed during the aggregation, whereas INLINESTATS
requires the join key to be present before that.
Cc @jan-elastic , I think we'll have to start out with a limitation where INLINESTATS
can't use CATEGORIZE
, at least at first; to enable this, I think we'd somehow have to grab the categorizer from the first phase of the query (which computes the STATS) and make it available to the second phase of the query (which performs the joining with every row we see).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I can live with the fact that the first version of INLINESTATS doesn't work with CATEGORIZE.
Just open a GitHub issue for that and it can be resolved later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #124717
This is not the alternative I'm proposing. What I observed is that Or to put it differently and bluntly: I think But that blows up the scope of this PR. I'll make a note to revisit this later - also, the general shape of the code for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the iterations, Andrei!
💚 Backport successful
|
In a query like
INLINESTATS max_lang = MAX(languages) BY y = gender
the "rename" ofgender
asy
was replaced by aneval y = gender
and then integrated back as a "projection" inside the aggregate itself.ReplaceAggregateNestedExpressionWithEval
andCombineProjections
seem to work one against the other. The former adds theeval
while the latter eliminates it and puts it back in the groupings of the aggregation.I have changed the order in which
ReplaceOrderByExpressionWithEval
andPropagateInlineEvals
rules were called in the logical optimizer so that theeval
mentioned above is not integrated back in the aggregation.