-
Notifications
You must be signed in to change notification settings - Fork 25.4k
[ES|QL] Add a standard deviation function #116531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
Documentation preview: |
final long count = state.count(); | ||
final double m2 = state.m2(); | ||
if (count == 0 || Double.isFinite(m2) == false) { | ||
return driverContext.blockFactory().newConstantNullBlock(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the result is infinity or NaN I set it to return null, but I'm not sure if there should be a warning or something similar printed (or where that would best be done)?
|
||
import static java.util.Collections.emptyList; | ||
|
||
public class StdDeviation extends AggregateFunction implements ToAggregator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what best to name this. There were a few options: StdDeviation
, StandardDeviation
, or Stdev
(or maybe even Stddev
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think StandardDeviation is better, although it's an internal name, and we can change it anytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about for the name of the ES|QL function? Right now it's std_deviation
, I feel like std_dev
maybe works better? I worry that standard_deviation
is kind of long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std_dev
is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the internal name be changed to StdDev to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer StandardDeviation
for the class name, but feel free to choose whichever name you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be easier to change both; I tried changing only the function name but it looks like the name of the tests class is used for generating the docs so I think it might be simpler to use StdDev for the class name as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can annotate the test class with the function name if it does not match cleanly. For example, see this class SpatialIntersectsTest
which tests ST_INTERSECTS
: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/SpatialIntersectsTests.java#L23
...rc/main/generated-src/org/elasticsearch/compute/aggregation/StdDeviationFloatAggregator.java
Outdated
Show resolved
Hide resolved
@limotova I extracted values from the serverless test and combined them to reproduce the test failure. Using three batches, the final value is 0.23282704603226836, while with a single batch, the value is 0.22797190865484734. Could you check if this discrepancy is acceptable? public void testBasic() {
double[] v1 = {1.97, 2.0, 1.57, 1.48, 1.77};
double[] v2 = {2.1, 1.74, 1.96, 1.42, 1.59, 2.07, 1.81, 1.59, 1.44, 2.03, 1.81};
double[] v3 = {2.03, 1.54, 1.55};
WelfordAlgorithm a1 = new WelfordAlgorithm();
for (double v : v1) {
a1.add(v);
}
WelfordAlgorithm a2 = new WelfordAlgorithm();
for (double v : v2) {
a2.add(v);
}
WelfordAlgorithm a3 = new WelfordAlgorithm();
for (double v : v3) {
a3.add(v);
}
WelfordAlgorithm merged = new WelfordAlgorithm();
for (WelfordAlgorithm a : List.of(a3, a2, a1)) {
merged.add(a.mean(), a.m2(), a.count());
}
System.err.println("--> merged = " + merged.evaluate());
WelfordAlgorithm single = new WelfordAlgorithm();
for (double v : v1) {
single.add(v);
}
for (double v : v2) {
single.add(v);
}
for (double v : v3) {
single.add(v);
}
System.err.println("--> single = " + single.evaluate());
} |
|
||
import static java.util.Collections.emptyList; | ||
|
||
public class StdDeviation extends AggregateFunction implements ToAggregator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think StandardDeviation is better, although it's an internal name, and we can change it anytime.
...gin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDeviationStates.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public double evaluate() { | ||
return count < 2 ? 0 : Math.sqrt(m2 / count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be count-1
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with count
because I believe we use the population standard deviation elsewhere but I can change it if we'd prefer sample standard deviation?
* <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm"> | ||
* Parallel algorithm</a> | ||
*/ | ||
public final class WelfordAlgorithm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fold this class into the StdDeviationStates#SingleState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if we wanted to use it elsewhere (like if we wanted to support variance or have both sample and population standard deviation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like the kind of thing that could move moved to a common place like libs/
, but while there is only one usage, it might as well stay here for now.
...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two optional comments, but overall, I think the PR is ready. LGTM! However, I’d love to have another review from the ES|QL team. Great work, thanks Larisa!
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDevAggregator.java.st
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Added some suggestions around tests mostly
...ql/src/test/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDevTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
Outdated
Show resolved
Hide resolved
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
Flyby feedback:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding Andrei's suggestions, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need at least the resolveType
method, and probably the tests that Andrei suggests.
docs/changelog/116531.yaml
Outdated
@@ -0,0 +1,5 @@ | |||
pr: 116531 | |||
summary: "[ES|QL] Add a standard deviation function" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the prefix [ES|QL]
. The changelog is organized by area
, which already says it is ES|QL.
* <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm"> | ||
* Parallel algorithm</a> | ||
*/ | ||
public final class WelfordAlgorithm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like the kind of thing that could move moved to a common place like libs/
, but while there is only one usage, it might as well stay here for now.
tag = "docsStatsStdDevNestedExpression" | ||
) } | ||
) | ||
public StdDev(Source source, @Param(name = "number", type = { "double", "integer", "long" }) Expression field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The params claims to only support the numeric types, but there is no resolveType
function to enforce this. It should be similar to the Avg function: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Avg.java#L57
I tried to add in tests with
It seems to work with row as far as I can tell, but I'm not sure I understand what I should be looking for when I add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
*/ | ||
@Aggregator( | ||
{ | ||
@IntermediateState(name = "mean", type = "DOUBLE"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to specify that these are blocks or we need to always emit them as vectors (with count=0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I changed it to returning 0 in the intermediate stages
💚 Backport successful
|
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
Uses Welford's online algorithm, as well as the parallel version, to
calculate standard deviation.