[ES|QL] Add a standard deviation function #116531

limotova · 2024-11-08T23:16:42Z

Uses Welford's online algorithm, as well as the parallel version, to
calculate standard deviation.

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

github-actions · 2024-11-08T23:16:54Z

Documentation preview:

✨ Changed pages

limotova · 2024-11-08T23:18:56Z

...compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDeviationAggregator.java.st

+        final long count = state.count();
+        final double m2 = state.m2();
+        if (count == 0 || Double.isFinite(m2) == false) {
+            return driverContext.blockFactory().newConstantNullBlock(1);


If the result is infinity or NaN I set it to return null, but I'm not sure if there should be a warning or something similar printed (or where that would best be done)?

limotova · 2024-11-08T23:19:54Z

...l/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDeviation.java

+
+import static java.util.Collections.emptyList;
+
+public class StdDeviation extends AggregateFunction implements ToAggregator {


I wasn't sure what best to name this. There were a few options: StdDeviation, StandardDeviation, or Stdev (or maybe even Stddev)

I think StandardDeviation is better, although it's an internal name, and we can change it anytime.

What about for the name of the ES|QL function? Right now it's std_deviation, I feel like std_dev maybe works better? I worry that standard_deviation is kind of long

I think std_dev is better.

Should the internal name be changed to StdDev to match?

I prefer StandardDeviation for the class name, but feel free to choose whichever name you prefer.

I think it might be easier to change both; I tried changing only the function name but it looks like the name of the tests class is used for generating the docs so I think it might be simpler to use StdDev for the class name as well

You can annotate the test class with the function name if it does not match cleanly. For example, see this class SpatialIntersectsTest which tests ST_INTERSECTS: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/SpatialIntersectsTests.java#L23

...rc/main/generated-src/org/elasticsearch/compute/aggregation/StdDeviationFloatAggregator.java

dnhatn · 2024-11-14T07:04:33Z

@limotova I extracted values from the serverless test and combined them to reproduce the test failure. Using three batches, the final value is 0.23282704603226836, while with a single batch, the value is 0.22797190865484734. Could you check if this discrepancy is acceptable?

    public void testBasic() {
        double[] v1 = {1.97, 2.0, 1.57, 1.48, 1.77};
        double[] v2 = {2.1, 1.74, 1.96, 1.42, 1.59, 2.07, 1.81, 1.59, 1.44, 2.03, 1.81};
        double[] v3 = {2.03, 1.54, 1.55};

        WelfordAlgorithm a1 = new WelfordAlgorithm();
        for (double v : v1) {
            a1.add(v);
        }

        WelfordAlgorithm a2 = new WelfordAlgorithm();
        for (double v : v2) {
            a2.add(v);
        }

        WelfordAlgorithm a3 = new WelfordAlgorithm();
        for (double v : v3) {
            a3.add(v);
        }

        WelfordAlgorithm merged = new WelfordAlgorithm();
        for (WelfordAlgorithm a : List.of(a3, a2, a1)) {
            merged.add(a.mean(), a.m2(), a.count());
        }

        System.err.println("--> merged = " + merged.evaluate());

        WelfordAlgorithm single = new WelfordAlgorithm();
        for (double v : v1) {
            single.add(v);
        }
        for (double v : v2) {
            single.add(v);
        }
        for (double v : v3) {
            single.add(v);
        }
        System.err.println("--> single = " + single.evaluate());
    }

dnhatn · 2024-11-14T01:15:31Z

...l/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDeviation.java

+
+import static java.util.Collections.emptyList;
+
+public class StdDeviation extends AggregateFunction implements ToAggregator {


I think StandardDeviation is better, although it's an internal name, and we can change it anytime.

...gin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDeviationStates.java

dnhatn · 2024-11-14T18:34:39Z

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java

+    }
+
+    public double evaluate() {
+        return count < 2 ? 0 : Math.sqrt(m2 / count);


Should this be count-1 instead?

I went with count because I believe we use the population standard deviation elsewhere but I can change it if we'd prefer sample standard deviation?

dnhatn · 2024-11-14T18:35:34Z

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java

+ *         <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm">
+ *         Parallel algorithm</a>
+ */
+public final class WelfordAlgorithm {


Should we fold this class into the StdDeviationStates#SingleState?

I wasn't sure if we wanted to use it elsewhere (like if we wanted to support variance or have both sample and population standard deviation)?

We can leave it as is.

Feels like the kind of thing that could move moved to a common place like libs/, but while there is only one usage, it might as well stay here for now.

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java

dnhatn

I have two optional comments, but overall, I think the PR is ready. LGTM! However, I’d love to have another review from the ES|QL team. Great work, thanks Larisa!

...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java

.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDevAggregator.java.st

ivancea

Looks good! Added some suggestions around tests mostly

...ql/src/test/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDevTests.java

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec

...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java

astefan · 2024-11-21T12:38:46Z

Flyby feedback:

std_dev(first_name) fails with an unfriendly error message: org.elasticsearch.xpack.esql.EsqlIllegalArgumentException: Cannot find intermediate state for: AggDef[aggClazz=class org.elasticsearch.xpack.esql.expression.function.aggregate.StdDev, type=BytesRef, extra=, grouping=false]. This probably comes from the lack of resolveType() method implementation.
std_dev(salary_change) I think deserves a test, meaning the aggregation function applied on a multi-value field. You are using salary_change in a test, but you apply mv_max on it, reducing it to a single value field.
another recent functionality we introduced for stats is the filter specific to an aggregation function. stats std_dev(salary) where languages > 3. It would be good to have a test for this as well. For example, FROM employees | stats std_dev(salary_change + 1) where languages > 3, std_dev(salary_change + 1) where languages <= 3, count(*) by gender
the row command is a different type of functionality where the source of the data is "static" and it doesn't come from ES. Would be good to have some tests with row as well. For example row a = [1,2,3], b = 5 | stats std_dev(a), std_dev(b) by a

ivancea

After adding Andrei's suggestions, LGTM!

craigtaverner

I think we need at least the resolveType method, and probably the tests that Andrei suggests.

craigtaverner · 2024-11-21T16:09:21Z

docs/changelog/116531.yaml

@@ -0,0 +1,5 @@
+pr: 116531
+summary: "[ES|QL] Add a standard deviation function"


Remove the prefix [ES|QL]. The changelog is organized by area, which already says it is ES|QL.

craigtaverner · 2024-11-21T16:13:26Z

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java

+ *         <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm">
+ *         Parallel algorithm</a>
+ */
+public final class WelfordAlgorithm {


Feels like the kind of thing that could move moved to a common place like libs/, but while there is only one usage, it might as well stay here for now.

craigtaverner · 2024-11-21T16:17:43Z

...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDev.java

+                tag = "docsStatsStdDevNestedExpression"
+            ) }
+    )
+    public StdDev(Source source, @Param(name = "number", type = { "double", "integer", "long" }) Expression field) {


The params claims to only support the numeric types, but there is no resolveType function to enforce this. It should be similar to the Avg function: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Avg.java#L57

limotova · 2024-11-22T01:54:27Z

@astefan

another recent functionality we introduced for stats is the filter specific to an aggregation function. stats std_dev(salary) where languages > 3. It would be good to have a test for this as well. For example, FROM employees | stats std_dev(salary_change + 1) where languages > 3, std_dev(salary_change + 1) where languages <= 3, count(*) by gender

I tried to add in tests with salary_change + 1, but since some of the values are empty it looks like it doesn't handle the + 1 very well (it seems like it ends the calculation as soon as it hits an empty value), is this expected? It works as expected with just salary_change (it skips empty values and calculates everything else), and it also handles something like salary * 2 (not multivalue, but no empty values) fine.
(Actually based on the test failure it might not be working properly... I'm unable to reproduce it locally though and am a bit stumped why it's happening)

the row command is a different type of functionality where the source of the data is "static" and it doesn't come from ES. Would be good to have some tests with row as well. For example row a = [1,2,3], b = 5 | stats std_dev(a), std_dev(b) by a

It seems to work with row as far as I can tell, but I'm not sure I understand what I should be looking for when I add by a here? The standard deviation of only one unique value is always 0, but since a is a multi-value here, the result of std_dev(a) by a is the same as std_dev(a), just the former repeats the result 3 times for each value of a

craigtaverner

LGTM

docs/changelog/116531.yaml

dnhatn · 2024-11-22T20:28:34Z

.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDevAggregator.java.st

+ */
+@Aggregator(
+    {
+        @IntermediateState(name = "mean", type = "DOUBLE"),


We need to specify that these are blocks or we need to always emit them as vectors (with count=0)?

Thank you! I changed it to returning 0 in the intermediate stages

elasticsearchmachine · 2024-11-22T22:34:59Z

💚 Backport successful

Status	Branch	Result
✅	8.x

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

Add a standard deviation function

7594d5a

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

elasticsearchmachine added the v9.0.0 label Nov 8, 2024

limotova commented Nov 8, 2024

View reviewed changes

limotova requested a review from dnhatn November 8, 2024 23:20

limotova added 4 commits November 8, 2024 13:34

lint

fa54d73

Merge branch 'main' into add-stddev-function

ffdf276

rest test

b4c9c1b

bwc

0a74c9a

dnhatn reviewed Nov 12, 2024

View reviewed changes

...rc/main/generated-src/org/elasticsearch/compute/aggregation/StdDeviationFloatAggregator.java Outdated Show resolved Hide resolved

limotova added 3 commits November 12, 2024 10:31

Merge branch 'main' into add-stddev-function

5941da8

move states from template to individual classes

55cdef1

Merge branch 'main' into add-stddev-function

cb07ebd

dnhatn reviewed Nov 14, 2024

View reviewed changes

...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java Outdated Show resolved Hide resolved

limotova added 10 commits November 14, 2024 11:37

change State names to SingleState and GroupingState

9f8b3be

change StdDeviation to StdDev

5fb5b88

fix parallel algorithm

3503934

whoops docs

2b4f80d

lint

bb32947

Merge branch 'main' into add-stddev-function

a8bff93

more renaming

f69d640

linting continues

8a5ac5c

Merge branch 'main' into add-stddev-function

78f2061

Merge branch 'main' into add-stddev-function

81fbf0a

dnhatn approved these changes Nov 20, 2024

View reviewed changes

...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java Outdated Show resolved Hide resolved

.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDevAggregator.java.st Outdated Show resolved Hide resolved

limotova added 2 commits November 19, 2024 16:45

Merge branch 'main' into add-stddev-function

6842d40

move evaluate final to states and fix entry

93b36db

ivancea reviewed Nov 20, 2024

View reviewed changes

limotova added 3 commits November 20, 2024 14:18

change SingleState to WelfordAlgorithm and change tests

bc715f4

dot

a3477ef

Merge branch 'main' into add-stddev-function

17f0b6b

limotova requested a review from ivancea November 21, 2024 01:01

ivancea approved these changes Nov 21, 2024

View reviewed changes

dnhatn added v8.18.0 auto-backport Automatically create backport pull requests when merged labels Nov 21, 2024

craigtaverner requested changes Nov 21, 2024

View reviewed changes

limotova added 2 commits November 21, 2024 15:43

tests, resolveType, changelog

a476e03

Merge branch 'main' into add-stddev-function

3e2ae59

limotova requested a review from craigtaverner November 22, 2024 01:54

limotova added 2 commits November 21, 2024 16:35

specify order of filter test

c59efab

Merge branch 'main' into add-stddev-function

601cfa5

craigtaverner approved these changes Nov 22, 2024

View reviewed changes

docs/changelog/116531.yaml Outdated Show resolved Hide resolved

craigtaverner reviewed Nov 22, 2024

View reviewed changes

docs/changelog/116531.yaml Outdated Show resolved Hide resolved

fix changelog

03e2a70

dnhatn reviewed Nov 22, 2024

View reviewed changes

limotova added 2 commits November 22, 2024 10:38

set intermediate values to 0 when null

827af56

Merge branch 'main' into add-stddev-function

ad4daa7

limotova merged commit 7e801e0 into elastic:main Nov 22, 2024
16 checks passed

limotova deleted the add-stddev-function branch November 22, 2024 22:33

limotova mentioned this pull request Nov 22, 2024

[8.x] [ES|QL] Add a standard deviation function (#116531) #117398

Merged

limotova added a commit to limotova/elasticsearch that referenced this pull request Nov 22, 2024

[ES|QL] Add a standard deviation function (elastic#116531)

dcd4f96

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

elasticsearchmachine pushed a commit that referenced this pull request Nov 22, 2024

[ES|QL] Add a standard deviation function (#116531) (#117398)

04849b0

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024

[ES|QL] Add a standard deviation function (elastic#116531)

e7c4776

Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.

alex-spies mentioned this pull request Dec 2, 2024

ESQL: Functions! #98545

Open

99 tasks


		import static java.util.Collections.emptyList;

		public class StdDeviation extends AggregateFunction implements ToAggregator {

		@@ -0,0 +1,5 @@
		pr: 116531
		summary: "[ES\|QL] Add a standard deviation function"

[ES|QL] Add a standard deviation function #116531

[ES|QL] Add a standard deviation function #116531

Uh oh!

Conversation

limotova commented Nov 8, 2024

Uh oh!

github-actions bot commented Nov 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ivancea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astefan commented Nov 21, 2024

Uh oh!

ivancea left a comment

Choose a reason for hiding this comment

Uh oh!

craigtaverner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

limotova commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigtaverner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dnhatn commented Nov 14, 2024 •

edited

Loading

craigtaverner left a comment •

edited

Loading

limotova commented Nov 22, 2024 •

edited

Loading