Summary statistics

Basic summary statistics:

Aggregating summary statistics:

Every summary statistics can be used in aggregations of:

df.mean()
df.age.sum()
df.groupBy { city }.mean()
df.pivot { city }.median()
df.pivot { city }.groupBy { name.lastName }.std()

sum, mean, std are available for (primitive) number columns of types Int, Double, Float, Long, Byte, Short, and any mix of those.

min/max, median, and percentile are available for self-comparable columns (so columns of type T : Comparable<T>, like DateTime, String, Int, etc.) which includes all primitive number columns, but no mix of different number types.

In all cases, null values are ignored.

NaN values can optionally be ignored by setting the skipNaN flag to true. When it's set to false, a NaN in the input will be propagated to the result.

Big numbers (BigInteger, BigDecimal) are generally not supported for statistics. Please convert them to primitive types before using statistics.

When statistics x is applied to several columns, it can be computed in several modes:

x(): DataRow computes separate value per every suitable column
x { columns }: Value computes single value across all given columns
xFor { columns }: DataRow computes separate value per every given column
xOf { rowExpression }: Value computes single value across results of row expression evaluated for every row

min/max, median, and percentile have additional mode by:

minBy { rowExpression }: DataRow finds a row with the minimal result of the rowExpression
medianBy { rowExpression }: DataRow finds a row where the median lies based on the results of the rowExpression

To perform statistics for a single row, see row statistics.

df.sum() // sum of values per every numeric column
df.sum { age and weight } // sum of all values in `age` and `weight`
df.sumFor(skipNaN = true) { age and weight } // sum of values per `age` and `weight` separately
df.sumOf { (weight ?: 0) / age } // sum of expression evaluated for every row

groupBy statistics

When statistics is applied to GroupBy DataFrame, it is computed for every data group.

If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a single column named according to the statistic name.

df.groupBy { city }.mean { age } // [`city`, `mean`]
df.groupBy { city }.meanOf { age / 2 } // [`city`, `mean`]

You can also pass a custom name for the aggregated column:

df.groupBy { city }.mean("mean age") { age } // [`city`, `mean age`]
df.groupBy { city }.meanOf("custom") { age / 2 } // [`city`, `custom`]

If a statistic is applied in a mode that returns a separate value for every column in a data group, aggregated values will be stored in columns with original column names.

df.groupBy { city }.meanFor { age and weight } // [`city`, `age`, `weight`]
df.groupBy { city }.mean() // [`city`, `age`, `weight`, ...]

pivot statistics

When statistics are applied to Pivot or PivotGroupBy, it is computed for every data group.

If a statistic is applied in a mode that returns a single value for every data group, it will be stored in a DataFrame cell without any name.

df.groupBy { city }.pivot { name.lastName }.mean { age }
df.groupBy { city }.pivot { name.lastName }.meanOf { age / 2.0 }

df.groupBy("city").pivot { "name"["lastName"] }.mean("age")
df.groupBy("city").pivot { "name"["lastName"] }.meanOf { "age"<Int>() / 2.0 }

If a statistic is applied in such a way that it returns separate value per every column in a data group, every cell in the nested dataframe will contain DataRow with values for every aggregated column.

df.groupBy { city }.pivot { name.lastName }.meanFor { age and weight }
df.groupBy { city }.pivot { name.lastName }.mean()

To group columns in aggregation results not by pivoted values, but by aggregated columns, apply the separate flag:

df.groupBy { city }.pivot { name.lastName }.meanFor(separate = true) { age and weight }
df.groupBy { city }.pivot { name.lastName }.mean(separate = true)

20 May 2025