SciPy - Stats

Last Updated : 28 Apr, 2025

The scipy.stats is the SciPy sub-package. It is mainly used for probabilistic distributions and statistical operations. There is a wide range of probability functions.

There are three classes:

Class	Description
rv_continuous	For continuous random variables, we can create specialized distribution subclasses and instances.
rv_discrete	For discrete random variables, we can create specialized distribution subclasses and instances.
rv_histogram	generate specific distribution histograms.

Continuous Random Variables

A continuous random variable is a probability distribution when the random variable X can have any value. The mean is defined by the location (loc) keyword. The standard deviation is determined by the scale (scale) keyword.

As we discussed that using the rv_continuous class we can create distributed subclasses and instances so there is a method called 'norm' which inherits from rv_continuous and this function will calculate the CDF for us.

Let X be a continuous random variable with PDF( (f) and CDF (F).

PDF - Probability Density Function

The PDF of a continuous random variable x satisfies the following conditions. If f\left ( x \right )\geq 0 for all x\in \mathbb{R} here f is piecewise continuous.

\int_{-\infty}^{\infty}f\left ( x \right )dx=1

P\left ( a\leq X\leq b \right )=\int_{a}^{b}f\left ( x \right )dx

The CDF is found by integrating the PDF:

F\left ( x \right )=\int_{-\infty}^{x}f\left ( t \right )dt

The pdf can be found by differentiating the CDF:

f\left ( x \right )=\frac{\mathrm{d} }{\mathrm{d} x}\left [ F\left ( x \right ) \right ]

Python3

# Importing the numpy module for numpy array
import numpy as npy

# Importing the scipy.stats.norm
from scipy.stats import norm

# calculating the cdf for the numpy array
print(norm.cdf(npy.array([-2, 0, 2])))

Output:

[0.02275013 0.5        0.97724987]

Discrete Random Variables

Only a countable number of values can be assigned to discrete random variables. L is an additional integer parameter that can be added to any discrete distribution. The general distribution p and the standard distribution p₀ have the following relationship:

p\left ( x \right )=p_{0 }\left ( x-L \right )

scipy.stats.circmean

Compute the circular mean for samples in a range. We will use the following function to calculate the circular mean:

Syntax:

scipy.stats.circmean(array, high=2*pi, low=0, axis=None, nan_policy='propagate')

where,

Array - input array or samples.
high (float or int ) - high boundary for sample. default high = 2 * pi.
low ( float or int ) - low boundary for sample. default low = 0.
axis ( int ) - Axis along which means are computed.
nan_policy ( ‘propagate’, ‘raise’, ‘omit’ ) - Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, and ‘omit’ performs the calculations ignoring nan values. The default is ‘propagate’.

Python3

# importing the required package
from scipy.stats import circmean

# calculating the circular mean
print(circmean([0.4, 2.4, 3.6], high=4, low=2))
#                    |              |         |
#               -------------  ------------  ------------
#               sample array   higher bound  lower bound

Output:

2.254068341376122

scipy.stats.contingency.crosstab

Given the lists a and p, create a contingency table that counts the frequencies of the corresponding pairs.

Python3

# importing the required package
from scipy.stats.contingency import crosstab

# list p
a = ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B']

# list q
p = ['P', 'P', 'P', 'Q', 'R', 'R', 'Q', 'Q', 'R', 'R']

# result ndarray
print(crosstab(a, p))

# using the crosstab function and extracting
# the informations like - a's unique values,
# b's unique values and the final count of the pairs.
(auv, puv), cnt = crosstab(a,  p)

# printing list a's unique values
print(auv)

# printing list p's unique values
print(puv)

# printing the count object which tells us
# the pairs count for each unique values of a and p.
print(cnt)

Output:

((array(['A', 'B'], dtype='<U1'), array(['P', 'Q', 'R'], dtype='<U1')), array([[2, 3, 0],
       [1, 0, 4]]))
['A' 'B']
['P' 'Q' 'R']
[[2 3 0]
 [1 0 4]]

Note - In the above output, we have a ndarray, which consists of the different other arrays. The first value (array(['A', 'B']), dtype='<U1') is basically the array of unique values in the list a, the second value (array(['P', 'Q', 'R']),dtype='<U1') is basically the array of unique values in the list p, and the third value is the frequency of each pair of list a and list p.

list a =

A B A A B B A A B B

list b =

P P P Q R R Q Q R R

Result analysis

Above image observations -

A - P = 2 
A - Q = 3 
A - R = 0

Above image observations:

B - P  = 1
B - Q  = 0
B - R  = 4

stats.describe()

This function basically calculates the several descriptive statistics of the argument array.

Syntax:

scipy.stats.describe(a, axis=0, ddof=1, bias=True, nan_policy='propagate')

where,

Input array - array for which we want to generate the statistics.
axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
ddof ( int ) { # optional } - Delta Degrees for variance. Default ddof = 1.
bias ( bool ) { # optional } - skewness and kurtosis calculations for statistical bias.
nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.

Return:

nbos ( int or ndarray ) - length of data along axis value.
minmax ( tuple of ndarrays or floats ) - Minimum and Maximum value of input array along the given axis.
mean ( float or ndarray ) - mean of input array.
variance ( ndarray or float ) - variance of input array along the given axis.
skewness ( float or ndarray ) - skewness of input array along the given axis.
kurtosis ( ndarray or float ) - kurtosis of input array along the given axis.

Python3

# importing the stats and numpy module
from scipy import stats as st
import numpy as npy

# ID input array
array = npy.array([10, 20, 30, 40, 50, 60, 70, 80])

# calling the describe function
print(st.describe(array))

Output:

DescribeResult(
 nobs=8,
 minmax=(10, 80),
 mean=45.0,
 variance=600.0,
 skewness=0.0,
 kurtosis=-1.2380952380952381)

Python3

# importing the stats and numpy module
from scipy import stats as st
import numpy as npy

# 2D array
nd = npy.array([[5, 6], [2, 3], [5, 5],\
                [7, 9], [9, 8], [8, 7]])

# calling the describe function
print(st.describe(nd))

Output:

DescribeResult(nobs=6,
 minmax=(array([2, 3]),
 array([9, 9])),
 mean=array([6.        , 6.33333333]),
 variance=array([6.4       , 4.66666667]),
 skewness=array([-0.40594941, -0.3380617 ]),
 kurtosis=array([-0.9140625, -0.96     ]))

scipy.stats.kurtosis

Kurtosis quantifies how much of a probability distribution's data are concentrated towards the mean as opposed to the tails.

Kurtosis is the fourth central moment divided by the square of the variance.

Syntax:

scipy.stats.kurtosis(a, axis=0, fisher=True, bias=True, nan_policy='propagate', *, keepdims=False

where,

Input array - Data for which the kurtosis is calculated..
axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
fisher ( bool ) { # optional } - If True, Fisher’s definition is used. If False, Pearson’s definition is used.
bias ( bool ) { # optional } - If False, then the calculations are corrected for statistical bias.
nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
keepdims( bool ) ( # optional ) - default is false. broadcast result correctly against the input array.

Returns:

kurtosis array - along the given axis.

Python3

# importing the stats module
from scipy import stats as st

# the random dataset
dataset = st.norm.rvs(size=88)

# calling the kurtosis function
print(st.kurtosis(dataset))

Output:

0.04606780907050423

scipy.stats.mstats.zscore

The Z-score provides information on how far a given value deviates from the standard deviation. When a data point's Z-score is 0, it means that it has the same score as the mean.

Z = ( Observed Value ( x ) - mean ( μ ) ) / standard deviation ( σ )

Calculate the z score for each value in the input array in comparison to the sample mean and standard deviation.

Function parameters -

Syntax:

scipy.stats.mstats.zscore(a, axis=0, ddof=0, nan_policy='propagate')

where,

Input array - sample input array.
axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
ddof ( int ) { # optional } - Degrees of freedom correction in the calculation of the standard deviation. The default value of ddof is 0.
nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.

Returns:

zscore - array - The z-scores of input array a, normalised by mean and standard deviation.

Python3

# importing the stats module
from scipy import stats as st

# the random 1D ARRAY ( dataset )
dataset = [0.02, 0.5, 0.01, 0.33, 0.51, 1.0, 0.03]

# the random 2D ARRAY ( dataset )
nd = [[5.1, 6.1], [2.1, 3.1], [5.1, 5.1],\
      [7.1, 9.1], [9.1, 8.1], [8.1, 7.1]]

# calling the kurtosis function
# 1D dataset
print(st.zscore(dataset))

# calling the kurtosis function
# 2D dataset
print(st.zscore(nd))

Output:

[-0.95649434  0.46555034 -0.98612027 -0.03809048  0.49517627  1.94684689
 -0.92686841]
[[-0.4330127  -0.16903085]
 [-1.73205081 -1.69030851]
 [-0.4330127  -0.6761234 ]
 [ 0.4330127   1.35224681]
 [ 1.29903811  0.84515425]
 [ 0.8660254   0.3380617 ]]

scipy.stats.skew

We can determine the direction of outliers from skewness. The tail of a distribution curve has a longer right side when there is a positive skew. Accordingly, the distribution curve's outliers are farther from the mean on the left and closer to it on the right. Skewness just conveys the direction of outliers; it doesn't provide information on the number of outliers.

Compute the sample skewness of a data set. Skewness should be close to zero for normally distributed data. A skewness value greater than zero indicates that the right tail of a unimodal continuous distribution has more weight.

Syntax:

scipy.stats.skew(a, axis=0, bias=True, nan_policy='propagate', *, keepdims=False)

where,

Input array
axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
bias ( bool ) { # optional } - If False, then the calculations are corrected for statistical bias.
nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
keepdims( bool ) ( # optional ) - default is false. broadcast result correctly against the input array.

Return:

skewness - ndarray

Python3

# importing the stats module
from scipy import stats as st

# ID input array
array = [99, 10, 30, 55, 50, 0, 90, 0]

# calling the skew function
print(st.skew(array))

Output:

0.3260023450293658

scipy.stats.energy_distance

Distance between two probability distributions. Suppose two distributions u and v and their CDF are U and V, two random variables X and Y are there, then the energy distance will be the square root of:

D²(U,V) = 2E || X - Y || - E || X - X' || - E || Y - Y' || > 0,

|| denotes the length of a vector

Compute the energy distance between two 1D distributions.

Python3

# importing the stats module
from scipy import stats as st

# calling the function
print(st.energy_distance([5, 10], [10, 20],\
                         [20, 30], [30, 40]))

Output:

2.851422845685634

scipy.stats.mode

Return an array of the most common values in the input array.

Python3

# importing the stats module
from scipy import stats as st

# sample input array
array = [[2, 3], [3, 1], [1, 3],\
         [3, 3], [4, 2], [4, 4],\
         [1, 2], [5, 6]]

# calling the mode function
print(st.mode(array))

Output:

ModeResult(mode=array([[1, 3]]), count=array([[2, 3]]))

scipy.stats.variation

The coefficient of variation - Standard deviation divided by the mean.

Python3

# importing the stats module
from scipy import stats as st

# sample input array
array = [[2, 3], [3, 1], [1, 3],\
         [3, 3], [4, 2], [4, 4],\
         [1, 2], [5, 6]]

# calling the function
print(st.variation(array, ddof=1))

Output:

[0.5070393  0.50395263]

scipy.stats.rankdata

Assign ranks to data, dealing with ties appropriately.

Python3

# importing the stats module
from scipy import stats as st

# sample input array
array = [2, 3, 15, 1, 6, 9, 8, 4, 5, 10]

# calling the function
print(st.rankdata(array))

Output:

[ 2.  3. 10.  1.  6.  8.  7.  4.  5.  9.]

Python Introduction

harishcarpenter

Improve

Article Tags :

Practice Tags :

python

SciPy - Stats

Continuous Random Variables

PDF - Probability Density Function

Discrete Random Variables

scipy.stats.circmean

scipy.stats.contingency.crosstab

Result analysis

stats.describe()

scipy.stats.kurtosis

scipy.stats.mstats.zscore

Function parameters -

scipy.stats.skew

scipy.stats.energy_distance

scipy.stats.mode

scipy.stats.variation

scipy.stats.rankdata

Similar Reads

Python Fundamentals

Python Data Structures

Advanced Python

Data Science with Python

Web Development with Python

Python Practice

Thank You!

What kind of Experience do you want to share?