Session 12
Session 12
Statistics refers to the mathematics and techniques with which we understand data.
Descriptive Statistics
It is about describing and summarizing data. It uses two main approaches:
Types of Measures
Central tendency tells you about the centers of the data. Useful measures include the mean, median, and mode.
Variability tells you about the spread of the data. Useful measures include variance and standard deviation.
Correlation or joint variability tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and
the correlation coefficient.
Mean
Weighted mean
Geometric mean
Harmonic mean
Median
Mode
Mean
The sample mean, also called the sample arithmetic mean or simply the average, is the arithmetic average of all the items in a dataset. The mean
of a dataset 𝑥 is mathematically expressed as
∑𝑥𝑖
𝑛
, where 𝑖 = 1, 2, …, 𝑛. In other words, it’s the sum of all the elements 𝑥ᵢ divided by the number of items in the dataset 𝑥.
Out[4]: 8.7
Out[5]: 8.7
In [6]: 1 mean_ = st.fmean(x)
2 mean_
Out[6]: 8.7
However, if there are nan values among your data, then statistics.mean() and statistics.fmean() will return nan as the output:
Out[7]: nan
Weighted Mean
The weighted mean, also called the weighted arithmetic mean or weighted average, is a generalization of the arithmetic mean that enables you to
define the relative contribution of each data point to the result.
You define one weight 𝑤ᵢ for each data point 𝑥ᵢ of the dataset 𝑥, where 𝑖 = 1, 2, …, 𝑛 and 𝑛 is the number of items in 𝑥. Then, you multiply each
∑𝑤𝑖𝑥𝑖
data point with the corresponding weight, sum all the products, and divide the obtained sum with the sum of weights:
∑𝑤𝑖
.
Out[8]: 6.95
Geometric Mean
The Geometric Mean is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube
root (for three numbers) etc. where i = 1, 2, 3, ....n.
√𝑛 ⎯𝜋𝑥𝑖
⎯⎯⎯⎯⎯
In [9]: 1 gmean = st.geometric_mean(x)
2 print(round(gmean, 2))
4.68
Harmonic Mean
The harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset.
For example, the harmonic mean of three values a, b and c will be equivalent to
3
(1/𝑎 + 1/𝑏 + 1/𝑐)
If one of the values is zero, the result will be zero.
The harmonic mean is a type of average, a measure of the central location of the data. It is often appropriate when averaging rates or ratios, for
example speeds.
Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr. What is the average speed?
Out[10]: 48.0
Out[11]: 27.97513321492007
2.76
Median
The sample median is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order. If the number of
elements 𝑛 of the dataset is odd, then the median is the value at the middle position: 0.5(𝑛 + 1). If 𝑛 is even, then the median is the arithmetic
mean of the two values in the middle, that is, the items at the positions 0.5𝑛 and 0.5𝑛 + 1.
For example, if you have the data points 2, 4, 1, 8, and 9, then the median value is 4, which is in the middle of the sorted dataset (1, 2, 4, 8, 9). If
the data points are 2, 4, 1, and 8, then the median is 3, which is the average of the two middle elements of the sorted sequence (2 and 4).
Out[13]: 3
Out[14]: 4.0
Out[15]: 4.0
Out[16]: 4
Out[17]: 3.25
median_low() and median_high() are two more functions related to the median in the Python statistics library. They always return an element
from the dataset:
If the number of elements is odd, then there’s a single middle value, so these functions behave just like median().
If the number of elements is even, then there are two middle values. In this case, median_low() returns the lower and median_high() the
higher middle value.
g
Out[18]: 3
Out[19]: 3
Out[20]: 3
Out[21]: 5
Mode
The sample mode is the value in the dataset that occurs most frequently. If there isn’t a single such value, then the set is multimodal since it has
multiple modal values. For example, in the set that contains the points 2, 3, 2, 8, and 12, the number 2 is the mode because it occurs twice, unlike
the other items that occur only once.
Out[22]: 3
Out[23]: [1, 3]
In [24]: 1 st.multimode('aabbbbccddddeeffffgg')
Variance
Standard deviation
Variance
The sample variance quantifies the spread of the data. It shows numerically how far the data points are from the mean. You can express the
sample variance of the dataset 𝑥 with 𝑛 elements mathematically as
𝑆2 = ∑(𝑥𝑖 − 𝑥¯)2
𝑛−1
where
𝑆2 = sample variance
𝑥𝑖= the value of the one observation
𝑥¯
= the mean value of all observations
𝑛= the number of observations
Out[25]: 123.2
Standard Deviation
⎯∑(
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯
𝑥 𝑖 − 𝜇 ) 2⎯
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.
𝜎=√ 𝑁
where
𝜎= population standard deviation
𝑁 = the size of the population
𝑥𝑖 = each value from the population
𝜇 = the population mean
In [26]: 1 st.stdev(x)
Out[26]: 11.099549540409287