To Data Science: Chapter 4: Statistical Description of Data
To Data Science: Chapter 4: Statistical Description of Data
Introduction to Data
Science
CHAPTER 4: STATISTICAL DESCRIPTION OF DATA
Introduction
The word statistics (plural of the word statistic) is a set of quantities computed from
sample data, while a parameter is a quantity computed from the whole population of
data.
Statistical descriptions are depends on the features of the sample data that they are
trying to describe.
1
11/23/2022
Measures of Center
Measures of Center or Central Tendency Measures describe the middle values over a
column of values.
Note: when working with datasets with many outliers (an outlier is a data point that
differs significantly from other observations), it is sometimes more useful to use the
median of the dataset.
Measures of Center
Sample mean (Mean)
The sample mean is obtained by adding all the values in the sample and dividing by the sample
size (which is usually denoted by small n).
=∑x / n
The symbol Σ is the sum sign in mathematics, which means that you should add up all the values
in your sample.
Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate the
average marks of the whole class.
• The sample size is n=5.
• The sample values are written as: x1=22, x2=21, x3=18, x4=25, x5=24
22 + 21 + 18 + 25 + 24
̅= ⁄5 = = 22
5
2
11/23/2022
Measures of Center
Sample mean (Mean)
The mean as a measure of center is probably the most frequently used measure of center, partly
because it has the following properties:
1. it can be calculated for any numerical data set.
2. it's value is unambiguous.
3. it lends itself to further statistical measures/treatment.
4. it takes into account every value in the data set.
However, outliers have a strong effect on the mean, particularly if the sample size is small. Therefore,
at times the trimmed mean is used instead, in which case the upper and lower 5% is deleted, and the
mean is taken without those values.
In previous example, if we trim (just for example purpose) the upper and lower 20%, then 18 and 25
would be trimmed, and then add the values and divide by 3, to obtain 22.3.
Measures of Center
The median
Median is a measure of center, like the mean, which is not affected by outliers like the mean is.
To obtain the median, we first need to re-arrange the data in ascending order (sorted), and then
find the middle value, namely,
◦ when n is odd, the median is the value in the middle (after sorting).
◦ when n is even, it is the mean of the two items nearest to the middle.
Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate the
average marks of the whole class.
• The sample size is n=5.
• The sample values are written as: x1=22, x2=21, x3=18, x4=25, x5=24
• The median of (18, 21, 22, 24, 25) is: 22
• If we add outlier value (77) to the previous sample, then the median of (18, 21, 22, 24,
25, 77) is: 22+24/2 = 23.
3
11/23/2022
Measures of Center
Fractiles/ Percentiles
The median is one example of a fractile/percentile, a measure (value) that divides the data set
into two or more equal parts. Tertiles and Quartiles are other examples of fractile/percentile.
Quartiles, that divide the data into 4 equal parts, obtaining the values Q1, Q2, and Q3.
◦ Q1 (25th Percentile) is the value such that 25% of the data fall below it.
◦ Q2 (50th Percentile) is the median, and hence 50% of the data values fall below.
◦ Q3 (75th Percentile) is such that 75% of the data fall below it
Measures of Center
Fractiles/ Percentiles
Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate Q1, Q2, Q3
of the whole class.
◦ Sorted values (18, 21, 22, 24, 25) Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 22 Q3 of (18, 21, 22, 24, 25) is: 24.5
Example: 4 students' marks in the class are collected as (22, 21, 18, and 24) to estimate the average
marks of the whole class.
◦ Sorted values 18, 21, 22, 24 Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 21.5 Q3 of (18, 21, 22, 24, 25) is: 23
Example: 8 students' marks in the class are collected as (22, 21, 18, 13, 11, 10, 8, and 9)
◦ Sorted values 8, 9, 10, 11, 13, 18, 21, 22 Q1 of (8, 9, 10, 11, 13, 18, 21, 22) is: 9.5
◦ Q2 of (8, 9, 10, 11, 13, 18, 21, 22) is: 12 Q3 of (8, 9, 10, 11, 13, 18, 21, 22) is: 19.5
4
11/23/2022
Measures of Center
The mode
The mode is a measure of center that is the value that occurs most frequently.
The mode is usually used for data that is non-numerical and it is the only value that can be collected
for qualitative data.
Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ The number 9 most frequently, showing exactly 5 times, therefore 9 is the mode.
Notes:
1. If no number in a set of numbers occurs more than once, that set has no mode.
2. A set of numbers with two modes is bimodal, a set of numbers with three modes is trimodal,
and any set of numbers with more than one mode is multimodal.
Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ Mode = 9 and 14 Bimodal.
Measures of Variations
Measures of variations are used to quantify how the data is "spread out". This is
a useful way to identify if our data has many outliers lurking inside.
5
11/23/2022
Measures of Variations
The range
The range is a measure of the differences between the data boundaries. The range of a data set
is the largest value minus the smallest value.
One disadvantage of the range is that it is highly influenced by outliers. For that reason, we use
other measure of variation more often.
Measures of Variations
The interquartile range is the distance or range between the (Q1:25th percentile) and the (Q3:
75th percentile).
Remember: A quartile is the value that marks one of the divisions that breaks a series of values
into four equal parts, which is a measure of center.
6
11/23/2022
Measures of Variations
The Variance (σ2)
A measure of the spread of the recorded values on a variable and dispersion.
Calculating variance starts with a “deviation.”
◦ A deviation is the distance away from the mean of a case’s score.
Meaning:
◦ The larger the variance, the further the individual cases are from the mean.
◦ The smaller the variance, the closer the individual scores are to the mean.
Measures of Variations
The Standard Deviation (σ)
The standard deviation (std), denoted by σ/s, measures how much data values deviate from the
arithmetic mean.
It's basically a way to see how spread out the data is. There is a general formula to calculate the
standard deviation, which is as follows:
Example: Find the standard deviation for {3, 5, 9, 10, 11, 12, 12, 14, 14}
Step 1: Sum 3 + 5 + 9 + 10 + 11 + 12 + 12 + 14 + 14 = 90.
Step 2: Mean 10
Step 3: Sum Square (Diff.-Mean) 49 + 25 + 1 + 0 + 1 + 4 + 4 + 16 + 16 = 116
Step 4: Sum Square/n 12.8
Step 5: std = 3.6
7
11/23/2022
Measures of Variations
Notes about std:
1. The larger the std the greater amounts of variation around the mean.
For example:
◦ [19, 25, 31, 13, 25, 37], Mean: 25, STD: 7.75
◦ [23, 25, 27, 26, 25, 24], Mean: 25, STD: 1.29
2. The 0 value of the std occurs when all the values are identical. Accordingly, the mean
value is the same as the values in the set and there is no variations around the
mean.
3. If you were to “normalize” a variable, the std would change by the same magnitude.
4. Like the mean, the std will be inflated by an outlier case value.
8
11/23/2022
Example:
Scores on a test have a mean of 70 and a standard deviation of 11, while Ahmad has a score of 48.
Convert Ahmad's score to a z-score and discuss the meaning of the obtained results.
Solution:
( )
′= = -2.00
Ahmad has a z score of -2. This means that Ahmad score of 48 was 2 standard deviations below the
mean.
Percentiles partition data into groups and represent the measures of location for data values.
Percentiles divide a set of data into 100 groups with about 1% of the values in each group.
#
( )= #
∗ 100%
Example:
Find the percentile for the data value 14, given the dataset {4 6 14 10 4 10 18 18 22 6 6 18 12 2 18}
Solution:
There are 9 data values less than 14 and a total of 15 data values.
9
11/23/2022
8 9.5 12 19.5 22
10
11/23/2022
Import Libraries:
In [*]: #Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
11
11/23/2022
In [3]: df['cool'].mean()
In [5]: df['funny'].median()
In [7]: df['user_id'].mode()
In [5]: df['cool'].quantile(0.75)-df['cool'].quantile(0.25)
In [7]: df['stars'].var()
12
11/23/2022
In [3]: df['cool'].rank()
13