0% found this document useful (0 votes)
22 views

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

yaazzoonn7788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

yaazzoonn7788
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

11/23/2022

Introduction to Data
Science
CHAPTER 4: STATISTICAL DESCRIPTION OF DATA

AHMAD ABU-SHAREHA (PH.D)


FACULTY OF INFORMATION TECHNOLOGY
AL-AHLIYYA AMMAN UNIVERSITY

Introduction
The word statistics (plural of the word statistic) is a set of quantities computed from
sample data, while a parameter is a quantity computed from the whole population of
data.

Statistical descriptions is a synonym of statistics.

Statistical descriptions are depends on the features of the sample data that they are
trying to describe.

The most common statistical descriptions are:


◦ Measures of Center.
◦ Measures of Variation.
◦ Measures of Relative Standing.

IntroductIon to data ScIence 2

1
11/23/2022

Measures of Center
Measures of Center or Central Tendency Measures describe the middle values over a
column of values.

Central Tendency (or Groups’ “Middle Values”)


◦ Mean: adding up all of the values and then dividing it by the number of data values.
◦ Median: the number found in the middle of the values when it is sorted in order.
◦ Mode: the value that appears most frequently in a data set

Note: when working with datasets with many outliers (an outlier is a data point that
differs significantly from other observations), it is sometimes more useful to use the
median of the dataset.

IntroductIon to data ScIence 3

Measures of Center
Sample mean (Mean)
The sample mean is obtained by adding all the values in the sample and dividing by the sample
size (which is usually denoted by small n).
=∑x / n

The symbol Σ is the sum sign in mathematics, which means that you should add up all the values
in your sample.

Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate the
average marks of the whole class.
• The sample size is n=5.
• The sample values are written as: x1=22, x2=21, x3=18, x4=25, x5=24
22 + 21 + 18 + 25 + 24
̅= ⁄5 = = 22
5

IntroductIon to data ScIence 4

2
11/23/2022

Measures of Center
Sample mean (Mean)

The mean as a measure of center is probably the most frequently used measure of center, partly
because it has the following properties:
1. it can be calculated for any numerical data set.
2. it's value is unambiguous.
3. it lends itself to further statistical measures/treatment.
4. it takes into account every value in the data set.

However, outliers have a strong effect on the mean, particularly if the sample size is small. Therefore,
at times the trimmed mean is used instead, in which case the upper and lower 5% is deleted, and the
mean is taken without those values.

In previous example, if we trim (just for example purpose) the upper and lower 20%, then 18 and 25
would be trimmed, and then add the values and divide by 3, to obtain 22.3.

IntroductIon to data ScIence 5

Measures of Center
The median
Median is a measure of center, like the mean, which is not affected by outliers like the mean is.

To obtain the median, we first need to re-arrange the data in ascending order (sorted), and then
find the middle value, namely,
◦ when n is odd, the median is the value in the middle (after sorting).
◦ when n is even, it is the mean of the two items nearest to the middle.

Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate the
average marks of the whole class.
• The sample size is n=5.
• The sample values are written as: x1=22, x2=21, x3=18, x4=25, x5=24
• The median of (18, 21, 22, 24, 25) is: 22
• If we add outlier value (77) to the previous sample, then the median of (18, 21, 22, 24,
25, 77) is: 22+24/2 = 23.

IntroductIon to data ScIence 6

3
11/23/2022

Measures of Center
Fractiles/ Percentiles
The median is one example of a fractile/percentile, a measure (value) that divides the data set
into two or more equal parts. Tertiles and Quartiles are other examples of fractile/percentile.

Quartiles, that divide the data into 4 equal parts, obtaining the values Q1, Q2, and Q3.
◦ Q1 (25th Percentile) is the value such that 25% of the data fall below it.
◦ Q2 (50th Percentile) is the median, and hence 50% of the data values fall below.
◦ Q3 (75th Percentile) is such that 75% of the data fall below it

How to obtain quartiles:


◦ Sort the values
◦ Q2 (median) is the middle value
◦ Q1 is the middle of the values less than Q2 25% 25%
25% 25%
◦ Q3 is the middle of the values greater than Q2

Required 4 or more numbers. 0 250 500 750 1000


Q1 Q2 Q3

IntroductIon to data ScIence 7

Measures of Center
Fractiles/ Percentiles
Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate Q1, Q2, Q3
of the whole class.
◦ Sorted values (18, 21, 22, 24, 25) Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 22 Q3 of (18, 21, 22, 24, 25) is: 24.5

Example: 4 students' marks in the class are collected as (22, 21, 18, and 24) to estimate the average
marks of the whole class.
◦ Sorted values 18, 21, 22, 24 Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 21.5 Q3 of (18, 21, 22, 24, 25) is: 23

Example: 8 students' marks in the class are collected as (22, 21, 18, 13, 11, 10, 8, and 9)
◦ Sorted values 8, 9, 10, 11, 13, 18, 21, 22 Q1 of (8, 9, 10, 11, 13, 18, 21, 22) is: 9.5
◦ Q2 of (8, 9, 10, 11, 13, 18, 21, 22) is: 12 Q3 of (8, 9, 10, 11, 13, 18, 21, 22) is: 19.5

IntroductIon to data ScIence 8

4
11/23/2022

Measures of Center
The mode
The mode is a measure of center that is the value that occurs most frequently.

The mode is usually used for data that is non-numerical and it is the only value that can be collected
for qualitative data.

Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ The number 9 most frequently, showing exactly 5 times, therefore 9 is the mode.

Notes:
1. If no number in a set of numbers occurs more than once, that set has no mode.
2. A set of numbers with two modes is bimodal, a set of numbers with three modes is trimodal,
and any set of numbers with more than one mode is multimodal.

Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ Mode = 9 and 14  Bimodal.

IntroductIon to data ScIence 9

Measures of Variations
Measures of variations are used to quantify how the data is "spread out". This is
a useful way to identify if our data has many outliers lurking inside.

Variations (or Summary of Differences Within Groups)


◦ Range
◦ Interquartile Range (IQR)
◦ Standard Deviation
◦ Variance

IntroductIon to data ScIence 10

5
11/23/2022

Measures of Variations
The range
The range is a measure of the differences between the data boundaries. The range of a data set
is the largest value minus the smallest value.

One disadvantage of the range is that it is highly influenced by outliers. For that reason, we use
other measure of variation more often.

Class A--IQs of 13 Students


102 115 128 109 131 89 98 106 140 119 93 97 110
Class A Range = 140 - 89 = 51

Class B--IQs of 13 Students


127 162 131 103 96 111 80 109 93 87 120 105 109
Class B Range = 162 - 80= 82

IntroductIon to data ScIence 11

Measures of Variations
The interquartile range is the distance or range between the (Q1:25th percentile) and the (Q3:
75th percentile).

Remember: A quartile is the value that marks one of the divisions that breaks a series of values
into four equal parts, which is a measure of center.

Class A--IQs of 13 Students


89 93 97 98 102 106 109 110 115 119 128 131 140
Class A Interquartile Range = Q3 – Q1 = (119+128/2) - (97+98/2) = 123.5 -97.5 =26

Class B--IQs of 13 Students


80 87 93 96 103 105 109 109 111 120 127 131 162
Class B Interquartile Range = Q3 – Q1 = (120+127/2) - (93+96/2) = 123.5 -94.5 =29

IntroductIon to data ScIence 12

6
11/23/2022

Measures of Variations
The Variance (σ2)
A measure of the spread of the recorded values on a variable and dispersion.
Calculating variance starts with a “deviation.”
◦ A deviation is the distance away from the mean of a case’s score.

Meaning:
◦ The larger the variance, the further the individual cases are from the mean.
◦ The smaller the variance, the closer the individual scores are to the mean.

IntroductIon to data ScIence 13

Measures of Variations
The Standard Deviation (σ)
The standard deviation (std), denoted by σ/s, measures how much data values deviate from the
arithmetic mean.
It's basically a way to see how spread out the data is. There is a general formula to calculate the
standard deviation, which is as follows:

Example: Find the standard deviation for {3, 5, 9, 10, 11, 12, 12, 14, 14}
Step 1: Sum  3 + 5 + 9 + 10 + 11 + 12 + 12 + 14 + 14 = 90.
Step 2: Mean  10
Step 3: Sum Square (Diff.-Mean)  49 + 25 + 1 + 0 + 1 + 4 + 4 + 16 + 16 = 116
Step 4: Sum Square/n  12.8
Step 5: std = 3.6

IntroductIon to data ScIence 14

7
11/23/2022

Measures of Variations
Notes about std:
1. The larger the std the greater amounts of variation around the mean.
For example:
◦ [19, 25, 31, 13, 25, 37], Mean: 25, STD: 7.75
◦ [23, 25, 27, 26, 25, 24], Mean: 25, STD: 1.29

2. The 0 value of the std occurs when all the values are identical. Accordingly, the mean
value is the same as the values in the set and there is no variations around the
mean.

3. If you were to “normalize” a variable, the std would change by the same magnitude.

4. Like the mean, the std will be inflated by an outlier case value.

IntroductIon to data ScIence 15

Measures of Relative Standing


Measures of relative standing are numbers showing the location of data values
relative to the other values within a data set.

Accordingly, these measures are used to:


◦ Compare values from different data sets, or to compare values within the same data set.
◦ Create normalized variables.

Most common relative standing measures:


◦ z-score.
◦ Percentiles and quartiles.
◦ Statistical graph called the boxplot.

IntroductIon to data ScIence 16

8
11/23/2022

Measures of Relative Standing


Z Scores
A z score is found by converting a value to a standardized scale and represents the number of
standard deviations that a data value is from the mean.
Note: Z-score should be normalized to two decimal points.
( )
′=

Example:
Scores on a test have a mean of 70 and a standard deviation of 11, while Ahmad has a score of 48.
Convert Ahmad's score to a z-score and discuss the meaning of the obtained results.

Solution:
( )
′= = -2.00

Ahmad has a z score of -2. This means that Ahmad score of 48 was 2 standard deviations below the
mean.

IntroductIon to data ScIence 17

Measures of Relative Standing


Percentiles

Percentiles partition data into groups and represent the measures of location for data values.
Percentiles divide a set of data into 100 groups with about 1% of the values in each group.

To find the percentile of a data value, use the formula:

#
( )= #
∗ 100%

Example:
Find the percentile for the data value 14, given the dataset {4 6 14 10 4 10 18 18 22 6 6 18 12 2 18}
Solution:
There are 9 data values less than 14 and a total of 15 data values.

(14) = ∗ 100% = 60%

IntroductIon to data ScIence 18

9
11/23/2022

Measures of Relative Standing


Boxplots:
A boxplot is a graph of a data set that consists of a line extending from the minimum value to the
maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third
quartile, Q3.

IntroductIon to data ScIence 19

Measures of Relative Standing


Example:
Given the following boxplot, what is Q1, Q2, and Q3
◦ Q1: 7
◦ Q2: 9
◦ Q3: 13 0
2
6
7 9
12
13 17
18

Draw a boxplot for the following students marks:


◦ Sorted values 8, 9, 10, 11, 13, 18, 21, 22 Q1 of (8, 9, 10, 11, 13, 18, 21, 22) is: 9.5
◦ Q2 of (8, 9, 10, 11, 13, 18, 21, 22) is: 12 Q3 of (8, 9, 10, 11, 13, 18, 21, 22) is: 19.5

8 9.5 12 19.5 22

IntroductIon to data ScIence 20

10
11/23/2022

Measures of Relative Standing


Modified Boxplots:
For purposes of constructing modified boxplots, we can consider outliers to be data
values meeting specific criteria.
In modified boxplots, a data value is an outlier if it is above Q3 by an amount greater
than 1.5 × IQR or below Q1 by an amount greater than 1.5 × IQR.
A modified boxplot is constructed with these specifications:
◦ A special symbol (such as an asterisk) is used to identify outliers.
◦ The solid horizontal line extends only as far as the minimum data value that is not an outlier and
the maximum data value that is not an outlier.

IntroductIon to data ScIence 21

Case Study: Statistical Description on Yelp


Download and Save Data:
◦ https://www.kaggle.com/shwetakhanjanshroff/yelp-review

Import Libraries:
In [*]: #Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Read the Data:


In [1]: #Read csv file
df = pd.read_csv("Data/yelp.csv")
df.head()

IntroductIon to data ScIence 22

11
11/23/2022

Case Study: Statistical Description on Yelp


Measures of Centers – Mean
In [2]: df['stars'].mean()

In [3]: df['cool'].mean()

Measures of Centers – Median


In [4]: df['useful'].median()

In [5]: df['funny'].median()

Measures of Centers – Mode


In [6]: df['stars'].mode()

In [7]: df['user_id'].mode()

IntroductIon to data ScIence 23

Case Study: Statistical Description on Yelp


Measures of Variations – Range
In [2]: df['stars'].max() - df['stars'].min()

In [3]: df['cool'].max() - df['cool'].min()

Measures of Centers – Interquartile Range


In [4]: df['stars'].quantile(0.75)-df['stars'].quantile(0.25)

In [5]: df['cool'].quantile(0.75)-df['cool'].quantile(0.25)

Measures of Centers – Variance and STD


In [6]: df['stars'].std()

In [7]: df['stars'].var()

IntroductIon to data ScIence 24

12
11/23/2022

Case Study: Statistical Description on Yelp


Relative Standing– Percentile
In [2]: df['stars'].rank()

In [3]: df['cool'].rank()

Relative Standing– z-Score


In [4]: M = df['stars'].mean()
St = df['stars'].std()
zscore = (df['stars'][0] - M)/ St
print(zscore)

Relative Standing– Boxplot


In [5]: boxplot = df.boxplot(column=['stars', 'cool', 'funny'])

IntroductIon to data ScIence 25

Case Study: Group-By Method


Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
In [1]: import pandas as pd
df = pd.read_csv("Data/yelp.csv")
df.head()
#Group data using rank
df_rank = df.groupby(['stars'])
#Calculate mean value for each numeric column per each group
df_rank.mean()

IntroductIon to data ScIence 26

13

You might also like