0% found this document useful (0 votes)

22 views

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

yaazzoonn7788

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

yaazzoonn7788

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

11/23/2022

Introduction to Data
Science
CHAPTER 4: STATISTICAL DESCRIPTION OF DATA

AHMAD ABU-SHAREHA (PH.D)

FACULTY OF INFORMATION TECHNOLOGY
AL-AHLIYYA AMMAN UNIVERSITY

Introduction
The word statistics (plural of the word statistic) is a set of quantities computed from
sample data, while a parameter is a quantity computed from the whole population of
data.

Statistical descriptions is a synonym of statistics.

Statistical descriptions are depends on the features of the sample data that they are
trying to describe.

The most common statistical descriptions are:

◦ Measures of Center.
◦ Measures of Variation.
◦ Measures of Relative Standing.

IntroductIon to data ScIence 2

1
11/23/2022

Measures of Center
Measures of Center or Central Tendency Measures describe the middle values over a
column of values.

Central Tendency (or Groups’ “Middle Values”)

◦ Mean: adding up all of the values and then dividing it by the number of data values.
◦ Median: the number found in the middle of the values when it is sorted in order.
◦ Mode: the value that appears most frequently in a data set

Note: when working with datasets with many outliers (an outlier is a data point that
differs significantly from other observations), it is sometimes more useful to use the
median of the dataset.

IntroductIon to data ScIence 3

Measures of Center
Sample mean (Mean)
The sample mean is obtained by adding all the values in the sample and dividing by the sample
size (which is usually denoted by small n).
=∑x / n

The symbol Σ is the sum sign in mathematics, which means that you should add up all the values
in your sample.

IntroductIon to data ScIence 4

2
11/23/2022

Measures of Center
Sample mean (Mean)

The mean as a measure of center is probably the most frequently used measure of center, partly
because it has the following properties:
1. it can be calculated for any numerical data set.
2. it's value is unambiguous.
3. it lends itself to further statistical measures/treatment.
4. it takes into account every value in the data set.

However, outliers have a strong effect on the mean, particularly if the sample size is small. Therefore,
at times the trimmed mean is used instead, in which case the upper and lower 5% is deleted, and the
mean is taken without those values.

In previous example, if we trim (just for example purpose) the upper and lower 20%, then 18 and 25
would be trimmed, and then add the values and divide by 3, to obtain 22.3.

IntroductIon to data ScIence 5

Measures of Center
The median
Median is a measure of center, like the mean, which is not affected by outliers like the mean is.

To obtain the median, we first need to re-arrange the data in ascending order (sorted), and then
find the middle value, namely,
◦ when n is odd, the median is the value in the middle (after sorting).
◦ when n is even, it is the mean of the two items nearest to the middle.

Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate the
average marks of the whole class.
• The sample size is n=5.
• The sample values are written as: x1=22, x2=21, x3=18, x4=25, x5=24
• The median of (18, 21, 22, 24, 25) is: 22
• If we add outlier value (77) to the previous sample, then the median of (18, 21, 22, 24,
25, 77) is: 22+24/2 = 23.

IntroductIon to data ScIence 6

3
11/23/2022

Measures of Center
Fractiles/ Percentiles
The median is one example of a fractile/percentile, a measure (value) that divides the data set
into two or more equal parts. Tertiles and Quartiles are other examples of fractile/percentile.

Quartiles, that divide the data into 4 equal parts, obtaining the values Q1, Q2, and Q3.
◦ Q1 (25th Percentile) is the value such that 25% of the data fall below it.
◦ Q2 (50th Percentile) is the median, and hence 50% of the data values fall below.
◦ Q3 (75th Percentile) is such that 75% of the data fall below it

How to obtain quartiles:

◦ Sort the values
◦ Q2 (median) is the middle value
◦ Q1 is the middle of the values less than Q2 25% 25%
25% 25%
◦ Q3 is the middle of the values greater than Q2

Required 4 or more numbers. 0 250 500 750 1000

Q1 Q2 Q3

IntroductIon to data ScIence 7

Measures of Center
Fractiles/ Percentiles
Example: 5 students' marks in the class are collected as (22, 21, 18, 25 and 24) to estimate Q1, Q2, Q3
of the whole class.
◦ Sorted values (18, 21, 22, 24, 25) Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 22 Q3 of (18, 21, 22, 24, 25) is: 24.5

Example: 4 students' marks in the class are collected as (22, 21, 18, and 24) to estimate the average
marks of the whole class.
◦ Sorted values 18, 21, 22, 24 Q1 of (18, 21, 22, 24, 25) is: 19.5
◦ Q2 of (18, 21, 22, 24, 25) is: 21.5 Q3 of (18, 21, 22, 24, 25) is: 23

Example: 8 students' marks in the class are collected as (22, 21, 18, 13, 11, 10, 8, and 9)
◦ Sorted values 8, 9, 10, 11, 13, 18, 21, 22 Q1 of (8, 9, 10, 11, 13, 18, 21, 22) is: 9.5
◦ Q2 of (8, 9, 10, 11, 13, 18, 21, 22) is: 12 Q3 of (8, 9, 10, 11, 13, 18, 21, 22) is: 19.5

IntroductIon to data ScIence 8

4
11/23/2022

Measures of Center
The mode
The mode is a measure of center that is the value that occurs most frequently.

The mode is usually used for data that is non-numerical and it is the only value that can be collected
for qualitative data.

Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ The number 9 most frequently, showing exactly 5 times, therefore 9 is the mode.

Notes:
1. If no number in a set of numbers occurs more than once, that set has no mode.
2. A set of numbers with two modes is bimodal, a set of numbers with three modes is trimodal,
and any set of numbers with more than one mode is multimodal.

Example: [10, 7, 14, 9, 9, 14, 18, 9, 11, 12, 16, 14, 9, 14, 13, 11, 9 and 20].
◦ Mode = 9 and 14  Bimodal.

IntroductIon to data ScIence 9

Measures of Variations
Measures of variations are used to quantify how the data is "spread out". This is
a useful way to identify if our data has many outliers lurking inside.

Variations (or Summary of Differences Within Groups)

◦ Range
◦ Interquartile Range (IQR)
◦ Standard Deviation
◦ Variance

IntroductIon to data ScIence 10

5
11/23/2022

Measures of Variations
The range
The range is a measure of the differences between the data boundaries. The range of a data set
is the largest value minus the smallest value.

One disadvantage of the range is that it is highly influenced by outliers. For that reason, we use
other measure of variation more often.

Class A--IQs of 13 Students

102 115 128 109 131 89 98 106 140 119 93 97 110
Class A Range = 140 - 89 = 51

Class B--IQs of 13 Students

127 162 131 103 96 111 80 109 93 87 120 105 109
Class B Range = 162 - 80= 82

IntroductIon to data ScIence 11

Measures of Variations
The interquartile range is the distance or range between the (Q1:25th percentile) and the (Q3:
75th percentile).

Remember: A quartile is the value that marks one of the divisions that breaks a series of values
into four equal parts, which is a measure of center.

Class A--IQs of 13 Students

89 93 97 98 102 106 109 110 115 119 128 131 140
Class A Interquartile Range = Q3 – Q1 = (119+128/2) - (97+98/2) = 123.5 -97.5 =26

Class B--IQs of 13 Students

80 87 93 96 103 105 109 109 111 120 127 131 162
Class B Interquartile Range = Q3 – Q1 = (120+127/2) - (93+96/2) = 123.5 -94.5 =29

IntroductIon to data ScIence 12

6
11/23/2022

Measures of Variations
The Variance (σ2)
A measure of the spread of the recorded values on a variable and dispersion.
Calculating variance starts with a “deviation.”
◦ A deviation is the distance away from the mean of a case’s score.

Meaning:
◦ The larger the variance, the further the individual cases are from the mean.
◦ The smaller the variance, the closer the individual scores are to the mean.

IntroductIon to data ScIence 13

Measures of Variations
The Standard Deviation (σ)
The standard deviation (std), denoted by σ/s, measures how much data values deviate from the
arithmetic mean.
It's basically a way to see how spread out the data is. There is a general formula to calculate the
standard deviation, which is as follows:

Example: Find the standard deviation for {3, 5, 9, 10, 11, 12, 12, 14, 14}
Step 1: Sum  3 + 5 + 9 + 10 + 11 + 12 + 12 + 14 + 14 = 90.
Step 2: Mean  10
Step 3: Sum Square (Diff.-Mean)  49 + 25 + 1 + 0 + 1 + 4 + 4 + 16 + 16 = 116
Step 4: Sum Square/n  12.8
Step 5: std = 3.6

IntroductIon to data ScIence 14

7
11/23/2022

Measures of Variations
Notes about std:
1. The larger the std the greater amounts of variation around the mean.
For example:
◦ [19, 25, 31, 13, 25, 37], Mean: 25, STD: 7.75
◦ [23, 25, 27, 26, 25, 24], Mean: 25, STD: 1.29

2. The 0 value of the std occurs when all the values are identical. Accordingly, the mean
value is the same as the values in the set and there is no variations around the
mean.

3. If you were to “normalize” a variable, the std would change by the same magnitude.

4. Like the mean, the std will be inflated by an outlier case value.

IntroductIon to data ScIence 15

Measures of Relative Standing

Measures of relative standing are numbers showing the location of data values
relative to the other values within a data set.

Accordingly, these measures are used to:

◦ Compare values from different data sets, or to compare values within the same data set.
◦ Create normalized variables.

Most common relative standing measures:

◦ z-score.
◦ Percentiles and quartiles.
◦ Statistical graph called the boxplot.

IntroductIon to data ScIence 16

8
11/23/2022

Measures of Relative Standing

Z Scores
A z score is found by converting a value to a standardized scale and represents the number of
standard deviations that a data value is from the mean.
Note: Z-score should be normalized to two decimal points.
( )
′=

Example:
Scores on a test have a mean of 70 and a standard deviation of 11, while Ahmad has a score of 48.
Convert Ahmad's score to a z-score and discuss the meaning of the obtained results.

Solution:
( )
′= = -2.00

Ahmad has a z score of -2. This means that Ahmad score of 48 was 2 standard deviations below the
mean.

IntroductIon to data ScIence 17

Measures of Relative Standing

Percentiles

Percentiles partition data into groups and represent the measures of location for data values.
Percentiles divide a set of data into 100 groups with about 1% of the values in each group.

To find the percentile of a data value, use the formula:

#
( )= #
∗ 100%

Example:
Find the percentile for the data value 14, given the dataset {4 6 14 10 4 10 18 18 22 6 6 18 12 2 18}
Solution:
There are 9 data values less than 14 and a total of 15 data values.

(14) = ∗ 100% = 60%

IntroductIon to data ScIence 18

9
11/23/2022

Measures of Relative Standing

Boxplots:
A boxplot is a graph of a data set that consists of a line extending from the minimum value to the
maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third
quartile, Q3.

IntroductIon to data ScIence 19

Measures of Relative Standing

Example:
Given the following boxplot, what is Q1, Q2, and Q3
◦ Q1: 7
◦ Q2: 9
◦ Q3: 13 0
2
6
7 9
12
13 17
18

Draw a boxplot for the following students marks:

◦ Sorted values 8, 9, 10, 11, 13, 18, 21, 22 Q1 of (8, 9, 10, 11, 13, 18, 21, 22) is: 9.5
◦ Q2 of (8, 9, 10, 11, 13, 18, 21, 22) is: 12 Q3 of (8, 9, 10, 11, 13, 18, 21, 22) is: 19.5

8 9.5 12 19.5 22

IntroductIon to data ScIence 20

10
11/23/2022

Measures of Relative Standing

Modified Boxplots:
For purposes of constructing modified boxplots, we can consider outliers to be data
values meeting specific criteria.
In modified boxplots, a data value is an outlier if it is above Q3 by an amount greater
than 1.5 × IQR or below Q1 by an amount greater than 1.5 × IQR.
A modified boxplot is constructed with these specifications:
◦ A special symbol (such as an asterisk) is used to identify outliers.
◦ The solid horizontal line extends only as far as the minimum data value that is not an outlier and
the maximum data value that is not an outlier.

IntroductIon to data ScIence 21

Case Study: Statistical Description on Yelp

Download and Save Data:
◦ https://www.kaggle.com/shwetakhanjanshroff/yelp-review

Import Libraries:
In [*]: #Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Read the Data:

In [1]: #Read csv file
df = pd.read_csv("Data/yelp.csv")
df.head()

IntroductIon to data ScIence 22

11
11/23/2022

Case Study: Statistical Description on Yelp

Measures of Centers – Mean
In [2]: df['stars'].mean()

In [3]: df['cool'].mean()

Measures of Centers – Median

In [4]: df['useful'].median()

In [5]: df['funny'].median()

Measures of Centers – Mode

In [6]: df['stars'].mode()

In [7]: df['user_id'].mode()

IntroductIon to data ScIence 23

Case Study: Statistical Description on Yelp

Measures of Variations – Range
In [2]: df['stars'].max() - df['stars'].min()

In [3]: df['cool'].max() - df['cool'].min()

Measures of Centers – Interquartile Range

In [4]: df['stars'].quantile(0.75)-df['stars'].quantile(0.25)

In [5]: df['cool'].quantile(0.75)-df['cool'].quantile(0.25)

Measures of Centers – Variance and STD

In [6]: df['stars'].std()

In [7]: df['stars'].var()

IntroductIon to data ScIence 24

12
11/23/2022

Case Study: Statistical Description on Yelp

Relative Standing– Percentile
In [2]: df['stars'].rank()

In [3]: df['cool'].rank()

Relative Standing– z-Score

In [4]: M = df['stars'].mean()
St = df['stars'].std()
zscore = (df['stars'][0] - M)/ St
print(zscore)

Relative Standing– Boxplot

In [5]: boxplot = df.boxplot(column=['stars', 'cool', 'funny'])

IntroductIon to data ScIence 25

Case Study: Group-By Method

Using "group by" method we can:
• Split the data into groups based on some criteria
• Calculate statistics (or apply a function) to each group
In [1]: import pandas as pd
df = pd.read_csv("Data/yelp.csv")
df.head()
#Group data using rank
df_rank = df.groupby(['stars'])
#Calculate mean value for each numeric column per each group
df_rank.mean()

IntroductIon to data ScIence 26

Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
100% (1)
Basics of Statistics: Definition: Science of Collection, Presentation, Analysis, and Reasonable
33 pages
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
No ratings yet
Lesson 4: Statistics/Data Management Unit 1 - Measures of Central Tendency
26 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Ch 2 Lecture Notes
No ratings yet
Ch 2 Lecture Notes
12 pages
Lecture 9
No ratings yet
Lecture 9
40 pages
Measures of Variability and Position
No ratings yet
Measures of Variability and Position
34 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
slides_week2
No ratings yet
slides_week2
43 pages
Basic Stat 1
No ratings yet
Basic Stat 1
50 pages
Branches of Statistics, Data Types, and Graphs
No ratings yet
Branches of Statistics, Data Types, and Graphs
6 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Intro To Stat1
No ratings yet
Intro To Stat1
31 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Lecture Afffasfafa
No ratings yet
Lecture Afffasfafa
29 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
Data Management
No ratings yet
Data Management
36 pages
8614.educational Statitics Unit 4
No ratings yet
8614.educational Statitics Unit 4
34 pages
Statistics Notes Self Made
100% (1)
Statistics Notes Self Made
41 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
WEEK 3 - Central-Tendency-Variation-And-Shape
No ratings yet
WEEK 3 - Central-Tendency-Variation-And-Shape
39 pages
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
No ratings yet
Statistics For Bussiness: By: Dr. (C) Nanik Istianingsih, S.E., M.E., C.LMA., C.PR., C.DM
31 pages
Midterm Exam Reviewer
No ratings yet
Midterm Exam Reviewer
12 pages
Intro to Stat
No ratings yet
Intro to Stat
50 pages
Statistics Notes
No ratings yet
Statistics Notes
16 pages
Basic of Statistics #5 (!!!)
No ratings yet
Basic of Statistics #5 (!!!)
49 pages
Chapter-5-Statistics-and-Data
No ratings yet
Chapter-5-Statistics-and-Data
25 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Interpretation of Test Results
No ratings yet
Interpretation of Test Results
27 pages
Statistics 02week 01 PDF
No ratings yet
Statistics 02week 01 PDF
48 pages
Kinds & Classification of Research: Reported By: Marina G. Servan
No ratings yet
Kinds & Classification of Research: Reported By: Marina G. Servan
52 pages
Measures
No ratings yet
Measures
8 pages
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
No ratings yet
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
34 pages
Module 3 - Branches of Statistics (1)
No ratings yet
Module 3 - Branches of Statistics (1)
50 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Stat Chapter 5-9
No ratings yet
Stat Chapter 5-9
32 pages
Class1
No ratings yet
Class1
52 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Descriptive Lec
No ratings yet
Descriptive Lec
8 pages
2.data Description
No ratings yet
2.data Description
57 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
1.2 Mathematical Presentation of Data
No ratings yet
1.2 Mathematical Presentation of Data
28 pages
MMW Section 2 Lesson 2 (1)
No ratings yet
MMW Section 2 Lesson 2 (1)
32 pages
Basic Statistics
No ratings yet
Basic Statistics
52 pages
Descriptive-Statistics
No ratings yet
Descriptive-Statistics
25 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
EPS201_Lecture3_Jan23_2024b_085916
No ratings yet
EPS201_Lecture3_Jan23_2024b_085916
56 pages
Lesson2 - Measures of Tendency
No ratings yet
Lesson2 - Measures of Tendency
65 pages
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
No ratings yet
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
90 pages
Lecture_04
No ratings yet
Lecture_04
88 pages
Statistics
No ratings yet
Statistics
68 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
Chapter 3: Statistics
No ratings yet
Chapter 3: Statistics
3 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
U3-PPT6
No ratings yet
U3-PPT6
4 pages

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

To Data Science: Chapter 4: Statistical Description of Data

Uploaded by

11/23/2022

AHMAD ABU-SHAREHA (PH.D)

Statistical descriptions is a synonym of statistics.

The most common statistical descriptions are:

IntroductIon to data ScIence 2

Central Tendency (or Groups’ “Middle Values”)

IntroductIon to data ScIence 3

IntroductIon to data ScIence 4

IntroductIon to data ScIence 5

IntroductIon to data ScIence 6

How to obtain quartiles:

Required 4 or more numbers. 0 250 500 750 1000

IntroductIon to data ScIence 7

IntroductIon to data ScIence 8

IntroductIon to data ScIence 9

Variations (or Summary of Differences Within Groups)

IntroductIon to data ScIence 10

Class A--IQs of 13 Students

Class B--IQs of 13 Students

IntroductIon to data ScIence 11

Class A--IQs of 13 Students

Class B--IQs of 13 Students

IntroductIon to data ScIence 12

IntroductIon to data ScIence 13

IntroductIon to data ScIence 14

IntroductIon to data ScIence 15

Measures of Relative Standing

Accordingly, these measures are used to:

Most common relative standing measures:

IntroductIon to data ScIence 16

Measures of Relative Standing

IntroductIon to data ScIence 17

Measures of Relative Standing

To find the percentile of a data value, use the formula:

(14) = ∗ 100% = 60%

IntroductIon to data ScIence 18

Measures of Relative Standing

IntroductIon to data ScIence 19

Measures of Relative Standing

Draw a boxplot for the following students marks:

IntroductIon to data ScIence 20

Measures of Relative Standing

IntroductIon to data ScIence 21

Case Study: Statistical Description on Yelp

Read the Data:

IntroductIon to data ScIence 22

Case Study: Statistical Description on Yelp

Measures of Centers – Median

Measures of Centers – Mode

IntroductIon to data ScIence 23

Case Study: Statistical Description on Yelp

In [3]: df['cool'].max() - df['cool'].min()

Measures of Centers – Interquartile Range

Measures of Centers – Variance and STD

IntroductIon to data ScIence 24

Case Study: Statistical Description on Yelp

Relative Standing– z-Score

Relative Standing– Boxplot

IntroductIon to data ScIence 25

Case Study: Group-By Method

IntroductIon to data ScIence 26

You might also like