0% found this document useful (0 votes)
11 views36 pages

MANG6513 2023 Lecture 3

Uploaded by

Todd Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

MANG6513 2023 Lecture 3

Uploaded by

Todd Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Descriptive

Statistics
Fangsheng of
MANG6513: Foundation GeBAMS
[email protected]
Understand the feature of
If, dataset
given a dataset as depicted, how to describe the data? Essentially, this
means we would like to extract the observable feature/pattern of the data.
Age Height Sex
0 50 Boy
4 96 Boy Age Height(boy) Height(girl)
4 90 Girl How? 0 50 50
0 55 Girl 4 96 92
8 120 Boy 8 114 110
8 110 Boy 12 129 133
8 112 Girl … … …
… … …

2 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Describe data
If given the data, how to describe it? Essentially, this means we would like
to extract the observable feature/pattern of the data.
• Univariate/bivariate case
e.g., What is the maximum/minimum/average height in the class?
How does height associate with weight/age/sex?
• Measurements/graphs
e.g., Any statistics/maths can be used to measure/quantify the feature?
Can we use graphs to help the illustration? -> We will cover this part next week

3 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Centrality

4 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of Centrality
• Centrality refers to the central location in the data
• Summarize the data into one number
• Key measures:
• Mean
• Median
• Mode
• Midrange

5 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Mean
• Formula

• Sensitive to outliers
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).

6 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Median
• The median specifies the middle value when the data are arranged
from least to greatest.
• Half the data are below the median, and half the data are above it.
• For an odd number of observations, the median is the middle of the sorted
numbers.
• For an even number of observations, the median is the mean of the two
middle numbers.
• Not sensitive to outliers
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The mean values would be 4.6 and 7, respectively.
The median values would be 5 for both.
7 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Mode
• The mode is the observation that occurs most frequently
• You can easily identify the mode from a frequency distribution by
identifying the value or group having the largest frequency or from a
histogram by identifying the highest bar
Think: what about the two data sets: (1,3,5,6,8) and (1,3,5,6,20)?

8 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Midrange
• The midrange is the average of the greatest and least values in the
data set
• Caution must be exercised when using the midrange because extreme
values easily distort the result
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The mean values would be 4.6 and 7, respectively.
The median values would be 5 for both.
The midrange values would be 4.5 and 11.5, respectively.

• It provides a much rougher estimate than the mean and is often used
for only small sample sizes
9 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
If we want to analyse the dataset of
personal income of all students at uni,
which measure would you suggest to use
to identify the central location of the data?
1. Mean

0%

ü 2. Median

0%

3. Mode

0%

Join: vevox.app ID: 132-776-410 POLL OPEN


Measures of
Dispersion
Measures of Dispersion
• Dispersion refers to the degree of variation in the data; the
numerical spread (or compactness) of the data. That is, how the
data deviate from the central location?

• Key measures:
• Range
• Interquartile range
• Variance
• Standard deviation

12 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Range
• The range is the simplest and is the difference between the maximum
value and the minimum value in the data set

• The range is affected by outliers, and is often used only for very small
data sets
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The ranges would be [1,8] and [1,20], respectively.

13 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Percentiles
• The kth percentile is a value at or below which at least k percent of
the observations lie. The most common way to compute the kth
percentile is to order the data values from smallest to largest and
calculate the rank of the kth percentile using the formula:

• If, k = 50, means if we want to find the point where half of the
data are smaller than it, essentially, we are calculating the
Median.

14 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Quartiles
• Quartiles break the data into four parts.
• The 25th percentile is called the first quartile,Q1;
• the 50th percentile is called the second quartile, Q2;
• the 75th percentile is called the third quartile, Q3; and
• the 100th percentile is the fourth quartile, Q4.
• One-fourth of the data fall below the first quartile, one-half are below
the second quartile, and three-fourths are below the third quartile.

15 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Interquartile Range
• The interquartile range (IQR), or the midspread is the difference
between the first and third quartiles, Q3 – Q1.

• This includes only the middle 50% of the data and, therefore, is less
influenced by extreme values

16 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Identifying Outliers
• There is no standard definition of what constitutes an outlier.
• Boxplot can be used for this:
Max = Q3 + 1.5 ∗ IQR
Min = Q1 – 1.5 ∗ IQR

• Some typical rules of thumb:


• z-scores greater than +3 or less than -3
• Extreme outliers are more than 3*IQR to the left of Q1 or right of Q3
• Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or right of Q3
17 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Variance and Standard
Deviation
• The variance is the “average” of the squared deviations from the mean

• The standard deviation S is the square root of the variance.


Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The variances are 7.3 and 56.5, respectively
The SDs are 2.7 and 7.5, approximately.
• The dimension/scale of the variance is the square of the dimension/scale
of the observations, whereas the dimension/scale of the standard
deviation is the same as the data.
18 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Variance and Standard
Deviation
• For many data sets encountered in practice:
• Approximately 68% of the observations fall within one standard deviation of
the mean

• Approximately 95% fall within two standard deviations of the mean

• Approximately 99.7% fall within three standard deviations of the mean

• These rules are commonly used to characterize the natural variation


in manufacturing processes and other business phenomena.

19 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Standardized Values
• A standardized value, commonly called a z-score, provides a normalised
measure of the distance an observation is from the mean, which is
independent of the units of measurement.
e.g., data with different scales, say (1,3,5,6,9) and (10,30,50,60,90), can we say they
have different or same dispersion?

• The z-score for the ith observation in a data set is calculated as


follows:

20 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Standardized Values
• The numerator represents the distance that xi is from the sample
mean; a negative value indicates that xi lies to the left of the mean,
and a positive value indicates that it lies to the right of the mean.

• By dividing by the standard deviation, s, we scale the distance from


the mean to express it in units of standard deviations. Thus,
• a z-score of 1.0 means that the observation is one standard deviation to the
right of the mean;
• a z-score of -1.5 means that the observation is 1.5 standard deviations to the
left of the mean.

21 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Shape
Skewness
• Skewness describes the degree of asymmetry of data
• Coefficient of Skewness (CS):
• Distributions that tail off to the right are called positively skewed; those
that tail off to the left are said to be negatively skewed

Positively skewed Symmetrical

23 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Kurtos
is
• Kurtosis refers to the peakedness (i.e., high, narrow) or flatness (i.e.,
short, flat-topped) of a histogram.
• The coefficient of kurtosis (CK):

• CK measures the degree of kurtosis of


a population
• CK < 3 indicates the data is somewhat flat
with a wide degree of dispersion.
• CK > 3 indicates the data is somewhat
peaked with less dispersion.
24 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Shape and Measures of
Location
• Comparing measures of location can sometimes reveal information
about the shape of the distribution of observations.
• For example:
• If the distribution were perfectly symmetrical and unimodal, the mean,
median, and mode would all be the same.
• If it were negatively skewed, we would generally find that mean <
median < mode
• Positive skewness would suggest that mode < median <
mean

25 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Association
Covariance
• Covariance is a measure of the linear association between two
variables, X and Y.
• The covariance between X and Y is the average of the product of the
deviations of each pair of observations from their respective means.

Covariance = 263.37

27 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Correlation
• Correlation is a normalised version of measure of the linear relationship
between two variables, X and Y, which does not depend on the units of
measurement.
• The correlation coefficient is scaled between -1 and 1.

Correlation = 0.56

28 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Examples of Correlation

29 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of Association
• Two variables have a strong statistical relationship with one another if
they appear to move together.

• When two variables appear to be related, you might suspect a cause-


and-effect relationship.

• However, statistical relationships may exist even though a change in


one variable is not caused by a change in the other.

30 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Introduction
to R
The R Environment
• R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. Among other things it has
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data
analysis,
• graphical facilities for data analysis and display either directly at the computer
or on hardcopy,
• a well developed, simple and effective programming language (called ‘S’)
• It has been extended by a large collection of packages
• One of sought-after business analytics skills as perceived by industry
•32 Online tutorial: https://education.rstudio.com/learn/beginner/
Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
RStudio

Data

Script

Files, plots, packages, help

Console

33 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
RStudio

Changing working directory

Getting help with functions and features

Exiting RStudio

34 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Reading data file
df <- read.table("mydata.csv", header = TRUE, sep = ",")
df <- read.csv("mydata.csv", header = TRUE)
df <- read.csv2("mydata.csv", header= TRUE)

Sep = the separator symbol;

The header argument is set at TRUE if the first line of the file being read contains the header with
the variable names;

read.csv() treats comma as the separator symbol

read.csv2() treats semicolon as the separator symbol

35 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Online resource
Linkdin Learning Resource:
https://www.linkedin.com/learnin
g/paths/master-r-for-data-science
?u=35146660

36 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )

You might also like