MANG6513 2023 Lecture 3
MANG6513 2023 Lecture 3
Statistics
Fangsheng of
MANG6513: Foundation GeBAMS
[email protected]
Understand the feature of
If, dataset
given a dataset as depicted, how to describe the data? Essentially, this
means we would like to extract the observable feature/pattern of the data.
Age Height Sex
0 50 Boy
4 96 Boy Age Height(boy) Height(girl)
4 90 Girl How? 0 50 50
0 55 Girl 4 96 92
8 120 Boy 8 114 110
8 110 Boy 12 129 133
8 112 Girl … … …
… … …
2 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Describe data
If given the data, how to describe it? Essentially, this means we would like
to extract the observable feature/pattern of the data.
• Univariate/bivariate case
e.g., What is the maximum/minimum/average height in the class?
How does height associate with weight/age/sex?
• Measurements/graphs
e.g., Any statistics/maths can be used to measure/quantify the feature?
Can we use graphs to help the illustration? -> We will cover this part next week
3 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Centrality
4 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of Centrality
• Centrality refers to the central location in the data
• Summarize the data into one number
• Key measures:
• Mean
• Median
• Mode
• Midrange
5 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Mean
• Formula
• Sensitive to outliers
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
6 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Median
• The median specifies the middle value when the data are arranged
from least to greatest.
• Half the data are below the median, and half the data are above it.
• For an odd number of observations, the median is the middle of the sorted
numbers.
• For an even number of observations, the median is the mean of the two
middle numbers.
• Not sensitive to outliers
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The mean values would be 4.6 and 7, respectively.
The median values would be 5 for both.
7 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Mode
• The mode is the observation that occurs most frequently
• You can easily identify the mode from a frequency distribution by
identifying the value or group having the largest frequency or from a
histogram by identifying the highest bar
Think: what about the two data sets: (1,3,5,6,8) and (1,3,5,6,20)?
8 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Midrange
• The midrange is the average of the greatest and least values in the
data set
• Caution must be exercised when using the midrange because extreme
values easily distort the result
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The mean values would be 4.6 and 7, respectively.
The median values would be 5 for both.
The midrange values would be 4.5 and 11.5, respectively.
• It provides a much rougher estimate than the mean and is often used
for only small sample sizes
9 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
If we want to analyse the dataset of
personal income of all students at uni,
which measure would you suggest to use
to identify the central location of the data?
1. Mean
0%
ü 2. Median
0%
3. Mode
0%
• Key measures:
• Range
• Interquartile range
• Variance
• Standard deviation
12 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Range
• The range is the simplest and is the difference between the maximum
value and the minimum value in the data set
• The range is affected by outliers, and is often used only for very small
data sets
Consider two data sets: (1,3,5,6,8) and (1,3,5,6,20).
The ranges would be [1,8] and [1,20], respectively.
13 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Percentiles
• The kth percentile is a value at or below which at least k percent of
the observations lie. The most common way to compute the kth
percentile is to order the data values from smallest to largest and
calculate the rank of the kth percentile using the formula:
• If, k = 50, means if we want to find the point where half of the
data are smaller than it, essentially, we are calculating the
Median.
14 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Quartiles
• Quartiles break the data into four parts.
• The 25th percentile is called the first quartile,Q1;
• the 50th percentile is called the second quartile, Q2;
• the 75th percentile is called the third quartile, Q3; and
• the 100th percentile is the fourth quartile, Q4.
• One-fourth of the data fall below the first quartile, one-half are below
the second quartile, and three-fourths are below the third quartile.
15 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Interquartile Range
• The interquartile range (IQR), or the midspread is the difference
between the first and third quartiles, Q3 – Q1.
• This includes only the middle 50% of the data and, therefore, is less
influenced by extreme values
16 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Identifying Outliers
• There is no standard definition of what constitutes an outlier.
• Boxplot can be used for this:
Max = Q3 + 1.5 ∗ IQR
Min = Q1 – 1.5 ∗ IQR
19 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Standardized Values
• A standardized value, commonly called a z-score, provides a normalised
measure of the distance an observation is from the mean, which is
independent of the units of measurement.
e.g., data with different scales, say (1,3,5,6,9) and (10,30,50,60,90), can we say they
have different or same dispersion?
20 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Standardized Values
• The numerator represents the distance that xi is from the sample
mean; a negative value indicates that xi lies to the left of the mean,
and a positive value indicates that it lies to the right of the mean.
21 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Shape
Skewness
• Skewness describes the degree of asymmetry of data
• Coefficient of Skewness (CS):
• Distributions that tail off to the right are called positively skewed; those
that tail off to the left are said to be negatively skewed
23 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Kurtos
is
• Kurtosis refers to the peakedness (i.e., high, narrow) or flatness (i.e.,
short, flat-topped) of a histogram.
• The coefficient of kurtosis (CK):
25 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of
Association
Covariance
• Covariance is a measure of the linear association between two
variables, X and Y.
• The covariance between X and Y is the average of the product of the
deviations of each pair of observations from their respective means.
Covariance = 263.37
27 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Correlation
• Correlation is a normalised version of measure of the linear relationship
between two variables, X and Y, which does not depend on the units of
measurement.
• The correlation coefficient is scaled between -1 and 1.
Correlation = 0.56
28 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Examples of Correlation
29 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Measures of Association
• Two variables have a strong statistical relationship with one another if
they appear to move together.
30 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Introduction
to R
The R Environment
• R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. Among other things it has
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data
analysis,
• graphical facilities for data analysis and display either directly at the computer
or on hardcopy,
• a well developed, simple and effective programming language (called ‘S’)
• It has been extended by a large collection of packages
• One of sought-after business analytics skills as perceived by industry
•32 Online tutorial: https://education.rstudio.com/learn/beginner/
Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
RStudio
Data
Script
Console
33 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
RStudio
Exiting RStudio
34 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Reading data file
df <- read.table("mydata.csv", header = TRUE, sep = ",")
df <- read.csv("mydata.csv", header = TRUE)
df <- read.csv2("mydata.csv", header= TRUE)
The header argument is set at TRUE if the first line of the file being read contains the header with
the variable names;
35 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )
Online resource
Linkdin Learning Resource:
https://www.linkedin.com/learnin
g/paths/master-r-for-data-science
?u=35146660
36 Top 50 in the world for Statistics and Operational Research (QS 2021 – 2019 )