DOM503 Session 1
DOM503 Session 1
Session 1
Categorical and Numerical
Data
Categorical data is data that is separated
into various groupings or categories for
display.
Takes form of tables, bar charts, pie
charts, etc.
Numerical data comprises of numbers that
have not been separated into categories.
Displays of numerical data include arrays,
frequency distributions, scatter plots, etc.
Both types of data can be displayed using
some types of tables such as Pivot Tables.
Grouped data mean example
Data: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Relative
Class Frequency Frequency Percentage
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100
Mean = (15*3+25*6+35*5+45*4+55*2)/20
= 33
Measures of Central
Tendency
Most sets of data show a tendency to group
around a central value, this is the ‘central
tendency’.
The most common measures of central
tendency are mean, median, and mode.
The mean, also called the arithmetic mean,
is the average
n of all values in the sample
space. Xi
X1 X 2 X n
X i 1
n n
4 6 8 10 12
Distribution Shape and the
Boxplot
Q1 Q2 Q3 Q1Q2Q3 Q1 Q2 Q3
Variation and shape of data:
Range
Measure of variation
Difference between the largest and the
Range
smallest X Largest X Smallest
observations:
X i X
2
Sample variance:
S 2 i 1 , is the
sample mean. n 1
Excel command: VAR.S N
Xi
2
2 i 1
Population variance: N , µ is
the population mean.
Excel command: VAR.P
Standard Deviation
X X
2
i
S i 1
Sample standard deviation: n 1
N
Xi
2
Excel command STDEV.S
i 1
Population standard deviation:
N
Excel command STDEV.P
Why do we divide by (n-1) for
sample variance?
For a sample variance to be unbiased, the average
variance for all possible samples for a given
population has to be equal to the population
variance.
It was mathematically shown that if the sample
variance was calculated using n instead of n-1, the
average variance of all possible samples was not
equal to population variance.
This is called Bessel’s correction, where we use
denominator (n-1) for calculating sample variance.
As population size becomes larger compared to
sample size, pop variance and sample variance
gives the same result.
Measuring skewness
Skewness is the measure of asymmetry in a
data distribution. One method of
calculating it is adjusted Fisher Pearson
coefficient, as follows: