0% found this document useful (0 votes)
18 views19 pages

DOM503 Session 1

The document discusses the differences between categorical and numerical data, including their displays and measures of central tendency such as mean, median, and mode. It also covers concepts like percentiles, quartiles, variance, and standard deviation, along with methods for measuring skewness and kurtosis. Additionally, it addresses common errors in data visualization and emphasizes the importance of ethical considerations in data analysis and interpretation.

Uploaded by

Mihir Ritti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views19 pages

DOM503 Session 1

The document discusses the differences between categorical and numerical data, including their displays and measures of central tendency such as mean, median, and mode. It also covers concepts like percentiles, quartiles, variance, and standard deviation, along with methods for measuring skewness and kurtosis. Additionally, it addresses common errors in data visualization and emphasizes the importance of ethical considerations in data analysis and interpretation.

Uploaded by

Mihir Ritti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

DOM503 2021

Session 1
Categorical and Numerical
Data
Categorical data is data that is separated
into various groupings or categories for
display.
Takes form of tables, bar charts, pie
charts, etc.
Numerical data comprises of numbers that
have not been separated into categories.
Displays of numerical data include arrays,
frequency distributions, scatter plots, etc.
Both types of data can be displayed using
some types of tables such as Pivot Tables.
Grouped data mean example
Data: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Relative
Class Frequency Frequency Percentage
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100

Mean = (15*3+25*6+35*5+45*4+55*2)/20
= 33
Measures of Central
Tendency
Most sets of data show a tendency to group
around a central value, this is the ‘central
tendency’.
The most common measures of central
tendency are mean, median, and mode.
The mean, also called the arithmetic mean,
is the average
n of all values in the sample
space.  Xi
X1  X 2    X n
X i 1

n n

n is the size of the sample. Excel command:


Mode
A measure of central tendency
It is the value that occurs most often in the
sample.
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode if all values have the
same frequency
There may be several modes if more than one
value are tied for the highest frequency.
Excel command: MODE
Note: Not suitable for small data sets
Median
Robust measure of central tendency
Not affected by extreme values
 In an ordered array, the median is
the “middle” number
Median: (n+1)/2 ranked value.
If n is odd, the median is the middle
number.
If n is even, the median is the
average of the two middle numbers.
EXCEL command: MEDIAN
Percentile
To find top xth percentile, we use same
method as quartile.
List data in ascending order
xth percentile = Data in rank (n+1)x/100,
where n is number of data points.
In case of fractional value of rank, use
unitary method to find value. Eg: 80th
percentile out of 30 data points would be
31*0.8=24.8th rank.Value would be 24th data
point * 0.2 + 25th data point * 0.8
EXCEL command for raw data:
PERCENTILE.EXC
Quartiles
Quartiles split data into 4 parts.
1st Quartile splits the lowest 25% of the values from
the rest.
(25th percentile)
3rd Quartile splits the lowest 75% of the values from
the rest.
(75th percentile)
Q2 is the median.
Interquartile range: Q3 – Q1 is a measure of how
the data points are distributed around the central
value Q2
EXCEL command: We can use either
PERCENTILE.EXC or QUARTILE
Interquartile Range
Measure of variation
Also known as midspread
Spread in the middle 50%
Difference between the first and third
quartiles
Not affected by extreme values
Data in Ordered Array: 11 12 13 16 16 17 17 18 21

Interquartile Range Q3  Q1 17.5  12.5 5


5-number summary and the
Box Plot
The 5 numbers: smallest X, Q , Q , Q ,
1 2 3
largest X
Boxplot
Graphical display of data using 5-number
summary

Median( Q2) Xlargest


X smallest Q Q3
1

4 6 8 10 12
Distribution Shape and the
Boxplot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3 Q1Q2Q3 Q1 Q2 Q3
Variation and shape of data:
Range
 Measure of variation
 Difference between the largest and the
Range
smallest  X Largest  X Smallest
observations:

 Ignores the way in which data are distributed

 Does not consider how the values cluster


between extremes.
Variance
Important measure of variation
Shows variation about the mean
Is the average of the square of the difference
between a data point and the mean
n

 X i  X
2

Sample variance:
S 2  i 1 , is the
sample mean. n 1
Excel command: VAR.S N

 Xi   
2

 2  i 1
Population variance: N , µ is
the population mean.
Excel command: VAR.P
Standard Deviation

Most important measure of variation


Shows variation about the mean
Has the same units as the original
data
Is the square root of the variance
n

 X  X
2
i
S  i 1
Sample standard deviation: n 1
N

 Xi  
2
Excel command STDEV.S
  i 1
Population standard deviation:
N
Excel command STDEV.P
Why do we divide by (n-1) for
sample variance?
 For a sample variance to be unbiased, the average
variance for all possible samples for a given
population has to be equal to the population
variance.
 It was mathematically shown that if the sample
variance was calculated using n instead of n-1, the
average variance of all possible samples was not
equal to population variance.
 This is called Bessel’s correction, where we use
denominator (n-1) for calculating sample variance.
 As population size becomes larger compared to
sample size, pop variance and sample variance
gives the same result.
Measuring skewness
Skewness is the measure of asymmetry in a
data distribution. One method of
calculating it is adjusted Fisher Pearson
coefficient, as follows:

A symmetrical distribution like a normal


distribution will have . Negative value
indicates left-skewed data, positive
indicates right-skewed.
Presence of extreme outliers can distort
value of G, giving erroneous results.
Excel command: SKEW
Measuring kurtosis
Kurtosis is a measure of how ‘heavy’ the tails
of a data set are, i.e., how many outliers are
present, relative to a normal distribution.

Normal distribution has kurtosis = 0. Higher


kurtosis indicates large number of outliers,
lower means few outliers.
Like skewness, extreme outliers can distort
kurtosis values.
Excel command: KURT
Errors in visualizing data
Using “chart junk”, visual effects that distort
or distract from the data to be presented, eg:
garish graphics, irrelevant visuals.
Failing to provide a relative basis in
comparing data between groups. For
example, two separate pie charts showing the
operations of two companies does not help if
we’re trying to compare the two.
Compressing the vertical axis – using an axis
going up to 100 when the highest value is 30.
Providing no zero point on the vertical axis
Pitfalls and Ethical Considerations
Data analysis is objective
Should report the summary measures that best
meet the assumptions about the data set
Data interpretation is subjective
Should be done in fair, neutral and clear manner

Numerical descriptive measures:


Should document both good and bad results
Should be presented in a fair, objective and
neutral manner
Should not use inappropriate summary measures
to distort facts

You might also like