Data Distribution
Data Distribution
Measures of
Data Distribution
Prepared By:
Deeman Yousif Mahmood
PhD Student
Data Type
Data types
Type Example
V. Ordinal
VI. Others
Categorical vs Boolean
• Categorical is essentially several Booleans that are
grouped by some logic
• Example
– Feature (color): Green, Blue, Red
vs
– Feature (isGreen): Yes/No
– Feature (isBlue): Yes/No
– Feature (isRed): Yes/NO
• Memory considerations
– Categorical vs Boolean (Male/Female or 0/1)
– Boolean can be sparse
Data
Distribution
Measures
Distribution measures 1: Mean, Median,
Mode
• Mode
Good for nominal variables
Quick and easy
• Median
Robust central tendency statistics
• Less sensitive to outliers and extreme values
Good for “bad” distributions
• Mean
Most commonly used statistic for central tendency
• Generally preferred except for “bad” distribution
Based on all data in the distribution
Used for inference as well as description
• best estimator of the parameter
Distribution measures 1: Mean, Median, Mode
Distribution measures 2: Skewness & kurtosis
• Skewness (tails) • Kurtosis (shoulders, heavy tail)
• Skewness is a measure of the asymmetry of • Kurtosis is the degree of peakedness of a distribution
the probability distribution relative to a normal distribution
Excess
Kurtosis
– Bernoulli Distribution
• Represents success/failure (e.g. accuracy of
prediction)
If n is large, then:
–
is a good approximation
( ) for
• Q-Q plot:
– Compare distributions based on quantiles
– Anderson–Darling test
Q-Q plot
• A plot of the quantiles of the first data set against the quantiles of the second data
set
• Data sets sizes don’t have to be equal
• The greater the departure from the 45 deg. reference line, the greater the
evidence for the conclusion that the two data sets have come from populations
with different distributions
Kolmogorov–Smirnov test
• A non-parametric test for the equality of continuous, one-
dimensional probability distribution
• Can be applied to test a dataset distribution against a known distribution OR
against another dataset distribution
H0: The data follow a specified distribution
H1: The data do not follow the specified distribution
Compare two sets of Unpaired t-test Mann-Whitney test 2 test or Fisher test
independently-collected
(unpaired) data
Compare two sets of data from Paired t-test Wilcoxon test McNemar’s test
the same subjects under
different circumstances (paired)
Look for a relationship between Pearson Correlation coefficient Spearman correlation Contingency Correlation
two variables coefficient coefficients
Look for a linear relationship Linear regression Nonparametric linear Simple logistic regression
between two variables regression