0% found this document useful (0 votes)
25 views

Data Distribution

This document discusses different data types and measures of data distribution. It describes common data types like numerical, categorical, boolean, and ordinal variables. It also defines key distribution measures such as mean, median, mode, skewness, kurtosis, and compares common continuous and discrete distributions. Finally, it outlines statistical tests used to compare distributions, such as the Q-Q plot, Kolmogorov-Smirnov test, and explains when to use different statistical tests depending on the data and comparisons being made.

Uploaded by

ky453125
Copyright
© © All Rights Reserved
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Data Distribution

This document discusses different data types and measures of data distribution. It describes common data types like numerical, categorical, boolean, and ordinal variables. It also defines key distribution measures such as mean, median, mode, skewness, kurtosis, and compares common continuous and discrete distributions. Finally, it outlines statistical tests used to compare distributions, such as the Q-Q plot, Kolmogorov-Smirnov test, and explains when to use different statistical tests depending on the data and comparisons being made.

Uploaded by

ky453125
Copyright
© © All Rights Reserved
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Types and

Measures of
Data Distribution

Prepared By:
Deeman Yousif Mahmood
PhD Student
Data Type
Data types
Type Example

I. Numerical (double) Income (e.g. 650.34)

II. Numerical (int) # of children (e.g. 4)

III. Boolean Gender (e.g. male)

IV. Categorical Colors (e.g. green)

V. Ordinal Satisfaction (e.g. pleased)

VI. Others Comments


Data types – Discrete and continuous
Type
I. Numerical (double) Continuous
II. Numerical (int)
III. Boolean
IV. Categorical
Discrete

V. Ordinal
VI. Others
Categorical vs Boolean
• Categorical is essentially several Booleans that are
grouped by some logic

• Example
– Feature (color): Green, Blue, Red
vs
– Feature (isGreen): Yes/No
– Feature (isBlue): Yes/No
– Feature (isRed): Yes/NO

Sometimes we convert categorical into Booleans


for machine learning
Why is knowledge of data type important?
• Model results are based on this input
– Distance measures

• Some models and techniques only use certain


data types

• Memory considerations
– Categorical vs Boolean (Male/Female or 0/1)
– Boolean can be sparse
Data
Distribution
Measures
Distribution measures 1: Mean, Median,
Mode
• Mode
 Good for nominal variables
 Quick and easy

• Median
 Robust central tendency statistics
• Less sensitive to outliers and extreme values
 Good for “bad” distributions

• Mean
 Most commonly used statistic for central tendency
• Generally preferred except for “bad” distribution
 Based on all data in the distribution
 Used for inference as well as description
• best estimator of the parameter
Distribution measures 1: Mean, Median, Mode
Distribution measures 2: Skewness & kurtosis
• Skewness (tails) • Kurtosis (shoulders, heavy tail)
• Skewness is a measure of the asymmetry of • Kurtosis is the degree of peakedness of a distribution
the probability distribution relative to a normal distribution
Excess
Kurtosis

• A normal distribution is a mesokurtic distribution


• Right skew -
• A pure leptokurtic distribution has a higher peak than
• Left skew - the normal distribution and has heavier tails.
• Symmetric - • A pure platykurtic distribution has a lower peak than a
normal distribution and lighter tails.
Common continuous distributions
Normal (Gaussian) Distribution Log-normal Distribution

 Z-score  Used to model a variable which is


a product of positive i.i.d vars,
• The distance of • A compound return from a
a value from the mean,
measured in standard sequence of many trades
deviations • Measures of size of living tissue

Student’s t-Distribution (Gosset 1908) The Distribution with k D.F


 Sampling distrib. (i.i.d measures) of

 Approaches the Gaussian


distrib. when  Heavily used in statistics
• or • Estimating variance
 Used for • Goodness-of-fit test
• Test the diff. between two sample means
• Inference when are unknown
Common discrete distributions
• Bernoulli Distribution • Binomial distribution
– Bernoulli trial – Number of success in n independent trials
• A trial with only two possible outcomes

– Bernoulli Distribution
• Represents success/failure (e.g. accuracy of
prediction)
If n is large, then:


is a good approximation
( ) for

• Multinomial Distribution • Poisson Distribution


– Categorical Distribution – Number of events occurring within a fixed
• A trial with k possible outcomes time interval (or space)
• , the shape param., indicates the average
where and
number of events in the given time interval
– Multinomial Distribution
• Number of occurrences of k categories in n
independent trials

– If is large, then is a good approximation


where for
Comparing distributions
Examples of commonly used distribution tests

• Q-Q plot:
– Compare distributions based on quantiles

• Kolmogorov–Smirnov (KS) test


– Compare distributions based on the cumulative density function

• Shapiro's test for normality


– Check if data is normally distributed

• Two derivatives of KS that also compare 2 distributions


– Cramér–von Mises criterion

– Anderson–Darling test
Q-Q plot
• A plot of the quantiles of the first data set against the quantiles of the second data
set
• Data sets sizes don’t have to be equal
• The greater the departure from the 45 deg. reference line, the greater the
evidence for the conclusion that the two data sets have come from populations
with different distributions
Kolmogorov–Smirnov test
• A non-parametric test for the equality of continuous, one-
dimensional probability distribution
• Can be applied to test a dataset distribution against a known distribution OR
against another dataset distribution
H0: The data follow a specified distribution
H1: The data do not follow the specified distribution

• The K-S statistics is defined as:


When to use which statistical test?
Using the correct statistical test, and correcting for multiple
hypotheses are recurrent issues in data science
Data comparisons you Data are normally Data are not normally- Data are Binomial
are making distributed distributed, or are ranks (Possess 2 possible
or scores values)
Compare one set of data to a One-sample t-test Wilcoxon test 2 test
hypothetical value

Compare two sets of Unpaired t-test Mann-Whitney test 2 test or Fisher test
independently-collected
(unpaired) data

Compare two sets of data from Paired t-test Wilcoxon test McNemar’s test
the same subjects under
different circumstances (paired)

Compare three or more sets of One-way ANOVA Kruskal-Wallis test 2 test


data

Look for a relationship between Pearson Correlation coefficient Spearman correlation Contingency Correlation
two variables coefficient coefficients

Look for a linear relationship Linear regression Nonparametric linear Simple logistic regression
between two variables regression

Look for a non-linear Non-linear regression Nonparametric non-linear


relationship between two regression
variables

You might also like