0% found this document useful (0 votes)

25 views

Data Distribution

This document discusses different data types and measures of data distribution. It describes common data types like numerical, categorical, boolean, and ordinal variables. It also defines key distribution measures such as mean, median, mode, skewness, kurtosis, and compares common continuous and discrete distributions. Finally, it outlines statistical tests used to compare distributions, such as the Q-Q plot, Kolmogorov-Smirnov test, and explains when to use different statistical tests depending on the data and comparisons being made.

Uploaded by

ky453125

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Data Distribution

Uploaded by

ky453125

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Types and

Measures of
Data Distribution

Prepared By:
Deeman Yousif Mahmood
PhD Student
Data Type
Data types
Type Example

I. Numerical (double) Income (e.g. 650.34)

II. Numerical (int) # of children (e.g. 4)

III. Boolean Gender (e.g. male)

IV. Categorical Colors (e.g. green)

V. Ordinal Satisfaction (e.g. pleased)

VI. Others Comments

Data types – Discrete and continuous
Type
I. Numerical (double) Continuous
II. Numerical (int)
III. Boolean
IV. Categorical
Discrete

V. Ordinal
VI. Others
Categorical vs Boolean
• Categorical is essentially several Booleans that are
grouped by some logic

• Example
– Feature (color): Green, Blue, Red
vs
– Feature (isGreen): Yes/No
– Feature (isBlue): Yes/No
– Feature (isRed): Yes/NO

Sometimes we convert categorical into Booleans

for machine learning
Why is knowledge of data type important?
• Model results are based on this input
– Distance measures

• Some models and techniques only use certain

data types

• Memory considerations
– Categorical vs Boolean (Male/Female or 0/1)
– Boolean can be sparse
Data
Distribution
Measures
Distribution measures 1: Mean, Median,
Mode
• Mode
 Good for nominal variables
 Quick and easy

• Median
 Robust central tendency statistics
• Less sensitive to outliers and extreme values
 Good for “bad” distributions

• Mean
 Most commonly used statistic for central tendency
• Generally preferred except for “bad” distribution
 Based on all data in the distribution
 Used for inference as well as description
• best estimator of the parameter
Distribution measures 1: Mean, Median, Mode
Distribution measures 2: Skewness & kurtosis
• Skewness (tails) • Kurtosis (shoulders, heavy tail)
• Skewness is a measure of the asymmetry of • Kurtosis is the degree of peakedness of a distribution
the probability distribution relative to a normal distribution
Excess
Kurtosis

• A normal distribution is a mesokurtic distribution

• Right skew -
• A pure leptokurtic distribution has a higher peak than
• Left skew - the normal distribution and has heavier tails.
• Symmetric - • A pure platykurtic distribution has a lower peak than a
normal distribution and lighter tails.
Common continuous distributions
Normal (Gaussian) Distribution Log-normal Distribution

 Z-score  Used to model a variable which is

a product of positive i.i.d vars,
• The distance of • A compound return from a
a value from the mean,
measured in standard sequence of many trades
deviations • Measures of size of living tissue

Student’s t-Distribution (Gosset 1908) The Distribution with k D.F

 Sampling distrib. (i.i.d measures) of

 Approaches the Gaussian

distrib. when  Heavily used in statistics
• or • Estimating variance
 Used for • Goodness-of-fit test
• Test the diff. between two sample means
• Inference when are unknown
Common discrete distributions
• Bernoulli Distribution • Binomial distribution
– Bernoulli trial – Number of success in n independent trials
• A trial with only two possible outcomes

– Bernoulli Distribution
• Represents success/failure (e.g. accuracy of
prediction)
If n is large, then:

–
is a good approximation
( ) for

• Multinomial Distribution • Poisson Distribution

– Categorical Distribution – Number of events occurring within a fixed
• A trial with k possible outcomes time interval (or space)
• , the shape param., indicates the average
where and
number of events in the given time interval
– Multinomial Distribution
• Number of occurrences of k categories in n
independent trials

– If is large, then is a good approximation

where for
Comparing distributions
Examples of commonly used distribution tests

• Q-Q plot:
– Compare distributions based on quantiles

• Kolmogorov–Smirnov (KS) test

– Compare distributions based on the cumulative density function

• Shapiro's test for normality

– Check if data is normally distributed

• Two derivatives of KS that also compare 2 distributions

– Cramér–von Mises criterion

– Anderson–Darling test
Q-Q plot
• A plot of the quantiles of the first data set against the quantiles of the second data
set
• Data sets sizes don’t have to be equal
• The greater the departure from the 45 deg. reference line, the greater the
evidence for the conclusion that the two data sets have come from populations
with different distributions
Kolmogorov–Smirnov test
• A non-parametric test for the equality of continuous, one-
dimensional probability distribution
• Can be applied to test a dataset distribution against a known distribution OR
against another dataset distribution
H0: The data follow a specified distribution
H1: The data do not follow the specified distribution

• The K-S statistics is defined as:

When to use which statistical test?
Using the correct statistical test, and correcting for multiple
hypotheses are recurrent issues in data science
Data comparisons you Data are normally Data are not normally- Data are Binomial
are making distributed distributed, or are ranks (Possess 2 possible
or scores values)
Compare one set of data to a One-sample t-test Wilcoxon test 2 test
hypothetical value

Compare two sets of Unpaired t-test Mann-Whitney test 2 test or Fisher test
independently-collected
(unpaired) data

Compare two sets of data from Paired t-test Wilcoxon test McNemar’s test
the same subjects under
different circumstances (paired)

Compare three or more sets of One-way ANOVA Kruskal-Wallis test 2 test

data

Look for a relationship between Pearson Correlation coefficient Spearman correlation Contingency Correlation
two variables coefficient coefficients

Look for a linear relationship Linear regression Nonparametric linear Simple logistic regression
between two variables regression

Look for a non-linear Non-linear regression Nonparametric non-linear

relationship between two regression
variables

20461C 00
100% (1)
20461C 00
7 pages
Sas Sur
No ratings yet
Sas Sur
9 pages
MATH 1281 Statistical Inference Unit 4 Written Assignment:: A Paired Design
100% (3)
MATH 1281 Statistical Inference Unit 4 Written Assignment:: A Paired Design
4 pages
Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
Descriptive Data Analytics
No ratings yet
Descriptive Data Analytics
56 pages
Data Mining
No ratings yet
Data Mining
87 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
22 pages
Training in R For Data Statistics
No ratings yet
Training in R For Data Statistics
113 pages
20IT503 - Big Data Analytics - Unit1
No ratings yet
20IT503 - Big Data Analytics - Unit1
59 pages
Perl Tutorial
No ratings yet
Perl Tutorial
32 pages
Module 4 SQL
No ratings yet
Module 4 SQL
151 pages
Report Design & Data Monitor Using Businessobjects Dashboard Design
No ratings yet
Report Design & Data Monitor Using Businessobjects Dashboard Design
74 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
Module No 5 Relational Database Design
No ratings yet
Module No 5 Relational Database Design
160 pages
Subqueries
No ratings yet
Subqueries
32 pages
Final - DBMS UNIT-5
No ratings yet
Final - DBMS UNIT-5
181 pages
Health Data Quality
No ratings yet
Health Data Quality
20 pages
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
No ratings yet
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
15 pages
DBMS Module 2
No ratings yet
DBMS Module 2
125 pages
SQL Basic
100% (1)
SQL Basic
53 pages
DBMS - Module 3 Ppts - Jan28th (Autosaved)
No ratings yet
DBMS - Module 3 Ppts - Jan28th (Autosaved)
104 pages
RDBMS
No ratings yet
RDBMS
155 pages
SQL
No ratings yet
SQL
101 pages
Unit-2 SQL Updated
No ratings yet
Unit-2 SQL Updated
102 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
No ratings yet
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
53 pages
Session 3 4 Data Literacy Privacy Ethics
No ratings yet
Session 3 4 Data Literacy Privacy Ethics
19 pages
Advanced SQL - LAB 2
No ratings yet
Advanced SQL - LAB 2
11 pages
DataMining S
No ratings yet
DataMining S
103 pages
DBMS Module 1
No ratings yet
DBMS Module 1
56 pages
L9 SQL
No ratings yet
L9 SQL
128 pages
Advanced SQL - LAB 3
No ratings yet
Advanced SQL - LAB 3
21 pages
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
No ratings yet
Chapter 5: Advanced SQL: Database System Concepts, 6 Ed
77 pages
Module 4
No ratings yet
Module 4
63 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
Unit 01
No ratings yet
Unit 01
32 pages
BookSlides 1 Machine Learning For Predictive Data Analytics
No ratings yet
BookSlides 1 Machine Learning For Predictive Data Analytics
56 pages
Big Data Analytics and Artificial Intelligence in
No ratings yet
Big Data Analytics and Artificial Intelligence in
10 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Lesson 3 Big Data Overview
No ratings yet
Lesson 3 Big Data Overview
30 pages
GIS Ch4
No ratings yet
GIS Ch4
37 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
SQL Statements: - Select - Insert - Update - Delete - Create - Alter - Drop - Rename - Truncate - Commit - Rollback - Savepoint
100% (1)
SQL Statements: - Select - Insert - Update - Delete - Create - Alter - Drop - Rename - Truncate - Commit - Rollback - Savepoint
231 pages
PPT ch01
No ratings yet
PPT ch01
82 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
1 page
SQL Introduction
No ratings yet
SQL Introduction
96 pages
PSIT03 (Group 1) Overview-of-Risk-Management-Frameworks - NIST-RMF-ISO-31000-and-COBIT-4
No ratings yet
PSIT03 (Group 1) Overview-of-Risk-Management-Frameworks - NIST-RMF-ISO-31000-and-COBIT-4
23 pages
Emerging Technology Chapter 1
No ratings yet
Emerging Technology Chapter 1
35 pages
Module 4 Transaction Processing
100% (1)
Module 4 Transaction Processing
94 pages
Lesson 3 Data Cleaning and Preparation
No ratings yet
Lesson 3 Data Cleaning and Preparation
105 pages
Advanced SQL - LAB 1
No ratings yet
Advanced SQL - LAB 1
12 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
145 pages
DBMS Module1 Part1
No ratings yet
DBMS Module1 Part1
66 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
Introduction of Big Data & Applications
No ratings yet
Introduction of Big Data & Applications
107 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Design of E-Government Security Governance
No ratings yet
Design of E-Government Security Governance
6 pages
CENG3300 Lecture 2-2
No ratings yet
CENG3300 Lecture 2-2
23 pages
Lecture 2 - Normative Distribution and Descriptive Statistics
No ratings yet
Lecture 2 - Normative Distribution and Descriptive Statistics
51 pages
A Very Basic Tutorial For Performing Linear Mixed Effects Analyses
No ratings yet
A Very Basic Tutorial For Performing Linear Mixed Effects Analyses
22 pages
Home Assignment: Vivek Sundar M (1301109)
No ratings yet
Home Assignment: Vivek Sundar M (1301109)
7 pages
Historical Background: New Latin Italian Politician German Gottfried Achenwall Data State
No ratings yet
Historical Background: New Latin Italian Politician German Gottfried Achenwall Data State
3 pages
Applied Regression Analysis Final Project
No ratings yet
Applied Regression Analysis Final Project
8 pages
IME Course Content
No ratings yet
IME Course Content
28 pages
Asteriou - Series de Tiempo
No ratings yet
Asteriou - Series de Tiempo
57 pages
FINALMAYThe Validity of Wagner in UK
No ratings yet
FINALMAYThe Validity of Wagner in UK
14 pages
Lab Report
No ratings yet
Lab Report
85 pages
G11 Modules
No ratings yet
G11 Modules
32 pages
Viva Questions and Possible Answers - Ver 1.0
No ratings yet
Viva Questions and Possible Answers - Ver 1.0
3 pages
Chapter 1 - The Fundamentals of Economic Research
No ratings yet
Chapter 1 - The Fundamentals of Economic Research
29 pages
Exam 2015 - MS1
No ratings yet
Exam 2015 - MS1
5 pages
Where Can Buy Statistics Using Technology Second Edition Kathryn Kozak Ebook With Cheap Price
100% (7)
Where Can Buy Statistics Using Technology Second Edition Kathryn Kozak Ebook With Cheap Price
74 pages
Version 3 Documentation Addendum
No ratings yet
Version 3 Documentation Addendum
11 pages
Assignment 3 A Nova
No ratings yet
Assignment 3 A Nova
6 pages
Model Variables Entered Variables Removed Method 1 Wat, VLT - Enter A. All Requested Variables Entered. B. Dependent Variable: RC
No ratings yet
Model Variables Entered Variables Removed Method 1 Wat, VLT - Enter A. All Requested Variables Entered. B. Dependent Variable: RC
2 pages
Estimation 2
No ratings yet
Estimation 2
20 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
8 pages
Factor Analysis
No ratings yet
Factor Analysis
11 pages
Module 5
0% (1)
Module 5
74 pages
Stat
No ratings yet
Stat
58 pages
Notes On ARIMA: ND RD
No ratings yet
Notes On ARIMA: ND RD
4 pages
Cia 4 ML
No ratings yet
Cia 4 ML
60 pages
DS 432 Assignment I 2020
No ratings yet
DS 432 Assignment I 2020
7 pages
Message
No ratings yet
Message
4 pages
Chapter 8 Hypothesis Testing
No ratings yet
Chapter 8 Hypothesis Testing
19 pages
IM Chapter9
No ratings yet
IM Chapter9
36 pages
MS4610 - Introduction To Data Analytics Final Exam Date: November 24, 2021, Duration: 1 Hour, Max Marks: 75
No ratings yet
MS4610 - Introduction To Data Analytics Final Exam Date: November 24, 2021, Duration: 1 Hour, Max Marks: 75
11 pages