0% found this document useful (0 votes)

30 views37 pages

Lecture Notes

Uploaded by

kyaligonzaerick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views37 pages

Lecture Notes

Uploaded by

kyaligonzaerick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 37

Exploratory Data Analysis

and Descriptive Statistics

Today

• What is descriptive statistics and

exploratory data analysis?

• Basic numerical summaries of data

• Basic graphical summaries of data

“Central Dogma” of Statistics

Probability
Population
Descriptive
Statistics

Sample

Inferential Statistics
EDA
Before making inferences from data it is
essential to examine all your variables.

Why?

To listen to the data:

- to catch mistakes
- to see patterns in the data
- to find violations of statistical assumptions
- to generate hypotheses
…and because if you don’t, you will have trouble later
Types of Data

Categorical Quantitative

binary nominal ordinal discrete continuous

2 categories
more categories
order matters
numerical
uninterrupted
Dimensionality of Data Sets

• Univariate: Measurement made on one variable

per subject

• Bivariate: Measurement made on two variables

per subject

• Multivariate: Measurement made on many

variables per subject
Numerical Summaries of Data

• Central Tendency measures. They are

computed to give a “center” around which the
measurements in the data are distributed.

• Variation or Variability measures. They

describe “data spread” or how far away the
measurements are from the center.
Location: Mean

1. The Mean

To calculate the average x of a set of observations, add

their value and divide by the number of observations:
Location: Median
• Median – the exact middle value

• Calculation:
- If there are an odd number of observations, find the middle value

- If there are an even number of observations, find the

middle two values and average them

• Example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5

Which Location Measure Is Best?

• Mean is best for symmetric distributions without outliers

• Median is useful for skewed distributions or data

with outliers

012345678910 012345678910

Mean = 3 Mean = 4

Median = 3 Median = 3
Scale: Variance

• Average of squared deviations of values

from the mean
Why Squared Deviations?

• Adding deviations will yield a sum of ?

• Absolute values do not have nice
mathematical properties
• Squares eliminate the negatives

• Result:
– Increasing contribution to the variance as
you go farther from the mean.
Scale: Standard Deviation
• Variance is somewhat arbitrary

• What does it mean to have a variance of

10.8? Or 2.2? Or 1459.092? Or 0.000001?

• Nothing. But if you could “standardize” that

value, you could talk about any variance (i.e.
deviation) in equivalent terms

• Standard deviations are simply the square root

of the variance
Scale: Standard Deviation

1. Score (in the units that are meaningful)

2. Mean
3. Each score’s deviation from the mean
4. Square that deviation
5. Sum all the squared deviations (Sum of Squares)
6. Divide by n-1
7. Square root – now the value is in the units we started with!!!
Scale: Quartiles and IQR
IQR
25% 25% 25% 25%

Q1 Q2 Q3

• The first quartile, Q1, is the value for which 25% of

the observations are smaller and 75% are larger

• Q2 is the same as the median (50% are smaller,

50% are larger)

• Only 25% of the observations are greater than the

third quartile
Percentiles (aka Quantiles)
th
In general the n percentile is a value such that n% of
the observations fall at or below or it

th
Q1 = 25 percentile
th
Median = 50 percentile
th
Q2 = 75 percentile
Graphical Summaries of Data

A (Good) Picture Is
Worth A 1,000 Words
Univariate Data: Histograms
and Bar Plots
• What’s the difference between a histogram and bar plot?
Bar plot
• Used for categorical variables to show frequency or
proportion in each category.
• Translate the data from frequency tables into a
pictorial representation…

Histogram
• Used to visualize distribution (shape, center, range,
variation) of continuous variables
• “Bin size” important
Effect of Bin Size on Histogram
• Simulated 1000 N(0,1) and 500 N(1,1)

Frequency
Frequency
Frequency More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
More on Histograms
• What’s the difference between a frequency
histogram and a density histogram?
Frequency Histogram Density Histogram
Box Plots
100.0
maximum

66.7 Q
3

IQR
Years

median

Q1
33.3

minimum

0.0
AGE
Variables
Bivariate Data

Variable 1 Variable 2 Display

Categorical Categorical Crosstabs
Stacked Box Plot

Categorical Continuous Boxplot

nuous Continuous Scatterplot Stacked

Box Plot
Multivariate Data
Clustering
• Organize units into clusters
• Descriptive, not inferential
• Many approaches
• “Clusters” always produced

Data Reduction Approaches (PCA)

• Reduce n-dimensional dataset into much smaller number
• Finds a new (smaller) set of variables that retains
most of the information in the total sample
• Effective way to visualize multivariate data
How to Make a Bad Graph
The aim of good data graphics:
Display data accurately and clearly

Some rules for displaying data badly:

– Display as little information as possible
– Obscure what you do show (with chart junk)
– Use pseudo-3d and color gratuitously
– Make a pie chart (preferably in color and 3d)
– Use a poorly chosen scale

From Karl Broman: http://www.biostat.wisc.edu/~kbroman/

Example 1
Example 2
Example 3
Example 4
Example 5
R Tutorial

• Calculating descriptive statistics in R

• Useful R commands for working with

multivariate data (apply and its derivatives)

• Creating graphs for different types of

data (histograms, boxplots, scatterplots)

• Basic clustering and PCA analysis

Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
43 pages
MÔ TẢ BIẾN SỐ
No ratings yet
MÔ TẢ BIẾN SỐ
48 pages
2 - Introduction To Statistics
No ratings yet
2 - Introduction To Statistics
97 pages
I am sharing 'DOC-20250811-WA0005.' with you
No ratings yet
I am sharing 'DOC-20250811-WA0005.' with you
16 pages
MS102
No ratings yet
MS102
9 pages
Statistics Maths Clinic Gr12 Eng
No ratings yet
Statistics Maths Clinic Gr12 Eng
6 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
Ch1 Prob&Stat NEW
No ratings yet
Ch1 Prob&Stat NEW
35 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Data Analysis
No ratings yet
Data Analysis
43 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
Week 8 Quantitative Data Analysis - Descriptive Statistics
No ratings yet
Week 8 Quantitative Data Analysis - Descriptive Statistics
59 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
LabModule - Exploratory Data Analysis - 2023ic
No ratings yet
LabModule - Exploratory Data Analysis - 2023ic
24 pages
Psyc 103 (Stats)
No ratings yet
Psyc 103 (Stats)
75 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Lesson2 - Measures Mean
No ratings yet
Lesson2 - Measures Mean
68 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
LEC 03 - Descriptive Statistics
No ratings yet
LEC 03 - Descriptive Statistics
42 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
01 Data
No ratings yet
01 Data
100 pages
Biostats Lesson 3
No ratings yet
Biostats Lesson 3
6 pages
02 - Descriptive Statistics
No ratings yet
02 - Descriptive Statistics
45 pages
Lecture-1 Descriptive Statistics
No ratings yet
Lecture-1 Descriptive Statistics
50 pages
Interpreting Test Score: Online Workshop 8602 Aiou
100% (1)
Interpreting Test Score: Online Workshop 8602 Aiou
39 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
Notes 3 Descriptive Statistics RJMurden 2021
No ratings yet
Notes 3 Descriptive Statistics RJMurden 2021
47 pages
SCA - Module 4
No ratings yet
SCA - Module 4
49 pages
Exploring Data: AP Statistics Unit 1: Chapters 1-4
No ratings yet
Exploring Data: AP Statistics Unit 1: Chapters 1-4
83 pages
Aicte L1
No ratings yet
Aicte L1
47 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
MS Excel in Data Analytics
No ratings yet
MS Excel in Data Analytics
56 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
Unit Iii
No ratings yet
Unit Iii
152 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Safari
No ratings yet
Safari
385 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
Descriptive Statistics Summary (Session 1-5) : Types of Data - Two Types
No ratings yet
Descriptive Statistics Summary (Session 1-5) : Types of Data - Two Types
4 pages
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
No ratings yet
Sampling Design and Analysis MTH 494: Ossam Chohan Assistant Professor CIIT Abbottabad
34 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Unit 4
No ratings yet
Unit 4
152 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
Descriptive Statistic
No ratings yet
Descriptive Statistic
37 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Inferential Statistics
No ratings yet
Inferential Statistics
92 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
The Sigma Guidelines-Toolkit: Sigma Opportunity and Risk Guide
No ratings yet
The Sigma Guidelines-Toolkit: Sigma Opportunity and Risk Guide
21 pages
Nokia 303 User Guide: Issue 1.1
No ratings yet
Nokia 303 User Guide: Issue 1.1
50 pages
Some Basic Concepts of Chemistry
No ratings yet
Some Basic Concepts of Chemistry
19 pages
Title List
No ratings yet
Title List
2 pages
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
No ratings yet
(Hooker and Monas, 2008) Shoestring Venture - The Startup Bible
532 pages
Camatkara-Candrika 3ed
No ratings yet
Camatkara-Candrika 3ed
100 pages
Ecosystem Services: Economics and Policy Stephen Muddiman Instant Download
No ratings yet
Ecosystem Services: Economics and Policy Stephen Muddiman Instant Download
62 pages
AWS-SOP - Creating ALB and Configuring Target Groups, Listeners and Stickiness
No ratings yet
AWS-SOP - Creating ALB and Configuring Target Groups, Listeners and Stickiness
15 pages
Blower & Vacuum Pump: IRS-32A・IRS-40A・IRS-50H/L・IRS-65H/L IRS-80H/L・IRS-100L・IRS-125R/L・IRS-150R/L
No ratings yet
Blower & Vacuum Pump: IRS-32A・IRS-40A・IRS-50H/L・IRS-65H/L IRS-80H/L・IRS-100L・IRS-125R/L・IRS-150R/L
68 pages
Funk MMQ 30 Days
100% (1)
Funk MMQ 30 Days
34 pages
Authentic Assessment Rubric - New Dog Breed
No ratings yet
Authentic Assessment Rubric - New Dog Breed
2 pages
Semi Detailed Lesson Plan
No ratings yet
Semi Detailed Lesson Plan
4 pages
DDP Sohana - 2021 - Notification
No ratings yet
DDP Sohana - 2021 - Notification
17 pages
Fiz117 Notebook
No ratings yet
Fiz117 Notebook
77 pages
The Empathetic School
100% (1)
The Empathetic School
9 pages
Plus One Notes - Eng
No ratings yet
Plus One Notes - Eng
11 pages
Role of Family in Consumer Behaviour
0% (1)
Role of Family in Consumer Behaviour
10 pages
Agriengineering 06 00187
No ratings yet
Agriengineering 06 00187
18 pages
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
No ratings yet
New Design of Intelligent Load Shedding Algorithm Based On Critical Line Overloads To Reduce Network Cascading Failure Risks
15 pages
Aditya Internship Training
No ratings yet
Aditya Internship Training
14 pages
Design and Analysis of A High Gain Rail To Rail Operational Amplifier
No ratings yet
Design and Analysis of A High Gain Rail To Rail Operational Amplifier
5 pages
Chapter 1 SAD
No ratings yet
Chapter 1 SAD
8 pages
Images Line Drawings and Backplanes
No ratings yet
Images Line Drawings and Backplanes
27 pages
Bca Muj
No ratings yet
Bca Muj
4 pages
Housing Design For Goats 1
No ratings yet
Housing Design For Goats 1
2 pages
Chapter 4 (Answers)
No ratings yet
Chapter 4 (Answers)
5 pages
AES DRRM Memo PASS
No ratings yet
AES DRRM Memo PASS
2 pages
1.5.2 Strategy As Position: Why Strategy Execution Fails
No ratings yet
1.5.2 Strategy As Position: Why Strategy Execution Fails
12 pages
Dbms Theory
No ratings yet
Dbms Theory
20 pages
Preboard Exam in Ee 2
No ratings yet
Preboard Exam in Ee 2
14 pages

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Exploratory Data Analysis

and Descriptive Statistics

• What is descriptive statistics and

• Basic numerical summaries of data

• Basic graphical summaries of data

To listen to the data:

binary nominal ordinal discrete continuous

• Univariate: Measurement made on one variable

• Bivariate: Measurement made on two variables

• Multivariate: Measurement made on many

• Central Tendency measures. They are

• Variation or Variability measures. They

To calculate the average x of a set of observations, add

- If there are an even number of observations, find the

Median = (22+23)/2 = 22.5

• Mean is best for symmetric distributions without outliers

• Median is useful for skewed distributions or data

• Average of squared deviations of values

• Adding deviations will yield a sum of ?

• What does it mean to have a variance of

• Nothing. But if you could “standardize” that

• Standard deviations are simply the square root

1. Score (in the units that are meaningful)

• The first quartile, Q1, is the value for which 25% of

• Q2 is the same as the median (50% are smaller,

• Only 25% of the observations are greater than the

Variable 1 Variable 2 Display

Categorical Continuous Boxplot

nuous Continuous Scatterplot Stacked

Data Reduction Approaches (PCA)

Some rules for displaying data badly:

From Karl Broman: http://www.biostat.wisc.edu/~kbroman/

• Calculating descriptive statistics in R

• Useful R commands for working with

• Creating graphs for different types of

• Basic clustering and PCA analysis

You might also like