Exploratory Data Analysis

This document discusses exploratory data analysis techniques including visualization and descriptive statistics. The objective is to get an initial understanding of data through visualizing distributions, checking for outliers and missingness, and calculating metrics like mean, median, variance and skewness. Visualization tools like histograms, boxplots and scatterplots help explore relationships between variables. Descriptive statistics and normality tests provide quantitative assessments. Examples using the airquality dataset demonstrate these exploratory analysis steps.

Uploaded by

Gagana U Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

282 views14 pages

Exploratory Data Analysis

Uploaded by

Gagana U Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Exploratory data analysis

Objective
• Get the quick idea about data
– visualization is the easiest way
– check descriptive statistics
• Data cleaning process to reduce the number of
data problems in the future
– handle missing data, outliers or typo etc.
– need to be careful!
• Explore your data to determine whether the model
assumptions are met etc.
– E.g., check normality of data
Visualization and descriptive statistics
• Visualization
– Histogram
– Boxplot
– Scatter plot (to find the relationship btw 2 variables)

• Descriptive statistics
– Mean, median, variance, skewness, kurtosis etc.
– Correlation (for 2 variables)

• Get rough idea about the distribution of data

• Check outliers or missingness
• In general, to check normality of data
Visualization and descriptive statistics
• Skewed right: mean > median
• Skewed left: mean < median
o Robustness of median.
o Able to guess its skewness based on mean an median values

• https://demonstrations.wolfram.com/ExploringSkewnessIn
BoxPlots/ (boxplot and skewness)
• Able to check its normality (informally) based on visual and
descriptive statistics
Example : airquality
• Daily air quality measurements in New York,
May to September 1973. (R built-in data)
• 154 observations on 6 variables – Ozone,
Solar R, Wind, …
hist(airquality$Ozone,main="Ozone",xlab="Ozone")

Provides the distribution of

the data. This can also be
used to assess potential
outlier concerns.
boxplot(airquality$Ozone,ylab="Ozone")
points(mean(airquality$Ozone, na.rm=TRUE), col="red")
Example of descriptive statistics
summary(airquality$Ozone, na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
mean(airquality$Ozone, na.rm=TRUE) -> mean
## [1] 42.12931
var(airquality$Ozone, na.rm=TRUE) -> variance
## [1] 1088.201
skewness(airquality$Ozone, na.rm=TRUE) -> skewness
## [1] 1.209866
range(airquality$Ozone,na.rm=TRUE) -> range [min,max]
## [1] 1 168
Missing data handling
• Can be a separate semester-long course
• Missing mechanisms:
– Missing Completely at Random (MCAR)
• Missing occurs by random
– Missing at Random (MAR)
– Missing Not at Random (MNAR)
Important statistical assumptions
• Normality
o Why normality check is important?
1) When conducting a t-test or ANOVA, normality assumption is
required
2) When using correlation and regression techniques, lack of normality
and outliers impact your conclusions

• Normal distribution is symmetric, bell-shaped

o Inverse is NOT true e.g., Cauchy distribution, t-distribution

o There are a lot of tests one can use to check for

normality and outliers in the data.
Inference based on Normality
• Under normality assumption, we can perform
following tests.
✓One-sample t-test
(e.g., test if iphone battery life span > 2 years)
✓Two-sample t-test
(e.g., test if iphone and galaxy have the same life span)
✓ANOVA test (simply speaking, comparing group means
among more than two groups)
(e.g., test among iphone, galaxy and Android phone)
Detection of Normality
• How to check Normality?
✓ Qualitatively check by looking at:
: histogram, boxplot, quantile-quantile plot (QQ plot) etc..
✓ Quantitative check by formal test
: Sharpiro-Wilk test …

• For a comparison among groups (e.g., t-test, ANOVA),

normality check should be conducted by groups
• If at least one group does not follow normality, t-test or
ANOVA conclusions may NOT be valid.
qqnorm(airquality$Ozone); qqline(airquality$Ozone, col = 2)

Quantile-Quantile Plots
(a.k.a., Q-Q plots): A useful
diagnostics of how well a
specified theoretical
distribution fits your data. If
the quantiles of the
theoretical and data
distributions agree, the
plotted points fall on or near
the line.
Shapiro-Wilk Normality test
shapiro.test(airquality$Ozone)
##
## Shapiro-Wilk normality test
##
## data: airquality$Ozone
## W = 0.87867, p-value = 2.79e-08

H0: Data follows normal distribution

H1: Data does not follow normal distribution
• If p-value is larger than significance level (in general α=0.05), we
do not enough evidence to reject the null hypothesis, thus our
conclusion is - data follows normal distribution
• If p-value is smaller than significance level, we have enough
evidence to reject the null hypothesis, thus our conclusion is –
data does not follow Normal distribution
14

COMP5310 Notes
No ratings yet
COMP5310 Notes
10 pages
Environmental Conditions and Environmental Loads
No ratings yet
Environmental Conditions and Environmental Loads
22 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
3 pages
Experiment To Study Flow Through A Venturimeter
57% (7)
Experiment To Study Flow Through A Venturimeter
26 pages
Marriott Rooms Forecasting - Unsolved
No ratings yet
Marriott Rooms Forecasting - Unsolved
23 pages
Chapter 7 - Sampling Distributions
No ratings yet
Chapter 7 - Sampling Distributions
43 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
SL Bivariate Analysis Questions
No ratings yet
SL Bivariate Analysis Questions
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
37 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
2.2 ML Session Bias Variance Tradeoffs
No ratings yet
2.2 ML Session Bias Variance Tradeoffs
38 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
Advice For Applying Machine Learning: Deciding What To Try Next
30 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
R Programming in Data Science
No ratings yet
R Programming in Data Science
23 pages
PSCV Unit-Iii Digital Notes
No ratings yet
PSCV Unit-Iii Digital Notes
46 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
NguyenCongSang ITITIU20292 Lab3
No ratings yet
NguyenCongSang ITITIU20292 Lab3
21 pages
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
No ratings yet
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
47 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
7 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Unit I
No ratings yet
Unit I
85 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
Evaluation Mcqs
No ratings yet
Evaluation Mcqs
2 pages
Chapter 2 - Describing Data
No ratings yet
Chapter 2 - Describing Data
24 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Time Series Analysis
No ratings yet
Time Series Analysis
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
BNM854 Week 1 Graphical Excellence Handout
No ratings yet
BNM854 Week 1 Graphical Excellence Handout
13 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
Applied Statistics: Assessment Tasks
No ratings yet
Applied Statistics: Assessment Tasks
4 pages
KMBN It01 - Unit 4
No ratings yet
KMBN It01 - Unit 4
19 pages
Binomial Probability Distribution Excel Part 1
No ratings yet
Binomial Probability Distribution Excel Part 1
1 page
Software Design Principles: "Producing The Software Blueprint"
No ratings yet
Software Design Principles: "Producing The Software Blueprint"
24 pages
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
No ratings yet
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
5 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
ML 2
No ratings yet
ML 2
6 pages
Chapter 06 Normalization of Database Tables
No ratings yet
Chapter 06 Normalization of Database Tables
26 pages
Assignment Excelr
0% (1)
Assignment Excelr
9 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
13 pages
R Vectors
No ratings yet
R Vectors
12 pages
Data Science Course Content Chapter 1: Introduction To Data Science
No ratings yet
Data Science Course Content Chapter 1: Introduction To Data Science
8 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
Chapter2 UML
No ratings yet
Chapter2 UML
38 pages
SPC Charts - Statistical Process Control Charts PDF
No ratings yet
SPC Charts - Statistical Process Control Charts PDF
6 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Lecture 2 - R Graphics PDF
No ratings yet
Lecture 2 - R Graphics PDF
68 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
Sampling Errors, Bias, and Objectivity - Chapter - IMP
No ratings yet
Sampling Errors, Bias, and Objectivity - Chapter - IMP
17 pages
Kohli Batting Analysis
No ratings yet
Kohli Batting Analysis
19 pages
Data Preprocessing: L1+ Freq
No ratings yet
Data Preprocessing: L1+ Freq
13 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Checking The Normality of A Dataset
No ratings yet
Checking The Normality of A Dataset
6 pages
Chapter 3 Slides Murach
No ratings yet
Chapter 3 Slides Murach
59 pages
Visual Prolog Version 5.x
No ratings yet
Visual Prolog Version 5.x
558 pages
Science and The Ethics of Curiosity: General Articles
No ratings yet
Science and The Ethics of Curiosity: General Articles
12 pages
The Finite Element Method
No ratings yet
The Finite Element Method
2 pages
L&T Placement Paper Sample Questions
No ratings yet
L&T Placement Paper Sample Questions
3 pages
Subtract Fractions
No ratings yet
Subtract Fractions
2 pages
J. P. Tremblay, P. G. Sorenson and D. M. Manegre - Instructor's Solutions Manual To Accompany An Introduction To Data Structures With Applications-McGraw Hill (1984)
No ratings yet
J. P. Tremblay, P. G. Sorenson and D. M. Manegre - Instructor's Solutions Manual To Accompany An Introduction To Data Structures With Applications-McGraw Hill (1984)
380 pages
Operational Research - Edited FINAL
No ratings yet
Operational Research - Edited FINAL
270 pages
Color Link-A-Pix Sampler
No ratings yet
Color Link-A-Pix Sampler
5 pages
Assignment 5 Mine
No ratings yet
Assignment 5 Mine
4 pages
Economic Growth and Inflation: A Panel Data Analysis
No ratings yet
Economic Growth and Inflation: A Panel Data Analysis
44 pages
Resource 20230216201018 G11 Computer Sci. (Exam Capsule)
No ratings yet
Resource 20230216201018 G11 Computer Sci. (Exam Capsule)
7 pages
Given P and N, Find The Largest X Such That P X Divides N!
No ratings yet
Given P and N, Find The Largest X Such That P X Divides N!
9 pages
Introductory Geometry and Arithmetic
100% (2)
Introductory Geometry and Arithmetic
387 pages
Permutation and Combination & Probability
No ratings yet
Permutation and Combination & Probability
16 pages
Booklet G7
No ratings yet
Booklet G7
96 pages
Geometry Chapter 1 Test
No ratings yet
Geometry Chapter 1 Test
3 pages
Arrow's Impossibility Theorem
No ratings yet
Arrow's Impossibility Theorem
3 pages
Es 101 Statics Ps2 1
No ratings yet
Es 101 Statics Ps2 1
2 pages
Unit 8
No ratings yet
Unit 8
5 pages
Q.1 Shown Below Is The Graph of A Function F (X) Whose Domain Is R - (-1,1) - Some Portion of The Graph Is Hidden Behind The Star. 1
No ratings yet
Q.1 Shown Below Is The Graph of A Function F (X) Whose Domain Is R - (-1,1) - Some Portion of The Graph Is Hidden Behind The Star. 1
18 pages
Celestial Navigation Diagnostic Test1
No ratings yet
Celestial Navigation Diagnostic Test1
2 pages
cvc5: A Versatile and Industrial-Strength SMT Solver
No ratings yet
cvc5: A Versatile and Industrial-Strength SMT Solver
31 pages
Maths Pt1 - Model QP - Class 5 - June 2025
No ratings yet
Maths Pt1 - Model QP - Class 5 - June 2025
4 pages
Reasoning - 1EQ Mock - 3
No ratings yet
Reasoning - 1EQ Mock - 3
6 pages
Modul 4 - Perhitungan Volumetrik
No ratings yet
Modul 4 - Perhitungan Volumetrik
38 pages
Question Paper Code:: Reg. No.
No ratings yet
Question Paper Code:: Reg. No.
3 pages

Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Uploaded by

Exploratory data analysis

• Get rough idea about the distribution of data

Provides the distribution of

• Normal distribution is symmetric, bell-shaped

o There are a lot of tests one can use to check for

• For a comparison among groups (e.g., t-test, ANOVA),

H0: Data follows normal distribution

You might also like