0% found this document useful (0 votes)
237 views14 pages

Exploratory Data Analysis

This document discusses exploratory data analysis techniques including visualization and descriptive statistics. The objective is to get an initial understanding of data through visualizing distributions, checking for outliers and missingness, and calculating metrics like mean, median, variance and skewness. Visualization tools like histograms, boxplots and scatterplots help explore relationships between variables. Descriptive statistics and normality tests provide quantitative assessments. Examples using the airquality dataset demonstrate these exploratory analysis steps.

Uploaded by

Gagana U Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views14 pages

Exploratory Data Analysis

This document discusses exploratory data analysis techniques including visualization and descriptive statistics. The objective is to get an initial understanding of data through visualizing distributions, checking for outliers and missingness, and calculating metrics like mean, median, variance and skewness. Visualization tools like histograms, boxplots and scatterplots help explore relationships between variables. Descriptive statistics and normality tests provide quantitative assessments. Examples using the airquality dataset demonstrate these exploratory analysis steps.

Uploaded by

Gagana U Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Exploratory data analysis

Objective
• Get the quick idea about data
– visualization is the easiest way
– check descriptive statistics
• Data cleaning process to reduce the number of
data problems in the future
– handle missing data, outliers or typo etc.
– need to be careful!
• Explore your data to determine whether the model
assumptions are met etc.
– E.g., check normality of data
Visualization and descriptive statistics
• Visualization
– Histogram
– Boxplot
– Scatter plot (to find the relationship btw 2 variables)

• Descriptive statistics
– Mean, median, variance, skewness, kurtosis etc.
– Correlation (for 2 variables)

• Get rough idea about the distribution of data


• Check outliers or missingness
• In general, to check normality of data
Visualization and descriptive statistics
• Skewed right: mean > median
• Skewed left: mean < median
o Robustness of median.
o Able to guess its skewness based on mean an median values

• https://demonstrations.wolfram.com/ExploringSkewnessIn
BoxPlots/ (boxplot and skewness)
• Able to check its normality (informally) based on visual and
descriptive statistics
Example : airquality
• Daily air quality measurements in New York,
May to September 1973. (R built-in data)
• 154 observations on 6 variables – Ozone,
Solar R, Wind, …
hist(airquality$Ozone,main="Ozone",xlab="Ozone")

Provides the distribution of


the data. This can also be
used to assess potential
outlier concerns.
boxplot(airquality$Ozone,ylab="Ozone")
points(mean(airquality$Ozone, na.rm=TRUE), col="red")
Example of descriptive statistics
summary(airquality$Ozone, na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
mean(airquality$Ozone, na.rm=TRUE) -> mean
## [1] 42.12931
var(airquality$Ozone, na.rm=TRUE) -> variance
## [1] 1088.201
skewness(airquality$Ozone, na.rm=TRUE) -> skewness
## [1] 1.209866
range(airquality$Ozone,na.rm=TRUE) -> range [min,max]
## [1] 1 168
Missing data handling
• Can be a separate semester-long course
• Missing mechanisms:
– Missing Completely at Random (MCAR)
• Missing occurs by random
– Missing at Random (MAR)
– Missing Not at Random (MNAR)
Important statistical assumptions
• Normality
o Why normality check is important?
1) When conducting a t-test or ANOVA, normality assumption is
required
2) When using correlation and regression techniques, lack of normality
and outliers impact your conclusions

• Normal distribution is symmetric, bell-shaped


o Inverse is NOT true e.g., Cauchy distribution, t-distribution

o There are a lot of tests one can use to check for


normality and outliers in the data.
Inference based on Normality
• Under normality assumption, we can perform
following tests.
✓One-sample t-test
(e.g., test if iphone battery life span > 2 years)
✓Two-sample t-test
(e.g., test if iphone and galaxy have the same life span)
✓ANOVA test (simply speaking, comparing group means
among more than two groups)
(e.g., test among iphone, galaxy and Android phone)
Detection of Normality
• How to check Normality?
✓ Qualitatively check by looking at:
: histogram, boxplot, quantile-quantile plot (QQ plot) etc..
✓ Quantitative check by formal test
: Sharpiro-Wilk test …

• For a comparison among groups (e.g., t-test, ANOVA),


normality check should be conducted by groups
• If at least one group does not follow normality, t-test or
ANOVA conclusions may NOT be valid.
qqnorm(airquality$Ozone); qqline(airquality$Ozone, col = 2)

Quantile-Quantile Plots
(a.k.a., Q-Q plots): A useful
diagnostics of how well a
specified theoretical
distribution fits your data. If
the quantiles of the
theoretical and data
distributions agree, the
plotted points fall on or near
the line.
Shapiro-Wilk Normality test
shapiro.test(airquality$Ozone)
##
## Shapiro-Wilk normality test
##
## data: airquality$Ozone
## W = 0.87867, p-value = 2.79e-08

H0: Data follows normal distribution


H1: Data does not follow normal distribution
• If p-value is larger than significance level (in general α=0.05), we
do not enough evidence to reject the null hypothesis, thus our
conclusion is - data follows normal distribution
• If p-value is smaller than significance level, we have enough
evidence to reject the null hypothesis, thus our conclusion is –
data does not follow Normal distribution
14

You might also like