0% found this document useful (0 votes)

18 views50 pages

Exploratory Data Analysis Reference

The document outlines the importance of Exploratory Data Analysis (EDA) and Data Visualization in understanding data distributions, identifying outliers, and assessing data quality. It covers various visualization techniques for single and multiple variables, including histograms, scatterplots, and boxplots, emphasizing the need for careful interpretation of visual data representations. Additionally, it discusses advanced methods like dimension reduction and parallel coordinates for visualizing high-dimensional data.

Uploaded by

Manish Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views50 pages

Exploratory Data Analysis Reference

Uploaded by

Manish Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 50

School of Computing

Science and Engineering

Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data
Analysis and Data
Visualization
Credits: ChrisVolinsky - Columbia University

2
Outline

• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction

3
EDA and Visualization
• Exploratory Data Analysis (EDA) and
Visualization are very important steps in any
analysis task.

• get to know your data!

– distributions (symmetric, normal, skewed)
– data quality problems
– outliers
– correlations and inter-relationships
– subsets of interest
– suggest functional relationships

• Sometimes EDA or viz might be the goal!

4
Data Visualization – cake bakery

5
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn
something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….

• Especially useful in early stages of data mining

– detect outliers (e.g. assess data quality)
– test assumptions (e.g. normal distributions or skewed?)
– identify useful raw data & transforms (e.g. log(x))

• Bottom line: it is always well worth looking at your data!

6
Summary Statistics
• not visual
• sample statistics of data X
– mean:  = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of
data is right skewed?)

– number of distinct values for a variable (see unique() in

R)
– Don’t need to report all of thses: Bottom line…do these
numbers make sense??? 7
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros

8
Issues with Histograms

• For small data sets, histograms can be

misleading.
– Small changes in the data, bins, or anchor can deceive

• For large data sets, histograms can be quite

effective at illustrating general properties of the
distribution.

• Histograms effectively only work with 1 variable

at a time
– But ‘small multiples’ can be effective

9
But be
careful with
axes and
scales!

10
Smoothed Histograms - Density
Estimates
• Kernel estimates smooth out the
contribution of each datapoint over a local
neighborhood of that point.
n
ˆf (x)  1  K( x  x i )
nh
i1 h
h is the kernel width

• Gaussian kernel is common:


2
1  x  x (i ) 
  
2 h 
Ce

11
Bandwidth
choice is an
art

Usually want
to try several

12
Boxplots

• Shows a lot of
information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell
distributional shape
– no standard
implementation in
software (many options
for whiskers, outliers)

13
Time Series
If your data has a temporal component, be sure to exploit
it
summer bifurcations in air travel
(favor early/late)
summer
peaks

steady growth
trend

New Year bumps

14
Time-Series Example 3

mean weight vs mean age

for 10k control group

Scotland experiment:
Possible explanations:
“ milk in kid diet  better health” ?
Grow less early in year than later?
20,000 kids: Would expect smooth weight growth plot.
5k raw, 5k pasteurize,
No steps in height plots; so why
10k control (no supplement) Visually reveals
height  uniformly, weight  spurts?
unexpected pattern (steps),
not apparent from raw data table.
Kids weighed in clothes: summer garb
lighter than winter?
Spatial Data

• If your data has a

geographic
component, be
sure to exploit it

• Data from
cities/states/zip
cods – easy to get
lat/long

• Can plot as
scatterplot

16
Spatial data: choropleth Maps

• Maps using color shadings to represent numerical values are called chloropleth maps
• http://elections.nytimes.com/2008/results/president/map.html

17
Two Continuous Variables

• For two numeric variables, the

scatterplot is the obvious choice

interesting?

18
2D Scatterplots

• standard tool to display • useful to answer:

relation between 2 – x,y related?
variables • linear
– e.g. y-axis = response, • quadratic
x-axis = suspected • other
indicator – variance(y) depend on
x?
– outliers present?

interesting
?

19
Scatter Plot: No apparent
relationship

20
Scatter Plot: Linear relationship

21
Scatter Plot: Quadratic relationship

22
Scatter plot: Homoscedastic

Why is this important in classical statistical modelling?

23
Scatter plot: Heteroscedastic

variation in Y differs depending on the value of X

e.g., Y = annual tax paid, X = income

24
Two variables - continuous

• Scatterplots
– But can be bad with lots of data

25
Two variables - continuous

• What to do for large data sets

– Contour plots

26
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)

27
Jittering

• Jittering points helps too

• plot(age, TimesPregnant)
• plot(jitter(age),jitter(TimesPregnant)

28
Displaying Two Variables

• If one variable is
categorical, use
small multiples

• Many software
packages have this
implemented as
‘lattice’ or ‘trellis’
packages
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0

29
Two Variables - one categorical

• Side by side boxplots are very effective in showing

differences in a quantitative variable across factor
levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling
honeybees

30
Barcharts and Spineplots

stacked barcharts
can be used to
compare
continuous values
across two or more
categorical ones.

orange=M blue=F

spineplots show
proportions well,
but can be hard to
interpret 31
More than two
variables
Pairwise
scatterplots

Can be somewhat
ineffective for
categorical data

32
33
Multivariate: More than two
variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception,
all based on conditioning
– Infinite possibilities

• Earthquake data:
– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since
1964
– Data collected on the severity of the
earthquake

34
35
36
How many
dimensions
are
represented
here?

Andrew Gelman blog 7/15/2009 37

Multivariate Vis: Parallel
Coordinates

Petal, a non-reproductive
part of the flower

Sepal, a non-reproductive
part of the flower

The famous iris data!

38
Parallel Coordinates

Sepal
Length

5.1

sepal sepal petal petal

length width length width
5.1 3.5 1.4 0.2
39
Parallel Coordinates: 2 D

Sepal Sepal
Length Width

3.5

5.1

sepal sepal petal petal

length width length width
5.1 3.5 1.4 0.2
40
Parallel Coordinates: 4 D

Sepal Sepal Petal Petal

Length Width length Width

3.5

5.1 1.4 0.2

sepal sepal petal petal

length width length width
5.1 3.5 1.4 0.2
41
Parallel Visualization of Iris data

3.5

5.1

1.4
0.2

42
Multivariate: Parallel coordinates

Alpha blending
can be effective

Courtesy Unwin, Theus, Hofmann

43
Parallel coordinates
• Useful in an interactive setting

44
Networks and Graphs

• Visualizing networks is helpful, even if is not

obvious that a network exists

45
Network Visualization
• Graphviz (open source software) is a nice layout tool
for big and small graphs

46
What’s missing?

• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)

• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/

47
Worst graphic in the
world?

48
Dimension Reduction

• One way to visualize high dimensional

data is to reduce it to 2 or 3 dimensions

– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those
similarities
More on this in next Topic

49
Visualization done right

• Hans Rosling @ TED

• http://www.youtube.com/watch?
v=jbkSRLYSojo

Ewu Bba Program Course Cataloge With Description
0% (1)
Ewu Bba Program Course Cataloge With Description
6 pages
Statistical Analysis With Software Application Course Outline
No ratings yet
Statistical Analysis With Software Application Course Outline
2 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
No ratings yet
Exploratory Data Analysis and Data Visualization: Credits: Chrisvolinsky - Columbia University
49 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
03 Temporal, Geospatial Multivariate Data
No ratings yet
03 Temporal, Geospatial Multivariate Data
69 pages
5.1_exploratory_analysis_en
No ratings yet
5.1_exploratory_analysis_en
79 pages
L5 6 DataViz
No ratings yet
L5 6 DataViz
79 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
02 Data
No ratings yet
02 Data
42 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Chapter 4
No ratings yet
Chapter 4
120 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
data mining 2
No ratings yet
data mining 2
64 pages
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
No ratings yet
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
56 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
Module 1
No ratings yet
Module 1
64 pages
02a EDA and Data Visualization
No ratings yet
02a EDA and Data Visualization
79 pages
BT 3041: Analysis and Interpretation of Biological Data
No ratings yet
BT 3041: Analysis and Interpretation of Biological Data
57 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02Data
No ratings yet
02Data
24 pages
02 Data
No ratings yet
02 Data
62 pages
02 Data
No ratings yet
02 Data
41 pages
Chapter_1
No ratings yet
Chapter_1
28 pages
Lect 3
No ratings yet
Lect 3
51 pages
02Data
No ratings yet
02Data
65 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
MATPLOTLIB BASICS
No ratings yet
MATPLOTLIB BASICS
27 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
CSE445_T2c_Exploratory_Data_Analysis
No ratings yet
CSE445_T2c_Exploratory_Data_Analysis
42 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02 Data
No ratings yet
02 Data
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
100% (1)
David Gerbing - R Visualizations Derive Meaning From Data (2020) - 1 - CRC Press (9780429894923)
252 pages
5 Data Exploration
No ratings yet
5 Data Exploration
41 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
FIT1043 - Lecture 3 - 2024
No ratings yet
FIT1043 - Lecture 3 - 2024
69 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
DV - Unit 2
No ratings yet
DV - Unit 2
73 pages
Unit 3 DATA VISUAIZATION
No ratings yet
Unit 3 DATA VISUAIZATION
25 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
09 Plotting and Visualization
No ratings yet
09 Plotting and Visualization
97 pages
Common Visualization Idioms
0% (1)
Common Visualization Idioms
95 pages
Week 02.1 Chaptr002
No ratings yet
Week 02.1 Chaptr002
29 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
From Average To K-means
From Everand
From Average To K-means
Beam van Waardenberg
No ratings yet
Clustering 47698 Techniques
No ratings yet
Clustering 47698 Techniques
47 pages
Naive456 Bayes297Classification
No ratings yet
Naive456 Bayes297Classification
21 pages
networks-sna
No ratings yet
networks-sna
126 pages
Percept Ron
No ratings yet
Percept Ron
54 pages
Abhishek Paul 19SCSE2030072 Big Data and technologies MCA Section 2 Assignment 4
No ratings yet
Abhishek Paul 19SCSE2030072 Big Data and technologies MCA Section 2 Assignment 4
3 pages
SOFTWARE
No ratings yet
SOFTWARE
5 pages
VC_Dim
No ratings yet
VC_Dim
22 pages
Pca
No ratings yet
Pca
28 pages
Singular Value Decomposition
No ratings yet
Singular Value Decomposition
43 pages
Fake News Detection System by Manish Verma 16scse111009
No ratings yet
Fake News Detection System by Manish Verma 16scse111009
7 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
9 pages
Lecture W6 EDA
No ratings yet
Lecture W6 EDA
28 pages
Assignment No-1 Cc Ibca+Mca-2016 (Ix) Mca-A2
No ratings yet
Assignment No-1 Cc Ibca+Mca-2016 (Ix) Mca-A2
1 page
Reserch Paper
No ratings yet
Reserch Paper
8 pages
reserch paperUpdated
No ratings yet
reserch paperUpdated
8 pages
Explore Case Processing Summary
No ratings yet
Explore Case Processing Summary
4 pages
Assessment Final Exam
No ratings yet
Assessment Final Exam
7 pages
Stat Assignmet 2
No ratings yet
Stat Assignmet 2
13 pages
Lab 7.1 Questions (1)
No ratings yet
Lab 7.1 Questions (1)
6 pages
2004
No ratings yet
2004
20 pages
Skewness and Kurtosis Topic 12: XX N XX N XX N XX N
No ratings yet
Skewness and Kurtosis Topic 12: XX N XX N XX N XX N
5 pages
DOC-20250326-WA0045.
No ratings yet
DOC-20250326-WA0045.
3 pages
Chapter 3 Descriptive Statistics
No ratings yet
Chapter 3 Descriptive Statistics
78 pages
Shushay Hailu
No ratings yet
Shushay Hailu
105 pages
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
No ratings yet
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
20 pages
Business Statistics 6th Edition Levine Solutions Manualpdf download
100% (11)
Business Statistics 6th Edition Levine Solutions Manualpdf download
51 pages
A2
No ratings yet
A2
18 pages
Online Let Reviewer
67% (3)
Online Let Reviewer
31 pages
Statistical Method
No ratings yet
Statistical Method
227 pages
Data Valley 21VV1A0510
No ratings yet
Data Valley 21VV1A0510
85 pages
Picturing Distributions With Graphs: 1 BPS - 5th Ed
No ratings yet
Picturing Distributions With Graphs: 1 BPS - 5th Ed
103 pages
State 301 Grand Quiz
No ratings yet
State 301 Grand Quiz
4 pages
Peperiksaan Percubaan STPM 2014: Sekolah Menengah Kebangsaan Anderson, Ipoh, Perak
No ratings yet
Peperiksaan Percubaan STPM 2014: Sekolah Menengah Kebangsaan Anderson, Ipoh, Perak
9 pages
PSY-113-Prelim-WITH-ANSWERS
No ratings yet
PSY-113-Prelim-WITH-ANSWERS
11 pages
Jurnal Manajemen Aset
No ratings yet
Jurnal Manajemen Aset
10 pages
Statistics Ready Reckoner
No ratings yet
Statistics Ready Reckoner
4 pages
07 Chapter
No ratings yet
07 Chapter
68 pages
#1660908-Data Management and Statistical Computing
No ratings yet
#1660908-Data Management and Statistical Computing
21 pages
Assignment
75% (4)
Assignment
13 pages
Elementary Statistics Step by Step Approach 9th Edition Bluman Solutions Manual
No ratings yet
Elementary Statistics Step by Step Approach 9th Edition Bluman Solutions Manual
23 pages
Flood Routing
50% (2)
Flood Routing
85 pages
Statictics 1st Paper (8 Experiments)
No ratings yet
Statictics 1st Paper (8 Experiments)
52 pages
2.2 Measures of Central Location
No ratings yet
2.2 Measures of Central Location
17 pages

Exploratory Data Analysis Reference

Uploaded by

Exploratory Data Analysis Reference

Uploaded by

School of Computing

Science and Engineering

• get to know your data!

• Sometimes EDA or viz might be the goal!

• Especially useful in early stages of data mining

• Bottom line: it is always well worth looking at your data!

– number of distinct values for a variable (see unique() in

• For small data sets, histograms can be

• For large data sets, histograms can be quite

• Histograms effectively only work with 1 variable

• Gaussian kernel is common:

New Year bumps

mean weight vs mean age

• If your data has a

• For two numeric variables, the

• standard tool to display • useful to answer:

Why is this important in classical statistical modelling?

variation in Y differs depending on the value of X

• What to do for large data sets

• Jittering points helps too

• Side by side boxplots are very effective in showing

Andrew Gelman blog 7/15/2009 37

The famous iris data!

sepal sepal petal petal

sepal sepal petal petal

Sepal Sepal Petal Petal

5.1 1.4 0.2

sepal sepal petal petal

Courtesy Unwin, Theus, Hofmann

• Visualizing networks is helpful, even if is not

• One way to visualize high dimensional

• Hans Rosling @ TED

You might also like