Exploratory Data Analysis Reference
Exploratory Data Analysis Reference
Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data
Analysis and Data
Visualization
Credits: ChrisVolinsky - Columbia University
2
Outline
• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction
3
EDA and Visualization
• Exploratory Data Analysis (EDA) and
Visualization are very important steps in any
analysis task.
4
Data Visualization – cake bakery
5
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn
something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….
6
Summary Statistics
• not visual
• sample statistics of data X
– mean: = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of
data is right skewed?)
8
Issues with Histograms
9
But be
careful with
axes and
scales!
10
Smoothed Histograms - Density
Estimates
• Kernel estimates smooth out the
contribution of each datapoint over a local
neighborhood of that point.
n
ˆf (x) 1 K( x x i )
nh
i1 h
h is the kernel width
11
Bandwidth
choice is an
art
Usually want
to try several
12
Boxplots
• Shows a lot of
information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell
distributional shape
– no standard
implementation in
software (many options
for whiskers, outliers)
13
Time Series
If your data has a temporal component, be sure to exploit
it
summer bifurcations in air travel
(favor early/late)
summer
peaks
steady growth
trend
14
Time-Series Example 3
Scotland experiment:
Possible explanations:
“ milk in kid diet better health” ?
Grow less early in year than later?
20,000 kids: Would expect smooth weight growth plot.
5k raw, 5k pasteurize,
No steps in height plots; so why
10k control (no supplement) Visually reveals
height uniformly, weight spurts?
unexpected pattern (steps),
not apparent from raw data table.
Kids weighed in clothes: summer garb
lighter than winter?
Spatial Data
• Data from
cities/states/zip
cods – easy to get
lat/long
• Can plot as
scatterplot
16
Spatial data: choropleth Maps
• Maps using color shadings to represent numerical values are called chloropleth maps
• http://elections.nytimes.com/2008/results/president/map.html
17
Two Continuous Variables
interesting?
interesting?
18
2D Scatterplots
interesting
?
interesting
?
19
Scatter Plot: No apparent
relationship
20
Scatter Plot: Linear relationship
21
Scatter Plot: Quadratic relationship
22
Scatter plot: Homoscedastic
23
Scatter plot: Heteroscedastic
24
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
25
Two variables - continuous
26
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)
27
Jittering
28
Displaying Two Variables
• If one variable is
categorical, use
small multiples
• Many software
packages have this
implemented as
‘lattice’ or ‘trellis’
packages
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0
29
Two Variables - one categorical
30
Barcharts and Spineplots
stacked barcharts
can be used to
compare
continuous values
across two or more
categorical ones.
orange=M blue=F
spineplots show
proportions well,
but can be hard to
interpret 31
More than two
variables
Pairwise
scatterplots
Can be somewhat
ineffective for
categorical data
32
33
Multivariate: More than two
variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception,
all based on conditioning
– Infinite possibilities
• Earthquake data:
– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since
1964
– Data collected on the severity of the
earthquake
34
35
36
How many
dimensions
are
represented
here?
Petal, a non-reproductive
part of the flower
Sepal, a non-reproductive
part of the flower
38
Parallel Coordinates
Sepal
Length
5.1
Sepal Sepal
Length Width
3.5
5.1
3.5
3.5
5.1
1.4
0.2
42
Multivariate: Parallel coordinates
Alpha blending
can be effective
44
Networks and Graphs
45
Network Visualization
• Graphviz (open source software) is a nice layout tool
for big and small graphs
46
What’s missing?
• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)
• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/
47
Worst graphic in the
world?
48
Dimension Reduction
– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those
similarities
More on this in next Topic
49
Visualization done right
• http://www.youtube.com/watch?
v=jbkSRLYSojo
50