0% found this document useful (0 votes)
8 views

Exploratory Data Analysis Reference

The document outlines the importance of Exploratory Data Analysis (EDA) and Data Visualization in understanding data distributions, identifying outliers, and assessing data quality. It covers various visualization techniques for single and multiple variables, including histograms, scatterplots, and boxplots, emphasizing the need for careful interpretation of visual data representations. Additionally, it discusses advanced methods like dimension reduction and parallel coordinates for visualizing high-dimensional data.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Exploratory Data Analysis Reference

The document outlines the importance of Exploratory Data Analysis (EDA) and Data Visualization in understanding data distributions, identifying outliers, and assessing data quality. It covers various visualization techniques for single and multiple variables, including histograms, scatterplots, and boxplots, emphasizing the need for careful interpretation of visual data representations. Additionally, it discusses advanced methods like dimension reduction and parallel coordinates for visualizing high-dimensional data.

Uploaded by

Manish Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

School of Computing

Science and Engineering

Program: M.C.A.
Course Code: MCAS9220
Course Name: Data Science
Fundamentals
Exploratory Data
Analysis and Data
Visualization
Credits: ChrisVolinsky - Columbia University

2
Outline

• EDA
• Visualization
– One variable
– Two variables
– More than two variables
– Other types of data
– Dimension reduction

3
EDA and Visualization
• Exploratory Data Analysis (EDA) and
Visualization are very important steps in any
analysis task.

• get to know your data!


– distributions (symmetric, normal, skewed)
– data quality problems
– outliers
– correlations and inter-relationships
– subsets of interest
– suggest functional relationships

• Sometimes EDA or viz might be the goal!

4
Data Visualization – cake bakery

5
Exploratory Data Analysis (EDA)
• Goal: get a general sense of the data
– means, medians, quantiles, histograms, boxplots
• You should always look at every variable - you will learn
something!
• data-driven (model-free)
• Think interactive and visual
– Humans are the best pattern recognizers
– You can use more than 2 dimensions!
• x,y,z, space, color, time….

• Especially useful in early stages of data mining


– detect outliers (e.g. assess data quality)
– test assumptions (e.g. normal distributions or skewed?)
– identify useful raw data & transforms (e.g. log(x))

• Bottom line: it is always well worth looking at your data!

6
Summary Statistics
• not visual
• sample statistics of data X
– mean:  = i Xi / n
– mode: most common value in X
– median: X=sort(X), median = Xn/2 (half below, half above)
– quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n
• interquartile range: value(Q3) - value(Q1)
• range: max(X) - min(X) = Xn - X1
– variance: 2 = i (Xi - )2 / n
– skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ]
• zero if symmetric; right-skewed more common (what kind of
data is right skewed?)

– number of distinct values for a variable (see unique() in


R)
– Don’t need to report all of thses: Bottom line…do these
numbers make sense??? 7
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros

8
Issues with Histograms

• For small data sets, histograms can be


misleading.
– Small changes in the data, bins, or anchor can deceive

• For large data sets, histograms can be quite


effective at illustrating general properties of the
distribution.

• Histograms effectively only work with 1 variable


at a time
– But ‘small multiples’ can be effective

9
But be
careful with
axes and
scales!

10
Smoothed Histograms - Density
Estimates
• Kernel estimates smooth out the
contribution of each datapoint over a local
neighborhood of that point.
n
ˆf (x)  1  K( x  x i )
nh
i1 h
h is the kernel width

• Gaussian kernel is common:



2
1  x  x (i ) 
  
2 h 
Ce

11
Bandwidth
choice is an
art

Usually want
to try several

12
Boxplots

• Shows a lot of
information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell
distributional shape
– no standard
implementation in
software (many options
for whiskers, outliers)

13
Time Series
If your data has a temporal component, be sure to exploit
it
summer bifurcations in air travel
(favor early/late)
summer
peaks

steady growth
trend

New Year bumps

14
Time-Series Example 3

mean weight vs mean age


for 10k control group

Scotland experiment:
Possible explanations:
“ milk in kid diet  better health” ?
Grow less early in year than later?
20,000 kids: Would expect smooth weight growth plot.
5k raw, 5k pasteurize,
No steps in height plots; so why
10k control (no supplement) Visually reveals
height  uniformly, weight  spurts?
unexpected pattern (steps),
not apparent from raw data table.
Kids weighed in clothes: summer garb
lighter than winter?
Spatial Data

• If your data has a


geographic
component, be
sure to exploit it

• Data from
cities/states/zip
cods – easy to get
lat/long

• Can plot as
scatterplot

16
Spatial data: choropleth Maps

• Maps using color shadings to represent numerical values are called chloropleth maps
• http://elections.nytimes.com/2008/results/president/map.html

17
Two Continuous Variables

• For two numeric variables, the


scatterplot is the obvious choice

interesting?

interesting?

18
2D Scatterplots

• standard tool to display • useful to answer:


relation between 2 – x,y related?
variables • linear
– e.g. y-axis = response, • quadratic
x-axis = suspected • other
indicator – variance(y) depend on
x?
– outliers present?

interesting
?

interesting
?

19
Scatter Plot: No apparent
relationship

20
Scatter Plot: Linear relationship

21
Scatter Plot: Quadratic relationship

22
Scatter plot: Homoscedastic

Why is this important in classical statistical modelling?

23
Scatter plot: Heteroscedastic

variation in Y differs depending on the value of X


e.g., Y = annual tax paid, X = income

24
Two variables - continuous

• Scatterplots
– But can be bad with lots of data

25
Two variables - continuous

• What to do for large data sets


– Contour plots

26
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22",
pch=16,cex=3)

27
Jittering

• Jittering points helps too


• plot(age, TimesPregnant)
• plot(jitter(age),jitter(TimesPregnant)

28
Displaying Two Variables

• If one variable is
categorical, use
small multiples

• Many software
packages have this
implemented as
‘lattice’ or ‘trellis’
packages
library(‘lattice’)
histogram(~DiastolicBP | TimesPregnant==0

29
Two Variables - one categorical

• Side by side boxplots are very effective in showing


differences in a quantitative variable across factor
levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling
honeybees

30
Barcharts and Spineplots

stacked barcharts
can be used to
compare
continuous values
across two or more
categorical ones.

orange=M blue=F

spineplots show
proportions well,
but can be hard to
interpret 31
More than two
variables
Pairwise
scatterplots

Can be somewhat
ineffective for
categorical data

32
33
Multivariate: More than two
variables
• Get creative!
• Conditioning on variables
– trellis or lattice plots
– Cleveland models on human perception,
all based on conditioning
– Infinite possibilities

• Earthquake data:
– locations of 1000 seismic events of MB > 4.0.
The events occurred in a cube near Fiji since
1964
– Data collected on the severity of the
earthquake

34
35
36
How many
dimensions
are
represented
here?

Andrew Gelman blog 7/15/2009 37


Multivariate Vis: Parallel
Coordinates

Petal, a non-reproductive
part of the flower

Sepal, a non-reproductive
part of the flower

The famous iris data!

38
Parallel Coordinates

Sepal
Length

5.1

sepal sepal petal petal


length width length width
5.1 3.5 1.4 0.2
39
Parallel Coordinates: 2 D

Sepal Sepal
Length Width

3.5

5.1

sepal sepal petal petal


length width length width
5.1 3.5 1.4 0.2
40
Parallel Coordinates: 4 D

Sepal Sepal Petal Petal


Length Width length Width

3.5

5.1 1.4 0.2

sepal sepal petal petal


length width length width
5.1 3.5 1.4 0.2
41
Parallel Visualization of Iris data

3.5

5.1

1.4
0.2

42
Multivariate: Parallel coordinates

Alpha blending
can be effective

Courtesy Unwin, Theus, Hofmann


43
Parallel coordinates
• Useful in an interactive setting

44
Networks and Graphs

• Visualizing networks is helpful, even if is not


obvious that a network exists

45
Network Visualization
• Graphviz (open source software) is a nice layout tool
for big and small graphs

46
What’s missing?

• pie charts
– very popular
– good for showing simple relations of proportions
– Human perception not good at comparing arcs
– barplots, histograms usually better (but less pretty)

• 3D
– nice to be able to show three dimensions
– hard to do well
– often done poorly
– 3d best shown through “spinning” in 2D
• uses various types of projecting into 2D
• http://www.stat.tamu.edu/~west/bradley/

47
Worst graphic in the
world?

48
Dimension Reduction

• One way to visualize high dimensional


data is to reduce it to 2 or 3 dimensions

– Variable selection
• e.g. stepwise
– Principle Components
• find linear projection onto p-space with maximal
variance
– Multi-dimensional scaling
• takes a matrix of (dis)similarities and embeds the
points in p-dimensional space to retain those
similarities
More on this in next Topic

49
Visualization done right

• Hans Rosling @ TED

• http://www.youtube.com/watch?
v=jbkSRLYSojo

50

You might also like