0% found this document useful (0 votes)
29 views30 pages

Lec448B 20160406

The document discusses exploratory data analysis and visualization. It provides examples of questions to ask of data and different visualizations to generate, including node-link diagrams, matrices, dot plots, stem-and-leaf plots, and histograms. The document also discusses challenges in data quality and preparation for analysis. The goal of exploratory analysis is to gain insights into data through an iterative process of questioning, visualization, and further questioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views30 pages

Lec448B 20160406

The document discusses exploratory data analysis and visualization. It provides examples of questions to ask of data and different visualizations to generate, including node-link diagrams, matrices, dot plots, stem-and-leaf plots, and histograms. The document also discusses challenges in data quality and preparation for analysis. The goal of exploratory analysis is to gain insights into data through an iterative process of questioning, visualization, and further questioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Exploratory Data Analysis

Maneesh Agrawala

CS 448B: Visualization
Spring 2016

Assignment 2: Exploratory Data Analysis


Use Tableau to formulate & answer questions
First steps
■ Step 1: Pick a domain
■ Step 2: Pose questions
■ Step 3: Find data
■ Iterate

Create visualizations
■ Interact with data
■ Question will evolve
■ Tableau

Make wiki notebook


■ Keep record of all steps
you took to answer the
questions

Due before class on Apr 18, 2016

1
Exploratory Data Analysis

The Future of Data Analysis, John W. Tukey 1962

2
The last few decades have seen the
rise of formal theories of statistics,
"legitimizing" variation by confining it
by assumption to random sampling,
often assumed to involve tightly
specified distributions, and restoring
the appearance of security by
emphasizing narrowly optimized
techniques and claiming to make
statements with "known" probabilities
of error.

The Future of Data Analysis, John W. Tukey 1962

While some of the influences of


statistical theory on data analysis
have been helpful, others have not.

The Future of Data Analysis, John W. Tukey 1962

3
Exposure, the effective laying open of
the data to display the unanticipated,
is to us a major portion of data
analysis. Formal statistics has given
almost no guidance to exposure;
indeed, it is not clear how the
informality and flexibility appropriate to
the exploratory character of exposure
can be fitted into any of the structures
of formal statistics so far proposed.

The Future of Data Analysis, John W. Tukey 1962

Set A Set B Set C Set D


X Y X Y X Y X Y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.11 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

Summary Statistics Linear Regression


uX = 9.0 σX = 3.317 Y = 3 + 0.5 X
[Anscombe 73]
uY = 7.5 σY = 2.03 R2 = 0.67

4
Set A Set B
14 14

12 12

10 10

Y 8 8

6 6

4 4

2 2

0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16

14
Set C 14
Set D
12 12

10 10

Y 8 8

6 6

4 4

2 2

0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18 20

X X

Topics
Exploratory Data Analysis
Data Diagnostics
Graphical Methods
Data Transformation
Confirmatory Data Analysis
Statistical Hypothesis Testing
Graphical Inference

5
Data Diagnostics

6
Data “Wrangling”
One often needs to manipulate data prior to
analysis. Tasks include reformatting, cleaning,
quality assessment, and integration
Some approaches:
Writing custom scripts
Manual manipulation in spreadsheets
Data Wrangler: http://vis.stanford.edu/wrangler
Google Refine: http://code.google.com/p/google-refine

How to gauge the quality of a visualization?


“The first sign that a visualization is good is that it
shows you a problem in your data…
…every successful visualization that I've been
involved with has had this stage where you realize,
"Oh my God, this data is not what I thought it would
be!" So already, you've discovered something.”

- Martin Wattenberg

7
8
Node-link

Matrix

9
Matrix

Visualize Friends by School?


Berkeley |||||||||||||||||||||||||||||||
Cornell ||||
Harvard |||||||||
Harvard University |||||||
Stanford ||||||||||||||||||||
Stanford University ||||||||||
UC Berkeley |||||||||||||||||||||
UC Davis ||||||||||
University of California at Berkeley |||||||||||||||
University of California, Berkeley ||||||||||||||||||
University of California, Davis |||

10
Data Quality & Usability Hurdles
Missing Data no measurements, redacted, …?
Erroneous Values misspelling, outliers, …?

Type Conversion e.g., zip code to lat-lon


Entity Resolution diff. values for the same thing?
Data Integration effort/errors when combining data

LESSON: Anticipate problems with your data.


Many research problems around these issues!

Exploratory Analysis:
Effectiveness of Antibiotics

11
What questions might we ask?

The Data Set


Genus of Bacteria String
Species of Bacteria String
Antibiotic Applied String
Gram-Staining? Pos / Neg
Min. Inhibitory Concent. (g) Number

Collected prior to 1951

12
Will Burtin, 1951

How do the drugs compare?

How do the bacteria


group with respect to
antibiotic resistance?

Not a streptococcus!
(realized ~30 yrs later)
Really a streptococcus!
(realized ~20 yrs later)

Wainer & Lysen


American Scientist, 2009

13
How do the bacteria group w.r.t. resistance?
Do different drugs correlate?
Wainer & Lysen
American Scientist, 2009

Lessons
Exploratory Process
1 Construct graphics to address questions
2 Inspect “answer” and assess new questions
3 Repeat!
Transform the data appropriately (e.g., invert, log)
“Show data variation, not design variation”
-Tufte

14
Exploratory Analysis:
Participation on Amazon’s
Mechanical Turk

The Data Set (~200 rows)


Turker ID String
Avg. Completion Percentage Number [0,1]

Collected in 2009 by Heer & Bostock.

What questions might we ask of the data?


What charts might provide insight?

15
Turker Completion
Percentage

Dot Plot (with transparency to indicate overlap)

Turker Completion
Percentage

Dot Plot w/ Reference Lines

16
Stem-and-Leaf Plot

Turker Completion
Percentage

Histogram (binned counts)

17
Min Median Max
Lower Quartile Upper Quartile

Turker Completion
Percentage

Box (and Whiskers) Plot

Used to compare two


distributions; in this case,
one actual and one
theoretical.

Plots the quantiles (here, the


percentile values) against
each other.

Similar distributions lie


along the diagonal. If
linearly related, values will
lie along a line, but with
potentially varying slope
and intercept.

Quantile-Quantile Plot

18
Quantile-Quantile Plots

Turker Completion
Percentage

Histogram + Fitted Mixture of 3 Gaussians

19
Lessons
Even for “simple” data, a variety of graphics might
provide insight. Again, tailor the choice of graphic
to the questions being asked, but be open to
surprises.
Graphics can be used to understand and help
assess the quality of statistical models.
Premature commitment to a model and lack of
verification can lead an analysis astray.

Confirmatory Data Analysis

20
Some Uses of Formal Statistics
What is the probability that the pattern I'm seeing
might have arisen by chance?
With what parameters does the data best fit a given
function? What is the goodness of fit?
How well do one (or more) data variables predict
another?
…and many others.

Example: Heights by Gender


Gender Male / Female
Height (in) Number

µm = 69.4 σm = 4.69 Nm = 1000


µf = 63.8 σf = 4.18 Nf = 1000

Is this difference in heights significant?


In other words: assuming no true difference, what
is the prob. that our data is due to chance?

21
Histograms

Bihistogram

22
23
Formulating a Hypothesis
Null Hypothesis (H0): µm = µf (population)

Alternate Hypothesis (Ha): µm ≠ µf (population)

A statistical hypothesis test assesses the


likelihood of the null hypothesis.
What is the probability of sampling the observed
data assuming population means are equal?
This is called the p value

Testing Procedure
Compute a test statistic. This is a number that in
essence summarizes the difference.

24
Compute test statistic

µm - µf
Z=
√σ2m /Nm + σ2f /Nf
µm - µf = 5.6

Testing Procedure
Compute a test statistic. This is a number that in
essence summarizes the difference.

The possible values of this statistic come from a


known probability distribution.

According to this distribution, look up the


probability of seeing a value meeting or
exceeding the test statistic. This is the p value.

25
Lookup probability of test statistic

Normal Distribution
µ= 0, σ = 1 Z = .2 Z > +1.96
Z ~ N(0, 1)

95% of Probability Mass

-1.96
p > 0.05 +1.96
p < 0.05

Statistical Significance

The threshold at which we consider it safe (or


reasonable?) to reject the null hypothesis.
If p < 0.05, we typically say that the observed effect
or difference is statistically significant.
This means that there is a less than 5% chance that
the observed data is due to chance.
Note that the choice of 0.05 is a somewhat arbitrary
threshold (chosen by R. A. Fisher)

26
Common Statistical Methods
Question Data Type Parametric Non-Parametric

Assumes a particular
distribution for the data --
usually normal, a.k.a.
Gaussian.

Does not assume a


distribution. Typically
works on rank orders.

Common Statistical Methods


Question Data Type Parametric Non-Parametric
Do data distributions 2 uni. dists t-Test Mann-Whitney U
have different “centers”? > 2 uni. dists ANOVA Kruskal-Wallis
(aka “location” tests) > 2 multi. dists MANOVA Median Test
Are observed counts Counts in χ2 (chi-squared)
significantly different? categories
Are two vars related? 2 variables Pearson coeff. Rank correl.
Do 1 (or more) variables Continuous Linear regression
predict another? Binary Logistic regression

27
Graphical Inference
Buja Cook, Hoffman, Wickham et al.

Choropleth maps of cancer deaths in Texas.

One plot shows a real data sets. The others are


simulated under the null hypothesis of spatial
independence.

Can you spot the real data? If so, you have some
evidence of spatial dependence in the data.

28
Distance vs. angle for 3 point shots by the LA
Lakers.

One plot is the real data. The others are generated


according to a null hypothesis of quadratic relationship.

Residual distance vs. angle for 3 point shots.

One plot is the real data. The others are generated


using an assumption of normally distributed residuals.

29
Summary
Exploratory analysis may combine graphical
methods, data transformations, and statistics

Use questions to uncover more questions

Formal methods may be used to confirm,


sometimes on held-out or new data

Visualization can further aid assessment of fitted


statistical models

30

You might also like