Lec448B 20160406
Lec448B 20160406
Maneesh Agrawala
CS 448B: Visualization
Spring 2016
Create visualizations
■ Interact with data
■ Question will evolve
■ Tableau
1
Exploratory Data Analysis
2
The last few decades have seen the
rise of formal theories of statistics,
"legitimizing" variation by confining it
by assumption to random sampling,
often assumed to involve tightly
specified distributions, and restoring
the appearance of security by
emphasizing narrowly optimized
techniques and claiming to make
statements with "known" probabilities
of error.
3
Exposure, the effective laying open of
the data to display the unanticipated,
is to us a major portion of data
analysis. Formal statistics has given
almost no guidance to exposure;
indeed, it is not clear how the
informality and flexibility appropriate to
the exploratory character of exposure
can be fitted into any of the structures
of formal statistics so far proposed.
4
Set A Set B
14 14
12 12
10 10
Y 8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
14
Set C 14
Set D
12 12
10 10
Y 8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 18 20
X X
Topics
Exploratory Data Analysis
Data Diagnostics
Graphical Methods
Data Transformation
Confirmatory Data Analysis
Statistical Hypothesis Testing
Graphical Inference
5
Data Diagnostics
6
Data “Wrangling”
One often needs to manipulate data prior to
analysis. Tasks include reformatting, cleaning,
quality assessment, and integration
Some approaches:
Writing custom scripts
Manual manipulation in spreadsheets
Data Wrangler: http://vis.stanford.edu/wrangler
Google Refine: http://code.google.com/p/google-refine
- Martin Wattenberg
7
8
Node-link
Matrix
9
Matrix
10
Data Quality & Usability Hurdles
Missing Data no measurements, redacted, …?
Erroneous Values misspelling, outliers, …?
Exploratory Analysis:
Effectiveness of Antibiotics
11
What questions might we ask?
12
Will Burtin, 1951
Not a streptococcus!
(realized ~30 yrs later)
Really a streptococcus!
(realized ~20 yrs later)
13
How do the bacteria group w.r.t. resistance?
Do different drugs correlate?
Wainer & Lysen
American Scientist, 2009
Lessons
Exploratory Process
1 Construct graphics to address questions
2 Inspect “answer” and assess new questions
3 Repeat!
Transform the data appropriately (e.g., invert, log)
“Show data variation, not design variation”
-Tufte
14
Exploratory Analysis:
Participation on Amazon’s
Mechanical Turk
15
Turker Completion
Percentage
Turker Completion
Percentage
16
Stem-and-Leaf Plot
Turker Completion
Percentage
17
Min Median Max
Lower Quartile Upper Quartile
Turker Completion
Percentage
Quantile-Quantile Plot
18
Quantile-Quantile Plots
Turker Completion
Percentage
19
Lessons
Even for “simple” data, a variety of graphics might
provide insight. Again, tailor the choice of graphic
to the questions being asked, but be open to
surprises.
Graphics can be used to understand and help
assess the quality of statistical models.
Premature commitment to a model and lack of
verification can lead an analysis astray.
20
Some Uses of Formal Statistics
What is the probability that the pattern I'm seeing
might have arisen by chance?
With what parameters does the data best fit a given
function? What is the goodness of fit?
How well do one (or more) data variables predict
another?
…and many others.
21
Histograms
Bihistogram
22
23
Formulating a Hypothesis
Null Hypothesis (H0): µm = µf (population)
Testing Procedure
Compute a test statistic. This is a number that in
essence summarizes the difference.
24
Compute test statistic
µm - µf
Z=
√σ2m /Nm + σ2f /Nf
µm - µf = 5.6
Testing Procedure
Compute a test statistic. This is a number that in
essence summarizes the difference.
25
Lookup probability of test statistic
Normal Distribution
µ= 0, σ = 1 Z = .2 Z > +1.96
Z ~ N(0, 1)
-1.96
p > 0.05 +1.96
p < 0.05
Statistical Significance
26
Common Statistical Methods
Question Data Type Parametric Non-Parametric
Assumes a particular
distribution for the data --
usually normal, a.k.a.
Gaussian.
27
Graphical Inference
Buja Cook, Hoffman, Wickham et al.
Can you spot the real data? If so, you have some
evidence of spatial dependence in the data.
28
Distance vs. angle for 3 point shots by the LA
Lakers.
29
Summary
Exploratory analysis may combine graphical
methods, data transformations, and statistics
30