0% found this document useful (0 votes)
33 views24 pages

1 ProbabilityAndInference

This document provides an overview of a computational statistics course at The University of Texas at Austin taught by Prof. William H. Press in Spring 2009. The course introduces students to probability, inference, and computational algorithms for statistical analysis. It emphasizes hands-on learning through projects, discussions, and applying techniques to real-world datasets. The course also explores relationships between computational statistics, machine learning, and bioinformatics.

Uploaded by

PeterParker1983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views24 pages

1 ProbabilityAndInference

This document provides an overview of a computational statistics course at The University of Texas at Austin taught by Prof. William H. Press in Spring 2009. The course introduces students to probability, inference, and computational algorithms for statistical analysis. It emphasizes hands-on learning through projects, discussions, and applying techniques to real-world datasets. The course also explores relationships between computational statistics, machine learning, and bioinformatics.

Uploaded by

PeterParker1983
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CS395T

Computational Statistics with


Application to Bioinformatics

Prof. William H. Press


Spring Term, 2009
The University of Texas at Austin

Unit 1: Probability and Inference

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 1
What is Computational Statistics, anyway?

• It’s not different (mathematically) from “regular” statistics.


• Has less distinction between “statisticians” and “users of
statistics”
– since users have access to lots of computing power
• Heavy use of simulation (e.g., Monte Carlo) and
resampling techniques
– instead of analytic calculation of distributions of the null
hypothesis
• Somewhat more Bayesian, but not exclusively so
• Somewhat more driven by specifics of unique data sets
– this can be dangerous (“shopping for significance”)
– or powerful!
• Closely related to machine learning, but a somewhat
different culture
The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 2
What should be in a
computational statistics course?
• No one knows!
– such courses are relatively new almost everywhere
– second year this particular course has been taught at UT
• although related courses exist
• No accepted standard textbooks
– some “pattern recognition” and “machine learning” texts are close
• closest for this course is Bishop, Pattern Recognition and Machine Learning
– other texts are more about numerical methods (e.g., quadrature) or
mathematical statistics books (e.g., prove theorems)
– web site has list of other possibly useful books
• We are are still inventing this course together
– participation in class and on the course web site is required
– grade based on participation, individual projects, and a 20 minute
individual “final interview”
– occasional problem sets to be turned in, but “graded” only by gestalt
• will count only ε towards grade
• This is [going to be] [supposed to be] [had better be] fun!

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 3
http://wpressutexas.net/forum
• Last year, tried Wiki format
– didn’t work very well: people reluctant to initiate new pages
• This year, we’ll try Forum format
– you should register using same email as on sign-up sheet
– start threads or add comments under Lecture Slides or Other Topics
– add comments to Course Administration topics

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 4
What should you learn in this course?
• A lot of conventional statistics at a 1st year graduate level
– mostly by practical example, not proving theorems
– but you should also learn to read the statistics and/or machine learning
and/or pattern recognition textbook literature
• A lot about real, as opposed to idealized, data sets
– we’ll supply and discuss some
– you can also use and/or share your own
• A bunch of important computational algorithms
– often stochastic
• Some bioinformatics, especially genomics
– although that is not the main point of the course
• Some programming methodology
– e.g., data parallel methods, notated in MATLAB but more general in
concept
– A computer with either MATLAB or Octave (free) is required.

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 5
Laws of Probability
“There is this thing called probability. It obeys the laws of an
axiomatic system. When identified with the real world, it gives
(partial) information about the future.”
• What axiomatic system?
• How to identify to real world?
– Bayesian or frequentist viewpoints are somewhat different
“mappings” from axiomatic probability theory to the real world
– yet both are useful

“And, it gives a consistent and complete calculus of inference.”


• This is only a Bayesian viewpoint
– It’s sort of true and sort of not true, as we will see!
• R.T. Cox (1946) showed that reasonable assumptions about
“degree of belief” uniquely imply the axioms of probability (and
Bayes)
– ordered (transitive) A > B > C
– “composable” (belief in AB depends only on A and B|A)
– what about, e.g., “fuzzy logic” or “Bayes nets”? are these Bayesian
or are the assumptions violated?

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 6
Axioms:
I. P (A) ≥ 0 for an event A
II. P (Ω) = 1 where Ω is the set of all possible outcomes
III. if A ∩ B = ∅, then P (A ∪ B) = P (A) + P (B)

Example of a theorem:
Theorem: P (∅) = 0
Proof: A ∩ ∅ = ∅, so
P (A) = P (A ∪ ∅) = P (A) + P (∅), q.e.d.

Basically this is a theory of measure on Venn diagrams,


so we can (informally) cheat and prove theorems by
inspection of the appropriate diagrams, as we now do.

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 7
Additivity or “Law of Or-ing”

P (A ∪ B) = P (A) + P (B) − P (AB)

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 8
“Law of Exhaustion”

If Ri are exhaustive and mutually exclusive (EME)


X
P (Ri ) = 1
i

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 9
Multiplicative Rule or “Law of And-ing”

(same picture as before)

“given”
P (AB) = P (A)P (B|A) = P (B)P (A|B)
P (AB)
P (B|A) =
P (A)
“conditional probability”
“renormalize the
outcome space”

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 10
Similarly, for multiple And-ing:

P (ABC) = P (A)P (B|A)P (C|AB)

Independence:
Events A and B are independent if
P (A|B) = P (A)
so P (AB) = P (B)P (A|B) = P (A)P (B)

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 11
A symmetric die has
. . = P (6) = 16
P (1) = P (2) = .P
Why? Because i P (i) = 1 and P (i) = P (j).
Not because of “frequency of occurence in N trials”.
That comes later!

The sum of faces of two dice (red and green) is > 8.


What is the probability that the red face is 4?

P (R4 ∩ >8) 2/36


P (R4 | >8) = = = 0.2
P (>8) 10/36

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 12
Law of Total Probability or “Law of de-Anding”

H’s are exhaustive and


mutually exclusive (EME)

X
P (B) = P (BH1 ) + P (BH2 ) + . . . = P (BHi )
i
X
P (B) = P (B|Hi )P (Hi )
i
“How to put Humpty-Dumpty back together again.”

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 13
Example: A barrel has 3 minnows and 2 trout, with
equal probability of being caught. Minnows must
be thrown back. Trout we keep.
What is the probability that the 2nd fish caught is a
trout?

H1 ≡ 1st caught is minnow, leaving 3 + 2


H2 ≡ 1st caught is trout, leaving 3 + 1
B ≡ 2nd caught is a trout
P (B) = P (B|H1 )P (H1 ) + P (B|H2 )P (H2 )
2 3 1 2
= 5 · 5 + 4 · 5 = 0.34

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 14
Bayes Theorem

Thomas Bayes
1702 - 1761

(same picture as before)

Law of And-ing
P (Hi B)
P (Hi |B) =
P (B)
P (B|Hi )P (Hi )
= P
j P (B|Hj )P (Hj )
Law of de-Anding
We usually write this as

P (Hi |B) ∝ P (B|Hi )P (Hi )

this means, “compute the normalization by using the


completeness of the Hi’s”
The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 15
• As a theorem relating probabilities, Bayes is
unassailable
• But we will also use it in inference, where the H’s are
hypotheses, while B is the data
– “what is the probability of an hypothesis, given the data?”
– some (defined as frequentists) consider this dodgy
– others (Bayesians like us) consider this fantastically powerful
and useful
– in real life, the war between Bayesians and frequentists is long
since over, and most statisticians adopt a mixture of techniques
appropriate to the problem.
• Note that you generally have to know a complete set of
EME hypotheses to use Bayes for inference
– perhaps its principal weakness

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 16
Let’s work a couple of examples using Bayes Law:

Example: Trolls Under the Bridge

Trolls are bad. Gnomes are benign.


Every bridge has 5 creatures under it:
20% have TTGGG (H1)
20% have TGGGG (H2)
60% have GGGGG (benign) (H3)

Before crossing a bridge, a knight captures one of the 5


creatures at random. It is a troll. “I now have an 80%
chance of crossing safely,” he reasons, “since only the case
20% had TTGGG (H1) Æ now have TGGG
is still a threat.”

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 17
P (Hi |T ) ∝ P (T |Hi )P (Hi )
2 1
so, 5 · 5 2
P (H1 |T ) = 2 1 1 1 3 =
5 · 5 + 5 · 5 +0· 5
3

The knight’s chance of crossing safely is actually only 33.3%


Before he captured a troll (“saw the data”) it was 60%.
Capturing a troll actually made things worse! [well…discuss]
(80% was never the right answer!)
Data changes probabilities!
Probabilities after assimilating data are called posterior
probabilities.

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 18
Example: The Monty Hall or
Let’s Make a Deal Problem

• Three doors
• Car (prize) behind one door
• You pick a door, but don’t open it yet
• Monty then opens one of the other doors, always revealing no
car (he knows where it is)
• You now get to switch doors if you want
• Should you?
• Most people reason: Two remaining doors were equiprobable
before, and nothing has changed. So doesn’t matter whether
you switch or not.
• Marilyn vos Savant (“highest IQ person in the world”) famously
thought otherwise (Parade magazine, 1990)
• No one seems to care what Monty Hall thought!

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 19
Hi = car behind door i, i = 1, 2, 3
Wlog, you pick door 2 (relabeling).
Wlog, Monty opens door 3 (relabeling).
P (Hi |O3) ∝ P (O3|Hi )P (Hi )

“Without loss of generality…”

1 2
P (H1 |O3) ∝ 1 · 3 = 6
1 1 1
P (H2 |O3) ∝ 2 · 3 = 6
1
P (H3 |O3) ∝ 0 · 3 =0
ignorance of Monty’s preference
between 1 and 3, so take 1/2

So you should always switch: doubles your chances!

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 20
Exegesis on Monty Hall

• Very important example! Master it.


• P (Hi ) = 13 is the “prior probability” or “prior”
• P (Hi |O3) is the “posterior probability” or “posterior”
• P (O3|Hi ) is the “evidence factor” or “evidence”
• Bayes says posterior ∝ prior × evidence

Bayesian viewpoint:
Probabilities are modified by data. This
makes them intrinsically subjective,
because different observers have
access to different amounts of data
(including their “background information”
or “background knowledge”).

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 21
Commutivity/Associativity of Evidence
P (Hi |D1 D2 ) desired
We see D1 :
P (Hi |D1 ) ∝ P (D1 |Hi )P (Hi )

Then, we see D2 :
P (Hi |D1 D2 ) ∝ P (D2 |Hi D1 )P (Hi |D1 ) this is now a prior!

But,
= P (D2 |Hi D1 )P (D1 |Hi )P (Hi )
= P (D1 D2 |Hi )P (Hi )
this being symmetrical shows that we would get the same answer
regardless of the order of seeing the data

All priors P (Hi ) are actually P (Hi |D),


conditioned on previously seen data! Often
write this as P (Hi |I). background information

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 22
Bayes Law is a “calculus of inference”, often better (and
certainly more self-consistent) than folk wisdom.

Example: Hempel’s Paradox


Folk wisdom: A case of a hypothesis adds support to that
hypothesis.
Example: “All crows are black” is supported by each new
observation of a black crow.

All crows
Ù All non-black things
are black are non-crows

But this is supported by the observation of a white shoe.


So, the observation of a white shoe is thus evidence that
all crows are black!

The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 23
I.J. Good: “The White Shoe
is a Red Herring” (1966)

We observe one bird, and it is a black crow.


a) Which world are we in?
b) Are all crows black?
P (H1 |D) P (D|H1 )P (H1 )
=
P (H2 |D) P (D|H2 )P (H2 )
0.0001 P (H1 ) P (H1 )
= = 0.001
0.1 P (H2 ) P (H2 )
So the observation strongly supports H2 and the existence of white crows.
Hempel’s folk wisdom premise is not true.
Data supports the hypotheses in which it is more likely compared with other
hypotheses.
We must have some kind of background information about the universe of
hypotheses, otherwise data has no meaning at all.
The University of Texas at Austin, CS 395T, Spring 2009, Prof. William H. Press 24

You might also like