Reading Material 02
Reading Material 02
Mathematical Preliminaries
You must walk before you can run. Similarly, there is a certain level of mathe-
matical maturity which is necessary before you should be trusted to do anything
meaningful with numerical data.
In writing this book, I have assumed that the reader has had some degree
of exposure to probability and statistics, linear algebra, and continuous math-
ematics. I have also assumed that they have probably forgotten most of it, or
perhaps didn’t always see the forest (why things are important, and how to use
them) for the trees (all the details of definitions, proofs, and operations).
This chapter will try to refresh your understanding of certain basic math-
ematical concepts. Follow along with me, and pull out your old textbooks if
necessary for future reference. Deeper concepts will be introduced later in the
book when we need them.
2.1 Probability
Probability theory provides a formal framework for reasoning about the likeli-
hood of events. Because it is a formal discipline, there are a thicket of associated
definitions to instantiate exactly what we are reasoning about:
All this you have presumably seen before. But it provides the language we
will use to connect between probability and statistics. The data we see usually
comes from measuring properties of observed events. The theory of probability
and statistics provides the tools to analyze this data.
2.1. PROBABILITY 29
Both subjects are important, relevant, and useful. But they are different, and
understanding the distinction is crucial in properly interpreting the relevance
of mathematical evidence. Many a gambler has gone to a cold and lonely grave
for failing to make the proper distinction between probability and statistics.
This distinction will perhaps become clearer if we trace the thought process
of a mathematician encountering her first craps game:
• If this mathematician were a probabilist, she would see the dice and think
“Six-sided dice? Each side of the dice is presumably equally likely to land
face up. Now assuming that each face comes up with probability 1/6, I
can figure out what my chances are of crapping out.”
• If instead a statistician wandered by, she would see the dice and think
“How do I know that they are not loaded? I’ll watch a while, and keep
track of how often each number comes up. Then I can decide if my ob-
servations are consistent with the assumption of equal-probability faces.
Once I’m confident enough that the dice are fair, I’ll call a probabilist to
tell me how to bet.”
S S S
A A B
A A\B A[B
Figure 2.1: Venn diagrams illustrating set difference (left), intersection (middle),
and union (right).
along the way establishing that the house wins this dice game with probability
p = 1 − (5/6)4 ≈ 0.517, where the probability p = 0.5 would denote a fair game
where the house wins exactly half the time.
A − B = {(1, 2), (1, 4), (2, 1), (2, 2), (2, 3), (2, 4), (2, 6), (3, 2), (3, 6), (4, 1),
(4, 2), (4, 4), (4, 5), (4, 6), (5, 4), (6, 2), (6, 3), (6, 4), (6, 6)}.
This is the set difference operation. Observe that here B − A = {}, because
every pair adding to 7 or 11 must contain one odd and one even number.
The outcomes in common between both events A and B are called the in-
tersection, denoted A ∩ B. This can be written as
A ∩ B = A − (S − B).
P (A ∩ B) = P (A) × P (B).
This means that there is no special structure of outcomes shared between events
A and B. Assuming that half of the students in my class are female, and half
the students in my class are above average, we would expect that a quarter of
my students are both female and above average if the events are independent.
2.1. PROBABILITY 31
pdf cdf
0.18
1.0
0.16
0.14 0.8
0.12
Probability
Probability
0.10 0.6
0.08
0.4
0.06
0.04
0.2
0.02
0.00 0.0
2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12
Total on dice Total on dice
Figure 2.2: The probability density function (pdf) of the sum of two dice con-
tains exactly the same information as the cumulative density function (cdf), but
looks very different.
what we got before. We will revisit Bayes theorem in Section 5.6, where it will
establish the foundations of computing probabilities in the face of evidence.
2 A technique called discounting offers a better way to estimate the frequency of rare events,
Figure 2.3: iPhone quarterly sales data presented as cumulative and incremental
(quarterly) distributions. Which curve did Apple CEO Tim Cook choose to
present?
number of observations:
h(k = X)
P (k = X) = P
x h(x = X)
There is another way to represent random variables which often proves use-
ful, called a cumulative density function or cdf. The cdf is the running sum of
the probabilities in the pdf; as a function of k, it reflects the probability that
X ≤ k instead of the probability that X = k. Figure 2.2 (right) shows the
cdf of the dice sum distribution. The values increase monotonically from left
to right, because each term comes from adding a positive probability to the
previous total. The rightmost value is 1, because all outcomes produce a value
no greater than the maximum.
It is important to realize that the pdf P (V ) and cdf C(V ) of a given random
variable V contain exactly the same information. We can move back and forth
between them because:
where δ = 1 for integer distributions. The cdf is the running sum of the pdf, so
X
C(X ≤ k) = P (X = x).
x≤k
Just be aware of which distribution you are looking at. Cumulative distribu-
tions always get higher as we move to the right, culminating with a probability
of C(X ≤ ∞) = 1. By contrast, the total area under the curve of a pdf equals
1, so the probability at any point in the distribution is generally substantially
less.
34 CHAPTER 2. MATHEMATICAL PRELIMINARIES
• Central tendency measures, which capture the center around which the
data is distributed.
• Variation or variability measures, which describe the data spread, i.e. how
far the measurements lie from the center.
• Mean: You are probably quite comfortable with the use of the arithmetic
mean, where we sum values and divide by the number of observations:
n
1X
µX = xi
n i=1
We can easily maintain the mean under a stream of insertions and dele-
tions, by keeping the sum of values separate from the frequency count,
and divide only on demand.
The mean is very meaningful to characterize symmetric distributions with-
out outliers, like height and weight. That it is symmetric means the num-
ber of items above the mean should be roughly the same as the number
2.2. DESCRIPTIVE STATISTICS 35
below. That it is without outliers means that the range of values is rea-
sonably tight. Note that a single MAXINT creeping into an otherwise
sound set of observations throws the mean wildly off. The median is a
centrality measure which proves more appropriate with such ill-behaved
distributions.
• Geometric mean: The geometric mean is the nth root of the product of n
values: !1/n
n
Y √
ai = n a1 a2 . . . a n
i=1
The geometric mean is always less than or equal to the arithmetic mean.
For example, the geometric mean of the sums of 36 dice rolls is 6.5201, as
opposed to the arithmetic mean of 7. It is very sensitive to values near
zero. A single value of zero lays waste to the geometric mean: no matter
what other values you have in your data, you end up with zero. This is
somewhat analogous to having an outlier of ∞ in an arithmetic mean.
But geometric means prove their worth when averaging ratios. The ge-
ometric mean of 1/2 and 2/1 is 1, whereas the mean is 1.25. There is
less available “room” for ratios to be less than 1 than there is for ratios
above 1, creating an asymmetry that the arithmetic mean overstates. The
geometric mean is more meaningful in these cases, as is the arithmetic
mean of the logarithms of the ratios.
• Median: The median is the exact middle value among a data set; just as
many elements lie above the median as below it. There is a quibble about
what to take as the median when you have an even number of elements.
You can take either one of the two central candidates: in any reasonable
data set these two values should be about the same. Indeed in the dice
example, both are 7.
A nice property of the median as so defined is that it must be a genuine
value of the original data stream. There actually is someone of median
height to you can point to as an example, but presumably no one in the
world is of exactly average height. You lose this property when you average
the two center elements.
Which centrality measure is best for applications? The median typically
lies pretty close to the arithmetic mean in symmetrical distributions, but
it is often interesting to see how far apart they are, and on which side of
the mean the median lies.
The median generally proves to be a better statistic for skewed distribu-
tions or data with outliers: like wealth and income. Bill Gates adds $250
to the mean per capita wealth in the United States, but nothing to the
median. If he makes you personally feel richer, then go ahead and use the
mean. But the median is the more informative statistic here, as it will be
for any power law distribution.
36 CHAPTER 2. MATHEMATICAL PRELIMINARIES
0.01 1
P(H) P(H)
3000 3000
Hours Hours
Figure 2.4: Two distinct probability distributions with µ = 3000 for the lifespan
of light bulbs: normal (left) and with zero variance (right).
• Mode: The mode is the most frequent element in the data set. This is 7
in our ongoing dice example, because it occurs six times out of thirty-six
elements. Frankly, I’ve never seen the mode as providing much insight
as centrality measure, because it often isn’t close to the center. Samples
measured over a large range should have very few repeated elements or
collisions at any particular value. This makes the mode a matter of hap-
penstance. Indeed, the most frequently occurring elements often reveal
artifacts or anomalies in a data set, such as default values or error codes
that do not really represent elements of the underlying distribution.
The related concept of the peak in a frequency distribution (or histogram)
is meaningful, but interesting peaks only get revealed through proper buck-
eting. The current peak of the annual salary distribution in the United
States lies between $30,000 and $40,000 per year, although the mode pre-
sumably sits at zero.
cartridge bulb,” where the evil manufacturer builds very robust bulbs, but in-
cludes a counter so they can prevent it from ever glowing after 3000 hours of
use. Here µ = 3000 and σ = 0. Both distributions have the same mean, but
substantially different variance.
The sum of squares penalty in the formula for σ means that one outlier value
d units from the mean contributes as much to the variance as d2 points each
one unit from the mean, so the variance is very sensitive to outliers.
An often confusing matter concerns the denominator in the formula for stan-
dard deviation. Should we divide by n or n−1? The difference here is technical.
The standard deviation of the full population divides by n, whereas the standard
deviation of the sample divides by n − 1. The issue is that sampling just one
point tells us absolutely nothing about the underlying variance in any popu-
lation, where it is perfectly reasonable to say there is zero variance in weight
among the population of a one-person island. But for reasonable-sized data sets
n ≈ (n − 1), so it really doesn’t matter.
• The stock market: Consider the problem of measuring the relative “skill”
of different stock market investors. We know that Warren Buffet is much
better at investing than we are. But very few professional investors prove
consistently better than others. Certain investment vehicles wildly out-
perform the market in any given time period. However, the hot fund one
38 CHAPTER 2. MATHEMATICAL PRELIMINARIES
Figure 2.5: Sample variance on hitters with a real 30% success rate results in a
wide range of observed performance even over 500 trials per season.
year usually underperforms the market the year after, which shouldn’t
happen if this outstanding performance was due to skill rather than luck.
The fund managers themselves are quick to credit profitable years to their
own genius, but losses to unforeseeable circumstances. However, several
studies have shown that the performance of professional investors is es-
sentially random, meaning there is little real difference in skill. Most
investors are paying managers for previously-used luck. So why do these
entrail-readers get paid so much money?
can be pretty sure that they will ascribe such a breakout season to their
improved conditioning or training methods instead of the fact they just
got lucky. Good or bad season, or lucky/unlucky: it is hard to tell the
signal from the noise.
Take-Home Lesson: Report both the mean and standard deviation to charac-
terize your distribution, written as µ ± σ.
• Are taller people more likely to remain lean? The observed correlation
between height and BMI is r = −0.711, so height is indeed negatively
correlated with body mass index (BMI).3
researchreport-2009-1-socioeconomic-status-sat-freshman-gpa-analysis-data.pdf
5 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3457990/.
6 http://lib.stat.cmu.edu/DASL/Stories/SmokingandCancer.html.
2.3. CORRELATION ANALYSIS 41
Let’s parse this equation. Suppose X and Y are strongly correlated. Then
we would expect that when xi is greater than the mean X̄, then yi should be
bigger than its mean Ȳ . When xi is lower than its mean, yi should follow. Now
look at the numerator. The sign of each term is positive when both values are
above (1 × 1) or below (−1 × −1) their respective means. The sign of each term
is negative ((−1 × 1) or (1 × −1)) if they move in opposite directions, suggesting
negative correlation. If X and Y were uncorrelated, then positive and negative
terms should occur with equal frequency, offsetting each other and driving the
value to zero.
The numerator’s operation determining the sign of the correlation is so useful
that we give it a name, covariance, computed:
n
X
Cov(X, Y ) = (Xi − X̄)(Yi − Ȳ ).
i=1
Anderson-Aggression.pdf.
42 CHAPTER 2. MATHEMATICAL PRELIMINARIES
2.5
2.0
1.5
1.0
0.5
0.0
−3 −2 −1 0 1 2 3
Figure 2.6: The function y = |x| does not have a linear model, but seems like it
should be easily fitted despite weak correlations.
6 d2i
P
ρ=1−
n(n2 − 1)
5 1.0
4
0.8
4
0.6 3
3
0.4 2
2
0.2
1
1 0.0
0
0 −0.2
0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 2.7: A monotonic but not linear point set has a Spearman coefficient
r = 1 even though it has no good linear fit (left). Highly-correlated sequences
are recognized by both coefficients (center), but the Pearson coefficient is much
more sensitive to outliers (right).
of y in a given data set. Suppose we replace p with p0 = (x1 , ∞). The Pearson
correlation will go crazy, since the best fit now becomes the vertical line x = x1 .
But the Spearman correlation will be unchanged, since all the points were under
p, just as they are now under p0 .
1.0 0.7
0.6
0.8
0.5
Pearson correlation
0.6 0.4
r2
0.3
0.4
0.2
0.2
0.1
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 250 300
Absolute value of Pearson correlation Sample size
Figure 2.8: Limits in interpreting significance. The r2 value shows that weak
correlations explain only a small fraction of the variance (left). The level of cor-
relation necessary to be statistically significance decreases rapidly with sample
size n (right).
0.6 0.6
0.4 0.4
0.2 0.2
y - f(x)
0.0 0.0
y
−0.2 −0.2
−0.4 −0.4
−0.6 −0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 2.9: Plotting ri = yi − f (xi ) shows that the residual values have lower
variance and mean zero. The original data points are on the left, with the
corresponding residuals on the right.
2.3. CORRELATION ANALYSIS 45
Weak but significant correlations can have value in big data models involving
large numbers of features. Any single feature/correlation might explain/predict
only small effects, but taken together a large number of weak but independent
correlations may have strong predictive power. Maybe. We will discuss signifi-
cance again in greater detail in Section 5.3.
Figure 2.11: Cyclic trends in a time series (left) are revealed through correlating
it against shifts of itself (right) .
the other. For example, the fact that we can put people on a diet that makes
them lose weight without getting shorter is convincing evidence that weight does
not cause height. But it is often harder to do these experiments the other way,
e.g. there is no reasonable way to make people shorter other than by hacking
off limbs.
eat them.
How can we recognize such cyclic patterns in a sequence S? Suppose we
correlate the values of Si with Si+p , for all 1 ≤ i ≤ n−p. If the values are in sync
for a particular period length p, then this correlation with itself will be unusually
high relative to other possible lag values. Comparing a sequence to itself is called
an autocorrelation, and the series of correlations for all 1 ≤ k ≤ n − 1 is called
the autocorrelation function. Figure 2.11 presents a time series of daily sales,
and the associated autocorrelation function for this data. The peak at a shift of
seven days (and every multiple of seven days) establishes that there is a weekly
periodicity in sales: more stuff gets sold on weekends.
Autocorrelation is an important concept in predicting future events, because
it means we can use previous observations as features in a model. The heuristic
that tomorrow’s weather will be similar to today’s is based on autocorrelation,
with a lag of p = 1 days. Certainly we would expect such a model to be
more accurate than predictions made on weather data from six months ago (lag
p = 180 days).
Generally speaking, the autocorrelation function for many quantities tends
to be highest for very short lags. This is why long-term predictions are less accu-
rate than short-term forecasts: the autocorrelations are generally much weaker.
But periodic cycles do sometimes stretch much longer. Indeed, a weather fore-
cast based on a lag of p = 365 days will be much better than one of p = 180,
because of seasonal effects.
Computing the full autocorrelation function requires calculating n − 1 differ-
ent correlations on points of the time series, which can get expensive for large n.
Fortunately, there is an efficient algorithm based on the fast Fourier transform
(FFT), which makes it possible to construct the autocorrelation function even
for very long sequences.
2.4 Logarithms
The logarithm is the inverse exponential function y = bx , an equation that can
be rewritten as x = logb y. This definition is the same as saying that
blogb y = y.
y = logb x ←→ by = x.
Logarithms are very useful things, and arise often in data analysis. Here
I detail three important roles logarithms play in data science. Surprisingly,
only one of them is related to the seven algorithmic applications of logarithms
48 CHAPTER 2. MATHEMATICAL PRELIMINARIES
I present in The Algorithm Design Manual [Ski08]. Logarithms are indeed very
useful things.
We can raise our sum to an exponential if we need the real probability, but
usually this is not necessary. When we just need to compare two probabilities
to decide which one is larger we can safely stay in log world, because bigger
logarithms correspond to bigger probabilities.
There is one quirk to be aware of. Recall that the log 2 ( 12 ) = −1. The
logarithms of probabilities are all negative numbers except for log(1) = 0. This
is the reason why equations with logs of probabilities often feature negative
signs in strange places. Be on the lookout for them.
7UM917 7UM917
7UM918 7UM918
6K9P5 6K9P5
6K448 6K448
0 2 4 6 8 10 12 14 −3 −2 −1 0 1 2 3
Figure 2.12: Plotting ratios on a scale cramps the space allocated to small ratios
relative to large ratios (left). Plotting the logarithms of ratios better represents
the underlying data (right).
One solution here would have been to use the geometric mean. But better is
taking the logarithm of these ratios, so that they yield equal displacement, since
log2 2 = 1 and log2 (1/2) = −1. We get the extra bonus that a unit ratio maps
to zero, so positive and negative numbers correspond to improper and proper
ratios, respectively.
A rookie mistake my students often make involves plotting the value of ratios
instead of their logarithms. Figure 2.12 (left) is a graph from a student paper,
showing the ratio of new score over old score on data over 24 hours (each red
dot is the measurement for one hour) on four different data sets (each given a
row). The solid black line shows the ratio of one, where both scores give the
same result. Now try to read this graph: it isn’t easy because the points on the
left side of the line are cramped together in a narrow strip. What jumps out at
you are the outliers. Certainly the new algorithm does terrible on 7UM917 in
the top row: that point all the way to the right is a real outlier.
Except that it isn’t. Now look at Figure 2.12 (right), where we plot the
logarithms of the ratios. The space devoted to left and right of the black line
can now be equal. And it shows that this point wasn’t really such an outlier at
all. The magnitude of improvement of the leftmost points is much greater than
that of the rightmost points. This plot reveals that new algorithm generally
makes things better, only because we are showing logs of ratios instead of the
ratios themselves.
1.2 1.2
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 200 400 600 800 1000 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Figure 2.13: Hitting a skewed data distribution (left) with a log often yields a
more bell-shaped distribution (right).
(left). The tail on the right goes much further than the tail on the left. And
we are destined to see far more lopsided distributions when we discuss power
laws, in Section 5.1.5. Wealth is representative of such a distribution, where
the poorest human has zero or perhaps negative wealth, the average person
(optimistically) is in the thousands of dollars, and Bill Gates is pushing $100
billion as of this writing.
We need a normalization to convert such distributions into something easier
to deal with. To ring the bell of a power law distribution we need something
non-linear, that reduces large values to a disproportionate degree compared to
more modest values.
The logarithm is the transformation of choice for power law variables. Hit
your long-tailed distribution with a log and often good things happen. The
distribution in Figure 2.13 happened to be the log normal distribution, so taking
the logarithm yielded a perfect bell-curve on right. Taking the logarithm of
variables with a power law distribution brings them more in line with traditional
distributions. For example, as an upper-middle class professional, my wealth is
roughly the same number of logs from my starving students as I am from Bill
Gates!
Sometimes taking the logarithm proves too drastic a hit, and a less dramatic
non-linear transformation like the square root works better to normalize a dis-
tribution. The acid test is to plot a frequency distribution of the transformed
values and see if it looks bell-shaped: grossly-symmetric, with a bulge in the
middle. That is when you know you have the right function.
with, and I have played bioinformatician in research projects since the very
beginnings of the human genome project.
DNA sequences are strings on the four letter alphabet {A, C, G, T }. Proteins
form the stuff that we are physically constructed from, and are composed of
strings of 20 different types of molecular units, called amino acids. Genes are
the DNA sequences which describe exactly how to make specific proteins, with
the units each described by a triplet of {A, C, G, T }s called codons.
For our purposes, it suffices to know that there are a huge number of possible
DNA sequences describing genes which could code for any particular desired
protein sequence. But only one of them is used. My biologist collaborators and
I wanted to know why.
Originally, it was assumed that all of these different synonymous encodings
were essentially identical, but statistics performed on sequence data made it
clear that certain codons are used more often than others. The biological con-
clusion is that “codons matter,” and there are good biological reasons why this
should be.
We became interested in whether “neighboring pairs of codon matter.” Per-
haps certain pairs of triples are like oil and water, and hate to mix. Certain
letter pairs in English have order preferences: you see the bigram gh far more
often than hg. Maybe this is true of DNA as well? If so, there would be pairs
of triples which should be underrepresented in DNA sequence data.
To test this, we needed a score comparing the number of times we actually
see a particular triple (say x = CAT ) next to another particular triple (say
y = GAG) to what we would expect by chance. Let F (xy) be the frequency
of xy, number of times we actually see codon x followed by codon y in the
DNA sequence database. These codons code for specific amino acids, say a
and b respectively. For amino acid a, the probability that it will be coded by
x is P (x) = F (x)/F (a), and similarly P (y) = F (y)/F (b). Then the expected
number of times of seeing xy is
F (x) F (y)
Expected(xy) = F (ab)
F (a) F (b)
Based on this, we can compute a codon pair score for any given hexamer xy
as follows:
Observed(xy) F (xy)
CP S(xy) = ln = ln F (x)F (y)
Expected(xy) F (ab)
F (a)F (b)
Taking the logarithm of this ratio produced very nice properties. Most im-
portantly, the sign of the score distinguished over-represented pairs from under-
represented pairs. Because the magnitudes were symmetric (+1 was just as
impressive as −1) we could add or average these scores in a sensible way to give
a score for each gene. We used these scores to design genes that should be bad
for viruses, which gave an exciting new technology for making vaccines. See the
chapter notes (Section 2.6) for more details.
52 CHAPTER 2. MATHEMATICAL PRELIMINARIES
Figure 2.14: Patterns in DNA sequences with the lowest codon pair scores
become obvious on inspection. When interpreted in-frame, the stop symbol
TAG is substantially depleted (left). When interpreted in the other two frames,
the most avoided patterns are all very low complexity, like runs of a single base
(right)
Knowing that certain pairs of codons were bad did not explain why they were
bad. But by computing two related scores (details unimportant) and sorting
the triplets based on them, as shown in Figure 2.14, certain patterns popped
out. Do you notice the patterns? All the bad sequences on the left contain
T AG, which turns out to be a special codon that tells the gene to stop. And
all the bad sequences on the right consist of C and G in very simple repetitive
sequences. These explain biologically why patterns are avoided by evolution,
meaning we discovered something very meaningful about life.
There are two take-home lessons from this story. First, developing numerical
scoring functions which highlight specific aspects of items can be very useful
to reveal patterns. Indeed, Chapter 4 will focus on the development of such
systems. Second, hitting such quantities with a logarithm can make them even
more useful, enabling us to see the forest for the trees.
2.7 Exercises
Probability
2-1. [3] Suppose that 80% of people like peanut butter, 89% like jelly, and 78% like
both. Given that a randomly sampled person likes peanut butter, what is the
probability that she also likes jelly?
2-2. [3] Suppose that P (A) = 0.3 and P (B) = 0.7.
(a) Can you compute P (A and B) if you only know P (A) and P (B)?
(b) Assuming that events A and B arise from independent random processes:
• What is P (A and B)?
• What is P (A or B)?
• What is P (A|B)?
2-3. [3] Consider a game where your score is the maximum value from two dice.
Compute the probability of each event from {1, . . . , 6}.
2-4. [8] Prove that the cumulative distribution function of the maximum of a pair of
values drawn from random variable X is the square of the original cumulative
distribution function of X.
2-5. [5] If two binary random variables X and Y are independent, are X̄ (the com-
plement of X) and Y also independent? Give a proof or a counterexample.
Statistics
2-6. [3] Compare each pair of distributions to decide which one has the greater
mean and the greater standard deviation. You do not need to calculate the
actual values of µ and σ, just how they compare with each other.
(a) i. 3, 5, 5, 5, 8, 11, 11, 11, 13.
ii. 3, 5, 5, 5, 8, 11, 11, 11, 20.
(b) i. −20, 0, 0, 0, 15, 25, 30, 30.
ii. −40, 0, 0, 0, 15, 25, 30, 30.
54 CHAPTER 2. MATHEMATICAL PRELIMINARIES
(c) i. 0, 2, 4, 6, 8, 10.
ii. 20, 22, 24, 26, 28, 30.
(d) i. 100, 200, 300, 400, 500.
ii. 0, 50, 300, 550, 600.
2-7. [3] Construct a probability distribution where none of the mass lies within one
σ of the mean.
2-8. [3] How does the arithmetic and geometric mean compare on random integers?
2-9. [3] Show that the arithmetic mean equals the geometric mean when all terms
are the same.
Correlation Analysis
2-10. [3] True or false: a correlation coefficient of −0.9 indicates a stronger linear
relationship than a correlation coefficient of 0.5. Explain why.
2-11. [3] What would be the correlation coefficient between the annual salaries of
college and high school graduates at a given company, if for each possible job
title the college graduates always made:
(a) $5,000 more than high school grads?
(b) 25% more than high school grads?
(c) 15% less than high school grads?
2-12. [3] What would be the correlation between the ages of husbands and wives if
men always married woman who were:
(a) Three years younger than themselves?
(b) Two years older than themselves?
(c) Half as old as themselves?
2-13. [5] Use data or literature found in a Google search to estimate/measure the
strength of the correlation between:
(a) Hits and walks scored for hitters in baseball.
(b) Hits and walks allowed by pitchers in baseball.
2-14. [5] Compute the Pearson and Spearman Rank correlations for uniformly drawn
samples of points (x, xk ). How do these values change as a function of increasing
k?
Logarithms
2-15. [3] Show that the logarithm of any number less than 1 is negative.
2-16. [3] Show that the logarithm of zero is undefined.
2-17. [5] Prove that
x · y = b(logb x+logb y)
2-18. [5] Prove the correctness of the formula for changing a base-b logarithm to base-
a, that
loga (x) = logb (x)/ logb (a).
2.7. EXERCISES 55
Implementation Projects
2-19. [3] Find some interesting data sets, and compare how similar their means and
medians are. What are the distributions where the mean and median differ on
the most?
2-20. [3] Find some interesting data sets and search all pairs for interesting correla-
tions. Perhaps start with what is available at http://www.data-manual.com/
data. What do you find?
Interview Questions
2-21. [3] What is the probability of getting exactly k heads on n tosses, where the
coin has a probability of p in coming up heads on each toss? What about k or
more heads?
2-22. [5] Suppose that the probability of getting a head on the ith toss of an ever-
changing coin is f (i). How would you efficiently compute the probability of
getting exactly k heads in n tosses?
2-23. [5] At halftime of a basketball game you are offered two possible challenges:
(a) Take three shots, and make at least two of them.
(b) Take eight shots, and make at least five of them.
Which challenge should you pick to have a better chance of winning the game?
2-24. [3] Tossing a coin ten times resulted in eight heads and two tails. How would
you analyze whether a coin is fair? What is the p-value?
2-25. [5] Given a stream of n numbers, show how to select one uniformly at random
using only constant storage. What if you don’t know n in advance?
2-26. [5] A k-streak starts at toss i in a sequence of n coin flips when the outcome of the
ith flip and the next k − 1 flips are identical. For example, sequence HTTTHH
contains 2-streaks starting at the second, third, and fifth tosses. What are the
expected number of k-streaks that you will see in n tosses of a fair coin ?
2-27. [5] A person randomly types an eight-digit number into a pocket calculator.
What is the probability that the number looks the same even if the calculator
is turned upside down?
2-28. [3] You play a dice rolling game where you have two choices:
(a) Roll the dice once and get rewarded with a prize equal to the outcome
number (e.g, $3 for number “3”) and then stop the game.
(b) You can reject the first reward according to its outcome and roll the dice
a second time, and get rewarded in the same way.
Which strategy should you choose to maximize your reward? That is, for what
outcomes of the first roll should you chose to play the second game? What is
the statistical expectation of reward if you choose the second strategy?
2-29. [3] What is A/B testing and how does it work?
2-30. [3] What is the difference between statistical independence and correlation?
2-31. [3] We often say that correlation does not imply causation. What does this
mean?
56 CHAPTER 2. MATHEMATICAL PRELIMINARIES
2-32. [5] What is the difference between a skewed distribution and a uniform one?
Kaggle Challenges
2-33. Cause–effect pairs: correlation vs. causation.
https://www.kaggle.com/c/cause-effect-pairs
2-34. Predict the next “random number” in a sequence.
https://www.kaggle.com/c/random-number-grand-challenge
2-35. Predict the fate of animals at a pet shelter.
https://www.kaggle.com/c/shelter-animal-outcomes