Transformations
Transformations
Transformations
A. Log
B. Tukey's Ladder of Powers
C. Box-Cox Transformations
D. Exercises
The focus of statistics courses is the exposition of appropriate methodology to
analyze data to answer the question at hand. Sometimes the data are given to you,
while other times the data are collected as part of a carefully-designed experiment.
Often the time devoted to statistical analysis is less than 10% of the time devoted
to data collection and preparation. If aspects of the data preparation fail, then the
success of the analysis is in jeopardy. Sometimes errors are introduced into the
recording of data. Sometimes biases are inadvertently introduced in the selection of
subjects or the mis-calibration of monitoring equipment.
In this chapter, we focus on the fact that many statistical procedures work
best if individual variables have certain properties. The measurement scale of a
variable should be part of the data preparation effort. For example, the correlation
coefficient does not require the variables have a normal shape, but often
relationships can be made clearer by re-expressing the variables. An economist
may choose to analyze the logarithm of prices if the relative price is of interest. A
chemist may choose to perform a statistical analysis using the inverse temperature
as a variable rather than the temperature itself. But note that the inverse of a
temperature will differ depending on whether it is measured in °F, °C, or °K.
The introductory chapter covered linear transformations. These
transformations normally do not change statistics such as Pearson’s r, although
they do affect the mean and standard deviation. The first section here is on log
transformations which are useful to reduce skew. The second section is on Tukey’s
ladder of powers. You will see that log transformations are a special case of the
ladder of powers. Finally, we cover the relatively advanced topic of the Box-Cox
transformation.
Log Transformations
by David M. Lane
578
Prerequisites
• Chapter 1: Logarithms
• Chapter 1: Shapes of Distributions
• Chapter 3: Additional Measures of Central Tendency
• Chapter 4: Introduction to Bivariate Data
Learning Objectives
1. State how a log transformation can help make a relationship clear
2. Describe the relationship between logs and the geometric mean
The log transformation can be used to make highly skewed distributions less
skewed. This can be valuable both for making patterns in the data more
interpretable and for helping to meet the assumptions of inferential statistics.
Figure 1 shows an example of how a log transformation can make patterns
more visible. Both graphs plot the brain weight of animals as a function of their
body weight. The raw weights are shown in the upper panel; the log-transformed
weights are plotted in the lower panel.
7000
6000
5000
4000
Brain
3000
2000
1000
-1000
0 10000 20000 30000 40000 50000 60000 70000 80000
Body/1000
579
9
5
Log(Brain)
-1
-2
0 5 10 15 20
Log(Body)
It is hard to discern a pattern in the upper panel whereas the strong relationship is
shown clearly in the lower panel.
The comparison of the means of log-transformed data is actually a
comparison of geometric means. This occurs because, as shown below, the anti-log
of the arithmetic mean of log-transformed values is the geometric mean.
Table 1 shows the logs (base 10) of the numbers 1, 10, and 100. The arithmetic mean of
the three logs is
(0 + 1 + 2)/3 = 1
101 = 10
580
(1 x 10 x 100).3333 = 10.
Table 1. Logarithms.
X Log10(X)
1 0
10 1
100 2
Therefore, if the arithmetic means of two sets of log-transformed data are equal
then the geometric means are equal.
581
Tukey Ladder of Powers
by David W. Scott
Prerequisites
• Chapter 1: Logarithms
• Chapter 4: Bivariate Data
• Chapter 4: Values of Pearson Correlation
• Chapter 12: Independent Groups t Test
• Chapter 13: Introduction to Power
• Chapter 16: Tukey Ladder of Powers
Learning Objectives
1. Give the Tukey ladder of transformations
2. Find a transformation that reveals a linear relationship
3. Find a transformation to approximate a normal distribution
Introduction
We assume we have a collection of bivariate data
(x1,y1),(x2,y2),...,(xn,yn)
and that we are interested in the relationship between variables x and y. Plotting
the data on a scatter diagram is the first step. As an example, consider the
population of the United States for the 200 years before the Civil War. Of course,
the decennial census began in 1790. These data are plotted two ways in Figure 1.
Malthus predicted that geometric growth of populations coupled with arithmetic
growth of grain production would have catastrophic results. Indeed the US
population followed an exponential curve during this period.
582
● ●
30 ●
US population (millions)
20.0 ●
●
25 10.0 ●
● ●
●
20 5.0 ●
●
● ●
15 2.0 ●
● ●
1.0 ●
10 ● ●
● 0.5 ●
●
●
5 ● ●
●
●● 0.2 ●
●●●
0 ●●●●●●● ●
0.1
1700 1750 1800 1850 1700 1750 1800 1850
Year Year
Figure 1. The US population from 1670 - 1860. The Y-axis on the right panel
is on a log scale.
y = b0 + b1/x
583
constant 1. We shall revisit this convention shortly. The following table gives
examples of the Tukey ladder of transformations.
If x takes on negative values, then special care must be taken so that the
transformations make sense, if possible. We generally limit ourselves to variables
where x > 0 to avoid these considerations. For some dependent variables such as
the number of errors, it is convenient to add 1 to x before applying the
transformation.
Also, if the transformation parameter λ is negative, then the transformed
variable xλ is reversed. For example, if x is increasing, then 1/x is decreasing. We
choose to redefine the Tukey transformation to be -(xλ) if λ < 0 in order to preserve
the order of the variable after transformation. Formally, the Tukey transformation
is defined as
8 if > 0
< x
x˜ =logx
if =0
(2)
: (x ) if < 0
In Table 2 we reproduce Table 1 but using the modified definition when λ < 0.
584
The Best Transformation for Linearity
The goal is to find a value of λ that makes the scatter diagram as linear as possible.
For the US population, the logarithmic transformation applied to y makes the
relationship almost perfectly linear. The red dashed line in the right frame of
Figure 1 has a slope of about 1.35; that is, the US population grew at a rate of
about 35% per decade.
The logarithmic transformation corresponds to the choice λ = 0 by Tukey's
convention. In Figure 2, we display the scatter diagram of the US population data
for λ = 0 as well as for other choices of λ.
●●●●●●●●●●● ●●●●●●● ●● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
●
●
● ●
●
λ=−0.5
●
●
λ=−1
●● ● ●
● ●
●
λ= 0
●
● ●
●
λ=−0.25
● ● ●
● ● ● ●
● ● ● ●
●
= 0.25
●
●
λ = 0.5 1
●
●
●
● ●
= 0.75
●
● ●
λ ●
●
●
●
●
●
●
●
λ ●
λ= ●
● ● ●
●
● ●
● ●
● ●
● ●
● ●●
● ●
● ●
● ●
● ● ●
●● ●● ●
● ●●● ●●●
● ●● ●●●●● ●● ●●●●●●● ●●
The raw data are plotted in the bottom right frame of Figure 2 when λ = 1. The
logarithmic fit is in the upper right frame when λ = 0. Notice how the scatter
diagram smoothly morphs from convex to concave as λ increases. Thus intuitively
there is a unique best choice of λ corresponding to the “most linear” graph.
585
One way to make this choice objective is to use an objective function for this
purpose. One approach might be to fit a straight line to the transformed points and
try to minimize the residuals. However, an easier approach is based on the fact that
the correlation coefficient, r, is a measure of the linearity of a scatter diagram. In
particular, if the points fall on a straight line then their correlation will be r = 1.
(We need not worry about the case when r = −1 since we have defined the Tukey
transformed variable xλ to be positively correlated with x itself.)
In Figure 3, we plot the correlation coefficient of the scatter diagram (x,y˜ )
as a function of λ. It is clear that the logarithmic transformation (λ = 0) is nearly
optimal by this criterion.
1
1
.95
●
Correlation coefficient
.9
.9995
.8 .85
.9990
.75
586
1000
US population (millions)
●
● ● ● ●
● ● ● ● ●
●
● ●
● ● ●
● ●
10 ● ● ●
● ●
● ● ●
●
●
● ●
● ●
.1 ● ●
●
●
●
.001
1700 1800 1900 2000
Year
λ = 0.41
.95 ●
15
(millions ) ● New York State ●
●
● 2.5 ●
● .9
●
● ●
10 2.0 ●
●
.85 ●
●
● ●
● 1.5 ●
5 ●
●
.8 ●
● ●
●
● 1.0 ●
●
● ●
●
●● .75 ●
0
1800 1850 1900 1950 2000 −.5 0 .5 1 1.5 1800 1850 1900 1950 2000
Year λ Year
587
Indeed, the growth of population in Arizona is logarithmic, and appears to still be
logarithmic through 2005.
6 ● 1 ● ●
●
1.5 log−Transformed
5 Population of ●
Population of ●
●
3 0.0
●
●
.9 −0.5
2 ● ●
●
●
−1.0 ●
1 ●
● ●
●
●
.85 −1.5 ●
0
1920 1940 1960 1980 2000 −1 −.5 0 .5 1 1920 1940 1960 1980 2000
Year λ Year
Reducing Skew
Many statistical methods such as t tests and the analysis of variance assume normal
distributions. Although these methods are relatively robust to violations of
normality, transforming the distributions to reduce skew can markedly increase
their power.
As an example, the data in the “Stereograms” case study is very skewed. A t
test of the difference between the two conditions using the raw data results in a p
value of 0.056, a value not conventionally considered significant. However, after a
log transformation (λ = 0) that reduces the skew greatly, the p value is 0.023 which
is conventionally considered significant.
The demonstration in Figure 7 shows distributions of the data from the
Stereograms case study as transformed with various values of λ. Decreasing λ
makes the distribution less positively skewed. Keep in mind that λ = 1 is the raw
data. Notice that there is a slight positive skew for λ = 0 but much less skew than
found in the raw data (λ = 1). Values of below 0 result in negative skew.
588
Figure 7. Distribution of data from the Stereogram case study for various values
of λ.
589
Box-Cox Transformations
by David Scott
Prerequisites
This section assumes a higher level of mathematics background than most other
sections of this work.
• Chapter 1: Logarithms
• Chapter 3: Additional Measures of Central Tendency (Geometic Mean)
• Chapter 4: Bivariate Data
• Chapter 4: Values of Pearson Correlation
• Chapter 16: Tukey Ladder of Powers
George Box and Sir David Cox collaborated on one paper (Box, 1964). The story
is that while Cox was visiting Box at Wisconsin, they decided they should write a
paper together because of the similarity of their names (and that both are British).
In fact, Professor Box is married to the daughter of Sir Ronald Fisher.
The Box-Cox transformation of the variable x is also indexed by λ, and is
defined as
x 1
x0 = .
(Equation 1)
At first glance, although the formula in Equation (1) is a scaled version of the
Tukey transformation xλ, this transformation does not appear to be the same as the
Tukey formula in Equation (2). However, a closer look shows that when λ < 0,
both xλ and xʹ′λ change the sign of xλ to preserve the ordering. Of more interest is
the fact that when λ = 0, then the Box-Cox variable is the indeterminate form 0/0.
Rewriting the Box-Cox formula as
590
as λ → 0. This same result may also be obtained using l'Hôpital's rule from your
calculus course. This gives a rigorous explanation for Tukey's suggestion that the
log transformation (which is not an example of a polynomial transformation) may
be inserted at the value λ = 0.
Notice with this definition of xʹ′λ that x = 1 always maps to the point xʹ′λ = 0
for all values of λ. To see how the transformation works, look at the examples in
Figure 1. In the top row, the choice λ = 1 simply shifts x to the value x−1, which is
a straight line. In the bottom row (on a semi-logarithmic scale), the choice λ = 0
corresponds to a logarithmic transformation, which is now a straight line. We
superimpose a larger collection of transformations on a semi-logarithmic scale in
Figure 2.
1.0
λ = −1 λ =0 λ =1
0.5
0.0
● ● ●
1.0 −1.0 −0.5
λ = −1 λ =0 λ =1
0.5
0.0
● ● ●
−1.0 −0.5
0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0
591
3
2
2.5
2
1.5
1
0.5
0
1
−0.5
−1
−1.5
−2
●
0
−1
Figure 2. Examples of the Box-Cox transformation versus log(x) for −2 < λ <
3. The bottom curve corresponds to λ = −2 and the upper to λ = 3.
Transformation to Normality
Another important use of variable transformation is to eliminate skewness and
other distributional features that complicate analysis. Often the goal is to find a
simple transformation that leads to normality. In the article on q-q plots, we discuss
how to assess the normality of a set of data, x1,x2,...,xn.
Data that are normal lead to a straight line on the q-q plot. Since the correlation
coefficients maximized when a scatter diagram is linear, we can use the same
approach above to find the most normal transformation. Specifically, we form
the n pairs
592
, for i = 1,2,...,n,
where Φ−1 is the inverse CDF of the normal density and x(i) denotes the ith sorted
value of the data set. As an example, consider a large sample of British household
incomes taken in 1973, normalized to have mean equal to one (n = 7,125). Such
data are often strongly skewed, as is clear from Figure 3. The data were sorted and
paired with the 7125 normal quantiles. The value of λ that gave the greatest
correlation (r = 0.9944) was λ = 0.21.
1.0
0.6
Correlation coefficient
0.8
Density
0.6
0.4
0.4
0.2
0.2
0.0
0 2 4 6 8 10 12 −2 −1 0 1 2 3
Income λ
Figure 3. (L) Density plot of the 1973 British income data. (R) The best value
of λ is 0.21.
The kernel density plot of the optimally transformed data is shown in the left frame
of Figure 4. While this figure is much less skewed than in Figure 3, there is clearly
an extra “component” in the distribution that might reflect the poor. Economists
often analyze the logarithm of income corresponding to λ = 0; see Figure 4. The
correlation is only r = 0.9901 in this case, but for convenience, the log-transform
probably will be preferred.
593
0.60.4
Density
0.2 0.0
−3 −2 −1 0 1 2 3 −4 −2 0 2
Income Income
Figure 4. (L) Density plot of the 1973 British income data transformed with λ
= 0.21. (R) The log-transform with λ = 0.
Other Applications
Regression analysis is another application where variable transformation is
frequently applied. For the model
594
Occasionally, the response variable y may be transformed. In this case, care must
be taken because the variance of the residuals is not comparable as λ varies. Let
For more examples and discussions, see Kutner, Nachtsheim, Neter, and Li (2004).
595
Statistical Literacy
by David M. Lane
Prerequisites
• Chapter 16: Logarithms
Many financial web pages give you the option of using a linear or a logarithmic
Yaxis. An example from Google Finance is shown below.
596
References
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations, Journal of
the Royal Statistical Society, Series B, 26, 211-252.
Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2004). Applied Linear
Statistical Models, McGraw-Hill/Irwin, Homewood, IL.
Tukey, J. W. (1977) Exploratory Data Analysis. Addison-Wesley, Reading, MA.
597
Exercises
Prerequisites
All Content in This Chapter
1. When is a log transformation valuable?
2. If the arithmetic mean of log 10 transformed data were 3, what would be the
geometric mean?
3. Using Tukey's ladder of transformation, transform the following data using a λ
of 0.5: 9, 16, 25
4. What value of λ in Tukey's ladder decreases skew the most?
5. What value of λ in Tukey's ladder increases skew the most?
6. In the ADHD case study, transform the data in the placebo condition (D0) with
λ's of .5, 0, -.5, and -1. How does the skew in each of these compare to the
skew in the raw data. Which transformation leads to the least skew?
598