Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
An Integrative Approach
Contents
Preface Acknowledgments
1 Introduction
The Role of the Computer in Data Analysis Statistics: Descriptive and
Inferential Variables and Constants The Measurement of Variables
Discrete and Continuous Variables Setting a Context with Real Data
Exercises
2 Examining Univariate Distributions
Counting the Occurrence of Data Values When Variables Are Measured at
the Nominal Level
Frequency and Percent Distribution Tables Bar Charts Pie Charts When
Variables Are Measured at the Ordinal, Interval, or Ratio Level
Frequency and Percent Distribution Tables Stem-and-Leaf Displays
Histograms Line Graphs Describing the Shape of a Distribution
Accumulating Data
Cumulative Percent Distributions Ogive Curves Percentile Ranks
Percentiles Five-Number Summaries and Boxplots Modifying the
Appearance of Graphs Summary of Graphical Selection Summary of Stata
Commands in Chapter 2 Exercises
3 Measures of Location, Spread, and Skewness Characterizing the
Location of a Distribution
The Mode The Median The Arithmetic Mean
Interpreting the Mean of a Dichotomous Variable The Weighted Mean
Comparing the Mode, Median, and Mean Characterizing the Spread of a
Distribution
The Range and Interquartile Range The Variance The Standard Deviation
Characterizing the Skewness of a Distribution Selecting Measures of
Location and Spread Applying What We Have Learned Summary of Stata
Commands in Chapter 3
Helpful Hints When Using Stata
Online Resources
The Stata Command Stata TIPS Exercises
4 Reexpressing Variables
Linear and Nonlinear Transformations Linear Transformations: Addition,
Subtraction, Multiplication, and Division
The Effect on the Shape of a Distribution The Effect on Summary Statistics
of a Distribution Common Linear Transformations Standard Scores z-
Scores
Using z-Scores to Detect Outliers Using z-Scores to Compare Scores in
Different Distributions Relating z-Scores to Percentile Ranks Nonlinear
Transformations: Square Roots and Logarithms Nonlinear Transformations:
Ranking Variables Other Transformations: Recoding and Combining
Variables
Recoding Variables Combining Variables Data Management Fundamentals
– the Do-File Summary of Stata Commands in Chapter 4 Exercises
5 Exploring Relationships between Two Variables
When Both Variables Are at Least Interval-Leveled
Scatterplots The Pearson Product Moment Correlation Coefficient
Interpreting the Pearson Correlation Coefficient
Judging the Strength of the Linear Relationship, The Correlation Scale
Itself Is Ordinal, Correlation Does Not Imply Causation, The Effect of Linear
Transformations, Restriction of Range, The Shape of the Underlying
Distributions, The Reliability of the Data, When at Least One Variable Is
Ordinal and the Other Is at Least Ordinal: The Spearman Rank Correlation
Coefficient When at Least One Variable Is Dichotomous: Other Special
Cases of the Pearson Correlation Coefficient
The Point Biserial Correlation Coefficient: The Case of One at Least
Interval and One Dichotomous Variable The Phi Coefficient: The Case of
Two Dichotomous Variables Other Visual Displays of Bivariate
Relationships Selection of Appropriate Statistic/Graph to Summarize a
Relationship Summary of Stata Commands in Chapter 5 Exercises
6 Simple Linear Regression
The “Best-Fitting” Linear Equation The Accuracy of Prediction Using the
Linear Regression Model The Standardized Regression Equation R as a
Measure of the Overall Fit of the Linear Regression Model Simple Linear
Regression When the Independent Variable Is Dichotomous Using r and R
as Measures of Effect Size Emphasizing the Importance of the Scatterplot
Summary of Stata Commands in Chapter 6
Exercises
7 Probability Fundamentals
The Discrete Case The Complement Rule of Probability The Additive Rules
of Probability
First Additive Rule of Probability Second Additive Rule of Probability The
Multiplicative Rule of Probability The Relationship between Independence
and Mutual Exclusivity Conditional Probability The Law of Large Numbers
Exercises
8 Theoretical Probability Models
The Binomial Probability Model and Distribution
The Applicability of the Binomial Probability Model The Normal Probability
Model and Distribution Using the Normal Distribution to Approximate the
Binomial Distribution Summary of Chapter 8 Stata Commands Exercises
9 The Role of Sampling in Inferential Statistics
Samples and Populations Random Samples
Obtaining a Simple Random Sample Sampling with and without
Replacement Sampling Distributions Describing the Sampling Distribution
of Means Empirically
Describing the Sampling Distribution of Means Theoretically: The Central
Limit Theorem
Central Limit Theorem (CLT) Estimators and BIAS Summary of Chapter 9
Stata Commands Exercises
10 Inferences Involving the Mean of a Single Population When σ Is Known
Estimating the Population Mean, μ, When the Population Standard
Deviation, σ, Is Known Interval Estimation Relating the Length of a
Confidence Interval, the Level of Confidence, and the Sample Size
Hypothesis Testing The Relationship between Hypothesis Testing and
Interval Estimation Effect Size Type II Error and the Concept of Power
Increasing the Level of Significance, α Increasing the Effect Size, δ
Decreasing the Standard Error of the Mean,
Closing Remarks Summary of Chapter 10 Stata Commands Exercises
11 Inferences Involving the Mean When σ Is Not Known: One- and Two-
Sample Designs
Single Sample Designs When the Parameter of Interest Is the Mean and σ
Is Not Known The t Distribution
Degrees of Freedom for the One Sample t-Test Violating the Assumption of
a Normally Distributed Parent Population in the One Sample t-Test
Confidence Intervals for the One Sample t-Test Hypothesis Tests: The One
Sample t-Test Effect Size for the One Sample t-Test Two Sample Designs
When the Parameter of Interest Is μ, and σ Is Not Known
Independent (or Unrelated) and Dependent (or Related) Samples
Independent Samples t-Test and Confidence Interval The Assumptions of
the Independent Samples t-Test Effect Size for the Independent Samples t-
Test Paired Samples t-Test and Confidence Interval The Assumptions of
the Paired Samples t-Test Effect Size for the Paired Samples t-Test The
Bootstrap Summary Summary of Chapter 11 Stata Commands Exercises
12 Research Design: Introduction and Overview
Questions and Their Link to Descriptive, Relational, and Causal Research
Studies
The Need for a Good Measure of Our Construct, Weight The Descriptive
Study From Descriptive to Relational Studies From Relational to Causal
Studies The Gold Standard of Causal Studies: The True Experiment and
Random Assignment Comparing Two Kidney Stone Treatments Using a
Non-randomized
Controlled Study Including Blocking in a Research Design Underscoring
the Importance of Having a True Control Group Using Randomization
Analytic Methods for Bolstering Claims of Causality from Observational
Data (Optional Reading) Quasi-Experimental Designs Threats to the
Internal Validity of a Quasi-Experimental Design Threats to the External
Validity of a Quasi-Experimental Design Threats to the Validity of a Study:
Some Clarifications and Caveats Threats to the Validity of a Study: Some
Examples Exercises
13 One-Way Analysis of Variance
The Disadvantage of Multiple t-Tests The One-Way Analysis of Variance
A Graphical Illustration of the Role of Variance in Tests on Means ANOVA
as an Extension of the Independent Samples t-Test Developing an Index of
Separation for the Analysis of Variance Carrying Out the ANOVA
Computation The Between Group Variance (MSB) The Within Group
Variance (MSW) The Assumptions of the One-Way ANOVA Testing the
Equality of Population Means: The F-Ratio How to Read the Tables and
Use Stata Functions for the F- Distribution ANOVA Summary Table
Measuring the Effect Size Post-Hoc Multiple Comparison Tests
The Bonferroni Adjustment: Testing Planned Comparisons The Bonferroni
Tests on Multiple Measures Summary of Stata Commands in Chapter 13
Exercises
14 Two-Way Analysis of Variance
The Two-Factor Design The Concept of Interaction The Hypotheses That
Are Tested by a Two-Way Analysis of Variance Assumptions of the Two-
Way Analysis of Variance Balanced versus Unbalanced Factorial Designs
Partitioning the Total Sum of Squares Using the F-Ratio to Test the Effects
in Two-Way ANOVA Carrying Out the Two-Way ANOVA Computation by
Hand Decomposing Score Deviations about the Grand Mean Modeling
Each Score as a Sum of Component Parts Explaining the Interaction as a
Joint (or Multiplicative) Effect Measuring Effect Size Fixed versus Random
Factors Post-hoc Multiple Comparison Tests Summary of Steps to be
Taken in a Two-Way ANOVA Procedure Summary of Stata Commands in
Chapter 14 Exercises
15 Correlation and Simple Regression as Inferential Techniques
The Bivariate Normal Distribution Testing Whether the Population Pearson
Product Moment Correlation Equals Zero Using a Confidence Interval to
Estimate the Size of the Population
Correlation Coefficient, ρ Revisiting Simple Linear Regression for
Prediction
Estimating the Population Standard Error of Prediction, σY|X Testing the b-
Weight for Statistical Significance Explaining Simple Regression Using an
Analysis of Variance Framework Measuring the Fit of the Overall
2 2 2 2
Regression Equation: Using R and R Relating R to σ Y|X Testing R for
2 2
Statistical Significance Estimating the True Population R : The Adjusted R
Exploring the Goodness of Fit of the Regression Equation: Using
Regression Diagnostics
Residual Plots: Evaluating the Assumptions Underlying Regression
Detecting Influential Observations: Discrepancy and Leverage Using Stata
to Obtain Leverage Using Stata to Obtain Discrepancy Using Stata to
Obtain Influence Using Diagnostics to Evaluate the Ice Cream Sales
Example Using the Prediction Model to Predict Ice Cream Sales Simple
Regression When the Predictor is Dichotomous Summary of Stata
Commands in Chapter 15 Exercises
16 An Introduction to Multiple Regression The Basic Equation with Two
Predictors Equations for b, β, and RY.12 When the Predictors Are Not
Correlated Equations for b, β, and RY.12 When the Predictors Are
Correlated
Summarizing and Expanding on Some Important Principles of Multiple
Regression Testing the b-Weights for Statistical Significance Assessing the
Relative Importance of the Independent Variables in the Equation
2
Measuring the Drop in R Directly: An Alternative to the Squared
Semipartial Correlation Evaluating the Statistical Significance of the
2
Change in R The b-Weight as a Partial Slope in Multiple Regression
Multiple Regression When One of the Two Independent Variables is
Dichotomous The Concept of Interaction between Two Variables That Are
at Least Interval-Leveled Testing the Statistical Significance of an
Interaction Using Stata Centering First-Order Effects to Achieve Meaningful
Interpretations of b-Weights Understanding the Nature of a Statistically
Significant Two-Way Interaction Interaction When One of the Independent
Variables Is Dichotomous and the Other Is Continuous Summary of Stata
Commands in Chapter 16 Exercises
17 Nonparametric Methods
Parametric versus Nonparametric Methods Nonparametric Methods When
the Dependent Variable Is at the Nominal Level The Chi-Square
2
Distribution (χ ) The Chi-Square Goodness-of-Fit Test The Chi-Square
Test of Independence
Assumptions of the Chi-Square Test of Independence Fisher’s Exact Test
Calculating the Fisher Exact Test by Hand Using the Hypergeometric
Distribution Nonparametric Methods When the Dependent Variable Is
Ordinal- Leveled Wilcoxon Sign Test The Mann-Whitney U Test The
Kruskal-Wallis Analysis of Variance Summary of Stata Commands in
Chapter 17 Exercises
Appendix A Data Set Descriptions Appendix B Stata .do Files and Data
Sets in Stata Format Appendix C Statistical Tables Appendix D
References Appendix E Solutions Index
Preface
This text capitalizes on the widespread availability of software packages to
create a course of study that links good statistical practice to the analysis of
real data, and the many years of the authors’ experience teaching statistics
to undergraduate students at a liberal arts university and to graduate
students at a large research university from a variety of disciplines
including education, psychology, health, and policy analysis. Because of its
versatility and power, our software package of choice for this text is the
popularly used Stata, which provides both a menu-driven and command
line approach to the analysis of data, and, in so doing, facilitates the
transition to a more advanced course of study in statistics. Although the
choice of software is different, the content and general organization of the
text derive from its sister text, now in its third edition, Statistics Using IBM
SPSS: An Integrative Approach, and as such, this text also embraces and
is motivated by several important guiding principles found in the sister text.
First, and perhaps most important, we believe that a good data analytic
plan must serve to uncover the story behind the numbers, what the data tell
us about the phenomenon under study. To begin, a good data analyst must
know his/her data well and have confidence that it satisfies the underlying
assumptions of the statistical methods used. Accordingly, we emphasize
the usefulness of diagnostics in both graphical and statistical form to
expose anomalous cases, which might unduly influence results, and to help
in the selection of appropriate assumption-satisfying transformations so
that ultimately we may have
confidence in our findings. We also emphasize the importance of using
more than one method of analysis to answer fully the question posed and
understanding potential bias in the estimation of population parameters.
Second, because we believe that data are central to the study of good
statistical practice, the textbook’s website contains several data sets used
throughout the text. Two are large sets of real data that we make repeated
use of in both worked-out examples and end-of-chapter exercises. One
data set contains forty-eight variables and five hundred cases from the
education discipline; the other contains forty-nine variables and nearly
forty-five hundred cases from the health discipline. By posing interesting
questions about variables in these large, real data sets (e.g., Is there a
gender difference in eighth graders’ expected income at age thirty?), we
are able to employ a more meaningful and contextual approach to the
introduction of statistical methods and to engage students more actively in
the learning process. The repeated use of these data sets also contributes
to creating a more cohesive presentation of statistics; one that links
different methods of analysis to each other and avoids the perception that
statistics is an often-confusing array of so many separate and distinct
methods of analysis, with no bearing or relationship to one another.
Third, to facilitate the analysis of these data, and to provide students with a
platform for actively engaging in the learning process associated with what
it means to be a good researcher and data analyst, we have incorporated
the latest version of Stata (version 14), a popular statistical software
package, into the presentation of statistical material using a highly
integrative approach that reflects practice. Students learn Stata along with
each new statistical method covered, thereby allowing them to apply their
newly-learned knowledge to the real world of applications. In addition to
demonstrating the use of Stata within each chapter, all chapters have an
associated .do-file, designed to allow students not only to replicate all
worked out examples within a chapter but also to
reproduce the figures embedded in a chapter, and to create their own.do-
files by extracting and modifying commands from them. Emphasizing data
workflow management throughout the text using the Stata.do-file allows
students to begin to appreciate one of the key ingredients to being a good
researcher. Of course, another key ingredient to being a good researcher is
content knowledge, and toward that end, we have included in the text a
more comprehensive coverage of essential topics in statistics not covered
by other textbooks at the introductory level, including robust methods of
estimation based on resampling using the bootstrap, regression to the
mean, the weighted mean, and potential sources of bias in the estimation
of population parameters based on the analysis of data from quasi-
experimental designs, Simpson’s Paradox, counterfactuals, other issues
related to research design.
Fourth, in accordance with our belief that the result of a null hypothesis test
(to determine whether an effect is real or merely apparent) is only a means
to an end (to determine whether the effect is important or useful) rather
than an end in itself, we stress the need to evaluate the magnitude of an
effect if it is deemed to be real, and of drawing clear distinctions between
statistically significant and substantively significant results. Toward this
end, we introduce the computation of standardized measures of effect size
as common practice following a statistically significant result. Although we
provide guidelines for evaluating, in general, the magnitude of an effect, we
encourage readers to think more subjectively about the magnitude of an
effect, bringing into the evaluation their own knowledge and expertise in a
particular area.
Finally, we believe that a key ingredient of an introductory statistics text is a
lively, clear, conceptual, yet rigorous approach. We emphasize conceptual
understanding through an exploration of both the mathematical principles
underlying statistical methods and real world applications. We use an easy-
going, informal style of writing that we have found gives readers the
impression
that they are involved in a personal conversation with the authors. And we
sequence concepts with concern for student readiness, reintroducing topics
in a spiraling manner to provide reinforcement and promote the transfer of
learning.
Another distinctive feature of this text is the inclusion of a large bibliography
of references to relevant books and journal articles, and many end- of-
chapter exercises with detailed answers on the textbook’s website. Along
with the earlier topics mentioned, the inclusion of linear and nonlinear
transformations, diagnostic tools for the analysis of model fit, tests of
inference, an in-depth discussion of interaction and its interpretation in both
two-way analysis of variance and multiple regression, and nonparametric
statistics, the text provides a highly comprehensive coverage of essential
topics in introductory statistics, and, in so doing, gives instructors flexibility
in curriculum planning and provides students with more advanced material
for future work in statistics.
The book, consisting of seventeen chapters, is intended for use in a one- or
two-semester introductory applied statistics course for the behavioral,
social, or health sciences at either the graduate or undergraduate level, or
as a reference text as well. It is not intended for readers who wish to
acquire a more theoretical understanding of mathematical statistics. To
offer another perspective, the book may be described as one that begins
with modern approaches to Exploratory Data Analysis (EDA) and
descriptive statistics, and then covers material similar to what is found in an
introductory mathematical statistics text, designed, for example, for
undergraduates in math and the physical sciences, but stripped of calculus
and linear algebra and grounded instead in data examples. Thus,
theoretical probability distributions, The Law of Large Numbers, sampling
distributions and The Central Limit Theorem are all covered, but in the
context of solving practical and interesting problems.
Acknowledgments
This book has benefited from the many helpful comments of our New York
University and Drew University students, too numerous to mention by
name, and from the insights and suggestions of several colleagues. For
their help, we would like to thank (in alphabetical order) Chris Apelian, Ellie
Buteau, Jennifer Hill, Michael Karchmer, Steve Kass, Jon Kettenring, Linda
Lesniak, Kathleen Madden, Isaac Maddow-Zimet, Meghan McCormick,
Joel Middleton, and Marc Scott. Of course, any errors or shortcomings in
the text remain the responsibility of the authors.
Chapter One
Introduction
◈
Welcome to the study of statistics! It has been our experience that many
students face the prospect of taking a course in statistics with a great deal
of anxiety, apprehension, and even dread. They fear not having the
extensive mathematical background that they assume is required, and they
fear that the contents of such a course will be irrelevant to their work in
their fields of concentration.
Although it is true that an extensive mathematical background is required at
more advanced levels of statistical study, it is not required for the
introductory level of statistics presented in this book. Greater reliance is
placed on the use of the computer for calculation and graphical display so
that we may focus on issues of conceptual understanding and
interpretation. Although hand computation is deemphasized, we believe,
nonetheless, that a basic mathematical background – including the
understanding of fractions, decimals, percentages, signed (positive and
negative) numbers, exponents, linear equations, and graphs – is essential
for an enhanced conceptual understanding of the material.
As for the issue of relevance, we have found that students better
comprehend the power and purpose of statistics when it is presented in the
context of a substantive problem with real data. In this information age,
data are available on a myriad of topics. Whether our interests are in
health, education,
psychology, business, the environment, and so on, numerical data may be
accessed readily to address our questions of interest. The purpose of
statistics is to allow us to analyze these data to extract the information that
they contain in a meaningful way and to write a story about the numbers
that is both compelling and accurate.
Throughout this course of study we make use of a series of real data sets
that are located in the website for this book and may be accessed by
clicking on the Resources tab using the URL, www.cambridge.org/Stats-
Stata. We will pose relevant questions and learn the appropriate methods
of analysis for answering such questions. Students will learn that more than
one statistic or method of analysis typically is needed to address a question
fully. Students also will learn that a detailed description of the data,
including possible anomalies, and an ample characterization of results, are
critical components of any data analytic plan. Through this process we
hope that this book will help students come to view statistics as an
integrated set of data analytic tools that when used together in a
meaningful way will serve to uncover the story contained in the numbers.
The Role of the Computer in Data Analysis
From our own experience, we have found that the use of a statistics
software package to carry out computations and create graphs not only
enables a greater emphasis on conceptual understanding and
interpretation but also allows students to study statistics in a way that
reflects statistical practice. We have selected the latest version of Stata
available to us at the time of writing, version 14, for use with this text. We
have selected Stata because it is a well-established comprehensive
package with a robust technical support infrastructure that is widely used
by behavioral and social scientists. In addition, not only does Stata include
a menu-driven “point and click” interface, making it accessible to the new
user, but it also includes a command line or program syntax interface,
allowing students to be guided from the comfortable “point and click”
environment to the beginnings of statistical programming. Like MINITAB,
JMP, Data Desk, Systat, SPSS, and SPlus, Stata is powerful enough to
handle the analysis of large data sets quickly. By the end of the course,
students will have obtained a conceptual understanding of statistics as well
as an applied, practical skill in how to carry out statistical operations.
Statistics: Descriptive and Inferential
The subject of statistics may be divided into two general branches:
descriptive and inferential. Descriptive statistics are used when the purpose
of an investigation is to describe the data that have been (or will be)
collected. Suppose, for example, that a third-grade elementary school
teacher is interested in determining the proportion of children who are
firstborn in her class of thirty children. In this case, the focus of the
teacher’s question is her own class of thirty children and she will be able to
collect data on all of the students about whom she would like to draw a
conclusion. The data collection operation will involve noting whether each
child in the classroom is firstborn or not; the statistical operations will
involve counting the number who are, and dividing that number by thirty,
the total number of students in the class, to obtain the proportion sought.
Because the teacher is using statistical methods merely to describe the
data she collected, this is an example of descriptive statistics.
Suppose, by contrast, that the teacher is interested in determining the
proportion of children who are firstborn in all third-grade classes in the city
where she teaches. It is highly unlikely that she will be able to (or even
want to) collect the relevant data on all individuals about whom she would
like to draw a conclusion. She will probably have to limit the data collection
to some randomly selected smaller group and use inferential statistics to
generalize to the larger group the conclusions obtained from the smaller
group. Inferential statistics are used when the purpose of the research is
not to describe the data that have been collected but to generalize or make
inferences based on it. The smaller group on which she collects data is
called the sample, whereas the larger group to whom conclusions are
generalized (or inferred), is called the population. In general,
two major factors influence the teacher’s confidence that what holds true
for the sample also holds true for the population at large. These two factors
are the method of sample selection and the size of the sample. Only when
data are collected on all individuals about whom a conclusion is to be
drawn (when the sample is the population and we are therefore in the
realm of descriptive statistics), can the conclusion be drawn with 100
percent certainty. Thus, one of the major goals of inferential statistics is to
assess the degree of certainty of inferences when such inferences are
drawn from sample data. Although this text is divided roughly into two
parts, the first on descriptive statistics and the second on inferential
statistics, the second part draws heavily on the first.
Variables and Constants
In the previous section, we discussed a teacher’s interest in determining
the proportion of students who are firstborn in the third grade of the city
where she teaches. What made this question worth asking was the fact that
she did not expect everyone in the third grade to be firstborn. Rather, she
quite naturally expected that in the population under study, birth order
would vary, or differ, from individual to individual and that only in certain