0% found this document useful (0 votes)
801 views46 pages

Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016

Uploaded by

Ramisa Fariha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
801 views46 pages

Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016

Uploaded by

Ramisa Fariha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

STATISTICS Using Stata

An Integrative Approach

Sharon Lawner Weinberg Sarah Knapp Abramowitz


S TAT I S T I C S U S I N G S TATA
Engaging and accessible to students from a wide variety of mathematical
backgrounds, Statistics Using Stata combines the teaching of statistical
concepts with the acquisition of the popular Stata software package. It
closely aligns Stata commands with numerous examples based on real
data, enabling students to develop a deep understanding of statistics in a
way that reflects statistical practice. Capitalizing on the fact that Stata has
both a menu-driven “point and click” and program syntax interface, the text
guides students effectively from the comfortable “point and click”
environment to the beginnings of statistical programming. Its
comprehensive coverage of essential topics gives instructors flexibility in
curriculum planning and provides students with more advanced material to
prepare them for future work. Online resources – including complete
solutions to exercises, PowerPoint slides, and Stata syntax (Do-files) for
each chapter – allow students to review independently and adapt codes to
solve new problems, reinforcing their programming skills.
Sharon Lawner Weinberg is Professor of Applied Statistics and
Psychology and former Vice Provost for Faculty Affairs at New York
University. She has authored numerous articles, books, and reports on
statistical methods, statistical education, and evaluation, as well as in
applied disciplines, such as psychology, education, and health. She is the
recipient of several major grants, including a recent grant from the Sloan
Foundation to support her current work with NYU colleagues to evaluate
the New York City Gifted and Talented programs.
Sarah Knapp Abramowitz is Professor of Mathematics and Computer
Science at Drew University, where she is also Department Chair and
Coordinator of Statistics Instruction. She is the coauthor, with Sharon
Lawner Weinberg, of
Statistics Using IBM SPSS: An Integrative Approach, Third Edition
(Cambridge University Press) and an Associate Editor of the Journal of
Statistics Education.

STATISTICS USING STATA


An Integrative Approach
Sharon Lawner Weinberg New York University
Sarah Knapp Abramowitz Drew University
32 Avenue of the Americas, New York NY 10013
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning,
and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107461185
© Sharon Lawner Weinberg and Sarah Knapp Abramowitz 2016
This publication is in copyright. Subject to statutory exception and to the provisions of relevant
collective licensing agreements, no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2016
Printed in the United States of America by Sheridan Books, Inc.
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloguing in Publication Data
Names: Weinberg, Sharon L. | Abramowitz, Sarah Knapp, 1967–
Title: Statistics using stata: an integrative approach / Sharon Lawner Weinberg, New York
University,
Sarah Knapp Abramowitz, Drew University.
Description: New York NY: Cambridge University Press, 2016. | Includes index.
Identifiers: LCCN 2016007888 | ISBN 9781107461185 (pbk.)
Subjects: LCSH: Mathematical statistics – Data processing. | Stata.
Classification: LCC QA276.4.W455 2016 | DDC 509.50285/53–dc23
LC record available at http://lccn.loc.gov/2016007888
ISBN 978-1-107-46118-5 Paperback
Additional resources for this publication are at www.cambridge.org/Stats-Stata.
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for
external or third-party Internet Web sites referred to in this publication and does not guarantee that
any content on such Web sites is, or will remain, accurate or appropriate.
To our families

Contents
Preface Acknowledgments
1 Introduction
The Role of the Computer in Data Analysis Statistics: Descriptive and
Inferential Variables and Constants The Measurement of Variables
Discrete and Continuous Variables Setting a Context with Real Data
Exercises
2 Examining Univariate Distributions
Counting the Occurrence of Data Values When Variables Are Measured at
the Nominal Level
Frequency and Percent Distribution Tables Bar Charts Pie Charts When
Variables Are Measured at the Ordinal, Interval, or Ratio Level
Frequency and Percent Distribution Tables Stem-and-Leaf Displays
Histograms Line Graphs Describing the Shape of a Distribution
Accumulating Data
Cumulative Percent Distributions Ogive Curves Percentile Ranks
Percentiles Five-Number Summaries and Boxplots Modifying the
Appearance of Graphs Summary of Graphical Selection Summary of Stata
Commands in Chapter 2 Exercises
3 Measures of Location, Spread, and Skewness Characterizing the
Location of a Distribution
The Mode The Median The Arithmetic Mean
Interpreting the Mean of a Dichotomous Variable The Weighted Mean
Comparing the Mode, Median, and Mean Characterizing the Spread of a
Distribution
The Range and Interquartile Range The Variance The Standard Deviation
Characterizing the Skewness of a Distribution Selecting Measures of
Location and Spread Applying What We Have Learned Summary of Stata
Commands in Chapter 3
Helpful Hints When Using Stata
Online Resources
The Stata Command Stata TIPS Exercises
4 Reexpressing Variables
Linear and Nonlinear Transformations Linear Transformations: Addition,
Subtraction, Multiplication, and Division
The Effect on the Shape of a Distribution The Effect on Summary Statistics
of a Distribution Common Linear Transformations Standard Scores z-
Scores
Using z-Scores to Detect Outliers Using z-Scores to Compare Scores in
Different Distributions Relating z-Scores to Percentile Ranks Nonlinear
Transformations: Square Roots and Logarithms Nonlinear Transformations:
Ranking Variables Other Transformations: Recoding and Combining
Variables
Recoding Variables Combining Variables Data Management Fundamentals
– the Do-File Summary of Stata Commands in Chapter 4 Exercises
5 Exploring Relationships between Two Variables
When Both Variables Are at Least Interval-Leveled
Scatterplots The Pearson Product Moment Correlation Coefficient
Interpreting the Pearson Correlation Coefficient
Judging the Strength of the Linear Relationship, The Correlation Scale
Itself Is Ordinal, Correlation Does Not Imply Causation, The Effect of Linear
Transformations, Restriction of Range, The Shape of the Underlying
Distributions, The Reliability of the Data, When at Least One Variable Is
Ordinal and the Other Is at Least Ordinal: The Spearman Rank Correlation
Coefficient When at Least One Variable Is Dichotomous: Other Special
Cases of the Pearson Correlation Coefficient
The Point Biserial Correlation Coefficient: The Case of One at Least
Interval and One Dichotomous Variable The Phi Coefficient: The Case of
Two Dichotomous Variables Other Visual Displays of Bivariate
Relationships Selection of Appropriate Statistic/Graph to Summarize a
Relationship Summary of Stata Commands in Chapter 5 Exercises
6 Simple Linear Regression
The “Best-Fitting” Linear Equation The Accuracy of Prediction Using the
Linear Regression Model The Standardized Regression Equation R as a
Measure of the Overall Fit of the Linear Regression Model Simple Linear
Regression When the Independent Variable Is Dichotomous Using r and R
as Measures of Effect Size Emphasizing the Importance of the Scatterplot
Summary of Stata Commands in Chapter 6
Exercises
7 Probability Fundamentals
The Discrete Case The Complement Rule of Probability The Additive Rules
of Probability
First Additive Rule of Probability Second Additive Rule of Probability The
Multiplicative Rule of Probability The Relationship between Independence
and Mutual Exclusivity Conditional Probability The Law of Large Numbers
Exercises
8 Theoretical Probability Models
The Binomial Probability Model and Distribution
The Applicability of the Binomial Probability Model The Normal Probability
Model and Distribution Using the Normal Distribution to Approximate the
Binomial Distribution Summary of Chapter 8 Stata Commands Exercises
9 The Role of Sampling in Inferential Statistics
Samples and Populations Random Samples
Obtaining a Simple Random Sample Sampling with and without
Replacement Sampling Distributions Describing the Sampling Distribution
of Means Empirically
Describing the Sampling Distribution of Means Theoretically: The Central
Limit Theorem
Central Limit Theorem (CLT) Estimators and BIAS Summary of Chapter 9
Stata Commands Exercises
10 Inferences Involving the Mean of a Single Population When σ Is Known
Estimating the Population Mean, μ, When the Population Standard
Deviation, σ, Is Known Interval Estimation Relating the Length of a
Confidence Interval, the Level of Confidence, and the Sample Size
Hypothesis Testing The Relationship between Hypothesis Testing and
Interval Estimation Effect Size Type II Error and the Concept of Power
Increasing the Level of Significance, α Increasing the Effect Size, δ
Decreasing the Standard Error of the Mean,
Closing Remarks Summary of Chapter 10 Stata Commands Exercises
11 Inferences Involving the Mean When σ Is Not Known: One- and Two-
Sample Designs
Single Sample Designs When the Parameter of Interest Is the Mean and σ
Is Not Known The t Distribution
Degrees of Freedom for the One Sample t-Test Violating the Assumption of
a Normally Distributed Parent Population in the One Sample t-Test
Confidence Intervals for the One Sample t-Test Hypothesis Tests: The One
Sample t-Test Effect Size for the One Sample t-Test Two Sample Designs
When the Parameter of Interest Is μ, and σ Is Not Known
Independent (or Unrelated) and Dependent (or Related) Samples
Independent Samples t-Test and Confidence Interval The Assumptions of
the Independent Samples t-Test Effect Size for the Independent Samples t-
Test Paired Samples t-Test and Confidence Interval The Assumptions of
the Paired Samples t-Test Effect Size for the Paired Samples t-Test The
Bootstrap Summary Summary of Chapter 11 Stata Commands Exercises
12 Research Design: Introduction and Overview
Questions and Their Link to Descriptive, Relational, and Causal Research
Studies
The Need for a Good Measure of Our Construct, Weight The Descriptive
Study From Descriptive to Relational Studies From Relational to Causal
Studies The Gold Standard of Causal Studies: The True Experiment and
Random Assignment Comparing Two Kidney Stone Treatments Using a
Non-randomized
Controlled Study Including Blocking in a Research Design Underscoring
the Importance of Having a True Control Group Using Randomization
Analytic Methods for Bolstering Claims of Causality from Observational
Data (Optional Reading) Quasi-Experimental Designs Threats to the
Internal Validity of a Quasi-Experimental Design Threats to the External
Validity of a Quasi-Experimental Design Threats to the Validity of a Study:
Some Clarifications and Caveats Threats to the Validity of a Study: Some
Examples Exercises
13 One-Way Analysis of Variance
The Disadvantage of Multiple t-Tests The One-Way Analysis of Variance
A Graphical Illustration of the Role of Variance in Tests on Means ANOVA
as an Extension of the Independent Samples t-Test Developing an Index of
Separation for the Analysis of Variance Carrying Out the ANOVA
Computation The Between Group Variance (MSB) The Within Group
Variance (MSW) The Assumptions of the One-Way ANOVA Testing the
Equality of Population Means: The F-Ratio How to Read the Tables and
Use Stata Functions for the F- Distribution ANOVA Summary Table
Measuring the Effect Size Post-Hoc Multiple Comparison Tests
The Bonferroni Adjustment: Testing Planned Comparisons The Bonferroni
Tests on Multiple Measures Summary of Stata Commands in Chapter 13
Exercises
14 Two-Way Analysis of Variance
The Two-Factor Design The Concept of Interaction The Hypotheses That
Are Tested by a Two-Way Analysis of Variance Assumptions of the Two-
Way Analysis of Variance Balanced versus Unbalanced Factorial Designs
Partitioning the Total Sum of Squares Using the F-Ratio to Test the Effects
in Two-Way ANOVA Carrying Out the Two-Way ANOVA Computation by
Hand Decomposing Score Deviations about the Grand Mean Modeling
Each Score as a Sum of Component Parts Explaining the Interaction as a
Joint (or Multiplicative) Effect Measuring Effect Size Fixed versus Random
Factors Post-hoc Multiple Comparison Tests Summary of Steps to be
Taken in a Two-Way ANOVA Procedure Summary of Stata Commands in
Chapter 14 Exercises
15 Correlation and Simple Regression as Inferential Techniques
The Bivariate Normal Distribution Testing Whether the Population Pearson
Product Moment Correlation Equals Zero Using a Confidence Interval to
Estimate the Size of the Population
Correlation Coefficient, ρ Revisiting Simple Linear Regression for
Prediction
Estimating the Population Standard Error of Prediction, σY|X Testing the b-
Weight for Statistical Significance Explaining Simple Regression Using an
Analysis of Variance Framework Measuring the Fit of the Overall
2 2 2 2
Regression Equation: Using R and R Relating R to σ Y|X Testing R for
2 2
Statistical Significance Estimating the True Population R : The Adjusted R
Exploring the Goodness of Fit of the Regression Equation: Using
Regression Diagnostics
Residual Plots: Evaluating the Assumptions Underlying Regression
Detecting Influential Observations: Discrepancy and Leverage Using Stata
to Obtain Leverage Using Stata to Obtain Discrepancy Using Stata to
Obtain Influence Using Diagnostics to Evaluate the Ice Cream Sales
Example Using the Prediction Model to Predict Ice Cream Sales Simple
Regression When the Predictor is Dichotomous Summary of Stata
Commands in Chapter 15 Exercises
16 An Introduction to Multiple Regression The Basic Equation with Two
Predictors Equations for b, β, and RY.12 When the Predictors Are Not
Correlated Equations for b, β, and RY.12 When the Predictors Are
Correlated
Summarizing and Expanding on Some Important Principles of Multiple
Regression Testing the b-Weights for Statistical Significance Assessing the
Relative Importance of the Independent Variables in the Equation
2
Measuring the Drop in R Directly: An Alternative to the Squared
Semipartial Correlation Evaluating the Statistical Significance of the
2
Change in R The b-Weight as a Partial Slope in Multiple Regression
Multiple Regression When One of the Two Independent Variables is
Dichotomous The Concept of Interaction between Two Variables That Are
at Least Interval-Leveled Testing the Statistical Significance of an
Interaction Using Stata Centering First-Order Effects to Achieve Meaningful
Interpretations of b-Weights Understanding the Nature of a Statistically
Significant Two-Way Interaction Interaction When One of the Independent
Variables Is Dichotomous and the Other Is Continuous Summary of Stata
Commands in Chapter 16 Exercises
17 Nonparametric Methods
Parametric versus Nonparametric Methods Nonparametric Methods When
the Dependent Variable Is at the Nominal Level The Chi-Square
2
Distribution (χ ) The Chi-Square Goodness-of-Fit Test The Chi-Square
Test of Independence
Assumptions of the Chi-Square Test of Independence Fisher’s Exact Test
Calculating the Fisher Exact Test by Hand Using the Hypergeometric
Distribution Nonparametric Methods When the Dependent Variable Is
Ordinal- Leveled Wilcoxon Sign Test The Mann-Whitney U Test The
Kruskal-Wallis Analysis of Variance Summary of Stata Commands in
Chapter 17 Exercises
Appendix A Data Set Descriptions Appendix B Stata .do Files and Data
Sets in Stata Format Appendix C Statistical Tables Appendix D
References Appendix E Solutions Index

Preface
This text capitalizes on the widespread availability of software packages to
create a course of study that links good statistical practice to the analysis of
real data, and the many years of the authors’ experience teaching statistics
to undergraduate students at a liberal arts university and to graduate
students at a large research university from a variety of disciplines
including education, psychology, health, and policy analysis. Because of its
versatility and power, our software package of choice for this text is the
popularly used Stata, which provides both a menu-driven and command
line approach to the analysis of data, and, in so doing, facilitates the
transition to a more advanced course of study in statistics. Although the
choice of software is different, the content and general organization of the
text derive from its sister text, now in its third edition, Statistics Using IBM
SPSS: An Integrative Approach, and as such, this text also embraces and
is motivated by several important guiding principles found in the sister text.
First, and perhaps most important, we believe that a good data analytic
plan must serve to uncover the story behind the numbers, what the data tell
us about the phenomenon under study. To begin, a good data analyst must
know his/her data well and have confidence that it satisfies the underlying
assumptions of the statistical methods used. Accordingly, we emphasize
the usefulness of diagnostics in both graphical and statistical form to
expose anomalous cases, which might unduly influence results, and to help
in the selection of appropriate assumption-satisfying transformations so
that ultimately we may have
confidence in our findings. We also emphasize the importance of using
more than one method of analysis to answer fully the question posed and
understanding potential bias in the estimation of population parameters.
Second, because we believe that data are central to the study of good
statistical practice, the textbook’s website contains several data sets used
throughout the text. Two are large sets of real data that we make repeated
use of in both worked-out examples and end-of-chapter exercises. One
data set contains forty-eight variables and five hundred cases from the
education discipline; the other contains forty-nine variables and nearly
forty-five hundred cases from the health discipline. By posing interesting
questions about variables in these large, real data sets (e.g., Is there a
gender difference in eighth graders’ expected income at age thirty?), we
are able to employ a more meaningful and contextual approach to the
introduction of statistical methods and to engage students more actively in
the learning process. The repeated use of these data sets also contributes
to creating a more cohesive presentation of statistics; one that links
different methods of analysis to each other and avoids the perception that
statistics is an often-confusing array of so many separate and distinct
methods of analysis, with no bearing or relationship to one another.
Third, to facilitate the analysis of these data, and to provide students with a
platform for actively engaging in the learning process associated with what
it means to be a good researcher and data analyst, we have incorporated
the latest version of Stata (version 14), a popular statistical software
package, into the presentation of statistical material using a highly
integrative approach that reflects practice. Students learn Stata along with
each new statistical method covered, thereby allowing them to apply their
newly-learned knowledge to the real world of applications. In addition to
demonstrating the use of Stata within each chapter, all chapters have an
associated .do-file, designed to allow students not only to replicate all
worked out examples within a chapter but also to
reproduce the figures embedded in a chapter, and to create their own.do-
files by extracting and modifying commands from them. Emphasizing data
workflow management throughout the text using the Stata.do-file allows
students to begin to appreciate one of the key ingredients to being a good
researcher. Of course, another key ingredient to being a good researcher is
content knowledge, and toward that end, we have included in the text a
more comprehensive coverage of essential topics in statistics not covered
by other textbooks at the introductory level, including robust methods of
estimation based on resampling using the bootstrap, regression to the
mean, the weighted mean, and potential sources of bias in the estimation
of population parameters based on the analysis of data from quasi-
experimental designs, Simpson’s Paradox, counterfactuals, other issues
related to research design.
Fourth, in accordance with our belief that the result of a null hypothesis test
(to determine whether an effect is real or merely apparent) is only a means
to an end (to determine whether the effect is important or useful) rather
than an end in itself, we stress the need to evaluate the magnitude of an
effect if it is deemed to be real, and of drawing clear distinctions between
statistically significant and substantively significant results. Toward this
end, we introduce the computation of standardized measures of effect size
as common practice following a statistically significant result. Although we
provide guidelines for evaluating, in general, the magnitude of an effect, we
encourage readers to think more subjectively about the magnitude of an
effect, bringing into the evaluation their own knowledge and expertise in a
particular area.
Finally, we believe that a key ingredient of an introductory statistics text is a
lively, clear, conceptual, yet rigorous approach. We emphasize conceptual
understanding through an exploration of both the mathematical principles
underlying statistical methods and real world applications. We use an easy-
going, informal style of writing that we have found gives readers the
impression
that they are involved in a personal conversation with the authors. And we
sequence concepts with concern for student readiness, reintroducing topics
in a spiraling manner to provide reinforcement and promote the transfer of
learning.
Another distinctive feature of this text is the inclusion of a large bibliography
of references to relevant books and journal articles, and many end- of-
chapter exercises with detailed answers on the textbook’s website. Along
with the earlier topics mentioned, the inclusion of linear and nonlinear
transformations, diagnostic tools for the analysis of model fit, tests of
inference, an in-depth discussion of interaction and its interpretation in both
two-way analysis of variance and multiple regression, and nonparametric
statistics, the text provides a highly comprehensive coverage of essential
topics in introductory statistics, and, in so doing, gives instructors flexibility
in curriculum planning and provides students with more advanced material
for future work in statistics.
The book, consisting of seventeen chapters, is intended for use in a one- or
two-semester introductory applied statistics course for the behavioral,
social, or health sciences at either the graduate or undergraduate level, or
as a reference text as well. It is not intended for readers who wish to
acquire a more theoretical understanding of mathematical statistics. To
offer another perspective, the book may be described as one that begins
with modern approaches to Exploratory Data Analysis (EDA) and
descriptive statistics, and then covers material similar to what is found in an
introductory mathematical statistics text, designed, for example, for
undergraduates in math and the physical sciences, but stripped of calculus
and linear algebra and grounded instead in data examples. Thus,
theoretical probability distributions, The Law of Large Numbers, sampling
distributions and The Central Limit Theorem are all covered, but in the
context of solving practical and interesting problems.

Acknowledgments
This book has benefited from the many helpful comments of our New York
University and Drew University students, too numerous to mention by
name, and from the insights and suggestions of several colleagues. For
their help, we would like to thank (in alphabetical order) Chris Apelian, Ellie
Buteau, Jennifer Hill, Michael Karchmer, Steve Kass, Jon Kettenring, Linda
Lesniak, Kathleen Madden, Isaac Maddow-Zimet, Meghan McCormick,
Joel Middleton, and Marc Scott. Of course, any errors or shortcomings in
the text remain the responsibility of the authors.
Chapter One
Introduction

Welcome to the study of statistics! It has been our experience that many
students face the prospect of taking a course in statistics with a great deal
of anxiety, apprehension, and even dread. They fear not having the
extensive mathematical background that they assume is required, and they
fear that the contents of such a course will be irrelevant to their work in
their fields of concentration.
Although it is true that an extensive mathematical background is required at
more advanced levels of statistical study, it is not required for the
introductory level of statistics presented in this book. Greater reliance is
placed on the use of the computer for calculation and graphical display so
that we may focus on issues of conceptual understanding and
interpretation. Although hand computation is deemphasized, we believe,
nonetheless, that a basic mathematical background – including the
understanding of fractions, decimals, percentages, signed (positive and
negative) numbers, exponents, linear equations, and graphs – is essential
for an enhanced conceptual understanding of the material.
As for the issue of relevance, we have found that students better
comprehend the power and purpose of statistics when it is presented in the
context of a substantive problem with real data. In this information age,
data are available on a myriad of topics. Whether our interests are in
health, education,
psychology, business, the environment, and so on, numerical data may be
accessed readily to address our questions of interest. The purpose of
statistics is to allow us to analyze these data to extract the information that
they contain in a meaningful way and to write a story about the numbers
that is both compelling and accurate.
Throughout this course of study we make use of a series of real data sets
that are located in the website for this book and may be accessed by
clicking on the Resources tab using the URL, www.cambridge.org/Stats-
Stata. We will pose relevant questions and learn the appropriate methods
of analysis for answering such questions. Students will learn that more than
one statistic or method of analysis typically is needed to address a question
fully. Students also will learn that a detailed description of the data,
including possible anomalies, and an ample characterization of results, are
critical components of any data analytic plan. Through this process we
hope that this book will help students come to view statistics as an
integrated set of data analytic tools that when used together in a
meaningful way will serve to uncover the story contained in the numbers.
The Role of the Computer in Data Analysis
From our own experience, we have found that the use of a statistics
software package to carry out computations and create graphs not only
enables a greater emphasis on conceptual understanding and
interpretation but also allows students to study statistics in a way that
reflects statistical practice. We have selected the latest version of Stata
available to us at the time of writing, version 14, for use with this text. We
have selected Stata because it is a well-established comprehensive
package with a robust technical support infrastructure that is widely used
by behavioral and social scientists. In addition, not only does Stata include
a menu-driven “point and click” interface, making it accessible to the new
user, but it also includes a command line or program syntax interface,
allowing students to be guided from the comfortable “point and click”
environment to the beginnings of statistical programming. Like MINITAB,
JMP, Data Desk, Systat, SPSS, and SPlus, Stata is powerful enough to
handle the analysis of large data sets quickly. By the end of the course,
students will have obtained a conceptual understanding of statistics as well
as an applied, practical skill in how to carry out statistical operations.
Statistics: Descriptive and Inferential
The subject of statistics may be divided into two general branches:
descriptive and inferential. Descriptive statistics are used when the purpose
of an investigation is to describe the data that have been (or will be)
collected. Suppose, for example, that a third-grade elementary school
teacher is interested in determining the proportion of children who are
firstborn in her class of thirty children. In this case, the focus of the
teacher’s question is her own class of thirty children and she will be able to
collect data on all of the students about whom she would like to draw a
conclusion. The data collection operation will involve noting whether each
child in the classroom is firstborn or not; the statistical operations will
involve counting the number who are, and dividing that number by thirty,
the total number of students in the class, to obtain the proportion sought.
Because the teacher is using statistical methods merely to describe the
data she collected, this is an example of descriptive statistics.
Suppose, by contrast, that the teacher is interested in determining the
proportion of children who are firstborn in all third-grade classes in the city
where she teaches. It is highly unlikely that she will be able to (or even
want to) collect the relevant data on all individuals about whom she would
like to draw a conclusion. She will probably have to limit the data collection
to some randomly selected smaller group and use inferential statistics to
generalize to the larger group the conclusions obtained from the smaller
group. Inferential statistics are used when the purpose of the research is
not to describe the data that have been collected but to generalize or make
inferences based on it. The smaller group on which she collects data is
called the sample, whereas the larger group to whom conclusions are
generalized (or inferred), is called the population. In general,
two major factors influence the teacher’s confidence that what holds true
for the sample also holds true for the population at large. These two factors
are the method of sample selection and the size of the sample. Only when
data are collected on all individuals about whom a conclusion is to be
drawn (when the sample is the population and we are therefore in the
realm of descriptive statistics), can the conclusion be drawn with 100
percent certainty. Thus, one of the major goals of inferential statistics is to
assess the degree of certainty of inferences when such inferences are
drawn from sample data. Although this text is divided roughly into two
parts, the first on descriptive statistics and the second on inferential
statistics, the second part draws heavily on the first.
Variables and Constants
In the previous section, we discussed a teacher’s interest in determining
the proportion of students who are firstborn in the third grade of the city
where she teaches. What made this question worth asking was the fact that
she did not expect everyone in the third grade to be firstborn. Rather, she
quite naturally expected that in the population under study, birth order
would vary, or differ, from individual to individual and that only in certain

individuals would it be first. Characteristics of persons or objects that vary

from person to person or object to object are called variables, whereas


characteristics that remain constant from person to person or object to
object are called constants. Whether a characteristic is designated as a
variable or as a constant depends, of course, on the study in question. In
the study of birth order, birth order is a variable; it can be expected to vary
from person to person in the given population. In that same study, grade
level is a constant; all persons in the population under study are in the third
grade.
EXAMPLE 1.1.
Identify some of the variables and constants in a study comparing the math
achievement of tenth-grade boys and girls in the southern United States.
Solution.
Constants: Grade level; Region of the United States
Variables: Math achievement; Sex
EXAMPLE 1.2.
Identify some of the variables and constants in a study of math
achievement of secondary-school boys in the southern United States.
Solution.
Constants: Sex; Region of the United States
Variables: Math achievement; Grade level
Note that grade level is a constant in Example 1.1 and a variable in
Example 1.2. Because constants are characteristics that do not vary in a
given population, the study of constants is neither interesting nor
informative. The major focus of any statistical study is therefore on the
variables rather than the constants. Before variables can be the subject of
statistical study, however, they need to be
numerically valued. The next section describes the process of measuring
variables so as to achieve that goal.
The Measurement of Variables
Measurement involves the observation of characteristics on persons or
objects, and the assignment of numbers to those persons or objects so that
the numbers represent the amounts of the characteristics possessed. As
introduced by S. S. (Stanley Smith) Stevens (1946) in a paper, “On the
theory of scales of measurement,” and later described by him in a chapter,
“Mathematics, Measurement, and Psychophysics,” in the Handbook of
Experimental Psychology, edited by Stevens (1951), we describe four
levels of measurement in this text. Each of the four levels is defined by the
nature of the observation and the way in which the numbers assigned
correspond to the amount of the underlying characteristic that has been
observed. The level of measurement of a variable determines which
numerical operations (e.g., addition, subtraction, multiplication, or division)
are permissible on that variable. If other than the permissible numerical
operations are used on a variable given its level of measurement, one can
expect the statistical conclusions drawn with respect to that variable to be
questionable.
Nominal Level. The nominal level of measurement is based on the
simplest form of observation – whether two objects are similar or dissimilar;
for example, whether they are short versus nonshort, male versus female,
or college student versus noncollege student. Objects observed to be
similar on some characteristic (e.g., college student) are assigned to the
same class or category, while objects observed to be dissimilar on that
characteristic are assigned to different classes or categories. In the nominal
level of measurement, classes or categories are not compared as say,
taller or shorter, better or worse, or more educated or less educated.
Emphasis is strictly on observing whether the objects are similar or
dissimilar. As befitting its label, classes or categories are merely named,
but not compared, in the nominal level of measurement.
Given the nature of observation for this level of measurement, numbers are
assigned to objects using the follow simple rule: if objects are dissimilar,
they are assigned different numbers; if objects are similar, they are
assigned the same number. For example, all persons who are college
students would be assigned the same number (say, 1); all persons who are
noncollege students also would be assigned the same number different
from 1 (say, 2) to distinguish them from college students. Because the
focus is on distinction and not comparison, in this level of measurement,
the fact that the number is 2 is larger than the number 1 is irrelevant in
terms of the underlying characteristic being measured (whether or not the
person is a college student). Accordingly, the number 1 could have been
assigned, instead, to all persons who are noncollege students and the
number 2 to all persons who are college students. Any numbers other than
1 and 2 also could have been used as well.
Although the examples in this section (e.g., college student versus
noncollege student) may be called dichotomous in that they have only two
categories, nominal variables also may have more than two categories
(e.g., car manufacturers – Toyota, Honda, General Motors, Ford, Chrysler,
etc.).
Ordinal Level. The ordinal level of measurement is not only based on
observing objects as similar or dissimilar but also on ordering those
observations in terms of an underlying characteristic. Suppose, for
example, we were not interested simply in whether a person was a college
student or not but, rather, in ordering college students in terms of the
degree of their success in college (e.g., whether the college student was
below average, average, or above average). We would, therefore, need to
observe such ordered differences among these college students in terms of
their success in college and we would choose numbers to assign to the
categories that corresponded to that ordering. For this example, we
might assign the number 1 to the below average category, the number 2 to
the average category, and the number 3 to the above average category.
Unlike in the nominal level of measurement, in the ordinal level of
measurement, it is relevant that 3 is greater than 2, which, in turn, is
greater than 1, as this ordering conveys in a meaningful way the ordered
nature of the categories relative to the underlying characteristic of interest.
That is, comparisons among the numbers correspond to comparisons
among the categories in terms of the underlying characteristic of success in
college. In summary, the ordinal level of measurement applies two rules for
assigning numbers to categories: (1) different numbers are assigned to
persons or objects that possess different amounts of the underlying
characteristic, and (2) the higher the number assigned to a person or
object, the less (or more) of the underlying characteristic that person or
object is observed to possess. From these two rules it does not follow,
however, that equal numerical differences along the number scale
correspond to equal increments in the underlying characteristic being
measured in the ordinal level of measurement. Although the differences
between 3 and 2 and between 2 and 1 in our college student success
example are both equal to 1, we cannot infer from this that the difference in
success between above average college students and average college
students equals the difference in success between average college
students and below average college students.
We consider another example that may convey more clearly this idea.
Suppose we line ten people up according to their size place and assign a
number from 1 to 10, respectively, to each person so that each number
corresponds to the person’s size place in line. We could assign the number
1 to the shortest person, the number 2 to the next shortest person, and so
forth, ending by assigning the number 10 to the tallest person. While
according to this method, the numbers assigned to each pair of adjacent
people in line will differ from each other by the same value (i.e., 1), clearly,
the heights of each pair of adjacent people will not
necessarily also differ by the same value. Some adjacent pairs will differ in
height by only a fraction of an inch, while other adjacent pairs will differ in
height by several inches. Accordingly, only some of the features of this size
place ranking are reflected or modeled by the numerical scale. In particular,
while, in this case, the numerical scale can be used to judge the relative
order of one person’s height compared to another’s, differences between
numbers on the numerical scale cannot be used to judge how much taller
one person is than another. As a result, statistical conclusions about
variables measured on the ordinal level that are based on other than an
ordering or ranking of the numbers (including taking sums or differences)
cannot be expected to be meaningful.
Interval Level. An ordinal level of measurement can be developed into a
higher level of measurement if it is possible to assess how near to each
other the persons or objects are in the underlying characteristic being
observed. If numbers can be assigned in such a way that equal numerical
differences along the scale correspond to equal increments in the
underlying characteristic, we have, what is called an interval level of
measurement. As an example of an interval level of measurement,
consider the assignment of yearly dates, the chronological scale. Because
one year is defined as the amount of time necessary for the earth to
revolve once around the sun, we may think of the yearly date as a measure
of the number of revolutions of the earth around the sun up to and including
that year. Hence, this assignment of numbers to the property number of
revolutions of the earth around the sun is on an interval level of
measurement. Specifically, this means that equal numerical differences for
intervals (such as
1800 C. E . to 1850 C. E . and 1925 C . E .to 1975 C. E .) represent equal
differences in the number of revolutions of the earth around the sun (in this
case, fifty). In the interval level of measurement, therefore, we may make
meaningful statements about the amount of difference between any two
points along the scale. As such, the numerical operations of addition and
subtraction (but not
multiplication and division) lead to meaningful conclusions at the interval
level and are therefore permissible at that level. For conclusions based on
the numerical operations of multiplication and division to be meaningful, we
require the ratio level of measurement.
Ratio Level. An interval level of measurement can be developed into a
higher level of measurement if the number zero on the numeric scale
corresponds to zero or “not any” of the underlying characteristic being
observed. With the addition of this property (called an absolute zero), ratio
comparisons are meaningful, and we have, what is called the ratio level of
measurement. Consider once again the chronological scale and, in
particular, the years labeled
2000 C. E . and 1000 C. E . Even though 2000 is numerically twice as large as
1000, it does not follow that the number of revolutions represented by the
year 2000 is twice the number of revolutions represented by the year 1000.
This is because on the chronological scale, the number 0 (0 C . E .) does not
correspond to zero revolutions of the earth around the sun (i.e., the earth
had made revolutions around the sun many times prior to the year 0 C. E .).
In order for us to make meaningful multiplicative or ratio comparisons of
this type between points on our number scale, the number 0 on the
numeric scale must correspond to 0 (none) of the underlying characteristic
being observed.
In measuring height, not by size place, but with a standard ruler, for
example, we would typically assign a value of 0 on the number scale to “not
any” height and assign the other numbers according to the rules of the
interval scale. The scale that would be produced in this case would be a
ratio scale of measurement, and ratio or multiplicative statements (such as
“John, who is 5 feet tall, is twice as tall as Jimmy, who is 2.5 feet tall”)
would be meaningful. It should be pointed out that for variables to be
considered to be measured on a ratio level, “not any” of the underlying
characteristic only needs to be meaningful theoretically. Clearly, no one
has zero height, yet using zero as an
anchor value for this scale to connote “not any” height is theoretically
meaningful.
Choosing a Scale of Measurement. Why is it important to categorize the
scales of measurement as nominal, ordinal, interval, or ratio? If we
consider college students and assign a 1 to those who are male college
students, a 2 to those who are female college students, and a 3 to those
who are not college students at all, it would not be meaningful to add these
numbers nor even to compare their sizes. For example, two male college
students together do not suddenly become a female college student, even
though their numbers add up to the number of a female college student (1
+ 1 = 2). And a female college student who is attending school only half-
time is not suddenly a male college student, even though half of her
number is the number of a male college student (2/2 = 1). By contrast, if we
were dealing with a ratio-leveled height scale, it would be possible to add,
subtract, multiply, or divide the numbers on the scale and obtain results
that are meaningful in terms of the underlying trait, height. In general, and
as noted earlier, the scale of measurement determines which numerical
operations, when applied to the numbers of the scale, can be expected to
yield results that are meaningful in terms of the underlying trait being
measured. Any numerical operation can be performed on any set of
numbers; whether the resulting numbers are meaningful, however,
depends on the particular level of measurement being used.
Note that the four scales of measurement exhibit a natural hierarchy, or
ordering, in the sense that each level exhibits all the properties of those
below it (see Table 1.1). Any characteristic that can be measured on one
scale listed in Table 1.1 can also be measured on any scale below it in that
list. Given a precise measuring instrument such as a perfect ruler, we can
measure a person’s height, for example, as a ratio-scaled variable, in which
case we could say that a person whose height is 5 feet has twice the height
of a person whose height is 2.5 feet.
Suppose, however, that no measuring instrument were available. In this
situation, we could, as we have done before, “measure” a person’s height
according to size place or by categorizing a person as tall, average, or
short. By assigning numbers to these three categories (such as 5, 3, and 1,
respectively), we would create an ordinal level of measurement for height.
Table 1.1. Hierarchy of scales of measurement
1. Ratio
2. Interval
3. Ordinal
4. Nominal
In general, it may be possible to measure a variable on more than one
level. The level that is ultimately used to measure a variable should be the
highest level possible, given the precision of the measuring instrument
used. A perfect ruler allowed us to measure heights on a ratio level, while
the eye of the observer allowed us to measure height only on an ordinal
level. If we are able to use a higher level of measurement but decide to use
a lower level instead, we would lose some of the information that would
have been available to us on the higher level. We would also be restricting
ourselves to a lower level of permissible numerical operations.
EXAMPLE 1.3.
Identify the level of measurement (nominal, ordinal, interval, or ratio) most
likely to be used to measure the following variables:
1. Ice cream flavors
2. The speed of five runners in a one-mile race, as measured by the
runners’ order of finish, first, second, third, and so on.
3. Temperature measured in Centigrade degrees.
4. The annual salary of individuals.
Solution.
1. The variable ice cream flavors is most likely measured at the nominal
level of measurement because the flavors themselves may be categorized
simply as being the same or different and there is nothing inherent to them
that would lend themselves to a ranking. Any ranking would have to
depend on some extraneous property such as, say, taste preference. If
numbers were assigned to the flavors as follows,
Flavor Number
Vanilla 0
Chocolate 1
Strawberry 2
etc. etc.
meaningful numerical comparisons would be restricted to whether the
numbers assigned to two ice cream flavors are the same or
different. The fact that one number on this scale may be larger than
another is irrelevant.
2. This variable is measured at the ordinal level because it is the order of
finish (first, second, third, and so forth) that is being observed and not the
specific time to finish. In this example, the smaller the number the greater
the speed of the runner. As in the case of measuring height via a size place
ranking, it is not necessarily true that the difference in speed between the
runners who finished first and second is the same as the difference in
speed between the runners who finished third and fourth. Hence, this
variable is not measured at the interval level. Had time to finish been used
to measure the speed of the runners, the level of measurement would have
been ratio for the same reasons that height, measured by a ruler, is ratio-
leveled.
3. Temperature measured in Centigrade degrees is at the interval level of
measurement because each degree increment, no matter whether from 3 to 4
degrees Centigrade or from 22 to 23 degrees Centigrade, has the same physical
meaning in terms of the underlying characteristic, heat. In particular, it takes 100
calories to raise the temperature of 1 mL of water by 1 degree Centigrade, no
matter what the initial temperature reading on the Centigrade scale. Thus, equal
differences along the Centigrade scale correspond to equal increments in heat,
making this scale interval-leveled. The reason this scale is not ratio-scaled is
because 0 degrees Centigrade does not correspond to “not any” heat. The 0 degree
point on the Centigrade scale is the point at which water freezes, but even frozen
water contains plenty of heat. The point of “not any” heat is at −273
degrees Centigrade. Accordingly, we cannot make meaningful ratio
comparisons with respect to amounts of heat on the Centigrade scale and
say, for example, that at 20 degrees Centigrade there is twice the heat than
at 10 degrees Centigrade.
4. The most likely level of measurement for annual salary is the ratio level
because each additional unit increase in annual salary along the numerical
scale corresponds to an equal additional one dollar earned no matter
where on the scale one starts, whether it be, for example, at $10,000 or at
$100,000; and, furthermore, because the numerical value of 0 on the scale
corresponds to “not any” annual salary, giving the scale a true or absolute
zero. Consequently, it is appropriate to make multiplicative comparisons on
this scale, such as “Sally’s annual salary of $100,000 is twice Jim’s annual
salary of $50,000.”
Discrete and Continuous Variables
As we saw in the last section, any variable that is not intrinsically
numerically valued, such as the ice cream flavors in Example 1.3, may be
converted to a numerically valued variable. Once a variable is numerically
valued, it may be classified as either discrete or continuous.
Although there is really no exact statistical definition of a discrete or a
continuous variable, the following usage generally applies. A numerically
valued variable is said to be discrete (or categorical or qualitative) if the
values it takes on are integers or can be thought of in some unit of
measurement in which they are integers. A numerically valued variable is
said to be continuous if, in any unit of measurement, whenever it can take
on the values a and b, it can also theoretically take on all the values
between a and b. The limitations of the measuring instrument are not
considered when discriminating between discrete and continuous variables.
Instead, it is the nature of the underlying variable that distinguishes
between the two types.
☞ Remark. As we have said, there is really no hard and fast definition of
discrete and continuous variables for use in practice. The words discrete
and continuous do have precise mathematical meanings, however, and in
more advanced statistical work, where more mathematics and
mathematical theory are employed, the words are used in their strict
mathematical sense. In this text, where our emphasis is on statistical
practice, the usage of the terms discrete and continuous will not usually be
helpful in guiding our selection of appropriate statistical methods or
graphical displays. Rather, we will generally use the
particular level of measurement of the variable, whether it is nominal,
ordinal, interval, or ratio.
EXAMPLE 1.4.
Let our population consist of all eighth-grade students in the United States,
and let X represent the region of the country in which the student lives. X is
a variable, because there will be different regions for different students. X is
not naturally numerically valued, but because X represents a finite number
of distinct categories, we can assign numbers to these categories in the
following simple way: 1 = Northeast, 2 = North Central, 3 = South, and 4 =
West. X is a discrete variable, because it can take on only four values.
Furthermore, because X is a nominal-leveled variable, the assignment of 1,
2, 3, and 4 to Northeast, North Central, South, and West, respectively, is
arbitrary. Any other assignment rule would have been just as meaningful in
differentiating one region from another.
EXAMPLE 1.5.
Consider that we repeatedly toss a coin 100 times and let X represent the
number of heads obtained for each set of 100 tosses. X is naturally
numerically valued and may be considered discrete because the only
values it can take on are the integer values 0, 1, 2, 3, and so forth. We may
note that X is ratio-leveled in this example because 0 on the numerical
scale represents “not any” heads.
EXAMPLE 1.6.
Consider a certain hospital with 100 beds. Let X represent the percentage
of occupied beds for different days of the year. X is naturally numerically
valued as a proportion of the number of occupied beds. Although X takes
on fractional values, it is considered discrete because the proportions are
based on a count of the number of beds occupied, which is an integer
value.
EXAMPLE 1.7.
Let our population consist of all college freshmen in the United States, and
let X represent their heights, measured in inches. X is numerically valued
and is continuous because all possible values of height are theoretically
possible. Between any two heights exists another theoretically possible
height. For example, between 70 and 71 inches in height, exists a height of
70.5 inches and between 70 and 70.5 exists a height of 70.25 inches, and
so on.
☞ Remark. Even if height in Example 1.7 were reported to the nearest inch
(as an integer), it would still be considered a continuous variable because
all possible values of height are theoretically possible. Reporting values of
continuous variables to the nearest integer is usually due to the lack of
precision of our measuring instruments. We would need a perfect ruler to
measure the exact values of height. Such a measuring instrument does
not, and probably cannot, exist. When height is reported to the nearest
inch, a height of 68 inches is considered to represent all heights between
67.5 and 68.5 inches. While the precision of the measuring
instrument determines the accuracy with which a variable is measured, it
does not determine whether the variable is discrete or continuous. For that
we need only to consider the theoretically possible values that a variable
can assume.
☞ Remark. In addition to the problem of not being able to measure
variables precisely, another problem that often confronts the behavioral
scientist is the measurement of traits, such as intelligence, that are not
directly observable. Instead of measuring intelligence directly, tests have
been developed that measure it indirectly, such as the IQ test. While such
tests report IQ, for example, as an integer value, IQ is considered to be a
continuous trait or variable, and an IQ score of 109, for example, is thought
of as theoretically representing all IQ scores between 108.5 and

109.5.Another issue, albeit a more controversial one, related to the


measurement of traits that are not directly observable, is the level of
measurement employed. While some scientists would argue that IQ scores
are only ordinal (given a good test of intelligence, a person whose IQ score
is higher than another’s on that test is said to have greater intelligence),
others would argue that they are interval. Even though equal intervals
along the IQ scale (say, between 105 and 110 and between 110 and 115)
may not necessarily imply equal amounts of change in intelligence, a
person who has an IQ score of 105 is likely to be closer in intelligence to a
person who has an IQ score of 100 rather than to a person who has an IQ
score of 115. By considering an IQ scale and other such psychosocial
scales as ordinal only, one would lose such information that is contained in
the data and the ability to make use of
statistical operations based on sums and differences, rather than merely on
rankings.
Another type of scale, widely used in attitude measurement, is the Likert
scale, which consists of a small number of values, usually five or seven,
ordered along a continuum representing agreement. The values
themselves are labeled typically from strongly disagree to strongly agree. A
respondent selects that score on the scale that corresponds to his or her
level of agreement with a statement associated with that scale. For the
same reasons noted earlier with respect to the IQ scale, for example, the
Likert scale is considered by many to be interval rather than strictly ordinal.
EXAMPLE 1.9.
For each variable listed, describe whether the variable is discrete or
continuous. Also, describe the level of measurement for each variable. Use
the following data set excerpted from data analyzed by Tomasi and
Weinberg (1999). The original study was carried out on 105 elementary
school children from an urban area who were classified as learning
disabled (LD) and who, as a result, were receiving special education
services for at least three years. In this example, we use only the four
variables described below.
Variable Name
What the Variable Measures
How It Is Measured
SEX 0 = Male
1 = Female
GRADE Grade level 1, 2, 3, 4, or 5
AGE Age in years Ages ranged from 6 to 14
MATHCOMP Mathematical
comprehension
Higher scores associate with greater mathematical comprehension. Scores
could range from 0 to 200.
Solution. The variable SEX is discrete because the underlying construct
represents a finite number of distinct categories (male or female).
Furthermore, individuals may be classified as either male or female, but
nothing in between. The level of measurement for SEX is nominal because
the two categories are merely different from one another.
The variable GRADE is discrete because the underlying construct
represents a finite number (five) of distinct categories and individuals in this
study may only be in one of the five grades. The level of measurement for
GRADE is interval because the grades are ordered and equal numerical
differences along the grade scale represent equal increments in grade.
The variable AGE is continuous even though the reported values are
rounded to the nearest integer. AGE is continuous because the underlying
construct can theoretically take on any value between 5.5 and 14.5 for
study participants. The level of measurement for AGE is ratio because
higher numerical values correspond to greater age, equal numerical
differences along the scale represent equal increments in age within
rounding, and the value of zero is a theoretically meaningful anchor point
on the scale representing “not any” age.
The variable MATHCOMP, like AGE, is continuous. The underlying
construct, mathematical comprehension, theoretically can take on any
value between 0 and 200. The level of measurement is at least ordinal
because the higher the numerical value, the greater the comprehension in
math, and it is considered to be interval because equal numerical
differences along the scale are thought to represent equal or nearly equal
increments in math comprehension. Because individuals who score zero on
this scale do not possess a total absence of math comprehension, the
scale does not have an absolute zero and is, therefore, not ratio-leveled.
Setting a Context with Real Data
We turn now to some real data and gain familiarity with Stata. For this
example and for others throughout the book, we have selected data from
the National Educational Longitudinal Study begun in 1988, which we will
refer to as the NELS data set.
In response to pressure from federal and state agencies to monitor school
effectiveness in the United States, the National Center of Education
Statistics (NCES) of the U.S. Department of Education conducted a survey
in the spring of 1988. The participants consisted of a nationally
representative sample of approximately 25,000 eighth graders to measure
achievement outcomes in four core subject areas (English, history,
mathematics, and science), in addition to personal, familial, social,
institutional, and cultural factors that might relate to these outcomes.
Details on the design and initial analysis of this survey may be referenced
in Horn, Hafner, and Owings (1992). A follow-up of these students was
conducted during tenth grade in the spring of 1990; a second follow-up was
conducted during the twelfth grade in the spring of 1992; and, finally, a third
follow-up was conducted in the spring of 1994.
For this book, we have selected a sub-sample of 500 cases and 48
variables. The cases were sampled randomly from the approximately 5,000
students who responded to all four administrations of the survey, who were
always at grade level (neither repeated nor skipped a grade) and who
pursued some form of post- secondary education. The particular variables
were selected to explore the relationships between student and home-
background variables, self-concept, educational and income aspirations,
academic motivation, risk-taking behavior, and academic achievement.
Some of the questions that we will be able to address using these data
include: Do boys perform better on math achievement tests than girls?
Does socioeconomic status relate to educational and/or income
aspirations? To what extent does enrollment in advanced math in eighth
grade predict twelfth-grade math achievement scores? Can we distinguish
between those who use marijuana and those who don’t in terms of self-
concept? Does owning a computer vary as a function of geographical
region of residence (Northeast, North Central, South, and West)? As you
become familiar with this data set, perhaps you will want to generate your
own questions of particular interest to you and explore ways to answer
them.
As shown below, Stata, which must be obtained separately from this text,
opens with five sub-windows contained within one large window.
Collectively, these windows help us to navigate between information about
the variables in the data set and statistical output based on analyses that
have been carried out on that data set. The largest (the area that contains
the word Stata), is called the Results or Output Window as it will contain
the results or output from our analyses. Directly below the Output Window
is the Command Window headed by the word, Command. It is in this
window that we enter the commands we want Stata to execute (run). The
window on the upper right is the Variables Window, headed by the word
Variables. This window displays a list of all variable names and labels for
the active dataset; that is, the dataset we currently are analyzing. Directly
below the Variables Window is the Properties Window headed by the word,
Properties. This window lists the properties of each of the variables in our
data set (e.g., their formats, variable labels, value labels, data types and so
on) as well as the properties of the data set itself (e.g., its filename, number
of variables, number of observations, and so on). On the extreme left is the
Review Window, headed by the word Review. This window gives a history
of the commands that we have executed to date. Because we have not yet
executed any
commands, the Review Window currently contains the message, “There
are no items to show.”
Across the top of the screen, is a list of eight words (File, Edit, Data,
Graphics, Statistics, User, Window, Help) that when clicked may be used to
navigate the Stata environment. We will refer to this list as comprising the
main menu bar. By clicking File, for example, we would obtain a drop-down
menu covering the variety of Stata activities that are at the file level, such
as File/Open and File/Save. If we were to click Window on the main menu
bar, we would see a listing of all of Stata’s windows, the first five of which
are those just described above. In addition to these, there are five other
windows that we do not see when we open Stata (Graph, Viewer, Data
Editor, Do-file Editor, and Variables Manager). These will be discussed
later in the text as needed.
It is important to note that Stata is case-sensitive, which means that Stata
will interpret a lowercase m, for example, as different from an uppercase M.
In addition, all commands entered into the Command Window must be in
lowercase. Stata will not understand a command typed with an uppercase
letter.
It also means that a variable name with only lower case letters must be
typed in the Command Window with only lower case letters, a variable
name with only upper case letters must be typed in the Command Window
with only upper case letters, and a variable name with a combination of
lower and upper case letters must be entered in the Command Window
with that exact combination of upper and lower case letters.
After typing a command, Stata will execute that command after we press
Enter. Although the results of executing that command will appear in the
Output Window, Stata does not automatically save the contents of the
Output Window. To keep a permanent copy of the contents of the Output
Window, we should begin each data analytic session by creating a log file.
When a log file is created, Stata outputs results to both the Output Window
and the log file. All of this will become quite clear when you do the end-of-
chapter Exercises of Chapter 1, which includes opening the NELS data set
in Stata.
A description of the NELS data set may be found in Appendix A. Material
for all appendices, including Appendix A, may be found online at
www.cambridge.org/Stats-Stata. Descriptions of the other real data sets we
use throughout this text are also found in Appendix A.
Exercise 1.9 introduces you to some of the variables in the NELS data set
without the use of Stata. Exercise 1.10 shows you how to open the NELS
data set in Stata and Exercise 1.15 shows you how to enter data to create
a new data set in Stata. These exercises also review how to start a log file
in Stata.
Once we have completed our analyses and saved our results, and are now
ready to close Stata, we may do so by clicking on the main menu bar
File/Exit or by typing in the Command Window the exit and hitting Enter.
☞ Remark. Although we have described four levels of measurement in this
chapter, you should not confuse this with the “Data Type” that Stata
specifies. Stata characterizes each variable in the data set as either an
alphanumeric variable (i.e., a “string” variable), or a numeric variable.
Numeric variables are stored in one of several ways depending upon how
much precision is desired by the analyst when carrying out computations.
For our purposes, in this book, we will not be concerned about a numeric
variable’s storage type, but will simply use the storage type assigned by
Stata for that variable.
☞ Remark. There is an important distinction to be drawn between the
numerical results of a statistical analysis and the interpretation of these
results given by the researcher. Methods involved in the interpretation of
results, such as researcher judgment, are not statistical operations. They
are extra-statistical. For instance, to determine on the basis of having
administered the same standardized test to a group of boys and girls that
the girls attain, on average, higher scores than the boys is a statistical
conclusion. However, to add that the reason for this difference is that the
test is biased toward the girls is a researcher based, not a statistically
based, conclusion. In offering such an interpretation, the researcher is
drawing on non-statistical information. It is important to be able to separate
statistical conclusions from researcher-inferred conclusions. The latter may
not justifiably follow from the former; and unfortunately, the latter are the
ones that are usually remembered and cited.
Exercises
1.1. Suppose Julia is a researcher who is gathering information on the
yearly income and number of years of experience of all female doctors
practicing in the United States. Identify each of the following as either a
constant or variable in this study. If your answer is variable, identify its most
likely level of measurement (nominal, ordinal, interval, or ratio).
a) Sex
b) Yearly income as reported on one’s tax return
c) Ethnicity
d) Profession (not specialty)
e) Number of years of experience
1.2. Given the population of all clowns in the Ringling Brothers, Barnum
and Bailey Circus, identify the following as either constant or variable. If
your answer is variable, identify its most likely level of measurement
(nominal, ordinal, interval, or ratio).
a) Height
b) Profession
c) Age
d) Eye color
1.3. Identify each of the following numerical variables as either discrete or
continuous and identify their most likely levels of measurement (nominal,
ordinal, interval, or ratio).
a) Percentage of high school seniors each year who report that they have
never smoked cigarettes
b) Annual rainfall in Savannah, Georgia, to the nearest inch
c) Number of runs scored in a baseball game
d) Weight
e) Verbal aptitude as measured by SAT verbal score
f) Salary of individual government officials
1.4. Identify the following numerical variables as either discrete or
continuous and identify their most likely levels of measurement (nominal,
ordinal, interval, or ratio).
a) Number of students enrolled at Ohio State University in any particular
term
b) Distance an individual can run in five minutes
c) Hair length
d) Number of hot dogs sold at baseball games
e) Self-concept as measured by the degree of agreement with the
statement, “I feel good about myself,” on a five-point scale where 1
represents strongly disagree, 2 represents disagree, 3 represents neutral, 4
represents agree, and 5 represents strongly agree
f) Lack of coordination as measured by the length of time it takes an
individual to assemble a certain puzzle
1.5. Identify the following numerical variables as either discrete or
continuous and identify their most likely levels of measurement (nominal,
ordinal, interval, or ratio).
a) Baseball jersey numbers
b) Number of faculty with Ph.D.’s at an institution
c) Knowledge of the material taught as measured by grade in course
d) Number of siblings
e) Temperature as measured on the Fahrenheit scale
f) Confidence in one’s ability in statistics as measured by a yes or no
response to the statement, “I am going to do very well in my statistics class
this semester”
1.6. A survey was administered to college students to learn about student
participation in University sponsored extra-curricular activities. Identify the
most likely level of measurement of each of the following variables taking
into account the coding used.
Variable Name
Question Asked Coding (if
any)
AGE How old are you in years?
EXTRAC At which type of extra-curricular activity do you
spend most of your time?
1 = Sports 2 = Student Government 3 = Clubs and organizations 4 = None
TIME How much time do you spend weekly on
University sponsored extracurricular activities?
0 = None 1 = 1–2 hours 2 = 3–5 hours 3 = 6–10 hours
4 = 10–20 hours 5 = More than 20 hours
a) AGE
b) EXTRAC
c) TIME
1.7. The Campus Survey of Alcohol and Other Drug Norms is administered
to college students to collect information about drug and alcohol use on
campus. Identify the most likely level of measurement of each of the
following variables from that survey taking into account the coding used.
Variable Name
Question Asked Coding (if any)
ABSTAIN Overall, what percentage of
students at this college do you think consume no alcoholic beverages at
all?
LIVING Living arrangements 1 = House/apartment/etc.
2 = Residence Hall 3 = Other
STATUS Student status: 1 = Full-time (12+ credits)
2 = Part-time (1–11 credits)
ATTITUDE Which statement about
drinking alcoholic beverages do you feel best represents your own attitude?
1 = Drinking is never a good thing to do. 2 = Drinking is all right but a
person should not get drunk
3 = Occasionally getting drunk is okay as long as it doesn’t interfere with
academics or other responsibilities. 4 = Occasionally getting drunk is okay
even if it does interfere with academics or other responsibilities. 5 =
Frequently getting drunk is okay if that’s what the individual wants to do.
DRINK How often do you typically consume alcohol (including beer, wine,
wine coolers, liquor, and mixed drinks)?
0 = Never 1 = 1–2 times/year 2 = 6 times/year 3 = Once/month 4 =
Twice/month 5 = Once/week 6 = 3 times/week 7 = 5 times/week 8 = Every
day
a) ABSTAIN
b) LIVING
c) STATUS
d) ATTITUDE
e) DRINK
Exercise 1.8 includes descriptions of some of the variables in the
Framingham data set. The Framingham data set is based on a longitudinal
study investigating
factors relating to coronary heart disease. A more complete description of
the Framingham data set may be found in Appendix A.
1.8. For each variable described from the Framingham data set, indicate
the level of measurement.
a) CURSMOKE1 indicates whether or not the individual smoked cigarettes
in 1956. It is coded so that 0 represents no and 1 represents yes.
b) CIGPDAY1 indicates the number of cigarettes the individual smoked per
day, on average, in 1956.
c) BMI1 indicates the body mass index of the individual in 1956. BMI can
be calculated as follows:
d)
e) SYSBP1 and DIABP1 indicate the systolic and diastolic blood pressures,
respectively, of the individuals in 1956. Blood is carried from the heart to all
parts of your body in vessels called arteries. Blood pressure is the force of
the blood pushing against the walls of the arteries. Each time the heart
beats (about 60–70 times a minute at rest), it pumps out blood into the
arteries. Your blood pressure is at its highest when the heart beats,
pumping the blood. This is called systolic pressure. When the heart is at
rest, between beats, your blood pressure falls. This is the diastolic
pressure.
Blood pressure is always given as these two numbers, the systolic and
diastolic pressures. Both are important. Usually they are written one above
or before the other, such as 120/80 mmHg. The top number is the systolic
and the bottom the diastolic. When the two measurements are written
down, the systolic pressure is the first or top number, and the diastolic
pressure is
the second or bottom number (for example, 120/80). If your blood pressure
is 120/80, you say that it is “120 over 80.” Blood pressure changes during
the day. It is lowest as you sleep and rises when you get up. It also can rise
when you are excited, nervous, or active. Still, for most of your waking
hours, your blood pressure stays pretty much the same when you are
sitting or standing still. Normal values of blood pressure should be lower
than 120/80. When the level remains too high, for example, 140/90 or
higher, the individual is said to have high blood pressure. With high blood
pressure, the heart works harder, your arteries take a beating, and your
chances of a stroke, heart attack, and kidney problems are greater.
Exercise 1.9 includes descriptions of some of the variables in the NELS
data set. The NELS data set is a subset of a data set collected by the
National Center for Educational Statistics. They conducted a longitudinal
study investigating factors relating to educational outcomes. A more
complete description of the NELS data set may be found in Appendix A.
1.9. Read the description of the following variables from the NELS data set.
Then, classify them as either discrete or continuous and specify their levels
of measurement (nominal, ordinal, interval, or ratio). The variable names
are given in capital letters.
a) GENDER. Classifies the student as either male or female, where 0 =
“male” and 1 = “female.”
b) URBAN. Classifies the type of environment in which each student lives
where 1 = “Urban,” 2 = “Suburban,” and 3 = “Rural.”
c) SCHTYP8. Classifies the type of school each student attended in eighth
grade where 1 = “Public,” 2 = “Private, Religious,” and 3 = “Private, Non-
Religious.”
d) TCHERINT. Classifies student agreement with the statement “My
teachers are interested in students” using the Likert scale 1 = “strongly
agree,” 2 = “agree,” 3 = “disagree,” and 4 = “strongly disagree.”
e) NUMINST. Gives the number of post-secondary institutions the student
attended.
f) ACHRDG08. Gives a score for the student’s performance in eighth grade
on a standardized test of reading achievement. Actual values range from
36.61 to 77.2, from low to high achievement.
g) SES. Gives a score representing the socioeconomic status (SES) of the
student, a composite of father’s education level, mother’s education level,
father’s occupation, mother’s occupation, and family income. Values range
from 0 to 35, from low to high SES.
h) SLFCNC12. Gives a score for student self-concept in twelfth grade.
Values range from 0 to 43. Self-concept is defined as a person’s self-
perceptions, or how a person feels about himself or herself. Four items
comprise the self-concept scale in the NELS questionnaire: I feel good
about myself; I feel I am a person of worth, the equal of other people; I am
able to do things as well as most other people; and, on the whole, I am
satisfied with myself. A self-concept score, based on the sum of scores on
each of these items, is used to measure self-concept in the NELS study.
Higher scores associate with higher self-concept and lower scores
associate with lower self-concept.
i) SCHATTRT. Gives the average daily attendance rate for the school.
j) ABSENT12. Classifies the number of times missed school. Assigns 0 to
Never, 1 to 1–2 times, 2 to 3–6 times, etc.
Exercises 1.10–1.13 require the use of Stata and the NELS data set.
1.10. Follow the instructions given below to access the NELS data set from
the website associated with this text.
Open Stata and begin a log file to keep a permanent copy of your results.
To begin a log file, click File/Log/Begin on the main menu bar. A window
will then appear prompting you to give the log file a name and save it in a
location of your choosing. Log files are denoted in Stata by the extension
.smcl added to their filename. It is a good idea to save the log file in a folder
that is associated with the analysis you will doing using this data set. For
example, you might want to save it in a folder labeled Exercise 1.10.
Open the NELS data set by clicking File/Open on the main menu bar,
browsing to locate the NELS data set (this data set, along with all other
data sets for this text, is located under the Resources tab on the book’s
website, www.cambridge.org/Stats-Stata) and clicking Open. The NELS
data set will now be in active memory and you will see something like the
following on your computer screen:
Notice that the Variables Window now contains the complete list of variable
names and labels for each variable in the NELS data set, and the Output
Window contains the command (use) that was used to open the data set
itself. The word clear at the end of the use command clears any data that
currently may be in Stata’s active memory so that we may begin with a
clean slate. The command use is followed by quotes within which appears
the address of the location of the NELS.dta data set.
To see the complete data set displayed on the screen, on the main menu
bar click Data/Data Editor/Data Editor(Edit). More simply, you may type
the command edit in the Command Window and hit Enter. A spreadsheet
containing the NELS data set will appear on the screen, and you will now
be able to edit or modify the data in the dataset. If you do not want to edit
the data at this time and want to avoid the risk of changing some values by
mistake, you can go into browse mode instead of edit mode by clicking
Data/Data Editor/Data Editor(Browse). More simply, you may type the
command browse in the Command Window and hit Enter.
a) What is the first variable in the NELS data set?
b) What is the second variable in the data set and what is its variable label?
c) What are the value labels for this variable? That is, how is the variable
coded?
(HINT: You may use one of the following four approaches to answer this
question.)
1. On the main menu bar, click Data/ Data Utilities/ Label Utilities/List
Value Labels and select the desired variable from the drop-down list. Click
OK and the value labels (codes) for this variable will appear on the screen
in the Output Window. If no specific variable is selected from the drop-down
list, the value labels for all the variables in the dataset will appear in the
Output Window when OK is clicked.
2. Alternatively, type the command label list in the Command Window and
hit Enter to view the value codes for all the variables in the dataset. If you
only want the value codes for a particular variable, you would type the
name of that variable following the command label list in the Command
Window and hit Enter. For example, you would type label list advmath8
and hit Enter to view the value codes for only the variable advanced math
in 8th grade.
3. And still another approach would be to type the command edit in the
Command Window, hit Enter to view the complete set of data in the Data
Editor Window, and click the variable advmath8 in the Variables Window.
Notice that the Properties Window now contains information about the
variable advmath8. In the row labeled Value Label within the Properties
Window, the variable name advmath8 is followed by three dots (...).
If you click on those three dots, the following window will appear on your
screen.
If we were to click the + sign next to advmath8, the value labels for this
variable would appear on the screen. If, instead, we were interested not
only in viewing the value labels, but also in editing (changing) them, we
would click Edit Label on this screen and the following window would
appear, allowing us to make the changes we wished.
4. Finally, and perhaps, most simply, type the command codebook in the
Command Window and hit Enter to view the value codes for all the
variables in the dataset. If you only want the value codes for a particular
variable, you would type the name of that variable following the command
codebook in the Command Window and hit Enter. For
example, you would type codebook advmath8 and hit Enter to view the
value codes for only the variable advanced math in 8th grade
d) What is the Type entered for this variable in Stata? (Hint: The Variables
Window contains a listing of the variable names and their labels. To view
the Type for this variable, click on the variable name in the Variables
Window and refer to the Properties Window below to view this variable’s
Type. Because all the variables in the NELS data set are numeric, one of
the following will be listed as the Type of that variable – byte, int, long,
float, or double. These numeric Types are listed in order of increasing
precision, but as was mentioned earlier, for our purposes we do not need to
pay much attention to these different numeric Types.
e) What is the variable label for the variable famsize?
f) Why are there no value labels for this variable?
g) How many variables are there in the NELS data set? (Hint: On the main
menu bar, click Data/ Describe Data/Describe data in memory. Then
click OK. Or, more simply type the command describe in the Command
Window and hit Enter. A full description of all the variables in the data set
will appear in the Output Window along with information at the top of the
Window regarding the total number of observations (Obs:) in the dataset
and the total number of variables (Var:) in the dataset. Information
regarding the number of variables and observations in the data set may
also be obtained by looking under Data in the Properties Window.
h) You may now close Stata by typing exit in the Command Window and
hitting Enter.
1.11. Click Data/Data Editor/ Data Editor (Browse) on the main menu bar
or type the command browse in the Command Window to view the full
NELS data
set on the screen. Refer to this data set to answer the following questions.
a) How many people are in the family (famsize) of the first student (ID=1) in
the NELS data set?
b) Did the first student in the NELS data set (id = 1) take advanced math in
eighth grade (advmath8)?
c) Did the second student in the NELS data set (id = 2) take advanced
math in eighth grade (advmath8)?
d) How many people are in the NELS data set? (See the Hint given for
Exercise 1.10(g))
1.12. The variable late12 gives the number of times a student has been late
for school in twelfth grade using the following coding: 0 = never, 1 = 1–2
times, 2 = 3–6 times, 3 = 7–9 times, 4 = 10–15 times, and 5 = over 15
times. Although a code of 0 indicates that the student was never late,
late12 is not measured on a ratio scale. Explain. Identify the level of
measurement used. Describe how the variable might have been measured
using a higher level of measurement.
1.13. The variable expinc30 gives, for each student in the NELS, the
expected income at age 30 in dollars. Identify the level of measurement
used. Describe how the variable might have been measured using a lower
level of measurement. Exercise 1.14 requires the use of Stata and the
States data set. The States data set contains educational information for
the 50 states and Washington DC. A more complete description of the
States data set may be found in Appendix A.
1.14. In this exercise, we introduce the notion of Type in the Properties
Window and review some of the other information that may be obtained in
Stata’s Data Editor Window. Load the States data set into active memory
by clicking on the
main menu bar File/Open and then open the Data Editor Window in
browse mode by clicking Data/Data Editor/Data Editor(Browse).
a) Verify that the string (alphanumeric) variable state has Type str42 and
the numeric variable stut each has Type double. The number 42 following
str indicates that the state with the longest name has 42 letters/characters.
Which state is it? What is the data Type for the variable enrollmnt?
b) Value labels typically are used for variables at the nominal and ordinal
levels to remind us which values have been assigned to represent the
different categories of these variable. For the variable region, which region
of the country has been assigned a value of 3?
[Hint: One way is to click on the main menu bar, Data/Data Utilities/List
Value Labels, and select the variable region from the drop-down list.] What
are three other ways to find the answer to this problem? [Another Hint: See
Exercise 1.10(c).]
c) What is the difference between Variable Labels and Values?
d) Why does the variable region have no label?
e) In which region is Arizona?
Exercise 1.15 provides a tutorial on data entry and requires the use of
Stata.
1.15. In this exercise, we will enter data on 12 statisticians who each have
contributed significantly to the field of modern statistics. We have provided
a table of variable descriptions and a table of data values and detailed
instructions on how to proceed. In most cases, we provide the commands
that are needed to execute the steps, but we also provide the main menu
point and click steps when those are easy and not cumbersome to carry
out.
a) The data, given near the end of this exercise relate to the variables
described in the table below. We will enter these data into Stata using the
steps provided.
Name Type Label Values
Statistician String
Gender float 1 =
Female 2 =Male
Birth float Year of birth
Death float Year of death
AmStat float Number of references in The American
Statistician, 1995–2005
Use the following steps to enter the data:
To enter these data into Stata, we need to be in edit mode in the Data
Editor Window. To do so, click on Data/Data Editor/Data Editor(Edit) or
type the command edit in the Command Window and hit Enter. A new,
blank datasheet will appear ready to receive the data you wish to enter.
Remember that Stata is case-sensitive and all commands must be written
in lowercase.
A copy of the spreadsheet with the data is given at the end of this exercise.
Begin to enter the data per cell as shown below for the first row:
var1 var2 var3 var4 var5
Sir Francis Galton 2 1822 1911 7
To select a location on your computer in which you would like to save this
file, on the main menu bar click File/Change Working Directory, and
select the location in which you would like to save this file. To save the file
with the data entered so far, and to give it the filename Statisticians, type
save Statisticians in the Command Window and hit Enter. Alternatively, on
the main menu bar, click File/Save As and browse for the location in which
you would like to save the file, give the file a name (e.g., Statisticians) and
hit Save. This data file, as well as all other Stata data files, will have the
extension .dta added to its filename to distinguish it as Stata data file.
Once the data are entered, we might wish to make changes to the dataset.
For example, if we wished to give each variable a name other than the
generic name it has been initially assigned by Stata (var1, var2, etc.), we
would click on the main menu bar Data/Data Utilities/Rename Groups of
Variables. In the panel labeled, “Existing variable names,” we would type
var1 var2 var3 var4 var5. In the panel labeled, “New variable names,” we
would type Statistician Gender Birth Death AmStat, and then click OK.
Alternatively, we could type rename (var1 var2 var3 var4 var5)
(Statistician Gender Birth Death AmStat) in the Command Window and
hit Enter.
If we wished to give the dataset a label in addition to its filename, on the
main menu bar, we would click Data/Data Utilities/Label Utilities/Label
Dataset and type the label you wish in the space provided (e.g.,
Statisticians Data – Exercise 1.15) and click OK.
Alternatively, we could type label data “Statisticians Data – Exercise
1.15” in the Command Window and hit Enter.
To label individual variables (e.g., AmStat) with labels in addition to their
variable names, on the main menu bar, we would click Data/Data
Utilities/Label Utilities/Label Variable, select the variable we wished to
label (AmStat) from the drop-down menu and type the label itself in the
space provided (# of references in the American Statistician, 1995–2005).
Alternatively, we could type label variable AmStat “# of references in the
American Statistician, 1995–2005” and hit Enter.
To define value labels and then assign these labels to the variable Gender,
on the main menu bar, we would click Data/Variables Manager, and click
on the variable Gender and in the Properties Window to the right, click
Manage and then Edit Labels in the new screen that appears and add the
desired label values. Alternatively, we could type the following two
commands in the Command Window, hitting Enter after each:
label define sex 2 “Male” 1 “Female”
label values Gender sex
We could then type the command codebook Gender to confirm that what
we did was correct.
To view how all the data in this file are represented or described internally
in the computer, we could type describe in the Command Window and hit
Enter.
Once you are finished with the task of creating your dataset, you will want
to save it, but since we already saved a portion of these data and the file
Statisticians .dta already exists, rather than simply typing in the
Command Window save Statisticians, we would type save Statisticians,
replace. Alternatively, on the main menu bar, we could click File/Save. At
this point, Stata will ask whether we want to overwrite the existing file, and
we would click Yes.
To list the contents of the data set on the screen (in the Output Window),
type list in the Command Window.
Statistician Gender Birth Death AmStat
Sir Francis Galton 2 1822 1911 7
Karl Pearson 2 1857 1936 16
William Sealy Gosset 2 1876 1937 0
Ronald Aylmer Fisher 2 1890 1962 5
Harald Cramer 2 1893 1985 0
Prasanta Chandra Mahalanobis 2 1893 1972 0
Jerzy Neyman 2 1894 1981 7
Egon S. Pearson 2 1895 1980 1
Gertrude Cox 1 1900 1978 6
Samuel S. Wilks 2 1906 1964 1
Florence Nightingale David 1 1909 1995 0
John Tukey 2 1915 2000 12
To log out of Stata, type exit in the Command Window and hit Enter.
Chapter Two
Examining Univariate Distributions

As noted in Chapter 1, the function of descriptive statistics is to describe
data. A first step in this process is to explore how the collection of values
for each variable is distributed across the array of values possible. Because
our concern is with each variable taken separately, we say that we are
exploring univariate distributions. Tools for examining such distributions,
including tabular and graphical representations, are presented in this
chapter along with Stata commands for implementing these tools. As we
noted in Chapter 1, we may enter Stata commands either by using a menu-
driven point and click approach or by directly typing the commands
themselves into the Command Window and hitting Enter. Because directly
typing the commands is simpler and less cumbersome than the menu-
driven approach, in only the first chapter do we provide both the menu-
driven point and click approach and the command-line approach. In this
and succeeding chapters we focus on the command-line approach, but
provide in this chapter, instructions for how to obtain the particular
sequence of point and click steps on the main menu bar that would be
needed to run a command.
Counting the Occurrence of Data Values
We again make use of the NELS data set introduced in Chapter 1, by
looking at various univariate distributions it includes. To begin our
exploration, we ask the following questions about two of them. How many
students in our sample are from each of the four regions of the country
(Northeast, North Central, South, and West)? How are the students
distributed across the range of values on socioeconomic status (ses)? By
counting the number of students who live within a region, or who score
within a particular score interval on the ses scale, we obtain the frequency
of that region or score interval. When expanded to include all regions or all
score intervals that define the variable, we have a frequency distribution.
Frequency distributions may be represented by tables and graphs. The
type of table or graph that is appropriate for displaying the frequency
distribution of a particular variable depends, in part, upon whether the
variable is discrete or continuous.
When Variables Are Measured at the Nominal
Level
The variable REGION in the NELS data set is an example of a discrete
variable with four possible values or categories: Northeast, North Central,
South, and West. To display the number of student respondents who are
from each of these regions, we may use frequency and percent tables as
well as bar and pie charts.
Frequency and Percent Distribution Tables
MORE Stata: To create frequency and percent distribution tables
We type the following command in the Command Window and hit Enter:
tabulate region
☞ Remark. As described, we can type in the variable name region in the
above command, or we can click on the variable region in the Variable
Window located in the top right corner of the Stata screen (see later). An
arrow will appear as shown. Place your cursor wherever you want the
variable to appear on the command line (after the command tabulate, in
this case) and click on the arrow.
The result, as given in Table 2.1, will appear on your screen in the Output
Window.
In Table 2.1, the first column lists the four possible categories of this
variable. The second column, labeled “Freq.”, presents the number of
respondents from each region. From this column, we may note that the
fewest number of respondents, 93, are from the West. Note that if we sum
the frequency column we obtain 500, which represents the total number of
respondents in our data set. This is the value we should obtain if an
appropriate category is listed for each student and all students respond to
one and only one category. If an appropriate category is listed for each
student and all students respond to one and only one category, we say the
categories are mutually exclusive and exhaustive. This is a desirable,
though not essential property of frequency distributions.
While the frequency of a category is informative, so is also the relative
frequency of the category; that is, the frequency of the category relative to
(divided by) the total number of individuals responding. The relative
frequency represents the proportion of the total number of responses that
are in the

You might also like