100% found this document useful (12 votes)
277 views15 pages

Reasoning With Data An Introduction To Traditional and Bayesian Statistics Using R Full Text Download

The book 'Reasoning with Data' by Jeffrey M. Stanton serves as an introduction to both traditional and Bayesian statistics using the R programming language. It emphasizes a hands-on, conceptual approach to learning statistics, focusing on simulations and practical examples rather than solely on formulas. The text aims to make statistical reasoning accessible and engaging for a broader audience, providing resources and exercises to enhance understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (12 votes)
277 views15 pages

Reasoning With Data An Introduction To Traditional and Bayesian Statistics Using R Full Text Download

The book 'Reasoning with Data' by Jeffrey M. Stanton serves as an introduction to both traditional and Bayesian statistics using the R programming language. It emphasizes a hands-on, conceptual approach to learning statistics, focusing on simulations and practical examples rather than solely on formulas. The text aims to make statistical reasoning accessible and engaging for a broader audience, providing resources and exercises to enhance understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Reasoning with Data An Introduction to Traditional and

Bayesian Statistics Using R

Visit the link below to download the full version of this book:

https://medipdf.com/product/reasoning-with-data-an-introduction-to-traditional-a
nd-bayesian-statistics-using-r/

Click Download Now


Reasoning
with Data
An Introduction to Traditional
and Bayesian Statistics Using R

Jeffrey M. Stanton

THE GUILFORD PRESS


New York  London
Copyright © 2017 The Guilford Press
A Division of Guilford Publications, Inc.
370 Seventh Avenue, Suite 1200, New York, NY 10001
www.guilford.com

All rights reserved

No part of this book may be reproduced, translated, stored in a retrieval


system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written
permission from the publisher.

Printed in the United States of America

This book is printed on acid-free paper.

Last digit is print number: 9 8 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication


Names: Stanton, Jeffrey M., 1961–
Title: Reasoning with data : an introduction to traditional and Bayesian
statistics using R / Jeffrey M. Stanton.
Description: New York : The Guilford Press, [2017] | Includes bibliographical
references and index.
Identifiers: LCCN 2017004984| ISBN 9781462530267 (pbk. : alk. paper) |
ISBN 9781462530274 (hardcover : alk. paper)
Subjects: LCSH: Bayesian statistical decision theory—Problems, exercises,
etc. | Bayesian statistical decision theory—Data processing. |
Mathematical statistics—Problems, exercises, etc. | Mathematical
statistics—Data processing. | R (Computer program language)
Classification: LCC QA279.5 .S745 2017 | DDC 519.50285/53—dc23
LC record available at https://lccn.loc.gov/2017004984
Preface

B ack in my youth, when mammoths roamed the earth and I was learning
statistics, I memorized statistical formulas and learned when to apply them.
I studied the statistical rulebook and applied it to my own data, but I didn’t fully
grasp the underlying logic until I had to start teaching statistics myself. After
the first couple semesters of teaching, and the first few dozen really confused
students (sorry, folks!), I began to get the hang of it and I was able to explain
ideas in class in a way that made the light bulbs shine. I used examples with data
and used scenarios and graphs that I built from spreadsheets to illustrate how
hypothesis testing really worked from the inside. I deemphasized the statisti-
cal formulas or broke them open so that students could access the important
concepts hiding inside the symbols. Yet I could never find a textbook that
complemented this teaching style—it was almost as if every textbook author
wanted students to follow the same path they themselves had taken when first
learning statistics.
So this book tries a new approach that puts simulations, hands-on exam-
ples, and conceptual reasoning first. That approach is made possible in part
thanks to the widespread availability of the free and open-source R platform
for data analysis and graphics (R Core Team, 2016). R is often cited as the lan-
guage of the emerging area known as “data science” and is immensely popular
with academic researchers, professional analysts, and learners. In this book I use
R to generate graphs, data, simulations, and scenarios, and I provide all of the
commands that teachers and students need to do the same themselves.
One definitely does not have to already be an R user or a programmer to
use this book effectively. My examples start slowly, I introduce R commands
and data structures gradually, and I keep the complexity of commands and
code sequences to the minimum needed to explain and explore the statistical
v
vi Preface

concepts. Those who go through the whole book will feel competent in using
R and will have a lot of new problem-solving capabilities in their tool belts. I
know this to be the case because I have taught semester-long classes using ear-
lier drafts of this textbook, and my students have arrived at their final projects
with substantial mastery of both statistical inference techniques and the use of
R for data analysis.
Above all, in writing this book I’ve tried to make the process of learn-
ing data analysis and statistical concepts as engaging as possible, and possibly
even fun. I wanted to do that because I believe that quantitative thinking and
statistical reasoning are incredibly important skills and I want to make those
skills accessible to a much wider range of people, not just those who must take
a required statistics course. To minimize the “busy work” you need to do in
order to teach or learn from this book, I’ve also set up a companion website
with a copy of all the code as well as some data sets and other materials that can
be used in- or outside of the classroom (www.guilford.com/stanton2-materials). So,
off you go, have fun, and keep me posted on how you do.
In closing, I acknowledge with gratitude Leonard Katz, my graduate statis-
tics instructor, who got me started on this journey. I would also like to thank the
initially anonymous reviewers of the first draft, who provided extraordinarily
helpful suggestions for improving the final version: Richard P. Deshon, Depart-
ment of Psychology, Michigan State University; Diana ­M indrila, Department
of Educational Research, University of West Georgia; Russell G. Almond,
Department of Educational Psychology, Florida State University; and Emily
A. Butler, Department of Family Studies and Human Development, University
of Arizona. Emily, in particular, astutely pointed out dozens of different spots
where my prose was not as clear and complete as it needed to be. Note that
I take full credit for any remaining errors in the book! I also want to give a
shout out to the amazing team at Guilford Publications: Martin C ­ oleman, Paul
Gordon, CDeborah Laughton, Oliver Sharpe, Katherine Sommer, and Jeannie
Tang. Finally, a note of thanks to my family for giving me the time to lurk in
my basement office for the months it took to write this thing. Much obliged!
Contents

Introduction 1
Getting Started 3

1. Statistical Vocabulary 7
Descriptive Statistics 7
Measures of Central Tendency 8
Measures of Dispersion 10
BOX. Mean and Standard Deviation Formulas 14
Distributions and Their Shapes 15
Conclusion 19
EXERCISES 20

2. Reasoning with Probability 21


Outcome Tables 21
Contingency Tables 27
BOX. Make Your Own Tables with R 30
Conclusion 34
EXERCISES 35

3. Probabilities in the Long Run 37


Sampling 38
Repetitious Sampling with R 40
Using Sampling Distributions and Quantiles to Think
about Probabilities 45

vii
viii Contents

Conclusion 49
EXERCISES 50

4. Introducing the Logic of Inference 52


Using Confidence Intervals
Exploring the Variability of Sample Means
with Repetitious Sampling 57
Our First Inferential Test: The Confidence Interval 60
BOX. Formulas for the Confidence Interval 61
Conclusion 64
EXERCISES 65

5. Bayesian and Traditional Hypothesis Testing 67


BOX. Notation, Formula, and Notes on Bayes’ Theorem 69
BOX. Markov‑Chain Monte Carlo Overview 71
BOX. Detailed Output from BESTmcmc() 74
The Null Hypothesis Significance Test 77
BOX. The Calculation of t 80
Replication and the NHST 83
Conclusion 84
EXERCISES 85

6. Comparing Groups and Analyzing Experiments 88


BOX. Formulas for ANOVA 93
Frequentist Approach to ANOVA 95
BOX. More Information about Degrees of Freedom 99
The Bayesian Approach to ANOVA 102
BOX. Giving Some Thought to Priors 103
BOX. Interpreting Bayes Factors 110
Finding an Effect 111
Conclusion 115
EXERCISES 117

7. Associations between Variables 119


BOX. Formula for Pearson’s Correlation 126
Inferential Reasoning about Correlation 127
BOX. Reading a Correlation Matrix 129
Null Hypothesis Testing on the Correlation 132
Bayesian Tests on the Correlation Coefficient 135
Categorical Associations 138
Exploring the Chi‑Square Distribution with a Simulation 141
The Chi‑Square Test with Real Data 146
The Bayesian Approach to the Chi‑Square Test 147
Conclusion 154
EXERCISES 155
Contents ix

8. Linear Multiple Regression 157


BOX. Making Sense of Adjusted R‑Squared 169
The Bayesian Approach to Linear Regression 172
A Linear Regression Model with Real Data 176
Conclusion 179
EXERCISES 181

9. Interactions in ANOVA and Regression 183


Interactions in ANOVA 186
BOX. Degrees of Freedom for Interactions 187
BOX. A Word about Standard Error 193
Interactions in Multiple Regression 195
BOX. Diagnosing Residuals and Trying Alternative Models 200
Bayesian Analysis of Regression Interactions 204
Conclusion 208
EXERCISES 209

10. Logistic Regression 211


A Logistic Regression Model with Real Data 221
BOX. Multinomial Logistic Regression 222
Bayesian Estimation of Logistic Regression 228
Conclusion 232
EXERCISES 234

11. Analyzing Change over Time 235


Repeated‑Measures Analysis 237
BOX. Using ezANOVA 244
Time‑Series Analysis 246
Exploring a Time Series with Real Data 259
Finding Change Points in Time Series 263
Probabilities in Change‑Point Analysis 266
BOX. Quick View of ARIMA 268
Conclusion 270
EXERCISES 272

12. Dealing with Too Many Variables 274


BOX. Mean Composites versus Factor Scores 281
Internal Consistency Reliability 282
Rotation 285
Conclusion 288
EXERCISES 289

13. All Together Now 291


The Big Picture 294
x Contents

APPENDIX A. Getting Started with R 297


Running R and Typing Commands 298
Installing Packages 301
Quitting, Saving, and Restoring 303
Conclusion 303

APPENDIX B. Working with Data Sets in R 304


Data Frames in R 305
Reading into Data Frames from External Files 309

APPENDIX C. Using dplyr with Data Frames 310

References 313

Index 317

About the Author 325

Purchasers of this book can find annotated R code


from the book’s examples, in-class exercises,
links to online videos, and other resources
at www.guilford.com/stanton2-materials.
Introduction

W illiam Sealy Gosset (1876–1937) was a 19th-­century uber-geek in both


math and chemistry. The latter expertise led the Guinness Brewery in
Dublin, Ireland, to hire Gosset after college, but the former made Gosset a
household name in the world of statistics. As a forward-­looking business, the
Guinness brewery was alert for ways of making batches of beer more consistent
in quality. Gosset stepped in and developed what we now refer to as “small-­
sample statistical techniques”—ways of generalizing from the results of a rela-
tively few observations (Lehmann, 2012).
Brewing a batch of beer takes time, and high-­quality ingredients are not
cheap, so in order to draw conclusions from experimental methods applied to
just a few batches, Gosset had to figure out the role of chance in determining
how a batch of beer had turned out. The brewery frowned upon academic
publications, so Gosset had to publish his results under the modest pseudonym
“Student.” If you ever hear someone discussing the “Student’s t-test,” that is
where the name came from.
The Student’s t-test allows us to compare two groups on some measure-
ment. This process sounds simple, but has some complications hidden in it.
Let’s consider an example as a way of illustrating both the simplicity and the
complexity. Perhaps we want to ask the question of whether “ale yeast” or
“lager yeast” produces the higher alcohol content in a batch of beer. Obviously,
we need to brew at least one batch of each type, but every brewer knows that
many factors inf luence the composition of a batch, so we should really brew
several batches with lager yeast and several batches with ale yeast. Let’s brew
five batches of each type. When we are done we can average the measurements
of alcohol content across the five batches made with lager yeast and do the same
across the five batches made with ale yeast.
1
2 Introduction

What if the results showed that the average alcohol content among ale
yeast batches was slightly higher than among lager yeast batches? End of story,
right? Unfortunately, not quite. Using the tools of mathematical probability
available in the late 1800s, Gosset showed that the average only painted part of
the big picture. What also mattered was how variable the batches were—in
other words, Was there a large spread of results among the observations in either
or both groups? If so, then one could not necessarily rely upon one observed
difference between two averages to generalize to other batches. Repeating the
experiment might easily lead to a different conclusion. Gosset invented the
t-test to quantify this problem and provide researchers with the tools to decide
whether any observed difference in two averages was sufficient to overcome
the natural and expected effects of sampling error. Later in the book, I will
discuss both sampling and sampling error so that you can make sense of these
ideas.
Well, time went on, and the thinking that Gosset and other statisticians
did about this kind of problem led to a widespread tradition in applied statis-
tics known as statistical significance testing. “Statistical significance” is a
technical term that statisticians use to quantify how likely a particular result
might have been in light of a model that depicts a whole range of possible
results. Together, we will unpack that very vague definition in detail through-
out this book. During the 20th century, researchers in many different fields—­
from psychology to medicine to business—­relied more and more on the idea
of statistical significance as the most essential guide for judging the worth of
their results. In fact, as applied statistics training became more and more com-
mon in universities across the world, lots of people forgot the details of exactly
why the concept was developed in the first place, and they began to put a lot of
faith in scientific results that did not always have a solid basis in sensible quan-
titative reasoning. Of additional concern, as matters have progressed we often
find ourselves with so much data that the small-­sample techniques developed
in the 19th century sometimes do not seem relevant anymore. When you have
hundreds of thousands, millions, or even billions of records, conventional tests
of statistical significance can show many negligible results as being statistically
significant, making these tests much less useful for decision making.
In this book, I explain the concept of statistical significance so that you
can put it in perspective. Statistical significance still has a meaningful role to
play in quantitative thinking, but it represents one tool among many in the
quantitative reasoning toolbox. Understanding significance and its limitations
will help you to make sense of reports and publications that you read, but will
also help you grasp some of the more sophisticated techniques that we can now
use to sharpen our reasoning about data. For example, many statisticians and
researchers now advocate for so-­called Bayesian inference, an approach to sta-
tistical reasoning that differs from the frequentist methods (e.g., statistical sig-
nificance) described above. The term “Bayesian” comes from the 18th-­century
thinker Thomas Bayes, who figured out a fresh strategy for reasoning based on
prior evidence. Once you have had a chance to digest all of these concepts and
Introduction 3

put them to work in your own examples, you will be in a position to critically
examine other people’s claims about data and to make your own arguments
stronger.

GETTING STARTED

Contrary to popular belief, much of the essential material in applied statistics


can be grasped without having a deep background in mathematics. In order to
make sense of the ideas in this book, you will need to be able to:

• Add, subtract, multiply, and divide, preferably both on paper and with
a calculator.
• Work with columns and rows of data, as one would typically find in a
spreadsheet.
• Understand several types of basic graphs, such as bar charts and scat-
terplots.
• Follow the meaning and usage of algebraic equations such as y = 2x–10.
• Install and use new programs on a laptop or other personal computer.
• Write interpretations of what you find in data in your own words.

To illustrate concepts presented in this book as well as to practice essen-


tial data analysis skills, we will use the open-­source R platform for statistical
computing and graphics. R is powerful, f lexible, and especially “extensible”
(meaning that people can create new capabilities for it quite easily). R is free for
everyone to use and it runs on just about every kind of computer. It is “com-
mand line” oriented, meaning that most of the work that one needs to perform
is done through carefully crafted text instructions, many of which have “fid-
dly” syntax (the punctuation and related rules for writing a command that
works). Sometimes it can be frustrating to work with R commands and to get
the syntax just right. You will get past that frustration because you will have
many working examples to follow and lots of practice at learning and using
commands. There are also two appendices at the end of the book to help you
with R. Appendix A gets you started with installing and using R. Appendix B
shows how to manage the data sets you will be analyzing with R. Appendix C
demonstrates a convenient method for sorting a data set, selecting variables, and
calculating new variables. If you are using this book as part of a course, your
instructor may assign these as readings for one of your initial classes or labs.
One of the virtues of R as a teaching tool is that it hides very little. The
successful user must fully understand what is going on or else the R commands
will not work. With a spreadsheet, it is easy to type in a lot of numbers and a
formula like =FORECAST() and, bang, a number pops into a cell like magic,
whether it makes any sense or not. With R you have to know your data, know
4 Introduction

what you can do with it, know how it has to be transformed, and know how
to check for problems. The extensibility of R also means that volunteer pro-
grammers and statisticians are adding new capabilities all the time. Finally, the
lessons one learns in working with R are almost universally applicable to other
programs and environments. If one has mastered R, it is a relatively small step
to get the hang of a commercial statistical system. Some of the concepts you
learn in working with R will even be applicable to general-­purpose program-
ming languages like Python.
As an open-­source program, R is created and maintained by a team of vol-
unteers. The team stores the official versions of R at a website called CRAN—
the Comprehensive R Archive Network (Hornik, 2012). If your computer has
the Windows, Mac-OS-X, or Linux operating system, there is a version of R
waiting for you at http://cran.r-­project.org. If you have any difficulties installing
or running the program, you will find dozens of great written and video tuto-
rials on a variety of websites. See Appendix A if you need more help.
We will use many of the essential functions of R, such as adding, subtract-
ing, multiplying, and dividing, right from the command line. Having some
confidence in using R commands will help you later in the book when you
have to solve problems on your own. More important, if you follow along with
every code example in this book while you are reading, it will really help you
understand the ideas in the text. This is a really important point that you should
discuss with your instructor if you are using this book as part of a class: when
you do your reading you should have a computer nearby so that you can run R
commands whenever you see them in the text!
We will also use the aspect of R that makes it extensible, namely the
“package” system. A package is a piece of software and/or data that downloads
from the Internet and extends your basic installation of R. Each package pro-
vides new capabilities that you can use to understand your data. Just a short
time ago, the package repository hit an incredible milestone—6,000 add-on
packages—­that illustrates the popularity and reach of this statistics platform.
See if you can install a package yourself. First, install and run R as described
just above or as detailed in Appendix A. Then type the following command at
the command line:

install.packages(“modeest”)

This command fetches the “mode estimation” package from the Internet and
stores it on your computer. Throughout the book, we will see R code and
output represented as you see it in the line above. I rarely if ever show the
command prompt that R puts at the beginning of each line, which is usually a
“ >” (greater than) character. Make sure to type the commands carefully, as a
mistake may cause an unexpected result. Depending upon how you are view-
ing this book and your instructor’s preferences, you may be able to cut and paste
some commands into R. If you can cut and paste, and the command contains
quote marks as in the example above, make sure they are “dumb” quotes and
Introduction 5

not “smart” quotes (dumb quotes go straight up and down and there is no dif-
ference between an open quote and a close quote). R chokes on smart quotes.
R also chokes on some characters that are cut and pasted from PDF files.
When you install a new package, as you can do with the install.packages
command above, you will see a set of messages on the R console screen show-
ing the progress of the installation. Sometimes these screens will contain warn-
ings. As long as there is no outright error shown in the output, most warnings
can be safely ignored.
When the package is installed and you get a new command prompt, type:

library(modeest)

This command activates the package that was previously installed. The package
becomes part of your active “library” of packages so that you can call on the
functions that library contains. Throughout this book, we will depend heavily
on your own sense of curiosity and your willingness to experiment. Fortu-
nately, as an open-­source software program, R is very friendly and hardy, so
there is really no chance that you can break it. The more you play around with
it and explore its capabilities, the more comfortable you will be when we hit the
more complex stuff later in the book. So, take some time now, while we are in
the easy phase, to get familiar with R. You can ask R to provide help by typing
a question mark, followed by the name of a topic. For example, here’s how to
ask for help about the library() command:

?library

This command brings up a new window that contains the “official” infor-
mation about R’s library() function. For the moment, you may not find R’s help
very “helpful” because it is formatted in a way that is more useful for experts
and less useful for beginners, but as you become more adept at using R, you
will find more and more uses for it. Hang in there and keep experimenting!

You might also like