Reasoning With Data An Introduction To Traditional and Bayesian Statistics Using R Full Text Download
Reasoning With Data An Introduction To Traditional and Bayesian Statistics Using R Full Text Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/reasoning-with-data-an-introduction-to-traditional-a
nd-bayesian-statistics-using-r/
Jeffrey M. Stanton
B ack in my youth, when mammoths roamed the earth and I was learning
statistics, I memorized statistical formulas and learned when to apply them.
I studied the statistical rulebook and applied it to my own data, but I didn’t fully
grasp the underlying logic until I had to start teaching statistics myself. After
the first couple semesters of teaching, and the first few dozen really confused
students (sorry, folks!), I began to get the hang of it and I was able to explain
ideas in class in a way that made the light bulbs shine. I used examples with data
and used scenarios and graphs that I built from spreadsheets to illustrate how
hypothesis testing really worked from the inside. I deemphasized the statisti-
cal formulas or broke them open so that students could access the important
concepts hiding inside the symbols. Yet I could never find a textbook that
complemented this teaching style—it was almost as if every textbook author
wanted students to follow the same path they themselves had taken when first
learning statistics.
So this book tries a new approach that puts simulations, hands-on exam-
ples, and conceptual reasoning first. That approach is made possible in part
thanks to the widespread availability of the free and open-source R platform
for data analysis and graphics (R Core Team, 2016). R is often cited as the lan-
guage of the emerging area known as “data science” and is immensely popular
with academic researchers, professional analysts, and learners. In this book I use
R to generate graphs, data, simulations, and scenarios, and I provide all of the
commands that teachers and students need to do the same themselves.
One definitely does not have to already be an R user or a programmer to
use this book effectively. My examples start slowly, I introduce R commands
and data structures gradually, and I keep the complexity of commands and
code sequences to the minimum needed to explain and explore the statistical
v
vi Preface
concepts. Those who go through the whole book will feel competent in using
R and will have a lot of new problem-solving capabilities in their tool belts. I
know this to be the case because I have taught semester-long classes using ear-
lier drafts of this textbook, and my students have arrived at their final projects
with substantial mastery of both statistical inference techniques and the use of
R for data analysis.
Above all, in writing this book I’ve tried to make the process of learn-
ing data analysis and statistical concepts as engaging as possible, and possibly
even fun. I wanted to do that because I believe that quantitative thinking and
statistical reasoning are incredibly important skills and I want to make those
skills accessible to a much wider range of people, not just those who must take
a required statistics course. To minimize the “busy work” you need to do in
order to teach or learn from this book, I’ve also set up a companion website
with a copy of all the code as well as some data sets and other materials that can
be used in- or outside of the classroom (www.guilford.com/stanton2-materials). So,
off you go, have fun, and keep me posted on how you do.
In closing, I acknowledge with gratitude Leonard Katz, my graduate statis-
tics instructor, who got me started on this journey. I would also like to thank the
initially anonymous reviewers of the first draft, who provided extraordinarily
helpful suggestions for improving the final version: Richard P. Deshon, Depart-
ment of Psychology, Michigan State University; Diana M indrila, Department
of Educational Research, University of West Georgia; Russell G. Almond,
Department of Educational Psychology, Florida State University; and Emily
A. Butler, Department of Family Studies and Human Development, University
of Arizona. Emily, in particular, astutely pointed out dozens of different spots
where my prose was not as clear and complete as it needed to be. Note that
I take full credit for any remaining errors in the book! I also want to give a
shout out to the amazing team at Guilford Publications: Martin C oleman, Paul
Gordon, CDeborah Laughton, Oliver Sharpe, Katherine Sommer, and Jeannie
Tang. Finally, a note of thanks to my family for giving me the time to lurk in
my basement office for the months it took to write this thing. Much obliged!
Contents
Introduction 1
Getting Started 3
1. Statistical Vocabulary 7
Descriptive Statistics 7
Measures of Central Tendency 8
Measures of Dispersion 10
BOX. Mean and Standard Deviation Formulas 14
Distributions and Their Shapes 15
Conclusion 19
EXERCISES 20
vii
viii Contents
Conclusion 49
EXERCISES 50
References 313
Index 317
What if the results showed that the average alcohol content among ale
yeast batches was slightly higher than among lager yeast batches? End of story,
right? Unfortunately, not quite. Using the tools of mathematical probability
available in the late 1800s, Gosset showed that the average only painted part of
the big picture. What also mattered was how variable the batches were—in
other words, Was there a large spread of results among the observations in either
or both groups? If so, then one could not necessarily rely upon one observed
difference between two averages to generalize to other batches. Repeating the
experiment might easily lead to a different conclusion. Gosset invented the
t-test to quantify this problem and provide researchers with the tools to decide
whether any observed difference in two averages was sufficient to overcome
the natural and expected effects of sampling error. Later in the book, I will
discuss both sampling and sampling error so that you can make sense of these
ideas.
Well, time went on, and the thinking that Gosset and other statisticians
did about this kind of problem led to a widespread tradition in applied statis-
tics known as statistical significance testing. “Statistical significance” is a
technical term that statisticians use to quantify how likely a particular result
might have been in light of a model that depicts a whole range of possible
results. Together, we will unpack that very vague definition in detail through-
out this book. During the 20th century, researchers in many different fields—
from psychology to medicine to business—relied more and more on the idea
of statistical significance as the most essential guide for judging the worth of
their results. In fact, as applied statistics training became more and more com-
mon in universities across the world, lots of people forgot the details of exactly
why the concept was developed in the first place, and they began to put a lot of
faith in scientific results that did not always have a solid basis in sensible quan-
titative reasoning. Of additional concern, as matters have progressed we often
find ourselves with so much data that the small-sample techniques developed
in the 19th century sometimes do not seem relevant anymore. When you have
hundreds of thousands, millions, or even billions of records, conventional tests
of statistical significance can show many negligible results as being statistically
significant, making these tests much less useful for decision making.
In this book, I explain the concept of statistical significance so that you
can put it in perspective. Statistical significance still has a meaningful role to
play in quantitative thinking, but it represents one tool among many in the
quantitative reasoning toolbox. Understanding significance and its limitations
will help you to make sense of reports and publications that you read, but will
also help you grasp some of the more sophisticated techniques that we can now
use to sharpen our reasoning about data. For example, many statisticians and
researchers now advocate for so-called Bayesian inference, an approach to sta-
tistical reasoning that differs from the frequentist methods (e.g., statistical sig-
nificance) described above. The term “Bayesian” comes from the 18th-century
thinker Thomas Bayes, who figured out a fresh strategy for reasoning based on
prior evidence. Once you have had a chance to digest all of these concepts and
Introduction 3
put them to work in your own examples, you will be in a position to critically
examine other people’s claims about data and to make your own arguments
stronger.
GETTING STARTED
• Add, subtract, multiply, and divide, preferably both on paper and with
a calculator.
• Work with columns and rows of data, as one would typically find in a
spreadsheet.
• Understand several types of basic graphs, such as bar charts and scat-
terplots.
• Follow the meaning and usage of algebraic equations such as y = 2x–10.
• Install and use new programs on a laptop or other personal computer.
• Write interpretations of what you find in data in your own words.
what you can do with it, know how it has to be transformed, and know how
to check for problems. The extensibility of R also means that volunteer pro-
grammers and statisticians are adding new capabilities all the time. Finally, the
lessons one learns in working with R are almost universally applicable to other
programs and environments. If one has mastered R, it is a relatively small step
to get the hang of a commercial statistical system. Some of the concepts you
learn in working with R will even be applicable to general-purpose program-
ming languages like Python.
As an open-source program, R is created and maintained by a team of vol-
unteers. The team stores the official versions of R at a website called CRAN—
the Comprehensive R Archive Network (Hornik, 2012). If your computer has
the Windows, Mac-OS-X, or Linux operating system, there is a version of R
waiting for you at http://cran.r-project.org. If you have any difficulties installing
or running the program, you will find dozens of great written and video tuto-
rials on a variety of websites. See Appendix A if you need more help.
We will use many of the essential functions of R, such as adding, subtract-
ing, multiplying, and dividing, right from the command line. Having some
confidence in using R commands will help you later in the book when you
have to solve problems on your own. More important, if you follow along with
every code example in this book while you are reading, it will really help you
understand the ideas in the text. This is a really important point that you should
discuss with your instructor if you are using this book as part of a class: when
you do your reading you should have a computer nearby so that you can run R
commands whenever you see them in the text!
We will also use the aspect of R that makes it extensible, namely the
“package” system. A package is a piece of software and/or data that downloads
from the Internet and extends your basic installation of R. Each package pro-
vides new capabilities that you can use to understand your data. Just a short
time ago, the package repository hit an incredible milestone—6,000 add-on
packages—that illustrates the popularity and reach of this statistics platform.
See if you can install a package yourself. First, install and run R as described
just above or as detailed in Appendix A. Then type the following command at
the command line:
install.packages(“modeest”)
This command fetches the “mode estimation” package from the Internet and
stores it on your computer. Throughout the book, we will see R code and
output represented as you see it in the line above. I rarely if ever show the
command prompt that R puts at the beginning of each line, which is usually a
“ >” (greater than) character. Make sure to type the commands carefully, as a
mistake may cause an unexpected result. Depending upon how you are view-
ing this book and your instructor’s preferences, you may be able to cut and paste
some commands into R. If you can cut and paste, and the command contains
quote marks as in the example above, make sure they are “dumb” quotes and
Introduction 5
not “smart” quotes (dumb quotes go straight up and down and there is no dif-
ference between an open quote and a close quote). R chokes on smart quotes.
R also chokes on some characters that are cut and pasted from PDF files.
When you install a new package, as you can do with the install.packages
command above, you will see a set of messages on the R console screen show-
ing the progress of the installation. Sometimes these screens will contain warn-
ings. As long as there is no outright error shown in the output, most warnings
can be safely ignored.
When the package is installed and you get a new command prompt, type:
library(modeest)
This command activates the package that was previously installed. The package
becomes part of your active “library” of packages so that you can call on the
functions that library contains. Throughout this book, we will depend heavily
on your own sense of curiosity and your willingness to experiment. Fortu-
nately, as an open-source software program, R is very friendly and hardy, so
there is really no chance that you can break it. The more you play around with
it and explore its capabilities, the more comfortable you will be when we hit the
more complex stuff later in the book. So, take some time now, while we are in
the easy phase, to get familiar with R. You can ask R to provide help by typing
a question mark, followed by the name of a topic. For example, here’s how to
ask for help about the library() command:
?library
This command brings up a new window that contains the “official” infor-
mation about R’s library() function. For the moment, you may not find R’s help
very “helpful” because it is formatted in a way that is more useful for experts
and less useful for beginners, but as you become more adept at using R, you
will find more and more uses for it. Hang in there and keep experimenting!