R All-in-One For Dummies
()
About this ebook
A deep dive into the programming language of choice for statistics and data
With R All-in-One For Dummies, you get five mini-books in one, offering a complete and thorough resource on the R programming language and a road map for making sense of the sea of data we're all swimming in. Maybe you're pursuing a career in data science, maybe you're looking to infuse a little statistics know-how into your existing career, or maybe you're just R-curious. This book has your back. Along with providing an overview of coding in R and how to work with the language, this book delves into the types of projects and applications R programmers tend to tackle the most. You'll find coverage of statistical analysis, machine learning, and data management with R.
- Grasp the basics of the R programming language and write your first lines of code
- Understand how R programmers use code to analyze data and perform statistical analysis
- Use R to create data visualizations and machine learning programs
- Work through sample projects to hone your R coding skill
This is an excellent all-in-one resource for beginning coders who'd like to move into the data space by knowing more about R.
Read more from Joseph Schmuller
Statistical Analysis with R For Dummies Rating: 0 out of 5 stars0 ratingsData Analytics & Visualization All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsStatistical Analysis with Excel For Dummies Rating: 0 out of 5 stars0 ratingsStatistical Analysis with R Essentials For Dummies Rating: 0 out of 5 stars0 ratings
Related to R All-in-One For Dummies
Related ebooks
R For Dummies Rating: 4 out of 5 stars4/5Statistics II For Dummies Rating: 3 out of 5 stars3/5Statistics All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsData Science For Dummies Rating: 5 out of 5 stars5/5Statistics Essentials For Dummies Rating: 3 out of 5 stars3/5SPSS Statistics Workbook For Dummies Rating: 0 out of 5 stars0 ratingsData Mining For Dummies Rating: 4 out of 5 stars4/5Algorithms For Dummies Rating: 4 out of 5 stars4/5Statistics for Big Data For Dummies Rating: 0 out of 5 stars0 ratingsStatistics: 1,001 Practice Problems For Dummies (+ Free Online Practice) Rating: 3 out of 5 stars3/5Python All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsU Can: Statistics For Dummies Rating: 3 out of 5 stars3/5Calculus All-in-One For Dummies (+ Chapter Quizzes Online) Rating: 0 out of 5 stars0 ratingsTableau For Dummies Rating: 4 out of 5 stars4/5Predictive Analytics For Dummies Rating: 3 out of 5 stars3/5Statistics: 1001 Practice Problems For Dummies (+ Free Online Practice) Rating: 0 out of 5 stars0 ratingsCalculus II For Dummies Rating: 0 out of 5 stars0 ratingsStatistics Workbook For Dummies with Online Practice Rating: 0 out of 5 stars0 ratingsPre-Algebra Essentials For Dummies Rating: 2 out of 5 stars2/5Data Science Programming All-in-One For Dummies Rating: 5 out of 5 stars5/5Statistics For Dummies Rating: 3 out of 5 stars3/5Beginning Programming with Python For Dummies Rating: 3 out of 5 stars3/5SPSS Data Analysis for Univariate, Bivariate, and Multivariate Statistics Rating: 0 out of 5 stars0 ratingsAlgebra II: 1001 Practice Problems For Dummies (+ Free Online Practice) Rating: 0 out of 5 stars0 ratingsCalculus: 1001 Practice Problems For Dummies (+ Free Online Practice) Rating: 3 out of 5 stars3/5Machine Learning For Dummies Rating: 4 out of 5 stars4/5Excel 2016 All-in-One For Dummies Rating: 3 out of 5 stars3/5Pre-Calculus: 1,001 Practice Problems For Dummies (+ Free Online Practice) Rating: 3 out of 5 stars3/5Biostatistics by Example Using SAS Studio Rating: 0 out of 5 stars0 ratingsLearn R Programming in 24 Hours Rating: 0 out of 5 stars0 ratings
Programming For You
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsPython Machine Learning By Example Rating: 4 out of 5 stars4/5Unreal Engine from Zero to Proficiency (Foundations): Unreal Engine from Zero to Proficiency, #1 Rating: 3 out of 5 stars3/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsHTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Python: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Beginning Programming with C++ For Dummies Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5
Reviews for R All-in-One For Dummies
0 ratings0 reviews
Book preview
R All-in-One For Dummies - Joseph Schmuller
Introduction
In this book, I’ve brought together all the information you need to hit the ground running with R. It’s heavy on statistics, of course, because R’s creators built this language to analyze data.
So it’s necessary that you learn the foundations of statistics. Let me tell you at the outset: This All-in-One is not a cookbook. I’ve never taught statistics that way and I never will. Before I show you how to use R to work with a statistical concept, I give you a strong grounding in what that concept is all about.
In fact, Books 2 and 3 of this 5-book compendium are something like an introductory statistics text that happens to use R as a way of explaining statistical ideas.
Book 4 follows that path by teaching the ideas behind machine learning before you learn how to use R to implement them. Book 5 gives you a set of projects that give you a chance to exercise your newly minted R skill set.
Want some more details? Read on.
About This All-in-One
The volume you’re holding (or the e-book you’re viewing) consists of five books that cover a lot of the length and breadth of R.
Book 1: Introducing R
As I said earlier in this introduction, R is a language that deals with statistics. Accordingly, Book 1 introduces you to the fundamental concepts of statistics that you just have to know in order to progress with R.
You then learn about R and RStudio, a widely used development environment for working with R. I begin by describing the rudiments of R code, and I discuss R functions and structures.
R truly comes alive when you use its specialized packages, which you learn about early on.
Book 2: Describing Data
Part of working with statistics is to summarize data in meaningful ways. In Book 2, you find out how to do just that.
Most people know about averages and how to compute them. But that’s not the whole story. In Book 2, I tell you about additional descriptive statistics that fill in the gaps, and I show you how to use R to calculate and work with those statistics. You also learn to create graphics that visualize the data descriptions and analyses you encounter in Books 2 and 3.
Book 3: Analyzing Data
Book 3 addresses the fundamental aim of statistical analysis: to go beyond the data and help you make decisions. Usually, the data are measurements of a sample taken from a large population. The goal is to use these data to figure out what’s going on in the population.
This opens a wide range of questions: What does an average mean? What does the difference between two averages mean? Are two things associated? These are only a few of the questions I address in Book 3, and you learn to use the R tools that help you answer them.
Book 4: Learning from Data
Effective machine learning model creation comes with experience. Accordingly, in Book 4 you gain experience by completing machine learning projects. In addition to the projects you complete along with me, I suggest additional projects for you to try on your own.
I begin by telling you about the University of California-Irvine Machine Learning Repository, which provides the data sets for most of the projects you encounter in Book 4.
To give you a gentle on-ramp into the field, I show you the Rattle package for creating machine learning applications. It’s a friendly interface to R’s machine learning functionality. I like Rattle a lot, and I think you will, too. You use it to learn about and work with decision trees, random forests, support vector machines, k-means clustering, and neural networks.
You also work with fairly large data sets — not the terabytes and petabytes data scientists work with, but large enough to get you started. In one project, you analyze a data set of more than 500,000 airline flights. In another, you complete a customer segmentation analysis of over 300,000 customers of an online retailer.
Book 5: Harnessing R: Some Projects to Keep You Busy
As its title suggests, Book 5 is also organized around projects.
In these projects, you create applications that respond to users. I show you the shiny package for working with web browsers and the shinydashboard package for creating dashboards.
All this is a little far afield from R’s original mission in life, but you get an idea of R’s potential to expand in new directions.
After you’ve worked with R for a while, maybe you can discover some of those new directions!
What You Can Safely Skip
Any reference book throws a lot of information at you, and this one is no exception. I intended it all to be useful, but I didn’t aim it all at the same level. So if you’re not deeply into the subject matter, you can avoid paragraphs marked with the Technical Stuff icon, and you can also skip the sidebars.
Foolish Assumptions
I’m assuming that you
Know how to work with Windows or the Mac. I don’t go through the details of pointing, clicking, selecting, and so forth.
Can install R and RStudio (I show you how in Book 1) and follow along with the examples. I use the Windows version of RStudio, but you should have no problem if you’re working on a Mac.
Icons Used in This Book
As is the case in all For Dummies books, icons help guide you through your journey. Each one is a little picture in the margin that lets you know something special about the paragraph it’s next to.
Tip This icon points out a hint or a shortcut that helps you in your work and makes you an all-around better person.
Remember This one points out timeless wisdom to take with you as you continue on the path to enlightenment.
Warning Pay attention to this icon. It’s a reminder to avoid something that might gum up the works for you.
Technical Stuff As I mention in What You Can Safely Skip,
this icon indicates material you can blow past if it’s just too technical. (I’ve kept this content to a minimum.)
Beyond This Book
In addition to what you’re reading right now, this book comes with a free, access-anywhere Cheat Sheet that will help you quickly use the tools I discuss. To find this Cheat Sheet, visit www.dummies.com and search for R All-in-One For Dummies Cheat Sheet in the Search box.
If you’ve read any of my earlier books, welcome back!
Where to Go from Here
Time to hit the books! You can start from anywhere, but here are a couple of hints. Want to introduce yourself to R and packages? Book 1 is for you. Has it been a while (or maybe never?) since your last statistics course? Hit Book 2. For anything else, find it in the table of contents or in the index and go for it.
If you prefer to read from cover to cover, just turn the page… .
Book 1
Introducing R
Contents at a Glance
Chapter 1: R: What It Does and How It Does It
The Statistical (and Related) Ideas You Just Have to Know
Getting R
Getting RStudio
A Session with R
R Functions
User-Defined Functions
Comments
R Structures
for Loops and if Statements
Chapter 2: Working with Packages, Importing, and Exporting
Installing Packages
Examining Data
R Formulas
More Packages
Exploring the tidyverse
Importing and Exporting
Chapter 1
R: What It Does and How It Does It
IN THIS CHAPTER
Bullet Introducing statistics
Bullet Getting R and RStudio on your computer
Bullet Starting a session with R
Bullet Working with R functions
Bullet Working with R structures
So you’re ready to journey into the wonderful world of R! Designed by and for statisticians and data scientists, R has a short but illustrious history.
In the 1990s, Ross Ihaka and Robert Gentleman developed R at the University of Auckland, New Zealand. The R Core Team and the R Foundation for Statistical Computing support R, which has a huge worldwide user base.
Before I tell you about R, however, I have to introduce you to the world that R lives in — the world of data and statistics.
The Statistical (and Related) Ideas You Just Have to Know
The analytical tools that R provides are based on statistical concepts I help you explore in this section. As you’ll see, these concepts are based on common sense.
Samples and populations
If you watch TV on election night, you know that one of the main events is the prediction of the outcome immediately after the polls close (and before all the votes are counted). How is it that pundits almost always get it right?
The idea is to talk to a sample of voters right after they vote. If they’re truthful about how they marked their ballots, and if the sample is representative of the population of voters, analysts can use the sample data to draw conclusions about the population.
That, in a nutshell, is what statistics is all about — using the data from samples to draw conclusions about populations.
Here’s another example. Imagine that your job is to find the average height of 10-year-old children in the United States. Because you probably wouldn’t have the time or the resources to measure every child, you’d measure the heights of a representative sample. Then you’d average those heights and use that average as the estimate of the population average.
Estimating the population average is one kind of inference that statisticians make from sample data. I discuss inference in more detail in the later section "Inferential Statistics: Testing Hypotheses."
Remember Here’s some important terminology: Properties of a population (like the population average) are called parameters, and properties of a sample (like the sample average) are called statistics. If your only concern is the sample properties (like the heights of the children in your sample), the statistics you calculate are descriptive. (I discuss descriptive statistics in Book 2.) If you’re concerned about estimating the population properties, your statistics are inferential. (I discuss inferential statistics in Book 3.)
Remember Now for an important convention about notation: Statisticians use Greek letters (μ, σ, ρ) to stand for parameters, and English letters ( math , s, r) to stand for statistics. Figure 1-1 summarizes the relationship between populations and samples, and between parameters and statistics.
Schematic illustration of the relationship between populations, samples, parameters, and statistics.FIGURE 1-1: The relationship between populations, samples, parameters, and statistics.
Variables: Dependent and independent
A variable is something that can take on more than one value — like your age, the value of the dollar against other currencies, or the number of games your favorite sports team wins. Something that can have only one value is a constant. Scientists tell us that the speed of light is a constant, and we use the constant π to calculate the area of a circle.
Statisticians work with independent variables and dependent variables. In any study or experiment, you’ll find both kinds. Statisticians assess the relationship between them.
For example, imagine a computerized training method designed to increase a person’s IQ. How would a researcher find out whether this method does what it’s supposed to do? First, the researcher would randomly assign a sample of people to one of two groups. One group would receive the training method, and the other would complete another kind of computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. What happens next? I discuss that topic in the later section "Inferential Statistics: Testing Hypotheses."
For now, understand that the independent variable here is Type of Activity. The two possible values of this variable are IQ Training and Reading Text. The dependent variable is the change in IQ from Before to After.
Remember A dependent variable is what a researcher measures. In an experiment, an independent variable is what a researcher manipulates. In other contexts, a researcher can’t manipulate an independent variable. Instead, they note naturally occurring values of the independent variable and how they affect a dependent variable.
Remember In general, the objective is to find out whether changes in an independent variable are associated with changes in a dependent variable.
In examples that appear throughout this book, I show you how to use R to calculate characteristics of groups of scores, or to compare groups of scores. Whenever I show you a group of scores, I’m talking about the values of a dependent variable.
Types of data
When you do statistical work, you can run into four kinds of data. And when you work with a variable, the way you work with it depends on what kind of data it is:
The first kind is nominal data. If a set of numbers happens to be nominal data, the numbers are labels — their values don’t signify anything. On a sports team, the jersey numbers are nominal. They just identify the players.
The next kind is ordinal data. In this data type, the numbers are more than just labels. As the name ordinal might tell you, the order of the numbers is important. If I ask you to rank ten foods from the one you like best (1) to the one you like least (10), we’d have a set of ordinal data.
But the difference between your third-favorite food and your fourth-favorite food might not be the same as the difference between your ninth-favorite and your tenth-favorite. So this type of data lacks equal intervals and equal differences.
Interval data gives us equal differences. The Fahrenheit scale of temperature is a good example. The difference between 30o and 40o is the same as the difference between 90o and 100o. So each degree is an interval.
People are sometimes surprised to find out that on the Fahrenheit scale a temperature of 80o is not twice as hot as 40o. For ratio statements (twice as much as,
half as much as
) to make sense, zero has to mean the complete absence of the thing you’re measuring. A temperature of 0o F doesn’t mean the complete absence of heat — it’s just an arbitrary point on the Fahrenheit scale. (The same holds true for Celsius.)
The fourth kind of data, ratio, provides a meaningful zero point. On the Kelvin scale of temperature, zero means absolute zero, where all molecular motion (the basis of heat) stops. So 200o Kelvin is twice as hot as 100o Kelvin. Another example is length. Eight inches is twice as long as 4 inches. Zero inches means a complete absence of length.
Remember An independent variable or a dependent variable can be either nominal, ordinal, interval, or ratio. The analytical tools you use depend on the type of data you work with.
A little probability
When statisticians make decisions, they use probability to express their confidence about those decisions. They can never be absolutely certain about what they decide. They can tell you only how probable their conclusions are.
What do we mean by probability? Mathematicians and philosophers might give you complex definitions. In my experience, however, the best way to understand probability is in terms of examples.
Here’s a simple example: If you toss a coin, what’s the probability that it turns up heads? If the coin is fair, you might figure that you have a 50-50 chance of heads and a 50-50 chance of tails. And you’d be right. In terms of the kinds of numbers associated with probability, that’s ½.
Think about rolling a fair die (one member of a pair of dice). What’s the probability that you roll a 4? Well, a die has six faces and one of them is 4, so that’s ⅙.
Still another example: Select one card at random from a standard deck of 52 cards. What’s the probability that it’s a diamond? A deck of cards has four suits, so that’s ¼.
These examples tell you that if you want to know the probability that an event occurs, count how many ways that event can happen and divide by the total number of events that can happen. In the first two examples (heads, 4), the event you’re interested in happens only one way. For the coin, we divide 1 by 2. For the die, we divide 1 by 6. In the third example (diamond), the event can happen 13 ways (Ace through King), so we divide 13 by 52 (to get ¼).
Now for a slightly more complicated example. Toss a coin and roll a die at the same time. What’s the probability of tails and a 4? Think about all the possible events that can happen when you toss a coin and roll a die at the same time. You could have tails and 1 through 6, or heads and 1 through 6. That adds up to 12 possibilities. The tails-and-4 combination can happen only one way. So the probability math .
In general, the formula for the probability that a particular event occurs is
mathAt the beginning of this section, I say that statisticians express their confidence about their conclusions in terms of probability, which is why I brought all this up in the first place. This line of thinking leads to conditional probability — the probability that an event occurs given that some other event occurs. Suppose that I roll a die, look at it (so that you don’t see it), and tell you that I rolled an odd number. What’s the probability that I’ve rolled a 5? Ordinarily, the probability of a 5 is ⅙, but I rolled an odd number
narrows it down. That piece of information eliminates the three even numbers (2, 4, 6) as possibilities. Only the three odd numbers (1,3, 5) are possible, so the probability is ⅓.
What’s the big deal about conditional probability? What role does it play in statistical analysis? Read on.
Inferential statistics: Testing hypotheses
Before a statistician does a study, they draw up a tentative explanation — a hypothesis that tells why the data might come out a certain way. After gathering all the data, the statistician has to decide whether to reject the hypothesis.
That decision is the answer to a conditional probability question — what’s the probability of obtaining the data, given that this hypothesis is correct? Statisticians have tools that calculate the probability. If the probability turns out to be low, the statistician rejects the hypothesis.
Back to coin-tossing for an example: Imagine that you’re interested in whether a particular coin is fair — whether it has an equal chance of heads or tails on any toss. Let’s start with The coin is fair
as the hypothesis.
To test the hypothesis, you’d toss the coin a number of times — let’s say 100. These 100 tosses are the sample data. If the coin is fair (as per the hypothesis), you’d expect 50 heads and 50 tails.
If it’s 99 heads and 1 tail, you’d surely reject the fair-coin hypothesis: The conditional probability of 99 heads and 1 tail given a fair coin is very low. Of course, the coin could still be fair and you could, quite by chance, get a 99-1 split, right? Sure. You never really know. You have to gather the sample data (the 100 toss results) and then decide. Your decision might be right, or it might not.
Juries make these types of decisions. In the United States, the starting hypothesis is that the defendant is not guilty (innocent until proven guilty
). Think of the evidence as data. Jury members consider the evidence and answer a conditional probability question: What’s the probability of the evidence, given that the defendant is not guilty? Their answer determines the verdict.
Null and alternative hypotheses
Think again about that coin-tossing study I just mentioned. The sample data are the results from the 100 tosses. I said that we can start with the hypothesis that the coin is fair. This starting point is called the null hypothesis. The statistical notation for the null hypothesis is H0. According to this hypothesis, any heads-tails split in the data is consistent with a fair coin. Think of it as the idea that nothing in the sample data is out of the ordinary.
An alternative hypothesis is possible — that the coin isn’t a fair one and it’s biased to produce an unequal number of heads and tails. This hypothesis says that any heads-tails split is consistent with an unfair coin. This alternative hypothesis is called, believe it or not, the alternative hypothesis. The statistical notation for the alternative hypothesis is H1.
Now toss the coin 100 times and note the number of heads and tails. If the results are something like 90 heads and 10 tails, it’s a good idea to reject H0. If the results are around 50 heads and 50 tails, don’t reject H0.
Similar ideas apply to the IQ example I gave earlier. One sample receives the computer-based IQ training method, and the other participates in a different computer-based activity — like reading text on a website. Before and after each group completes its activities, the researcher measures each person’s IQ. The null hypothesis, H0, is that one group’s improvement isn’t different from the other. If the improvements are greater with the IQ training than with the other activity — so much greater that it’s unlikely that the two aren’t different from one another — reject H0. If they’re not, don’t reject H0.
Remember Notice that I did not say "accept H0." The way the logic works, you never accept a hypothesis. You either reject H0 or don’t reject H0. In a jury trial, the verdict is either guilty
(reject the null hypothesis of not guilty
) or not guilty
(don’t reject H0). Innocent
(acceptance of the null hypothesis) is not a possible verdict.
Notice also that in the coin-tossing example, I said around 50 heads and 50 tails.
What does around mean? Also, I said that if it’s 90-10, reject H0. What about 85-15? 80-20? 70-30? Exactly how much different from 50-50 does the split have to be for you to reject H0? In the IQ training example, how much greater does the IQ improvement have to be to reject H0?
I don’t answer these questions now. Statisticians have formulated decision rules for situations like this, and I’ll explore those rules in Book 3.
Two types of error
Whenever you evaluate data and decide to reject H0 or to not reject H0, you can never be absolutely sure. You never really know the true
state of the world. In the coin-tossing example, that means you can’t be certain whether the coin is fair. All you can do is make a decision based on the sample data. If you want to know for sure about the coin, you have to have the data for the entire population of tosses — which means you have to keep tossing the coin until the end of time.
Because you’re never certain about your decisions, you can make an error either way you decide. As I mention earlier, the coin could be fair and you just happen to get 99 heads in 100 tosses. That’s not likely, and that’s why you reject H0 if that happens. It’s also possible that the coin is biased, yet you just happen to toss 50 heads in 100 tosses. Again, that’s not likely and you don’t reject H0 in that case.
Although those errors are not likely, they are possible. They lurk in every study that involves inferential statistics. Statisticians have named them Type I errors and Type II errors.
If you reject H0 and you shouldn’t, that’s a Type I error. In the coin example, that’s rejecting the hypothesis that the coin is fair, when in reality it is a fair coin.
If you don’t reject H0 and you should have, that’s a Type II error. It happens if you don’t reject the hypothesis that the coin is fair, and in reality it’s biased.
How do you know if you’ve made either type of error? You don’t — at least not right after you make the decision to reject or not reject H0. (If it’s possible to know, you wouldn’t make the error in the first place!) All you can do is gather more data and see whether the additional data is consistent with your decision.
If you think of H0 as a tendency to maintain the status quo and not interpret anything as being out of the ordinary (no matter how it looks), a Type II error means you’ve missed out on something big. In fact, some iconic mistakes are Type II errors.
Here’s what I mean. On New Year’s Day in 1962, a rock group consisting of three guitarists and a drummer auditioned in the London studio of a major recording company. Legend has it that the recording executives didn’t like what they heard, didn’t like what they saw, and believed that guitar groups were on the way out. Although the musicians played their hearts out, the group failed the audition.
Who was that group? The Beatles!
And that’s a Type II error.
Getting R
Now that I’ve taken you through the world that R lives in, let’s dive into R.
If you don’t already have R on your computer, the first thing to do is to download R and install it.
You’ll find the appropriate software on the website of the Comprehensive R Archive Network (CRAN). In your browser, type this web address:
https://cran.rstudio.com
Click the appropriate link to download R for your computer.
Getting RStudio
Working with R is a lot easier if you do it through an application called RStudio. Computer honchos refer to RStudio as an IDE (Integrated Development Environment). Think of it as a tool that helps you write, edit, run, and keep track of your R code, and as an environment that connects you to a world of helpful hints about R.
Here’s the web address for this terrific tool:
www.rstudio.com/products/rstudio/download
Click the link for the installer for your flavor of computer and again follow the usual installation procedures. (You’ll want RStudio Desktop.)
Tip In this book, I work with R version 4.2.0 and RStudio version 2022.02.3 Build 492. By the time you read this, later versions of both might be available. Incidentally, each version of R has its own whimsical nickname. Version 4.2.0 is called Vigorous Calisthenics. Why? I have no idea. Perhaps it reflects an evolution from the previous version, One Push-Up.
After you finish installing R and RStudio, click your brand-new RStudio icon to open the window that looks very much like the window shown in Figure 1-2. It won’t be an exact match, because my history with RStudio — reflected in the upper right pane — is probably different from yours.
A command output displaying the RStudio, immediately after you install it and click its icon.FIGURE 1-2: RStudio, immediately after you install it and click its icon.
The large Console pane on the left runs R code. One way to run R code is to type it directly into the Console pane. I show you another in a moment.
The other two panes provide helpful information as you work with R. The Environment and History pane is in the upper right. The Environment tab keeps track of the things you create (which R calls objects) as you work with R. The History tab tracks R code that you enter.
Tip Get used to the word object. Everything in R is an object.
The Files, Plots, Packages, and Help pane is in the lower right. The Files tab shows files you create. The Plots tab holds graphs you create from your data. The Packages tab shows add-ons (called packages) that have downloaded with R. Bear in mind that downloaded doesn’t mean ready to use.
To use a package’s capabilities, one more step is necessary — and trust me — you’ll want to use packages.
Figure 1-3 shows the Packages tab. I discuss packages later in this chapter.
Screenshot of the RStudio Packages tab. The Packages tab shows add-ons that have downloaded with R.FIGURE 1-3: The RStudio Packages tab.
The Help tab, shown in Figure 1-4, links you to a wealth of information about R and RStudio.
Screenshot of the help tab links you to a wealth of information about R and Rstudio.FIGURE 1-4: The RStudio Help tab.
To tap into the full power of RStudio as an IDE, click the icon in the rightmost upper corner of the Console pane. (It looks like a tall folder with a gray band across the top.) That changes the appearance of RStudio so that it looks like Figure 1-5.
A command output for RStudio, after you click the icon in the upper right corner of the Console pane.FIGURE 1-5: RStudio, after you click the icon in the upper right corner of the Console pane.
The Console pane relocates to the lower left. The new pane in the upper left is the Scripts pane. You type and edit code in the Scripts pane, press Ctrl+Enter (Command+Enter on the Mac), and then the code executes in the Console pane.
Tip You can also highlight lines of code in the Scripts pane and choose Code⇒ Run Selected Line(s) from RStudio’s main menu.
A Session with R
Before you start working, choose File ⇒ Save As from RStudio’s main menu and then save the blank pane as My First R Session. This relabels the tab in the Scripts pane with the name of the file and adds the .R extension. This also causes the filename (along with the .R extension) to appear on the Files tab.
The working directory
When you follow my advice and save something called My First R Session, what exactly is R saving and where does R save it? What R saves is called the workspace, which is the environment you’re working in. R saves the workspace in the working directory. In Windows, the default working directory is
C:\Users\
If you ever forget the path to your working directory, type
> getwd()
in the Console pane, and R returns the path onscreen.
Tip In the Console pane, you don’t have to type the right-pointing arrowhead at the beginning of the line. That’s a prompt, and it’s there by default.
My working directory looks like this:
> getwd()
[1] C:/Users/Joseph Schmuller/Documents
Note the direction in which the slashes are slanted. They’re opposite to what you typically see in Windows file paths. This is because R uses \ as an escape character, meaning that whatever follows the \ means something different from what it usually means. For example, \t in R means Tab key.
Tip You can also write a Windows file path in R as
C:\\Users\\
If you like, you can change the working directory:
> setwd(
Another way to change the working directory is to choose Session⇒ Set Working Directory⇒ Choose Directory from R Studio’s main menu.
Getting started
Let’s get down to business and start writing R code. In the Scripts pane, type
x <- c(5,10,15,20,25,30,35,40)
and then press Ctrl+Enter.
That puts this line into the Console pane:
> x <- c(5,10,15,20,25,30,35,40)
As I said in an earlier Tip, the right-pointing arrowhead (the greater-than sign) is a prompt that R puts in the Console pane. You don’t see it in the Scripts pane.
Here’s what R just did: The arrow sign says that x gets assigned whatever is to the right of the arrow sign. Think of the arrow sign as R’s assignment operator.
So the set of numbers 5, 10, 15, 20 … 40 is now assigned to x.
Remember In R-speak, a set of numbers like this is a vector. I tell you more about this topic in the later section "R Structures." That c in front of the parentheses is what does the actual vector-creating.
You can read that line of code as x gets the vector 5, 10, 15, 20.
Type x into the Scripts pane and press Ctrl+Enter, and here’s what you see in the Console pane:
> x
[1] 5 10 15 20 25 30 35 40
The 1 in square brackets is the label for the first line of output. So this signifies that 5 is the first value.
Here you have only one line, of course. What happens when R outputs many values over many lines? Each line gets a bracketed numeric label, and the number corresponds to the first value in the line. For example, if the output consists of 23 values and the 18th value is the first one on the second line, the second line begins with [18].
Creating the vector x adds the line in Figure 1-6 to the Environment tab.
A command output displaying a line in the RStudio environment tab after creating the vector x.FIGURE 1-6: A line in the RStudio Environment tab after creating the vector x.
Tip Another way to see the objects in the environment is to type
ls()
into the Scripts pane and then press Ctrl+Enter. Or you can type
> ls()
directly into the Console pane and press Enter. Either way, the result in the Console pane is
[1] x
Now you can work with x. First, add all the numbers in the vector. Typing
sum(x)
in the Scripts pane (be sure to follow with pressing Ctrl+Enter) executes the following line in the Console pane:
> sum(x)
[1] 180
How about the average of the numbers in vector x?
That would be
mean(x)
in the Scripts pane, which (when followed by pressing Ctrl+Enter) executes
> mean(x)
[1] 22.5
in the Console pane.
Tip As you type in the Scripts pane or in the Console pane, you see that helpful information pops up. As you become experienced with RStudio, you learn how to use that information.
Variance is a measure of how much a set of numbers differ from their mean. Here’s how to use R to calculate variance:
> var(x)
[1] 150
What exactly is variance and what does it mean? I tell you all about it in Book 2.
After R executes all these commands, the History tab looks like the one in Figure 1-7.
A command output displaying the History tab, after creating and working with a vector.FIGURE 1-7: The History tab, after creating and working with a vector.
To end a session, choose File⇒ Quit Session from R Studio’s main menu or press Ctrl+Q. As Figure 1-8 shows, a dialog box opens and asks what you want to save from the session. Saving the selections enables you, the next time you open RStudio, to reopen the session where you left off (although the Console pane doesn’t save your work).
Screenshot of the Quit R Session dialog box. To end a session, choose File and Quit Session from R Studio’s main menu or press Ctrl+Q. A dialog box opens and asks what you want to save from the session.FIGURE 1-8: The Quit R Session dialog box.
Remember Moving forward, most of the time I don’t say Type this code into the Scripts pane and press Ctrl+Enter
whenever I take you through an example. I just show you the code and its output, as in the var() example.
Remember Also, sometimes I show code with the > prompt, and sometimes without. Generally, I show the prompt when I want you to see R code and its results. I don’t show the prompt when I just want you to see R code that I create in the Scripts pane.
R Functions
The examples in the preceding section use c(), sum(), and var(). These are three functions built into R. Each one consists of a function name immediately followed by parentheses. Inside the parentheses are arguments. In the context of a function, argument doesn’t mean debate
or disagreement
or anything like that. It’s the math term for whatever a function operates on.
Remember Sometimes a function takes no arguments (as is the case with ls()). You still include the parentheses.
The functions in the examples I showed you are pretty simple: Supply an argument, and each one gives you a result. Some R functions, however, take more than one argument.
R has a couple of ways for you to deal with multiargument functions. One way is to list the arguments in the order that they appear in the function’s definition. R calls this positional mapping.
Here’s an example. Remember when I created the vector x?
x <- c(5,10,15,20,25,30,35,40)
Another way to create a vector of those numbers is with the function seq():
> y <- seq(5,40,5)
> y
[1] 5 10 15 20 25 30 35 40
Think of seq() as creating a sequence.
The first argument to seq() is the number to start the sequence from (5). The second argument is the number that ends the sequence — the number the sequence goes to (40). The third argument is the increment of the sequence — the amount the sequence increases by (5, in this case).
If you name the arguments, it doesn’t matter how you order them:
> z <- seq(to=40,by=5,from=5)
> z
[1] 5 10 15 20 25 30 35 40
So if you name a function when using it, you can place the function’s arguments out of order. R calls this keyword matching. This comes in handy when you use an R function that has many arguments. If you can’t remember their order, use their names and the function works.
Tip For help with a particular function — seq(), for example — type ?seq and press Ctrl+Enter to open helpful information on the Help tab.
User-Defined Functions
R enables you to create your own functions, and here are the fundamentals on how to do it.
The form of an R function is
myfunction <- function(argument1, argument2, …){
statements
return(object)
}
Here’s a function for dealing with right triangles. Remember them? A right triangle has two sides that form a right angle and a third side called a hypotenuse. You might also remember that a guy named Pythagoras showed that if one side has length a, and the other side has length b, the length of the hypotenuse, c, is
mathSo here’s a simple function called hypotenuse() that takes two numbers a and b, (the lengths of the two sides of a right triangle) and returns c, the length of the hypotenuse.
hypotenuse <- function(a,b){
hyp <- sqrt(a^2+b^2)
return(hyp)
}
Type that code snippet into the Scripts pane and highlight it. Then press Ctrl+Enter. Here’s what appears in the Console pane:
> hypoteneuse <- function(a,b){
+ hyp <- sqrt(a^2+b^2)
+ return(hyp)
+ }
Each plus sign is a continuation prompt. It just indicates that a line continues from the preceding line.
And here’s how to use the function:
> hypoteneuse(3,4)
[1] 5
Comments
A comment is a way of annotating code. Begin a comment with the # symbol, which, as everyone knows, is called an octothorpe. (Wait. What? Hashtag?
Getattahere!) This symbol tells R to ignore everything to the right of it.
Comments help someone who has to read the code you’ve written. For example:
hypoteneuse <- function(a,b){ # list the arguments
hyp <- sqrt(a^2+b^2) # perform the computation
return(hyp) # return the value
}
Here’s a heads-up: I don’t typically add comments to lines of code in this book. Instead, I provide detailed descriptions. In a book like this, I feel it’s the best way to get the message across.
R Structures
As I mention in the "R Functions" section, earlier in this chapter, an R function can have many arguments. An R function can also have many outputs. To understand the possible inputs and outputs, you must understand the structures that R works with.
Vectors
The vector is the fundamental structure in R. I show it to you in earlier examples. It’s an array of elements of the same type. The data elements in a vector are called components.
To create a vector, use the function c(), as I did in the earlier example:
x <- c(5,10,15,20,25,30,35,40)
In the vector x, of course, the components are numbers.
In a character vector, the components are quoted text strings:
> beatles <- c(john
,paul
,george
,ringo
)
It’s also possible to have a logical vector, whose components are TRUE and FALSE, or the abbreviations T and F:
> w <- c(T,F,F,T,T,F)
To refer to a specific component of a vector, follow the vector name with a bracketed number:
> beatles[2]
[1] paul
Within the brackets, you can use a colon (:) to refer to two consecutive components:
> beatles[2:3]
[1] paul
george
Want to refer to non-consecutive components? That’s a bit more complicated, but doable via c():
> beatles[c(2,4)]
[1] paul
ringo
Numerical vectors
In addition to c(), R provides two shortcut functions for creating numerical vectors. One, seq(), I showed you earlier:
> y <- seq(5,40,5)
> y
[1] 5 10 15 20 25 30 35 40
Without the third argument, the sequence increases by 1:
> y <- seq(5,40)
> y
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [20] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Remember On my screen, and probably on yours too, all the elements in y appear on one line. The printed page, however, is not as wide as the Console pane. Accordingly, I separated the output into two lines and added the R-style bracketed number [20]. I do that throughout the book where necessary.
Tip R has a special syntax for creating a numerical vector whose elements increase by 1:
> y <- 5:40
> y
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [20] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Another function, rep(), creates a vector of repeating values:
> quadrifecta <- c(7,8,4,3)
> repeated_quadrifecta <- rep(quadrifecta,3)
> repeated_quadrifecta
[1] 7 8 4 3 7 8 4 3 7 8 4 3
You can also supply a vector as the second argument:
> rep_vector <-c(1,2,3,4)
> repeated_quadrifecta <- rep(quadrifecta,rep_vector)
The vector specifies the number of repetitions for each element. So here’s what happens:
> repeated_quadrifecta
[1] 7 8 8 4 4 4 3 3 3 3
The first element repeats once; the second, twice; the third, three times; and the fourth, four times.
You can use append() to add an item at the end of a vector:
> xx <- c(3,4,5)
> xx
[1] 3 4 5
> xx <- append(xx,6)
> xx
[1] 3 4 5 6
How many items are in a vector? That’s
> length(xx)
[1] 4
Matrices
A matrix is a two-dimensional array of data elements of the same type. You can have a matrix of numbers:
5 30 55 80
10 35 60 85
15 40 65 90
20 45 70 95
25 50 75 100
or a matrix of character strings:
john
paul
george
ringo
groucho
harpo
chico
zeppo
levi
duke
larry
obie
The numbers are a 5 (rows) X 4 (columns) matrix. The character strings matrix is 3 X 4.
To create this particular 5 X 4 numerical matrix, first create the vector of numbers from 5 to 100 in steps of 5:
> num_matrix <- seq(5,100,5)
Then you use R’s dim() function to turn the vector into a two-dimensional matrix:
> dim(num_matrix) <- c(5,4)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 30 55 80
[2,] 10 35 60 85
[3,] 15 40 65 90
[4,] 20 45 70 95
[5,] 25 50 75 100
Note how R displays the bracketed row numbers along the side and the bracketed column numbers along the top.
Transposing a matrix interchanges the rows with the columns. The t() function takes care of that:
> t(num_matrix)
[,1] [,2] [,3] [,4] [,5]
[1,] 5 10 15 20 25
[2,] 30 35 40 45 50
[3,] 55 60 65 70 75
[4,] 80 85 90 95 100
The function matrix() give you another way to create matrices:
> num_matrix <- matrix(seq(5,100,5),nrow=5)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 30 55 80
[2,] 10 35 60 85
[3,] 15 40 65 90
[4,] 20 45 70 95
[5,] 25 50 75 100
If you add the argument byrow=T, R fills the matrix by rows, like this:
> num_matrix <- matrix(seq(5,100,5),nrow=5,byrow=T)
> num_matrix
[,1] [,2] [,3] [,4]
[1,] 5 10 15 20
[2,] 25 30 35 40
[3,] 45 50 55 60
[4,] 65 70 75 80
[5,] 85 90 95 100
How do you refer to a specific matrix component? You type the matrix name and then, in brackets, the row number, a comma, and the column number:
> num_matrix[5,4]
[1] 100
To refer to a whole row (like the third one):
> num_matrix[3,]
[1] 45 50 55 60
and to a whole column (like the second one):
> num_matrix[,2]
[1] 10 30 50 70 90
Although it’s a column, R displays it as a row in the Console pane.
BUT BEAR IN MIND …
As I mention, a matrix is a two-dimensional array. In R, however, an array can have more than two dimensions. One well-known set of data (which I use as an example in Chapter 1 of Book 3) has three dimensions: Hair Color (Black, Brown, Red, Blond), Eye Color (Brown, Blue, Hazel, Green), and Gender (Male, Female). So this particular array is 4 X 4 X 2. It’s called HairEyeColor and it looks like this:
> HairEyeColor
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
Each number represents the number of people in this group who have a particular combination of hair color, eye color, and gender — 16 brown-eyed red-haired females, for example. (Why did I choose brown-eyed red-haired females? Because I have the pleasure of looking at an extremely beautiful one every day!)
How would I refer to all the females? That’s
HairEyeColor[,,2]
Lists
In R, a list is a collection of objects that aren’t necessarily the same type. Suppose you’re putting together some information on the Beatles:
> beatles <- c(john
,paul
,george
,ringo
)
One piece of important information might be each Beatle’s age when he joined the group. John and Paul started singing together when they were 17 and 15, respectively, and 14 year-old George joined them soon after. Ringo, a late arriver, became a Beatle when he was 22. So
> ages <- c(17,15,14,22)
To combine the information into a list, you use the list() function:
> beatles_info <-list(names=beatles,age_joined=ages)
Naming each argument (names, age_joined) causes R to use those names as the names of the list components.
And here’s what the list looks like:
> beatles_info
$names
[1] john
paul
george
ringo
$age_joined
[1] 17 15 14 22
R uses the dollar sign ($) to indicate each component of the list. If you want to refer to a list component, you type the name of the list, the dollar sign, and the component name:
> beatles_info$names
[1] john
paul
george
ringo
And