Practical R 4 PDF
Practical R 4 PDF
www.allitebooks.com
Practical R 4
Applying R to Data Manipulation,
Processing and Integration
Jon Westfall
www.allitebooks.com
Practical R 4: Applying R to Data Manipulation, Processing and Integration
Jon Westfall
Division of Counselor Education & Psychology
Division of Student Success Center
Delta State University
Cleveland, MS, USA
www.allitebooks.com
Dedicated to the little people in my life: Kaden, Ryleigh, Ryan,
Amelia, Loretta, Rhett, Grant, and Walt.
May a love of reading all sorts of books (even boring ones like this!)
be with you for life.
www.allitebooks.com
Table of Contents
About the Author����������������������������������������������������������������������������������������������������� ix
Acknowledgments��������������������������������������������������������������������������������������������������� xi
Introduction����������������������������������������������������������������������������������������������������������� xiii
www.allitebooks.com
Table of Contents
vi
Table of Contents
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 313
viii
About the Author
Jon Westfall is an award-winning professor, author, and
practicing cognitive scientist. He teaches a variety of courses
in psychology, from introduction to psychology to graduate
seminars. His current research focuses on the variables that
influence economic and consumer finance decisions, as
well as retention and persistence of college students. With
applications to psychology, information technology, and
marketing, his work finds an intersection between basic
and applied science. His current appointments include
Associate Professor of Psychology, Coordinator of the First
Year Seminar program, and Coordinator of the Psychology
program at Delta State University. Prior to joining the
faculty at Delta State in 2014, he was a Visiting Assistant Professor at Centenary College
of Louisiana and the Associate Director for Research and Technology at the Center for
Decision Sciences, a center within Columbia Business School at Columbia University in
New York City. He now maintains a role with Columbia as a research affiliate/variable
hours officer of administration and technology consultant.
In addition to his research, Dr. Westfall also has career ties in information technology,
where he has worked as a consultant since 1997, founding his own firm, Bug Jr. Systems.
As a consultant, he has developed custom software solutions (including native Windows
32 applications, Windows .NET applications, Windows Phone 7 and Android mobile
applications, as well as ASP, ASP.NET, and PHP web applications). He has also served
as a senior network and systems architect and administrator and been recognized as a
Microsoft Most Valuable Professional (MVP) 2008–2012. He currently is the owner and
managing partner of Secure Research Services LLC. He has authored several fiction and
nonfiction books and presented at academic as well as technology conferences and
gatherings. A native of Ohio, in his spare time, he enjoys knitting, crocheting, creative
writing with the Delta Writers Group, and a variety of other hobbies.
For more information, visit jonwestfall.com, listen to him weekly on the
MobileViews podcast (mobileviews.com), or follow him on Twitter (@jonwestfall).
ix
Acknowledgments
Writing a book is never an easy task and is seldom the task of just one individual, even
in a sole author work. I am indebted to my wife, Karey, for her support throughout this
process. It can’t be easy to have a husband who pounds out thousands of words at a time
on weekends, on road trips, and in hotel rooms. Yet she has never complained once.
I’m also thankful for my parents, Alan and Dianne, who instilled a love of reading
and learning in me early in my life. Writing is only possible after spending a ton of time
reading, whether it be fiction or nonfiction. I’m thankful to my friends who have listened
to me talk about this project and provided feedback (Steve Jocke, Jason Dunn, Matt
Rozema, and my longtime podcasting partner Todd Ogasawara). Other friends who have
supported me without knowing it (by providing inspiration for projects, or stories in this
book) include Christy Riddle, Tricia Killebrew, Kristen Land, Darla Poole, Kesha Pates,
Jontil Coleman, Elise Mallette, Jackie Goldman, Andrés García-Penagos, Sally Zengaro,
and many others in the extended Delta State family. I’d also like to acknowledge my
students, for whom many of these projects originally were designed, for inspiring me to
continue creating. I also am lucky to have the support of those at Apress, especially Mark
Powers, Steve Anglin, and Matt Moodie.
On a personal level, I’d also like to thank the Delta Writers Group (Michael Koehler,
Katy Koehler, Jason Hair, and Dick Denny) for providing a ground for sharpening my
writing and critique skills regularly. I also gain so much support from friends and family,
including Nate and Kristen Toney, Sarah Speelman, Maggie Ditto, Heather Hudgins,
Ashley Newman, Dan, Sue, Scott, Emily, Greg, Janet, Mark, and Brenda Himmel,
Margaret Lee, Christine and Carl Morris, Don Sorcinelli, Tony Rylow, Trella Williams,
Eric Johnson, Elke Weber, Karen Fosheim, Carol Beard, Maria Gaglio, Hope Hanks, Tom
Brady, and many others.
xi
Introduction
In 2007, as I was finishing up my last full year of graduate school, I learned about
R. At the time, most of my data analysis was done in SAS or SPSS, and I had grown
weary of the headaches of trying to find a properly licensed computer to run my
data. A longtime fan of open source software, I started noticing more and more of my
colleagues talking about R, and I began exploring it. In 2009, when I took a position at
the Center for Decision Sciences, working with Eric Johnson and Elke Weber, I found
it to be a nearly R-only. Coupled with my background in scripting languages and other
programming duties, I jumped right in. Today I consider myself an R evangelist, having
given multiple talks on the platform, incorporating it into my courses, and now writing
my first book on the subject.
R, in my opinion, is wildly misunderstood outside of its core user base. Even within
its base, many are only familiar with what they use R for – statistics, visualizations, data
formatting, and so on. Outside the base, many of my colleagues view it as intimidating,
given its command-line appearance. Many have shared with me that they “really wish
they could learn R” but that they “can’t afford the time” to devote to it. My hope is that
this book shows that you really can’t afford to not learn R, because once you know it and
recognize its power, your time frees up. How? The report that used to take 25 minutes
now takes 25 seconds. The PowerPoint deck that you needed to update every week to
send to your boss now is automatically created and sent, all without you having to lift
a finger. The analysis that you used to have to spend an entire class period explaining
how to run in SPSS is now run with 3–4 lines of R code (allowing you to spend that class
period explaining what the analysis does, not which menus to drill into to run it). R saves
you time, and in this book, my goal is to show you how you can use it in a myriad of ways
in your life, hence the “Practical” label.
To do this, we’ll start by explaining what R is and how to get up and running with it.
Chapter 1 assumes no knowledge of R, so if you’ve just heard of it or picked up this book
thinking “I don’t even know where to begin,” you’re in the right place. Chapter 2 then
discusses how to get data into R to work with – whether that be a series of columns in a
spreadsheet or finding specific items on a web page or document. Chapters 3 and 4
get into data analysis by collecting data using open source tools (which might end
xiii
Introduction
up saving you money over commercial options). Chapter 5 gives you some everyday
applications for how to use R to format and manipulate data. Chapters 6 and 7 ramp up
your automation skills, while Chapters 8–10 bring R to the cloud, allowing you to run
your analysis anywhere you need to, and have it report back to you.
The specific code in this book is just the beginning, however. My goal is to give you
tools through examples, inspiring you to mix and match as appropriate to your needs.
For example, Chapter 5 discusses sorting data based upon rules, and Chapter 10
includes code to download the latest news headlines. Imagine needing to know the
number of times a certain word appears in news headlines over a 10-day period of time.
While there is no explicit example, by grabbing code from Chapters 5 and 10, one could
easily write code to collect and process the data, and using tools introduced in Chapter 8,
schedule it to run every day. After 10 days, pull the data and sum it up, and you’re all set.
More than anything, my goal here is to inspire innovation with an extremely
powerful tool, R 4, to make your life easier and your work more enjoyable. I don’t know
about you, but I’d rather spend time thinking critically about something than spend
hours counting rows in a spreadsheet or computing statistics on a hand calculator. I’d
rather talk with others about my findings rather than spend an hour wrestling with
PowerPoint to make 30 slides filled with data and analyses (see Chapter 7). Do more of
what matters to you, and let R do the rest.
I hope you are inspired, and look forward to seeing what you create!
xiv
CHAPTER 1
Getting Up and
Running with R
Welcome to the first chapter on the book that covers every pirate’s favorite programming
language and statistics package, R. With that very bad joke out of the way, let’s talk
about what this book is: your Practical R recipe book for three broad areas – research,
productivity, and automation. In the first part of this book, I’ll give you the essentials
to getting up and running with R and apply those to two research projects you might
find useful in your work: a market research study and a psychological process-tracing
study. In the second part, I’ll talk about how R can be used in your workday to enhance
your productivity – less data science, more useful scripting environment. And finally, in
the third part, I’ll use R to automate some seriously complex tasks, turning R into your
personal assistant through two projects in Chapters 9 and 10. Along the way, you’ll find
that R fits in many different areas of your life and that no job is likely too big for it!
So, the actual business of this book: helping improve your life in practical ways by
using R. Before we can do much of that, we’ll need to fill in a bit of background on this
programming language that is over two and a half decades old, yet still unknown to
many who could benefit from it. In this chapter, I’ll cover
1
OK, so it probably isn’t really required by law, but certainly feels like it!
1
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_1
Chapter 1 Getting Up and Running with R
Along the way on our journey, I’ll share with you how I came to use R in the
examples written here and how I continue to find new uses for R in my daily life. So let’s
get started.
What Is R
In my experience, many people come to R after having heard about it from someone
else. And when you hear about it from someone else, you tend to get just one view of
what R is and where it came from. So in this section, I’ll try to give you the most holistic
and encompassing view of what R is, as well as what people think it is. According to the
R-Project homepage, R “is a language and environment for statistical computing and
graphics.” Further, they elaborate that R is
2
www.r–project.org/about.html
2
Chapter 1 Getting Up and Running with R
learn a new tool.” As you’ll see in this book, R is a free and flexible tool that most anyone
can incorporate into their lives.
So who evangelizes for R in the real world? Typically people who have used it for one
particular niche scenario or another. Here are a few examples:
• The professor who wants to give her students a low-cost (e.g., free)
alternative to pricey statistics packages, so she writes a few example
scripts and posts them in her statistics or research methodology
course.
• The IT manager whose firm asks him to compile the annual budget
spreadsheet, and he finds it’s really inconvenient to wait 10 minutes
for Excel to load up just to realize he has the wrong version of the file.
• The secretary who has to update a department’s web page and writes
a script in R to take data from a spreadsheet, format it nicely, and
spit out HTML that she can copy and paste into the web page’s CMS
(content management system) software.
If you don’t see yourself in any of these scenarios, that’s fine and somewhat the
point – many people find themselves using R for a task, but don’t realize that it can do so
much more. Think of R like a Swiss Army knife, but instead of 15–30 tools, it has 15,000
tools named packages, tools that anyone can write and share. Those tools do everything
from generating reports (a package named knitr) to adapting and smoothing fMRI data
(adaptsmoFMRI) to sending email (mail). If you’re seeking to do something, you can
probably do it in R.
Oh yeah, and it’s free. R is an open source software product licensed under the GNU
General Public License, and while commercial versions of R exist, the base product
will always be free. What you get with the commercial packages typically consists of
greater optimization for large data operations, specially designed packages, and support
contracts that allow you to call someone when something breaks. However, none of
3
Chapter 1 Getting Up and Running with R
those enhancements are required to download and start using R today.3 Ironically, it
is this very powerful advantage R has that I feel has actually hurt its adoption in many
scenarios. As a professor, I often encounter people in education and IT that tend to
believe that there is no such thing as a free lunch. When I explain that R is free, I can get
sideways glances that seem to say “Oh sure, it’s just as good as the $6000 statistics package
we buy.” And they’re right – it isn’t just as good, I firmly believe it’s better. How so?
3
All you need for that is a few minutes, a computer, and desire to learn.
4
Chapter 1 Getting Up and Running with R
multilevel regression that we will run in Chapter 4), you might find it much easier to
find packages to accomplish it in R than a more common language, such as Python. R is
powerful, established, and accessible.
What R Is Not
If R sounds like it’s your savior when it comes to data processing, analysis, visualization,
reporting, automation, and calculation, then it’s very possible that it is. But it’s also
possible that you’re delusional, because R does have a few things it isn’t so good at, at
least not right now. Here are some warning signs that you might not be ready for R yet (in
which case, buy this book and keep it for later), or that R might not be the best tool out
there (in which case, still buy this book and give it to a friend).
First, R can be very user-unfriendly, for a few reasons. First, the folks behind R, as of
this writing in 2020, are still a little bit stuck in the 1995 Internet. What do I mean?
• The R Project homepage sports a clean design, but isn’t the easiest to
navigate. As you’ll see in a few pages, downloading R is not as easy as
“go to the homepage, click the big Download button.”
Second, R can also be unintuitive for those raised after the command-line era in
computing. For many today, scripting is not something they are familiar with doing.
In my own classes, I’ve found that asking an 18-year-old to “download a file and open
5
Chapter 1 Getting Up and Running with R
it in R” can be very challenging in the era of cloud computing. Tools like Jamovi, which is
built upon R but incorporates responsive design, help mitigate this problem. Also, R does
not always provide the most intuitive error messages when things go wrong. Someone
unfamiliar with searching the Web to fix a typo or mistake may find R especially
challenging.
Further, R does not run natively on iOS. It can run on a Chromebook and on an
Android device if those devices allow a Linux mode in settings. While it may seem weird
to consider someone using those devices to run R, we’ll see in Chapters 6, 7, 9, and 10
projects that one might want to be able to run from an iOS or iPadOS device, and in some
cases, I’ll show workarounds for those scenarios. There are ways to access an RStudio
Server on iOS as well as use Remote Desktop software such as VNC; however, it can be
limiting for someone who lives their life on an iPhone or iPad. R runs fine on macOS,
Windows, and Linux.
And finally, R can require a bit of future-proofing to ensure it will always run the way
you need it to. We’ll discuss this more in future chapters, but for now, the short version
is packages are updated regularly by their maintainers. If a maintainer needs to re-work
a package, to take out a function or add new functionality, it may break your script. This
means that for any mission-critical application, you’ll want to keep copies of the specific
versions of your packages. As I said, I’ll walk through this later in the book.
There you have it – my most compelling arguments for why R might not be the best
for you. Obviously, I don’t provide them to scare you away, but rather to make sure
you’re going into your R adventure with realistic expectations. After all, I can’t help you
in a practical and actionable way if you’re angry with me for the remainder of the book!
The R Landscape
In this short section, I want to give you a bit of an overview on how R has evolved over
time, with special attention on the moving parts that make it somewhat unique in the
programming world. The best way I can think to do this is a timeline (See Figure 1-1).4
4
ade with the R, of course, the timelineS package. Code available on GitHub under
M
1-1.timeline.r
6
Chapter 1 Getting Up and Running with R
R can trace its inspirations back to the S programming language of the mid-1970s,
with work on R starting in the mid-1990s. It’s always hard to pin down exact dates, but
the earliest recorded conversations on R tend to center around the mailing lists that
began in 1997. A hobby at that time, R became “useful” around 2000, with CRAN (the
Comprehensive R Archive Network) supporting not only copies of the source files
and compiled binaries but also packages that had been submitted. With R’s ability to
download and install packages (via the install.packages() function) as well as update
them, the user was able to easily add functionality to the basic package with just a few
keystrokes. Looking at my timeline code, you’ll notice the very first line installs the
timelineS package, which it needs in order to create the timeline that you see.
By the mid-2000s, R had begun to attract a devoted user base, with the first official R
conference, useR!, being held in 2004. The spirit of the future of R was present here, with
keynotes on new features, “Users become Developers,” and data science precursors to
the “Big Data” movement of today. Graphing with R and using R as a teaching tool were
also covered. With this activity, it was inevitable that those with an eye for profit would
begin to notice and utilize R.
In 2007, the first major corporation to make R a centerpiece was founded: Revolution
Analytics. Revolution’s business was to support an open source product, R, add their
own enhancements, and provide a version to their clients for large volume or niche
specifications. Revolution R would become so popular that Microsoft would purchase
them in 2015, with Microsoft building R into many of their cloud business applications
and releasing a free version of R based on the work done by Revolution. In parallel,
smaller software groups began writing enhancements for R, with perhaps the most
7
Chapter 1 Getting Up and Running with R
8
Chapter 1 Getting Up and Running with R
Under that Download heading, you see a link for CRAN, the Comprehensive R
Archive Network that I mentioned earlier (See Figure 1-3). Clicking that, you’re asked
to pick a mirror from the list of nearly 100 separate R mirror sites around the world. If
the World Wide Web ever suffers a major break in connectivity, I’m sure you’re happy to
know that while you might not be able to stream your favorite show or use your favorite
social media app, you will likely still be able to get to an R mirror and download the
packages you need to adequately analyze and graph your ensuing depression!
Anyway, today I typically advise people to use the Cloud option at the top of CRAN,
which will automatically find a near mirror to you. From CRAN’s homepage, you do see
installation links for the three major desktop platforms: Linux, macOS, and Windows.
9
Chapter 1 Getting Up and Running with R
I nstallation on Windows
For Windows users, you’ll want to follow these steps:
2. Click “base” (See Figure 1-4), which indicates that you’d like to
download the base installation of R with its included packages.
10
Chapter 1 Getting Up and Running with R
5. Finally, you’ll get an installation wizard (See Figure 1-7) that you
get the joy of pressing “Next” to several times. Accepting the default
options is fine, as we can easily change them later if need be.
11
Chapter 1 Getting Up and Running with R
Congratulations – you should now have R installed on your computer. Going to your
Start menu, you should see an “R” program group. You’ll notice two versions of R inside
there, one labeled “R i386” and “R x64” with the version number after them. This is by
design, with the i386 option allowing you to run 32-bit R and the x64 allowing 64-bit. For
most things today, you’ll likely want to use the x64 option. In rare cases, you may need
the 32-bit option if you’re using a package that hasn’t been updated yet to support the
64-bit version. One thing to note is that the packages you download for each version
are specific to it, so if you find yourself using both versions interchangeably, you’ll be
frustrated thinking “Didn’t I already download that package?!?” when you try to run
your code.
Go ahead and launch the 64-bit version, and you should see a screen similar to the
one as follows (See Figure 1-8), ready for you to start work!
12
Chapter 1 Getting Up and Running with R
I nstallation on macOS
Those on a Mac have a similar installation routine as Windows. From the CRAN
homepage
13
Chapter 1 Getting Up and Running with R
14
Chapter 1 Getting Up and Running with R
Once the installer finishes, you’ll find R in your Launchpad – just look for the giant R
icon similar to the one shown here (See Figure 1-11).
Launching it should provide you with a screen similar to the one as follows (See
Figure 1-12), and from that point, you’re ready to work!
15
Chapter 1 Getting Up and Running with R
Installation on Linux
Installation on Linux is, oddly enough, either much easier than the other platforms
or much more complex, depending on how geeky you’d like to be. R is available pre-
compiled for four common Linux distributions: Debian, Red Hat/Fedora, SUSE, and
Ubuntu. Here are the quickest ways to install the most common pieces of R for each of
distribution:
16
Chapter 1 Getting Up and Running with R
• Note that for Ubuntu, only the latest LTS release is available from the
Ubuntu servers. You can install the latest stable builds by following
the instructions at https://cloud.r-project.org/bin/linux/
ubuntu/README.html.
• For SUSE, you can use the one-click installation links available in
Section 1.4 of https://cloud.r-project.org/bin/linux/suse/
README.html, choosing the appropriate link for your version of
openSUSE.
In many cases, these are a bit easier than on Windows or macOS, since the command
does all the work of downloading the file and installing it. However, if you’re a Linux
aficionado, you may also want to install R from source, which is a bit more time-
consuming and beyond the scope of this book.
Regardless of how you get it installed, to run R on your Linux machine, you simply
run the command R – remembering that Linux is case-sensitive. r will not work, as you
can see in the following image, it must be R. Assuming you issue the appropriately cased
command, you’ll get the R version and copyright, similar to the Windows and macOS
version earlier, with a prompt waiting for your first R program to run (See Figure 1-13).
17
Chapter 1 Getting Up and Running with R
2+2.
R executes this, as seen in the following image, and happily tells you that there is one
returned result, the number 4. Getting a little more fancy, you can try this:
x <- 2
x + 2
18
Chapter 1 Getting Up and Running with R
This also returns 4, as you just assigned the value of “2” to a variable named “x” and
then added x + 2 or 2+2 once more. If the assignment operator “arrow” (really a less than
sign and a hyphen) looks strange to you, you can also use an actual equals sign in most
cases. The following code will also give 4.
y = 2
Or
x + y
We now have completely useless variables named x and y, but their point has been
demonstrated as you reminisce nostalgically back to basic arithmetic (See Figure 1-14).
You may wonder how we would do something a bit more automated? This is
accomplished by piping a script into the R command-line interpreter. You can do this
from the command line by writing a script, saving it generally with a .r file extension,
and then providing it as an argument to the R command. Typing this in a terminal will
execute the code and provide the output R -f 1-2.domath.r (See Figure 1-15).
19
Chapter 1 Getting Up and Running with R
You can also open a new R script in the Windows and macOS versions by going to the
File menu and choosing New Script (Windows) or New Document (macOS). You’ll get a
new blank text window near your R interpreter window (See Figure 1-16 and 1-17).
20
Chapter 1 Getting Up and Running with R
You can type the same commands into the script window as I had earlier. To execute
them, you can put your cursor at the top and execute line by line by pressing Ctrl+R or
choosing Run Line from the Edit menu (Windows). On macOS, you use Cmd+Enter or
choose “Execute” from the Edit menu. To execute the entire script at once, you can either
select all of it and press the shortcut key for your operating system or use the “Run All”
command from the Edit menu (on Windows only).
21
Chapter 1 Getting Up and Running with R
I’ll stop and take a moment to address something slightly annoying in the last
paragraph – the fact that the menu structures are so radically different between R on
macOS and R on Windows. Because R is an open source product and different teams are
responsible for different elements, the macOS GUI and Windows GUI can seem radically
different even though the same code works equally well on both. It’s one of those things
that makes it difficult to teach with R since your students may have different screens
depending on their operating system. Further complicating matters is that R on Linux
also has a GUI that can be invoked with the R -g Tk command, which doesn’t have the
same wording either – it doesn’t even have an Edit menu! (See Figure 1-18).
Hello World
We’re nearly at the end of the chapter, so it’s time for a Hello World program, a time-
honored tradition since 1974 with another monoalphabetic programming language, C.
Admittedly, it’s not that much to look at:
print("Hello World").
22
Chapter 1 Getting Up and Running with R
If you find that a bit underwhelming, well… it is. As mentioned though, R can do
some fancy graphics work, such as the timeline earlier in the chapter. So let’s jazz it up a
very little bit. Try this code, executed in Figure 1-19:
install.packages("BlockMessage")
library("BlockMessage")
blockMessage("Hello World")5
As you can see, that’s a little fancier. The blockMessage() function actually has quite
a few customizable commands. To get help on it or any other function in R, all you need
to do is place a ? in front of the function and press Enter. The help should automatically
launch and take you to information on that command. Try it by typing ?blockMessage in
your R window after running the preceding code (See Figure 1-20).
5
1-3.blockmessage.r in this book’s code package.
23
Chapter 1 Getting Up and Running with R
Wrapping It All Up
We’ve covered a lot of ground in this chapter, from a background on what R is and what
it is not, to a brief history of R, to getting R up and running on your computer. The rest of
this book is dedicated to putting R to good use now that it’s there. Here’s a quick preview
of what we’ll be covering:
Chapter Summary
2 – Feed the Beast: Getting We’ll talk about how to get your data into R, how it’s stored, and how
Data into R to hook R into dynamic data sources such as a database. We’ll also
talk about basic web scraping – getting a website’s data into R!
3 – Project 1: Launching, Ever have to give a survey out as part of your job? Want something a bit
Analyzing, and Reporting a more in depth than the reports SurveyMonkey or another platform can
Survey Using R and LimeSurvey provide to you? Want something… FREE? Then this is the chapter for you.
(continued)
24
Chapter 1 Getting Up and Running with R
Chapter Summary
4 – Project 2: Advanced Sometimes you need to know what grabs someone’s attention, and
Statistical Analysis Using R in this chapter, we’ll use another free tool to see how people behave
and MouselabWEB when searching for the best product on a website!
5 – R in Everyday Life So maybe you’re not a scientist or market researcher. Maybe you
just want R to help you out day by day. In this chapter, we talk about
using R to automate data formatting, reporting, and more.
6 – Project 3: The R Form Mail Merge, the ability in Microsoft Office to send out customized
Mailer emails or letters using a spreadsheet is a real timesaver. What if you
could do that without the hassle of pointing and clicking? What if you
could script it? Now you can!
7 – Project 4: The R Powered We know that R can create graphics, it can create text, and it can
Presentation crunch numbers. What if we put those all together with audience
participation? Get ready for the Best. Presentation. Ever.
8 – R Anywhere It’s annoying that R can only run on your computer. What if R could
run in the cloud? Then you could access it anywhere. That’s what
we’ll explore in this chapter.
9 – Project 5: The Change The world is full of change, but we don’t always feel like we know
Alert! about it far enough in advance. What if you had a script searching for
changes in reports or web pages and notifying you nearly instantly
when they occurred? In Chapter 9, we’ll cover exactly that.
10 – Project 6: The R So R can do a lot for us, but the best personal assistants do
Personal Assistant everything behind the scenes. In this final project, we’ll have R
prepare a daily report for us and then find ways in which our other
technological servants, such as an Amazon Echo, can deliver this
vital information to us at our command.
It’s time you lived your life a bit more in tune with that favored swashbuckling stats
package, R. I’m glad you’ve decided to join me on our voyage!
25
CHAPTER 2
• Explain the different types of data that R can work with and how that
data is stored
We’ll begin by talking about how data is stored in R and the types of data that R can
automatically classify and use appropriately.
1
https://swcarpentry.github.io/r–novice–inflammation/13–supp–data–structures/
27
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_2
Chapter 2 Feed the Beast: Getting Data into R
x <- 2
y <- 2L
x
y
str(x)
str(y)
You’ll notice something interesting here – you stored the same value (the output on
Lines 3 and 4 is the same – [1] 2 – despite the fact that you put the letter L in after the
number on Line 2). What’s going on here? Well, that L told R to store the same value,
2, as an integer instead of a floating point number. The str() commands confirm that,
showing x to be num and showing y to be int. Aside from integers and numbers, what
other data types does R support?
28
Chapter 2 Feed the Beast: Getting Data into R
Atomic Atomic vectors contain a vector("logical", The c() combine function can add
vector one-dimensional number length=3) will items to vectors. R also supports
of items. produce a vector missing data in vectors represented
with three items, all by the term NA, which functions like
defaulting to FALSE. is.na() or anyNA() can check
for. Additionally, one might find
NaN if a mathematical operation
produces a value that is not a
number.
List Sometimes referred to x <- list(grp1 = Many different commands export
as a generic vector, lists "name", grp2 = their results as lists. This means
are more flexible and 1:5, grp3 = FALSE) that you can “peal” off different
can have mixed types of will create a list of parts of a result. Using the code to
data. three named items, the left, one could type x$grp1 and
each with their own get "name" in return.
lists inside.
Matrix Matrices are an x <- matrix You can modify individual elements
extension of vectors (nrow = 3, within an array by reassigning them.
or lists, just in two ncol=2) For example, x[2,2] <- 3 will
dimensions. put 3 into the second row, second
column of the matrix you created
earlier.
Data The powerhouse of data x <- data. You can reference specific columns
frame structures. Most large frame(idnum = in a data frame (e.g., x$idnum)
datasets you work with 1:26, alpha = or rows (x[1,], or exact cells
will be this type. Data letters[1:26]) (x[1,2]). You can also use the
frames can have named head() and tail() commands to
columns and nested view the first and last six items in
structures and be easily the data frame.
modified and queried.
(continued)
29
Chapter 2 Feed the Beast: Getting Data into R
Factor A collection of nominal gender <-factor(c Factors work a bit differently than
values – labels without ("male","male","f lists because R keeps track of the
quantity, if you will. emale","female")) levels of a factor. So removing a
value (such as removing the last
two “female” values and replacing
with NA) will not modify the levels
of the factor. Additionally, using
the relevel command will
change the reference level for
the factor. In the example to the
left, female is the reference level.
relevel(gender,"male") would
make male the reference level.
Now that we have a bit of an understanding of the types of data R can store, let’s
actually get some in there. And we’ll start with a dataset already included!
30
Chapter 2 Feed the Beast: Getting Data into R
Several other datasets are included, but I think the five mentioned earlier give us
some things to play around with. First, let’s explore one of them – USArrests. To see the
entire dataset, type USArrests. If you’re like me, you’ll run out of screen space with the
list that fills your R console (See Figure 2-1).
31
Chapter 2 Feed the Beast: Getting Data into R
This might be a bit overwhelming to see, so perhaps we should try to only look at
the top or the bottom. Being alphabetically challenged myself, I’ll choose the last six
items with the tail(USArrests) command. Looking at it, I’m kind of glad I didn’t live in
Virginia in 1973. But is that 8.5 really bad compared to the murder rate nationally in that
year? I can use the max(USArrests$Murder) command to get the maximum number and
mean(USArrests$Murder) to get the arithmetic average (See Figure 2-2).
32
Chapter 2 Feed the Beast: Getting Data into R
Figure 2-2. The Last Six Lines, Plus the Maximum and Mean
8.5 seems higher than average (7.778), but is far lower than the maximum. This
might get me wondering what the distribution of murder rates was. Perhaps I should
look at a histogram with hist(USArrests$Murder). See Figure 2-3.
Hmm… 8.5 does seem to be a bit on the higher end of that graph now that I see it.
And wow… Vermont must really have been safe with only 2.2!
While we’re here, it’s also useful to point out that the functions above all take
several arguments to customize them. If we wanted to customize the X or Y axis
labels in that histogram, we could do so easily – try this command on your own:
hist(USArrests$Murder,xlab="Murder Rate").
33
Chapter 2 Feed the Beast: Getting Data into R
Now that we’ve played around with that data, let’s look at the two time series datasets
we have from earlier – presidents and AirPassengers (See Figure 2-4). Doing a quick
str(presidents) command will tell us that this is a data type Time-Series – that wasn’t
one that I had mentioned earlier, now is it?
And here we find the real power of R – complex real-world data structures. In this
case, time series is a data structure that can accommodate data on a given time interval,
whether they be months or years.
Let’s finish our walk-through of datasets with some pressing questions about
ChickWeight dataset. Namely, which diet produces the fattest chicks? First, let’s look at
what the ChickWeight dataset is, structurally (See Figure 2-5).
We see that we have 4 variables and 578 observations. Chick is an ordinal factor,
likely the ID number assigned to the chick. Diet is the diet version that the chick
received. Time is the number of days old, and weight is, well, the weight of the chick. I
wonder if we have an equal number of observations for each diet? I’ll take a look with the
table(ChickWeight$Diet) command (See Figure 2-6).
Looks like we have a few more chicks on Diet 1 vs. Diets 2–4, but we have over 30 in
each group, so we could assume normality… probably. Just to be on the safe side, let’s
look at the distributions with a simple plot: interaction.plot(ChickWeight$Time,
ChickWeight$Diet,ChickWeight$weight) (See Figure 2-7).
35
Chapter 2 Feed the Beast: Getting Data into R
The plot shows me that it certainly looks like Diet 3 produces the heaviest chicks at
21 days. I wonder if that’s true, and by true, I mean based upon inferential statistics. Let’s
take a look at a linear model by issuing the commands here:
This returns our linear model output shown here (Figure 2-8).
36
Chapter 2 Feed the Beast: Getting Data into R
For those of you who haven’t taken stats in a while, I’ll walk you through the output:
• Diets 2 and 4 are not significantly different from the baseline Diet 1,
but Diet 3 is marginally worse. This seems at odds with what we saw
in the plot where Diet 3 seemed much better.
Linear models compare things relative to a baseline level – in this case Diet 1. What if
we were to make the baseline level Diet 3? Would we see that all diets were significantly
worse? Remember that relevel command I talked about earlier? Let’s modify our code a
bit and see what happens (See Figure 2-9):
37
Chapter 2 Feed the Beast: Getting Data into R
Ah, just as I suspected, with the comparison against Diet 3, all four diets are
significantly worse than Diet 3. We have our winning diet to produce fat chicks!2
One important thing to note about the last section on linear models – it’s included
here as an example of how you can easily run a complex statistical test within R in just a
few lines. We’ll use linear regression throughout the book a few more times, and if you’re
not familiar with it, I highly suggest background reading to understand the assumptions
that are made when interpreting a model. An excellent primer, if you’re not familiar
with regression diagnostics and assumptions, can be found at www.sthda.com/english/
articles/39-regression-model-diagnostics/161-linear-regression-assumptions-
and-diagnostics-in-r-essentials/. Throughout this book I skip some of these
steps in order to move us to the findings a little quicker; however, if you are planning
to publish your results in a scientific journal, be sure to check your diagnostics and
assumptions before reporting a completed model.
As you can see, working with data in R is pretty straightforward, but so far our data
has either been typed in or it’s been pre-loaded into R. How do we get larger datasets?
That’s what we’ll cover in the next section!
2
I n all of the years I‘ve written things, I never thought I’d get to write that last sentence and it be
completely accurate and non-pejorative.
38
Chapter 2 Feed the Beast: Getting Data into R
I’ll take each one of these areas and give you some of my favorite methods; however,
it’s important to recognize that at the end of the day, all we’re doing is piping data from
one place to another, and that can take various forms. These are some that I believe are
most intuitive, but you’ll likely see others as you explore R further.
OurData = ("
Student Pretest Posttest
A 25 27
B 23 23
C 21 22
D 23 29
E 23 24
F 21 19
")
Data = read.table(textConnection(OurData),header=T)
t.test(Data$Pretest,Data$Posttest,paired=T)3
3
1–1.textConnection.r
39
Chapter 2 Feed the Beast: Getting Data into R
I refer to this method as the textConnection method, since it uses that built-in
function to take the data out of the first several lines and create a data frame. I really like
this method for teaching as it allows me to show off data directly in R, without having
to open Excel or another application to show it there before running the statistics on it.
And before we move on, for the stats nerds and geeks out there, the paired t-test done
indicates that the students did not do so well – no improvement between pretest and
posttest. Want to lend them a hand? Change Student F’s posttest grade from 19 to 25.
Now the p-value drops to 0.052. Not great, and an example of p-hacking, but since this is
fake data, we can do what we want!
What about larger datasets though? Something I can’t have all in one file? Let’s work
with that next!
40
Chapter 2 Feed the Beast: Getting Data into R
Once I got this extremely long named file on my computer, I had to store it somewhere
that I can access with R. This typically isn’t a problem unless you’re running on an operating
system that sandboxes files to prevent one application from modifying another. In my case,
I put it in my Downloads folder, which on my computer is at /Users/jon/Downloads. The
exact command I used to read this into R and store it in an object named data was
data <-
read.csv("/Users/jon/Downloads/userssharedsdfratebrthsyaw1819raceethncty
20002012.csv");
And being that I’m on macOS Catalina, I received the following warning message
(See Figure 2-11).
Note If you’re downloading a particularly large file, you may want to use this
version of the code that downloads the file first – also useful if you want to keep a
copy of the file for later:
download.file("https://inventory.data.gov/dataset/cedbc0ee-d679-4ebf-8b00-
502dc0de5738/resource/ef734bd0-0aff- 4687-9b8a-fc69b937be63/download/
userssharedsdfratebrthsyaw1819raceethncty20002012.csv", "data.csv",
method="auto", quiet=FALSE)
data <- read.csv("data.csv");
41
Chapter 2 Feed the Beast: Getting Data into R
We’ve now seen one of the most powerful and versatile ways that R can save you
time – it has full Internet access and can download files directly into its own memory.
Think about what this means:
• If your data is stored on a web server and updated regularly, your
R script can download a fresh copy each time, no need for you to
download it first.
• If you want to share your script with someone else, you only need to
send them the actual script, not the data file.
• If you use a service such as Google Sheets, you can publish a specific file
to CSV by going to File and then “Publish to the web” and then using the
URL provided in your script (remember that using “Publish to the web”
does allow anyone who has the URL to access the data; See Figure 2-12).
So far our data has been in plain text – all the CSV file format does is place commas
between each data value. We can actually tweak the read.csv() command’s big brother,
read.table(), with a ton of options to fit how our data is formatted, for example:
• header lets R know if the first line of the text should be treated as the
header. This defaults to true for read.csv() and false for read.table().
• sep lets R know what the separator character is – a comma in read.
csv() and a blank space in read.table().
• quote lets R know if it should place quotations around the imported data.
• row.names and col.names are vectors that tell R what to call the rows
and columns, respectively.
By mixing and matching arguments as needed, we can read many different plain
text formats into R. Data not in plain text? That’s a little trickier. The main culprit in this
space: Microsoft Excel.
install.packages("openxlsx")
library(openxlsx)
data <- read.xlsx("/Users/jon/Downloads/Births-to-young-adult-women_
verified.no-chart.two-tabs-with-rates.xlsx")
Replacing the path with the appropriate path on your computer, you’ll get the data
into R, and the added benefit is that the openxlsx package can also be used to write data
back to Excel, as we’ll look at later. I still believe that plain text is easier and more flexible,
but I recognize that sometimes in the business world, you need to speak Excel!
And sometimes you also need to speak “Database”!
43
Chapter 2 Feed the Beast: Getting Data into R
44
Chapter 2 Feed the Beast: Getting Data into R
The tunnel created earlier routes the database server, which normally runs on port
3306 of the destination machine, to port 33306 on my local machine. I prefer to add an
extra digit or change the local port number just in case I want to have multiple tunnels
open at the same time – imagine, you could use R to copy data from one database on
Machine A to another database on Machine B, using a script that you could modify as
needed!
Now that we have the connection set up, let’s look at the following code:
install.packages("RMySQL");
library("RMySQL");
username <- "pr4";
password <- "pr4";
database <- "pr4-database";
dbconn <- dbConnect(MySQL(), user=username, password=password,
dbname=database, host="127.0.0.1", port=33306);
dbListTables(dbconn);
results <- dbSendQuery(dbconn, 'select * from secretstuff');
data = fetch(results, n=-1);
45
Chapter 2 Feed the Beast: Getting Data into R
data
dbClearResult(results);
data = dbGetQuery(dbconn, "SELECT * FROM secretstuff");4
You’ll notice that the first two lines install and then load the RMySQL package. Once
you’ve got this installed, you could easily remove the first line or comment it out using
a # character at the front. Lines 3, 4, and 5 are used to create variables the script will use
later to connect to the database. You’d replace my username, password, and database
name with the username, password, and database name in your scenario.
Line 6 is where the heavy lifting starts – it creates an object in R that represents the
database connection to the server. If you’ve got any of the information wrong, here’s
where you’ll get errors. For example, if my tunnel isn’t working right, I’ll get a “can’t
connect” error similar to the following one (See Figure 2-14).
Assuming you don’t have any errors, your database connection has been created
and stored in an object. I chose to name my object dbconn although you could name it
whatever you want. Next, I decide to test to see if the database connection is working
by listing the tables in the database using the dbListTables() command, and it
returns one table, the ominously named secretstuff (See Figure 2-15). Sounds like
an interesting table.
Let’s actually view what’s in that table. I’ve given you two examples of code that do
the exact same thing, in order to talk about why you might want to use one over the
other. Lines 8, 9, 10, and 11 do the following actions:
4
1–2.mysql.r
46
Chapter 2 Feed the Beast: Getting Data into R
Now the more astute of you that are working ahead may have noticed that Line 12,
the innocent looking data = dbGetQuery(dbconn, "SELECT * FROM secretstuff");
actually did everything Lines 8, 9, and 11 did. So why go through the extra work?
It all has to do with how much data you’re bringing down and what you want to
do with it. If you have a large number of records, the first route is best as it will let you
read in various “chunks” of data into your data frames. The dbSendQuery() function is
also more flexible – you don’t have to use a SELECT statement, you could use an insert,
update, or drop statement to modify the database directly. In practical work though,
dbGetQuery() will speed you up by allowing you to write cleaner, more concise code.
Speaking of cleaner and more concise code, it’s a good time to note that those two
things can often be at odds. You may have noticed that sometimes I place semicolons
at the end of my code statements and sometimes not. The reason for this is that R can
support putting multiple statements on the same line of code. The following two code
blocks are equally valid:
x <- 2
y <- 3
or
x <- 2; y <- 3
This seems very handy, and it is – for simple code declarations. But imagine code
that looks like this – again, equally valid:
That line of code – yes, it’s a single line – will work fine. But it can be extremely
challenging to read and understand, especially when you’re new to R. Eventually I
believe every R user finds a comfortable level of code “conciseness,” where their code
is easy to read and also doesn’t span dozens of unnecessary lines. When starting out,
though, it is sometimes very useful to “unpack” code like this, placing it on multiple lines
and executing parts of it sequentially instead of all at one time, in order to troubleshoot
where errors might exist.
Now that we’ve looked at getting data out of a database, what about the situation
where a database isn’t an option – but the data is still online? Imagine a website that
has reports of data, but it’s all nicely formatted on HTML pages with pretty styling. How
would we get that data then? By scraping it – of course!
48
Chapter 2 Feed the Beast: Getting Data into R
49
Chapter 2 Feed the Beast: Getting Data into R
Figure 2-16. The College Navigator Entry for Delta State University
Pulling up my university in NCES, I can see that there are a ton of collapsed tables on
the page. Expanding the first, I see the number I want – total faculty. (See Figure 2-17).
50
Chapter 2 Feed the Beast: Getting Data into R
Well, from here I can see the number – 157 – and if I wanted to, I could just paste
that number into my R script as I need it. But when that number changes, say if we hire
someone new, I’m going to have to update it. And that’s just going to add extra work for
me in the future. Figure out how to scrape it once, and unless NCES changes their design
layout, I should be good for a while. Also, imagine if I have 20 schools and I want to pull
this number for all 20 – much easier to just collect a series of URLs and feed them into
my R script vs. open each new page, expand the tab, and write down the number (also
much less prone to typos!).
51
Chapter 2 Feed the Beast: Getting Data into R
The first step is to figure out how the data is formatted in HTML. Using the “Inspect
Element” command in my browser (Safari; other browsers have similar commands;
See Figure 2-18 and Figure 2-19), I can see that it’s an HTML table (<td>) inside a CSS
class named tabular.
Now I know where it is, and I can begin to build my script in R. This script pulls in
everything I need, producing the output below it:
52
Chapter 2 Feed the Beast: Getting Data into R
install.packages("rvest");
library("rvest");
addr <- https://nces.ed.gov/collegenavigator/?s=MS&pg=2&id=175616
page <- read_html(addr);
nodes <- html_nodes(page,".tabular td");
totfaculty <- html_text(nodes)[2]
paste("The total number of faculty are",totfaculty);5
And the output of the script looks like this (See Figure 2-20).
5
1–3.scraping.example.r
53
Chapter 2 Feed the Beast: Getting Data into R
54
Chapter 2 Feed the Beast: Getting Data into R
First off, you’ll notice a ton of downloaded packages – remember that R packages can
rely on each other, and in this case, rvest uses a lot of other packages to get its job done.
The majority of the time these don’t cause any issue, but it’s important to remember
this if you ever have a script stop working – it might be that one of the packages that it
depends on isn’t working right or loading properly.
Next, you’ll see a line declaring the variable addr which is the URL of my institution
in NCES. It’s the address I’d paste into my browser’s address bar in order to view it. Next
are the page and nodes variables, the first of which downloads the HTML code from the
page and the second breaks it down into components. We can see what these look like by
typing their name (See Figure 2-21).
By examining the nodes variable, I can see that the line I want is in element 2 of
the node. From there, all I need to do is use the html_text() function to clean it up
(by stripping off the HTML code) and store it in a variable, totfaculty. I can then do
whatever I want with it, and in this case, I’m pasting it into my output, letting the person
running the script know what the value was.
55
Chapter 2 Feed the Beast: Getting Data into R
Obviously, I can expand this example however I like. I could run the code on
multiple page addresses and download the total faculty for a variety of other schools and
then build a table of them. I could also further explore the page and grab other elements,
like the University’s name:
Once you start scraping, you can really get carried away with all the data you can
grab. Combine that with reading from databases, flat files, and scripts themselves, and
you can quickly build up a little data arsenal in your R console. So once it’s all there, what
are you going to do with it? Probably at some point you’ll want to save it to use later. Let’s
talk about that in the next section.
Saving R Data
As you work with R, you create a bunch of objects. You can see those objects anytime by
using the ls() command. After the previous examples, these were the objects left in my
workspace (See Figure 2-22).
Some of those objects are pretty small – just a username or an HTML page. But
others are quite large – data, for example, is a data frame from my secretstuff table. If
I close R, I’ll get a message asking what I want to do with the items on my workspace,
specifically asking if I want to save my workspace “image” (See Figure 2-23).
What exactly does this mean? Well, it means that the next time I open R, I’ll see a
message that my Workspace and History have been restored. Now if I type ls() again,
everything is just as it was when I left (See Figure 2-24).
This is great, as long as I’m still working on the same things that I was working on
before. But what if I want to work on a different set of data, or files, or projects? Then I’ll
need to work with my “workspace” a bit more by directly saving, clearing, and restoring
it. I can use the following commands:
57
Chapter 2 Feed the Beast: Getting Data into R
getwd() and Gets or sets the working directory – the path getwd(); setwd
setwd() on your computer that R will look for files to ("/Users/jon/
read and will write files if not given a full Downloads")
directory path
history() Views previous command history history
(max.show=Inf
savehistory() and Saves and loads history from a file savehistory
loadhistory() ("myhistoryfile")
save.image() Saves everything in your workspace (e.g., all save.image
objects) to an image ("mywork.RData")
load() Loads an image into R’s workspace load
("mywork.RData")
You’ll notice that, sometimes, individuals distribute their RData file vs. another format
because it provides a full view of the workspace – which can be helpful if you have multiple
datasets loaded or specially written functions to analyze data. If you don’t want to write
everything to an RData format, you can also use saveRDS() to save a single item, such as
saveRDS(data, file = "thedata.rds"), reading it in with readRDS("thedata.rds").
Thus, to save and distribute your data, all you might need to provide to someone is the
RData and R script file, which typically has the ending .r as we’ve seen thus far.
If your friend isn’t an R user though, and you’ve been unable to convert her to the
language yet, then I suppose you can also save in other applications.
58
Chapter 2 Feed the Beast: Getting Data into R
We’ve come a long way in this chapter – from typing in all of our R data to bringing it
in through web scraping and databases. We’re now ready for our first project – launching
a web-based survey, getting the data into R, analyzing it, and becoming a sales or
marketing superstar!
59
CHAPTER 3
Project 1: Launching,
Analyzing, and Reporting
a Survey Using R
and LimeSurvey
When I was a graduate student, a former student who had gone into industry came back
to visit us a year post-graduation. He was working for a company in marketing, a rare
departure from the bulk of our comrades who had gone into academia. Talking with
him, I asked what the biggest difference was. He smiled and told me of a meeting he’d
recently been in with the sales team. They were very happy that sales were up that month
compared to last month. He asked “What is the standard deviation of monthly sales?”
They looked at him perplexed, until one of them asked “Is that important?”
Anyone who has taken a statistics course knows the answer – the actual average
is hard to place without the standard deviation, which is an estimate of how far any
given score is typically located away from the mean. If Month A has you earning
$500 in sales, and Month B has you at $600, that sounds good. Find out that the
standard deviation is $125, and all of a sudden there really hasn’t been too big of a
difference between A and B.
In many ways, businesses thrive on data; however, getting it and interpreting it
can be a challenging task. In this chapter, I’m going to give you a real-world scenario:
a market research project. I’ll show you how to collect the data, analyze the data, and
report the data, all using open source software. Along the way, I’ll give a bit of a dose
on survey research and data analysis that, while not intended to replace research
61
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_3
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
methodology courses or statistics courses, does come from a professor who regularly
teaches them. My goal is to give you enough to start exploring, and if you find that you
need more information, I can give you some references on that as well! Let’s get started.
Introducing LimeSurvey
Almost 2 decades ago, in 2003, an open source software project named PHPSurveyor
was first released by programmer Jason Cleeland on SourceForge.net. In a time before
widespread adoption of platforms such as Qualtrics (est. 2002) and SurveyMonkey
(est. 1999), most of the survey data that was collected on the Internet came from
custom-written software, likely HTML Forms pages that had been strung together and
coded to write to a database back end. I recall a programming project I was hired to do
in 2005 that was simply that – a survey. Today with SurveyMonkey having a free tier and
Google Forms an option, it’s easy to get basic data from large groups of people with low
cost; however, features such as conditional branching, modifiable templates, and closed
access surveys (e.g., invite-only) can be difficult to achieve without paying a premium –
unless you know about PHPSurveyor’s current name: LimeSurvey.
62
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
LimeSurvey offers a number of premium features that make it an essential tool in any
researcher-on-a-budget’s toolkit. These include
• The ability to fully customize the look and feel of the survey using
pre-defined templates.
• A panel integration feature that lets you keep and reuse a group of
respondents allowing you to send out multiple surveys and track
responses across all of them. Essentially, you could use this to build a
robust user profile of your customers or constituents.
All of this is offered free through the self-hosted Community Edition of LimeSurvey.
Those who find the following steps a bit daunting and would rather pay for a hosted
service can look into LimeSurvey Pro, which has various tiers available depending on
how many responses you plan to collect.
Assuming you’re a bit tech-savvy and you want to run LimeSurvey on your own
server or cloud storage provider, I’ll outline the steps as follows to get you up and
running. Then, we’ll build a sample survey.
63
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
64
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
If you already have a server or cloud storage bucket that you can install the software
into, you can download the latest stable version in whatever format you prefer. I tend to
simply use wget on Linux to download the latest stable gzipped version, which I can then
unpack (Figure 3-3).
65
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Now we need to talk a little bit about server options. LimeSurvey runs as a LAMP
application, with LAMP standing for Linux, Apache, MySQL (or MariaDB), and PHP. It’s
a common acronym for an extremely popular combination of software that allows you to
run the software on a free operating system (Linux), through a free web server (Apache)
with a free database (MySQL), and free programming platform (PHP). There are several
dozen flavors of LAMP, with many having their favorite combination. You can learn
more by looking at the LAMP Wikipedia entry (Figure 3-4) and find links to install LAMP
server on a variety of platforms at the bottom (Figure 3-5).
If you are more familiar with cloud applications, you can install LimeSurvey on a
cloud platform such as Amazon Web Services or Microsoft’s Azure. You’ll need a server
instance to house the files and serve them and a database to write to. Various blog posts
and tutorials exist that can walk you through your platform and installation.
66
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
If you simply want to test out LimeSurvey on your own computer, I’ve used XAMPP
(Figure 3-6) for many years on Windows and Mac computers. It’s a simple one-stop
download that will get your computer set up with the essential elements quickly. It’s
meant for development work and is a great place to try out LimeSurvey before buying a
virtual server or cloud computing service. In a small environment, such as a corporate
office, you might even be able to use it to run surveys on a secured intranet, within your
own work group.
However you decide to house or serve your files, you’ll run through a pretty basic
installation process. Once the files are unpacked, navigating to them (by going to the
web address associated with where you placed them, e.g., http://localhost) will
launch the LimeSurvey installation. It will ask you a series of questions regarding how to
configure LimeSurvey, as well as connection details for your database. Once you’ve got
your system up and running, you can navigate to the LimeSurvey login page at /admin
(Figure 3-7), logging in with the username and password you set during installation
(Figure 3-8).
67
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
68
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
It’s important to note that earlier versions of LimeSurvey do differ slightly in their
layout, and local administrators can customize this screen as well. In Figure 3-9, we see
a slightly earlier version that you’ll see is subtly different than the version in Figure 3-7.
Regardless of version, the same basic operations exist. They just might be slightly bigger
or smaller, depending on browser size and customizations.
69
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-9. The LimeSurvey Admin Homepage of an Earlier Version, Showing the
Create Survey Button
Once you’ve logged in, it’s time to start building your survey. You can do this by
clicking the “Create Survey” button that we saw in Figures 3-7 and 3-9. You’ll be taken to
the Create New Survey screen (Figure 3-10).
70
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
I’ve filled in the basic information required to create a new survey in Figure 3-10,
and once I hit Save, I’ll get a screen confirming that the new survey has been saved in the
system (Figure 3-11).
You’ll notice on the Saved New Survey screen that you have all of the basic
information you entered as well as a few new entries. Most notably is the survey URL that
is listed. This is the address you’ll give to people whom you’d like to take your survey.
You’ll see that it indicates the language the survey is written in. LimeSurvey allows you to
have multiple languages available for any survey, which can be useful if you work with a
bilingual population or have customers around the world.
At the top left, you’ll see a slider button labeled “Settings/Structure” (Figure 3-12).
The Settings options are visible in Figure 3-11; they include
71
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
In the second section of the menu are settings related to the specific items in the
survey vs. settings. These include
• And finally, any plugins you’ve installed that modify the survey.
72
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
There is a lot to explore in LimeSurvey, and for our demonstration, we’ll simply add
a few question groups and questions. We’ll then launch the survey and collect responses.
I’d encourage you to explore the options yourself, as there are tons of features that are
very specialized but might be exactly what you need for your scenario.
To get started on our survey, I’m going to click the Settings/Structure slider to
Structure, and the screen changes to what you see in Figure 3-13.
You’ll notice an Add question group button at the top. Questions must live within
specific question groups, which can be useful to help break up the survey into logical
parts. I’m going to click Add question group and fill out the information on the screen in
Figure 3-14.
73
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Filling out the required information, I’ve clicked “Save and add question” in the
upper right. This takes me to the screen in Figure 3-15, which lets me enter information
in for my first question. In this case, I’d like to ask about the person’s gender.
74
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Once I hit “Select”, the Add a new question screen changes to have the gender option
preview (Figure 3-17).
75
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Hitting “Save”, I get an overview screen that shows the question information
(Figure 3-18). It also has a series of Preview buttons at the top – one to preview the
survey, one to preview the question group, and one to preview the specific question. I’ve
clicked “Preview Survey” and see the result in Figure 3-19.
76
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
OK, now we’re making progress! I have my first question group and my first question
inside of it. Let’s add a few more questions. I’ll start by choosing Add a new question
again and then creating an Age question (Figure 3-20). I’ll then change the question type
to Numerical input, which will require the user to enter a number (Figure 3-21). Finally
I’ll make sure people answer my question by changing the slider for Mandatory to Yes
(Figure 3-22).
77
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
78
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Once we’ve collected these two pieces, we have some demographics that might be
useful in our sales operation. However, what if we have some very targeted questions
that might not apply to everyone? We can use a Multiple choice question type that
lets individuals choose all of the options that apply to them. We start by adding a new
question and changing the question type to Multiple choice (Figure 3-23).
Filling in the required information and saving, you’ll see a saved question screen as
follows. But notice, in Figure 3-24, that it shows a warning telling us that we’ll need to
add subquestions. Clicking that warning takes us to the screen in Figure 3-25 where we
can add our subquestions or answer options.
79
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
This screen lets us create a short code to represent each subquestion and then enter
the text that we’d like. If you’ve got one or two options, it’s not too big of a deal to add
them manually by clicking the green plus arrow next to the last option to insert a new
one. I’ve done that twice to get the screen to look like it does in Figure 3-26.
Once I save that, it’s time to add the final question – one that asks the survey
respondent to rate our new product. I’m going to add a new question group just for this
purpose and add a new question inside of it. The question type is a list, under Single
Choice Questions. Similar to the Multiple choice question type, once I save it, I’m going
to have to add responses, in this case, a number between 1 and 10. I could go through
and hit the green plus arrow nine times to get all ten options entered, but instead I’ll use
the Quick Add Label screen shown in Figure 3-27. By entering the code for each label
and the label itself (the same thing for this question), I can quickly create the question’s
options without a lot of clicking. Figure 3-28 shows the answer options screen after
I chose “Replace” in the previous figure. Figure 3-29 shows me the completed rating
question.
81
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
82
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
I now have my completed survey! If you are following along and want to check mine
to yours, you can import my completed survey by creating a new survey in LimeSurvey
and choosing the Import tab and importing the file limesurvey_survey_167537.lss
that’s included in the code package for this chapter.
Choosing to preview the survey, I see the screen shown in Figure 3-30. Note that it’s
warning me that the survey is not yet active. As an administrator, I can go through and
test out everything in the survey, but no data will actually be saved. That’s fine for now;
let’s just make sure the survey looks OK.
OK, we’re past the first page, let’s look at the second. It’s shown in Figure 3-31.
I wonder what happens if I don’t fill out that required age question.
83
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Turns out that missing that red asterisk that shows me the question is required does
have repercussions. If I submit the page without anything in that box, I get the error in
Figure 3-32. Going back to the page, I now have a warning text on the question, helping
me see what I missed (see Figure 3-33).
84
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-33. The Warning Text visible after not submitting a required question
Moving along, I can see the last section of the survey in Figure 3-34 and finally the
end message, along with another warning that the survey is not active in Figure 3-35.
85
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
If I’m happy with the survey, it’s time to activate! Activating the survey by clicking
the green “Activate this survey” button (shown in Figure 3-36) does a few things. First, as
noted on the screen (see Figure 3-37), it prevents you from adding or deleting question
groups, questions, or subquestions. This is because once you activate, a new database
table is created for your survey, and the structure of that table depends on how many
questions are defined. You also will see options for saving additional information. In
practice, I typically do not anonymize responses, and I turn everything else to “Yes” – the
more information I have, the easier it is to troubleshoot later if something goes wrong.
The timing information can also be useful to view later if you suspect something strange
is happening on your respondent’s end. Perhaps they opened the survey and then left for
several minutes? Or they finished a 50-item survey in 1 minute. That might let you know
it’s not completely honest data!
86
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Choosing “Save & activate survey”, you’re next given a page (Figure 3-38) that asks if
you want to switch to closed access mode. In open access mode, the default, anyone with
the web link to the survey can take it. This is especially useful if you want to blast it out
over social media or get many people to take it that might not be known to you already.
However, if you need to track who is taking your survey, then you’ll want the closed
access mode. In this mode, a survey respondent must have a “token” or access code that
will let them into the survey. You can track which tokens have been used, you can modify
them to expire at a certain time, and you can create new ones as needed. If you use the
built-in respondent functions in LimeSurvey, you can even have LimeSurvey create
tokens and email them to your participants directly from the application. In our case, we
won’t enable this feature – but it might be something you’d want to look at later!
87
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Now that you’ve activated your survey, you can go to the survey link and see a similar
screen to before (Figure 3-39) but without the warnings.
Once someone takes your survey, you will find the tallies in the Response summary
screen start to fill up (Figure 3-40). However, even with my taking the survey five times
(Figure 3-41), I suspect that won’t be enough data for us. I’ve gone into the data file (that
you can view in our book’s data download) and added more fictional data, so we have
something to analyze later.
88
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-41. Adding a Few More Responses – but They Still Won’t Be Enough!
Once all of your data is collected, you’ll want to see what you’ve found. There are two
ways to do this, broadly, in LimeSurvey. The first is to use the built-in statistics export
options (Figure 3-42) which allow you to view your statistics on the Web (Figure 3-43)
or in a PDF report that you can download (Figures 3-44, 3-45, 3-46, and 3-47). You’ll
see that LimeSurvey can visualize and display the data not only in table format but also
graphically. This can be very nice to use when presenting your data to a team. It’s also
live data, so if the survey is still open, you could use this screen to grab a daily snapshot
to see if sentiment changes over time.
90
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
91
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
You can also view the raw responses in LimeSurvey directly using the Survey
responses option under the Responses and statistics menu. If you’ve got a small dataset,
this can be very useful (e.g., perhaps during testing) to see if there are any anomalies in
data recording.
92
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
However, if we’re going to really get to the heart of the data, we’re going to need the
data in a format that’s easy for R to work with. For that, we’ll use the Export menu off of
the Responses and statistics area (Figure 3-49). Two of the built-in options (Figure 3-50)
reference R. The first (Figure 3-51) is the R Syntax file – this will have LimeSurvey
generate an R script that correctly loads the data in and then maps the responses onto
the question codes. You can customize how it does this mapping using the “Headings”
options on the right of Figure 3-50.
93
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Finally, after you’ve customized and download the R syntax file, you’ll need to
download the R data file. This export is actually very similar to the CSV option that
LimeSurvey provides. However, it will format the column names and codes to be easier
for R to parse.
94
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
On your computer, after export, you should have two files – the R script (with .R as
the file extension) and the data (with .csv as the extension).
At this point, we could fire up R, load in the script, and have it load our data.
However, you’ll quickly notice that the basic R application on Windows, Mac, and
Linux leaves a lot to be desired, as I mentioned earlier. Now is a great time to introduce
RStudio – a product I briefly mentioned in Chapter 1. RStudio is an open source
integrated development environment (IDE) for R. Installing it after you’ve already
installed R on a machine allows you to have a much more comfortable place to work,
placing many of your regularly used options and commands in an easy-to-find location.
In the next section, I’ll walk you through downloading and installing RStudio and
creating a new project that you can use to house the data files from the preceding
example. For the remainder of the book, whenever we’re working in R, we’ll be working
in RStudio. This will also help us in areas of cross-platform compatibility, since the menu
options won’t be radically different!
95
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
As I’ve mentioned earlier, when I began working with R, there weren’t too many options
for how you ran R. The basic R clients for Windows, Mac, and Linux were available,
and some third-party IDEs could integrate with them, but none were too spectacular in
their implementation. Some, such as R Commander, tried to emulate a point-and-click
interface that would let you run statistical tests by selecting the appropriate GUI menu
96
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
command. Others, such as JGR (Java GUI for R, pronounced “Jaguar”), tried to create a
prettier interface by adding functionality that the core interface lacked. It wasn’t until
RStudio was released that we had an open source, cohesive, integrated development
environment that was designed with R specifically in mind. Introducing support for
basic necessities, such as projects that you could switch easily between, and advanced
features such as an integration with output package knitr, RStudio is the IDE that I use
the majority of the time. Let’s go ahead and download it by first heading to the RStudio
homepage at https://rstudio.com (Figure 3-54).
The homepage can be a bit overwhelming, but you’ll want to find the Download
option in the upper right to be taken to the downloads for RStudio Desktop. We’ll cover
some of the other RStudio software offerings, such as RStudio Server and Shiny later in
the book. Once you choose download, you’ll be taken to a screen (Figures 3-55, 3-56,
and 3-57) to choose between RStudio Desktop and RStudio Server. Click “Download” for
RStudio Desktop Open Source License, which is free.
97
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
98
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
99
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Next, you’ll see several download options for various operating systems. It’s
important to note that R must already be installed on your computer before you install
RStudio. Choose the version for your operating system, download, and install it like any
other program. See Figure 3-58.
100
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
101
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-59. The New Project Option from the File Menu
103
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-63. The Files Area after Moving the Data and Syntax File into the
Directory
Clicking the R syntax file that LimeSurvey wrote, you’ll see it open in a color-coded
script window (Figure 3-64) and the entire screen resizes to put the R syntax file in the
upper left, moving the console window to the lower left instead of the entire left-hand
pane (Figure 3-65). You can, of course, change these layouts if you prefer them to be
different.
104
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
105
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Now that we have the script open, let’s click the “Source” button in the upper right
(Figure 3-66). This will tell R to run the entire R script. You could also select all of the
lines of code and choose Cmd+Enter on a Mac or Ctrl+Enter on a PC to accomplish the
same task.
Once the script has been executed, the data is loaded into its own variable, aptly
named data although you could modify the script to call it whatever you like. Once
you’ve loaded the data, you can view it by typing data to show everything or head(data)
to show the first six lines (see Figure 3-67).
106
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
We now have the data loaded into RStudio, and we can work with it in the confines
of our project. If you were to exit RStudio, it would ask if you’d like to save the workspace,
similar to how R normally works; however, now that becomes a bit more useful since you
can easily see what’s sitting in that workspace in the “Environment” window in the upper
right. I’d encourage you to explore the RStudio interface a bit more now to understand
how it’s laid out. Once you’ve gotten comfortable, move on to the next section where
we will mine our data a tad, talk about how to use some of RStudio’s features, and
understand exactly what predicts our customer’s satisfaction!
107
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
how the data fits. The data here is generated for the following examples, and in the real
world, it’s likely you’ll have many more variables to consider as you plan your analysis.
Most high school math sequences discuss, at some point, probabilities, means, and
standard deviations. Those, along with frequencies, give you an idea of what possible
relationships you might have in your data.
In this section, we’re going to assume that you’ve already reviewed your data and
that you’ve decided to use an advanced technique, linear regression. In Chapter 2, I
explained that there are also a host of diagnostics and assumptions you’ll want to check
before you use such a tool. My point in using it here is not to suggest that it’s a cure-all
for any statistical analysis – it’s merely to show it in an applied context, giving you one
possible way to analyze the data. With those caveats mentioned, I’ll give you a quick
version of what linear regression is in this section and how it works. For now, know that
the output you see in Figure 3-68 will have some meaning by the end of this chapter.
Avoid the temptation to freak out at the numbers, and trust me!
Before we can get into how we run and read the output in Figure 3-68, we need to
do a bit of data housekeeping. The first thing I recommend doing is taking a look at the
R syntax script to learn some useful interface tricks in RStudio. If you look at the top of
the script, on Line 1, you’ll see the code that loads the survey data file into a variable
108
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
named data. One thing I like to do is add a setwd() command at the top – this will make
sure that R is always looking in the directory we intend it to look. While RStudio defaults
your working directory to the directory your project file lives in, sometimes it is useful
to look elsewhere, and setwd() ensures you come back to where you intend. Adding
the setwd() in Figure 3-70, you’ll see that the interface has changed – the title of the file
now has an asterisk (*) at the end of it, which tells us the file has unsaved changes. A
quick Cmd+S or Ctrl+S will save the file, and the file name will change back to normal
(Figure 3-71).
Figure 3-71. The filename changes back to black without an asterisk after save
Moving over to the Environment tab on the right, you’ll see that right now we have
one item loaded – the survey data that we’ve named data – this screen lets us save and
load an R workspace image through the file folder and save icon and also lets us clear
out the entire environment by using the broom icon (Figure 3-72). It can be very useful
to use this tab whenever we’re adding or removing lines of data from our data file, since
it lets us easily see the number of observations. If we’re doing an operation we think will
remove variables (columns) or observations (rows), we can keep an eye on this output to
make sure we aren’t going in the opposite (or no) direction.
109
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Take a moment to click the data object and it will load the data viewer, showing
you the observations and variables (Figure 3-73). If you scroll over, you’ll find a variable
after enjoy named X.with.10...I.really.love.it.and....i.m.not.so.happy.with.
it – what is that?!? Turns out that the R syntax script is only human (well, programmed
by a human) – in this case, the variables didn’t exactly line up, leading there to be a few
empty variables (as evidenced by the NA lines) at the end of the data file. To remove those
last two columns, we use the following line of code (Line 56): data <- data[,1:11]
which tells R to reassign the data variable to a subset of the current data variable,
namely, to include all of the observations (as evidenced by nothing between the [ and
the comma), and then the first 11 variables (the 1:11 portion). Variables 12 and 13, the
data that didn’t really exist, get dropped. This is also how you might go about cutting
down a dataset – if you only wanted the first five observations, you could write the
statement data <- data[1:5,].
110
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Figure 3-73. Clicking the Data Frame opens the Data Viewer
Now that we have the last two empty variables gone, let’s do a bit more digging into
our data. First, we’re going to install a new package – psych – while there are plenty of
ways to look at statistics in R, some are prettier than others. One of my favorite functions
is the describe() function in the psych package. Loading the package in on Line 58
and then running it on Line 59, we see the output in Figure 3-75. It shows us summary
statistics, such as the mean and standard deviation, for each of our variables. Some of
these really don’t make much sense (a mean of gender?), but others are meaningful,
such as age and the enjoyment rating.
111
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
This brings us to an interesting topic: errors. You’ll notice that after you run that
command (describe(data)), you receive the errors in Figure 3-76. Briefly, R is telling
us that some of the data wasn’t coded in a way that let us run the command properly
(Errors 1 and 2) and that others just don’t apply to the statistics we tried to run (Errors
3–76). My advice with R errors tends to always be the same: Google is your friend. R’s
errors are not always the most user-friendly, and when it comes to some packages, the
authors of the package don’t always state things in clear ways when they throw an error.
You might need to search the exact error string to find situations where others have had
the same issue and to receive guidance. Asking fellow R users can also be useful.
Figure 3-76. Many Errors Trying to Use the describe() Function with the Entire
Dataset
112
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Returning to the output, remember how I said that some of those variables really
don’t lend themselves to means and standard deviations? Those include the gender
and “Check All That Apply” questions. For those, I really would like to know either the
number of people that choose each option or the percentage of people that chose that
option. I can do that pretty easy in R using the table() function. In Figure 3-78, I’ve
nested table() inside of prop.table() – reading from the outside in, it starts by pulling
a variable, data$gender, for example, and asking for a table of responses. This will give
me the raw counts. Counts are useful in some times and places (e.g., Sesame Street),
but most of the time people find percentages to be more easily understood. That’s why I
put the prop.table() function around the table. This tells R to give me the proportions
in each response category. As we can see, it’s much easier to tell someone “Yes, 60% of
respondents were male” or “40% have recently bought a book on R” vs. the raw numbers.
113
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Now let’s talk about what this section claimed to de-mystify: linear regression. To do
this, I’m going to ask that you recall your first or second algebra class, where you learned
the equation of a line. Remember that a line can be plotted by knowing the slope and the
y-intercept – this is often written algebraically as y=mx+b with m standing in for slope
and b standing in for the y-intercept.
If we have a line that has multiple slopes, it will rock the line up or down across the
Y axis, in essence finding a sweet spot between the multiple slopes. Linear regression is,
statistically, a process to determine the equation of a line. Feed it a bunch of Xs and Ys,
and it will figure out the y-intercept and various slopes for each X.
We write this equation out in our R script on Line 66, which reads fit <- lm(enjoy
~ gender + age + AllThatApply_geek + AllThatApply_rbook + AllThatApply_
catyell, data=data) – let’s break that down a bit:
• fit is the name of a new object we’re going to create in R to store the
calculation.
• data=data tells R to feed data from our dataset named data into the
model. If our dataset had been named something else, for example,
surveydata, this line would have read data=surveydata).
Now that you know what the line of code does, let’s run it and get a warning
(see Figure 3-79).
Turns out that we have a few issues we need to fix before we can run the linear
model. Looking at the structure of the data using the str() command (Figure 3-80), we
can see that we have a bit of a problem – our outcome variable, enjoy, is a factor, not a
number. R wants to predict a linear model, which means Y has to be a number.
R can fix this problem by casting the data from one structure to another, in this
case using the as.numeric() function. One word of warning though – casting factors to
numbers can be tricky. Remember that factors depend on a reference level. R tries to
115
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
choose reasonable reference levels, so if the factor values are “1”, “2”, “3”… “10”, it chooses
the lowest, 1. However, if R chose wrong, or someone specifically defined the levels
differently, we can run into issues.
Let’s look at this in a short example. In Figure 3-81, I’ve first listed the enjoy data,
and we can see each observation of the variable listed in order. I then ran the same
command within the as.numeric() function. The numbers match the levels perfectly. I
then reassigned data$enjoy to the numeric version of data$enjoy instead of the factor
version. Running data$enjoy again shows that the data is now a number, not a factor
(we know this because we don’t see a “Levels” line).
But what if something went wrong? Well, in Figure 3-82, I’ve created that scenario. It
starts the same, but the second line of code I enter re-orders the levels so that 10 is the
first level. Now we can see that if I do the exact same thing I did earlier, a 10 has become
a 1, a 2 has become a 3, a 5 has become a 6, and a 9 has become a 10. Mass chaos! The
moral of the story: Always test out your casting before you make it permanent to verify
that you don’t trash your data accidentally.
Moving on, I’ll now run a data cast from factor to numeric and then the linear model.
You can see the output in Figure 3-83.
116
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Or rather the lack of output. Nothing comes out. Turns out that’s what we want –
nothing coming out tells us that R did not find an error in our code, and it simply
executed what we asked for. To actually see the output, we need to ask R to show us the
output using the summary() function. Executing summary(fit) gives us the output in
Figure 3-84.
117
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Something looks a little off here – and indeed, R tells us as much in the Warning
message at the bottom: essentially perfect fit – what does that mean? It means that the
data we have is suspiciously non-variable. And this is to be expected – because I basically
copy and pasted the data ten times to create this dataset. R was not fooled – it figured out
that the data wouldn’t be expected to be this perfect and told us as much.
Perhaps I can make R happy by adding in a bit of variation to the dataset. Opening up
the dataset in Excel, I purposefully added a few values that were not consistent with the
trends reported in R. Now running the same command gives me different output and no
error (Figure 3-85).
Let’s talk about that output now that we can see all of it. First, R reminds us what equation
we were attempting to estimate. Second, it lists the Residuals – which is a measure of error.
The residuals tell us how far away our predicted line is from the actual line in our data. The
lower the residual, the better the data fits both the estimated and actual line.
118
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
Next, we see the coefficients of the equation. Let’s walk through each one:
• (Intercept) is the y-intercept, the point at which our line crosses the
Y axis if all Xs were at 0 or baseline.
• genderMale refers to the change in the line that accounts for a change
in Gender from the reference level, female, to another level, Male. We
can interpret this value to mean that Males enjoyed our product, on
average, 1.79 more on the rating of 1–10 than females. Guess we want
to target men!
• The last two lines also work the same way; however, you’ll notice
that the p-value in the far right column is above 0.05. Typically, this
means we don’t recognize it as a significant predictor, and thus
we don’t interpret it. The first four lines all are below 0.05 and are
significant predictors.
At the bottom, we get some summary information about the model, including
standard error, and percentage of variance accounted for. Finally, we get an overall
ANOVA omnibus test of differences, showing that the model does have predictive
validity.
If we would like, we can also examine different means by group to verify our findings.
For example, in Figure 3-86, I’ve used the tapply() function to calculate a mean based
upon gender. I can see that men do indeed have higher enjoyment ratings. The second
example of tapply() makes me also wonder if perhaps men and women differ in
enjoyment based upon if they like or dislike the meme – it certainly seems like men
who liked the meme had the highest enjoyment and had a larger margin of difference
between them than females. I can test this by slightly tweaking the estimated model.
Instead of having all of my factors tested for a main effect, I can specifically look for
interactions between that meme and gender. The model shown in Figure 3-87 does
exactly that.
119
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
In this model, we see the combination of gender and the cat meme, the very last line,
and find that there is not a significant difference (the p-value is above 0.05). So while
that difference looked really large in Figure 3-86, it wasn’t so large that it was statistically
significant.
120
Chapter 3 Project 1: Launching, Analyzing, and Reporting a Survey Using R and LimeSurvey
In this chapter, we’ve covered a lot of ground – we’ve set up a survey instrument,
collected data, analyzed that data, and have reached a point that we can make a
conclusion: in our example, we’d like to target our marketing at male computer geeks!
In the next chapter, we’ll continue this by digging deeper into human behavior, before
we pivot outward in Chapter 5 to discuss R in other parts of our daily life. But before we
go, let’s answer the question that started us off – who do we target in our marketing?
Apparently, we target somewhat older males who classify themselves as geeks. Sounds
like whatever this product is, I’m the prime market for it!
121
CHAPTER 4
Project 2: Advanced
Statistical Analysis Using
R and MouselabWEB
Imagine the following scenario: you’re lying in bed, sick. While you know it’s a bad idea,
you wonder if perhaps you should engage in a little at-home diagnosing, and you head
over to your search engine of choice. You plug in your symptoms and get taken to a page
that has way too many advertisements – all you want is information, yet everywhere you
look you see adds for amazing cure-alls and nonsense likely conspiracy theories. It’s
almost as if the person laying the page out knew where you would be most likely to look
and purposefully put all of the information you needed somewhere else.
And it’s actually possible they did. For many years, technology such as eye tracking
has helped market researchers figure out where people are gazing, and they can
then use that information for good or evil. Eye trackers, the combination of hardware
and software that monitor gaze and fixation, can cost between $7000 and $30,000,
depending on the model, and thus aren’t within the budgets of most of us. However,
in this chapter, I’ll introduce a piece of software that can do much the same function,
albeit less automatically and less unobtrusively. By the end of it, you’ll have a working
example that helps you understand the order that people look at information, which
pieces of information they review most, and what they eventually decide based off of that
information. We’ll also discuss the concept of R Packages and “future-proofing” your
code at the end of the chapter, because nothing is worse than writing a great piece of
code only to have it die in less than a year because of something out of your control!
123
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_4
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
I ntroducing MouselabWEB
MouselabWEB is an open source tool that enables researchers (or anyone else) to
monitor how people take in information and then use it to make a decision. As you
can see in the example at the bottom of Figure 4-1, it does this by having boxes of
information covered until the mouse rolls over them (as I did in Figure 4-2). The basic
premise is that individuals will use the information they find in the boxes to learn about
the choice they want to make. They will then refer to the boxes as needed to make their
decision.
124
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
• Choosing which car to buy, with boxes labeled for color, engine size,
price, insurance premium, and average maintenance cost
• Choosing where to eat lunch with co-workers, with boxes labeled for
“have we recently eaten there?”, “are they fast?”, and “are they pricey?"
Basically, any time you have a choice to make, you can construct a MouselabWEB
page that will list the attributes of the various options and then allow the user to select
an option. While the user is “mousing around” the page, opening and closing boxes, the
software is tracking a number of things.
To see this in action, I’m going to go to the MouselabWEB Demo Page and do the
demo experiment. I’ll put in my name as “Jon-PR4” and Condition number 1 (Figure 4-3).
The first page (Figure 4-4) asks me to review information inside each box and then
match up the professor with the subject they teach. A simple information search. The
second page (Figure 4-5) gives me choices of two cameras to buy, with information on
each, and allows me to pick one (measuring my preferences). The final page (Figure 4-6)
simply asks me to rate how difficult the decision regarding the camera was.
125
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Figure 4-3. Choosing a Name and Condition Number for the MouselabWEB Demo
Figure 4-4. The first demo page - searching for the subject Professor Smith teaches
126
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Figure 4-6. The final demo page, asking about the difficulty of the camera
question
Once I’ve finished the demo, I can use the MouselabWEB Datalyser screen to
view or export my demo data. In Figure 4-7, you can see the output from the first
page of the demo. The data is set up in a “long” format, which means that each
action has its own line. We’ll discuss “long” and “wide” data formats a bit later in this
chapter, principally how to convert one from the other. For now, we’ll walk through
this line by line.
The first line lists my participant number (12656), the section I was on (intro), and
my name (Jon-PR4). The next three columns show my IP address, the condition number
I was in (1), and the date and time that I ran the demo. The interesting stuff begins in the
next column, which shows that the page finished loading the body 22 milliseconds after
the page was requested. MouselabWEB needs to know this so it can figure out how long
you might have been looking at the page before ever interacting with it.
127
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Figure 4-7. My output from the Professor Smith question seen in Figure 4-4
Reading down through the lines, I have some diagnostic information that tells me
what the layout of the boxes was, since MouselabWEB can be set up to counterbalance
information (e.g., show information on the left on one page, show the same information
on the right the next time the page is loaded – useful to determine if it’s the information
that’s driving the decision or where it is placed). We know that, in some cases, humans
have a right- or left-ward bias, or are more sensitive to the first or last pieces of
information they are shown (a primacy or recency effect, respectively). This makes the
counterbalance feature especially useful to guard against these biases. Returning to
the output, reading the mouseover events, you can see that I first opened box a0, with
Professor Marx's information inside. I opened the box at 2800 milliseconds after the page
loaded (2.8 seconds) and closed it at 5186 milliseconds, meaning I looked at it for about
2.3 seconds. I then opened up Professor Jones's box, looked at it for just under 2 seconds,
and then opened up Professor Smith’s box, looking at it for just over 2 seconds. Finally,
after 11254 milliseconds on the page, I submitted the demo, with Philosophy as my
choice for what Professor Smith taught.
As you can probably imagine, the amount of data one could generate from these
studies can be massive. Thankfully, the creators of MouselabWEB provided not only a tool
to create the pages (the MouselabWEB Designer that we’ll see in the next section) but also
a tool to export the data (the MouselabWEB Datalyser). With those in mind, let’s create our
own example and settle an age-old debate: to be (a cat person) or not to be (a cat person).
128
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
figure you might be able to use this information to co-write a book with Mr. Flufkins on
how to identify “cat people” vs. everybody else.
In order to do this, we’re going to use the MouselabWEB Designer. MouselabWEB
can be downloaded free of charge after a quick registration of your email account
(so that you can get periodic updates and information on bug fixes, the “owners”
of MouselabWEB are academic researchers, they don’t have any desire to sell your
information). You can download the full package or just the designer portion. Eventually,
you’ll need to host the full package on a web server of some sort, so I’d recommend
downloading everything.
It’s a bit outside the scope of this book to discuss how to set up a LAMP stack
(Linux, Apache, MySQL, PHP); however, know that you’ll need such a setup going to
host your MouselabWEB pages, specifically with PHP 5. PHP 7 has removed the mysql_
functions in PHP for the newer MySQL commands and thus broke compatibility with
MouselabWEB. The quick steps that you’ll need to do to install MouselabWEB on your
LAMP server are
Once you’ve done that, you should be able to test out MouselabWEB by running the
designer. It’s a separate download, so you may want to put it in its own directory below
your root website URL (e.g., http://mywebsite.org/designer). Loading it up, you
should see a screen similar to Figure 4-8.
The interface is pretty easy to understand – you can add new rows and columns
by pressing the “new row” or “new Col” button. If you’d like to add buttons to make a
choice, you can do that by pressing the “new Btns” button and can then choose push
buttons or radio options. In Figure 4-9, I’ve put in all of our information for our pro- or
anti-cat test.
You’ll notice that in the upper right there is an “Output” section. This lets you
download the HTML or PHP version of the page, as well as test the entire page from
within the designer interface. Pressing the “test” button brings up a screen similar
to Figure 4-10, which lets you test out your boxes and buttons. Once you’ve finished
mousing around, you can press the “Show Data” button and see what your script would
have recorded (Figure 4-11).
130
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Figure 4-10. Pressing the “test” button to show the Test Table
131
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Figure 4-11. The output after mousing around the test table
Finally, once you’re all done, you can press the php button to get the PHP output
(Figure 4-12) that you can then customize and save to your web server.
132
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
The PHP version is more minimal than the test version, since it doesn’t need to show
the data back to the participant. In Figures 4-13 and 4-14, we can see the output PHP
being run, first with all the boxes closed and then with them open.
And that’s about it – there are a lot of ways we could continue to customize, including
piping a participant ID number in or counterbalancing the order, but for our simple
page, we’ve got everything we need. Now we just need to interview roommates. And in
my case, because I’m an overachiever, I interviewed 34 of them. It’s time to download the
data and crunch the numbers!
D
ownloading the Data
Downloading MouselabWEB data is most easily done through the Datalyser, a special
PHP script created by the researchers who developed MouselabWEB that lets you get
your data in a few different ways. The Datalyser is named “datalyser.php” and lives in
the same directory as the other MouselabWEB files. Loading it up in your web browser,
you’ll get a screen similar to Figure 4-15.
134
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
There are a few things to note about the Datalyser screen. The first is that all
experiments that you run will be listed here, as it shows all experiments in the database.
If you want to separate some of them out, you’ll need to specify a different database in
the mlwebdb.inc.php file that I mentioned earlier.
We’ve already seen an example of the “Show Table” command in Figure 4-7; however,
the “Replay” command can be very useful when understanding how participants engaged
with your boxes. Clicking it shows a screen similar to Figure 4-16. You can then choose
any participant (Figure 4-17) and get the output of their session (Figure 4-18). This is very
useful on a person-by-person basis.
135
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
136
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
However, it’s not going to be as useful for us if we want to analyze everything. For that
we’ll need the data file. Pressing the “download selected” box will let you download the
raw data, with just a bit of processing to unpack the events. You’ll be taken to a screen
similar to Figure 4-19 where you can download the CSV file that we’ll import into R.
While the raw data is useful, the Datalyser has a great feature that allows you to
process selected files into easier data to analyze, by removing any data below a threshold
(200 milliseconds by default, so a quick “mouseover” that was unintentional will be
removed) and by adding columns for how long time was spent on that particular box
(as opposed to simply a timestamp in number of milliseconds since the page was
loaded) and how many times that box was opened. Pressing the “download and process
selected” box takes you to a page similar to Figure 4-20. It’s this processed data file that
we’ll be using in R.
Now that you have the data, let’s get started analyzing it in R!
C
runching the Numbers
To understand what my 34 respondents did, I’m going to need to do a bit of analysis.
To start
137
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
4. Finally load in the dataset by using the command data <- read.
csv(file="pr4example_proc.csv"), replacing the filename with
whatever your processed file is named.
We’re now ready to start analyzing. We’re going to first explore our data, then format
our data as needed for our analyses, and then answer three important questions:
• Does the time spent viewing a box predict which choice the person
will make?
V
iewing and Formatting
To start, I’ve executed the following code. For each line, I’ll tell you the code, what it
does, why I did it, and what the output looks like.
138
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
table Displays the data, It’s a good idea to look at your data in Figure 4-22
(data$boxname) breaking down by an easy-to-digest way. If we were to
how many times look at the data in our data viewer, we’d
people viewed have 195 rows of data to view – much
each box more than I can handle. But this lets
me see at a glance if we have data
across all options and if any were to
look exceptionally high or low. If we had
all 34 choose the catbtn, for example,
I might wonder how honest everyone
were being. Mr. Flufkins would be very
skeptical.
by(data$boxtime, Gets descriptive Think of this as a sanity check. Figure 4-23
data$boxname, statistics on the This command lets us explore a bit
describe) time spent in the more just to be sure our data is varied
box by box name enough to be useful and doesn’t have
anything strange lurking in it. You’ll
notice that all of the button means are
around 300–400 milliseconds – that’s
fine since it was the amount of time the
mouse was over the button.
Imagine though if all of the boxes were
that way – we’d wonder if people were
really reading the box content or not.
Right now, it sorta looks like the pro2
box wasn’t open very long. However,
pro2 was the shortest text we had:
“cats are cute”, so it makes sense that
people wouldn’t be reading that very
long. Again, we’re just making sure that
the numbers make sense.
(continued)
139
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
data <- subset Removes onclick The onclick events represent clicking Figure 4-24
(data,event == events the buttons at the bottom of the boxes.
"mouseout"); We don’t need that information, so we
remove it from the dataset.
data$boxtime.c “Centers” the In a regression analysis, we often want Figure 4-24
<- data$boxtime - variable boxtime to center a variable so that the intercept
mean by subtracting the term is meaningful. In this case, if we
(data$boxtime); average of all the don’t center, we’re going to have an
boxtimes intercept set at 0. Since no one looks
at a box for 0 seconds, it’s easier to
have the intercept set to the average
boxtime. This will let us interpret our
intercept term roughly as the average
time a box is open.
data$boxtime.c. Divides boxtime.c This helps with our interpretation as Figure 4-24
scaled <- by 1000, well as keeping things a little easier to
data$boxtime.c / converting model.
1000; milliseconds to
seconds
140
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Now that we have our data ready, let’s answer our questions!
141
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
please spend a few minutes with a stats textbook. In our case, we’re going to predict
choice by how long they spent inside each of the six boxes. The code to do that is
The first portion of the output gives us a reminder of the formula we used and how
the model was fitted. It then gives us some regression diagnostics, such as
142
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Now that we’ve gotten past that, we can look at the random effects group. Random
effects refer to data that can have many possible levels, and we’re interested in all of them.
Fixed effects have a set number of levels, and we might only be interested in a few of them.
In our case, we’ve specified the ID variable as a grouping random effects variable.
It turns out that some people are just faster readers than others, and this allows us to
capture that essence of the participant apart from the rest of our data. Here R tells me
that we had 158 observations over 34 groups (id numbers). The variance was 3193, with
a standard deviation of 56.51. These numbers can be tricky to interpret, but it’s worth
noting that the standard deviation (56.51) is much, much larger than any of the fixed
effects treatments below (which mostly range from 0 to 2). Therefore, we know that the
individual person variance is a big deal – some people just look longer and boxes than
others, and it doesn’t affect what they do in terms of their information search, it’s just a
part of how they function as humans.
143
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Finally, we get to the fixed effects. These tell us our estimates for how long people
spent looking at each box when they ultimately decided to choose the cat option over
something else. First, we notice that only two estimates are statistically significant
(indicated by the asterisks next to them): the intercept (where the regression line crosses
the Y axis) and the time spent in Con3 (the item that insinuated that cats will eat your
corpse if they’re hungry). We can interpret it that, on average, people spend about 12
seconds looking at boxes overall, and those that spend about 2.79 seconds less on Con3
are more likely to choose cat. Practically, the answer to our question is: yes, people who
look at Con3 less than average, by about 2.7 seconds, are more likely to choose cat than
others. Guess that argument against cats really holds a lot of sway for some. And sadly, in
some cases, it’s absolutely true.
If you do that, however, you’ll get an error, as shown in Figure 4-26. This error is not
all that useful – “boundary (singular) fit: see ?isSingular” – however, there are a few things
we can do to troubleshoot it. The first is to run the command that it gives: ?isSingular.
Running this code brings up the help menu entry in Figure 4-27. Admittedly, though, the
explanation can be a little confusing to non-stats nerds. That’s where your search engine of
choice comes into play. It’s worth noting that the R help files, as you can see in Figure 4-27,
tend to be written targeted at someone who has a very comfortable knowledge of both
programming and statistics. This is why the searching can be helpful as people typically
re-explain it several times in different ways on forums such as Stack Overflow and Quora.
After a little bit of searching around, we find our issue: the random effect term.
144
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Whereas we had multiple lines for each person in the first analysis, the variables for
this question stay largely the same for each person – therein lies the problem – we were
making this more complicated than it needed to be. Amending the code to a simple
linear model without the random effects gives us nearly the exact same output, but
without the error message.
145
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
First, we see some familiar lines for residuals, and then we go straight into
coefficients. In this model, we can think of coefficient correlations in a sense. As the
relcount variable changes (the position that this box was opened, higher numbers mean
later viewing), the output tells us there is a relationship between relcount and two boxes:
con3 and pro3. Because the coefficients are positive, we can see that they move with
relcount, so as relcount goes up, so does the likelihood that the box someone is looking
at is con3 or pro3. This makes intuitive sense, those are the bottom most boxes. This
is why MouselabWEB has a counterbalancing feature built in – because otherwise, a
146
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
simple psychological effect of ordering might make us think pro3 and con3 were more
important, since it appears people go back to them at the end. In reality, they might just
be the last thing they view – a more parsimonious answer.
So answering our second question: yes, the boxes at the bottom (Pro3 and Con3) are
more likely to be opened later in the decision-making session than the boxes at the top!
Imagine how this finding might be useful in market research applications where you’re
curious what gets “the last word” as someone debates which option to choose.
147
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Once the data is in “wide” format, all I need to do is run a logistic regression to
predict if they’re going to choose the cat button, and if they are avoiding the con
information, I should see those choosing cat not even opening up those boxes. Mr.
Flufkins wouldn’t, that’s for sure.
To do that, I need to add a new column to my data: hit – this simple variable gives
a hit value of 1 to each row. If I wanted to do something fancier, I could always assign a
higher value to certain boxes. This might be useful in a marketing context to “weight”
more important information that impacts (or is proposed to impact) the viewer more so
than the rest of the material. Next, after adding hit, I’m going to use a function from the
reshape2 package named dcast(). This function takes data in a long format and “casts”
it to a wide format. The entire reshape2 package provides a lot of useful functions that
we will play more with in future chapters. For now, you can interpret the second line
of the following code as telling dcast() to create a new data frame called data_wide
from the existing data frame. On the left side of the tilde (~) are the columns that should
remain constant on the line and on the right side are variables that should be counted
for each person. The value.var parameter tells it to use the value in the hit variable
we just created. In honesty, we could actually not use the hit step – dcast() defaults to
counting the number of instances. But to be precise, and to discuss the option to weight
values, I’ve included it here.
data$hit <- 1;
data_wide <- dcast(data,id + choice ~ boxname, value.var="hit");
Now our data looks a little different, as you can see in Figure 4-31.
Figure 4-31. The data for the first 6 participants in wide format
148
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
With the data in the right format, I can run a logistic regression using the following
commands:
animals potentially are avoiding opening all boxes equally, compared to the cat lovers.
Mr. Flufkins is planning a PR campaign as we speak to discuss how non-cat people aren’t
fair and balanced.
C
onclusion
We’ve moved through a lot in this chapter, from thinking about how we’d track what
information people use to actually tracking and analyzing it. And like LimeSurvey in
Chapter 3, we’ve only scratched the surface of the features that MouselabWEB has –
counterbalancing, embedding images, linking multiple pages together, and more can be
accomplished. And, perhaps even more interesting is combining the two. Think about
a scenario where LimeSurvey feeds into MouselabWEB or vice versa – you could not
only collect information on what people think but also how they formed their opinion.
Combine that with a little bit of knowledge about linear regression, and suddenly you’re
the office hero that makes sense of the data. There is one thing that can be a downer
though – finding that the code you carefully crafted isn’t working anymore. Let’s talk
about that briefly in the last section of our chapter.
Future-Proofing: R Packages
The greatest strength of R is that it’s open source with literally thousands of people
contributing to it. That same strength can be a weakness at times if you’re not prepared
for it.
When I wrote this chapter, I used four different libraries: lme4, lmerTest, psych, and
reshape2. Each of those packages has a version number (for me, it was 1.1–21 for lme4,
3.1.1 for lmerTest, 1.9.12.31 for psych, and 1.4.3 for reshape2 – you can get the version
of a package by using the sensibly named packageVersion() command). With each new
version of a package, the maintainers add, edit, update, and remove elements to increase
the overall quality of the package – whether that be to include new features or kill bugs in
old features. And sometimes one man’s bug is another man’s treasure.
In 2013, the maintainers of lme4 realized that a function named mcmcsamp that had
been used to generate a Markov Chain Monte Carlo sample wasn’t always accurate. So
they stopped using it, and certain output changed, namely, p-values were removed from
linear regressions based on the Gaussian family. From a statistician’s point of view, this
was the correct thing to do. However, for those in other sciences, it caused an issue. For
150
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
one, our scientific views are never (or should never be) based upon one or two tests.
Scientific theories require a lot of evidence to support them. So in this situation, the lmer
model with possible inaccuracy would only be a small piece of the larger puzzle. It’s
sensible to be cautious, but absurd when 99 signs are pointing one way, and we’re going
to doubt everything based upon one.
So the R community found another way to get those p-values, initially pvals.fnc()
in the languageR package and then the lmerTest package that we used here. Today when
I use those p-values, I know that they might not be 100% accurate (of course nothing in
statistics is 100% accurate, it’s all educated guesses), but I can present that to my reader
or viewer as a caveat and be ready to address it.
However, what happens if you want to run your linear model the day after you
update your package and find the p-values missing? There hasn’t been time for others
to write new code, and taking a side trip down construction of Markov Chain Monte
Carlo samples is going to take quite a bit of time. The answer is simply build and load the
previous version of the package.
To do this, you need to download the source of the version you’d like, compile it
within R, and then load it. The first step is easy – downloading a file. In many cases,
the old versions are on CRAN in the Archive section, or if you’re paranoid, you might
download the exact versions you used when you create a project and store them
somewhere safe in case you need them later.
The second step can be a bit tricky depending on your computer. Linux and Mac
computers tend to have the tools they need to compile code already installed; however,
Windows machines sometimes lack these tools. If you run into errors compiling, you’ll
need to download complier software and install it.
Finally, the third step seems easy enough – load the newly built package using our
regular commands. In the best possible world, the following code would download the
very first version of lme4, build it, and install it:
151
Chapter 4 Project 2: Advanced Statistical Analysis Using R and MouselabWEB
Thankfully, this problem was something that the folks at Revolution Analytics
thought about in 2014, before they were acquired by Microsoft, and they wrote a package
named checkpoint. checkpoint is a clever package that allows you to pick a date in the
past and install and use packages in R exactly as they were on that date. This is an easy
way to ensure you can recreate and reproduce your analyses later, in the event that
packages change. I might still suggest downloading specific versions of the package
source code if it’s vital to your operation, but tools like checkpoint should help you with
your package version woes.
And there you have it, we’ve completed our foray into market research in both
Chapter 3 and Chapter 4. But what if you aren’t a market researcher? What if you just
want to use R in your everyday life to help you with little tasks around the office? In the
next section of the book, Chapters 5–7, we’ll do exactly that!
152
CHAPTER 5
R in Everyday Life
A friend of mine tells the story of her serving on a budget committee many years ago.
In the midst of making tough decisions about where to make cuts, the committee was
provided with spreadsheets showing each office and how much money was assigned
to each, plus a metric of return on investment. The idea was that the offices with high
ROIs and low money would surely survive any cuts, but if you had the opposite – a ton of
spending and not much to show for it – you were in trouble.
The irony of this process was that the spreadsheets were sorted in alphabetical order,
not by either of the two columns that might be useful. My friend, being an Excel guru,
sorted each one and re-sent them to the committee. The other committee members
were enthusiastic, including the CFO. It wasn’t until one day that the CFO realized one
of his offices was at the top of the “high investment, low return” pile that he sheepishly
asked her to stop sending her lists. That CFO wasn’t around much longer.
How data is manipulated, sorted, and displayed is key to doing many of our jobs as
professionals. And while we all know the Excel gurus that can point and click our data
into shape with a flourish, we also know that this can take time and isn’t very easy to
re-train in someone new when the Excel guru moves on to bigger and better reports.
What if we could script it – take the data and change it exactly the way we’d like and then
report it? With R, all of that is possible. In this chapter, we’ll look at several ways to format
data, manipulate data, and finally report the data back out to software such as Microsoft
Office. We’ll also talk a bit about custom functions and show how you can use those to
speed up your scripting.
Before we begin, a word about how this chapter is formatted: it’s written in a “follow-
along” style. The code snippets are small enough that I believe you will get a lot out of
typing them in and executing them. I’ll describe in text what results you should be seeing
as you go along. This will help you become more comfortable with the environment and
the operations we discuss! Let’s begin!
153
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_5
Chapter 5 R in Everyday Life
Data Formatting
Sometimes data just isn’t formatted in a usable way. Whether it’s because we’ve scraped
it off the Internet, or some well-meaning person has tried to make it “prettier” by
adding information, numbers might not read as numbers, or information might be the
reverse of what we want (e.g., names as Firstname Lastname vs. Lastname, Firstname).
Additionally, we might want to trim our data to just the essentials or sort and filter our
data in meaningful ways. We’ll walk through all of these in this section.
String Manipulation
When we pull data into R, we often get “extra” pieces – spaces, text, or symbols we don’t
need. Imagine receiving a spreadsheet where someone has “helpfully” added “USD” after
each price to indicate it’s in dollars – or added “$” at the front. While visually useful, R
won’t be able to add the numbers like we’d like it to, as you can see in the following code:
As you can see, the first sum works perfectly since it’s summing the y variable, which R
can easily see are numbers. However, x has all “USD” at the end. We can remove that with the
substr() function, which takes a variable, starts at a certain point, and finishes at another
point. substr(x,1,5) tells R to take each item and start at the first position and read until the
fifth position. sum(as.numeric(substr(x,1,5))) will get us our sum of 76.07.
However, you might see a problem here – what if we have a three-digit amount? It
will break the code because now we need to read until the sixth position, but only for that
item. What we really need to do is strip off the last four characters, whatever those may
be. We can do that with a little bit of logic – I first need to figure out how long each item
is. Then I have to tell R to get to the position 4 back from the end. Thankfully, we have a
function to determine how long an item is – nchar() will do it. Here’s our modified code,
with one item with three digits:
154
Chapter 5 R in Everyday Life
Looking at the z variable, we can see one item is over $100, yet we’ll still get the
correct sum (179.59) from these items. Also, for demonstration purposes, I’ve been
nesting things within each other, but you can imagine a situation where you’d want to
save those recoded variables so that they could be used later. In that case, you might do
something like this:
a <- substr(z,1,(nchar(z)-4))
From the preceding examples, you can probably guess our code if someone has put
in a leading $ in front of everything:
substr() is a very nice string function, but it’s not the only one that R has. Here’s a
few others that are useful to know:
String processing could be its own book, and in fact there are several focused on it for
different languages. In the end, most of the purpose-driven string modifications you’re
looking for will be well documented. For example, a quick search for “Switch last name
first name R” provides a Stack Overflow listing for doing exactly that. Once you know the
basic functions, you can also explore them further. And in practical purposes, imagine
getting a report each week from a system that appends useless string information (such
as “USD” or “beats per minute”) and being able to simply run an R script to clean it
up – much more useful than sitting in Excel and going row by row removing the extra
characters.
155
Chapter 5 R in Everyday Life
Variable Magic
One other thing that is annoying to do in Excel that is much easier in R is working with
data frames in terms of re-ordering or renaming variables within them or adding new
ones. There are a few common tasks in the following table, with example code on how to
accomplish them. We’ll use the built-in iris data frame in the examples.
156
Chapter 5 R in Everyday Life
You’ll notice that the last command used the rbind() function. rbind() is a row
bind – it adds new rows to an already existing dataset. cbind() does the same for
columns. These are easy commands to use if you have data in the order you want, but
they can get you in trouble if you don’t have the data formatted properly to start with
(R will complain about this as well – it won’t easily bind row if the number of columns
doesn’t match, or bind columns if the number of rows doesn’t match). Another function,
merge(), exists that can help you add two datasets together that aren’t perfectly matched
up. We’ll use merge() later in the book.
Now that we’ve seen how to do some basic manipulations, let’s move on to the next
area: sorting and filtering!
Sorting and Filtering
We’ve already seen a few ways to filter data in R earlier when we discussed removing
rows or columns. But what if we want to write code that’s a little more readable instead
of using index numbers? And how would we sort the entire data frames based on given
values? We’ll continue to use the iris dataset and do some simple operations on it:
• What about descending order? Just put a negative sign before the
column: iris <- iris[order(-iris$Sepal.Length),]
It’s important to note that sorting could be used to get the highest value and return
it as an informational piece. We’ll do that in a few minutes at the end of this section.
But first, let’s talk about filtering data, which we can do in two different ways – adding
another data frame condition similar to how we’ve done earlier or the subset()
function.
Selecting all of the versicolor species out of the iris dataset can be done using this
command:
iris[iris$Species=="versicolor",]
157
Chapter 5 R in Everyday Life
Or if we’d like something a bit more human-readable, we could use the subset()
function, which I find makes it easier when sharing scripts with someone for them to
understand what we’re filtering for.
subset(iris,Species == "versicolor")
This will pull all of the versicolor with a Sepal.Width of 3.2. What if we wanted
those in versicolor or Sepal.Width of 3.2? It would look like this:
Let’s put together a few things we’ve had earlier into a simple example: imagine that
you want to report back to the user the highest Sepal.Length for the versicolor species.
You don’t want to show the user the data, just give them the information on one line in a
report. Here’s how we could do it:
Congratulations, you’ve just written your first report output in R! Now we’ll spend
some time manipulating our data and learn how to write our own functions!
158
Chapter 5 R in Everyday Life
Conversions and Calculations
Simple calculations in R are, as we’ve seen, fairly easy. For example, if we wanted to
calculate a new variable in the iris dataset named Petal.Area, we could do that easily by
multiplying Petal.Length by Petal.Width.
We can also convert values using R’s built-in libraries. One case for this might be
calculating time between two dates. Here I’ve coded two dates as strings, and then I’ve
converted them to Date objects in R and then subtracted the difference:
x <- "2019/01/01"
y <- "2020/01/01"
x.date <- as.Date(as.character(x), format="%Y/%m/%d");
y.date <- as.Date(as.character(y), format="%Y/%m/%d");
paste("The difference between ",x," and ",y," is ", (y.date-x.date), " Days");
You’ll notice that the as.Date() function takes a format argument, which you can
use to let R know if your date values are in a different format, such as 01/01/2019 or
23-2-2019, and so on. We also did a fair bit of casting there that we might not have
needed – x and y were already cast as characters; however, it doesn’t hurt to make sure
that they are properly casted in our code.
Lastly, you’ll notice that when I recast x and y as dates, I appended .date to their
variable name. It’s a good practice to do this, especially on large datasets with many
different variables. By appending the data type to the end of the data, you ensure that
you will know later, at a glance, the type of variable you’re working with. I generally use
.f for factors, .chr for characters, .date as dates, and so on. Pick a system that works for
you and go with that!
Finally, let’s briefly talk about conditional statements and loops and useful functions
for them. R supports a few common structures, which I’ve demoed as follows.
If-else or ifelse
if (iris$Sepal.Length[1] > 5) {paste("Greater than 5") } else {paste("Less
than 5")}
This checks the first row. If we wanted to do all rows, we could use ifelse():
This code returns the value of either “Greater than 5” or “Less than 5” for each of
the rows in the dataset, in essence summarizing them. In general use, I find ifelse() is
much more commonly used than if unless I’m working inside a loop or function.
For
for (i in 1:10) {print(i)}
This declares a new variable named i and works through it ten times, printing 1–10
as output.
While
i <- 1
while (i < 10) {
print(i)
i = i+2
}
This code declares a new variable named i, and while i is less than 10, it prints the
value and then adds 2 on to it. The output lists odd numbers: 1, 3, 5, 7, and 9. Since the
last loop sets i to 11, the loop doesn’t work.
You could use these functions to select subsets of data, or random sequences, as
necessary for your work.
Now that we’ve seen some of the control functions in R, let’s reshape data in more
detail!
160
Chapter 5 R in Everyday Life
R
eshaping
In the last chapter, we used the reshape2 package to reshape data. We’ll explore that a
little more here as we convert the iris from wide data to long and back again.
To do this, we first load the reshape2 package. We then “melt” our data into
“identification” variables and “value” variables. In the following code, I’ve loaded the
package and then created a unique ID variable for each entry using the runif function
from earlier. Finally, I’ve melted the entire dataset.
library("reshape2")
iris$ID <- paste(iris$Species,runif(150))
melted <- melt(iris, id.vars=c("ID","Species"))
Looking at the first six lines, we can see how our data has changed (see Figure 5-1).
Figure 5-1. The original iris dataset and the melted dataset
Now that we have the data “melted”, we can cast it however we like. I’m going to cast
it back to the way that it had been:
We can see in Figure 5-2 that iris and newiris have been resorted; however, they
still contain the exact same data, as Line 1 in iris matches with Line 2 in newiris. Line 5
of iris also matches up with Line 4 of newiris.
161
Chapter 5 R in Everyday Life
We’ve now just converted from wide to long then back to wide. In our data, we had
the advantage that we didn’t have multiple values in long format for any particular entry
(e.g., multiple Sepal.Lengths for the same ID). In most long formats, that’s not going
to be the case. Thankfully, the dcast() function can also take an argument of fun.
aggregate with options such as mean, median, or sum. This tells R what you’d like it to do
when it encounters multiple lines in long format – basically how should it summarize
those lines?
Once you’ve started playing around with the reshape2 package, you can think of
ways it might work into your daily life. Here are a few that come to my mind:
• Sales teams send you daily reports that you need to summarize. Each
day you add the data to a CSV file, and once a week, you pull that file
into R and have it reshape the long data (sales each day) to wide (a
sales report for the week).
162
Chapter 5 R in Everyday Life
There are many different possibilities for converting data from wide to long, so I’d
encourage you to start looking at the data you see around you on a daily basis to consider
ways you might want to summarize it! You may also notice that data isn’t always coded in
the ways you’d like, which is where our next section comes into play: recoding data!
Recoding
Recoding variables can be useful for two primary reasons: first to summarize data by
reducing it in scope (e.g., a median split, either “above median” or “below”) and second
to change the reporting style of the variable (e.g., converting 1–5 to a scale of agreement).
We’ll look at both in our examples as follows.
The first and perhaps most obvious way that we could recode variables is to use the
ifelse() function that we saw earlier. This is good for single criteria recoding. Returning
to the iris dataset, we find that we have three species: setosa, versicolor, and virginca.
Imagine wanting to simply have two groups: Setosa and “Other”. We could do something
like this:
Both Species.new and Species.control now have the exact same information in them;
we just went about coding it in two different ways. The difference is that our second
method could be used for more than just an “either-or” decision, since the first line sets
all of the Setosa lines to NA until the second line fills them in with Setosa. You could,
therefore, have as many lines like this as you have levels of your variable, gradually
assigning everything a line at a time, something like this:
163
Chapter 5 R in Everyday Life
A third way to recode data comes from the recode() function from the car package.
I’m a fan of this method because it gives us the flexibility of the control structures
approach while also putting most everything on one human-readable line. Here’s an
example of it adapted from the documentation for the recode() function:
library("car")
x <- rep(1:5,4);
x
recode(x, "1='A'; 2='B'; c(3,4) = 'C'; else='D'")
That fourth line makes it pretty easy to see what we’re doing – 1 becomes A, 2
becomes B, 3 or 4 becomes C, and everything else becomes D. The only trickiness
is getting the syntax right – starting the recode argument with a quotation mark, any
character with an apostrophe, and using semicolons between each “clause”. Even after
using this command for years, I sometimes still find myself with an error the first time
using it in a project. Indeed, when I wrote this code earlier, I had accidentally placed a
comma after “C”, and R was not happy with me!
Finally, it’s usually a good idea to recode into a different variable name. While R will
let you “overwrite” an existing variable, it can be hard to tell what’s going on by doing so.
Adding a new variable, by appending something like .recoded onto the end, will help
you keep your data straight as you work through your task.
Before we move on to different ways to report data, let’s talk about a situation where
you may need to recode data quite often and want to centralize your code into your own
function. As you’ll see, writing a function in R is pretty simple!
library("car")
recode(88,"lo:59 = 'F'; 59.01:69 = 'D'; 69.01:79 = 'C'; 79.01:89 = 'B';
89.01:hi = 'A'")
164
Chapter 5 R in Everyday Life
Now every time you want to recode a percentage to a letter grade, you could use that
recode line; however, it’s going to cause a lot of typing. Instead, let’s wrap it in a function
called LetterGrade(). Here’s the code:
Now all we need to do is run LetterGrade(88) to get the letter grade for a score of 88.
If we ever make a change to our grade mapping, we only need to change it in one place –
the LetterGrade() function.
Using custom functions in R merely requires that they be loaded into your script
before they’re used. A common way to do this is to have all of your custom functions
at the top of your scripts, along with any packages that you need to have loaded. Some
choose to place all of these in a separate file and then use the source() function in R to
execute that entire file every time they load their project. And if you’re using RStudio or
saving your image in R, you’ll have those functions reloaded every time you open your
project, which can save time.
Finally, it’s important to note that functions can return any type of data object or
type. You can get very complex with your functions – in fact, this is how whole packages
are born! If you start using R for a lot of your daily work, you’ll likely amass a large
collection of your favorite functions to use and reuse, again and again.
Speaking of using R for your daily work, at some point, someone is probably going
to want to see that work. And we can produce a very nice product for them using a few
different data reporting methods. In the next section, I’ll discuss one of the most popular
ways to report your data – RMarkdown!
From there, create a new RMarkdown document (Figure 5-3). For our first example,
we’ll use HTML (Figure 5-4).
As you can see in Figure 5-5, RStudio opens a new document with some basic
markdown examples in it.
166
Chapter 5 R in Everyday Life
From this, we can actually compile the document immediately and see how the
output would look. To do this, we’ll have RStudio run the RMarkdown through a
program called knitr. Knitr takes the RMarkdown file, parses it, executes any R code in it,
and weaves it all together into one HTML document. To get it to “knit”, click the “Knit”
button shown in Figure 5-6. We can see it doing this in Figure 5-7 and see the final output
in Figure 5-8.
167
Chapter 5 R in Everyday Life
168
Chapter 5 R in Everyday Life
You probably saw other options for output, including PDF. However, as seen in
Figure 5-9, this fails if you do not have MiKTeX (Windows) or TeX Live (Mac) installed.
169
Chapter 5 R in Everyday Life
Installing MacTeX (which provides TeX Live) and having it take up a whopping 10 GB
of space between installer and installed files and restarting RStudio, I now have the
ability to knit PDF files (Figure 5-10 shows the compile output; Figure 5-11 shows the
PDF document).
170
Chapter 5 R in Everyday Life
171
Chapter 5 R in Everyday Life
And finally, since I have Microsoft Office installed on my machine, I can use the
Knit to Word option (See Figure 5-12) to create a word document. First Word does ask
permission to allow R to write to it (shown in Figure 5-13), and from there, it creates the
document automatically (See Figure 5-14).
172
Chapter 5 R in Everyday Life
173
Chapter 5 R in Everyday Life
All four test documents area available in the rmarkdown-demo code for this chapter. I’d
encourage you to play around with Markdown in R to create reports that will be useful to you,
in the proper format. I’d then encourage you to think about the possibilities such as
• Each week having an R script run to create a PDF report, that you
could then attach to an email and send (we’ll explore sending emails
with R in the next chapter).
• Exporting your statistical analysis straight into Word and then opening
the document to add narrative around each analysis.
• Updating a report on your web page by having R create the HTML file
directly, that you could then drop onto your web server (or better yet, have
R write it there directly by using ftpUpload() function in the RCurl package.
C
onclusion
We’ve covered a lot of basics in this chapter that take R from simply running statistical
analysis to creating rich datasets that are formatted exactly as you’d want them and then
output into formats that others could easily utilize such as PDF or Word. Along the way,
I’ve shared some of my favorite tips, and we’ve seen that there is often more than one way
to do something in R. Our next two chapters focus on projects that amp up workplace
productivity – an R Form Mailer in Chapter 6 and an R Powered Presentation in Chapter 7!
174
CHAPTER 6
175
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_6
Chapter 6 Project 3: The R Form Mailer
blastula A versatile package that can send HTML encoded emails. Flexible, and offers
support for a few third-party APIs
blatr A Windows-only package that sends mail through the command-line program
blat
edeR A bit of the reverse – it doesn’t send, but rather reads email. Imagine having data
sent to an email box – this package can access that mailbox and download the
data directly!
emayili An email library that supports sending through an SMTP server, with minimum R
dependencies
IMailgun A library to send mail through the Mailgun service (from http://mailgun.com)
mail One of the oldest mail libraries. Doesn’t support some of the newer features you
might need, such as SMTP authentication or HTML. Perfect for quick notifications
‘mailR’ A wrapper for the Apache Commons Email API
sendmailR A simple SMTP client that allows R to send email directly to receiving servers,
although this may run into issues if your machine is not properly set up as an
SMTP server (e.g., your mail may be blacklisted for failing a reverse DNS test)
gmailR An email library specifically designed to work with Gmails RESTful API. This is the
package we’ll further explore as follows.
ponyexpress A community project to bring automated email list distribution to Gmail
through R. You can find more about this at https://docs.ropensci.org/
ponyexpress/.
RDCOMClient Another Windows-only package; however, this one utilizes Windows DCOM
methods, which allow it to send email as well as send data to other applications.
As you can see, depending on what you want to accomplish, you may be better off
with certain clients over others. The simpler clients that directly connect to an SMTP
server (or include one themselves) may work just fine on their own if they’re within your
own small network, where you control spam filtering and access. However, they might
not work well with the “wider” world, where email must be somewhat vetted to prevent
as much SPAM as possible. At the end of this chapter, I will provide a few links to email
providers that I’ve found especially helpful and useful to me as a small developer and
176
Chapter 6 Project 3: The R Form Mailer
system administrator. In many cases, these services are free to small quantities of mail,
which make them ideal for your own personal productivity. Before we get into those,
however, let’s work with a pretty large player: Gmail.
S
ending Gmail
Gmail has come a very long way since its founding on April 1, 2004 (when many thought
that its 1 GB of space was an April Fools’ Day joke!). Today Gmail serves more than a
billion users, making it the most popular email service in the United States and in many
other countries. As it has “grown up,” Gmail has added a plethora of features from robust
search to spam filtering. Given its size and scope, unfortunately, it means that setting
it up for automated email sending can be a bit of a hassle. However, in this section, I’ll
walk you through setting up your Gmail account for API access (by creating your own
application), downloading the credentials you need, and ultimately sending a test email.
Later we’ll use these same credentials and set up for our R Form Mailer!
2. Once you’ve gotten into the console, you’ll want to find the
“Create Project” button. Give your project a descriptive name that
you’ll recognize (as you’re likely the only one to ever see it!). I’ve
called mine “Email from R” as you can see in Figure 6-2.
178
Chapter 6 Project 3: The R Form Mailer
179
Chapter 6 Project 3: The R Form Mailer
3. Next, from the dashboard, you’re going to want to find the APIs
& Services area. You’ll see a link at the top of Figure 6-3 that says
“ENABLE APIS AND SERVICES”. Figure 6-4 shows the API and
Services page that we’re looking for. Alternatively, you can search
"Gmail API" in the search box above. Regardless of how you get
there, you want to get to the Gmail API screen, either by jumping
through Figure 6-4 or going there directly. Figure 6-5 shows the
screen we're after.
180
Chapter 6 Project 3: The R Form Mailer
181
Chapter 6 Project 3: The R Form Mailer
4. Enable the API by clicking the ENABLE button. The screen will
change to show the API status for your project (see Figure 6-6).
182
Chapter 6 Project 3: The R Form Mailer
183
Chapter 6 Project 3: The R Form Mailer
184
Chapter 6 Project 3: The R Form Mailer
186
Chapter 6 Project 3: The R Form Mailer
7. OK, we’re almost there. Now that we have the consent screen
done, we can return to the “Credentials” option in the left-hand
menu bar. Click the “CREATE CREDENTIALS” option and choose
“OAuth client ID”, as seen in Figure 6-11.
187
Chapter 6 Project 3: The R Form Mailer
That was a laborious process, but it only has to be done one time – you now have
your application set up to interface between Gmail and R, and you have the client
credential you need for it. In the next section, we’re going to set up an RStudio project
named “Mailer” that is going to house the code for all three of our examples in this
chapter – sending an individual email, securing data through encryption, and, finally,
sending out bulk mail through R!
189
Chapter 6 Project 3: The R Form Mailer
install.packages("gmailr")
library("gmailr")
Next, we need to specify the Gmail credentials. If you’ve changed the name of your
OAuth JSON credentials, here’s the line to modify. Since mine are named “client_secret.
json”, my code is simply
gm_auth_configure(path="client_secret.json")
190
Chapter 6 Project 3: The R Form Mailer
This tells the gmailr package what authentication credentials to use. Next, I’m going
to set up my email – the subject, the from header, the to header, and the body:
Since Gmail works with MIME message objects, I can then use the following code to
create a MIME mail object named email_msg:
It’s worth noting that this is the first time we’re seeing the %>% notation in R. This is a
forward pipe infix operator – it forwards information to the next function or expression.
It provides a mechanism to create a bit more readable code, although it can look a little
strange at first. You can learn more about it by looking at the magrittr package that
initially created it.
At this point, we’re ready to send the email. However, the first time we do so, we need
to authenticate with Gmail and grant it permissions to send from our Gmail account. As
you can see in Figure 6-16, I’ve done nearly everything, and I’m on the last line.
191
Chapter 6 Project 3: The R Form Mailer
Figure 6-16. Ready to send mail, all but the last line has been executed
Now I execute the line gm_send_message(email_msg) and get the message “httpuv
not installed, defaulting to out-of-band authentication”, with a line asking me for an
authorization code. At the same time, a browser window opens in my default browser to
the Google Sign in page (Figures 6-17 and 6-18).
Once I’ve logged in, I’m taking to a warning screen that I’m trying to authenticate
to an unverified application. This is the application I created earlier, so while, yes, it is
unverified, I’m reasonably sure it’s safe since I’m the only one who has been using it and
the only one who has authentication credentials for it! So it is safe for me, in Figure 6-19,
to click “Advanced” and then “Go to Send Mail From R (unsafe)” in Figure 6-20.
193
Chapter 6 Project 3: The R Form Mailer
With that warning aside, I now have to grant permission to my Gmail account to use
the Gmail API (See Figures 6-21 and 6-22), and eventually, I’m given an authorization
code, in Figure 6-23.
194
Chapter 6 Project 3: The R Form Mailer
Returning to RStudio, I paste that authorization code into the prompt and press
Enter. The email is then sent, and I receive a confirmation ID (Figures 6-24 and 6-25).
And that’s it! My message has been sent. On the receiving end, I can see it in
Microsoft Outlook (Figure 6-26), and if I log into the sending Gmail account, I can see the
message in my sent list (Figure 6-27).
196
Chapter 6 Project 3: The R Form Mailer
Now that I’ve authenticated one time, the gmailr package can cache my
authorization, if I let it. So the next time that I send email, in Figure 6-28, it goes through
without any prompting (See Figure 6-29).
197
Chapter 6 Project 3: The R Form Mailer
Finally, you’ll notice that if you log back into the Google Developer Console, you now
have one user who has used your application (see Figure 6-30), viewing the OAuth User
Count. Unverified apps can have up to 100 users to test them out as they’re being built,
so in practice, you’re well under the limit and can continue using your application in
Gmail for likely as long as you want. If you do ever want to build out a public application,
however, you’ll need to have your application verified and potentially pay Google to
conduct a security audit on it. Speaking of security, how about we think about how we
want to secure our information that we send out from R?
S
ecuring Communications
Email is a protocol that harkens back to a simpler time on the Internet, where things
were not as secure as they are today. In its basic implementation, it’s 100% plain text.
Today we typically want a bit more security around our communications, and we get
198
Chapter 6 Project 3: The R Form Mailer
that through SSL encryption in many places we go on the Web. However, sometimes
you want to have a bit more security – something that you control yourself. In this
section, I’ll talk about two packages that can help you with that in R. Once your data
is secured, you can then send it directly from R using the preceding example. Our first
package is a simple form of encryption that is useful within R only. Our second is an
industry standard that takes a bit more to set up, but can be used in a wide variety of use
scenarios.
T he sodium Package
Cryptography can be a difficult subject to master, and as we’ll see in our next section, it
can get quite complex. However, for something simple, perhaps wanting to send data
output from R to a friend who is also using R, a simpler way can be very appreciated. This
is the point of the sodium package by Jeroen Ooms. The package offers a few different
ways to encrypt data, and in the following example, we’re using one of the simplest:
secret key encryption, or commonly thought of – a password!
Here’s how this might work: I’m working with a friend on a project, and we want
to share data back and forth via email. However, we need our data to be secure, so that
someone can’t easily intercept it. We agree on a passphrase, something like you can
grow ideas in the garden of your mind (a quote from Fred Rogers). Imagine that I want
to encrypt a dataset in R (we’ll use ChickWeight that we talked about in Chapter 2) using
that passphrase. I can do this by installing the sodium package and using the following
code:
install.packages("sodium")
library("sodium")
passphrase <- "you can grow ideas in the garden of your mind"
key <- hash(charToRaw(passphrase))
secret_message <- serialize(ChickWeight, NULL)
noise <- random(24)
encrypted_text <- data_encrypt(secret_message, key, noise)
I now have an encrypted text variable that isn’t very useful to anyone. You can see a
snippet of it in Figure 6-31.
199
Chapter 6 Project 3: The R Form Mailer
Now all I have to do is send my friend the random noise I used to encrypt with (the
noise variable) and the text (encrypted_text). Something like this will work:
saveRDS(noise, file="noise.rds");
saveRDS(encrypted_text, file="text.rds");
library("gmailr");
gm_auth_configure(path="client_secret.json");
msg_subject <- "Encrypted Data";
msg_from <- "Jon Westfall <[email protected]>";
msg_to <- "Another Jon Westfall <[email protected]>";
msg_body <- paste("See the data attached");
email_msg <- gm_mime() %>%
gm_to(msg_to) %>%
gm_from(msg_from) %>%
gm_subject(msg_subject) %>%
gm_html_body(msg_body) %>%
gm_attach_file("noise.rds") %>%
gm_attach_file("text.rds")
gm_send_message(email_msg)
This first saves our random noise variable and our encrypted text to an R Data object
and attaches it to an email. When our receiver gets the message (Figure 6-32), they can
load and restore the data using the secret key that you’ve agreed upon, using code like
this:
200
Chapter 6 Project 3: The R Form Mailer
library("sodium")
noise <- readRDS(file="noise.rds")
encrypted_text <- readRDS(file="text.rds")
passphrase <- "you can grow ideas in the garden of your mind"
key <- hash(charToRaw(passphrase))
data <- unserialize(data_decrypt(encrypted_text,key,noise))
And just like that you’ve got your data securely transported. Our limitation, though,
is that we have to use both R to encrypt and decrypt our data. Our next encryption
method is a bit more complex, but it overcomes this problem by using an industry
standard: GPG, the free replacement for Symantec’s PGP cryptography.
201
Chapter 6 Project 3: The R Form Mailer
And when you wanted to send something to a friend, you’d use Key A to lock it,
and using Key B, they could ensure that you were the one who actually sent it. Now
call Key A your “Private Key” and Key B your “Public Key” and you have the concept of
Public-Private Key cryptography. It’s a way for me to ensure that only the recipient of
the message can read it, and it’s also a way for me to “sign” my own documents so that
people know it’s really me.
Back when PGP/GPG was first started, individuals would go to “key signing”
parties to build up a “web of trust.” Imagine a bunch of people in a room meeting each
other, showing each other documents that proved their identity (e.g., driver’s licenses,
passports), and the other person saying “Yep, that’s you, I’ll sign your key”. Eventually
you’d have a key that was well trusted. Today that same principle is used in a few
different areas, including a popular platform named Keybase (https://keybase.io).
Figure 6-33 shows my Keybase profile. Through Keybase, one can know that all of
the web profiles that I’ve associated with it are truly me. Keybase also allows people to
securely talk with me through the Chat feature. What’s very interesting, though, is the
line of seemingly random numbers at the top that starts with 1396. Clicking that, we see
my PGP/GPG public key fingerprint and full public key (see Figure 6-34).
202
Chapter 6 Project 3: The R Form Mailer
We’ll use this data in a moment in an example of how one can send an encrypted
message using the gpg package in R. But first, before you can send me a message, you
must have your own public and private keys. Thankfully, they’re pretty easy to create
using the gpg package. This code will create a public/private key pair and then email the
public key to a friend:
install.packages("gpg")
library("gpg")
my_key <- gpg_keygen(name = "Testy McTestperson", email="doctorwestfall
@gmail.com")
public_key <- gpg_export(id = my_key)
library("gmailr");
gm_auth_configure(path="client_secret.json");
203
Chapter 6 Project 3: The R Form Mailer
gm_send_message(email_msg)
Your friend will now get a message that has your Public Key, as you can see in
Figure 6-35.
This allows my friend to send messages to me by encrypting them using the public
key that I just sent them. I can decrypt it using the private key.
Once you have created your public/private key, GPG will store them on your GPG
keyring. However, it’s a good idea to back them up by using the following commands:
Another way to get someone’s PGP/GPG key, other than having them email it to you,
is through that fingerprint we saw in Figure 6-34. The following code will download my
public key to my GPG keyring directly. The second half of the code then encrypts the
contents of the secure-communications.r file using the public key I downloaded.
204
Chapter 6 Project 3: The R Form Mailer
library("gpg")
jon <- "F97D7D4A348AE209B0FE5B201396233294A0EFA4"
gpg_recv(jon)
library("gmailr");
gm_auth_configure(path="client_secret.json");
msg_subject <- "Encrypted Message";
msg_from <- "Jon Westfall <[email protected]>";
msg_to <- "Another Jon Westfall <[email protected]>";
msg_body <- gpg_encrypt("secure-communications.R",receiver=jon)
email_msg <- gm_mime() %>%
gm_to(msg_to) %>%
gm_from(msg_from) %>%
gm_subject(msg_subject) %>%
gm_html_body(msg_body)
gm_send_message(email_msg)
On the receiving end of the message (Figure 6-36), the message appears to be gibberish.
However, since it’s been encrypted with my public key, I can use my private key to decrypt it.
205
Chapter 6 Project 3: The R Form Mailer
Now that we’ve talked about security, we can talk about automation – taking a fairly
complex task of sending multiple emails and breaking it down into an easily modified
script we can use again and again!
lastname,firstname,commission-rate,gross-sales,email_address
West, Jon, 1.5, 1500, [email protected]
Westfall, Prof Jon, 3.25, 1000, [email protected]
Jon West, who likes to shorten his last name so it’s easier to reserve tables at
restaurants, sold $1500 in product last week and has a 1.5x commission rate (because my
business is about to go out of business, I pay people more than they bring in each week.
My bad business practices should not be copied by you!). My professor identity gets
even more money. It’s a good thing I’m using an open source product, because I won’t
have much money left after I pay out more than I’m taking in. (Or do I just have a whole
salesforce that I pay much lower commission rates? Probably!)
Taking a look at the following code, we can see that it borrows from our generic
mailing earlier to create a series of emails and then send them out:
library("gmailr");
library("dplyr");
library("purrr");
gm_auth_configure(path="client_secret.json");
form_data <- read.csv("form-mail.csv");
206
Chapter 6 Project 3: The R Form Mailer
## Calculate commissions
form_data$commission <- form_data$commission.rate * form_data$gross.sales;
##
msg_from <- "Jon Westfall <[email protected]>"
body <- "Hi, %s. Your commission for this week is %s."
# Uses the mutate() function from dplyr to insert the correct variables.
sending_data <- form_data %>%
mutate(
To = sprintf('%s <%s>', paste(firstname, lastname), email_address),
From = msg_from,
Subject = sprintf('Sales Commission for %s', firstname),
body = sprintf(body, firstname, commission)) %>%
select(To, From, Subject, body)
# Use the pmap() function and safely function from "purrr" package.
emails <- sending_data %>%
pmap(mime)
Reading from top down, we first see that we are loading three libraries, gmailr,
dplyr, and purrr. Their use will be discussed as follows.
Next, we load in our Gmail credentials and the CSV file that has our form mails. Then
I calculate my commissions by multiplying my very generous rates by my grow sales.
Moving along, we get to the portion of the email that has our From header as well as
our body. Since we’re inserting information into the body based on each line, we need a
function to help accomplish that. We’ll use the mutate() function from dplyr as well as
the base sprintf() function.
Finally, we use two functions from the purrr() package to format the email into
MIME and then to send it out. The safely() function provides a bit of redundancy in
case one email in the middle of a large job has an error. The results can be seen in the
Outlook emails in Figures 6-37 and 6-38.
207
Chapter 6 Project 3: The R Form Mailer
In researching for this chapter, it should be noted the excellent blog post by Artur
Hebda at the Mailtrap.io blog (https://blog.mailtrap.io/r-send-email/). It provides
a few nice tweaks to my simplified examples earlier, including tracking success and
failures of the sent email and exporting the finished emails to a CSV file before they are
sent as a record of exactly what was sent to each person. The blog post also covers some
of the other R email packages and gives example code. Their service, Mailtrap, is also
very interesting – I’ll discuss it more here in the final section of this chapter.
Once you have your basic form mail setup going, you can think about ways to extend
it. In our preceding example, we calculated a simple commission value. Here are some
more complicated things that are possible:
• Collecting the times everyone is available to meet via a Google Form,
downloading the data from Google Sheets via a CSV link (as outlined
in Chapter 2), and looking for the most commonly available time –
then emailing out everyone with the results.
• Drawing a winning raffle ticket number and then emailing all of the
raffle winners and losers with the results.
208
Chapter 6 Project 3: The R Form Mailer
209
Chapter 6 Project 3: The R Form Mailer
Email is an essential service for many, and it makes sense that you might find it useful
to integrate into R. However, it’s not the only way to communicate information into the
world. In the next chapter, we’ll discuss how to use R when building a presentation to
seamlessly incorporate information in. And later in the book, in Chapter 9, we’ll talk
about push notifications through a product named Pushover, which can be faster than
email and more robust. Email is just the beginning of where you can use R to increase
your productivity!
210
CHAPTER 7
211
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_7
Chapter 7 Project 4: The R Powered Presentation
T he Setup
Before we can address the task at hand, we’ll need to do a little bit of thinking and
preparing. The first task is, simply, deciding what you’d want to incorporate into your
presentation. A few thoughts come to mind:
• Quizzing and testing could be done this way as well – think about
offering a test preparation course that requires you to assess where
your students are struggling each day. The students could take an
end-of-day quiz, and the teachers could meet 10 minutes later to
review them as a group.
Thinking about your own task is the first step, and after that, one needs to consider
how to get the data and where to analyze it.
• Real-time data entry: Simply going around a room and asking each
person for their result and typing it into a spreadsheet or even directly
212
Chapter 7 Project 4: The R Powered Presentation
• Data entry through a survey service that can export to CSV: This is
the method I chose for the following example. I set up a very quick
Google Forms spreadsheet that I connected to a Google Sheet and
then got the Google Sheet CSV file results (as mentioned in Chapter 2).
More robust survey systems could also be employed such as
LimeSurvey (see Chapter 3) or a commercial alternative. If it can
export to a “live” link or you can download a file from it, it could
conceivably be your source. I might stay away from some purpose-
driven options though, such as tools that integrate into PowerPoint
to run polls. They often don’t support downloading of data in an
automated way (or sometimes in any way!).
One caveat to live data, however, is that it must be “cleaned” in some way. For
example, years ago I asked students to enter their pulse, in beats per minute, into a
Google Form. I found that even with very explicit instructions, I would still get more
information than just a two- or three-digit number. Think “89 bpm”, “89 beats per
minute”, “89bpm”, “89beats”, and so on instead of simply “89”. R can clean up some of
this data if you can predict what problems you’ll have. However, it’s always possible that
someone in your audience will come up with a new way to misform their data before it
gets to your spreadsheet.
Now that we’ve talked about sources of your data, let’s talk about the computer needs
you might have in analyzing it.
213
Chapter 7 Project 4: The R Powered Presentation
C
omputing Needs
Depending on your technology setup, you may need to do a lot or a little planning before
you run your presentation. I’ll outline the easiest to the most difficult in the following
table:
Best case Your sufficiently powerful laptop Basic connectivity to projector and audio. R
running on your laptop with the script and
packages ready to go. Tool to open PowerPoint
presentation and project
Second best A powerful dedicated You can access the computer early and install R
presentation computer that you and script/packages that you’ll need. Testing will
can install software on need to be done to make sure the script runs.
Mediocre Your underpowered laptop If your laptop cannot crunch the numbers fast
enough, you may want to investigate using RStudio
Server, which I’ll discuss in Chapter 8, to run the
analysis somewhere else. You can then download
the file it creates to project on your slower
computer.
Somewhat A underpowered or locked down In this scenario, you’ll probably want to bring your
painful dedicated presentation computer own laptop or remotely connect to your computer
to run the analysis and create the file. Dropping
the file into cloud storage would then allow you
to open it on the dinosaur computer you’re stuck
with.
Excruciating An underpowered/locked down In this case, you’re probably not going to be able to
dedicated presentation computer create a live data presentation. If your participants
without network access in a can connect, you might be able to run the analysis
room without Wi-Fi or cell signal through your smartphone or laptop and then report
it verbally to the group.
214
Chapter 7 Project 4: The R Powered Presentation
We then watched both videos. After the videos, I provided a link to the group that
took them to the Google Form in Figure 7-2, which my participants then took on their
phones or tablets.
215
Chapter 7 Project 4: The R Powered Presentation
Because my participants had two videos to watch, one in which Emma read a simple
paragraph and one in which she read a more difficult paragraph. I not only wanted the
overall Single Score Intraclass Correlation but also wanted to break it down by video. I
also wanted to print out the raw values from each person who rated the videos, so that I
could use those as I discussed the results.
216
Chapter 7 Project 4: The R Powered Presentation
The following code is what I used to take the data from Google Forms and analyze
it. Your code will almost certainly look different; however, I’m providing this code as
it reshapes the Google Forms data into a long format from its original wide shape (see
Chapter 5).
## Interrater Statistics
## Original Article in http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/
##
## Enter Data
data <- read.csv("PATH_TO_CSV_FILE");
## Take out Test Data
data <- data[-c(1:5),]
data <- data[,c(3:7)]
names(data) <- c("Rater","Pronunciation1","Breaks1","Pronunciation2","Breaks2");
library(reshape)
library(irr)
mdata <- melt(data,id.vars="Rater");
mdata$video <- NA;
mdata$video <- ifelse(mdata$variable == "Pronunciation1","EasyEmma",mdata$video);
mdata$video <- ifelse(mdata$variable == "Breaks1","EasyEmma",mdata$video);
mdata$video <- ifelse(is.na(mdata$video),"HardEmma",mdata$video);
mdata$score <- NA;
mdata$score <- ifelse(mdata$variable == "Pronunciation1","Pronunciation",
mdata$score);
mdata$score <- ifelse(mdata$variable == "Pronunciation2","Pronunciation",
mdata$score);
mdata$score <- ifelse(is.na(mdata$score),"Breaks",mdata$score);
mdata <- mdata[,-2]
## Pronunciation Data Only
pro <- subset(mdata,score == "Pronunciation");
br <- subset(mdata,score=="Breaks");
easyemma <- subset(mdata,video == "EasyEmma");
hardemma <- subset(mdata,video == "HardEmma");
pro <- pro[,-4];
217
Chapter 7 Project 4: The R Powered Presentation
br <- br[,-4];
easyemma <- easyemma[,-3];
hardemma <- hardemma[,-3];
pro <- cast(pro,video~Rater);
br <- cast(br,video~Rater);
easyemma <- cast(easyemma,score~Rater);
hardemma <- cast(hardemma,score~Rater);
# Looking at task, pronunciation or breaks. Higher correlation = easier to rate
print(head(pro));
print(icc(pro,model="oneway",type="consistency",unit="single"));
print(head(br));
print(icc(br,model="oneway",type="consistency",unit="single"));
print(head(easyemma));
print(icc(easyemma,model="oneway",type="consistency",unit="single"));
print(head(hardemma));
print(icc(hardemma,model="oneway",type="consistency",unit="single"));
> print(head(pro));
video rtr1 rtr2 rtr3 rtr4 rtr5 rtr6 rtr7 rtr8 rtr9
1 EasyEmma 3 2 3 0 0 0 0 4 3
2 HardEmma 5 9 7 4 10 5 5 7 14
> print(icc(pro,model="oneway",type="consistency",unit="single"));
Single Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 2
Raters = 9
ICC(1) = 0.702
218
Chapter 7 Project 4: The R Powered Presentation
> print(head(br));
video rtr1 rtr2 rtr3 rtr4 rtr5 rtr6 rtr7 rtr8 rtr9
1 EasyEmma 7 5 9 6 5 9 23 10 0
2 HardEmma 7 9 8 9 20 11 44 28 10
> print(icc(br,model="oneway",type="consistency",unit="single"));
Single Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 2
Raters = 9
ICC(1) = 0.178
Model: oneway
Type : consistency
Subjects = 2
Raters = 9
ICC(1) = 0.474
219
Chapter 7 Project 4: The R Powered Presentation
score rtr1 rtr2 rtr3 rtr4 rtr5 rtr6 rtr7 rtr8 rtr9
1 Breaks 7 9 8 9 20 11 44 28 10
2 Pronunciation 5 9 7 4 10 5 5 7 14
> print(icc(hardemma,model="oneway",type="consistency",unit="single"));
Single Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 2
Raters = 9
ICC(1) = 0.267
Because I was going to be discussing this data with my colleagues, I did not put them
directly into the PowerPoint presentation. However, in the next section, I will do exactly
that by using a rather robust and useful package in R named officer.
220
Chapter 7 Project 4: The R Powered Presentation
Now that we have the presentation template, I’ll save the file and name it
“powerpoint.pptx” (because sometimes I run out of fun filenames). I’ve placed it inside
my RStudio project named “PresenteR”.
Of course, because “pptx” is not a file format that normally contains data, R doesn’t
exactly know how to handle it at first. Thankfully, there is an excellent package named
officer (Figure 7-5) that can work with office documents within R. As you can see in
Figure 7-6, it has a number of commands for working with PowerPoint files.
222
Chapter 7 Project 4: The R Powered Presentation
In Figure 7-7, you can see what happens when we install the package and load up the
PowerPoint file that we just created. The officer package reads the presentation and
tells us the layouts available to us and the template that they are using.
223
Chapter 7 Project 4: The R Powered Presentation
From here, we simply need to add a few features. First, I’m going to recreate my slide
from Figure 7-1, giving the instructions to my audience. I’ll do this by adding a slide and
putting the text on to it using this code:
224
Chapter 7 Project 4: The R Powered Presentation
Next, I’d like to add in the overall data tables for each dependent variable –
pronunciation and number of breaks. I can do that with this next snippet of code:
The output of these slides can be found in Figure 7-9. You’ll notice the table overshot
the slide a bit in Figure 7-9. I simply grabbed the lower edge and resized to fit the slide in
Figure 7-10 – a simple 1-second fix, which is a bit annoying but not difficult. The officer
package unfortunately has some issues properly sizing data tables. There are options you
might want to explore that can reformat the tables before placing them into a slide, such
as the flexible package.
Now we’re rolling. Next, we’ll tackle adding in the results of the Single Score
Intraclass Correlation. I could do this by simply outputting the results, but I’ll format it
to be a little more English-readable by accessing the elements of the output of the icc()
function.
227
Chapter 7 Project 4: The R Powered Presentation
Finally, once I’m done making my slides, I can compile them into a PowerPoint show
with the following line of code:
print(presentation, target="completed-powerpoint.pptx");
At this point, one could simply continue adding as much or as little information as
they like. The example pages for the officer package show how to add graphics, images,
and more depending on what you’d like to show off. Another added benefit of creating
your PowerPoint presentation like this is that you can easily create batches of slides with
minimal code changes. Perfect for long reports that contain a lot of repetition! You can
also combine what we’ve done here with what we’ve seen in earlier chapters, such as
email the presentation directly to someone.
Now that you’ve got your code all set, let’s talk about presentation flow!
If you’ve had to have R run on another machine, you may need to open your new
slide deck after downloading from cloud storage. You might also need to remotely run
the R script, depending on your setup.
228
Chapter 7 Project 4: The R Powered Presentation
And that’s it – we’ve successfully embedded nearly live data into our R-generated
PowerPoint presentation. If you don’t use PowerPoint, preferring other formats, then
earlier examples of using RMarkdown might be better for you to create a PDF or Word
version to share. However you do it, it is tremendously powerful to show an audience
their own data, whatever point you’re trying to make.
This brings us to the last section of our book. In the next chapter, we’ll talk about R
running without some of the limitations we’ve discussed so far (e.g., running on your
laptop, not accessible on iOS or a Chromebook, hard to collaborate with others, etc.).
We’ll be running R on a server, away from your computer, in a space that you and others
can easily access from nearly anywhere. We’ll also talk about other ways to show off data,
using a product called Shiny. We’re taking R anywhere we need to be!
229
CHAPTER 8
R Anywhere
As we’ve seen in the first two parts of this book, R can really help you in very noticeable
ways: formatting and manipulating data, analyzing complex problems and relationships,
and even automating some of your daily tasks. One problem, though, is that we have
to actually have a computer running R in order to do these things. As I mentioned very
far back in Chapter 1, R runs on a lot of systems – Macs, Windows PCs, Linux desktops
and servers, some android devices, and even devices like a Raspberry Pi. But it doesn’t
run everywhere. It can be frustrating in a time when many people are turning to
Chromebooks, and Apple iOS devices, to not have R. And even if R runs on your device,
you need to be physically at your device to run something, and a longer job may mean
that your computer is tied up for hours chugging away at something. In this chapter,
we’ll discuss a way around this limitation – putting R “in the cloud” or (in other terms),
somewhere else and remotely accessing it. We’ll see that there are some really useful
tools that help us achieve that, and all we need is a spare computer at home, a virtual
private server on the Internet, or even a cloud computing service account like Amazon
Web Services!
231
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_8
Chapter 8 R Anywhere
S
oftware
Here’s the software you’ll need in order to achieve your R Anywhere dream. Along the
way, I’ve given my tips and tricks for getting it installed.
R
Well, this one was kind of obvious. Using the instructions that I gave in Chapter 1,
you should start by installing R on your target machine or cloud. R, by default, not
only installs its graphical environment that we’ve used for several projects but also its
command-line interface and its script interface, rscript. While it is not very pretty, it is
functional to use R in a command-line environment, as I will discuss in a few moments
in the following.
A bit of a caveat though to installing R on a machine that isn’t sitting right in front
of you, with your user account logged in, relates to packages. If you’re installing R on a
Unix-style machine, such as a Linux server, you’ll need to think about the user account
that is running R in the following examples. It may be sufficient for your user account
to run R in all examples; thus, if you install packages that are in your local account,
you won’t have any issues. However, if you decide to use some of the more advanced
R Anywhere tips as follows, such as running a Shiny Server, you may need to install
packages as the root user so that they are available to all user accounts. Often tutorials
on the Web will subtly indicate this by the command they tell you to run to install
packages. As an example, the command provided to install the shiny package (discussed
later) begins sudo su which executes the code as the root user on your machine.
If you don’t realize that packages must be installed by root and available to all, you
might find yourself in a maddening situation of knowing you installed a package using
the install.packages() function in R, but having error logs that tell you the package
cannot be loaded. Executing the code in your local user account will likely work fine,
assuring your frustration as things look OK to you!
R
Studio – Server
For a good portion of our book, we’ve used RStudio, an open source IDE for R. RStudio
PBC, the company behind RStudio, also makes a version of RStudio named RStudio
Server. A server in the sense that it typically runs on a computer that you designate as a
server, I actually think of it more as RStudio Web. While the professional version offers a
ton of resources around project sharing, multiple R versions and sessions, administrative
232
Chapter 8 R Anywhere
and security tools, and auditing, the open source edition is perfect for an individual user
or a small group. Here I’ve opened our PresenteR example from Chapter 7 in RStudio
(Figure 8-1). Copying the R project files over to my RStudio Server, I log in (Figure 8-2)
with my username and password and open the same project (Figure 8-3). As you can
see, aside from the RData file not being read properly to restore the workspace, and a
message that certain packages that I have required are not installed, things look pretty
similar. The difference is, though, that I now have these files and interface running in a
web server. Figure 8-4 is the same screen loaded up on my iPad – bringing R to a device
that I formerly wouldn’t have been able to use it on!
233
Chapter 8 R Anywhere
234
Chapter 8 R Anywhere
One thing to note with RStudio Server is that it is limited to one browser connected
to a session at a time. When I loaded up the server on my iPad, I found the error in
Figure 8-5 on my desktop. Hitting reconnect got me back to my working session.
S
hiny Server
Shiny Server is another excellent tool from RStudio PBC. Imagine wanting to share
data with the world, perhaps even allowing them to work with the data to visualize or
analyze it however they choose. While you could create an RStudio Server interface that
everyone logged into, that would likely be overkill. Shiny Server serves small R scripts
called Shiny apps. These are interactive applications that display, manipulate, or analyze
data so that the information the data pertains to is easier to understand to a person
viewing a web page. In other words, it lets you grab the graph and change it to suit how
you would best understand or explore the data.
Like RStudio Server, Shiny Server comes in two flavors – an open source edition
and a professional version named RStudio Connect. One downloads Shiny Server by
going to https://rstudio.com/products/shiny/download-server/ and choosing your
platform. There are a few issues that you will want to consider before you install and
during your installation:
• The open source version of Shiny Server does not support any sort
of native authentication or encryption (e.g., data is sent in cleartext
between your computer and the server and is in cleartext on the
server). This means that you can easily deploy it, but it also can be
vulnerable to people snooping around your data. You’ll want to either
run your Shiny Server on an internal server to your network or find a
secure way to serve just what you need to serve from it out onto the
Internet through your firewall.
• Shiny Server requires several packages be installed by root in
order to function properly (see earlier). These packages include
shiny (obviously!), rmarkdown, and others. When I first installed
236
Chapter 8 R Anywhere
• Shiny Server will, by default, serve files under a document root, such
as /srv/shiny-server. These files are owned by root which can
make it hard for regular user accounts to write and read them. You’ll
want to think about where you want to store your files so that you can
easily update them as needed. You can modify the document root
and the log document in /etc/shiny-server/shiny-server.conf,
as seen in Figure 8-7. The Shiny Server documentation also addresses
the issues of multiple users and how to allow everyone to have their
own “publishing” space (e.g., off of their home directory or another
central location on the server).
237
Chapter 8 R Anywhere
As mentioned earlier, once Shiny Server is installed, you can visit its interface at, by
default, port 3838 on your machine. If it’s configured properly, the two example scripts
at the right will load up their data as seen in more detail in Figures 8-8 and 8-9. Each of
them can be adjusted by the user to show data differently. The example in Figure 8-8, a
sample app named “hello”, allows the user to adjust a slider to display data in different
sized bins (as I’ve done in Figure 8-10). The example in Figure 8-9, a sample app named
“rmd”, demonstrates the use of RMarkdown to display the requested data. Changing the
region in the drop-down box (Figure 8-11) changes the data as seen in Figure 8-12.
239
Chapter 8 R Anywhere
240
Chapter 8 R Anywhere
241
Chapter 8 R Anywhere
242
Chapter 8 R Anywhere
243
Chapter 8 R Anywhere
If you think this looks like it would take a lot of work to create, it actually isn’t as
complex as it looks. Shiny apps consist of, at bare minimum, a function to tell Shiny
Server what to do and an interface that tells Shiny Server how to display the data. We’ll
build a few simple Shiny apps in the following, and you can also jump in by viewing the
tutorial video at https://shiny.rstudio.com/tutorial/.
R
Packages
Finally, the last piece of software we’re going to need are the R Packages that we use to create
our applications or projects. Later on in this chapter, I’m going to highlight some of those
packages in a section of ideas that you may want to investigate for your “always-on” R server.
Speaking of which, let’s talk about where that R instance lives.
244
Chapter 8 R Anywhere
Hardware or Cloud
R Anywhere is only possible if R is somewhere. That somewhere might be a few
different places, depending on your needs. Here are a few options, along with the pros
and cons of each:
A computer If you have an old • Ability to access the • May not have the fastest upload
you own/use computer lying hardware directly in connection to the Internet, slowing
around at home case of issue down remote access
or work, you could • Physical security, you • May not have full control over
install R, RStudio, know exactly where security, especially if the computer
and Shiny on it. your is at work – might not be able to
data is. access over the Internet or may
• Easy to upgrade need to set up and use a virtual
• Low cost private network (VPN; an encrypted
connection between your location
and your home or work network)
• If the hardware is older, it may be
more prone to fail or may take an
exceedingly long time to perform
simple tasks.
• If you break it, you need to
fix it – no one is available to call.
(continued)
245
Chapter 8 R Anywhere
246
Chapter 8 R Anywhere
A portable It is possible to • You literally are taking • The computing requirements may be
device (e.g., a run R on a device it with you anywhere more than the device can handle.
Raspberry Pi) you take with you, you need to go, so • You will need some way to
such as an older Internet access is not interface with the device while on
laptop or an required. the go. An ad hoc wireless network
ultra-small form • It could be powered or cabled interface.
factor computing in your bag to allow • It’s another thing to carry, in a
device like the analyses to continue world where we increasingly want
Raspberry Pi. while you’re on to be more portable!
the move, using a
portable battery.
• Data is very
safeguarded in that
it’s not connected to
the Internet or only
connected when
you’re using it.
• Not going to lie – it’s
pretty cool to show
off to friends!
RStudio Recently RStudio • Allows you to easily • In beta test now, and may not be
Cloud PBC has started get started with free forever
offering RStudio RStudio and Shiny in • Data uploaded is in the care of
Cloud free of charge the cloud RStudio PBC, which means you
(Figure 8-13). • Allows for may not be comfortable uploading
collaboration and sensitive data.
easy sharing/
modifying of existing
R projects
247
Chapter 8 R Anywhere
As you can see, there are a lot of options of where to put your RStudio or Shiny Server
project. Not all options will be appropriate for everyone, and in some cases, you may
have a bit of a hybrid. I could imagine hosting a simple project on RStudio Cloud that
doesn’t contain private data, hosting a multi-user collaborative project on a VPS or cloud
service, and hosting my own “automation server” at home off of an old laptop. Since the
software we’re using is free from licensing costs, you might be able to run all three of
those examples without spending little more than your time.
248
Chapter 8 R Anywhere
U
se Scenarios
Now that we’ve covered what you can do and where you can do it, let’s walk through a
few examples. We’ll cover the following three scenarios in a broad sense in this section.
In the next section, Scripts to Tinker With, I’ll provide three short projects that you could
take and modify for your own use. Let’s start with perhaps the simplest interface – the
command line.
249
Chapter 8 R Anywhere
250
Chapter 8 R Anywhere
Using this interface, I can execute basic R functions by typing them in. If I want to
plan ahead, however, I can also write an R script that I can execute using the rscript
command. The following script, named test.R, outputs Figure 8-16 when I run rscript
test.R:
This script shows the basics of getting output from a script that’s running in
“headless” mode – meaning without interactivity by the user. Using the print()
command outputs the text with a line number, which is what we’re generally used to in
R. However, if I wanted to print something without line numbers, I can use the cat()
function. Adding \n inside my output creates a line break, which helps clean up the
output a little bit.
We can get a little more fancy by modifying one of our earlier examples. In Chapter 2,
we used the following example:
OurData = ("
Student Pretest Posttest
A 25 27
B 23 23
C 21 22
D 23 29
E 23 24
F 21 19
")
251
Chapter 8 R Anywhere
Data = read.table(textConnection(OurData),header=T)
t.test(Data$Pretest,Data$Posttest,paired=T)
Imagine that I’d like to output the t-test results in APA style vs. the full output. I could
change the script to this:
OurData = ("
Student Pretest Posttest
A 25 27
B 23 23
C 21 22
D 23 29
E 23 24
F 21 19
")
Data = read.table(textConnection(OurData),header=T)
res <- t.test(Data$Pretest,Data$Posttest,paired=T)
cat(paste('The output is t(',res$parameter,')=',round(res$statistic,
digits=2),', p = ',round(res$p.value,digits=3),'\n',sep=""))
252
Chapter 8 R Anywhere
We can also pipe in arguments to the script from the command line. Here’s our final
iteration of the script:
args = commandArgs(trailingOnly=TRUE)
OurData = ("
Student Pretest Posttest
A 25 27
B 23 23
C 21 22
D 23 29
E 23 24
F 21 19
")
Data = read.table(textConnection(OurData),header=T)
res <- t.test(Data$Pretest,Data$Posttest,paired=T)
cat(paste('The output, ',args[1],', is t(',res$parameter,')=',round
(res$statistic, digits=2),', p = ',round(res$p.value,digits=3),'\n',sep=""))
One can imagine how this might be useful when combining with the concepts we
talked about in Chapters 6 and 7. Imagine having a series of pre-built presentations that
require the use of live data. The data changes every day. A colleague, Bob, asks you to
send him the most up-to-date version you have. You pull out your phone, log into your
R server, and type a command similar to Rscript presentation-email.R "bob@test.
com". Within a few moments, a freshly written PowerPoint file (see Chapter 7) is sent
to Bob (see Chapter 6) with the most up-to-date data pulled from a database server or
public website (see Chapter 2). You may even through in some fancy analyses similar to
what we did in Chapters 3 and 4 and include another attachment that is the formatted
data (Chapter 5). Bob will likely be overwhelmed!
Command line, while powerful, is not everyone’s cup of coffee or tea. Let’s explore
RStudio Server a bit more and see some interesting uses there.
253
Chapter 8 R Anywhere
install.packages("later")
install.packages("jsonlite")
install.packages("curl")
library("later")
library("jsonlite")
This code will download the top Today I Learned every 15 minutes and output it into
the R console. It will keep doing that until we restart the R session (using the Session ➤
Restart R command seen in Figure 8-20). As this is running on our server, we can safely
close the web browser, and when we return a few hours later, we can see out output (see
Figure 8-21 and Figure 8-22). You could imagine using this to track any historical data
you’d like – download and scrape data off of a web page every few minutes or hour. Track
daily temperatures or stock quotes. Whatever you want to track, your RStudio Server will
just keep on ticking, as long as you tell it to.
254
Chapter 8 R Anywhere
255
Chapter 8 R Anywhere
Now that we’ve played around with RStudio, let’s build a simple Shiny app – I think
you’ll find that you already have most of the skills necessary!
256
Chapter 8 R Anywhere
At this point, the template can actually be run all by itself – it’s a functioning single
file app! Clicking “Run App” in the upper corner of Figure 8-24 launches the Shiny app
in a local shiny server web server (see Figure 8-25). You may experience a slight hiccup
if you have your browser set to block popups (in the form of the box in Figure 8-26), but
eventually, you’ll see the the Hello application in its own browser window (Figure 8-27).
Now that we’re up and running, we can start modifying the default code to do
what we need it to. My first example is pretty simple – I still want the slider from the
Hello example, but I want to pipe that value into the rlatin() function from the magic
package, which will generate a Latin Square. The following code is broken into two
sections – the ui section which tells Shiny what objects I’d like on the web page, where
they should go, how they should look, and what they should be called and the second
section, server, which tells Shiny what to do when an interactive element, such as the
slider, is used. The following code transforms the Hello example into my Latin Square
example:
library(shiny)
library(magic)
ui <- fluidPage(
titlePanel("Latin Square Generator"),
sidebarLayout(
sidebarPanel(
sliderInput("steps",
"Number of steps:",
min = 1,
max = 10,
value = 3)
),
mainPanel(
tableOutput("latin")
)
)
)
server <- function(input, output) {
output$latin <- renderTable({
rlatin(input$steps)
})
}
# Run the application
shinyApp(ui = ui, server = server)
259
Chapter 8 R Anywhere
Running this app, I can see the output alongside the code in Figure 8-28. Adjusting
the slider value automatically calls the server portion and the output is re-calculated, as
you can see in Figure 8-29.
260
Chapter 8 R Anywhere
Assuming I’m happy with my code, I can deploy it to my Shiny Server in a few ways.
First, there is a Publish option in RStudio that can push the code to your server. In my
case, my Shiny Server is running on the same machine as my RStudio Server. So all I had
to do was copy the files using a command like this:
I see what the problem is – the magic library isn’t installed for the root user, only my
account (as I mentioned earlier). Logging in as root, starting R, and using install.pack
ages(c("magic","abind"),lib="/usr/local/lib/R/site-library") to install the two
packages that I need (magic and a package it relies on, abind) to the site-library, my app
loads up fine (see Figure 8-31).
261
Chapter 8 R Anywhere
From here, it’s really up to you to think about what you want to build. Shiny allows a
number of different inputs and outputs. One that I found particularly useful in teaching
statistics was the fileInput() – it accepts any file and then pipes it through to the server
function. I would have my students collect data, enter it into a CSV file template that I
provided, and then upload the completed CSV file to my Shiny Server. Students could
then get their data analysis and interpret it!
At this point, your brain is probably spinning with a few possibilities of your own
for RStudio Server and Shiny Server, and so the last section of this chapter is devoted
to some ideas that I had, which you may want to consider for your own scripts. I’ve
included links to a few packages that will, no doubt, also help get your ideas flowing!
R Anywhere Ideas
In this last section, I’m going to highlight a few automation scenarios that you might
want to investigate. I’m also going to point out some packages that you’ll probably find
helpful!
262
Chapter 8 R Anywhere
The Today a decent amount of news gets A few packages would likely help – twitteR
Regularly consumed through social media. allows you to post tweets to your timeline, and
Tweeted Sharing statistics about something we’ve seen an example above of the later
Report you are passionate in can be a way package allowing you to have something
to help educate others. Perhaps happen every so many seconds. Or you
you want to regularly tweet out CO2 may choose write up a script and use a task
emissions, or sports performances, scheduler, like Cron to regularly call Rscript.
or the number of people who have You can even use a package like cronR to
signed an online petition. How could modify and manage this job directly from within
you automate that? RStudio Server!
The On Job hunting is hard. Tailoring your The resumer package in R can generate an
Demand resume to a particular job can be attractive resume using RMarkdown. Imagine
Resume even harder. If you’ve got a wide having a master file of resume material and
Email set of skills, you may want to have then writing a function to generate a resume
different versions of your resume to using portions of the master script. You could
send based upon the job you want then couple it with the packages we mentioned
to apply for. in Chapter 6 to write an email cover letter, and
attach the appropriate file, to send to a recruiter
or hiring manager. If the job is centered on
workplace automation and productivity, you
may want to even mention how the resume
email they received was generated!
(continued)
263
Chapter 8 R Anywhere
The APA In many industries, regular reports First, we need to make our data look
Table must be made available to the professional – one very nice package for that is
Uploaded to general public. Those reports apaTables which will take your output from
a Website might need to be standardized and R and form it into an APA Style table, which it
regularly updated, all of which can can then save directly to Microsoft Word format.
be done with R. Revisiting our friend the officer package, we
could create the rest of the report. An open source
tool named Pandoc (http://pandoc.org)
could be used to create PDF documents from the
Word documents, which you could then use the
pdftools package to combine into one PDF, and
then upload using the ssh package to your web
server! A little exhausting to set up – but imagine
how this could save an hour or two a month if you
had to prepare this report regularly by hand!
Building Perhaps you work in a bupaR is a suite of R packages (www.bupar.
a Process manufacturing environment where net) that allow you to pull in process data,
Monitoring processes need to be monitored on analyze it, and display it. In particular, using the
Dashboard a high level to ensure productivity? processmonitR package would allow you to
Likely you have a lot of equipment produce easy-to-read performance dashboards
that can either send data to a that could be published to a Shiny Server. Since
database or has the ability to we’ve already seen notification methods such
have data pulled from a script. R as email (in Chapter 6) and will see another
is a great place to centralize and notification method in Chapter 9 (the pushover
monitor. package), your dashboard could also be a
proactive early alert warning system!
264
Chapter 8 R Anywhere
Perhaps something from the table struck you as related to your job, but if not, the
next two chapters might help you in our longer examples. In the next chapter, the
Change Notifier, we’re going to return to a concept we introduced in Chapter 2 – web
scraping. But we’re going to take it up a notch by notifying you anytime something
we’re “watching” changes. And if it’s an emergency, we’ll make sure you don’t miss the
notification by using a versatile service named Pushover. And in our final chapter, we’ll
build out an R personal assistant. He or she can put together a daily briefing for you, and
even read it to you, with the help of other smart assistants you may already have!
265
CHAPTER 9
Project 5: The Change
Alert!
In late January 2020, I sat at my desk thinking about the projects for this book. I had
come up with all of the projects except one, the one that we’re to talk about in this
chapter. Looking for inspiration, I went down the hallway to my colleagues and asked
them what “pain points” they dealt with on a daily basis. One of them lamented to me
that she wished another office would let her know when they added information to a
report. Because they didn’t, she had to pull the report every day, just to see if anything
had been added. It got me thinking about how many times I do a similar thing – check
information just to see if it has changed – thus, the inspiration for this chapter!
We live in a world filled with notifications. We get so many that our smartphones
include entire configuration settings around managing the notifications we get so we
don’t get overwhelmed. But amid all of these notifications, we still don’t always know
about things when we need to. Forget to check a report for a week and you may find
yourself very behind if it’s been a busy week. Other weeks, nothing may be added to the
report at all, and you’ve wasted your time pulling it daily. In this chapter, we’ll talk about
the types of data you might want to monitor, how you could monitor it using R on an
RStudio Server, and how to get around some of the limitations that you may have when
it comes to interruptions (e.g., your computer loses power) or timely notification (e.g.,
email might not be fast enough!). Hopefully by the time we reach the ChangeAlert script,
at the end of this chapter, you’ll be ready to hit the ground running and roll your own
business notification system!
267
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_9
Chapter 9 Project 5: The Change Alert!
268
Chapter 9 Project 5: The Change Alert!
However, the following code will find these two needles in the haystack pretty
quickly:
install.packages("arsenal")
library(arsenal)
df1 <- read.csv(file="survey_file_1.csv")
df2 <- read.csv(file="survey_file_2.csv") # 10 new rows of data, and 2
demographics changes made
comparedf(df1,df2)
summary(comparedf(df1,df2, by="Response.ID"))
269
Chapter 9 Project 5: The Change Alert!
Running that code, we get a very comprehensive report that includes the following
two snippets:
statistic value
------------------------------------------------------------ ------
Number of by-variables 1
Number of non-by variables in common 10
Number of variables compared 10
Number of variables in x but not y 0
Number of variables in y but not x 0
Number of variables compared with some values unequal 2
Number of variables compared with all values equal 8
Number of observations in common 49
Number of observations in x but not y 0
Number of observations in y but not x 11
Number of observations with some compared variables unequal 2
Number of observations with all compared variables equal 47
Number of values unequal 2
Right away we can see that there are 11 observations in y (our second data frame)
that aren’t in x (our first data frame). Farther down in the list, we actually get a table that
lists the Response.IDs of each observation that isn’t shared.
And what about those two demographic pieces that I changed? We can find them
easily too:
var.x var.y Response.ID values.x values.y row.x row.y
--------------------- -------------------- ------------ --------- --------- ----- -----
What.is.your.gender. What.is.your.gender. 29 2 1 29 29
How.old.are.you. How.old.are.you. 20 29 27 20 20
270
Chapter 9 Project 5: The Change Alert!
These might have been changed for legitimate reasons (e.g., the person entering the
data made a typo that they fixed later in the day, after the first report had gone out), or
they might have been changed for less than legitimate reasons. Security professionals
will tell you that individuals who break into computers can modify log files to cover
their tracks. Imagine having a central monitoring script that recorded user activity and
periodically auditing the logs to see if the user count was retroactively changed. You
might find something before it’s too late.
And as I mentioned earlier, comparedf() has some features that other packages
lack. What if your column names are slightly different (“Response.ID” vs. “response.
id”)? It can handle that using the tol.vars option. What about having a level of
acceptable difference? Pass tol.num.val with an absolute difference and you’ll only
trigger differences if the threshold is met. comparedf() can also support user-defined
tolerance functions, which means you can customize your criteria even more. Example
1 in the comparedf vignette shows how to allow items with a newer date to be ignored,
suggesting that those differences are intentional updates.
If only all data could be that nice. And it’s possible it might be. It is possible to coerce
a lot of data into a data frame in some way. However, in doing so, you may lose some of
the things that make that data special – like the actual content! What might we do if the
content is what we want to monitor, through either downloading it via an API or directly
from the rendered web page or document? Enter the next two scenarios!
install.packages("jsonlite")
library("jsonlite")
hot <- "https://www.reddit.com/hot/.json"
hot.df <- fromJSON(hot);
hot.df$data$children$data$title[1]
271
Chapter 9 Project 5: The Change Alert!
As we’ll see later, once I have the item, there are ways in which I can store it and then
compare it later. In the following example, we’ll actually check it once an hour to see if it
changes, although you can check it as often as you like. Well, almost…
Here’s a caveat about API data – not all of it is free. Many websites understand that if
you’re using an API to access their content, that’s fewer eyes that will be on their actual
web page. And while it’s nice of you to lighten their server load, you’re also lightening
their pockets by reducing their ad revenue. So for many, they require that you provide
an API key in order to access their API. You purchase that key and a certain number
of data calls with it. Reddit themselves allow up to 60 requests per minute and require
that you authenticate if you’re building a client application. So don’t look to bog down a
server with a ton of calls, because even if you are able to do a few, you might find yourself
blocked if you’re grabbing the same data 10–20 times per minute.
In situations where the data is expensive or an API isn’t available, what option do we
have? The majority of the time our best option then is to use some form of web scraping,
like I discussed in Chapter 2.
272
Chapter 9 Project 5: The Change Alert!
The first is actually easiest thanks to the diffr package. The following code grabs
a copy of the homepage of Delta State University from the Internet Wayback Machine
at archive.org at two separate points. It then analyzes the differences and produces a
difference report of the raw HTML that highlights the differences in file2.html vs.
file1.html. As we can see from the report, quite a bit of code changed – mostly due to a
difference in file paths.
install.packages("diffr")
library(diffr)
install.packages("shiny");
library("shiny");
addr <- "https://web.archive.org/web/20200102195729/http://www.deltastate.edu/"
addr2 <- "https://web.archive.org/web/20191202220905/http://www.deltastate.edu/"
download.file(addr,"file1.html");
download.file(addr2,"file2.html");
diffr("file1.html","file2.html", contextSize=0);
273
Chapter 9 Project 5: The Change Alert!
diffr can work on any text file to create a difference report. It’s not going to be your
solution, however, for compiled files such as a PDF. As of this writing, there isn’t a utility
in R to compare PDF documents; however, it does exist for Python. And as hinted earlier,
diffr is not going to be your solution if the document you download has specific data
you need. For that, rvest is your best option.
However, as a caveat, rvest or any other package is going to be pretty worthless
if the data you download doesn’t include the data you need. Many web pages today
use JavaScript to download data after the page is loaded, through an Asynchronous
JavaScript and XML (AJAX). When you download a web page using the download.file()
function, it downloads the raw HTML. That HTML might just include placeholders for
the data that you want, because that data gets loaded later in your web browser. This
means that rvest won’t be able to find anything. Later in this chapter, we’ll look at a
274
Chapter 9 Project 5: The Change Alert!
solution for this in our ChangeAlert script, by emulating the page loading experience
in a “headless” web browser. For now, let’s turn back to the basics of detecting change.
Because once we can get the data one time, we can get it in the future as often as we like
to compare against that first copy.
275
Chapter 9 Project 5: The Change Alert!
This is great if I know I want to run something at 8:00 AM every workday, or every
hour on the hour. It’s not so great if I want to run something over a shorter period of time
(e.g., every 10 seconds) or if I want to monitor the output in real time, able to start and
stop the task as needed. For that, we need a package that will let us run something on a
loop. Enter the aptly named later package and function.
later() allows you to specify a function and a delay. Take a look at the following code:
install.packages("later")
library("later")
276
Chapter 9 Project 5: The Change Alert!
Typing myfunc() into R after running that code will print “This is the output of the
scheduled later loop” repeatedly while the loop variable remains true. Type loop <-
FALSE and press Enter, and you’ll get one more loop of the myfunc() command and then
it will be done.
We can also interrupt the later() function by using code such as this:
277
Chapter 9 Project 5: The Change Alert!
That code will start the function, and it will keep going every 10 seconds as it
“reschedules” itself each time it runs. However, typing cancelfunc() into R and pressing
Enter will cause the loop to be cancelled. Unlike the first method, you don’t get an
additional run out of the code – it stops as soon as you execute cancelfunc(). If you’d
like the option to cancel after the next run (using the loop variable) or before the next
run (using cancelfunc()), you can modify the code to this:
The best of both worlds, at least in terms of loop flexibility. And flexibility is
important, because just like humans don’t always wake up at the right time of the
morning, our computers can have off days as well. Power outages, data connectivity
issues, confused system administrators who reboot the system while people are using
it, and more all mean that your script might not keep running in perpetuity as you
intended. Let’s talk about a little bit of redundancy.
As long as my R environment is saved when I exit R, I can always come back and
compare against this. It’s fast, and it’s also easy to modify if I want to test my script – I
don’t have to wait for the top hot title to change, I simply need to change tophothist to
something new and my script will detect it as a change.
278
Chapter 9 Project 5: The Change Alert!
Notification and Acknowledgment
So how do you want to be disturbed? Or how disturbed do you want to be by technology?
That’s the question for this next section. We’ll talk about various ways that we can learn
about our change.
If you want to get a little bit fancier than a simple print output, you can use one of the
several packages that R has for logging information. futile.logger has a ton of features
for multiple logs and different notification thresholds. Combining those together, you
could choose to silently log certain levels of change, without disturbing you until you
want to check on the results later. Another option would be to have a pop-up dialog box
or alert. Something like the svDialogs package will do the trick:
install.packages("svDialogs")
library(svDialogs)
dlg_message("This is a test")
These methods are great if you are running your code in a machine you can easily
access and view; however, they might not work if you’re scheduling your script (or at
least not work in the ways that you would find useful or intuitive). Thus, you may want to
send a notification elsewhere, in a few different ways that we’ll explore.
E mail/Text
Perhaps the most intuitive way to be notified would be through email or text message.
These methods are fairly well used and accepted. We’ve seen how to send emails in
Chapter 6, and dedicated texting options do exist. While this might be your knee-jerk
notification go-to, it’s worth pointing out a few potential problems. First, email can be
held up or dropped for looking untrustworthy. Sending a message that says “This is a
notification that your job has finished” will look rather suspicious to most email filters. Your
message may be dropped without you even realizing it. The same goes for text messaging.
Unless you’re using a commercial service designated for texting, you might run into the
280
Chapter 9 Project 5: The Change Alert!
same “suspicious” problem. At the very least you’ll have to figure out how to get your email
to text, which can be confusing as this feature isn’t as widely used as it once was.
At the end of this section, we’ll talk about a method that I endorse over email and
text for its reliability and customizability, named Pushover. However, there is one more
way you might get your notification across – updating your web page.
Pushover
To understand Pushover, you need to think back to a time when push notifications were
first being widely used, almost 10 years ago. To get a push notification, you needed to be
a developer with an infrastructure to send these notifications. You also needed to pay to
access the notification APIs, either directly or by having an app in the Google Play Store
or the Apple App Store that supported push. It wasn’t really accessible for individuals.
Pushover changed that.
281
Chapter 9 Project 5: The Change Alert!
To send a notification, you can simply email a special email address Pushover
assigns to you. But you won’t want to do that if you’re using R, because you can
use the pushoverr package! Here’s a small listing of use scenarios that Pushover
supports, with R code:
• Simple setup: Create an app to get an API key (see Figure 9-9). Then
put your API key and user key at the top of your script:
install.packages("pushoverr")
library("pushoverr")
set_pushover_app(token="ad7w7uqezsfd3v81sze7znjhaz1")
#Change this to your API key
set_pushover_user(user="GnoCrXCawlXUwcBDFDkhKBUgC1IMSO")
#Change this to your user key
282
Chapter 9 Project 5: The Change Alert!
283
Chapter 9 Project 5: The Change Alert!
284
Chapter 9 Project 5: The Change Alert!
• Support for different sounds and the iOS Critical Alerts features.
I’ve been a user of Pushover for a number of years, which is why I recommend their
service when you need a notification that’s reliably delivered and trackable. Over the
years, I’ve used Pushover for the following situations outside of R:
• Weather alerts
And I’ll likely use it for others. But before that, let’s put everything we’ve discussed
together in this chapter and create our ChangeAlert script!
ChangeAlert-JSON.R
# Get Top hot Reddit Thread
install.packages("jsonlite")
install.packages("later")
285
Chapter 9 Project 5: The Change Alert!
library("jsonlite")
library("later")
tracktophot()
The preceding code takes an example I mentioned earlier in the book and fully
develops it out with other concepts introduced. It first installs and loads the jsonlite
and later libraries. Next, the code creates a function named gettophot(). This function
first declares the URL for the JSON download of the top hot threads from Reddit and
then returns the top hot thread. Theoretically, we could always just call these three lines
of code, but by putting it in its own function, we make things a bit more elegant in the
following section.
In that next section, we create another function, tracktophot(). This function first
checks to see if we have a variable named tophothist. If we do not, it creates it and puts
the current top hot thread title into it. If we didn’t have this line, R would complain that
we were referencing a variable that didn’t exist when we compare the current top hot
thread to our tophothist variable. Next, we ask R to compare the current top hot thread
(by calling the gettophot() function) to the history that we’ve saved. The variable same
now is either True (if the current top thread is the same as the history variable) or False
(if it’s different). If it’s true, we don’t need to do anything – there hasn’t been a change.
But if it’s false (which we test by using the !same statement), we then need to alert
someone to its change.
In this example, I’m simply writing to the console that, at a given time, the top
hot thread changed. I then output what it changed to. Technically here I’m being a bit
wasteful – I’m calling the JSON URL twice and downloading it twice. I could modify my
code to store the variable temporarily in a variable in the function, but since this only
runs infrequently, the difference is pretty minuscule in processing time, bandwidth, and
API usage.
Finally, I use the later() function to schedule my loop to run every hour (3600
seconds). I then launch it by calling tracktophot(). If I want to cancel my running
loop, I can by calling cancelfunc(). When I resume, because tophothist is a global
environment variable, it will compare based off of my last unchanged thread title.
In thinking about what you would like this type of script to do, obviously you can
swap out the console logging for any of the other notification options. For example,
changing cat( to pushover( after loading in the pushoverr library and setting your
user key and API key will cause the message to get pushed to your cell phone instead of
written to the console.
287
Chapter 9 Project 5: The Change Alert!
ChangeAlert-Rendered.R
# Monitor the number of emails sent on the internet, alert me when it's
above 130 billion
library("pushoverr")
library(rvest)
library(later)
set_pushover_app(token="ad7w7uqezsfd3v81sze7znjhaz1")
#Change this to your API key
set_pushover_user(user="GnoCrXCawlXUwcBDFDkhKBUgC1IMSO")
#Change this to your user key
# This Function downloads the live stats, and checks them. It then
# notifies you if the number of emails sent is above 130,000,000,000
checkmail <- function() {
system("./phantomjs get_internetlivestats.js")
page <- read_html("livestats.html")
node <- html_nodes(page,"span")
# The number we want, Emails sent Today, is element 22
# html_text(node)[22]
num.emails <- as.numeric(gsub(",","",html_text(node)[22]))
# we can now alert if that number is above our critical cutoff (130 billion,
130,000,000,000)
if (num.emails > 130000000000)
{ pushover(message = "It's Above!") } else { pushover(message = "It's Still
Below!") }
cancelfunc <<- later(checkmail,300)
}
checkmail()
# This modified function downloads the live stats, and checks them. It then
# notifies you if the number of emails sent is above 130,000,000,000
# It also will require that you acknowledge it or it will keep sending
checkmail <- function() {
system("./phantomjs get_internetlivestats.js")
page <- read_html("livestats.html")
node <- html_nodes(page,"span")
288
Chapter 9 Project 5: The Change Alert!
checkmail()
Our goal in this script is to monitor the web page InternetLiveStats.com and let us
know when the daily email sent total reaches 130 billion and notify you with Pushover
(with a second version that’s pushier than the first). There are a lot of other stats on that
page we could also use, but given how much email I feel like I send and receive, the
email number seemed darkly comedic. There is a lot going on in this script, including
calling another application to do some heavy lifting, so let’s walk through it!
First, we need to load our libraries and set our Pushover values. Next, we create a
function that downloads the live stats and checks them. This is actually a lot harder
than it sounds. Thus far we’ve downloaded static web pages, where the data we need
lives in the raw HTML files. If you download the raw HTML of internetlivestats.com, you
get a page with placeholder values. That’s because the authors of internetlivestats use
JavaScript to load the values in using AJAX calls in the background. This lets them keep
the numbers rolling higher and higher as the person views the page, but it also means
that the data lives somewhere other than the raw HTML.
Our way around this is to emulate the same thing we would do if we went to the web
page in our browser and downloaded the fully rendered page. We can do this using a
piece of open source software named PhantomJS (https://phantomjs.org). PhantomJS
is a command-line “headless” browser, which will take a URL and render it, saving the
output. To instruct PhantomJS what we need it to do, we use the system() function in
289
Chapter 9 Project 5: The Change Alert!
// get_internetlivestats.js
var fs = require('fs');
var path = 'livestats.html'
Once you have that code saved and the PhantomJS browser downloaded, calling the
system() function takes about 3–4 seconds to load up the page and save it as livestats.
html. From there, we can use our old friend the rvest package to open the file, find the
number of emails that were sent, and save it to a variable, num.emails.
From there, we simply compare num.emails to the value we’ve set (130 billion), and
if it’s higher, we push the message it’s above. If it’s lower, we push the message that it’s
below. We then schedule using the later() package to have the check run again in 5
minutes (300 seconds) and launch the whole thing using checkmail().
Lower down in the code, you’ll see that I’ve modified the checkmail function to be a
bit more insistent in notifying the user that we’re above 130 billion emails. Specifically,
we’ve changed the following code:
290
Chapter 9 Project 5: The Change Alert!
A few things to note in this code that you may have noticed but not understood. First,
you’ll see that when we send the emergency pushover, we use <<- instead of <-. This is
so that the msg variable lives in the global environment, not the local environment inside
the function. By default, R keeps local variables inside the scope of the function only,
so that they don’t mess with other functions that might use the same variable names.
However, when we want something to live in the global environment, we need to use <<-
to indicate that.
Similarly, you’ll notice the remove function (rm()) has a pos option for .GlobalEnv –
this tells that function to delete the global environment variable msg and not a local
variable inside the function named msg. Same idea as using the <<- – we need to know
if we’re talking about a variable that only exists locally inside the function or if it’s in the
global environment.
Conclusion
We’ve come a long way in this chapter – we’ve broken down the types of things we
might want to be notified about changing, we’ve talked about how we’d build in some
redundancy, and we’ve talked about notifications. We’ve then seen examples of tracking
several different items, including fully rendered web pages. From here, you should feel
confident to mix and match based on your needs in tracking the things that someone
never bothered to tell you changed!
291
CHAPTER 10
• I always check my calendar to see how long I can stay in bed before
my first meeting!
293
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7_10
Chapter 10 Project 6: The R Personal Assistant
• Hear about specific news items such as content about their local town
or favorite sports team
And the list could go on and on. For the following example , I’ve focused on the five
items that were in my first list. But the second list is completely doable as well; it just
might take some ingenuity. Let’s take a moment to think about where we get the data
that goes into these lists:
Each of those items is going to require a bit of planning to pull off. For our example,
we’ve limited ourselves to the API calls; however, you can imagine using the code
examples from previous chapters in this book to scrape websites and create reports. If you
wanted to be especially daring, you could script an email to be sent out every afternoon
at 4:30 PM to your employees reminding them to fill out a form with the day’s report. You
could then have an R script download that form data (see Chapters 2, 3, and 7) every
294
Chapter 10 Project 6: The R Personal Assistant
evening and compile a report. You could finally have your daily briefing script at 8:00
AM every morning pull portions of that report to put into your briefing. All of this could
happen after you programmed it one time, without you having to chase down reports
every day or week and compile them by hand.
Once you’ve decided what sources you want to use, your next step is to figure out the
format you’d like your daily report to be in. Let’s talk about that next.
Plain Text
Perhaps the simplest and most effective way to communicate in the world is just through
words. It’s how I’m communicating in this book, and it’s how you’re taking in the
information. While infographics, charts, spreadsheets, and PowerPoint slides are useful,
sometimes they add more clutter than information. Plain old text can sometimes be king
for its brevity, its ability to be specific using small changes of phrase, its reliability, and
small size. Plain text is also extremely easy for a text-to-speech (TTS) synthesizer to use.
Thus, your smartphone or personal assistant can easily read your daily report without
having issues navigating around photos, images, and formatting blocks.
However, text can fall short in some key spaces. First, for very complex thoughts,
a picture can really be worth a thousand words. Thinking about how I explain certain
statistics topics, such as three-way interactions, the graphs I use are way more intuitive
than simply explaining it in words. Plus, there are certain things I cannot communicate
in text. If I’m a product designer and part of my job is to review the latest design sketches
every morning, I really can’t depend on text for my daily briefing. And finally, text
does take longer to parse than many other formats. Indeed you may notice in the daily
briefing script as follows that while the text version of the script is shorter in length, the
visual appeal of the HTML/PDF version makes it more palatable to the eye.
295
Chapter 10 Project 6: The R Personal Assistant
H
TML/PDF
Rendering text in a rich environment while also adding formatting, images, and color
can produce a stylish and professional looking report that you will enjoy reading each
morning. It also has the added benefit that it’s a report you will probably not hesitate to
share if needed – especially if it’s information that could appeal to others in your circle
of friends or co-workers. UI/UX designers use the term skeuomorphic to describe the
emulation of a real-world object virtually, such as the older version of applications on
iOS that looked like their paper counterparts (e.g., notes, contacts, and calendar).
While not many of us will go “all out” with our skeuomorphic designs, there is
something to be said about beauty as well as information. One example of this is the Tufte
Handout style that one can actually create in RStudio using RMarkdown. This style is
beautiful and simple and harkens back to a clean textbook style of decades past. However,
while it may be visually appealing, a TTS synthesizer may have issues with it, and it could
also be overwhelming to the reader who just wants to get what they need and get done
with the document. Pretty may equal professional, but it doesn’t always equal fast.
D
ata Frame/Graphs
Our last manner of rendering data is probably the most well developed in R, given its roots
as a statistical software package, but also the trickiest to pull off effectively: exporting data
as a data frame or in graphical representation. I won’t spend a lot of time on this here,
because it’s likely that if your daily briefing depends on statistics and numbers in columns,
you probably already know exactly what graphs you want and what tables of text you
expect to see. I will note that some of the formatting that we’ll see in the HTML version
of the daily briefing R script is very useful if you are going to be working with these areas.
And don’t forget the officer package we talked about in Chapter 7 – it can be used to
write and read Excel workbooks. I shared an example of an office having to put together
the same report daily for 2 weeks and how this could be easily done using officer. I could
see a similar scenario for someone in an organization that reviews metrics each morning
designing an R script to create an Excel workbook with several sheets that could be shared.
Once you have decided what format you want your daily report in, it’s time to glue
it all together. In the next section, I’ll walk through the daily briefing R script, how it
pulls in its data, and how it exports data out in either plain text or in HTML through
RMarkdown. Once we’re doing walking through it, I’ll talk about how we might distribute
it to ourselves for the easiest “digestion” of our briefing.
296
Chapter 10 Project 6: The R Personal Assistant
• owmr
• tidyquant
• jsonlite
• markdown
• knitr
Weather from OpenWeatherMap.Org
Weather is a necessary consideration for most of us when we’re not sheltering in place
or quarantined (as you can imagine, during the month I’m writing this, April 2020, I’m
not as interested in the weather as I usually am). OpenWeatherMap.org offers several
weather APIs that you can choose from, with their basic forecast API for free for up to
1000 calls per day. You will need to sign up and get an API key in order to access their
data remotely. Once you have yours, you’ll need to replace the example APPID that I
have in the following code. You’ll also want to change your ZIP code to your location.
Their API also allows you to specify city and latitude/longitude.
library('owmr')
Sys.setenv(OWM_API_KEY = "a91bb498c90bcd09f445039a88474ec5")
weather <- get_current(zip="38733", units="imperial")
weather$main$temp
297
Chapter 10 Project 6: The R Personal Assistant
As you can see from the last line of code, we now have a data frame named weather
that contains our forecast data. Later we’ll mine out the current weather conditions to
put in our report. And while we’ll only be using two of the variables that we’re given, we
could use much more. See Figure 10-1 to see all of the information that was downloaded
through OpenWeatherMap.org’s API.
298
Chapter 10 Project 6: The R Personal Assistant
library("tidyquant")
dji <- getQuote("^DJI") #Get Current Dow Jones Industrial Average
aapl <- getQuote("AAPL")
goog <- getQuote("GOOG")
Previously, we’ve loaded the library and downloaded three quotes – the current Dow
Jones Industrial Average, Apple, and Alphabet Inc. (Google). We’ll slot them into our
report as we find it useful. Figure 10-2 shows an example of the data downloaded for one
quote, in this case, for Apple.
library("jsonlite")
url <- "http://newsapi.org/v2/top-headlines?country=us&apiKey=APIKEY"
news.df <- fromJSON(url);
299
Chapter 10 Project 6: The R Personal Assistant
library("jsonlite")
url <- "http://quotes.rest/qod.json?category=inspire"
quote.df <- fromJSON(url);
If you have multiple calendars, use the calendar_list() function to get a list of
them, and then replace the my_cal_id with the ID of the calendar you’d like to pull.
Figure 10-6 shows the day I used for my example throughout this chapter (ironically
filled with things I would do on an average Sunday, but didn’t do on this particular
Sunday!).
Finally, the last line of code here sorts the events in order from earliest to latest;
otherwise, they’d be alphabetical, which might cause your After Dinner Conference Call
to show up first on your list before your Zither Breakfast Meditation!
devtools::install_github("andrie/gcalendr")
timezoneoffset <- -4 #Set to your distance from UTC
library(gcalendr)
calendar_auth(email="[email protected]",path="client_secret.json")
301
Chapter 10 Project 6: The R Personal Assistant
With all of the piece put together, we can now create our text report! And we’ll do that
in a very basic yet easily changeable way – by building a paragraph.
302
Chapter 10 Project 6: The R Personal Assistant
303
Chapter 10 Project 6: The R Personal Assistant
# Stock Block
txt <- paste(txt," In stock news, the Dow Jones closed at ",as.
character(dji$Last)," a change of ",as.character(round(dji$`%
Change`,digits=2))," percent.",
if (aapl$Change > 0) { txt <<- paste(txt,"Apple Stock increased") } else {
txt <<- paste(txt,"Apple stock decreased") }
txt <- paste(txt,as.character(round(abs(aapl$`% Change`),digits=2)), "
percent. Google stock ")
if (goog$Change > 0) { txt <<- paste(txt,"increased") } else { txt <<-
paste(txt,"decreased") }
txt <- paste(txt,as.character(round(abs(goog$`% Change`),digits=2)), " percent.")
# World News Block
txt <- paste(txt,"In world news, the top 3 headlines and their
sources are:",news.df$articles$title[1],"followed by",news.
df$articles$title[2],"and finally",
# Optionally, include news.df$articles$url at the bottom of your text
briefing so you can easily get to the full story.
# Inspirational Quote Block
txt <- paste(txt,". Finally, to wrap up your briefing, today's
inspirational quote comes from ",quote.df$contents$quotes$author[1], "and
is '", quote.df$
This is your automated daily briefing for Sun Apr 19 2020! Today the high
temperature will be 63 degrees and Clouds. You have 3 events today. The
first event, Lunch with Karey ,starts at 12:00 .The remaining 2 events
are: Podcasting (Mobileviews.com) with Todd at 14:00, and FaceTime with
Nate & Kristen at 19:30. In stock news, the Dow Jones closed at 24242.49 a
change of 2.99 percent. Apple stock decreased 1.36 percent. Google stock
increased 1.57 percent. In world news, the top 3 headlines and their
sources are: As coronavirus cases rise in U.S. hot spots, governors tell
Trump it's too soon to reopen America - Reuters followed by The coronavirus
pandemic will likely leave a lasting legacy on retail: Fewer department
stores - CNBC and finally Pence defends Trump's 'LIBERATE' tweets -
POLITICO . Finally, to wrap up your briefing, today's inspirational quote
304
Chapter 10 Project 6: The R Personal Assistant
comes from Brene Brown and is ' What's the greater risk? Letting go of
what people think or letting go of how I feel, what I believe, and who I
am? '. That's today's briefing, have a great day, Jon
For good measure, I’ll use the following code to save that output to a text file, which
I can easily distribute later. I’ll also save my workspace image to a file, so that I can save
some time (and API calls) later.
fileDB <-file("db.txt")
writeLines(txt, fileDB)
close(fileDB)
save.image("daily-brief")
While text is perhaps not the prettiest, it is easy for a text-to-speech synthesizer to
read to me. If I want it prettier, well, then I’ll just create an HTML version!
---
title: "Daily Briefing"
output: html_document
---
## `r format(Sys.time(),'%a %b %d %Y')`
Welcome to your daily briefing. Today the high temperature will be `r round
(weather$main$temp,digits=0)` degrees and `r weather$weather$main`.
```
Here are your stock quotes from the last trading day:
```{r echo=FALSE, results='asis'}
stocks <- rbind(dji,aapl,goog)
kable(stocks)
```
Walking through the code, we can see that it starts with a commented title and
output type. RMarkdown then sets options by using the triple back-tick (`) to indicate
code and then putting a brace ({) and the letter r – this tells the R compiler that it
shouldn’t just show this code, it should actually execute it. We can see an example of r
code execution in Line 11, where we put in the date of the report. We can also see it on
Line 13 when we slip in the temperature and the weather conditions.
Moving into Lines 15–18, we can see a nicely formatted table (using the kable()
function). In Lines 20–35, we iterate out the five top stories. We could do this with a loop
as well; however, I wrote it out in longer form for easy modification. But it would be more
efficient to do this with a for() loop as we did with the text output earlier.
Finally, we finish with our stock quotes and our inspirational quote in block quote
format. With the file completely written, all that’s left to do is write the Markdown and
HTML versions of my report, using the following code:
306
Chapter 10 Project 6: The R Personal Assistant
Opening up the HTML version (Figure 10-7) in a web browser, we can see that it’s
nicer than the text (and more functional, since I can click the news headlines) – not as
easy to read if you’re a computerized voice in a smart speaker, but easier on human eyes.
With that, we have two versions of our daily briefing. I could easily set up an R script to
run db.R every morning at a set time and have the two files written on my RStudio Server
or my desktop and ready to go whenever I wished. Now the final question is – where?
Where do I put them so that I can access them when I need them?
307
Chapter 10 Project 6: The R Personal Assistant
Distribution Options
Once you have your daily briefing, it’s up to you how you would like to either access
it or have it delivered to yourself. Depending on your workflow, you may be more
comfortable with a “pull” method (where you go out and initiate the action of reading
the report) or a “push” method (where it’s sent to you automatically). Here are a few
options for both that you might want to consider.
Pull Methods
• Perhaps the most basic would be to have it write the file to a spot
on your hard drive that you would access each morning. It may
be easiest to have this in a cloud storage folder, such as OneDrive,
Dropbox, or Google Drive. That way you could also easily access it
from your phone or tablet.
308
Chapter 10 Project 6: The R Personal Assistant
• To make pulling your report easier, you will likely want to create a
shortcut in some way depending on the device you’re using. On iOS,
for example, you might use Siri Shortcuts to create a custom home
icon that will take you to your report location. In Windows, you might
create a shortcut to the file.
P
ush Methods
• You could attach your daily report to an email as seen in Chapter 6
and have it emailed to you daily. This also will be useful for archival
purposes as well.
309
Chapter 10 Project 6: The R Personal Assistant
Conclusion
As you can see, having your own daily briefing facilitated by R can be a fun project to
build out and create as well as a useful timesaver for you during the day. As the final
project in this book, I hope that it’s inspired you beyond the limitations of what I’ve
shown.
310
Chapter 10 Project 6: The R Personal Assistant
This book has had a central goal of introducing you to ways to use R in your life
that you might not have considered. I would challenge you to dream up ways to use
the building blocks that we’ve seen, whether they be statistical analysis and basic data
manipulation, to offloading daily work to R automation solutions, to allow you to spend
more time creating and enjoying the fruits of your creation, rather than simply being
a slave to pointing, clicking, and typing the same things over and over again. If that’s
been accomplished, even if by just inspiring you, then I’ve done my job. I look forward
to hearing from you through your feedback to my social media accounts (principally @
jonwestfall on Twitter, a full list available at http://jonwestfall.com/contact) and
old-fashioned communication such as email. And if you’re looking for new ideas, you
can always check out my blog at http://jonwestfall.com and hear my thoughts nearly
every week on the MobileViews podcast (mobileviews.com) with my good friend, Todd
Ogasawara. Keep learning, keep creating, and keep growing!
311
Index
A Change detection
API data, 271, 272
as.Date() function, 159
code running, 270, 271
as.numeric() function, 115
R data formats, 268
Asynchronous JavaScript and
web page, 272, 275
XML (AJAX), 274
checkmail() function, 290
Command line, SSH, 249
B comparedf() function, 268, 271
Content management system (CMS), 3
blockMessage() function, 23, 24
CRAN package, 175
Built-in datasets
cronR running, 276
base installation, 30
Customer satisfaction survey, 62
ChickWeight Structure, 35
data() command, 30
histogram, 33 D
linear model output, 37 Daily briefing R script
linear regression, 38 Google Calendar, 301, 302
output, 34 HTML Markdown version, 305, 307
releveling, ChickWeight, 38 NewsAPI.org, 299, 300
USArrests, 32 OpenWeatherMap.org, 297
plain text output, 303, 304
R packages, 297
C stock quote, 299
calendar_list() function, 301 TheySaidSo.com, 300
cancelfunc() function, 278 Database servers
casino package, 212 can’t connect error, RMySQL, 46
cbind() function, 157 code blocks, 48
ChangeAlert-JSON.R, 285, 287 combined financial records, 44
ChangeAlert-Rendered.R, 288, 289, 291 connection set up, 45
ChangeAlert script, 275, 279 core tunnel routing, 45
313
© Jon Westfall 2020
J. Westfall, Practical R 4, https://doi.org/10.1007/978-1-4842-5946-7
Index
315
Index
316
Index
317
Index
318