123
123
R. Backing up a bit, I thought I’d include a short summary of how to get started
with R, and as you do, I’m now writing a separate post on how to do that. No, I
won’t back up more and explain computers or how to connect your printer.
Setup
For today’s excursion we’ll use the programming language “R”. Why do they call it
“R”? What happened to all the languages between “C” and “R”? Moving on. But why R?
You could do the same things with Python, or most programming languages, but Python
is for hipsters, and R is for real data scientists.
Pirate data scientists are the coolest. Good job on finding this post and joining
the club. You can add R programming to your resume/CV already - it’s such a weird
language that nobody will ask you to do anything in a job interview. And in real
life, you’ll just search around for code samples to copy & paste anyway. Trust me.
I got all of this to work without really understanding what I’m doing either.
Anyway, R is available for most computers. There’s a simple IDE called RStudio
which I use for R. The open source desktop edition is free. You don’t even need an
editor. Install it and let’s move on. You might be prompted to install “R” as well
- follow the directions for that. “R” is the programming language, “RStudio” is an
easy way of working in “R”.
Using RStudio
Obviously, you should read the 291 page documentation. Someone should (I didn’t,
maybe it’s not even 291 pages long). This is just my short-cut. It might not even
be a good short cut, as clearly I am no authority on R. I do sometimes EAT like a
pirate though.
RStudio has a few quirks that you could get used to. The main window will look like
this initially:
(It turns out, a lot of these graphics get resized when I publish, so they look a
bit terrible. Oops. But at least it’s fast.)
Now for the magic: Click “More” and “Set as Working Directory”.
If you don’t do this, all files you read & write will end up somewhere else. It’s
super-annoying, computers can be such jerks with details like this. Always remember
to set the working directory first, even if it’s currently showing that directory.
For those used to programming languages, you assign values by using “<-”. R is a
bit weird in that you can kinda assign values to functions too, but whatever floats
your boat, R. Also, you can apply functions to individual numbers, vectors, or
arrays all at once.
You don’t really have to understand the code here, but very roughly:
rnorm() creates a list of 1000 randomly distributed numbers averaging around 100
with a deviation of 10 (so mostly numbers 70-130, math is weird too). floor() turns
them into integers. Mathematically it’s a set of numbers in a normal distribution
with a mean of 100 and a standard deviation of 10. These are now assigned to the
variable “n”.
table() counts the individual occurrences of each number and places them into the
variable “t”.
Outcome
If all goes well, your RStudio UI should now look like this:
The console quadrant (bottom left) mentions the “source()” command. You can enter
any R command here, and it’ll be processed. This is useful for when you have no
idea what you’re doing, and need to try things out.
The file quadrant now shows a graph. What the heck, huh? So cool. But also, why.
The top right quadrant shows your variables. This is kinda useful for figuring
things out.
If you run the script a few times (remember, the “Source” button - we’re data
scientists here), it’ll create new sets of random numbers and generate new graphs.
Try it out. Clicking stuff is cool, but also, to show how to deal with these graphs
we’re going to need them.
When you have multiple graphs, you can switch between graphs (“plots” in data-
scientist-eze) using the arrows:
In the same place, you can export these graphs to save them as files, or copy them
into your clipboard if you’re writing a report.
Using packages
The default R installation doesn’t have all the cool stuff. If you use Stack
Overflow regularly to copy and paste code, I mean to learn, you’ll see mentions of
other “packages” or “libraries”. Installing these is often pain-free. You need an
internet connection though (this is kinda assumed anyway nowadays, it’s not like
we’re a pirate on a boat in the ocean, oh wait).
For R, there are always two steps involved: install the package, and then use the
library. Why they don’t call it the same thing, I don’t know. Gate-keeping by data
scientists, obviously.
In the console quadrant (bottom left), copy the following and hit enter:
install.packages("ggplot2")
This will now install the ggplot2 library. This library helps to make nice
graphics. If you’re curious, there’s a big collection of R graphics that you can
use to copy & paste in your code, many of them use ggplot2.
Your console should show something like this now (the exact content will differ):
Let’s start a new script (menu: File / New File / R Script), and use the following
code:
library(ggplot2)
The next two lines use ggplot() to create a graph (plot). ggplot() takes the
dataset (“mpg”), the items in there you want to graph with aes(), and then adds the
type of graphic (“geom_point()”) that you want to do. ggplot() does these weird
things with just adding things together with “+” to combine them.
You might wonder where the data used in the graphic comes from - how did we
suddenly get “MPG” data and car types? R includes a number of small data sources
that you can use for trying things out. It makes it a bit easier to mess with
simple graphics before you use your real data. In this case, it’s some older car
manufacturer information: the “mpg dataset” as a part of the “mtcars dataset”. If
you spot random car-related statistics and graphics in R, now you know why.
That’s mostly it
At this point, you should be ready to do things in RStudio. Remember, R is weird,
and the names of things are a bit confusing at first, so use your favorite search
engine whenever you get stuck. Regardless, I hope this helps to get you started.