Learning R
Learning R
In its most basic form, R can be used as a simple calculator. Consider the following
arithmetic operators:
Addition: +
Subtraction: -
Multiplication: *
Division: /
Exponentiation: ^
Modulo: %%
The ^ operator raises the number to its left to the power of the number to its right:
for example 3^2 is 9.
The modulo returns the remainder of the division of the number to the left by the
number on its right, for example 5 modulo 3 or 5 %% 3 is 2.
Note how the quotation marks in the editor indicate that "some text" is a string.
On your way from rags to riches, you will make extensive use of vectors. Vectors are
one-dimension arrays that can hold numeric data, character data, or logical data. In
other words, a vector is a simple tool to store data. For example, you can store your
daily gains and losses in the casinos.
In R, you create a vector with the combine function c(). You place the vector elements
separated by a comma between the parentheses. For example:
numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")
Once you have created these vectors in R, you can use them to do calculations.
Before doing a first analysis, you decide to first collect all the winnings and losses for
the last week:
For poker_vector:
For roulette_vector:
Naming a vector
As a data analyst, it is important to have a clear view on the data that you are using.
Understanding what each element refers to is therefore essential.
In the previous exercise, we created a vector with your winnings over the week. Each
vector element refers to a day of the week but it is hard to tell which element belongs to
which day. It would be nice if you could show that in the vector itself.
You can give a name to the elements of a vector with the names() function. Have a look
at this example:
some_vector <- c("John Doe", "poker player")
names(some_vector) <- c("Name", "Profession")
This code first creates a vector some_vector and then gives the two elements a name.
The first element is assigned the name Name, while the second element is
labeled Profession. Printing the contents to the console yields following output:
Naming a vector (2)
If you want to become a good statistician, you have to become lazy. (If you are already
lazy, chances are high you are one of those exceptional, natural-born statistical talents.)
In the previous exercises you probably experienced that it is boring and frustrating to
type and retype information such as the days of the week. However, when you look at it
from a higher perspective, there is a more efficient way to do this, namely, to assign the
days of the week vector to a variable!
Just like you did with your poker and roulette returns, you can also create a variable that
contains the days of the week. This way you can use and re-use it.
Calculating total winnings (2)
Now you understand how R does arithmetic with vectors, it is time to get those Ferraris
in your garage! First, you need to understand what the overall profit or loss per day of
the week was. The total daily profit is the sum of the profit/loss you realized on poker
per day, and the profit/loss you realized on roulette per day.
A function that helps you to answer this question is sum(). It calculates the sum of all
elements of a vector. For example, to calculate the total amount of money you have
lost/won with poker you do:
total_poker <- sum(poker_vector)
Comparing total winnings
Oops, it seems like you are losing money. Time to rethink and adapt your strategy! This
will require some deeper analysis…
After a short brainstorm in your hotel's jacuzzi, you realize that a possible explanation
might be that your skills in roulette are not as well developed as your skills in poker. So
maybe your total gains in poker are higher (or > ) than in roulette.
To select multiple elements from a vector, you can add square brackets at the end of it.
You can indicate between the brackets what elements should be selected. For example:
suppose you want to select the first and the fifth day of the week: use the vector c(1,
5) between the square brackets. For example, the code below selects the first and fifth
element of poker_vector:
poker_vector[c(1, 5)]
Vector selection: the good times (3)
Selecting multiple elements of poker_vector with c(2, 3, 4) is not very convenient.
Many statisticians are lazy people by nature, so they created an easier way to do
this: c(2, 3, 4) can be abbreviated to2:4, which generates a vector with all natural
numbers from 2 up to 4.
So, another way to find the mid-week results is poker_vector[2:4]. Notice how the
vector 2:4 is placed between the square brackets to select element 2 up to 4.
Just like you did in the previous exercise with numerics, you can also use the element
names to select multiple elements, for example:
poker_vector[c("Monday","Tuesday")]
As seen in the previous chapter, stating 6 > 5 returns TRUE. The nice thing about R is
that you can use these comparison operators also on vectors. For example:
c(4, 5, 6) > 5
[1] FALSE FALSE TRUE
This command tests for every element of the vector if the condition stated by the
comparison operator is TRUE or FALSE.
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector
# Which days did you make money on poker?
selection_vector <- c(140, -50, 20, -120, 240) > 0
# Print out selection_vector
selection_vector
In the previous exercises you used selection_vector <- poker_vector > 0 to find the
days on which you had a positive poker return. Now, you would like to know not only the
days on which you won, but also how much you won on those days.
You can select the desired elements, by putting selection_vector between the square
brackets that follow poker_vector:
poker_vector[selection_vector]
R knows what to do when you pass a logical vector in square brackets: it will only select
the elements that correspond to TRUE in selection_vector.
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector
# Which days did you make money on poker?
selection_vector <- poker_vector > 0
# Select from poker_vector these days
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days
Advanced selection
Just like you did for poker, you also want to know those days where you realized a
positive return for roulette.
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector
# Which days did you make money on roulette?
selection_vector <-roulette_vector > 0
# Select from roulette_vector these days
roulette_winning_days <- roulette_vector[selection_vector]
roulette_winning_days
What's a matrix?
In R, a matrix is a collection of elements of the same data type (numeric, character, or
logical) arranged into a fixed number of rows and columns. Since you are only working
with rows and columns, a matrix is called two-dimensional.
In the matrix() function:
The first argument is the collection of elements that R will arrange into the rows
and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3,
4, 5, 6, 7, 8, 9).
The argument byrow indicates that the matrix is filled by the rows. If we want the
matrix to be filled by the columns, we just place byrow = FALSE.
The third argument nrow indicates that the matrix should have three rows.
# Construct a matrix with 3 rows that contain the numbers 1 up to 9
matrix(1:9,byrow = TRUE ,nrow=3)
matrix(1:9,byrow = TRUE ,nrow=3)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
In the editor, three vectors are defined. Each one represents the box office numbers
from the first three Star Wars movies. The first element of each vector indicates the US
box office revenue, the second element refers to the Non-US box office (source:
Wikipedia).
In this exercise, you'll combine all these figures into a single vector. Next, you'll build a
matrix from this vector.
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)
# Create box_office
box_office <- c(new_hope,empire_strikes,return_jedi)
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow=2)
star_wars_matrix
# Create box_office
box_office <- c(new_hope,empire_strikes,return_jedi)
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow = TRUE, nrow=2)
star_wars_matrix
[,1] [,2] [,3]
[1,] 460.998 314.400 290.475
[2,] 247.900 309.306 165.800
Naming a matrix
To help you remember what is stored in star_wars_matrix, you would like to add the names
of the movies for the rows. Not only does this help you to read the data, but it is also
useful to select certain elements from the matrix.
Similar to vectors, you can add names for the rows and the columns of a matrix
# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region
colnames(star_wars_matrix) <- region
# Name the rows with titles
rownames(star_wars_matrix) <- titles
# Print out star_wars_matrix
star_wars_matrix
To calculate the total box office revenue for the three Star Wars movies, you have to
take the sum of the US revenue column and the non-US revenue column.
# Construct star_wars_matrix
box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
region <- c("US", "non-US")
titles <- c("A New Hope",
"The Empire Strikes Back",
"Return of the Jedi")
star_wars_matrix <- matrix(box_office,
nrow = 3, byrow = TRUE,
dimnames = list(titles, region))
# Calculate worldwide box office figures
worldwide_vector <- rowSums(star_wars_matrix)
worldwide_vector
Adding a row
Just like every action has a reaction, every cbind() has an rbind(). (We admit, we are
pretty bad with metaphors.)
Your R workspace, where all variables you defined 'live' (check out what a workspace
is), has already been initialized and contains two matrices:
star_wars_matrix that we have used all along, with data on the original trilogy,
star_wars_matrix2, with similar data for the prequels trilogy.
Explore these matrices in the console if you want to have a closer look. If you want to
check out the contents of the workspace, you can type ls() in the console.
# star_wars_matrix and star_wars_matrix2 are available in your workspace
star_wars_matrix
star_wars_matrix2
# Combine both Star Wars trilogies in one matrix
all_wars_matrix <- rbind(star_wars_matrix,star_wars_matrix2)
all_wars_matrix