Basics of R
Liana Harutyunyan
Programming for Data Science
April 8, 2024
American University of Armenia
[email protected]
1
Special Values in R — NAs
In R, the NA values are used to represent missing values.
(NA stands for “not available.”)
vec <- c(1, 2, 3, NA)
There is is.na() function to check whether you have NA.
This returns TRUE-FALSE values, you can use sum function to
count how many TRUE you have.
There is which function as well that will return the index of
TRUE value.
2
Type conversion
In Python, when we wanted to change the data type of an
object, we wrote for example:
int("123")
In R, there are as.numeric, as.double, as.integer (the last
two are combined in the first), as.character, as.factor etc
for data type conversion.
When the conversion is not possible, R returns NA values.
as.numeric(c(1, 10, "a"))
3
Special Values in R — Infs
If a computation results in a number that is too big, R will
return Inf for a positive number and -Inf for a negative
number.
50/0, -50/0, 5010000
4
Special Values in R — NaN
Sometimes, a computation will produce a result that makes
little sense. In these cases, R will often return NaN (meaning
“not a number”)
0 / 0
5
Special Values in R — NULL
Additionally, there is a NULL in R.
• NULL is often used as an argument in functions to mean
that no value was assigned to the argument.
• Additionally, some functions may return NULL.
• NULL is used mainly to represent the lists with zero
length, and is often returned by expressions and
functions. (similar to Python’s None).
6
Reading data
In Data Science, everything revolves around the data.
If the data is given in a tabular format in a CSV file, we can
easily read it into R (like we did in Python with pandas).
cities <- read.csv("cities.csv")
By str function you can see info about each column.
• stringsAsFactors argument of read.csv function
allows to specify how to read strings - as strings or as
factors.
• sep argument of read.csv function allows to specify the
separator type that is used in the data. This can be ’,’
(comma), ’\t’, etc.
7
Subsetting data
Exercises:
• Create a named vector of football country teams and
scored goals of at least 4 elements.
• Take the elements 1 and 2.
• Take the elements 1 and 3.
• Take elements by subsetting of names.
Why named vector[1, 2] gives error?
8
Subsetting data
Exercises solutions:
• Create a named vector of football country teams and
scored goals of at least 4 elements.
• Take the elements 1 and 2. named vector[1:2]
• Take the elements 1 and 3. named vector[c(1, 3)]
• Take elements by subsetting of names.
named vector[c("name1", "name2")]
named vector[1, 2] gives error, because when writing with
comma, R understands that as a second dimension:
vector[row, col].
9
Subsetting of Data Frame
data[rows, cols]
What they will return?
• data[2, 5]
• data[1:10, 4:6]
• data[2, c(1, 5)]
• data[1:3, ]
• data[, c(1, 4:6)]
10
Subsetting of Data Frame
We can also subset by column names:
• data[, c("City", "State")]
We can exclude columns/rows by negative indexing, but it
does not work with name indexing.
• data[, -c(2, 5, 7)]
11
Subsetting of Data Frame
Exercises:
• include first 100 rows and columns 2,3,5
• exclude rows 10, 20, 30 and exclude column 5.
12
Data Frames
You can access specific column in dataframe by using $.
data$City
• Can count statistics on the column: mean(data$City),
table(data$City)
13
Conditional Indexing
Creating new dataframe where we will have only cities from
”WA” state.
This means, we want to have those ROWS, that have ”WA”
for column ”State”.
14
Conditional Indexing
Creating new dataframe where we will have only cities from
”WA” state.
This means, we want to have those ROWS, that have ”WA”
for column ”State”.
cities[cities$State == "WA",]
Can check with table or unique functions if we did correctly.
14
Conditional Indexing
Create new dataframe where we have cities from EITHER
”WA” state or ”OH” state.
How would we do this in Python?
15
Conditional Indexing
Create new dataframe where we have cities from EITHER
”WA” state or ”OH” state.
How would we do this in Python?
For R, instead of isin (Python), we have
cities[cities$State %in% c("WA", "OH"),]
To condition by multiple statements use ’&’ or ’|’.
15
Column Types
To change the column types, we can use as.DATATYPE
functions.
data$column = as.numeric(data$column)
16
New Column
To add a new column in the dataframe:
data$lat diff <- data$LatD - data$LatM
17
Conditionals
else if / else parts can be omitted.
18
Conditionals
Exercises:
• Given a year value, display whether the year is a leap
year or not.
• If given value is negative, change the number to its
square, if the value is 0, change to 1, if the value is
positive, change to 2 * number.
19
For loops
• Iterating over an object.
for(i in x) { }
• Iterating over object containing the indices of the main
object.
for(i in 1:length(x)) { }
Here, we also have break and next (intead of continue in
python).
20
For loops
• Iterate over a vector and add 2 to each value. Do this
once inplace, and once storing the changed values into
a new vector.
• Use a for loop to count the number of even numbers
stored inside a vector of numbers.
• Use a for loop to get indices of even numbers in the
vector.
• Use for and if to find solution to equation
2x 2 − 20x − 48 = 0. Stop when you find the solution.
21
For loops
How to iterate using a for loop over a dataset.
for (i in 1:ncol(data)) { }
Exercise: Iterate over iris dataset’s numerical columns and
add 1000 to each.
22
While loops
while (test expression) { statement }
Exercises:
• Given a number, print all the even numbers from 0 to
that number using while.
• Given a positive integer and calculate the sum of all the
integers from 1 to that number using while.
23
Functions
• There are built-in R functions, such as mean, sum,
length, nchar, runif, etc
• You can write custom functions:
function name <- function(x) { return(x) }
Same as
function name <- function(x) { x }
In R, the last line of function expression is returned
by the function.
24
Functions
Functions can have default arguments, and unlike Python,
there is no order for that.
Reference: In Python you can not have an argument w/o
default value after an argument w default value (func(x=2,
y)).
25
Functions
Examples:
• A function that calculates the area of a rectangle given
its length and width as arguments.
• A function that calculates the factorial of a given
number.
• A function that takes an array of numbers as input and
returns the largest number in the array, without using
the function max.
26
apply family of functions
These functions allow crossing the data in a number of ways
and avoid explicit use of loop constructs.
The most commun functions: apply(), lapply() ,
sapply().
• apply(X, MARGIN, FUN, ...) - applies the FUN on X
with MARGIN (axis).
Example: Create a matrix object and calculate sum of
the columns.
27
apply family of functions
• lapply(X, FUN, ...) - applies the same function on
each element of object (list, vector, dataframe) and
always returns a list.
Example: Have a list containing a vector and a dataframe.
Calculate sum of each element using lapply.
Example: Having a vector, calculate sqrt of each element.
28
apply family of functions
The sapply() and lapply() work basically the same.
The only difference is that lapply() always returns a list,
whereas sapply() tries to simplify the result into a vector or
matrix
sapply(X, FUN, ...) - applies the FUN on X and simplifies
result if possible.
29
Data Vizualization
The most popular package in R, that we can use to plot
graphs is ggplot2.
As a reminder to install a package, we write:
install.packages("ggplot2")
You should write this either in the console, or in the R script
or Rmd part, but after completion, delete or comment the
line.
Then you need to ”import” the package.
library(ggplot2)
30