Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
Kaustav Banerjee
Decision Sciences Area, IIM Lucknow
1 Introduction
This is a minimal introduction to R via R Studio: we will discuss just those things, which are
necessary to get started for our course. As we progress in the course, we will learn more about R.
We should install R first, from the CRAN home page. Then we install R Studio (free version). As
you open R Studio and open the R script file, the interface looks like:
Editor This top-left editor window is for writing your code: its the input device. We write on R
script files. You can run a part of your code only, by accessing the ‘Run Region’ commands,
available in the ‘Code’ menu drop-down list.
Console This bottom-left window is the output device: you get your output in the console.
Workspace This top-right window is to keep track of your workspace/history for your R session
Viewer This bottom-right window has tabs for viewing the plot/help page/package/file etc.
1
Remember
1. Save your R script file as filename.R, with .R extension.
2. You do not double-click on your R script file located in the working directory, for access in R
Studio. Rather, you open R Studio, and access the script file from File menu.
3. While quitting R Studio, a message window(Figure 2) asks whether you want to save the R
session. By default, it will save the session history. You should always select ‘Don’t Save’.
Help files: Say you are trying to get the median of a data set, and you need help. You can try
either of the following: (a) Go to the ‘Help’ tab located in the Viewer window, and type ‘median’
in the search box. It will take you to the related help page. (b) Type help(median) in the console
window and press ‘Enter’, it will take you to the same help page.
Reading the help page itself is a task for beginners. Good thing is, you will find several sample
codes in the help page. You can copy-paste them in the console/editor and check how they work.
> plot(x = 0:10, y = dbinom(0:10,10,prob = 0.5), type = "h", col = "red", lwd = 2)
Now you may feel that the outer box is unneccesary and the axes should have proper names. To
get that you add a few options further and obtain Figure 3
> plot(x = 0:10, y = dbinom(0:10,10,prob = 0.5), type = "h", col = "red", lwd = 2,
bty = "n", xlab = "Outcome", ylab = "Probability")
0.25
0.25
0.20
0.20
dbinom(0:10, 10, prob = 0.5)
0.15
0.15
Probability
0.10
0.10
0.05
0.05
0.00
0.00
0 2 4 6 8 10 0 2 4 6 8 10
0:10 Outcome
For further details on how to control different plotting options, go to the help page for ‘plot’.
2
3 First few examples
3.1 R as calculator
R can be used as a calculator. To see, go to the console window, type the following and press ‘enter’.
This [1] before 1500 indicates that the first member of the output vector is 1500. To simulate 5
observations from Normal(µ = 0, σ = 1), type the following in the console window and press ‘enter’.
Notice that, you will have different data in your case, as it’s simulation exercise. Also, while typing
the code in the console window you may have noticed a help tab (Figure 4) appearing automatically,
telling you the right way to do things. This is one of the big plus of R Studio. When you find the
help tab, if you press ‘Tab’ button, it will guide you through the entire code so that you don’t go
wrong. If, for the sake of reproduction, you need to save the data you have simulated, do it as:
> set.seed(5)
> rnorm(n = 5, mean = 0, sd = 1)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
> set.seed(5)
> rnorm(n = 5, mean = 0, sd = 1)
[1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087
3.2 Workspace
You can also give numbers a name. By doing so, they become so-called variables which can be
used later. For example, you can type in the console:
3
Check that ‘a’ appears in the History window, which means that R now remembers what ‘a’ is. To
remove all such variables from R memory, type the following or click ‘Clear all history entries’ icon
of the History window.
> rm(list = ls())
4
When we have a matrix with many rows or columns (by convention columns stand for variables
and rows stand for outcomes or cases), we do it this way:
> b = matrix(0, nrow = 5, ncol = 2)
> b[,1] = rnorm(n = 5, mean = 0, sd = 1)
> b[,2] = rnorm(n = 5, mean = 0, sd = 1)
> b
[,1] [,2]
[1,] 1.2276303 -0.1389861
[2,] -0.8017795 -0.5973131
[3,] -1.0803926 -2.1839668
[4,] -0.1575344 0.2408173
[5,] -1.0717600 -0.2593554
Now we have two independent samples drawn from Normal(µ = 0, σ = 1) stored in a matrix. Alter-
natively, we can create two vectors (must be of same length) and bind them column-wise to create
a matrix.
> b1 = rnorm(n = 5, mean = 0, sd = 1)
> b2 = rnorm(n = 5, mean = 0, sd = 1)
> b = cbind(b1,b2)
> b
b1 b2
[1,] 0.9005119 -0.2934818
[2,] 0.9418694 1.4185891
[3,] 1.4679619 1.4987738
[4,] 0.7067611 -0.6570821
[5,] 0.8190089 -0.8527954
If we have two matrices (of same number of columns) we can join them row-wise also:
> b1 = matrix(1:6, nrow = 2, ncol = 3, byrow = T)
> b2 = matrix(7:12, nrow = 2, ncol = 3, byrow = T)
> b = rbind(b1, b2)
> b
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
To do some matrix operations, the function apply is very useful. Say we need the maximum value
from each row or column. We do it this way:
> apply(b, MARGIN = 1, max)
[1] 3 6 9 12
> apply(b, MARGIN = 2, max)
[1] 10 11 12
The third argument, which is the in-built function max here, can be any in-built function available
in R. It could also be some user-defined function.
5
4.3 Data in data frame format
Often we see data presented in a tabular format similar to a spreadsheet. R has a flexible way of
handling such data: it’s called data frame. Let us create the following:
There are number of ways we can access data frame. A few are as follows, which will provide
same output, as the one obtained at the end.
Notice that, the column gender is categorical in nature: R treats it as a ‘factor’. One simple way
to deal with factors is to use table function:
> table(study$gender)
F M
3 1
$sample2
[1] -2.1023291 -0.3017023 -1.2723834 -0.2796661 -0.2040973
6
4.5 Data import/export
Say, we have a data set in spreadsheet like MS-Excel. To import the data in R, do the following.
1. Save the excel file in comma separated value (csv) format (eg. filename.csv), you will get this
option from the save-as window drop-down list.
2. Suppose your file is located in ‘Data’ folder under ‘K’ drive
3. Run the command
read.csv("K:/Data/filename.csv", header = T)
4. In case you have data in txt format, run the command
read.table("K:/Data/filename.txt", header = T)
5. Check whether your data set is stored in data frame format.
Suppose we need to export a random sample created in R, in the ‘Data’ folder under ‘K’ drive. Run
the following commands:
sample = rnorm(10, 0, 1)
normal.sample = data.frame(data = sample)
write.csv(normal.sample, "K:/Data/normal sample.csv", row.names = F, col.names = T)
Ideally, your data set should remain separated from your R code.
my.sum = function(arg1)
{
s = 0
for (i in 1:length(arg1))
{
s = s + arg1[i]
}
return(s)
}
sample1 = 1:10
my.sum(sample1); sum(sample1) # Check my sum with the inbult sum function
If your function needs more arguments, you separate them with comma: arg1, arg2,... so on.