0% found this document useful (0 votes)
284 views

HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction

R is a free statistical software environment that can perform a wide variety of statistical analyses and produce high-quality graphs and plots. It has an active user community that is continually improving and expanding its capabilities. Some key reasons to use R include its low cost, broad functionality, strong community support, and ability to produce publication-quality graphs and plots. This document provides an introduction to downloading and installing R and RStudio, an integrated development environment that makes working in R easier. It describes the basic RStudio interface and provides instructions for getting started with writing and running code in R.

Uploaded by

Fei Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views

HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction

R is a free statistical software environment that can perform a wide variety of statistical analyses and produce high-quality graphs and plots. It has an active user community that is continually improving and expanding its capabilities. Some key reasons to use R include its low cost, broad functionality, strong community support, and ability to produce publication-quality graphs and plots. This document provides an introduction to downloading and installing R and RStudio, an integrated development environment that makes working in R easier. It describes the basic RStudio interface and provides instructions for getting started with writing and running code in R.

Uploaded by

Fei Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

HUDM 5026 - Introduction to Data Analysis and

Graphics in R
01 - Introduction
1 Preliminaries
Why use R?

• It’s free.

• It accommodates all manner of basic statistical analysis and many, many advanced
and new methods.

• There is an active community of programmers, academics, and developers who con-


tinually work on improving R and creating and improving auxiliary software such as
contributed packages and IDEs like RStudio.

• Many new procedures and capabilities come out first in R and are often supported by
publications in peer-reviewed journals, such as the Journal of Statistical Software.

• R runs on all the major operating platforms, including Mac, PC, and Unix.

• R’s ability to produce publication quality plots and graphics are unparalleled.

Some recent examples of plots I have made in R can be found on my faculty webpage
at https://www.tc.columbia.edu/faculty/bsk2131/. In particular, see pdf versions of
papers entitled Variable Selection for Causal Effect Estimation and Heterogeneous Subgroup
Identification.

B. Keller, Teachers College, Columbia University 1


1.1 Downloading R and RStudio
Visit the Comprehensive R Archive Network (CRAN) website at https://cran.r-project.
org/ to download R for your computer’s operating system. If you have an older version of R
on your machine, now is the time to download the most recent version from CRAN. This is
especially important because some of the packages we will use are not compatible with older
versions of R.
Although you don’t need to use RStudio to work with R, it makes R easier to work
with. RStudio is a separate download; go to https://www.rstudio.com/ and download
and install the right version for your operating system. You will likely want to put an
RStudio shortcut icon in your menu bar or start menu. Open RStudio and notice the pane
design for integrated viewing of different processes. Go to the “RStudio” drop-down menu
and select “Preferences”. Then select “Panes”. You will see a screen that looks like the figure
below. My preference is to put the source pane in the top left, the console pane in the
bottom left, the environment/history pane in the top right, and the plots/help pane in the
bottom right.

Next, go back to the “Preferences” menu and select the “General” tab. I prefer to uncheck
all boxes except the three shown in the figure below. Furthermore, I recommend setting the
drop down menu so that RStudio never saves your workspace on exit. The rationale for
setting this to never save is that if you don’t, R will save whatever data is in your workspace
(i.e., objects visible using ls()) in a history file so that the next time you open RStudio, it
will all be available in your working environment. While, in principal, this is a nice idea, in
practice, you end up storing way more data and objects than are necessary and the clutter
causes RStudio to open slowly and get glitchy.

B. Keller, Teachers College, Columbia University 2


Instead of storing important information in your workspace, I will encourage you to begin
to think of your source file, which is a text file that contains your saved lines of code, as the
best place to store your important information in R. Most operations you will ask R to do
will take only a fraction of second to run, so storing your code and then running the code
each time you need it is a good habit to get into. To that end, you will work on writing
efficient code that is understandable, so that the next time you read it you can see clearly
what you were trying to do when you wrote it. Adding comments to code are a big part of
making it understandable.
There are two file type extensions that we will use a lot in this course: “.R” and “.Rdata”.
At some point in the near future, you will want to instruct your computer to open both
those file types with Rstudio by default. The way to do this varies based on the operating
system you are using, but typically it can be done by right-clicking on the file and choosing
“open with” and then selecting the option to make Rstudio the default.

1.2 The Four RStudio Panes


The console pane is where you may interact directly with the R command line. If you type
code in the console and press enter (or return), your code will run, and, if called for, R will
produce output, also in the console. Let’s try it. Do some basic math in the console, like 5
+ 5, and press enter. You should see the answer 10 printed as output. Note that you must
use the asterisk “*” for multiplication and the forward slash “/” for division.
The history of code run in your console will be recorded in the history pane. Go to the
top right pane and click on the history tab. You should see all the code that you just ran in
the console. Click on a line of code in the history pane that you want to run again. Then,
with your cursor on the line (you don’t have to highlight the whole line) find and click the
“To Console” button. You should notice that the line now appears in your console.
In general, as I mentioned above, you will work in the source pane rather than the
console because the source panel allows you to save your code as text and adds intelligent
color coding and tabbing to help make code easier to read and debug. Go to the “File” menu

B. Keller, Teachers College, Columbia University 3


and select “New File” and “R Script”. A new text file should open up in your source window
pane. Save the file. You may send code from your history to your source text file by clicking
on “To Source”. Try that as well. You should now see the line in your source panel. To run
a line of code in your source file, put your cursor anywhere on that line and, on a Mac OS,
push command and return at the same time. On Windows OS, push control and enter at the
same time. The line will run and the cursor will move to the next line. This is a convenient
way to move through a document. If you wish to run a specific part of a line, or more than
one line at a time, simply select the code you wish to run and then push command and
return or control and enter. There is also a “Run” button at the top of the RStudio source
pane in case you prefer to point and click to run.

1.3 The CRAN Website and the R Community


In addition to downloading R, you may also visit the Comprehensive R Archive Network
(CRAN) to access various help manuals at https://cran.r-project.org/manuals.html.
If you have a question about something R-related, a Google search is typically a great first
step toward finding answers to R questions. Check out the R help pages at Stack Overflow
here https://stackoverflow.com/. Chances are, if you have an R related question, some-
one else has already asked about it on one of the stack exchange sites. The Quick R website
is another resource. It is available at http://www.statmethods.net/.

B. Keller, Teachers College, Columbia University 4


2 Working with Code in the Source Pane
Create a new R syntax file for today by going to File −→New File −→R Script and
save the file and call it “01_Intro.R”. Then locate the file where you saved it and right click
on the file and either get properties or get info and select “open with”. Find the option to
change the default so that all .R files open with RStudio by default. Then, double-click on
the “01_Intro.R” and verify that it opens in RStudio. If not, try again.
Once you have written code in the syntax file, there are a few ways to run it.

• Select the code you wish to run and then click the “Run” button. This option will run
all the selected code.

• Select the code you wish to run and simultaneously push command and return on
MacOS or control and enter on Windows. This option will run all the selected code.

• Put your cursor anywhere on a line you wish to run and and simultaneously push
command and return on MacOS or control and enter on Windows. This option will
only run one line at a time.

2.1 Using R as a calculator


• The hashtag “#” is the comment character in R. Anything on a line following a hashtag
will be ignored.

• As we have seen, R will do arithmetic operations using the usual symbols.

• Other math functions include sqrt for square root, exp for the exponential function,
log for log base e, trigonometric functions using sin, cos, and tan.

• R will follow order of operations. For example, running 5 + 3 * 4 will return 17, not
32.

• Parenthesis may be used as well. For example, running (5 + 3) * 4 will return 32.

• Scientific notation can be specified with the letter e, which is interpreted as “times ten
to the power of” when written in a numerical expression. For example, 2e2 is 200.

Activity 1 Use R as a calculator by writing and saving code to your syntax file and then
running it. Experiment with comments, order of operations, and scientific notation. Exper-
iment with all three methods given above for running the code in your syntax file.

2.2 Assigning Values


• Before we get into assigning values, look at your environment tab in the environ-
ment/history pane. It should be empty at this point, meaning that no variables have
been assigned, and nothing is stored in your R workspace. To check this with code, you
may enter ls(), which will list all the names of objects stored in your environment.

B. Keller, Teachers College, Columbia University 5


• The help page for the assign function notes that the first two arguments passed to the
function are called x and value, where x needs to be a character string representing
the name of the variable you want to create, and value is the value you want the
variable to have. Let’s call the variable var1, and let’s assign it a value of 5133.

• The value of var1 may be overwritten. Suppose we want to update var1 to be its old
value less 5000. See the source file for code to do just that.

• There is a shortcut for the assign function that involves using the less than symbol
and the hyphen to construct an assignment arrow, <-.

• To assign the variable name x to have the value 5, we could write assign(x = "x",
value = 5). Or, with the shorthand, we could simply write x <- 5.
Now is a good time to note that R is case sensitive. So, even though a variable named
x is defined, R will not recognize X because of the case difference. Similarly, if you
try to assign a variable by writing Assign instead of assign, R will throw an error
because of the capitalization.

Activity 2 Use the assign() function and the assignment arrow to define some variables
and then operate on them with arithmetic operations.

2.3 Creating Vectors with the c() Function


The c() function, where “c” is for “combine”, is used to create vectors. For example, to
simultaneously create a vector of the consecutive integers from 1 to 10 and assign it to the
name vec1, we could run the following:
assign("vec1", c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
Similarly, we could use the assignment arrow shortcut to accomplish the same thing as
follows:
vec1 <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

3 Two Important Functions: seq and rep


The seq and rep functions are basic building block functions that are used, respectively, to
create sequences or repetitions of numbers. Some essential information about seq():
• Access the help on the seq() function by running help(seq), or ?seq.

• The first three arguments of the seq() function are from, to, and by and are used to
specify the starting point, ending point, and increment, respectively, for the generated
sequence.

• seq(from = 2, to = 30, by = 4) will produce 2 6 10 14 18 22 26 30.

• The colon may be used as a short hand for the seq() function where the by argument
is 1.

B. Keller, Teachers College, Columbia University 6


• For example, 1:10 is identical to seq(from = 1, to = 10, by = 1). Both produce
1 2 3 4 5 6 7 8 9 10.

Some essential information about rep():

• The help page for the rep function reveals that the first argument x is a vector.

• Other important arguments are times and each, which govern (a) how many times
to replicate the vector and (b) how many times each element of the vector should be
repeated.

• For example, rep(x = 1:4, times = 3, each = 2) will produce 1 1 2 2 3 3 4 4


1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4.

Activity 3 Use seq() and rep() to create sequences of numbers. In particular, explore and
understand the use of the arguments. Also use the colon as a shortcut.

Activity 4 Use rep() to create the following vector: 1 1 4 4 5 5 0 0 1 1 4 4 5 5 0 0


1 1 4 4 5 5 0 0?

4 Installing and Using Packages


Packages are collections of functions, data, and compiled code bundled up into a well-defined
format that makes them easily importable into R.

• Packages may be installed via code or via the RStudio help pane. Let’s install the car
package via the RStudio help pane.

• The car package name is an acronym for ‘Companion to Applied Regression’. The
package contains, among other things, a number of useful functions for diagnostics for
linear regression.

• Go to the ‘Packages’ tab in the help pane and click the ‘Install’ button.

• By selecting to install from ‘Repository (CRAN)’ you instruct R to download the


package from the CRAN website. You would choose the other option if you already
had a compressed file on your computer that contained the package you wanted to
install to R.

• In the space for ‘Packages’, put the name car.

• Check the box to ‘Install dependencies’ to enable packages that car depends on to also
be downloaded automatically.

• Then click ‘Install’.

B. Keller, Teachers College, Columbia University 7


If you get an error that package “car” is not available for your version of R, update to
the latest version by going to the CRAN website and downloading. After downloading the
latest version of R, restart RStudio and try to install package car again. It should work
now. The other way to install a package is to do it with code. You can go to the console
and type install.packages("car"), for example, to install the package manually. In any
case, when you install the package for the first time, all the pacakges that car depends on
will also need to be downloaded. Thus, you will get the following message:

also installing the dependencies ‘backports’, ‘digest’, ‘glue’, ‘zeallot’,


‘ellipsis’, ‘magrittr’, ‘vctrs’, ‘R6’, ‘clipr’, ‘BH’, ‘rematch’,
‘prettyunits’, ‘assertthat’, ‘utf8’, ‘forcats’, ‘hms’, ‘readr’, ‘cellranger’,
‘progress’, ‘zip’, ‘cli’, ‘crayon’, ‘fansi’, ‘pillar’, ‘pkgconfig’, ‘rlang’,
‘SparseM’, ‘MatrixModels’, ‘sp’, ‘haven’, ‘curl’, ‘data.table’, ‘readxl’,
‘openxlsx’, ‘tibble’, ‘minqa’, ‘nloptr’, ‘Rcpp’, ‘RcppEigen’, ‘carData’,
‘abind’, ‘pbkrtest’, ‘quantreg’, ‘maptools’, ‘rio’, ‘lme4’

Subsequently, you will receive status reports on the downloads of each of those packages.
Once they are all done downloading, you will get a message that “The downloaded binary
packages are in ...” Now the package is downloaded, but it is not yet loaded in your workspace.
Type search() to see which packages and/or data sets are loaded in your global working
environment. For example, function qqPlot() is available in package car, but not in base
R. If you ask for help on that function with help(qqPlot), you will get a message that there
is no documentation for the function.
To load the car package, type library(car). Note that quotes around the package name
are not needed here as they are when installing, but quotes will not cause an error if you
use them here. Now search() again and note that package car is loaded. Now ask for help
again with help(qqPlot) and note that now the help works.

5 Functions
The idea of a function in programming is analogous to the idea of a function in mathematics.
That is, a function is a rule that describes what to do with an input to create an output.
In R, a function is defined using the function() function by giving a list of arguments for
the function followed by an expression that tells R what to do with those arguments. For
example,

f1 <- function(x) { x + 2 }

To apply the function we might write f1(3), which should produce the answer 5, for example.

Activity 5 Create a function called f2 and apply it to some different inputs.

B. Keller, Teachers College, Columbia University 8

You might also like