HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
HUDM 5026 - Introduction To Data Analysis and Graphics in R 01 - Introduction
Graphics in R
01 - Introduction
1 Preliminaries
Why use R?
• It’s free.
• It accommodates all manner of basic statistical analysis and many, many advanced
and new methods.
• Many new procedures and capabilities come out first in R and are often supported by
publications in peer-reviewed journals, such as the Journal of Statistical Software.
• R runs on all the major operating platforms, including Mac, PC, and Unix.
• R’s ability to produce publication quality plots and graphics are unparalleled.
Some recent examples of plots I have made in R can be found on my faculty webpage
at https://www.tc.columbia.edu/faculty/bsk2131/. In particular, see pdf versions of
papers entitled Variable Selection for Causal Effect Estimation and Heterogeneous Subgroup
Identification.
Next, go back to the “Preferences” menu and select the “General” tab. I prefer to uncheck
all boxes except the three shown in the figure below. Furthermore, I recommend setting the
drop down menu so that RStudio never saves your workspace on exit. The rationale for
setting this to never save is that if you don’t, R will save whatever data is in your workspace
(i.e., objects visible using ls()) in a history file so that the next time you open RStudio, it
will all be available in your working environment. While, in principal, this is a nice idea, in
practice, you end up storing way more data and objects than are necessary and the clutter
causes RStudio to open slowly and get glitchy.
• Select the code you wish to run and then click the “Run” button. This option will run
all the selected code.
• Select the code you wish to run and simultaneously push command and return on
MacOS or control and enter on Windows. This option will run all the selected code.
• Put your cursor anywhere on a line you wish to run and and simultaneously push
command and return on MacOS or control and enter on Windows. This option will
only run one line at a time.
• Other math functions include sqrt for square root, exp for the exponential function,
log for log base e, trigonometric functions using sin, cos, and tan.
• R will follow order of operations. For example, running 5 + 3 * 4 will return 17, not
32.
• Parenthesis may be used as well. For example, running (5 + 3) * 4 will return 32.
• Scientific notation can be specified with the letter e, which is interpreted as “times ten
to the power of” when written in a numerical expression. For example, 2e2 is 200.
Activity 1 Use R as a calculator by writing and saving code to your syntax file and then
running it. Experiment with comments, order of operations, and scientific notation. Exper-
iment with all three methods given above for running the code in your syntax file.
• The value of var1 may be overwritten. Suppose we want to update var1 to be its old
value less 5000. See the source file for code to do just that.
• There is a shortcut for the assign function that involves using the less than symbol
and the hyphen to construct an assignment arrow, <-.
• To assign the variable name x to have the value 5, we could write assign(x = "x",
value = 5). Or, with the shorthand, we could simply write x <- 5.
Now is a good time to note that R is case sensitive. So, even though a variable named
x is defined, R will not recognize X because of the case difference. Similarly, if you
try to assign a variable by writing Assign instead of assign, R will throw an error
because of the capitalization.
Activity 2 Use the assign() function and the assignment arrow to define some variables
and then operate on them with arithmetic operations.
• The first three arguments of the seq() function are from, to, and by and are used to
specify the starting point, ending point, and increment, respectively, for the generated
sequence.
• The colon may be used as a short hand for the seq() function where the by argument
is 1.
• The help page for the rep function reveals that the first argument x is a vector.
• Other important arguments are times and each, which govern (a) how many times
to replicate the vector and (b) how many times each element of the vector should be
repeated.
Activity 3 Use seq() and rep() to create sequences of numbers. In particular, explore and
understand the use of the arguments. Also use the colon as a shortcut.
• Packages may be installed via code or via the RStudio help pane. Let’s install the car
package via the RStudio help pane.
• The car package name is an acronym for ‘Companion to Applied Regression’. The
package contains, among other things, a number of useful functions for diagnostics for
linear regression.
• Go to the ‘Packages’ tab in the help pane and click the ‘Install’ button.
• Check the box to ‘Install dependencies’ to enable packages that car depends on to also
be downloaded automatically.
Subsequently, you will receive status reports on the downloads of each of those packages.
Once they are all done downloading, you will get a message that “The downloaded binary
packages are in ...” Now the package is downloaded, but it is not yet loaded in your workspace.
Type search() to see which packages and/or data sets are loaded in your global working
environment. For example, function qqPlot() is available in package car, but not in base
R. If you ask for help on that function with help(qqPlot), you will get a message that there
is no documentation for the function.
To load the car package, type library(car). Note that quotes around the package name
are not needed here as they are when installing, but quotes will not cause an error if you
use them here. Now search() again and note that package car is loaded. Now ask for help
again with help(qqPlot) and note that now the help works.
5 Functions
The idea of a function in programming is analogous to the idea of a function in mathematics.
That is, a function is a rule that describes what to do with an input to create an output.
In R, a function is defined using the function() function by giving a list of arguments for
the function followed by an expression that tells R what to do with those arguments. For
example,
f1 <- function(x) { x + 2 }
To apply the function we might write f1(3), which should produce the answer 5, for example.