0% found this document useful (0 votes)
11 views25 pages

Lab Activity 1

This document serves as a guide for R Lab 1, introducing R and RStudio, and outlining essential commands and functions for statistical calculations. It covers topics such as setting the working directory, using the command line, creating and manipulating vectors, and saving scripts. Additionally, it emphasizes the importance of documenting code with comments and provides instructions for installing and loading R packages.

Uploaded by

milliemurphy17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views25 pages

Lab Activity 1

This document serves as a guide for R Lab 1, introducing R and RStudio, and outlining essential commands and functions for statistical calculations. It covers topics such as setting the working directory, using the command line, creating and manipulating vectors, and saving scripts. Additionally, it emphasizes the importance of documenting code with comments and provides instructions for installing and loading R packages.

Uploaded by

milliemurphy17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

R Lab 1 Markdown

Millie Murphy

2025-02-06

The code above is very important. It sets the working directory and calls
upon the location of the datasets on my computer. You will need to change
this to match your computer.
Student Learning Outcomes
 Learning how to start with R and RStudio

 Use the command line

 Use functions in R

Data, R scripts, and other resources for this lab is found on Moodle.
Learning the tools
What is R?
R is a computer program that allows an extraordinary range of statistical
calculations. It is a free program, mainly written by voluntary contributions
from statisticians around the world. R is available on most operating
systems, including Windows, Mac OS, and Linux.
R can make graphics and do statistical calculations. It is also a full-fledged
computing language. In this series of labs, we will only scratch the surface of
what R can do.
What is RStudio?
RStudio is a separate program, also free, that provides a more elegant front
end for R. RStudio allows you to easily organize separate windows for R
commands, graphic, help, etc. in one place.
Getting started
By this point, you should have already downloaded R and RStudio. When you
start RStudio, it will automatically start R as well. You run R inside RStudio.
After you have started RStudio, you should see a new window with a menu
bar at the top and three main sections. One of the sections is called the
“Console” – this is where you type commands to give instructions to R and
typically where you see R’s answers to you.
Another important corner of this window can show a variety of information.
Most importantly to us, this is where graphics will appear, under the tab
marked “Plots”.
The command line
When you start RStudio, you’ll see a corner of the window called the
“Console.” By the default the console window is in the bottom left of the
RStudio screen.
You can type commands in this window where there is a prompt (which will
look like a > sign at the bottom of the window). The Console has to be the
selected window. (Clicking anywhere in the Console selects it.)
The > prompt is R’s way of inviting you to give it instructions. You
communicate with R by typing commands after the > prompt.
Type “2+2” at the > prompt, and hit return. You’ll see that R can work like a
calculator (among its many other powers). It will give you the answer, 4, and
it will label that answer with [1] to indicate that it is the first element in the
answer. (This is sort of annoying when the answers are simple like this, but
can be very valuable when the answers become more complex.)
In these labs, the input will show up in a gray box and the output, if any, will
follow in a white box.
2+2

## [1] 4

log()
You can use a wide variety of math functions to make calculations here, e.g.,
log() calculates the log of a number:
log(42)

## [1] 3.73767

(By default, this gives the natural log with base e.)
Parentheses are used both as a way to group elements of the calculation and
also as a way to denote the arguments of functions. (The “arguments” of a
function are the set of values given to it as input.) For example, log(3) is
applying the function log() to the argument 3.
sqrt()
Another mathematical function that often comes in handy is the square root
function, sqrt(). For example, the square root of 4 is:
sqrt(4)
## [1] 2

To calculate a value with an exponent, used the ^ symbol. For example 43 is


written as:
4^3

## [1] 64

Of course, many math functions can be combined to give an almost infinite


possibility of mathematical expressions. For example,

can be calculated with:


(1/(sqrt(2 * pi * (3.1)^2))) * exp(-(12-10.7)^2/(2*3.1))

## [1] 0.09798692

Saving your code


When you analyze your own data, we strongly recommend that you keep a
record of all commands used, along with copious notes, so that weeks or
years later you can retrace the steps of your earlier analysis.
In RStudio, you can create a text file (sometimes called a script), which
contains R commands that can be reloaded and used at a later date. Under
the menu at the top, choose “File”, then “New File”, and then “R Script”. This
will create a new section in RStudio with the temporary name “Untitled1” (or
similar). You can copy and paste any commands that you want from the
Console, or type directly here. (When you copy and paste, it’s better to not
include the > prompt in the script.)
If you want to keep this script for later, just hit Save under the File menu. In
the future you can open this file in all the normal ways to have those
commands available for use again.
It is best to type all your commands in the script window and run them from
there, rather than typing directly into the console. This lets you save a record
of your session so that you can more easily re-create what you have done
later.
Comments
In scripts, it can be very useful to save a bit of text which is not to be
evaluated by R. You can leave a note to yourself (or a colleague) about what
the next line is supposed to do, what its strengths and limitations are, or
anything else you want to remember later. To leave a note, we use
“comments”, which are a line of text that starts with the hash symbol #.
Anything on a line after a # will be ignored by R.
# This is a comment. Running this in R will
# have no effect.

Functions
Most of the work in R is done by functions. A function has a name and one or
more arguments. For example, log(4) is a function that calculates the log in
base e for the value 4 given as input.
Sometimes functions have optional input arguments. For the function log(),
for example, we can specify the optional input argument base to tell the
function what base to use for the logarithm. If we don’t specify the base
variable, it has a default value of base = e. To get a log in base 10, for
example, we would use:
log(4, base = 10)

## [1] 0.60206

Defining objects
In R, we can store information of various sorts by assigning them to objects.
For example, if we want to create an object called x and give it a value of 4,
we would write
x <- 4

The middle bit of this—a less than sign and a hyphen typed together to make
something that looks a little like a left-facing arrow—tells R to assign the
value on the right to the object on the left. After running the command
above, whenever we use x in a command it would be replaced by its value 4.
For example, if we add 3 to x, we would expect to get 7.
x+3

## [1] 7

Objects in R can store more than just simple numbers. They can store lists of
numbers, functions, graphics, etc., depending on what values get assigned
to the object.
We can always reassign a new value to an object. If we now tell R that x is
equal to 32:
x <- 32

then x takes its new value:


x

## [1] 32
Names
Naming objects and functions in R is pretty flexible.
A name has to start with a letter, but that can be followed by letters or
numbers. There can’t be any spaces, though.
Names in R are case-sensitive, which means that Weights and weights are
completely different things to R. This is a common and incredibly frustrating
source of errors in R.
It’s a good idea to have your names be as descriptive as possible, so that
you will know what you meant later on when looking at it. (However, if they
get too long, it becomes painful and error prone to type them each time we
use them, so this, as with all things, requires moderation.)
Sometimes clear naming means that it is best to have multiple words in the
name, but we can’t have spaces. Therefore, a common approach is like we
saw in the previous section, to chain the words with underscores (not
hyphens!), as in weights_before_hospital. (Another solution to make
separate words stand out in an object name is to vary the case:
weightsBeforeHospital.)
R Command Summary

Questions
1. For each week, create an R script that captures the commands that
you use to answer the questions. Use a # at the beginning of each
comment line.

a. Open a new R script file. Start by adding comments with your name
and the week (Week 1) at the top.
#Ignore this question since we are working in R Markdown.

2. All of the commands used in this lab are in a script called “R Lab 1 –
Part 1 Script” on Moodle. (A similar file will be available for all R
activities in this class.)

a. Load this script into this file.

b. Run most or all of the commands in R. Did you get the same
answers as shown in this handout?

In the formula above, pi is 3.1415…., and the exp() function raises the base
of the natural log (e = 2.78…) to the power of the value in the argument.
3. For each of the following, come up with an object name that would be
appropriate to use in R for the listed variable:
a. Body temperature in Celsius
4. #bodytemp_c

b. How much aspirin is given per dose for a patient


5. #asprindose_p

c. Number of televisions per person


6. #tvnumber_p

d. Height (including neck and extended legs) of giraffes


7. #giraffeheight_nl

8. Use R to calculate:

a. 15 x 17
15 * 17

## [1] 255

b. 13^3
13^3

## [1] 2197

c. log(14) (use the natural log)


log(14)

## [1] 2.639057

d. log(100) (use base 10)


log(100, base = 10)

## [1] 2

e. √81.
sqrt(81)

## [1] 9

RLAB 1 - INTRO TO R PART 2


Learning Outcomes
 Learn how to increase the power of R with packages

 Use vectors

 Use data frames

If you have not already done so, download the data files, R scripts and other
resources for Part 2 of R Lab 1 – An Introduction to R.
Vectors
One useful feature of R is the ability to sometimes apply functions to an
entire collection of numbers. The technical term for a set of numbers is
“vector”. For example, the following code will create a vector of five
numbers:
c(78, 85, 64, 54, 102, 98.6)

## [1] 78.0 85.0 64.0 54.0 102.0 98.6

c()
c() is a function that creates a vector, containing the list of items given in its
arguments. To help you remember, you could think of the function c()
meaning to “combine” some elements into a vector.
Let’s add a little extra here to make the computer remember this vector.
Let’s assign it to a object, called temperatureF (because these numbers are
actually a set of temperatures in degrees Fahrenheit):
temperatureF <- c(78, 85, 64, 54, 102, 98.6)

The combination of the less than sign and the hyphen makes an arrow
pointing from right to left—this tells R to assign the stuff on the right to the
name on the left. In this case we are assigning a vector to the object
temperatureF.
Inputting this to R causes no obvious output, but R will now remember this
vector of temperatures under the name temperatureF. We can view the
contents of the vector temperatureF by simply typing its name:
temperatureC <- (temperatureF - 32) * 5/9

temperatureC

## [1] 25.55556 29.44444 17.77778 12.22222 38.88889 37.00000

To pull out one of the numbers in this vector, we add square brackets after
the vector name, and inside those brackets put the index of the element we
want. (The “index” is just a number giving the relative location in the vector
of the item we want. The first item has index 1, etc.) For example, the
second element of the vector temperatureC is:
temperatureC[2]

## [1] 29.44444

One of the common ways to slip up in R is to confuse the [square brackets]


which pull out an element of a vector, with the (parentheses), which is used
to enclose the arguments of a function.
Vectors can also operate mathematically with other vectors. For example,
imagine you have a vector of the body weights of patients before entering
hospital (weight_before_hospital) and another vector with the same
patient’s weights after leaving hospital (weight_after_hospital). You can
calculate the change in weight for all these patients in one command, using
vector subtraction:
#weight_change_during_hospital <- weight_after_hospital -
weight_before_hospital

The result will be a vector that has each patient’s change in weight. (Note
that the code above is as a comment. This allows R Markdown to run without
an error.)
Basic calculation examples
In this course, we’ll learn how to use a few dozen functions, but let’s start
with a couple of basic ones.
mean()
The function mean() does just what it sounds like: it calculates the sample
mean (that is, the average) of the vector given to it as input. For example,
the mean of the vector of the temperatures in degrees Celsius from above is
26.81481:
mean(temperatureC)

## [1] 26.81481

sum()
Another simple (and simply named) function calculates the sum of all
numbers in a vector: sum().
sum(temperatureC)

## [1] 160.8889

length()
To count the number of elements in a vector, use length().
length(temperatureC)

## [1] 6

This shows that there are 6 temperature values in the vector that make up
the vector temperatureC.
Expanding R’s capabilities
R has a lot of power in its basic form, but one of the most important parts
about R is that it is expandable by the work of other people. These
expansions are usually released in “packages”.
Each package needs to be installed on your computer only once, but to be
used it has to be loaded into R during each session.
To install a package in RStudio, click on the packages tab from the sub-
window with tables for Files, Plots, Packages, Help, and Viewer. Immediately
below that will be a button labeled “Install”—click that and a window will
open.

In the second row (labeled “Packages”), type “ggplot2” (without the


quotation marks). Make sure the box for “Install dependencies” near the
bottom is clicked, and then click the “Install” button at bottom right. This will
install the graphics package “ggplot2”.

This only needs to be done once on a given computer, and that package is
permanently available.
There is another package that we will use in this course (including today)
called dplyr. While you are installing packages, go ahead and install dplyr
as well.
Loading a package
Once a package is installed, it needs to be loaded into R during a session if
you want to use it. You do this with a function called library().
library()
For this lab, we will use the package dplyr, which allows for easy
modification of data frames. Before using the functions in this package, we
need to load it. We do this with the library() function. In the console, enter
this:
library(dplyr)

If the dplyr package is installed on your computer, the computer will just
give a new prompt and be ready to go. If the package is not installed it will
give you an error message in red asking you to get the package installed.
(See the section above.)
Setting the working directory
The files on your computers are organized hierarchically into folders, or
“directories.” It is convenient in RStudio to tell R which directory to look for
files at the beginning of a session, to minimize typing later.
You can set the working directory for RStudio from RStudio’s menu. From the
“Session” tab in the menu bar, choose “Set Working Directory”, and then
“Choose Directory…” This will open a dialog box that will let you find and
select the directory you want. For this course, you may want to keep all the
files associated with R in the same folder on your computer. (Note that this
was done at the top of this document. Doing this allows the code that follows
to work, allowing the dataset to be loaded.)
Reading a file
In these labs, we have saved the data in a “comma-separated variable”
format. All files in this format ought to have “.csv” as the end of their file
name. A CSV file is a plain text file, easily read by a wide variety of
programs. Each row in the file (besides the first row) is the data for a given
individual, and for each individual each variable is listed in the same order,
separated by commas. It’s important to note that you can’t have commas
anywhere else in the file, besides the separators.
The first row of a CSV file should be a “header” row, which gives the names
of each variable, again separated by commas.
read.csv()
For examples in this lab, let’s use a data set about the passengers of the
RMS Titanic. One of the data sets in the folder of data attached to this lab is
called “titanic.csv”. This is a data set of 1313 passengers from the voyage of
this ship, which contains information about some personal info about each
passenger as well as whether they survived the accident or not.
To import a CSV file into R, use the read.csv() function as in the following
command. (This assumes that you have set the working directory to a
specific folder on your computer.)
setwd("~/Desktop/Bio 341")
titanicData <- read.csv("titanic.csv", stringsAsFactors = TRUE)

This looks for the file called titanic.csv in the folder called DataForLabs. Here
we have given the name titanicData to the object in R that contains all this
passenger data. Of course, if you wanted to load a different data set, you
would be better off giving it a more apt name than “titanicData”. The option
“stringsAsFactors = TRUE” asks R to interpret the columns with non-
numerical information as “factor” with possibly repeated instances of the
same value of a categorical variable.
To see if the data loads appropriately, you might want to run the command
summary(titanicData)

## passenger_class name age


## 1st:322 Carlsson,MrFransOlof : 2 Min. : 0.1667
## 2nd:280 Connolly,MissKate : 2 1st Qu.:21.0000
## 3rd:711 Kelly,MrJames : 2 Median :30.0000
## Abbing,MrAnthony : 1 Mean :31.1942
## Abbott,MasterEugeneJoseph: 1 3rd Qu.:41.0000
## Abbott,MrRossmoreEdward : 1 Max. :71.0000
## (Other) :1304 NA's :680
## embarked home_destination sex survive
## :493 :558 female:463 no :864
## Cherbourg :202 NewYork,NY : 65 male :850 yes:449
## Queenstown : 45 London : 14
## Southampton:573 Montreal,PQ : 10
## Cornwall/Akron,OH: 9
## Paris,France : 9
## (Other) :648

which will list all the variables and some summary statistics for each
variable.
R has other functions that can read other data formats besides csv files, but
the function read.csv() requires that the file be a csv file.
Finding files in other locations
For all of the examples in these labs, we have assumed that the data is in a
folder called DataForLabs. When you work on a new data set outside of these
labs, you will want to store the data somewhere else. To upload data from
another location on your computer, you need to know the file path for the
file. For example, the file path to the titanic file on my computer is “C:\Users\
davisjg\Desktop\Wofford Courses\BIO341\DataForLabs\titanic.csv”. This file
path is a list of folders inside of folders that tells the computer where to look
for the file.
Reading a file from any location can be done with read.csv like this:
#titanicData <- read.csv("C:\Users\davisjg\Desktop\Wofford Courses\
BIO341\DataForLabs\titanic.csv", stringsAsFactors = TRUE)

(Note that I have already loaded the dataset in this R Markdown file so the
code above is stated as a comment so that it is ignored.)
Introduction to data frames
A data frame is a way that R can store a data set on a number of individuals.
A data frame is a collection of columns; each column contains the values of a
single variable for all individuals. The values of each individual occur in the
same order in all the columns, so the first value for one variable represents
the same individual as the first value in the lists of all other variables.
The function read.csv() loads the data it reads into a data frame.
The data frame is usually given a name, which is used to tell R’s functions
which data set to use. For example, in the previous section we read in a data
set to a data frame that we called titanicData. This data frame now contains
information about each of the passengers on the Titanic. This data frame has
seven variables, so it has seven columns (passenger_class, name, age,
embarked, home_destination, sex, and survive).
Very importantly, we can grab one of the columns from a data frame by
itself. We write the name of the data frame, followed by a $, and then the
name of the variable.
For example, to show a list of the age of all the passengers on the Titanic,
use
titanicData$age

## [1] 29.0000 2.0000 30.0000 25.0000 0.9167 47.0000 63.0000


39.0000 58.0000
## [10] 71.0000 47.0000 19.0000 NA NA NA 50.0000
24.0000 36.0000
## [19] 37.0000 47.0000 26.0000 25.0000 25.0000 19.0000 28.0000
45.0000 39.0000
## [28] 30.0000 58.0000 NA 45.0000 22.0000 NA 41.0000
48.0000 NA
## [37] 44.0000 59.0000 60.0000 45.0000 NA 53.0000 58.0000
36.0000 33.0000
## [46] NA NA 36.0000 36.0000 14.0000 11.0000 49.0000
NA 36.0000
## [55] NA 46.0000 47.0000 27.0000 31.0000 NA NA
NA NA
## [64] 27.0000 26.0000 NA NA 64.0000 37.0000 39.0000
55.0000 NA
## [73] 70.0000 69.0000 36.0000 39.0000 38.0000 NA 27.0000
31.0000 27.0000
## [82] NA 31.0000 17.0000 NA NA 4.0000 27.0000
50.0000 48.0000
## [91] 49.0000 48.0000 39.0000 23.0000 53.0000 36.0000 NA
NA 30.0000
## [100] 24.0000 19.0000 28.0000 23.0000 64.0000 60.0000 NA
49.0000 NA
## [109] 44.0000 22.0000 60.0000 48.0000 37.0000 35.0000 47.0000
22.0000 45.0000
## [118] 49.0000 NA 71.0000 54.0000 38.0000 19.0000 58.0000
45.0000 23.0000
## [127] 46.0000 25.0000 21.0000 48.0000 49.0000 45.0000 36.0000
NA 55.0000
## [136] 52.0000 24.0000 NA NA NA 16.0000 44.0000
51.0000 42.0000
## [145] 35.0000 35.0000 38.0000 35.0000 NA 50.0000 49.0000
46.0000 NA
## [154] 58.0000 41.0000 NA 42.0000 40.0000 NA NA
NA 42.0000
## [163] 55.0000 50.0000 16.0000 NA 29.0000 21.0000 30.0000
15.0000 30.0000
## [172] NA NA NA 46.0000 54.0000 36.0000 28.0000
NA 65.0000
## [181] 33.0000 44.0000 37.0000 NA 55.0000 47.0000 36.0000
58.0000 31.0000
## [190] 23.0000 19.0000 64.0000 NA 64.0000 22.0000 28.0000
NA NA
## [199] 22.0000 NA NA 18.0000 17.0000 52.0000 46.0000
56.0000 NA
## [208] NA 43.0000 31.0000 NA NA 33.0000 NA
27.0000 55.0000
## [217] 54.0000 NA 61.0000 48.0000 18.0000 13.0000 21.0000
NA NA
## [226] NA 34.0000 40.0000 36.0000 50.0000 39.0000 56.0000
28.0000 56.0000
## [235] 56.0000 24.0000 18.0000 NA 24.0000 23.0000 45.0000
40.0000 6.0000
## [244] 57.0000 NA 32.0000 62.0000 54.0000 43.0000 52.0000
NA 62.0000
## [253] 67.0000 63.0000 61.0000 46.0000 52.0000 39.0000 18.0000
48.0000 NA
## [262] 49.0000 39.0000 17.0000 46.0000 NA 31.0000 NA
61.0000 47.0000
## [271] 64.0000 60.0000 60.0000 55.0000 54.0000 21.0000 57.0000
45.0000 31.0000
## [280] 50.0000 50.0000 27.0000 20.0000 51.0000 NA 21.0000
NA NA
## [289] 36.0000 NA NA NA NA NA NA
NA NA
## [298] NA NA NA NA NA NA NA
NA NA
## [307] 40.0000 NA NA 32.0000 NA NA NA
NA NA
## [316] NA 33.0000 NA NA NA NA NA
30.0000 28.0000
## [325] 18.0000 NA 34.0000 32.0000 57.0000 18.0000 23.0000
36.0000 28.0000
## [334] 51.0000 32.0000 19.0000 28.0000 36.0000 4.0000 1.0000
12.0000 34.0000
## [343] 19.0000 23.0000 26.0000 NA 27.0000 15.0000 45.0000
40.0000 20.0000
## [352] 25.0000 36.0000 25.0000 NA 42.0000 26.0000 26.0000
0.8333 31.0000
## [361] NA 19.0000 54.0000 44.0000 52.0000 30.0000 30.0000
NA NA
## [370] 29.0000 NA 29.0000 27.0000 24.0000 35.0000 31.0000
8.0000 22.0000
## [379] 30.0000 NA 20.0000 NA 21.0000 49.0000 8.0000
28.0000 18.0000
## [388] NA 28.0000 22.0000 25.0000 18.0000 32.0000 18.0000
NA 42.0000
## [397] 34.0000 8.0000 NA NA 23.0000 21.0000 19.0000
NA NA
## [406] NA 38.0000 NA 38.0000 35.0000 35.0000 38.0000
24.0000 16.0000
## [415] 26.0000 45.0000 24.0000 21.0000 22.0000 NA 34.0000
30.0000 50.0000
## [424] 30.0000 23.0000 1.0000 44.0000 28.0000 6.0000 30.0000
NA 43.0000
## [433] 45.0000 7.0000 24.0000 24.0000 49.0000 48.0000 NA
34.0000 32.0000
## [442] 21.0000 18.0000 53.0000 23.0000 21.0000 NA 52.0000
42.0000 36.0000
## [451] 21.0000 41.0000 NA NA 33.0000 17.0000 NA
NA NA
## [460] NA NA NA 23.0000 34.0000 NA 22.0000
NA NA
## [469] 45.0000 NA NA 31.0000 30.0000 26.0000 NA
34.0000 26.0000
## [478] 22.0000 1.0000 3.0000 NA NA NA 25.0000
NA 48.0000
## [487] NA 57.0000 NA NA NA 2.0000 NA
27.0000 19.0000
## [496] 30.0000 20.0000 45.0000 NA 46.0000 41.0000 13.0000
19.0000 30.0000
## [505] 48.0000 71.0000 54.0000 NA NA 64.0000 32.0000
18.0000 2.0000
## [514] 32.0000 3.0000 26.0000 19.0000 NA 20.0000 29.0000
39.0000 22.0000
## [523] NA 24.0000 NA 28.0000 NA 50.0000 20.0000
40.0000 42.0000
## [532] 21.0000 32.0000 34.0000 NA NA 33.0000 2.0000
8.0000 36.0000
## [541] 34.0000 30.0000 28.0000 23.0000 0.8333 25.0000 3.0000
50.0000 NA
## [550] 21.0000 NA NA 25.0000 18.0000 20.0000 30.0000
59.0000 30.0000
## [559] 35.0000 22.0000 NA 25.0000 41.0000 25.0000 14.0000
50.0000 22.0000
## [568] NA 27.0000 29.0000 27.0000 30.0000 22.0000 35.0000
30.0000 28.0000
## [577] 23.0000 NA 12.0000 40.0000 36.0000 28.0000 32.0000
29.0000 4.0000
## [586] 2.0000 NA NA 36.0000 33.0000 NA NA
NA 32.0000
## [595] NA NA 26.0000 NA 30.0000 24.0000 NA
18.0000 42.0000
## [604] 13.0000 16.0000 35.0000 16.0000 25.0000 18.0000 20.0000
30.0000 26.0000
## [613] 40.0000 24.0000 41.0000 18.0000 0.8333 23.0000 20.0000
25.0000 35.0000
## [622] 17.0000 32.0000 20.0000 39.0000 39.0000 6.0000 2.0000
17.0000 38.0000
## [631] 9.0000 26.0000 11.0000 4.0000 20.0000 26.0000 25.0000
18.0000 24.0000
## [640] 35.0000 40.0000 38.0000 5.0000 9.0000 3.0000 13.0000
23.0000 5.0000
## [649] NA 45.0000 23.0000 17.0000 27.0000 23.0000 20.0000
32.0000 33.0000
## [658] 3.0000 NA NA NA 18.0000 40.0000 26.0000
15.0000 45.0000
## [667] 18.0000 27.0000 22.0000 19.0000 26.0000 22.0000 20.0000
32.0000 21.0000
## [676] 18.0000 26.0000 6.0000 NA NA 9.0000 40.0000
32.0000 NA
## [685] 26.0000 18.0000 20.0000 NA 29.0000 22.0000 22.0000
35.0000 21.0000
## [694] 20.0000 19.0000 18.0000 18.0000 38.0000 NA 30.0000
17.0000 21.0000
## [703] 21.0000 21.0000 NA NA 24.0000 33.0000 33.0000
28.0000 16.0000
## [712] 37.0000 28.0000 NA 24.0000 21.0000 NA 32.0000
29.0000 26.0000
## [721] 18.0000 20.0000 19.0000 24.0000 24.0000 36.0000 31.0000
31.0000 30.0000
## [730] 22.0000 NA 43.0000 35.0000 27.0000 19.0000 30.0000
36.0000 3.0000
## [739] 9.0000 59.0000 19.0000 44.0000 17.0000 NA 45.0000
22.0000 19.0000
## [748] 29.0000 30.0000 34.0000 28.0000 0.3333 27.0000 25.0000
24.0000 22.0000
## [757] 21.0000 17.0000 NA NA 26.0000 33.0000 1.0000
0.1667 25.0000
## [766] 36.0000 36.0000 30.0000 NA 23.0000 26.0000 19.0000
65.0000 NA
## [775] 42.0000 43.0000 32.0000 19.0000 30.0000 24.0000 23.0000
NA 24.0000
## [784] 24.0000 23.0000 22.0000 NA 18.0000 16.0000 45.0000
NA NA
## [793] NA 47.0000 5.0000 NA NA NA NA
NA NA
## [802] NA NA NA NA NA 21.0000 18.0000
9.0000 48.0000
## [811] 16.0000 NA NA 25.0000 NA NA 22.0000
16.0000 NA
## [820] 33.0000 NA 9.0000 41.0000 38.0000 40.0000 43.0000
14.0000 16.0000
## [829] 9.0000 10.0000 6.0000 11.0000 40.0000 32.0000 NA
20.0000 37.0000
## [838] 28.0000 19.0000 NA NA NA NA NA
NA NA
## [847] NA NA NA NA NA NA NA
NA NA
## [856] NA NA NA NA NA NA NA
NA NA
## [865] NA NA NA NA NA NA NA
NA NA
## [874] NA NA NA NA NA NA NA
NA NA
## [883] NA NA NA NA NA NA NA
NA NA
## [892] NA NA NA NA NA NA NA
NA NA
## [901] NA NA NA NA NA NA NA
NA NA
## [910] NA NA NA NA NA NA NA
NA NA
## [919] NA NA NA NA NA NA NA
NA NA
## [928] NA NA NA NA NA NA NA
NA NA
## [937] NA NA NA NA NA NA NA
NA NA
## [946] NA NA NA NA NA NA NA
NA NA
## [955] NA NA NA NA NA NA NA
NA NA
## [964] NA NA NA NA NA NA NA
NA NA
## [973] NA NA NA NA NA NA NA
NA NA
## [982] NA NA NA NA NA NA NA
NA NA
## [991] NA NA NA NA NA NA NA
NA NA
## [1000] NA NA NA NA NA NA NA
NA NA
## [1009] NA NA NA NA NA NA NA
NA NA
## [1018] NA NA NA NA NA NA NA
NA NA
## [1027] NA NA NA NA NA NA NA
NA NA
## [1036] NA NA NA NA NA NA NA
NA NA
## [1045] NA NA NA NA NA NA NA
NA NA
## [1054] NA NA NA NA NA NA NA
NA NA
## [1063] NA NA NA NA NA NA NA
NA NA
## [1072] NA NA NA NA NA NA NA
NA NA
## [1081] NA NA NA NA NA NA NA
NA NA
## [1090] NA NA NA NA NA NA NA
NA NA
## [1099] NA NA NA NA NA NA NA
NA NA
## [1108] NA NA NA NA NA NA NA
NA NA
## [1117] NA NA NA NA NA NA NA
NA NA
## [1126] NA NA NA NA NA NA NA
NA NA
## [1135] NA NA NA NA NA NA NA
NA NA
## [1144] NA NA NA NA NA NA NA
NA NA
## [1153] NA NA NA NA NA NA NA
NA NA
## [1162] NA NA NA NA NA NA NA
NA NA
## [1171] NA NA NA NA NA NA NA
NA NA
## [1180] NA NA NA NA NA NA NA
NA NA
## [1189] NA NA NA NA NA NA NA
NA NA
## [1198] NA NA NA NA NA NA NA
NA NA
## [1207] NA NA NA NA NA NA NA
NA NA
## [1216] NA NA NA NA NA NA NA
NA NA
## [1225] NA NA NA NA NA NA NA
NA NA
## [1234] NA NA NA NA NA NA NA
NA NA
## [1243] NA NA NA NA NA NA NA
NA NA
## [1252] NA NA NA NA NA NA NA
NA NA
## [1261] NA NA NA NA NA NA NA
NA NA
## [1270] NA NA NA NA NA NA NA
NA NA
## [1279] NA NA NA NA NA NA NA
NA NA
## [1288] NA NA NA NA NA NA NA
NA NA
## [1297] NA NA NA NA NA NA NA
NA NA
## [1306] NA NA NA NA NA NA NA
NA

This will show a vector that has all the values for this variable age, one for
each individual in the data set.
Adding a new column
Sometimes we would like to add a new column to a data frame. The easiest
way to do this is to simply assign a new vector to a new column name, using
the $.
For example, to add the log of age as a column in the titanicData data
frame, we can write
titanicData$log_age = log(titanicData$age)

You can run the command head(titanicData) to see that log_age is now a
column in titanicData.
head(titanicData)

## passenger_class name age


embarked
## 1 1st Allen,MissElisabethWalton 29.0000
Southampton
## 2 1st Allison,MissHelenLoraine 2.0000
Southampton
## 3 1st Allison,MrHudsonJoshuaCreighton 30.0000
Southampton
## 4 1st Allison,MrsHudsonJ.C.(BessieWaldoDaniels) 25.0000
Southampton
## 5 1st Allison,MasterHudsonTrevor 0.9167
Southampton
## 6 1st Anderson,MrHarry 47.0000
Southampton
## home_destination sex survive log_age
## 1 StLouis,MO female yes 3.36729583
## 2 Montreal,PQ/Chesterville,ON female no 0.69314718
## 3 Montreal,PQ/Chesterville,ON male no 3.40119738
## 4 Montreal,PQ/Chesterville,ON female no 3.21887582
## 5 Montreal,PQ/Chesterville,ON male yes -0.08697501
## 6 NewYork,NY male yes 3.85014760
Choosing subsets of data
Sometimes we want to do an analysis only on some of the data that fit
certain criteria. For example, we might want to analyze the data from the
Titanic using only the information from females.
The easiest way to do this is to use the filter() function from the package
dplyr. Make sure you have sourced the dplyr package as described above,
and then load it into R using library():
library(dplyr)

In the titanic data set there is a variable named sex, and an individual is
female if that variable has value “female”. We can create a new data frame
that includes only the data from females with the following command:
titanicDataFemalesOnly <- filter(titanicData, sex == "female")

This new data frame will include all the same columns as the original
titanicData, but it will only include the rows for which the sex was “female”.
Note that the syntax here requires a double == sign. In R (and many other
computer languages), the double equal sign creates a statement that can be
evaluated as true or false, while a single equal sign may change the value of
the object to the value on the right-hand side of the equal sign. Here we are
asking, for each individual, whether sex is “female”, not assigning the value
”female” to the variable sex. So we use a double equal sign ==.
R Commands Summary

Working together with your group, construct R code to answer the


questions below. Use R Markdown to generate your answers by
modifying this code. You should submit a copy of your final
document to the “R Lab1 Submission” link in Moodle. Although you
can work together, each person must submit their own completed
assignment. This assignment is not graded, but by completing this,
you are demonstrating your familiarity with using R and that you
are ready for the next step. The assignment is due by 11:30 AM on
Monday Feb. 10.
Questions
For the answers to the following questions, create a script in the spaces
provided that captures each of these calculations, and record your answers
where appropriate for each question. Make sure to include your name at the
top of this document. Use the Knit function to generate your document and
save it as a Word or PDF for submission.
1. People are notoriously dishonest about revealing how often they
perform antisocial behaviors like peeing in swimming pools. In addition
to being disgusting, the nitrogenous chemicals in urine combine with
the pool’s chlorine to produce some toxic chemicals like trichloramine,
the source of most skin irritations for swimmers.) A group of
researchers (Jmaiff Blackstock et al. 2017) recently realized that an
artificial sweetener called ACE passes out in urine unmetabolized and
in known average quantities,and therefore by measuring ACE
concentrations we can measure the amount of urine in a pool. Here is
a list of measurements, each from a different pool, of the
concentration of ACE (measured in ng/L) for 23 different pools in
Canada.
640, 1070, 780, 70, 160, 130, 60, 50, 2110, 70, 350, 30, 210, 90, 470, 580,
250, 310, 460, 430, 140, 1070, 130
a. In R, create a vector of these data, and name it appropriately.
ACE_concentration <- c(640, 1070, 780, 70, 160, 130, 60, 50, 2110, 70,
350, 30, 210, 90, 470, 580, 250, 310, 460, 430, 140, 1070, 130)

b. What is the mean ACE concentration of these 23 pools?


mean(ACE_concentration)

## [1] 420

c. Urine on average has 4000 ng ACE/ ml. Therefore, to convert these


measurements of ng ACE / L pool water to ml urine / L pool water we
need to divide each by 4000. Make a new vector showing the
concentration of urine per liter in these 23 pools. Give it a suitable
name.
ACE_concentration_liters <- ACE_concentration/4000

d. What is the mean concentration of urine per liter? How did this change
relative to the mean measurement of ng ACE / L ?
mean_urine_concentration <- mean(ACE_concentration_liters)

e. The arithmetic mean is calculated by adding up all the numbers and


dividing by how many numbers there are. Calculate the mean of these
numbers using sum() and length(). Did you get the same answer as
with using mean()? YES
sum(ACE_concentration_liters)/length(ACE_concentration_liters)

## [1] 0.105

f. Use R to calculate the average amount of urine (in ml) in a 500,000 L


pool. (I am not sure)
mean_urine_concentration * 500000

## [1] 52500
2. Weddell seals live in Antarctic waters and take long strenuous dives in
order to find fish to feed upon. Researchers (Williams et al. 2004)
wanted to know whether these feeding dives were more energetically
expensive than regular dives (perhaps because they are deeper, or the
seal has to swim further or faster). They measured the metabolic costs
of dives using the oxygen consumption of 10 animals (in ml O2 / kg)
during a feeding dive. Here are the data:
71.0, 77.3, 82.6, 96.1, 106.6, 112.8, 121.2, 126.4, 127.5, 143.1
For the same 10 animals, they also measured the oxygen consumption in
non-feeding dives. With the 10 animals in the same order as before, here are
those data:
42.2, 51.7, 59.8, 66.5, 81.9, 82.0, 81.3, 81.3, 96.0, 104.1
a. Make a vector for each of these lists, and give them appropriate
names.
oxygen_consumption_feeding <- c(71.0, 77.3, 82.6, 96.1, 106.6, 112.8,
121.2, 126.4, 127.5, 143.1)
oxygen_consumption_non_feeding <- c(42.2, 51.7, 59.8, 66.5, 81.9,
82.0, 81.3, 81.3, 96.0, 104.1)

b. Confirm (using R) that both of your vectors have the same number of
individuals in them.
length(oxygen_consumption_feeding)

## [1] 10

length(oxygen_consumption_non_feeding)

## [1] 10

c. Create a vector called MetabolismDifference by calculating the


difference in oxygen consumption between feeding divesand
nonfeeding dives for each animal.
MetabolismDifference <- oxygen_consumption_feeding -
oxygen_consumption_non_feeding

d. What is the average difference between feeding dives and nonfeeding


dives in oxygen consumption?
mean(MetabolismDifference)

## [1] 31.78

e. Another appropriate way to represent the relationship between these


two numbers would be to take the ratio of O2 consumption for feeding
dives over the O2 consumption of nonfeeding dives. Make a vector
which gives this ratio for each seal.
ratio_oxygen_consumption <- oxygen_consumption_feeding /
oxygen_consumption_non_feeding
ratio_oxygen_consumption

## [1] 1.682464 1.495164 1.381271 1.445113 1.301587 1.375610 1.490775


1.554736
## [9] 1.328125 1.374640

f. Sometimes ratios are easier to analyze when we look at the log of the
ratio. Create a vector which gives the log of the ratios from the
previous step. (Use the natural log.) What is the mean of this log-ratio?
log_oxygen_consumption <- log(ratio_oxygen_consumption)
log_oxygen_consumption

## [1] 0.5202597 0.4022362 0.3230040 0.3681874 0.2635845 0.3188971


0.3992961
## [8] 0.4413055 0.2837682 0.3181917

3. The data file called “countries.csv” on Moodle contains information


about all the countries on Earth. Each row is a country, and each
column contains a variable.

a. Use read.csv() to read the data from this file into a dataframe
called countries. I can’t figure this one out.
setwd("~/Desktop/Bio 341")
countries <- read.csv("countries.csv", stringsAsFactors = TRUE)

b. Use summary() to get a quick description of this data set. What


are the first three variables?
summary(countries)

## country total_population_in_thousands_2015
## Afghanistan : 1 Min. : 1.6
## Albania : 1 1st Qu.: 1875.8
## Algeria : 1 Median : 8069.6
## Andorra : 1 Mean : 37721.9
## Angola : 1 3rd Qu.: 26413.0
## Antigua and Barbuda: 1 Max. :1400000.0
## (Other) :190 NA's :2
## gross_national_income_per_capita_2013
life_expectancy_at_birth_female
## Min. : 600 Min. :48.80

## 1st Qu.: 3070 1st Qu.:67.05

## Median : 9800 Median :75.90

## Mean : 14792 Mean :73.42

## 3rd Qu.: 20370 3rd Qu.:79.25


## Max. :123860 Max. :86.70

## NA's :27 NA's :13

## life_expectancy_at_birth_male
life_expectancy_at_age_60_female
## Min. :47.40 Min. :12.70

## 1st Qu.:62.90 1st Qu.:18.00

## Median :69.80 Median :20.40

## Mean :68.53 Mean :20.81

## 3rd Qu.:73.95 3rd Qu.:23.40

## Max. :81.10 Max. :28.60

## NA's :13 NA's :13

## life_expectancy_at_age_60_male physicians_density_per_1000
## Min. :12.50 Min. :0.029
## 1st Qu.:15.80 1st Qu.:1.681
## Median :17.50 Median :2.765
## Mean :18.07 Mean :2.725
## 3rd Qu.:20.20 3rd Qu.:3.510
## Max. :23.90 Max. :7.519
## NA's :13 NA's :125
## number_neonatal_deaths_in_thousands_2014
measles_immunization_oneyearolds
## Min. : 0.00 Min. :22.00

## 1st Qu.: 0.00 1st Qu.:83.25

## Median : 1.00 Median :93.00

## Mean : 14.11 Mean :87.28

## 3rd Qu.: 9.50 3rd Qu.:97.00

## Max. :722.00 Max. :99.00

## NA's :2 NA's :2

## dpt2_vaccination_oneyearolds
fines_for_tobacco_advertising_2014
## Min. :20.00 No : 54
## 1st Qu.:84.25 Yes :140

## Median :94.00 NA's: 2

## Mean :87.91

## 3rd Qu.:97.00

## Max. :99.00

## NA's :2

## mortality_rate_cancer_2012 cigarette_price_2014
continent
## Min. : 54.00 Min. : 0.360
Africa :54
## 1st Qu.: 88.62 1st Qu.: 1.320
Asia :44
## Median :108.00 Median : 2.620
Europe :47
## Mean :109.64 Mean : 3.798 North
America:23
## 3rd Qu.:124.53 3rd Qu.: 4.965
Oceania :16
## Max. :223.00 Max. :16.140 South
America:12
## NA's :24 NA's :89

## ecological_footprint_2000 ecological_footprint_2012
## Min. : 0.600 Min. :0.700
## 1st Qu.: 1.097 1st Qu.:1.400
## Median : 2.140 Median :2.000
## Mean : 3.147 Mean :2.353
## 3rd Qu.: 4.872 3rd Qu.:3.000
## Max. :15.990 Max. :5.300
## NA's :58 NA's :147
## cell_phone_subscriptions_per_100_people_2012
## Min. : 5.47
## 1st Qu.: 69.83
## Median :103.25
## Mean : 99.90
## 3rd Qu.:126.10
## Max. :198.62
## NA's :10

c. Using the output of summary(), how many countries are from


Africa in this data set?
length(countries)
## [1] 18

d. What kinds of variables (i.e., categorical or numerical) are


continents, cell_phone_subscriptions_per_100_people_2012,
total_population_in_thousands_2015 and
fines_for_tobacco_advertising_2014? (Don’t go by their variable
names – look at the data in the summary results to decide.)

e. CAN’T DO

f. Add a new column to your countries data frame that has the
difference in ecological footprint between 2012 and 2000. What
is the mean of this difference? (Note: this variable will have
“missing data”, which means that some of the countries do not
have data in this file for one or the other of the years of
ecological footprint. By default, R doesn’t calculate a mean
unless all the data are present. To tell R to ignore the missing
data, add an option to the mean() command that says
na.rm=TRUE. We’ll learn more about this later.)

g. CAN’T DO

4. Using the countries data again, create a new data frame called
AfricaData, that only includes data for countries in Africa. What is the
sum of the total_population_in_thousands_2015 for this new data
frame?

5. CAN’T DO

You might also like