0% found this document useful (0 votes)
2 views

R Tutorial

The document outlines a plan for learning R programming, covering topics such as using R as a calculator, creating and manipulating objects, loading data, generating graphics, and performing regressions. It includes examples of arithmetic operations, object creation, data frame manipulation, and basic plotting techniques. Additionally, it discusses simple and multiple linear regression, including how to interpret model outputs and diagnostic plots.

Uploaded by

yuruoqianqy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

R Tutorial

The document outlines a plan for learning R programming, covering topics such as using R as a calculator, creating and manipulating objects, loading data, generating graphics, and performing regressions. It includes examples of arithmetic operations, object creation, data frame manipulation, and basic plotting techniques. Additionally, it discusses simple and multiple linear regression, including how to interpret model outputs and diagnostic plots.

Uploaded by

yuruoqianqy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Today’s plan:

1. R as a calculator
2. Objects in R
3. Loading data
4. Graphics
5. Regressions
6. R markdown

1. Using R as a calculator (Arithmetric Operations)


5+3

## [1] 8
5/3

## [1] 1.666667
5ˆ3

## [1] 125
5*(10-3)

## [1] 35
sqrt(4)

## [1] 2

2 Creating and manipulating various objects in R


2.1 Objects
R can store information as an object.
• Objects are “shortcuts” to some piece of information or data.
• We use the assignment operator <- to assign some value to an object.
– < Left Angle Bracket
– - Dash
• We can store the result as an object, and then access the value by referring to the object’s name.
– just enter the object name and hit Enter // use the print( ) function
• Object names are case sensitive.
• In the upper-right window, called Environment, we will see the objects we created.

2
result <- 5+3
result

## [1] 8
print(result)

## [1] 8

2.2 Vectors
A vector or a one-dimensional array simply represents a collection of information stored in a specific order.

x <- 1:5 # generate a sequence from 1 to 5


x

## [1] 1 2 3 4 5
y <- c(1,2,3,4,5) # use the function c( ), which stands for “concatenate,” to enter a data vector
y

## [1] 1 2 3 4 5

Indexing:
To access specific elements of a vector, we use square brackets [ ].
• Within square brackets, the dash, -, removes the corresponding element from a vector.
• Within square brackets, multiple elements can be extracted via a vector of indices.

x[2] # access the second element of the vector

## [1] 2
x[-2] # access all elements except for the second one

## [1] 1 3 4 5
x[c(1,3)] # access the first and the third elements

## [1] 1 3
x[1:3] # access the first three elements

## [1] 1 2 3

2.3 Functions
A function often takes multiple input objects and returns an output object.
• e.g., sqrt( ), print( ), and c( )
• funcname(input): funcname is the function name and input is the input object (arguments)
• To find out more information about a function (e.g. sqrt( )), type ?sqrt.

3
length(x) # display the length of vector x

## [1] 5
mean(x) # display the mean value

## [1] 3
max(x) # display the maximum value

## [1] 5

3 Loading data sets into R


3.1 Working Directory
where R by default load data from and save data to.
1. getwd( ): get working directory
2. setwd( ): change working directory
• e.g. setwd(“/Users/Desktop”)

3.2 Data Files

• CSV or comma-separated values files: tabular data


• RData files: a collection of R objects

ads <- read.csv("Advertising.csv")


class(ads)

## [1] "data.frame"

3.3 Data Frame Object


A data frame object is a collection of vectors.

names(ads) # return variable names

## [1] "X" "TV" "radio" "newspaper" "sales"


nrow(ads) # return the number of rows

## [1] 200

4
ncol(ads) # return the number of columns

## [1] 5
dim(ads) # return the dimensions of the data

## [1] 200 5
summary(ads) # produce a summary of the data.

## X TV radio newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
## Median :100.50 Median :149.75 Median :22.900 Median : 25.75
## Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00

Indexing (extraction):

• square brackets [ ]
• two indexes: one for rows and the other for columns
• call specific rows (columns) by either row (column) numbers/names.
• If we use row (column) numbers, useful sequencing functions, i.e., : and c()
• If we do not specify a row (column) index, then the syntax will return all rows (columns).
• To access an individual variable in a data frame: use the $ operator.

ads[2,"sales"] # extract the second row of the "sales" column

## [1] 10.4
ads_subset1 <- ads[1:3,] # extract the first three rows (and all columns)
ads_subset1

## X TV radio newspaper sales


## 1 1 230.1 37.8 69.2 22.1
## 2 2 44.5 39.3 45.1 10.4
## 3 3 17.2 45.9 69.3 9.3
ads_subset2 <- ads[c(1,3),] # extract the first and the third rows (and all columns)
ads_subset2

## X TV radio newspaper sales


## 1 1 230.1 37.8 69.2 22.1
## 3 3 17.2 45.9 69.3 9.3
ads_subset1[,"sales"] # extract the column called "sales" from ads_subset1

5
## [1] 22.1 10.4 9.3
ads_subset1$sales # another way of retrieving individual variable "sales": operator $

## [1] 22.1 10.4 9.3

3.4 Packages
• R packages are collections of functions and data sets developed by the community.
• To use the package, we must load it into the workspace using the library() function.
• In some cases, a package needs to be installed before being loaded: we can use the install.packages()
function.

install.packages("foreign") # install package


library("foreign") # load package

4. Graphics
4.1 Scatterplot

• The plot( ) function is the primary way to plot data in R.


• plot(x, y) produces a scatterplot of the numbers in x versus the numbers in y.

x <- rnorm(100) # generates a vector of 100 random normal variables


y <- rnorm(100)
plot(x, y) # scatterplot of x and y

6
3
2
1
y

0
−1
−2

−2 −1 0 1 2

x
plot(x, y, xlab = "this is the x-axis",
ylab = "this is the y-axis", main = "Plot of X vs Y") # add xlabel, ylabel, title

Plot of X vs Y
3
2
this is the y−axis

1
0
−1
−2

−2 −1 0 1 2

this is the x−axis


We return to the advertising data. To show the relationship between TV and sales in a scatterplot, we need
to specifically indicate the variable name using $.

7
plot(ads$TV, ads$sales)

25
20
ads$sales

15
10
5

0 50 100 150 200 250 300

ads$TV
Alternatively, we can use the attach( ) function in order to tell R to make the variables in this data frame
available by name.

attach(ads)
plot(TV, sales)

4.2 Histogram

• The hist( ) function can be used to plot a histogram.


attach(ads) # tell R to make the variables in "ads" available by name
hist(sales) # histogram of the variable "sales"

8
Histogram of sales
80
60
Frequency

40
20
0

0 5 10 15 20 25 30

sales
hist(sales, col = 2, breaks = 15) # change the color to red, change the number of bins to 15

Histogram of sales
40
30
Frequency

20
10
0

0 5 10 15 20 25

sales

9
5. Regressions
5.1 Simple Linear Regression
• We use the lm() function to fit a simple linear regression model.
• The basic syntax is lm(y ~ x, data), where y is the response, x is the predictor, and data is the data set
in which these two variables are kept.
• Again, we either use $ or attach() to explicitly tell R what data frame we use.
• We store the regression results in an object called lm.fit.
• If we type lm.fit, some basic information about the model is output.
• If we want to get detailed information about the model, we use the summary( ) function.

lm.fit <- lm(sales ~ TV, data = ads) # simple linear regression of sales on TV

attach(ads) # alternative
lm.fit <- lm(sales ~ TV)
lm.fit # basic information about the model

##
## Call:
## lm(formula = sales ~ TV)
##
## Coefficients:
## (Intercept) TV
## 7.03259 0.04754
summary(lm.fit) # detailed information about the model

##
## Call:
## lm(formula = sales ~ TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

• To obtain a confidence interval for the coefficient estimates, we can use the confint() command:
confint(lm.fit, level=0.95)

## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256

10
• We can plot sales and TV along with the least squares regression line using the plot() and abline()
functions.
• diagnostic plots (optional):
– use the par() and mfrow() functions to tell R to split the display screen into separate panels so
that multiple plots can be viewed simultaneously
– e.g., par(mfrow = c(2, 2)) divides the plotting region into a 2 × 2 grid of panels

plot(TV, sales)
abline(lm.fit)
25
20
sales

15
10
5

0 50 100 150 200 250 300

TV
par(mfrow = c(2, 2))
plot(lm.fit)

11
Standardized residuals
Residuals vs Fitted Q−Q Residuals
Residuals

2
0

0
−2
−10

26
17936 1793626

8 10 12 14 16 18 20 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
17936
26

2
1.0

0
Cook's distance26
0.0

−3
179 36

8 10 12 14 16 18 20 0.000 0.005 0.010 0.015 0.020

Fitted values Leverage

The diagnostic plots show residuals in four different ways:


1. Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without
distinct patterns is an indication for a linear relationship, what is good.
2. Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals
points follow the straight dashed line.
3. Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals
(homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
This is not the case in our example, where we have a heteroscedasticity problem.
4. Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence
the regression results when included or excluded from the analysis.

5.2 Multiple Linear Regression


• The syntax lm(y ~ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3.
• The summary( ) function outputs the regression coefficients for all the predictors.
• We can access the individual components of a summary object by name.

lm.fit2 <- lm(sales ~ TV + radio + newspaper, ads)


summary(lm.fit2)

##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ads)
##
## Residuals:
## Min 1Q Median 3Q Max

12
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
names(summary(lm.fit2))

## [1] "call" "terms" "residuals" "coefficients"


## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary(lm.fit2)$r.sq # the R2

## [1] 0.8972106
summary(lm.fit2)$fstatistic # the F-statistic

## value numdf dendf


## 570.2707 3.0000 196.0000
names(ads) # The data frame "ads" contains 5 variables.

## [1] "X" "TV" "radio" "newspaper" "sales"


lm(sales ~ ., ads) # a regression on all of the predictors
lm(sales ~ .-X, ads) # a regression excluding the predictor X

5.3 Interaction Terms


• The syntax TV :radio tells R to include an interaction term between TV and radio.
• The syntax TV*radio simultaneously includes TV, radio, and the interaction term TV×radio as
predictors.
– it is a short-hand for TV + radio + TV:radio

summary(lm(sales ~ TV*radio, data = ads))

##
## Call:
## lm(formula = sales ~ TV * radio, data = ads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##

13
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16

6. R Markdown (Optional)
install.packages("rmarkdown")

14
The rmd file contains three types of content:
• A YAML (Yet Another Markup Language) header surrounded by ---
– A YAML header contains YAML arguments, such as “title”, “author”, and “output”.
• R code chunks
– All code chunks start and end with three backticks. On your keyboard, the backticks can be found
on the same key as the tilde (~).
• text, tables, figures, images

15
16

You might also like