R Tutorial
R Tutorial
1. R as a calculator
2. Objects in R
3. Loading data
4. Graphics
5. Regressions
6. R markdown
## [1] 8
5/3
## [1] 1.666667
5ˆ3
## [1] 125
5*(10-3)
## [1] 35
sqrt(4)
## [1] 2
2
result <- 5+3
result
## [1] 8
print(result)
## [1] 8
2.2 Vectors
A vector or a one-dimensional array simply represents a collection of information stored in a specific order.
## [1] 1 2 3 4 5
y <- c(1,2,3,4,5) # use the function c( ), which stands for “concatenate,” to enter a data vector
y
## [1] 1 2 3 4 5
Indexing:
To access specific elements of a vector, we use square brackets [ ].
• Within square brackets, the dash, -, removes the corresponding element from a vector.
• Within square brackets, multiple elements can be extracted via a vector of indices.
## [1] 2
x[-2] # access all elements except for the second one
## [1] 1 3 4 5
x[c(1,3)] # access the first and the third elements
## [1] 1 3
x[1:3] # access the first three elements
## [1] 1 2 3
2.3 Functions
A function often takes multiple input objects and returns an output object.
• e.g., sqrt( ), print( ), and c( )
• funcname(input): funcname is the function name and input is the input object (arguments)
• To find out more information about a function (e.g. sqrt( )), type ?sqrt.
3
length(x) # display the length of vector x
## [1] 5
mean(x) # display the mean value
## [1] 3
max(x) # display the maximum value
## [1] 5
## [1] "data.frame"
## [1] 200
4
ncol(ads) # return the number of columns
## [1] 5
dim(ads) # return the dimensions of the data
## [1] 200 5
summary(ads) # produce a summary of the data.
## X TV radio newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
## Median :100.50 Median :149.75 Median :22.900 Median : 25.75
## Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
Indexing (extraction):
• square brackets [ ]
• two indexes: one for rows and the other for columns
• call specific rows (columns) by either row (column) numbers/names.
• If we use row (column) numbers, useful sequencing functions, i.e., : and c()
• If we do not specify a row (column) index, then the syntax will return all rows (columns).
• To access an individual variable in a data frame: use the $ operator.
## [1] 10.4
ads_subset1 <- ads[1:3,] # extract the first three rows (and all columns)
ads_subset1
5
## [1] 22.1 10.4 9.3
ads_subset1$sales # another way of retrieving individual variable "sales": operator $
3.4 Packages
• R packages are collections of functions and data sets developed by the community.
• To use the package, we must load it into the workspace using the library() function.
• In some cases, a package needs to be installed before being loaded: we can use the install.packages()
function.
4. Graphics
4.1 Scatterplot
6
3
2
1
y
0
−1
−2
−2 −1 0 1 2
x
plot(x, y, xlab = "this is the x-axis",
ylab = "this is the y-axis", main = "Plot of X vs Y") # add xlabel, ylabel, title
Plot of X vs Y
3
2
this is the y−axis
1
0
−1
−2
−2 −1 0 1 2
7
plot(ads$TV, ads$sales)
25
20
ads$sales
15
10
5
ads$TV
Alternatively, we can use the attach( ) function in order to tell R to make the variables in this data frame
available by name.
attach(ads)
plot(TV, sales)
4.2 Histogram
8
Histogram of sales
80
60
Frequency
40
20
0
0 5 10 15 20 25 30
sales
hist(sales, col = 2, breaks = 15) # change the color to red, change the number of bins to 15
Histogram of sales
40
30
Frequency
20
10
0
0 5 10 15 20 25
sales
9
5. Regressions
5.1 Simple Linear Regression
• We use the lm() function to fit a simple linear regression model.
• The basic syntax is lm(y ~ x, data), where y is the response, x is the predictor, and data is the data set
in which these two variables are kept.
• Again, we either use $ or attach() to explicitly tell R what data frame we use.
• We store the regression results in an object called lm.fit.
• If we type lm.fit, some basic information about the model is output.
• If we want to get detailed information about the model, we use the summary( ) function.
lm.fit <- lm(sales ~ TV, data = ads) # simple linear regression of sales on TV
attach(ads) # alternative
lm.fit <- lm(sales ~ TV)
lm.fit # basic information about the model
##
## Call:
## lm(formula = sales ~ TV)
##
## Coefficients:
## (Intercept) TV
## 7.03259 0.04754
summary(lm.fit) # detailed information about the model
##
## Call:
## lm(formula = sales ~ TV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
• To obtain a confidence interval for the coefficient estimates, we can use the confint() command:
confint(lm.fit, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256
10
• We can plot sales and TV along with the least squares regression line using the plot() and abline()
functions.
• diagnostic plots (optional):
– use the par() and mfrow() functions to tell R to split the display screen into separate panels so
that multiple plots can be viewed simultaneously
– e.g., par(mfrow = c(2, 2)) divides the plotting region into a 2 × 2 grid of panels
plot(TV, sales)
abline(lm.fit)
25
20
sales
15
10
5
TV
par(mfrow = c(2, 2))
plot(lm.fit)
11
Standardized residuals
Residuals vs Fitted Q−Q Residuals
Residuals
2
0
0
−2
−10
26
17936 1793626
8 10 12 14 16 18 20 −3 −2 −1 0 1 2 3
Standardized residuals
Scale−Location Residuals vs Leverage
17936
26
2
1.0
0
Cook's distance26
0.0
−3
179 36
##
## Call:
## lm(formula = sales ~ TV + radio + newspaper, data = ads)
##
## Residuals:
## Min 1Q Median 3Q Max
12
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
names(summary(lm.fit2))
## [1] 0.8972106
summary(lm.fit2)$fstatistic # the F-statistic
##
## Call:
## lm(formula = sales ~ TV * radio, data = ads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
13
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
6. R Markdown (Optional)
install.packages("rmarkdown")
14
The rmd file contains three types of content:
• A YAML (Yet Another Markup Language) header surrounded by ---
– A YAML header contains YAML arguments, such as “title”, “author”, and “output”.
• R code chunks
– All code chunks start and end with three backticks. On your keyboard, the backticks can be found
on the same key as the tilde (~).
• text, tables, figures, images
15
16