Chapter4 Notes
Chapter4 Notes
Linear Regression:
Regression shows a line or curve that passes through all the data points on the target-predictor
graph in such a way that the vertical distance between the data points and the regression line is
minimum.
A linear regression is a statistical model is used to predict the value of an outcome variable y on
the basis of one or more input predictor variables x.
In other words, linear regression is used to establish a linear relationship between the predictor
and response variables.
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is gathered
through experiments. The other variable is called response variable whose value is derived from
the predictor variable.
Linear regression is one of the most basic statistical models.
In linear regression, predictor and response variables are related through an equation in which
the exponent (power) of both these variables is 1. Mathematically, a linear relationship denotes a
straight line, when plotted as a graph.
There is the following general mathematical equation for linear regression:
y = ax + b
y is a response variable (Dependent Variable)
x is a predictor variable (Independent Variable).
a and b are constants that are called the coefficients (the intercept and the slope)
Example:
The prediction of the weight of a person when his height is known is a simple example of
regression. To predict the weight, we need to have a relationship between the height and
weight of a person. Weight= a+Height*b
When you calculate the age of a child based on their height, you are assuming the older
Positive Linear Relationship: If the dependent variable increases on the Y-axis and the
independent variable increases on the X-axis, then such a relationship is termed as a Positive
linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a relationship is called a negative linear
relationship.
Hence, we try to find a linear function that predicts the response value(y) as accurately as
possible as a function of the feature or independent variable(x).
Y = β₀ + β₁X + ε
The dependent variable, also known as the response or outcome variable, is represented
by the letter Y.
The independent variable, often known as the predictor or explanatory variable, is
denoted by the letter X.
The intercept, or value of Y when X is zero, is represented by the β₀.
The slope or change in Y resulting from a one-unit change in X is represented by the β₁.
The error term or the unexplained variation in Y is represented by the ε.
Steps to Establish a Regression:
A simple example of regression is predicting weight of a person when his height is known. To do
this we need to have the relationship between height and weight of a person.
1. Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
2. Create a relationship model using the lm() functions in R.
3. Find the coefficients from the model created and create the mathematical equation using
these
4. Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
5. To predict the weight of new persons, use the predict() function in R.
Step 1.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
Step 2 creates the relationship model between the predictor and the response variable.
lm ( ) Function:
This function creates the relationship model between the predictor and the response variable
Syntax:
The basic syntax for lm() function in linear regression is −
lm(formula, data)
formula is a symbol presenting the relation between x and y.
data is the vector on which the formula will be applied
Note: Tilde (~) is used to separate the left- and right-hand sides in a model formula.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) #weight
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) #height
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
#Get the Summary of the Relationship
#print(summary(relation))
Output
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
object: It is the formula that we have already created using the lm() function.
newdata: It is the vector that contains the new value for the predictor variable
Predict the weight of new persons:
Output
1
76.22869
Call: Shows the function call used to compute the regression model.
Residuals: Provide a quick view of the distribution of the residuals, which by definition
have a mean zero. Therefore, the median should not be far from zero, and the minimum
and maximum should be roughly equal in absolute value.
Coefficients: Shows the regression beta coefficients and their statistical significance.
Predictor variables, that are significantly associated to the outcome variable, are marked
by stars.
Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are
used to check how well the model fits to our data.
Example:
Input Data:
Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the mtcars
data set for this purpose.
Output
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
Advanced graphics:
Handling the Graphics Device:
So far, your plotting has dealt with one image at a time. It’s possible to have multiple graphics
devices open, but only one will be deemed active at any given time.
Manually Opening a New Device:
The typical base R commands you’ve met already (such as plot, hist, boxplot, and so on) will
automatically open a device for plotting and draw the desired plot, if nothing is currently open.
You can also open new device windows using dev.new( ); this newest window will immediately
become active, and any subsequent plotting commands will affect that particular device.
As an example, first close any open graphics windows and then enter the following at the R
prompt:
R> plot(quakes$long, quakes$lat)
Now, let’s say you’d also like to see a histogram of the number of stations that detected each
event. Execute the following to open a new plotting window:
R> dev.new( )
At this point, you can enter the usual command to bring up the desired histogram in Device 3:
R> hist(quakes$stations)
If you hadn’t used dev.new, the histogram would’ve just overwritten the plot of the spatial
locations in Device 2.
Switching Between Devices:
To change something in Device 2 without closing Device 3, use dev.set followed by the device
number you want to make active.
R> dev.set(2)
quartz
2
R> plot(quakes$long,quakes$lat,cex=0.02*quakes$stations, xlab="Longitude",ylab="Latitude")
R> dev.set(3)
quartz
3
R> abline(v=mean(quakes$stations),lty=2)
Closing a Device:
To close a graphics device, use the dev.off( ) function
R> dev.off(2)
quartz
3
Then repeat the call without an argument to close the remaining device:
R> dev.off()
null device
1
Multiple Plots in One Device:
You can also control the number of individual plots in any one device. par( ) function is used to
control various graphical parameters of traditional R plots.
Setting the mfrow Parameter:
The mfrow argument instructs a new (or the currently active) device to “invisibly” divide itself
into a grid of the specified dimensions, with each cell holding one plot. You pass the mfrow
option a numeric integer vector of length 2 in the order of c(rows,columns); as you might guess,
its default is c(1,1).
Now, say you want the two plots of the quakes data side by side in the same device.
You would set mfrow as a 1 × 2 grid with the vector c(1,2)—one row of plots and two columns
R> dev.new(width=8,height=4)
R> par(mfrow=c(1,2))
R> plot(quakes$long,quakes$lat, cex=0.02*quakes$stations, xlab="Longitude",ylab="Latitude")
R> hist(quakes$stations)
R> abline(v=mean(quakes$stations), lty=2)
Defining a Particular Layout
You can refine the arrangements of plots in a single device using the layout( ) function, which
offers more ways to individualize the panels into which the plots will be drawn.
When you use layout, you provide the dimensions in a matrix mat as the first argument; these
govern an invisible rectangular grid, just like controlling the mfrow option. The difference now
is that you can use numeric integer entries in mat to tell layout which plot number will go where.
Examine the following object:
R> lay.mat <- matrix(c(1,3,2,3),2,2)
R> lay.mat
[,1] [,2]
[1,] 1 2
[2,] 3 3
The dimensions of this matrix create a 2 × 2 grid of plotting cells, but the values inside lay.mat
tell R that you want plot 1 to take the upper-left cell, plot 2 to take the upper-right cell, and plot 3
to stretch itself over the two bottom cells.
Calling layout as follows will either initialize the active device based on lay.mat or open a new
one (if the null device is the only device currently available) and initialize it.
R> layout(mat=lay.mat)
If you’re ever unsure of the result of your specification, you can use the layout.show( ) function
to see how plots will be placed.
R> layout.show(n=max(lay.mat))
R> plot(survey$Wr.Hnd, survey$Height, xlab="Writing handspan", ylab="Height")
R> box("outer",lty=3)
R> mtext("Figure region margins\nmar[ . ]",line=2)
R> mtext("Outer region margins\noma[ . ]",line=0.5,outer=TRUE)
mtext( ):
Here, you provide the text you want written in a character string as the first argument, and the
argument line instructs how many lines of space away from the inside border the text should
appear.
Clipping:
Controlling clipping allows you to draw in or add elements to the margin regions with reference
to the user coordinates of the plot itself. For example, you might want to place a legend outside
the plotting area, or you might want to draw an arrow that extends beyond the plot region to
enhance a particular observation.
The graphical parameter xpd controls clipping in base R graphics. By default, xpd is set to
FALSE, so all drawing is clipped to the available plot region only (with the exception of special
margin-addition functions such as mtext).
Setting xpd to TRUE allows you to draw things outside the formally defined plot region into the
figure margins but not into any outer margins.
Setting xpd to NA will permit drawing in all three areas—plot region, figure margins, and the
outer margins.
For example, take a look at the images in Figure 23-5, showing side-by side boxplots of mileage
split by number of cylinders, created with the following code:
R> dev.new()
R> par(oma=c(1,1,5,1),mar=c(2,4,5,4))
R> boxplot(mtcars$mpg~mtcars$cyl,xaxt="n",ylab="MPG")
R> box("figure",lty=2)
R> box("outer",lty=3)
R> arrows(x0=c(2,2.5,3),y0=c(44,37,27),x1=c(1.25,2.25,3),y1=c(31,22,20), xpd=FALSE)
R> text(x=c(2,2.5,3),y=c(45,38,28),c("V4 cars","V6 cars","V8 cars"), xpd=FALSE)
The locator( ) command allows you to find and return user coordinates.
To see how it works, first execute a call to plot(1,1) to bring up a simple plot with a single point
in the middle. To use locator, you simply execute the function (with no arguments for default
behavior), which will “hang” the console, without returning you to the prompt. Then, on an
active graphics device, your mouse cursor will change to a + symbol (you may need to first click
your device once to bring it to the foreground of your computer desktop). With your cursor as the
+, you can perform a series of (left) mouse clicks inside the device, and R will silently record the
precise user coordinates. To stop this, simply right-click to terminate the command and once you
do, the coordinates you identified in the device are returned as a list with components $x and $y.
R> plot(1,1)
R> locator()
$x
[1] 0.8275456 1.1737525 1.1440526 0.8201909
$y
[1] 1.1581795 1.1534442 0.9003221 0.8630254
Visualizing Selected Coordinates:
You can also use locator( ) to plot the points you select as either individual points or as lines.
R> plot(1,1)
R> Rtist <- locator(type="o",pch=4,lty=2,lwd=3,col="red",xpd=TRUE)
R> Rtist
$x
[1] 0.5013189 0.6267149 0.7384407 0.7172250 1.0386740 1.2765699
[7] 1.4711542 1.2352573 1.2220592 0.8583484 1.0483300 1.0091491
$y
[1] 0.6966016 0.9941945 0.9636752 1.2819852 1.2766579 1.4891270
[7] 1.2439071 0.9630832 0.7625887 0.7541716 0.6394519 0.9618461
Ad Hoc Annotation:
The locator function also allows you to place ad hoc annotations, such as legends, on your plot.
The student survey data in the MASS package, first loading the package by calling
library("MASS").
R> library("MASS")
R>plot(survey$Height~survey$Wr.Hnd,pch=16,
col=c("gray","black")[as.numeric(survey$Sex)],
xlab="Writing handspan",ylab="Height")
R> legend(locator(n=1),legend=levels(survey$Sex), pch=16, col=c("gray","black"))
If you specify n=1, locator will automatically terminate after you left-click once in the device
This plot is almost the same as the default, but note now that there’s no padding space at the end
of the axes.
plot(hp,mpg,cex=wtcex, axes=FALSE, ann=FALSE)
Customizing Boxes:
To add a box specific to the current plot region in the active graphics device, you use box and
specify its type with bty.
The bty argument is supplied a single character: "o" (default), "l", "7", "c", "u", "]", or "n".
You can use other relevant parameters that you’ve met already, such as lty, lwd, and col, to
further control the appearance of a box.
R> box(bty="l",lty=3,lwd=2)
R> box(bty="]",lty=2,col="gray")
Customizing Axes
Once you have the box the way you want it, you can focus on the axes. The axis ( ) function
allows you to control the addition and appearance of an axis on any of the four sides of the plot
region in greater detail.
The first argument it takes is side, provided with a single integer: 1 (bottom), 2 (left), 3 (top), or
4 (right). These numbers are consistent with the positions of the relevant margin-spacing values
when you’re setting graphical parameter vectors like mar.
R> hpseq <- seq(min(hp),max(hp),length=10)
R> plot(hp,mpg,cex=wtcex,xaxt="n",bty="n",ann=FALSE)
R> axis(side=1,at=hpseq)
R> axis(side=3,at=round(hpseq))
Specialized Text and Label Notation:
Font:
The displayed font is controlled by two graphical parameters: family for the specific font family
and font, an integer selector for controlling bold and italic typeface.
There are three generic families—"sans" (the default), "serif", and "mono"—that are always
available.
These are paired with the four possible values of font—1 (normal text, default), 2 (bold), 3
(italic), and 4 (bold and italic).
R> text(0,6,label="sans text (default)\nfamily=\"sans\", font=1")
Mathematical Expressions:
R> expr1 <- expression(c^2==a[1]^2+b[1]^2)
R> expr2 <- expression(paste(pi^{x[i]},(1-pi)^(n-x[i])))
R> expr3 <- expression(paste("Sample mean: ", italic(n)^{-1}, sum(italic(x)[italic(i)],
italic(i)==1, italic(n))==frac(italic(x)[1]+...+italic(x)[italic(n)], italic(n))))
red 0 0 255
green 0 205 192
blue 0 0 203
These RGB triplets are frequently expressed as hexadecimals, a numeric coding system often
used in computing. In R, a hexadecimal, or hex code, is a character string with a # followed by
six alphanumeric characters: valid characters are the letters A through F and the digits 0 through
9. The first pair of characters represents the red component, and the second and third pairs
represent green and blue, respectively.
R> rgb(t(col2rgb(c("black","green3","pink"))), maxColorValue=255)
[1] "#000000" "#00CD00" "#FFC0CB"
3D Scatterplots
Creating 3D scatterplots, which allow you to plot raw observations based on three continuous
variables at once, as opposed to only two in a conventional 2D scatterplot.
Basic Syntax:
The syntax of the scatterplot3d function is similar to the default plot function. In the latter, you
supply a vector of x- and y-axis coordinates; in the former, you merely supply an additional third
vector of values providing the z-axis coordinates. With that additional dimension, you can think
of these three axes in terms of the x-axis increasing from left to right, the y-axis increasing from
foreground to background, and the z-axis increasing from bottom to top.
Install and load the scatterplot3d package.
library("scatterplot3d")
pwid <- iris$Petal.Width
plen <- iris$Petal.Length
swid <- iris$Sepal.Width
slen <- iris$Sepal.Length
scatterplot3d(x=pwid,y=plen,z=swid)