0% found this document useful (0 votes)
73 views6 pages

How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable

The document provides instructions on how to use qqplot in R to create quantile-quantile (Q-Q) plots for normal and other distributions like Poisson. It then discusses simple linear regression, including how to find the regression line and coefficients using the lm() function, make predictions, and identify outliers. Methods for resistant regression like least trimmed squares (lqs()) and rlm() are introduced. Finally, examples are given of adding trend lines using scatter.smooth(), smooth.spline(), and supsmu() for non-linear relationships.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views6 pages

How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable

The document provides instructions on how to use qqplot in R to create quantile-quantile (Q-Q) plots for normal and other distributions like Poisson. It then discusses simple linear regression, including how to find the regression line and coefficients using the lm() function, make predictions, and identify outliers. Methods for resistant regression like least trimmed squares (lqs()) and rlm() are introduced. Finally, examples are given of adding trend lines using scatter.smooth(), smooth.spline(), and supsmu() for non-linear relationships.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

1

How to use qqplot


(1) Normal quantile-quantile plot
# Generating data from the normal distribution
x <- rnorm(500)
hist(x)
qqnorm(x)
qqline(x)

(2) Q-Q plot for other distributions


# Generating data from Poisson distribution
x <- rpois(100, lambda=5)
hist(x)
par(mfrow=c(1,2), pty="s")
# Comparing against a Poisson
# First, generate theoretical quantiles
th.quantile = qpois( seq(0, 1, length=length(x)), lambda=mean(x) )
qqplot( th.quantile, x, xlab="Theoretical Quantiles", ylab="x")
title(main="Poisson Q-Q Plot")
# Comparing against a Normal
qqnorm(x, ylab="x")
qqline(x)
par(mfrow=c(1,1))

3.4 Simple linear regression


variables x and y have a linear relationship
=> y = mx + b, where m is the slope, b the intercept.
x : independent variable, y : dependent variable

simple linear regression model :


: error term,

the coefficients

and

: regression coefficients.

x : predictor variable
y : response variable
meaning of "linear" applies to the way the regression coefficients are used.
meaning of simple : only one predictor variable is used.
The estimated regression line :
the predicted value :
residual : the difference between the observed value and the predicted value

: the signed vertical distance of the point

to the prediction line.

Estimation method : The method of least squares


- chooses the coefficients so that the sum of the squared residuals is as small as possible.

( x x)( y y )
( x x)
i

x
y
1
=

3.4.1 Using the regression model for prediction


- make predictions for the response value for new values of the predictor.

3.4.2 Finding the regression coefficients using lm()


lm() : function for linear model fitting
lm(model.formula)

ex) lm(y ~ x) : y is modeled by x

Example 3.5: The regression line for the Maple wood home data
homedata(UsingR) : a strong linear trend between the 1970 & and the 2000 assessments.
attach(homedata)
lm(y2000 ~ y1970)
fit1 = lm(y2000 ~ y1970) # save the result
fit1

Adding the regression line to a scatterplot: abline()


plot(y1970, y2000, main="-113,000+5.43x")

abline(fit1)
Remark) abline() function can add other lines too :
abline(a,b) : the line y=a+bx
abline(h=c) : the horizontal line y=c
abline(v=c) : the vertical line x=c

Using the regression line for predictions


predict the y value for a given x value
(1)
-113000 + 5.43*50000
(2)
betas = coef(fit1)
sum(betas * c(1, 50000)) # beta0 * 1 + betal * 50000
Remark) Other useful extractor functions like coef()
residuals() : returns the residuals
predict() : perform predictions

Eg. Find the predicted and residual value at the data point (55100, 130200)
To specify the x value, a data frame is required with properly named variables.
predict(fit1, data.frame(y1970=55100))
130200 - predict(fit1, data.frame(y1970=55100)) # residual

More on model formulas


Summary() is one of generic function: output different depending on input argument
The plot() function is an example of a generic function in R.
plot( model formula ) : a scatterplot is created.
plot(the output of the density() function) : a density plot is produced.
plot(y2000 ~ y1970)
fit1 = lm(y2000 ~ y1970)
abline(fit1)
3.4.3 Transformations of the data
Example 3.6: Kids weights: Is weight related to height squared?
kid.weights(UsingR) data set : the relationship between height and weight.
the BMI suggests a relationship between height squared and weight.

height.sq = kid.weights$height^2
plot(weight ~ height.sq, data=kid.weights)
fit2 = lm(weight ~ height.sq, data=kid.weights)
abline(fit2)
fit2

Using a model formula with transformations


(1) Wrong method
plot(weight ~ height^2, data=kid.weights) # not as expected
fit2 = lm(weight ~ height^2, data=kid.weights)
abline(fit2)

(2) Right method


plot(weight ~ I(height^2), data=kid.weights)
fit2 = lm(weight ~ I(height^2), data=kid.weights)
abline(fit2)

3.4.4 Interacting with a scatterplot


identify() : R function to identify points on a scatterplot.
usage : identify (x, y, labels=, n=)
The value n= : specifies the number of points to identify.
The argument labels= : allows for the placement of other text.
locator() : locate the (x, y) coordinates of the points we select.
called with the number of points desired, as with locator (2).

Example 3.7: Florida 2000


florida(UsingR) data set : county-by-county vote counts for the 2000 U.S. presidential election
in the state of Florida.
plot(BUCHANAN ~ BUSH, data=florida) # two outliers.
res = lm(BUCHANAN ~ BUSH, data=florida)
abline(res)
with(florida, identify(BUSH, BUCHANAN, n=2, labels=County))
florida$County[c(13,50)]
The predicted amount and residual for Palm Beach :
with(florida, predict(res, data.frame(BUSH = BUSH[50])))

residuals(res)[50]

Buchanan received 2,610 of Gores votes. Many more than the 567 that decided the state
and the presidency.

3.4.5 Outliers in the regression model


Two types of outliers:
outlier for the individual variables
outlier in the regression : points that are far from the trend or pattern of the data.
Example 3.8: Emissions versus GDP
emissions(UsingR) data set : data for several countries on CO2 emissions and per-capita gross
domestic product(GDP).
f = CO2 ~ perCapita # save formula
plot(f, data=emissions) # one isolated point that seems to pull the regression line upward
abline( lm(CO2 ~ perCapita, data=emissions) )
abline( lm(f, data=emissions, subset=-1), lty=2 )
Remark) U.S. point is an outlier for the CO2 variable, but not for the per-capita GDP.
an outlier in regression, as it stands far off from the trend set by the rest of the data.
an influential observation, as its presence dramatically affects the regression line.

3.4.6

Resistant regression lines: lqs() and rlm()

The regression coefficients are subject to strong influences from outliers.

use resistant regression methods.

(1) Least-trimmed squares


The method of least-trimmed squares
- use the sum of the q smallest squared residuals, where q is roughly n/2.
- lqs() function from the MASS package.
library(MASS)
abline( lqs(f, data=emissions), lty=3 )
(2) Resistant regression using rlm()
rlm() function, from the MASS package.
abline( rlm(f, data=emissions, method="MM"), lty=4 )
(3) Adding legends to plots
legend()
The placement : in (x, y) coordinates or done with the mouse using locator(n=1).

The labels : legend= argument.


The markings : different line types (lty=); with different colors (col=);
or with different plot characters (pch=).
the.labels = c(lm, lm w/o 1, least trimmed squares, rlm with MM)
the.ltys = 1:4
legend(5000, 6000, legend=the.labels, lty=the.ltys)

3.4.7 Trend lines


When a proper transformation is not possible to make a linear relationship,
superimpose a trend line on top of the data using one of the smoothing techniques.
scatter.smooth() : uses the loess() function to plot both the scatterplot and a trend line.
smooth.spline() : fit the data using cubic splines
supsmu() : perform Friedmans super smoother algorithm.
Example 3.9: Five years of temperature data
five.yr.temperature(UsingR) : five years of New York City temperature data.
- scatterplot shows a periodic, sinusoidal pattern.
attach(five.yr.temperature)
scatter.smooth(temps ~ days, col=gray(0.75))
scatter.smooth(temps ~ days, col=gray(0.75))
lines(smooth.spline(temps ~ days), lty=2, lwd=2)
lines(supsmu(days, temps), lty=3, lwd=2)
legend(locator(1), lty=c(1,2,3), lwd=c(1,2,2), legend=c(scatter.smooth,smooth.spline,supsmu))
detach(five.yr.temperature)

You might also like