0% found this document useful (0 votes)
61 views

Model Visualisation: (With Ggplot2)

The document discusses visualizing linear regression models with ggplot2. It notes that the current approach of using plot.lm is suboptimal because it separates the data from the representation. The author argues for a better strategy where the data is separated from the representation to allow for more customizable visualizations of linear models.

Uploaded by

api-14814295
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Model Visualisation: (With Ggplot2)

The document discusses visualizing linear regression models with ggplot2. It notes that the current approach of using plot.lm is suboptimal because it separates the data from the representation. The author argues for a better strategy where the data is separated from the representation to allow for more customizable visualizations of linear models.

Uploaded by

api-14814295
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Model

visualisation
(with ggplot2)

Hadley Wickham
Rice University

Monday, 13 July 2009


1. Introducing plot.lm
2. The current state of play. Why this is
suboptimal.
3. A better strategy: separate data from
representation.
4. Why a canned set of plots is not
good enough.

Monday, 13 July 2009


plot.lm(mod,
Residuals vs Fitted
which = 1)
0.3

624 ●
0.2

● ●
● ●

● ● ●● ●

● ● ● ●
● ● ●
● ● ● ●●
● ● ●●
0.1

● ● ●
● ● ● ● ● ● ● ●●

● ● ● ● ● ●
● ●●

●● ● ● ●●● ● ●● ●
● ● ●● ●
● ●●
●● ● ● ●
●●
●●● ● ●● ● ●●
● ● ●

● ● ● ● ●
● ●
●● ●●●●● ● ●
●●● ● ●
●● ●
● ●● ●● ● ● ● ●
● ● ●● ● ●●● ● ●●● ● ●● ● ● ● ●
Residuals

● ● ●●●● ●● ● ●●● ● ●● ● ●●
● ●
●● ●●● ●●●
● ● ●
●● ●●● ●●●● ●● ● ●●

● ●● ●●●● ● ● ●●●●●
●●
●●● ● ●●
●● ● ● ● ●
●●●●● ●●
● ●●
●●●●● ●

● ●● ●●● ● ● ●

0.0

●● ●● ●
●●●●
● ● ● ●●●●
●●
● ●●●●
● ●
●●● ●
● ●●● ● ● ● ●
● ●●●●● ●●● ●● ● ●● ●
● ●
●● ●● ●
●● ● ●● ●● ●●
● ● ●


● ● ●●● ●● ●●
●●●●●●●● ● ●
● ●●●●●●● ●
●● ●●● ● ●
●●●●● ●
●●● ● ●●
● ● ● ●
●● ● ● ●●●●● ●● ● ● ● ●●●●●
● ●●●


● ● ●●
● ●
● ●
●●●● ●
●●
● ● ● ● ● ● ●●
● ●●● ●
●●
● ●
● ● ● ●● ● ● ●● ●
● ●

● ●
●●


●●●● ●●
● ●●●

●●
●●
●●●●●●●●● ● ●●●●●

● ● ● ●● ● ● ● ● ● ●●
●● ●
● ●
● ● ●
● ● ● ● ●● ● ● ●
●● ● ●● ●
●● ●●● ● ●●● ●●● ●●● ● ●● ● ●●●● ● ●
●●
● ●
● ● ●
−0.1

● ● ● ●
● ● ● ● ● ●
● ● ● ● ●

● ● ● ●● ●● ● ●
● ● ● ●



−0.2


133 ●

● 574
−0.3

−0.2 0.0 0.2 0.4 0.6

Fitted values
lm(log10(sales) ~ city * ns(date, 3) + factor(month))
Monday, 13 July 2009
# File src/library/stats/R/plot.lm.R show[which] <- TRUE
# Part of the R package, http://www.R-project.org r <- residuals(x)
# yh <- predict(x) # != fitted() for glm
# This program is free software; you can redistribute it and/or modify w <- weights(x)
# it under the terms of the GNU General Public License as published by if(!is.null(w)) { # drop obs with zero wt: PR#6640
# the Free Software Foundation; either version 2 of the License, or wind <- w != 0
# (at your option) any later version. r <- r[wind]
# yh <- yh[wind]
# This program is distributed in the hope that it will be useful, w <- w[wind]
# but WITHOUT ANY WARRANTY; without even the implied warranty of labels.id <- labels.id[wind]
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the }
# GNU General Public License for more details. n <- length(r)
# if (any(show[2L:6L])) {
# A copy of the GNU General Public License is available at s <- if (inherits(x, "rlm")) x$s
# http://www.r-project.org/Licenses/ else if(isGlm) sqrt(summary(x)$dispersion)
else sqrt(deviance(x)/df.residual(x))
plot.lm <- hii <- lm.influence(x, do.coef = FALSE)$hat
function (x, which = c(1L:3,5), ## was which = 1L:4, if (any(show[4L:6L])) {
caption = list("Residuals vs Fitted", "Normal Q-Q", cook <- if (isGlm) cooks.distance(x)
"Scale-Location", "Cook's distance", else cooks.distance(x, sd = s, res = r)
"Residuals vs Leverage", }
expression("Cook's dist vs Leverage " * h[ii] / (1 - h[ii]))), }
panel = if(add.smooth) panel.smooth else points, if (any(show[2L:3L])) {
sub.caption = NULL, main = "", ylab23 <- if(isGlm) "Std. deviance resid." else "Standardized residuals"
ask = prod(par("mfcol")) < length(which) && dev.interactive(), ..., r.w <- if (is.null(w)) r else sqrt(w) * r
id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, ## NB: rs is already NaN if r=0, hii=1
qqline = TRUE, cook.levels = c(0.5, 1.0), rs <- dropInf( r.w/(s * sqrt(1 - hii)), hii )
add.smooth = getOption("add.smooth"), }
label.pos = c(4,2), cex.caption = 1)
{ if (any(show[5L:6L])) { # using 'leverages'
dropInf <- function(x, h) { r.hat <- range(hii, na.rm = TRUE) # though should never have NA
if(any(isInf <- h >= 1.0)) { isConst.hat <- all(r.hat == 0) ||
warning("Not plotting observations with leverage one:\n ", diff(r.hat) < 1e-10 * mean(hii, na.rm = TRUE)
paste(which(isInf), collapse=", "), }
call.=FALSE) if (any(show[c(1L, 3L)]))
x[isInf] <- NaN l.fit <- if (isGlm) "Predicted values" else "Fitted values"
} if (is.null(id.n))
x id.n <- 0
} else {
id.n <- as.integer(id.n)
if (!inherits(x, "lm")) if(id.n < 0L || id.n > n)
stop("use only with \"lm\" objects") stop(gettextf("'id.n' must be in {1,..,%d}", n), domain = NA)
if(!is.numeric(which) || any(which < 1) || any(which > 6)) }
stop("'which' must be in 1L:6") if(id.n > 0L) { ## label the largest residuals
isGlm <- inherits(x, "glm") if(is.null(labels.id))
show <- rep(FALSE, 6) labels.id <- paste(1L:n)

Monday, 13 July 2009


iid <- 1L:id.n
show.r <- sort.list(abs(r), decreasing = TRUE)[iid] }
if(any(show[2L:3L])) abline(h = 0, lty = 3, col = "gray")
show.rs <- sort.list(abs(rs), decreasing = TRUE)[iid] }
text.id <- function(x, y, ind, adj.x = TRUE) { if (show[2L]) { ## Normal
labpos <- ylim <- range(rs, na.rm=TRUE)
if(adj.x) label.pos[1+as.numeric(x > mean(range(x)))] else 3 ylim[2L] <- ylim[2L] + diff(ylim) * 0.075
text(x, y, labels.id[ind], cex = cex.id, xpd = TRUE, qq <- qqnorm(rs, main = main, ylab = ylab23, ylim = ylim, ...)
pos = labpos, offset = 0.25) if (qqline) qqline(rs, lty = 3, col = "gray50")
} if (one.fig)
} title(sub = sub.caption, ...)
getCaption <- function(k) # allow caption = "" , plotmath etc mtext(getCaption(2), 3, 0.25, cex = cex.caption)
as.graphicsAnnot(unlist(caption[k])) if(id.n > 0)
text.id(qq$x[show.rs], qq$y[show.rs], show.rs)
if(is.null(sub.caption)) { ## construct a default: }
cal <- x$call if (show[3L]) {
if (!is.na(m.f <- match("formula", names(cal)))) { sqrtabsr <- sqrt(abs(rs))
cal <- cal[c(1, m.f)] ylim <- c(0, max(sqrtabsr, na.rm=TRUE))
names(cal)[2L] <- "" # drop " formula = " yl <- as.expression(substitute(sqrt(abs(YL)), list(YL=as.name(ylab23))))
} yhn0 <- if(is.null(w)) yh else yh[w!=0]
cc <- deparse(cal, 80) # (80, 75) are ``parameters'' plot(yhn0, sqrtabsr, xlab = l.fit, ylab = yl, main = main,
nc <- nchar(cc[1L], "c") ylim = ylim, type = "n", ...)
abbr <- length(cc) > 1 || nc > 75 panel(yhn0, sqrtabsr, ...)
sub.caption <- if (one.fig)
if(abbr) paste(substr(cc[1L], 1L, min(75L, nc)), "...") else cc[1L] title(sub = sub.caption, ...)
} mtext(getCaption(3), 3, 0.25, cex = cex.caption)
one.fig <- prod(par("mfcol")) == 1 if(id.n > 0)
if (ask) { text.id(yhn0[show.rs], sqrtabsr[show.rs], show.rs)
oask <- devAskNewPage(TRUE) }
on.exit(devAskNewPage(oask)) if (show[4L]) {
} if(id.n > 0) {
##---------- Do the individual plots : ---------- show.r <- order(-cook)[iid]# index of largest 'id.n' ones
if (show[1L]) { ymx <- cook[show.r[1L]] * 1.075
ylim <- range(r, na.rm=TRUE) } else ymx <- max(cook, na.rm = TRUE)
if(id.n > 0) plot(cook, type = "h", ylim = c(0, ymx), main = main,
ylim <- extendrange(r= ylim, f = 0.08) xlab = "Obs. number", ylab = "Cook's distance", ...)
plot(yh, r, xlab = l.fit, ylab = "Residuals", main = main, if (one.fig)
ylim = ylim, type = "n", ...) title(sub = sub.caption, ...)
panel(yh, r, ...) mtext(getCaption(4), 3, 0.25, cex = cex.caption)
if (one.fig) if(id.n > 0)
title(sub = sub.caption, ...) text.id(show.r, cook[show.r], show.r, adj.x=FALSE)
mtext(getCaption(1), 3, 0.25, cex = cex.caption) }
if(id.n > 0) { if (show[5L]) {
y.id <- r[show.r] ylab5 <- if (isGlm) "Std. Pearson resid." else "Standardized residuals"
y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3 r.w <- residuals(x, "pearson")
text.id(yh[show.r], y.id, show.r) if(!is.null(w)) r.w <- r.w[wind] # drop 0-weight cases

Monday, 13 July 2009


rsp <- dropInf( r.w/(s * sqrt(1 - hii)), hii )
ylim <- range(rsp, na.rm = TRUE) format(mean(r.hat)),
if (id.n > 0) { "\n and there are no factor predictors; no plot no. 5")
ylim <- extendrange(r= ylim, f = 0.08) frame()
show.rsp <- order(-cook)[iid] do.plot <- FALSE
} }
do.plot <- TRUE }
if(isConst.hat) { ## leverages are all the same else { ## Residual vs Leverage
if(missing(caption)) # set different default xx <- hii
caption[[5]] <- "Constant Leverage:\n Residuals vs Factor Levels" ## omit hatvalues of 1.
## plot against factor-level combinations instead xx[xx >= 1] <- NA
aterms <- attributes(terms(x))
## classes w/o response plot(xx, rsp, xlim = c(0, max(xx, na.rm = TRUE)), ylim = ylim,
dcl <- aterms$dataClasses[ -aterms$response ] main = main, xlab = "Leverage", ylab = ylab5, type = "n",
facvars <- names(dcl)[dcl %in% c("factor", "ordered")] ...)
mf <- model.frame(x)[facvars]# better than x$model panel(xx, rsp, ...)
if(ncol(mf) > 0) { abline(h = 0, v = 0, lty = 3, col = "gray")
## now re-order the factor levels *along* factor-effects if (one.fig)
## using a "robust" method {not requiring dummy.coef}: title(sub = sub.caption, ...)
effM <- mf if(length(cook.levels)) {
for(j in seq_len(ncol(mf))) p <- length(coef(x))
effM[, j] <- sapply(split(yh, mf[, j]), mean)[mf[, j]] usr <- par("usr")
ord <- do.call(order, effM) hh <- seq.int(min(r.hat[1L], r.hat[2L]/100), usr[2L],
dm <- data.matrix(mf)[ord, , drop = FALSE] length.out = 101)
## #{levels} for each of the factors: for(crit in cook.levels) {
nf <- length(nlev <- unlist(unname(lapply(x$xlevels, length)))) cl.h <- sqrt(crit*p*(1-hh)/hh)
ff <- if(nf == 1) 1 else rev(cumprod(c(1, nlev[nf:2]))) lines(hh, cl.h, lty = 2, col = 2)
facval <- ((dm-1) %*% ff) lines(hh,-cl.h, lty = 2, col = 2)
## now reorder to the same order as the residuals }
facval[ord] <- facval legend("bottomleft", legend = "Cook's distance",
xx <- facval # for use in do.plot section. lty = 2, col = 2, bty = "n")
xmax <- min(0.99, usr[2L])
plot(facval, rsp, xlim = c(-1/2, sum((nlev-1) * ff) + 1/2), ymult <- sqrt(p*(1-xmax)/xmax)
ylim = ylim, xaxt = "n", aty <- c(-sqrt(rev(cook.levels))*ymult,
main = main, xlab = "Factor Level Combinations", sqrt(cook.levels)*ymult)
ylab = ylab5, type = "n", ...) axis(4, at = aty,
axis(1, at = ff[1L]*(1L:nlev[1L] - 1/2) - 1/2, labels = paste(c(rev(cook.levels), cook.levels)),
labels= x$xlevels[[1L]][order(sapply(split(yh,mf[,1]), mgp = c(.25,.25,0), las = 2, tck = 0,
mean))]) cex.axis = cex.id, col.axis = 2)
mtext(paste(facvars[1L],":"), side = 1, line = 0.25, adj=-.05) }
abline(v = ff[1L]*(0:nlev[1L]) - 1/2, col="gray", lty="F4") } # if(const h_ii) .. else ..
panel(facval, rsp, ...) if (do.plot) {
abline(h = 0, lty = 3, col = "gray") mtext(getCaption(5), 3, 0.25, cex = cex.caption)
} if (id.n > 0) {
else { # no factors y.id <- rsp[show.rsp]
message("hat values (leverages) are all = ", y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3

Monday, 13 July 2009


text.id(xx[show.rsp], y.id, show.rsp)
} }
} }
}
if (show[6L]) { if (!one.fig && par("oma")[3L] >= 1)
g <- dropInf( hii/(1-hii), hii ) mtext(sub.caption, outer = TRUE, cex = 1.25)
ymx <- max(cook, na.rm = TRUE)*1.025 invisible()
plot(g, cook, xlim = c(0, max(g, na.rm=TRUE)), ylim = c(0, ymx), }
main = main, ylab = "Cook's distance",
xlab = expression("Leverage " * h[ii]),
xaxt = "n", type = "n", ...)
panel(g, cook, ...)
## Label axis with h_ii values
athat <- pretty(hii)
axis(1, at = athat/(1-athat), labels = paste(athat))
if (one.fig)
title(sub = sub.caption, ...)
p <- length(coef(x))
bval <- pretty(sqrt(p*cook/g), 5)

usr <- par("usr")


xmax <- usr[2L]
ymax <- usr[4L]
for(i in 1L:length(bval)) {
bi2 <- bval[i]^2
if(ymax > bi2*xmax) {
xi <- xmax + strwidth(" ")/3
yi <- bi2*xi
abline(0, bi2, lty = 2)
text(xi, yi, paste(bval[i]), adj = 0, xpd = TRUE)
} else {
yi <- ymax - 1.5*strheight(" ")
xi <- yi/bi2
lines(c(0, xi), c(0, yi), lty = 2)
text(xi, ymax-0.8*strheight(" "), paste(bval[i]),
adj = 0.5, xpd = TRUE)
}
}

## axis(4, at=p*cook.levels, labels=paste(c(rev(cook.levels),


cook.levels)),
## mgp=c(.25,.25,0), las=2, tck=0, cex.axis=cex.id)
mtext(getCaption(6), 3, 0.25, cex = cex.caption)
if (id.n > 0) {
show.r <- order(-cook)[iid]
text.id(g[show.r], cook[show.r], show.r)

Monday, 13 July 2009


Problems

Hard to understand.
Hard to extend.
Locked into set of pre-specified graphics.
Of no use to other graphics packages.

Monday, 13 July 2009


Alternative approach

What does this actually code do?


It 1) extracts various quantities of interest
from the model and then 2) plots them
So why not perform those two tasks
separately?

Monday, 13 July 2009


Quantities of interest
fortify.lm <- function(model, data = model$model, ...) {
infl <- influence(model, do.coef = FALSE)
data$.hat <- infl$hat
data$.sigma <- infl$sigma
data$.cooksd <- cooks.distance(model, infl)

data$.fitted <- predict(model)


data$.resid <- resid(model)
data$.stdresid <- rstandard(model, infl)

data
}
Note use of . prefix to
avoid name clasehes
Monday, 13 July 2009
plot.lm(mod,
Residuals vs Fitted
which = 1)
0.3

624 ●
0.2

● ●
● ●

● ● ●● ●

● ● ● ●
● ● ●
● ● ● ●●
● ● ●●
0.1

● ● ●
● ● ● ● ● ● ● ●●

● ● ● ● ● ●
● ●●

●● ● ● ●●● ● ●● ●
● ● ●● ●
● ●●
●● ● ● ●
●●
●●● ● ●● ● ●●
● ● ●

● ● ● ● ●
● ●
●● ●●●●● ● ●
●●● ● ●
●● ●
● ●● ●● ● ● ● ●
● ● ●● ● ●●● ● ●●● ● ●● ● ● ● ●
Residuals

● ● ●●●● ●● ● ●●● ● ●● ● ●●
● ●
●● ●●● ●●●
● ● ●
●● ●●● ●●●● ●● ● ●●

● ●● ●●●● ● ● ●●●●●
●●
●●● ● ●●
●● ● ● ● ●
●●●●● ●●
● ●●
●●●●● ●

● ●● ●●● ● ● ●

0.0

●● ●● ●
●●●●
● ● ● ●●●●
●●
● ●●●●
● ●
●●● ●
● ●●● ● ● ● ●
● ●●●●● ●●● ●● ● ●● ●
● ●
●● ●● ●
●● ● ●● ●● ●●
● ● ●


● ● ●●● ●● ●●
●●●●●●●● ● ●
● ●●●●●●● ●
●● ●●● ● ●
●●●●● ●
●●● ● ●●
● ● ● ●
●● ● ● ●●●●● ●● ● ● ● ●●●●●
● ●●●


● ● ●●
● ●
● ●
●●●● ●
●●
● ● ● ● ● ● ●●
● ●●● ●
●●
● ●
● ● ● ●● ● ● ●● ●
● ●

● ●
●●


●●●● ●●
● ●●●

●●
●●
●●●●●●●●● ● ●●●●●

● ● ● ●● ● ● ● ● ● ●●
●● ●
● ●
● ● ●
● ● ● ● ●● ● ● ●
●● ● ●● ●
●● ●●● ● ●●● ●●● ●●● ● ●● ● ●●●● ● ●
●●
● ●
● ● ●
−0.1

● ● ● ●
● ● ● ● ● ●
● ● ● ● ●

● ● ● ●● ●● ● ●
● ● ● ●



−0.2


133 ●

● 574
−0.3

−0.2 0.0 0.2 0.4 0.6

Fitted values
lm(log10(sales) ~ city * ns(date, 3) + factor(month))
Monday, 13 July 2009
ggplot(mod, aes(.fitted, .resid)) +
geom_hline(yintercept = 0) +
geom_point() +
geom_smooth(se = F)

Monday, 13 July 2009


0.2 ●

● ●


● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●●
0.1 ● ●● ● ●
● ● ● ●● ●●
● ● ● ●●
● ● ● ● ● ● ●
● ● ● ●

● ●
● ●●● ● ● ● ● ●
● ●
● ● ● ●


●●●●●●●

●● ●
●●● ●
●● ●● ● ● ●● ●
● ●
● ● ●●●
●● ●

● ●
● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●● ● ● ● ● ●
●● ● ● ● ● ●
● ● ● ● ●●● ● ● ●● ●● ● ●●
● ●●● ●●● ●●● ●●●●
● ● ●●
● ●● ●● ●
● ●● ●● ● ● ●

● ● ●● ● ●● ● ●
●● ●● ● ● ● ● ● ●●● ● ●
●● ● ●●●●
●● ●●
●● ● ●
● ●
●● ● ●
●● ● ●● ●

● ●
● ●●

●● ●●●●
●● ●


● ● ●
●● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ●
0.0 ●
● ● ●● ●● ●● ● ● ●● ● ● ● ●● ●
● ● ● ●
● ● ● ●● ● ●
.resid

● ●
● ● ● ● ● ● ● ●●
●●● ● ●● ●
●● ● ●
●●
● ● ● ● ● ●●● ● ● ●●

● ● ● ● ●●●● ●●●● ●●●●●●● ●
●●● ●
●●●
●● ●● ●● ●
●● ● ●● ● ● ●
●● ● ●●●
● ● ●● ●●● ● ● ●● ●
●●
● ●●● ● ●● ●●● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ●●● ●
● ●● ●●●
●● ● ●● ● ● ● ● ●●
● ● ● ● ●●
● ●● ●●●● ●
● ●●●
●● ●
● ● ●●
● ● ●●●●●●●

●●● ● ●●
● ●
● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●

● ● ● ●● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ●●

● ●

● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
−0.1 ● ● ● ●

● ●
● ● ●● ●
● ●
● ●●
● ● ●


−0.2 ●

−0.2 0.0 0.2 0.4 0.6


.fitted
Monday, 13 July 2009
Diagnostics should
reflect data

Monday, 13 July 2009


0.2 ●

● ●


● ● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●●
0.1 ● ●● ● ●
● ● ● ●● ●●
● ● ● ●●
● ● ● ● ● ● ●
● ● ● ●

● ●
● ●●● ● ● ● ● ●
● ●
● ● ● ●


●●●●●●●

●● ●
●●● ●
●● ●● ● ● ●● ●
● ●
● ● ●●●
●● ●

● ●
● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●● ● ● ● ● ●
●● ● ● ● ● ●
● ● ● ● ●●● ● ● ●● ●● ● ●●
● ●●● ●●● ●●● ●●●●
● ● ●●
● ●● ●● ●
● ●● ●● ● ● ●

● ● ●● ● ●● ● ●
●● ●● ● ● ● ● ● ●●● ● ●
●● ● ●●●●
●● ●●
●● ● ●
● ●
●● ● ●
●● ● ●● ●

● ●
● ●●

●● ●●●●
●● ●


● ● ●
●● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ●
0.0 ●
● ● ●● ●● ●● ● ● ●● ● ● ● ●● ●
● ● ● ●
● ● ● ●● ● ●
.resid

● ●
● ● ● ● ● ● ● ●●
●●● ● ●● ●
●● ● ●
●●
● ● ● ● ● ●●● ● ● ●●

● ● ● ● ●●●● ●●●● ●●●●●●● ●
●●● ●
●●●
●● ●● ●● ●
●● ● ●● ● ● ●
●● ● ●●●
● ● ●● ●●● ● ● ●● ●
●●
● ●●● ● ●● ●●● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ●●● ●
● ●● ●●●
●● ● ●● ● ● ● ● ●●
● ● ● ● ●●
● ●● ●●●● ●
● ●●●
●● ●
● ● ●●
● ● ●●●●●●●

●●● ● ●●
● ●
● ●
● ● ● ●● ● ● ● ●●
● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●

● ● ● ●● ● ● ● ●● ● ●
● ● ●
● ● ●● ● ● ●●

● ●

● ● ● ●
● ● ● ● ●
● ●
● ● ● ● ●
−0.1 ● ● ● ●

● ●
● ● ●● ●
● ●
● ●●
● ● ●


−0.2 ●

−0.2 0.0 0.2 0.4 0.6


.fitted
Monday, 13 July 2009

Use informative
0.2
x variable


● ●


● ● ●
● ● ●
● ● ●
● ● ●
● ●● ●


● ● ●


0.1 ● ● ● ● ●


● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●

● ● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ● ●●●● ●● ●● ●
● ●
● ●● ● ● ●
●● ● ● ● ● ● ● ● ● ●●
● ●●● ● ●
● ● ●●● ● ●● ● ● ● ● ● ●

● ● ● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ● ●● ● ● ● ●● ● ● ●●●● ●
● ● ● ●

● ●●●
●● ●
● ● ●●●
● ●● ●● ● ●
● ● ● ●
●●● ●● ●●● ● ●● ● ●



● ● ● ● ● ●
●●



● ● ● ● ●● ● ●

●●

●●● ●●●

● ●
●● ●
● ● ● ● ●●● ● ● ●● ● ●
● ● ●
●● ● ● ● ● ● ● ●● ● ●
●●
● ● ● ● ●
0.0 ●●
● ●

● ● ●
● ●
● ● ● ● ●
● ●● ● ● ●

● ●● ●●
● ●
● ●
.resid

● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●●
●●
● ●●
● ● ● ● ● ●● ● ● ● ●● ● ●
● ● ● ● ● ●
●●● ● ● ●●
● ●
● ● ●
● ● ● ● ●●● ● ● ●● ●●● ●
● ●
●● ●
● ●● ●● ● ●
● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●●●
● ●●●● ●
● ●●●● ● ●● ●
● ● ● ●
● ●

● ●
●● ● ●
● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ●●
● ● ●
●●●
●●
● ●● ●
● ●
●●● ● ● ● ● ● ● ● ●●
● ● ● ●● ● ●
● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●● ●
● ● ● ● ● ● ●
● ● ● ●● ● ●
● ● ● ● ●
−0.1 ● ●

● ●
● ●
● ●●
● ● ● ●●
● ●
● ● ●


−0.2 ●

2000 2002 2004 2006 2008


date
Monday, 13 July 2009
Connect original
0.2
units

0.1

0.0
.resid

−0.1

−0.2

2000 2002 2004 2006 2008


date
Monday, 13 July 2009
Colour by possible
explanatory variable
0.2

0.1

0.0
.resid

−0.1

−0.2

2000 2002 2004 2006 2008


date
Monday, 13 July 2009
Austin Bryan−College Station Dallas

0.2

0.1

0.0

−0.1

−0.2

48,000 / 86,000
.resid

Houston San Antonio San Marcos

0.2

0.1

0.0

−0.1

−0.2
29,000 / 50,000
2000 2002 2004 2006 2008 2000 2002 2004 2006 2008 2000 2002 2004 2006 2008
date
Monday, 13 July 2009
ggplot(modf, aes(date, .resid)) +
geom_line(aes(group = city))

ggplot(modf, aes(date, .resid,


colour = college_town)) +
geom_line(aes(group = city))

ggplot(modf, aes(date, .resid)) +


geom_line(aes(group = city)) +
facet_wrap(~ city)

Monday, 13 July 2009


fortify.lm <- function(model, data = model$model, ...) { # Which = 1
infl <- influence(model, do.coef = FALSE) ggplot(mod, aes(.fitted, .resid)) +
data$.hat <- infl$hat geom_hline(yintercept = 0) +
data$.sigma <- infl$sigma geom_point() +
data$.cooksd <- cooks.distance(model, infl) geom_smooth(se = F)

data$.fitted <- predict(model) # Which = 2


data$.resid <- resid(model) ggplot(mod, aes(sample = .stdresid)) +
data$.stdresid <- rstandard(model, infl) stat_qq() +
geom_abline()
data
} # Which = 3
ggplot(mod, aes(.fitted, abs(.stdresid)) +
geom_point() +
geom_smooth(se = FALSE) +
scale_y_sqrt()

# Which = 4
mod$row <- rownames(mod)
ggplot(mod, aes(row, .cooksd)) +
geom_bar(stat = "identity")

# Which = 5
ggplot(mod, aes(.hat, .stdresid)) +
geom_vline(size = 2, colour = "white", xintercept = 0) +
geom_hline(size = 2, colour = "white", yintercept = 0) +
geom_point() +
geom_smooth(se = FALSE)

# Which = 6
ggplot(mod, aes(.hat, .cooksd, data = mod)) +
geom_vline(colour = NA) +
geom_abline(slope = seq(0, 3, by = 0.5), colour = "white") +
geom_smooth(se = FALSE) +
geom_point()

Monday, 13 July 2009


Other models

A work in progress: hard work because


most of the functions are like plot.lm
Models: lm, tsdiag, survreg
Maps: maps, and sp classes. Much
easier to work with data frames.

Monday, 13 July 2009


Conclusions

Separating data from visualisation


improves clarity and reusability.
A pre-specified set of plots will not
uncover many model problems. Should
be easy custom diagnostics for your
needs.

Monday, 13 July 2009


crantastic! http://crantastic.org
A community site for finding,
rating, and reviewing R packages.

Monday, 13 July 2009


Monday, 13 July 2009

You might also like