Chapter 1 Introduction of Regression
Chapter 1 Introduction of Regression
Department of Statistics
The Chinese University of Hong Kong
2021/22 Term 1
1
Chapter Outline
• Section 1.1: Motivation
• Section 1.2: Five Examples on Regression
• Section 1.3: Installation of R and R libraries
• Section 1.4: Mean and Variance Functions
• Section 1.5: Separated Points
• Section 1.6: Scatterplot
2
Section 1.1
Motivation
3
Motivation: Example
• Problem of Interest: Want to predict the Overall GPA of
students in CUHK
• Methodology:
1. Select a random sample of students graduated from CUHK
2. Record the following for each student:
• Overall GPA
• Properties from students: E.g. IQ, AL-results, Major,
Gender, … etc
3. Use the above information to predict the overall GPA of
current students
4
Motivation: Example
8
Examples on Regression
• Next few pages: Examples with data available in R (alr4 library)
• Messages from the examples:
Y y x y y2 y x x
y
Given the Mheight = x (inches), Dheight is normally distributed with constant
variance (1 2 ) x2 . 11
Example 2 – Forbes’ Data
• Barometer (氣壓計) was a fragile instrument to measure
atmospheric pressure in 1850s.
• James D. Forbes (1857): Use the boiling point of water as
a substitute (which is more reliable based on a
thermometer) of the measurement of atmospheric
pressure
• At 17 different locations in the Alps and the Scotland, he
measured
• the pressure (in inches of mercury) using a barometer,
and
• the boiling point of water (in F)
• Question: Does the boiling point of water vary with
atmospheric pressure in a linear way?
12
Example 2 – Forbes’ Data
• High Altitude: Low Atmospheric Pressure and Low Boiling Point of Water
• Low Altitude: High Atmospheric Pressure and High Boiling Point of Water
13
Example 2 – Forbes’ Data
• x = boiling point of water (in Fahrenheit)
y = atmospheric pressure (in inches of mercury)
• Residual Plot on the Right: Presence of systematic error
(quadratic relationship?) between x and y
Residual = y – “Fitted Value of y”
Outlier
14
Example 2 – Forbes’ Data
• Data Transformation: y = log(atmospheric pressure)
=> Points fall closer to a horizontal line
Outlier
17
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
Fort
Collins
18
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
• x = Early Snowfall (in inches)
from Sep 1st to Dec 31st
• y = Late Snowfall (in inches)
from Jan 1st to Jun 30th next year
• Yearly data from 1900 to 1992
(n=93)
• Dash line = Fitted Regression line
• Solid line = Average Late Winter
Snowfall level (with slope=0)
4%
10%
• Record the average weight gain of each turkey pen (欄) 20
Example 5 – Turkey Growth
• y = Average weight gain (in gram) of turkeys in a pen
• x = Dose of Amino Acid Supplement (as a percentage of total diet)
• Circle/Triangle/Cross:
3 different type of
amino acids supplement
in their diet
• Challenges:
• Non-linear relationship between x and y (Ch5 Polynomial
Regression)
• Inclusion of the type of amino acid (a categorical variable) to
the regression model (Ch5 Dummy Variables) 21
Section 1.3
Installation of R and R Libraries
22
Installation of R
• Dataset from the textbook (and this course) are available in the R
libraries “car” and “alr4”
• Require R of version 3.5.0 or higher (current version: 4.1.1)
Installation of R
1. Go to http://cran.r-project.org/bin/windows/base/ and “Download
R 4.1.1 for Windows”
Mac OS X: R-4.1.1.pkg from https://cran.r-project.org/bin/macosx/
2. Run the .exe file to install R. The default folder of the software is
“C:\program files\R\R-4.1.1\”
23
Installation of R Libraries (car and alr4)
• Most data sets from this course is stored in the “alr4” library
• The installation of the “alr4” library is messy because it depends on a lot of
other libraries as follows:
curl rio data.table
25
Section 1.4
Mean and Variance Functions
26
Mean and Variance Functions
• Consider data {(xi, yi), i = 1, 2, … ,n}
• x is called the Explanatory Variable (EV) [also called the
Predictor or Independent Variable]
• y is called the Response Variable (RV) [also called the
Dependent Variable]
• Model Assumption when setup of a regression model:
1. Mean Function E (Y | X x )
2. Variance Function Var(Y | X x)
1. Mean Function is the expected value of the response when the
explanatory variable X=x :
E (Y | X x ) f ( x )
• Linear Regression: f (x) = a + bx,
• Quadratic Regression: f (x) = a + b1 x + b2 x2 27
Mean Function – Inheritance of Height
Example: Inheritance of Height (Mother’s height vs Daughter’s height)
y=x
y = ax + b
E(Growth| Dose = x)
= β0 + β1 [1-exp(- β2 x)]
• Interpretation of Parameters
• β0: Baseline growth (i.e. growth without amino acid supplement)
• β1: Max. effect of amino acid, with y 0 1 as x
• β2: The speed to achieve the maximum growth
29
Variance Function
• Variance Function is assumed to be constant (mostly unknown)
throughout the course:
Constant
Var(Y | X x) 2
variability
31
Section 1.5
Separated Points
32
Four Hypothetical Data Sets
• [Textbook Table 1.1] 4 different data sets {(xi,yi), i=1, 2,…, 11}
Same summary
statistics:
{x , y , s x2 , s 2y , s xy }
35
Separated Points
• Horizontal: Leverage point (i.e. leverage effect to the line)
The location of the leverage
point (x, y) has higher
impact to the regression
line (i.e. leverage) than the
other points
36
Section 1.6
Scatterplot
37
Why Scatterplot?
• Scatterplot uses Cartesian (x-y) coordinates to displays values
of two variables.
• Scatterplot is able to identify the following:
1. the mean function Inheritance of Height Data
2. the variance function
3. separated points
1. Mean function:
Linear
2. Variance function:
Constant
3. No separated point
38
Null Plot
• Null plot is a scatterplot with
1. constant mean function (slope=0)
2. constant variance function
3. no separated point
Snowfall data
41
Generate Scatterplot Matrix in R – the “pairs” Function
Example: Generate the Scatterplot Matrix for the Fuel data
library(car); library(alr4)
Tax<-fuel2001$Tax # Gasoline state tax rate
Dlic<-fuel2001$Drivers/fuel2001$Pop # No. of Drivers / population over age 16
Income<-fuel2001$Income # Personal Income
logMiles<-log(fuel2001$Miles,2) # Log (total length of highway)
Fuel<-fuel2001$FuelC/fuel2001$Pop # Amnt of Gasoline sold per population over age 16
Data<-cbind(Tax,Dlic,Income,logMiles,Fuel)
# Bind the 5 objects (by columns) into the matrix of 5 columns
pairs(Data) # Generate the Scatterplot Matrix of “Data”
42
Correlation Heatmap
Correlation Heatmap: Graphical illustration of the correlation matrix
• Quick and dirty way to summarize how the linear association (i.e.
correlation) between each pair of variables
• Works particularly good for data with A LOT of variables
• Unable to visualize the linearity and possible separated points
install.packages("corrplot")
library(corrplot)
par(mfrow=c(1,1)) # Set the Graphical screen to 1
row & 1 column
M <- cor(Data) # Compute the correlation matrix
corrplot(M, method = "color") # Heatmap of M
43