0% found this document useful (0 votes)
22 views

Chapter 1 Introduction of Regression

This document introduces regression analysis and provides examples of its applications. It outlines the chapters of the course, which cover topics like linear regression, model fitting, residual analysis, and including categorical variables in regression models. Examples use real data to demonstrate predicting variables like height, weather patterns, and animal growth.

Uploaded by

Ice Ice cold
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Chapter 1 Introduction of Regression

This document introduces regression analysis and provides examples of its applications. It outlines the chapters of the course, which cover topics like linear regression, model fitting, residual analysis, and including categorical variables in regression models. Examples use real data to demonstrate predicting variables like height, weather patterns, and animal growth.

Uploaded by

Ice Ice cold
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 1 Introduction

STAT 3008 Applied Regression Analysis

Department of Statistics
The Chinese University of Hong Kong

2021/22 Term 1

Dr. LEE Pak Kuen, Philip

1
Chapter Outline
• Section 1.1: Motivation
• Section 1.2: Five Examples on Regression
• Section 1.3: Installation of R and R libraries
• Section 1.4: Mean and Variance Functions
• Section 1.5: Separated Points
• Section 1.6: Scatterplot

2
Section 1.1
Motivation

3
Motivation: Example
• Problem of Interest: Want to predict the Overall GPA of
students in CUHK
• Methodology:
1. Select a random sample of students graduated from CUHK
2. Record the following for each student:
• Overall GPA
• Properties from students: E.g. IQ, AL-results, Major,
Gender, … etc
3. Use the above information to predict the overall GPA of
current students

4
Motivation: Example

• Points not exactly on a straight line, why?


• How to use a mathematical model to relate Y (GPA) and X (IQ)?
5
Linear Regression in a Page
Steps:
1. Select a random sample of
students graduated from
CUHK
2. Record y(GPA) and x(IQ)
3. Plot (x,y) on a scatterplot
4. Find a straight line equation
that fits the data points best GPA=2.00+0.01(IQ)
5. Predict the GPA using a
new student’s IQ

Regression studies the dependency between


Explanatory Variables (X) and the Response Variable (Y)
6
Linear Regression Y = a +bX
Regression studies the dependency between the
Explanatory Variables (X) and Response Variable (Y)
• Explanatory Variable (EV) X: Also known as predictor, or
independent variable
• Response Variable (RV) Y: Also known as dependent variable
Linear Regression – Typical Problem of Interests
• Obtain the best estimates from a regression line (I.e. the
intercept a and the slope b).
• Predict the value of the RV, based on a new set of EVs.
• Identify the EVs which are important to explain the RV.
• Is the regression line good enough to explain the data? If not,
how can we extend the regression line to a more complicated
model?
7
Section 1.2
Five Examples on Regression

8
Examples on Regression
• Next few pages: Examples with data available in R (alr4 library)
• Messages from the examples:

• Will go through some of those examples in details in later


chapters
9
Example 1 – Inheritance of Height
Problem of interest: Want to study how the Daughter’s height is
affected by the Mother’s height

Data: n = 1,375 families


x = Heights (in inches) of
mothers in the UK under age
65 (Mheight)
y = Heights (in inches) of one
of their adult daughters over
age 18 (Dheight)

Question: Can we interchange x and y?


10
Example 1 – Inheritance of Height
x = Heights (in inches) of mothers in the UK under age 65 (Mheight)
y = Heights (in inches) of one of their adult daughters over age 18 (Dheight)
Findings from the Scatterplot:
• Dheight increases with Mheight
• The two variables are of similar range
(55-70 inches)
• The points appear to form an
elliptical region*
=> Linear regression would make
sense
* (STAT2001): Joint pdf of Bivariate Normal is elliptical in shape
X    x    x2  x y     x 
  ~ N 2   ,    Y | X  x ~ N    ( x   ), (1   2
) 2

Y     y    x y  y2    y  x x 
   y 
Given the Mheight = x (inches), Dheight is normally distributed with constant
variance (1   2 ) x2 . 11
Example 2 – Forbes’ Data
• Barometer (氣壓計) was a fragile instrument to measure
atmospheric pressure in 1850s.
• James D. Forbes (1857): Use the boiling point of water as
a substitute (which is more reliable based on a
thermometer) of the measurement of atmospheric
pressure
• At 17 different locations in the Alps and the Scotland, he
measured
• the pressure (in inches of mercury) using a barometer,
and
• the boiling point of water (in F)
• Question: Does the boiling point of water vary with
atmospheric pressure in a linear way?
12
Example 2 – Forbes’ Data

• High Altitude: Low Atmospheric Pressure and Low Boiling Point of Water
• Low Altitude: High Atmospheric Pressure and High Boiling Point of Water
13
Example 2 – Forbes’ Data
• x = boiling point of water (in Fahrenheit)
y = atmospheric pressure (in inches of mercury)
• Residual Plot on the Right: Presence of systematic error
(quadratic relationship?) between x and y
Residual = y – “Fitted Value of y”

Outlier

14
Example 2 – Forbes’ Data
• Data Transformation: y = log(atmospheric pressure)
=> Points fall closer to a horizontal line

Outlier

• General Procedure: Understand the data (via the scatterplot)


• Fit a linear regression to the data (Ch2-3)
• Understand the residuals based on Residual Analysis (Ch7) and
make necessary Data Transformation (Ch8) 15
Example 3 – Length at Age for Smallmouth Bass
• Background: Smallmouth Bass (小
嘴鱸魚) is a popular game fish in
North America
• Problem of interest: Avoid
excessive fishing => Would like to
set impose fishing regulation to
protect the young smallmouth bass
(based on its length)
• Want to study the growth pattern
(age vs length) of fish:
• y = Length of small mouth bass
at capture (in mm)
• x = Age of small mouth bass at
capture (in year)
Linear relationship between length
and age 16
Example 3 – Length at Age for Smallmouth Bass
• Dash line: Connects the average length of fish at each age group.
i.e. Sample mean length of fish at age i, for i = 1, 2, …, 8
Need 8 numbers to summarize the locations (i.e. 1st moment) of the
data

• Solid line: Regression line


y = a + bx
Only 2 numbers are required to
relate the 8 locations of length
by age
=> Regression provide a
simpler model to the data

17
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?

• Money magazine's Best Place to Live in the


U.S. in 2006
• One of the towns that inspired the design of
Main Street, U.S.A. inside the main entrance
of the many 'Disneyland'-style parks

Fort
Collins

18
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
• x = Early Snowfall (in inches)
from Sep 1st to Dec 31st
• y = Late Snowfall (in inches)
from Jan 1st to Jun 30th next year
• Yearly data from 1900 to 1992
(n=93)
• Dash line = Fitted Regression line
• Solid line = Average Late Winter
Snowfall level (with slope=0)

• “Can Early Snowfall predict Late Snowfall?“


Hypothesis Testing: The slope is significantly different from 0?
19
Example 5 – Turkey Growth
• A farmer would like to increase the yield of turkeys (火雞) through
the use of amino acids => How weight gain of turkey is affected by
(1) Type of amino acid supplement (A Categorical Variable!!)
(2) Amount of amino acid supplement (% of total diet)
% of amino Amino Acid #1 Amino Acid #2 Amino Acid #3
acid in diet

4%

10%


• Record the average weight gain of each turkey pen (欄) 20
Example 5 – Turkey Growth
• y = Average weight gain (in gram) of turkeys in a pen
• x = Dose of Amino Acid Supplement (as a percentage of total diet)

• Circle/Triangle/Cross:
3 different type of
amino acids supplement
in their diet

• Challenges:
• Non-linear relationship between x and y (Ch5 Polynomial
Regression)
• Inclusion of the type of amino acid (a categorical variable) to
the regression model (Ch5 Dummy Variables) 21
Section 1.3
Installation of R and R Libraries

22
Installation of R
• Dataset from the textbook (and this course) are available in the R
libraries “car” and “alr4”
• Require R of version 3.5.0 or higher (current version: 4.1.1)

Installation of R
1. Go to http://cran.r-project.org/bin/windows/base/ and “Download
R 4.1.1 for Windows”
Mac OS X: R-4.1.1.pkg from https://cran.r-project.org/bin/macosx/
2. Run the .exe file to install R. The default folder of the software is
“C:\program files\R\R-4.1.1\”

23
Installation of R Libraries (car and alr4)
• Most data sets from this course is stored in the “alr4” library
• The installation of the “alr4” library is messy because it depends on a lot of
other libraries as follows:
curl rio data.table

carData car alr4 haven Rcpp forcats …

### The following R codes are available in RcodeCh1.r on Blackboard ###


### Make sure you are not using super version of R ###
Install alr4 library
update.packages(repos="http://cran.rstudio.com/",checkBuilt = TRUE, ask =F) and libraries it
install.packages(c("carData","car","effects","rio","curl","data.table","haven","Rcpp",
"forcats", "magrittr","hms","rlang","vctrs","zeallot","backports","pkgconfig","tibble",
depends on
"pillar","crayon","openxlsx","alr4"),repos="http://cran.rstudio.com/", dependencies=TRUE) (would take 5-10
library(car) # Load the car package
library(alr4) # Load the alr4 package for Forbes data below
mins!)
Temperature<-Forbes*,1+ # Object “Temperature” from the 1st column of “Forbes” data
Pressure<-Forbes*,2+ # Object “Pressure” from the 2nd column of “Forbes” data Test if the alr4
par(mfrow=c(1,2)) # Set the Graphical screen to 1 row & 2 columns library works
plot(Temperature,Pressure) # Scatterplot of Temperature (x) and Pressure (y)
fit0<-lm(Pressure~Temperature) # Create linear regression (lm) object (fit0) using the Forbes
abline(fit0) # Draw the regression line data (2nd
Residuals<-fit0$residuals # Object “Residuals” extracted from the object “fit0”
plot(Temperature,Residuals) # Scatterplot of Temperature (x) and Residuals (y)
Example Earlier)
abline(h=0,lty=2) # Draw the x-axis using a dotted line (line type = 2) 24
Installation of R: Step-by-Step
1) In Rx64 (4.1.1) or Rxi386 (4.1.1), File (from Menu Bar) -> New Window =>
A window called the R Editor is created at the bottom right
2) Copy the R codes from the previous page to the R Editor
[Alternative of step 1) and 2): File (from Menu Bar) -> Open Script and choose
the RcodeCh1.r file you downloaded from the course Blackboard]
3) Select All (Ctrl+A), then Run line or selection (Ctrl+R). The codes will be
executed at the R console on the left

25
Section 1.4
Mean and Variance Functions

26
Mean and Variance Functions
• Consider data {(xi, yi), i = 1, 2, … ,n}
• x is called the Explanatory Variable (EV) [also called the
Predictor or Independent Variable]
• y is called the Response Variable (RV) [also called the
Dependent Variable]
• Model Assumption when setup of a regression model:
1. Mean Function E (Y | X  x )
2. Variance Function Var(Y | X  x)
1. Mean Function is the expected value of the response when the
explanatory variable X=x :

E (Y | X  x )  f ( x )
• Linear Regression: f (x) = a + bx,
• Quadratic Regression: f (x) = a + b1 x + b2 x2 27
Mean Function – Inheritance of Height
Example: Inheritance of Height (Mother’s height vs Daughter’s height)
y=x
y = ax + b

• Mean Function E(Dheight | Mheight = x) = a + b x ,


where a (intercept), b (slope) are the parameters of the linear regression
• b < 1, with E(Dheight | Mheight = 70) = 68 (Why?)
28
Mean Function – Turkey Data
• Possible (non-linear)
mean function:

E(Growth| Dose = x)
= β0 + β1 [1-exp(- β2 x)]

• Interpretation of Parameters
• β0: Baseline growth (i.e. growth without amino acid supplement)
• β1: Max. effect of amino acid, with y  0  1 as x  
• β2: The speed to achieve the maximum growth
29
Variance Function
• Variance Function is assumed to be constant (mostly unknown)
throughout the course:
Constant
Var(Y | X  x)   2
variability

That is, variance of the response


is THE SAME for all values of x
Why constant variance?
Because good statistical
properties of the estimators

• Example: Heights Data


• Var (Y| X = x) = σ2
• Scatterplot: The variance function is approximately the same
along different values of x
30
Reasonable Assumption on Constant Variance?

31
Section 1.5
Separated Points

32
Four Hypothetical Data Sets
• [Textbook Table 1.1] 4 different data sets {(xi,yi), i=1, 2,…, 11}

Same summary
statistics:
{x , y , s x2 , s 2y , s xy }

• (Ch2) Estimates of y = a + bx depends on the 5 summary


statistics only => Same regression lines for the 4 data sets 33
Four Hypothetical Data Sets

Conclusions from the above:


1. Dependence is not limited to E(Y|X) = a + bX. (e.g. polynomial)
2. Summary statistic (i.e. a, b from regression line) may not be a good
summary of dependence
 Should first understand the data graphically (e.g. scatterplot)
before fitting a regression line 34
Separated Points
• Separated Points: Points are well separated from the other points,
either horizontally or vertically
• Does the presence of separated points affect the regression line?

35
Separated Points
• Horizontal: Leverage point (i.e. leverage effect to the line)
The location of the leverage
point (x, y) has higher
impact to the regression
line (i.e. leverage) than the
other points

• Vertical: Outlier (i.e. lie outside the line)

Outlier typically does not


affect the regression line
much

36
Section 1.6
Scatterplot

37
Why Scatterplot?
• Scatterplot uses Cartesian (x-y) coordinates to displays values
of two variables.
• Scatterplot is able to identify the following:
1. the mean function Inheritance of Height Data
2. the variance function
3. separated points

1. Mean function:
Linear
2. Variance function:
Constant
3. No separated point

38
Null Plot
• Null plot is a scatterplot with
1. constant mean function (slope=0)
2. constant variance function
3. no separated point

Snowfall data

Null plot on the residuals => Linear Regression is a reasonable


model to the data 39
Scatterplot Matrix
What to do if there are more than 2 variables?
Answer: Draw a scatterplot for EACH PAIR of variables => Scatterplot Matrix
• Only marginal relationship between two variables is observed.
• Joint relationship (e.g. Interaction of 3 or more variables)??
Example: Fuel Consumption Data on the Next Page
Problem: Understand how fuel consumption varies over 50 states in the US,
understand the effect on fuel consumption of state gasoline tax.
• Fuel (y) – Gasoline (in thousand of gallon) sold for road use per
population age 16+
• Tax (x1) – Gasoline state tax rate (cents per gallon)
• Dlic (x2) – 1,000(# of licensed drivers/ population of age 16+) in that state
• Income (x3) – Personal income (in US$1,000)
• logMiles (x4) – Log (Total length of highway [in miles] of that state)
Scatterplot Matrix: Next Page 40
Scatterplot Matrix

41
Generate Scatterplot Matrix in R – the “pairs” Function
Example: Generate the Scatterplot Matrix for the Fuel data
library(car); library(alr4)
Tax<-fuel2001$Tax # Gasoline state tax rate
Dlic<-fuel2001$Drivers/fuel2001$Pop # No. of Drivers / population over age 16
Income<-fuel2001$Income # Personal Income
logMiles<-log(fuel2001$Miles,2) # Log (total length of highway)
Fuel<-fuel2001$FuelC/fuel2001$Pop # Amnt of Gasoline sold per population over age 16
Data<-cbind(Tax,Dlic,Income,logMiles,Fuel)
# Bind the 5 objects (by columns) into the matrix of 5 columns
pairs(Data) # Generate the Scatterplot Matrix of “Data”

Generate a scatterplot: Use the “plot” function


plot(logMiles, Fuel)

42
Correlation Heatmap
Correlation Heatmap: Graphical illustration of the correlation matrix
• Quick and dirty way to summarize how the linear association (i.e.
correlation) between each pair of variables
• Works particularly good for data with A LOT of variables
• Unable to visualize the linearity and possible separated points

install.packages("corrplot")
library(corrplot)
par(mfrow=c(1,1)) # Set the Graphical screen to 1
row & 1 column
M <- cor(Data) # Compute the correlation matrix
corrplot(M, method = "color") # Heatmap of M

43

You might also like