0% found this document useful (0 votes)

26 views43 pages

Chapter 1 Introduction of Regression

This document introduces regression analysis and provides examples of its applications. It outlines the chapters of the course, which cover topics like linear regression, model fitting, residual analysis, and including categorical variables in regression models. Examples use real data to demonstrate predicting variables like height, weather patterns, and animal growth.

Uploaded by

Ice Ice cold

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views43 pages

Chapter 1 Introduction of Regression

Uploaded by

Ice Ice cold

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Chapter 1 Introduction

STAT 3008 Applied Regression Analysis

Department of Statistics
The Chinese University of Hong Kong

2021/22 Term 1

Dr. LEE Pak Kuen, Philip

1
Chapter Outline
• Section 1.1: Motivation
• Section 1.2: Five Examples on Regression
• Section 1.3: Installation of R and R libraries
• Section 1.4: Mean and Variance Functions
• Section 1.5: Separated Points
• Section 1.6: Scatterplot

2
Section 1.1
Motivation

3
Motivation: Example
• Problem of Interest: Want to predict the Overall GPA of
students in CUHK
• Methodology:
1. Select a random sample of students graduated from CUHK
2. Record the following for each student:
• Overall GPA
• Properties from students: E.g. IQ, AL-results, Major,
Gender, … etc
3. Use the above information to predict the overall GPA of
current students

4
Motivation: Example

• Points not exactly on a straight line, why?

• How to use a mathematical model to relate Y (GPA) and X (IQ)?
5
Linear Regression in a Page
Steps:
1. Select a random sample of
students graduated from
CUHK
2. Record y(GPA) and x(IQ)
3. Plot (x,y) on a scatterplot
4. Find a straight line equation
that fits the data points best GPA=2.00+0.01(IQ)
5. Predict the GPA using a
new student’s IQ

Regression studies the dependency between

Explanatory Variables (X) and the Response Variable (Y)
6
Linear Regression Y = a +bX
Regression studies the dependency between the
Explanatory Variables (X) and Response Variable (Y)
• Explanatory Variable (EV) X: Also known as predictor, or
independent variable
• Response Variable (RV) Y: Also known as dependent variable
Linear Regression – Typical Problem of Interests
• Obtain the best estimates from a regression line (I.e. the
intercept a and the slope b).
• Predict the value of the RV, based on a new set of EVs.
• Identify the EVs which are important to explain the RV.
• Is the regression line good enough to explain the data? If not,
how can we extend the regression line to a more complicated
model?
7
Section 1.2
Five Examples on Regression

8
Examples on Regression
• Next few pages: Examples with data available in R (alr4 library)
• Messages from the examples:

• Will go through some of those examples in details in later

chapters
9
Example 1 – Inheritance of Height
Problem of interest: Want to study how the Daughter’s height is
affected by the Mother’s height

Data: n = 1,375 families

x = Heights (in inches) of
mothers in the UK under age
65 (Mheight)
y = Heights (in inches) of one
of their adult daughters over
age 18 (Dheight)

Question: Can we interchange x and y?

10
Example 1 – Inheritance of Height
x = Heights (in inches) of mothers in the UK under age 65 (Mheight)
y = Heights (in inches) of one of their adult daughters over age 18 (Dheight)
Findings from the Scatterplot:
• Dheight increases with Mheight
• The two variables are of similar range
(55-70 inches)
• The points appear to form an
elliptical region*
=> Linear regression would make
sense
* (STAT2001): Joint pdf of Bivariate Normal is elliptical in shape
X    x    x2  x y     x 
  ~ N 2   ,    Y | X  x ~ N    ( x   ), (1   2
) 2

Y     y    x y  y2    y  x x 
   y 
Given the Mheight = x (inches), Dheight is normally distributed with constant
variance (1   2 ) x2 . 11
Example 2 – Forbes’ Data
• Barometer (氣壓計) was a fragile instrument to measure
atmospheric pressure in 1850s.
• James D. Forbes (1857): Use the boiling point of water as
a substitute (which is more reliable based on a
thermometer) of the measurement of atmospheric
pressure
• At 17 different locations in the Alps and the Scotland, he
measured
• the pressure (in inches of mercury) using a barometer,
and
• the boiling point of water (in F)
• Question: Does the boiling point of water vary with
atmospheric pressure in a linear way?
12
Example 2 – Forbes’ Data

• High Altitude: Low Atmospheric Pressure and Low Boiling Point of Water
• Low Altitude: High Atmospheric Pressure and High Boiling Point of Water
13
Example 2 – Forbes’ Data
• x = boiling point of water (in Fahrenheit)
y = atmospheric pressure (in inches of mercury)
• Residual Plot on the Right: Presence of systematic error
(quadratic relationship?) between x and y
Residual = y – “Fitted Value of y”

Outlier

14
Example 2 – Forbes’ Data
• Data Transformation: y = log(atmospheric pressure)
=> Points fall closer to a horizontal line

Outlier

• General Procedure: Understand the data (via the scatterplot)

• Fit a linear regression to the data (Ch2-3)
• Understand the residuals based on Residual Analysis (Ch7) and
make necessary Data Transformation (Ch8) 15
Example 3 – Length at Age for Smallmouth Bass
• Background: Smallmouth Bass (小
嘴鱸魚) is a popular game fish in
North America
• Problem of interest: Avoid
excessive fishing => Would like to
set impose fishing regulation to
protect the young smallmouth bass
(based on its length)
• Want to study the growth pattern
(age vs length) of fish:
• y = Length of small mouth bass
at capture (in mm)
• x = Age of small mouth bass at
capture (in year)
Linear relationship between length
and age 16
Example 3 – Length at Age for Smallmouth Bass
• Dash line: Connects the average length of fish at each age group.
i.e. Sample mean length of fish at age i, for i = 1, 2, …, 8
Need 8 numbers to summarize the locations (i.e. 1st moment) of the
data

• Solid line: Regression line

y = a + bx
Only 2 numbers are required to
relate the 8 locations of length
by age
=> Regression provide a
simpler model to the data

17
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?

• Money magazine's Best Place to Live in the

U.S. in 2006
• One of the towns that inspired the design of
Main Street, U.S.A. inside the main entrance
of the many 'Disneyland'-style parks

Fort
Collins

18
Example 4 – Predicting the Weather
Problem of Interest: Can early snowfall (Sep 1st to Dec 31st) predict
late snowfall (Jan 1st to Jun 30th next year) at Fort Collins, Colorado?
• x = Early Snowfall (in inches)
from Sep 1st to Dec 31st
• y = Late Snowfall (in inches)
from Jan 1st to Jun 30th next year
• Yearly data from 1900 to 1992
(n=93)
• Dash line = Fitted Regression line
• Solid line = Average Late Winter
Snowfall level (with slope=0)

• “Can Early Snowfall predict Late Snowfall?“

Hypothesis Testing: The slope is significantly different from 0?
19
Example 5 – Turkey Growth
• A farmer would like to increase the yield of turkeys (火雞) through
the use of amino acids => How weight gain of turkey is affected by
(1) Type of amino acid supplement (A Categorical Variable!!)
(2) Amount of amino acid supplement (% of total diet)
% of amino Amino Acid #1 Amino Acid #2 Amino Acid #3
acid in diet

10%



• Record the average weight gain of each turkey pen (欄) 20
Example 5 – Turkey Growth
• y = Average weight gain (in gram) of turkeys in a pen
• x = Dose of Amino Acid Supplement (as a percentage of total diet)

• Circle/Triangle/Cross:
3 different type of
amino acids supplement
in their diet

• Challenges:
• Non-linear relationship between x and y (Ch5 Polynomial
Regression)
• Inclusion of the type of amino acid (a categorical variable) to
the regression model (Ch5 Dummy Variables) 21
Section 1.3
Installation of R and R Libraries

22
Installation of R
• Dataset from the textbook (and this course) are available in the R
libraries “car” and “alr4”
• Require R of version 3.5.0 or higher (current version: 4.1.1)

Installation of R
1. Go to http://cran.r-project.org/bin/windows/base/ and “Download
R 4.1.1 for Windows”
Mac OS X: R-4.1.1.pkg from https://cran.r-project.org/bin/macosx/
2. Run the .exe file to install R. The default folder of the software is
“C:\program files\R\R-4.1.1\”

23
Installation of R Libraries (car and alr4)
• Most data sets from this course is stored in the “alr4” library
• The installation of the “alr4” library is messy because it depends on a lot of
other libraries as follows:
curl rio data.table

carData car alr4 haven Rcpp forcats …

### The following R codes are available in RcodeCh1.r on Blackboard ###

### Make sure you are not using super version of R ###
Install alr4 library
update.packages(repos="http://cran.rstudio.com/",checkBuilt = TRUE, ask =F) and libraries it
install.packages(c("carData","car","effects","rio","curl","data.table","haven","Rcpp",
"forcats", "magrittr","hms","rlang","vctrs","zeallot","backports","pkgconfig","tibble",
depends on
"pillar","crayon","openxlsx","alr4"),repos="http://cran.rstudio.com/", dependencies=TRUE) (would take 5-10
library(car) # Load the car package
library(alr4) # Load the alr4 package for Forbes data below
mins!)
Temperature<-Forbes*,1+ # Object “Temperature” from the 1st column of “Forbes” data
Pressure<-Forbes*,2+ # Object “Pressure” from the 2nd column of “Forbes” data Test if the alr4
par(mfrow=c(1,2)) # Set the Graphical screen to 1 row & 2 columns library works
plot(Temperature,Pressure) # Scatterplot of Temperature (x) and Pressure (y)
fit0<-lm(Pressure~Temperature) # Create linear regression (lm) object (fit0) using the Forbes
abline(fit0) # Draw the regression line data (2nd
Residuals<-fit0$residuals # Object “Residuals” extracted from the object “fit0”
plot(Temperature,Residuals) # Scatterplot of Temperature (x) and Residuals (y)
Example Earlier)
abline(h=0,lty=2) # Draw the x-axis using a dotted line (line type = 2) 24
Installation of R: Step-by-Step
1) In Rx64 (4.1.1) or Rxi386 (4.1.1), File (from Menu Bar) -> New Window =>
A window called the R Editor is created at the bottom right
2) Copy the R codes from the previous page to the R Editor
[Alternative of step 1) and 2): File (from Menu Bar) -> Open Script and choose
the RcodeCh1.r file you downloaded from the course Blackboard]
3) Select All (Ctrl+A), then Run line or selection (Ctrl+R). The codes will be
executed at the R console on the left

25
Section 1.4
Mean and Variance Functions

26
Mean and Variance Functions
• Consider data {(xi, yi), i = 1, 2, … ,n}
• x is called the Explanatory Variable (EV) [also called the
Predictor or Independent Variable]
• y is called the Response Variable (RV) [also called the
Dependent Variable]
• Model Assumption when setup of a regression model:
1. Mean Function E (Y | X  x )
2. Variance Function Var(Y | X  x)
1. Mean Function is the expected value of the response when the
explanatory variable X=x :

E (Y | X  x )  f ( x )
• Linear Regression: f (x) = a + bx,
• Quadratic Regression: f (x) = a + b1 x + b2 x2 27
Mean Function – Inheritance of Height
Example: Inheritance of Height (Mother’s height vs Daughter’s height)
y=x
y = ax + b

• Mean Function E(Dheight | Mheight = x) = a + b x ,

where a (intercept), b (slope) are the parameters of the linear regression
• b < 1, with E(Dheight | Mheight = 70) = 68 (Why?)
28
Mean Function – Turkey Data
• Possible (non-linear)
mean function:

E(Growth| Dose = x)
= β0 + β1 [1-exp(- β2 x)]

• Interpretation of Parameters
• β0: Baseline growth (i.e. growth without amino acid supplement)
• β1: Max. effect of amino acid, with y  0  1 as x  
• β2: The speed to achieve the maximum growth
29
Variance Function
• Variance Function is assumed to be constant (mostly unknown)
throughout the course:
Constant
Var(Y | X  x)   2
variability

That is, variance of the response

is THE SAME for all values of x
Why constant variance?
Because good statistical
properties of the estimators

• Example: Heights Data

• Var (Y| X = x) = σ2
• Scatterplot: The variance function is approximately the same
along different values of x
30
Reasonable Assumption on Constant Variance?

31
Section 1.5
Separated Points

32
Four Hypothetical Data Sets
• [Textbook Table 1.1] 4 different data sets {(xi,yi), i=1, 2,…, 11}

Same summary
statistics:
{x , y , s x2 , s 2y , s xy }

• (Ch2) Estimates of y = a + bx depends on the 5 summary

statistics only => Same regression lines for the 4 data sets 33
Four Hypothetical Data Sets

Conclusions from the above:

1. Dependence is not limited to E(Y|X) = a + bX. (e.g. polynomial)
2. Summary statistic (i.e. a, b from regression line) may not be a good
summary of dependence
 Should first understand the data graphically (e.g. scatterplot)
before fitting a regression line 34
Separated Points
• Separated Points: Points are well separated from the other points,
either horizontally or vertically
• Does the presence of separated points affect the regression line?

35
Separated Points
• Horizontal: Leverage point (i.e. leverage effect to the line)
The location of the leverage
point (x, y) has higher
impact to the regression
line (i.e. leverage) than the
other points

• Vertical: Outlier (i.e. lie outside the line)

Outlier typically does not

affect the regression line
much

36
Section 1.6
Scatterplot

37
Why Scatterplot?
• Scatterplot uses Cartesian (x-y) coordinates to displays values
of two variables.
• Scatterplot is able to identify the following:
1. the mean function Inheritance of Height Data
2. the variance function
3. separated points

1. Mean function:
Linear
2. Variance function:
Constant
3. No separated point

38
Null Plot
• Null plot is a scatterplot with
1. constant mean function (slope=0)
2. constant variance function
3. no separated point

Snowfall data

Null plot on the residuals => Linear Regression is a reasonable

model to the data 39
Scatterplot Matrix
What to do if there are more than 2 variables?
Answer: Draw a scatterplot for EACH PAIR of variables => Scatterplot Matrix
• Only marginal relationship between two variables is observed.
• Joint relationship (e.g. Interaction of 3 or more variables)??
Example: Fuel Consumption Data on the Next Page
Problem: Understand how fuel consumption varies over 50 states in the US,
understand the effect on fuel consumption of state gasoline tax.
• Fuel (y) – Gasoline (in thousand of gallon) sold for road use per
population age 16+
• Tax (x1) – Gasoline state tax rate (cents per gallon)
• Dlic (x2) – 1,000(# of licensed drivers/ population of age 16+) in that state
• Income (x3) – Personal income (in US$1,000)
• logMiles (x4) – Log (Total length of highway [in miles] of that state)
Scatterplot Matrix: Next Page 40
Scatterplot Matrix

41
Generate Scatterplot Matrix in R – the “pairs” Function
Example: Generate the Scatterplot Matrix for the Fuel data
library(car); library(alr4)
Tax<-fuel2001$Tax # Gasoline state tax rate
Dlic<-fuel2001$Drivers/fuel2001$Pop # No. of Drivers / population over age 16
Income<-fuel2001$Income # Personal Income
logMiles<-log(fuel2001$Miles,2) # Log (total length of highway)
Fuel<-fuel2001$FuelC/fuel2001$Pop # Amnt of Gasoline sold per population over age 16
Data<-cbind(Tax,Dlic,Income,logMiles,Fuel)
# Bind the 5 objects (by columns) into the matrix of 5 columns
pairs(Data) # Generate the Scatterplot Matrix of “Data”

Generate a scatterplot: Use the “plot” function

plot(logMiles, Fuel)

42
Correlation Heatmap
Correlation Heatmap: Graphical illustration of the correlation matrix
• Quick and dirty way to summarize how the linear association (i.e.
correlation) between each pair of variables
• Works particularly good for data with A LOT of variables
• Unable to visualize the linearity and possible separated points

install.packages("corrplot")
library(corrplot)
par(mfrow=c(1,1)) # Set the Graphical screen to 1
row & 1 column
M <- cor(Data) # Compute the correlation matrix
corrplot(M, method = "color") # Heatmap of M

M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
Greenwood Intermediate Statistics With R
No ratings yet
Greenwood Intermediate Statistics With R
429 pages
Unit 4- r Programming
No ratings yet
Unit 4- r Programming
26 pages
? Overview of R Programming Language Unit 5
No ratings yet
? Overview of R Programming Language Unit 5
23 pages
R_corregr
No ratings yet
R_corregr
147 pages
Lecture-3---Linear-Regression-imran-20022025-092939am
No ratings yet
Lecture-3---Linear-Regression-imran-20022025-092939am
46 pages
3
No ratings yet
3
19 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Lab 2
No ratings yet
Lab 2
23 pages
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
No ratings yet
Asset v1 - Indic AI+PR103+2020 - T3+type@asset+block@1 Running Linear Regression in R
74 pages
R Unit 4th and 5th
No ratings yet
R Unit 4th and 5th
17 pages
MakeUpCat
No ratings yet
MakeUpCat
6 pages
Introduction To Correlation and Regression
No ratings yet
Introduction To Correlation and Regression
53 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Introduction To Datascience (R20DS501)
No ratings yet
Introduction To Datascience (R20DS501)
19 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
AIML MSE 2 Notes
No ratings yet
AIML MSE 2 Notes
35 pages
Linear.regression.with.Python
No ratings yet
Linear.regression.with.Python
140 pages
STATISTICAL-MODELLING
No ratings yet
STATISTICAL-MODELLING
39 pages
Lucero R Tutorial 2016
No ratings yet
Lucero R Tutorial 2016
135 pages
Ubs Ot
No ratings yet
Ubs Ot
45 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
ComputerLabNotes 2024
No ratings yet
ComputerLabNotes 2024
109 pages
Regression Models For Data Science in R by Brian Caffo
No ratings yet
Regression Models For Data Science in R by Brian Caffo
144 pages
Analysing Data Using Linear Models 5th Ed January 2021
No ratings yet
Analysing Data Using Linear Models 5th Ed January 2021
388 pages
EFMA 2021 - Stage-2049 - Question-Full Paper - Id-299
No ratings yet
EFMA 2021 - Stage-2049 - Question-Full Paper - Id-299
30 pages
A Hybrid Derivative Trading Sy
No ratings yet
A Hybrid Derivative Trading Sy
34 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
Advanced Topics in Analysis of Economic and Financial Data Using R
No ratings yet
Advanced Topics in Analysis of Economic and Financial Data Using R
148 pages
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
No ratings yet
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
34 pages
06 Regression
No ratings yet
06 Regression
18 pages
Script ASR v161212
No ratings yet
Script ASR v161212
148 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Regression Models Course Notes
No ratings yet
Regression Models Course Notes
102 pages
Matrix Introduction
No ratings yet
Matrix Introduction
30 pages
Lecture 4.3 Regression-1
No ratings yet
Lecture 4.3 Regression-1
30 pages
Corelation and Regression
No ratings yet
Corelation and Regression
137 pages
Basic Statistics with R: Reaching Decisions with Data Stephen C. Loftus 2024 Scribd Download
100% (5)
Basic Statistics with R: Reaching Decisions with Data Stephen C. Loftus 2024 Scribd Download
66 pages
Companion Applied Regression R
100% (13)
Companion Applied Regression R
802 pages
Weatherwax Weisberg Solutions
No ratings yet
Weatherwax Weisberg Solutions
162 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Jarrow DerivativeSecurityMarkets 1994
No ratings yet
Jarrow DerivativeSecurityMarkets 1994
22 pages
Prediction Is A Key Task of Statistics
No ratings yet
Prediction Is A Key Task of Statistics
18 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
SASprimer
No ratings yet
SASprimer
125 pages
Introduction To R: Exercises: Aboratory For Pplied Tatistics Elle Ørensen Niversity of Openhagen Ugust
No ratings yet
Introduction To R: Exercises: Aboratory For Pplied Tatistics Elle Ørensen Niversity of Openhagen Ugust
42 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
1 Point Estimation Lecture I (Week 3 Wednesday)
No ratings yet
1 Point Estimation Lecture I (Week 3 Wednesday)
6 pages
Exercises
No ratings yet
Exercises
38 pages
An R Companion To Statistical Thinking For The 21st Century
No ratings yet
An R Companion To Statistical Thinking For The 21st Century
159 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Reg Mods
No ratings yet
Reg Mods
137 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Estad Istica II Chapter 4: Simple Linear Regression
No ratings yet
Estad Istica II Chapter 4: Simple Linear Regression
46 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages