Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

MH3510/MTH352 REGRESSION

ANALYSIS
Pan Guangming
OUTLINE
| Syllabus
| The position of this course

| Example 1

| Example 2

| Summary
SYLLABUS
COURSE WORK
| 10% - 2 or 3 times assignments
| 20% - Mid-term test

| 10% - Group projects.


y There are 5-7 students in a group;
y One report per group;
WHY STATISTICS?
What we want to do if a lot of data come here:
| Explanation;

| Prediction.

| Starts with a problem;


| Proceeds with the collection of data;

| Continues with the data analysis;

| Finishes with conclusion.


WHAT STATISTICS DO? - EXPLANATION
| Generally, we want to use formulas to approximate
the patterns in the data.

(x,y)
1.2 2.4
3.5 7
2.4 4.8
4.9 9.8
1.8 3.6
3.1 6.2
THE POSITION OF THIS COURSE
| We use different tools to deal with all kinds of data.

Regression analysis
Time series analysis
Sampling Multivariate analysis
Data Survival analysis
survey

This course introduces some methods to


deal with commonly-used types of data.
WHAT IS REGRESSION ?
| Regression analysis is a statistical methodology
that utilizes the relation between two or more
quantitative variables so that a response or
outcome variable can be predicted from the other,
or the others.

| Regression summarizes the relationship among


variables.
WHY NEEDS REGRESSION ANALYSIS (I)
| In any system in which variable quantities
changes, it is of interest to examine the effects
that some variables exert (or appear exert) on
others.
| In most physical processes there exists a
functional relationship that is too complicated to
grasp or to describe in simple terms. In this case
we may wish to approximate to this functional
relationship by some simple mathematical
function such as a polynomial so that we may be
able to learn more about the underlying true
relationship.
WHY NEEDS REGRESSION ANALYSIS (II)

|Do you know how to predict the


height of your children based on
your height ?

Do you know how to predict the


Singapore house price next year ?
WHY NEEDS REGRESSION ANALYSIS (III)
| In an efficiency study of 67 branches offices of a
consumer finance chain, the response variable
was direct operating cost for the year just ended.
| By developing a usable statistical relation
between cost and the predictor variables, the
management was able to set cost standards for
each branch office in the company chain.
THREE PURPOSES
| Regression analysis serves three main purposes.

| 1) Description

| 2) Control

| 3) Prediction.
HOW DOES THE REGRESSION
ANALYSIS WORK ?
EXAMPLE 1
| The following data set records the plasma levels of
total cholesterol level of 24 patients with
hypercholesterolemia admitted to a hospital:
3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2,4.2,2.3,4.0,4.3,3.9,3.3,3.2,2.5,3.3

„Question: When a new patient come here, do you


have some idea about his/her cholesterol level?
ANSWER
| Use the average of the 24 observations: 3.354.
THE EXAMPLE
| A data set records cholesterol level of 24 patients :
3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8,4.1,3.0,
2.5,4.6,3.2,4.2,2.3,4.0,4.3,3.9,3.3,3.2,2.5,3.3
| The ages of 24 patients:

46,20,52,30,57,25,28,36,22,43,57,33,
22,63,40,48,28,49,52,58,29,34,24,50
QUESTIONS
| When a new patient with known age (e.g. 45)
come here, do you have some idea about his/her
plasma level?
SCATTER PLOTS
SCATTER PLOT
SIMPLE LINEAR REGRESSION
| It seems that the plasma levels depend on the
ages.
| We can use a straight line to express such
dependence.
| For the new patient with age 45, we can use this
line to get some basic idea about his/her plasma
levels.
| The above example highlights the importance in
data analysis of collecting data on some other
variables (e.g. age) relevant to the main variable
of interest (e.g. cholesterol level) .
RELATIONSHIPS BETWEEN VARIABLES
Functional Relationships – The value of the
dependent variable Y can be computed exactly if
we know the value of the independent variable X.
(e.g., Y=2X)

| Statistical Relationships – Not a perfect or


exact relationship. The expected value of the
response variable Y is a function of the
explanatory or predictor variable X.
RELATIONSHIPS BETWEEN
VARIABLES(I)
| Is: cholesterol level= b + m× age ? (functional
relation)
| No! The relationship is far from perfect/exact (it's
a statistical relation)!

| We can say that E(cholesterol level)=b+m × age.


That is, cholesterol level is a random variable.
FURTHERMORE
| If we also record the blood pressures for these 24
patients.
| How can we use both of the age and the blood
pressure to predict the plasma level?

Multiple linear regression


QUESTION
| If we record the genders, instead of blood
pressures, of these patients. What happens?
WHAT CAN REGRESSION DO ?
EXAMPLE 2
| Question 1: Are PM (Pure Math) students
smarter than ST (STatistics) students?

No. Program GPA


1 ST 3.2
2 ST 4.3
3 ST 4.9
T-test
4 ST 3.0
5 PM 4.0
6 PM 3.8
7 PM 3.7
8 PM 4.1
EXAMPLE 2
| Question 2: How about the situation in whole
Math division?
No. Program GPA
1 ST 3.2
2 ST 4.3
3 ST 4.9
4 ST 3.0
5 PM 4.0
6 PM 3.8 One-way ANOVA
7 PM 3.7
8 PM 4.1
9 AM 4.2
10 AM 3.8
11 AM 3.8
12 AM 3.2
13 ME 4.5
14 ME 4.7
15 ME 4.9
16 ME 3.5
EXAMPLE 2
| Question 3: How to draw a conclusion if the
genders are involved?
No. Gender Program GPA
1 F ST 3.2
2 M ST 4.3
3 F ST 4.9
4 M ST 3.0
5 F PM 4.0
6 M PM 3.8
Two-way ANOVA
7 F PM 3.7
8 F PM 4.1
9 M AM 4.2
10 F AM 3.8
11 M AM 3.8
12 F AM 3.2
13 M ME 4.5
14 F ME 4.7
15 M ME 4.9
16 M ME 3.5
EXAMPLE 2
| Question 4: How to draw a conclusion if the GPA
in HS (High School) are involved?
No. Gender Program GPA in HS GPA
1 F ST 3.1 3.2
2 M ST 4.7 4.3
3 F ST 4.8 4.9
4 M ST 2.9 3.0
5 F PM 4.2 4.0
6 M PM 3.7 3.8
7 F PM 3.6 3.7
8 F PM 4.0 4.1
9 M AM 4.1 4.2
10 F AM 4.1 3.8
11 M AM 3.7 3.8
12 F AM 3.1 3.2
13 M ME 4.4 4.5
14 F ME 4.8 4.7
15 M ME 4.8 4.9
16 M ME 4.8 3.5
SUMMARY
TERMINOLOGY
| There are two types of variables: quantitative
and qualitative.
| Quantitative variables can be measured in a
numerical form: e.g. age, GPA, income, time,
temperature etc.
| Qualitative variables are not numerical in nature:
e.g. gender, categorized age, education level, type
of crime committed, style of cuisine served in a
restaurant etc.
TERMINOLOGY
| The variable to be predicted, y, is called the
response variable. Or, the variable which is of our
primary interest is called the response variable
(output variable, outputs, Y-variables or
dependent variable),
| whereas the remaining variables are called
predictor variables (input variable, inputs, X-
variables, regressors or independent variable).
STATISTICAL MODELS
| Response variable =
Model function (Ey) + Random error.
| When systematic component (Model function) is a
linear function of parameters, it is also so-called
Linear Regression.
| Least squares (LS) method is usually used to
estimate the parameters in systematic
component.
A SHORT OF HISTORY OF REGREESION

Studying natural inheritance


in1886, scientist Francis Galton
collected data on heights of
parents and adult children.
A SHORT OF HISTORY OF REGREESION

Noticed the tendency for tall (or


short) parents to have tall ( or
short) children, but, not as tall
( or short ) on average as their
parents.
Galton called this phenomenon
the “ law of Universal
Regression”
A SHORT OF HISTORY OF REGREESION

for the average heights of adult


children tended to “regress” to
the mean of the population.
Galton modeled a son’s adult
height (y) as a function of mid-
parent height (x), and the term
regression model was coined.
REGRESSION APPLICATIONS (1)
| Do chief executive (CEOs) and their top
managers always agree on the goals of the
company ?
| The researchers used regression to model a vice
presidents (VP)’s attitude toward the goal of
improving efficiency (y) as a function of the two
independent variables, level of CEO leadership
x_1 and level of congruence between the CEO
and the VP (x_2). They discovered that the
impact of CEO leadership on a VP’s attitude
toward improving efficiency depended on level of
congruence.
HISTORICAL REMARKS (2)
Education
| The Standardized Admission Test (SAT) scores of
3,492 high school and college students, some of
whom paid a private tutor in an effort to obtain a
higher score, were analyzed.
| Multiple regression was used to successfully
estimate the effect of coaching on the SAT-
Mathematics score, y. The independent variables
included in the models were scores on PSAT,
whether the student was coached, student
ethnicity, socioeconomic status, overall high
school GPA, number of mathematics courses
taken in high school.

You might also like