Advanced Statistics II
( PST22209/ FST 22209/ ESNRM22209)
R.M. KAPILA RATHNAYAKA
B.Sc. Special (Math. & Stat. ) (Ruhuna), M.Sc. (Industrial Mathematics) (USJ),
M.Sc. (Stat. ) (WHUT, China),
Ph.D. (Applied Statistics, WHUT)
Why we need alternative method than
Linear Regression…….
Polynomial Regression
• In situations where the functional relationship between the
response Y and the independent variable x cannot be
adequately approximated by a linear relationship, it is
sometimes possible to obtain a reasonable fit by considering a
polynomial relationship.
• where are regression coefficients that would have to be estimated
• h is called the degree of the polynomial.
• To determine these estimators, we take partial derivatives
with respect to of the foregoing sum of squares, and then
set these equal to 0 so as to determine the minimizing
values.
• On doing so, and then rearranging the resulting equations,
we obtain that the least square estimators satisfy the
following set of linear equations called the normal
equations.
Degree of the polynomial
• where h is called the degree of the polynomial. For
lower degrees, the relationship has a specific names.
• h = 2 is called quadratic
• h = 3 is called cubic,
• h = 4 is called quartic, and so on.
Second-degree Polynomial – quadratic Trend
• Practically, most of the real world data patterns are best described by
curves, not straight lines. In these instances, the linear trend model does
not adequately describe the change in the variable as time changers.
• To overcome this problem, we often use a parabolic curve, which is
described by mathematically by a second-degree equation.
• The general form for an estimated second-degree equation is;
• Where;
• estimate of the dependent variable
• numerical constants
• However, we can determine the values of the numerical constants
from the following three equations.
Second-degree Polynomial – quadratic Trend
Applications
• Fit a polynomial to the following data.
X Y
0 0
1 0
2 2
3 6
4 12
• However, we can determine the values of the
numerical constants from the following three
equations. X Y
0 0
1 0
2 2
3 6
4 12
Example
• Fit a polynomial to the following data.
quadratic Trend
• However, we can determine the values of the numerical constants
from the following three equations.
• the estimated quadratic regression equation I
• The estimated quadratic regression equation is
Matrix notation to solve equation system
• which has the solution
which has the solution
Example 2
• You are studying the relationship between a particular
machine setting and the amount of energy consumed.
• log transformation of the response variable will produce a
more symmetric error distribution.
Multiple Linear Regression
• Multiple regression is an extension of simple linear
regression.
• It is used when we want to predict the value of a variable
based on the value of two or more other variables.
• Suppose that we have a linear model
Example
• you could use multiple regression to understand whether
exam performance can be predicted based on
– revision time,
– test anxiety,
– lecture attendance
– gender.
• Alternately, you could use multiple regression to understand
whether daily cigarette consumption can be predicted based
on
– smoking duration,
– age when started smoking,
– smoker type,
– income
– gender.
Assumption #1:
• Dependent variable should be measured on a continuous
scale (i.e., it is either an interval or ratio variable)
• Example:
– revision time (measured in hours),
– intelligence (measured using IQ score),
– exam performance (measured from 0 to 100),
– weight (measured in kg)
Assumption #2:
• Two or more independent variables, which can be either
continuous (i.e., an interval or ratio variable) or categorical
(i.e., an ordinal or nominal variable).
• Examples of nominal variables include ;
– gender (male and female),
– ethnicity (Caucasian, African American and Hispanic),
– physical activity level (sedentary, low, moderate and high),
– profession (surgeon, doctor, nurse, dentist, therapist),
Numerical Data (Data that is Numbers) :
Continuous Random Variables
• Continuous Variable –
Continuous variables is a variable whose value is obtained
by measuring.
height of students in class
weight of students in class
time it takes to get to school
distance traveled between classes
Numerical Data (Data that is Numbers) :
Discrete Random Variables
• A discrete variable is a variable whose value is obtained by
counting.
• All continuous variables are numeric, but not all numeric
variables are continuous.
• Examples:
– number of students present
– number of red marbles in a jar
– number of heads when flipping three coins
– students’ grade level
Categorical Data (Data that is not
numbers) : Nominal Variable
• Sometimes there is no hierarchy in categorical data.
• If eye colour was coded
– 0-- “Blue”
– 1 --“Green”
– 2 --“Brown”
we have to randomly choose which option gets which
number.
• It doesn’t matter whether Blue eyes is zero, or one, or two,
because there is no hierarchy in eye colour.
Categorical Data (Data that is not
numbers) : Ordinal Variable
• Annoying surveys often ask you to answer with the options
“Strongly Disagree”, “Disagree”, “Neutral”, “Agree” or
“Strongly agree”.
• This data has a special structure, because if these are coded 0
“Strongly Disagree” to 4 “Strongly agree”;
– 0 = Strongly Disagree
– 1 = Disagree
– 2 = Neutral
– 3 = Agree
– 4 = Strongly agree
Assumption #3:
• Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move
along the line.
Assumption #4:
• Data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly
correlated with each other.
What is Multicollinearity?
The following data on 20 individuals with high blood pressure:
1. blood pressure (y = BP, in mm Hg)
2. age (x1 = Age, in years)
3. weight (x2 = Weight, in kg)
4. body surface area (x3 = BSA, in sq m)
5. duration of hypertension (x4 = Dur, in years)
6. basal pulse (x5 = Pulse, in beats per minute)
7. stress index (x6 = Stress)
BP Age Weight BSA Dur Pulse
Age 0.659
Weight 0.950 0.407
BSA 0.866 0.378 0.875
Dur 0.293 0.344 0.201 0.131
Pulse 0.721 0.619 0.659 0.465 0.402
Stress 0.164 0.368 0.034 0.018 0.312 0.506
• Cell Contents: Pearson correlation
• Blood pressure appears to be related fairly strongly to Weight (r = 0.950)
and BSA (r = 0.866), and hardly related at all to Stress level (r = 0.164).
• Weight and BSA appear to be strongly related (r = 0.875)
• The high correlation among some of the predictors suggests that data-
based multicollinearity exists.
Assumption #5:
• There should be
– no significant outliers,
– high leverage points
– highly influential points.
• These different classifications of unusual points reflect the different
impact they have on the regression line.
What are outliers in the data?
• An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population.
• The box plot is a useful graphical display for describing the behavior of the
data in the middle as well as at the ends of the distributions.
• The following quantities (called fences) are needed for identifying extreme
values in the tails of the distribution:
– lower inner fence: Q1 - 1.5*IQ
– upper inner fence: Q3 + 1.5*IQ
– lower outer fence: Q1 - 3*IQ
– upper outer fence: Q3 + 3*IQ
• A point beyond an inner fence on either side is considered a mild outlier. A
point beyond an outer fence is considered an extreme outlier.
Assumption #5:
• You should have independence of observations (i.e., independence of
residuals), which you can easily check using the Durbin-Watson statistic
Assumption #6:
• There needs to be a linear relationship between
– the dependent variable and each of your independent
variables
Assumption #7:
• Finally, you need to check that the residuals (errors) are
approximately normally distributed
• Two common methods to check this assumption include using:
– histogram (with a superimposed normal curve) and a
Normal P-P Plot;
– Normal Q-Q Plot of the studentized residuals.
Multiple Linear Regression
• Suppose that we have a linear model
And we make independent observations , on .
• We can write the observation as
,
• where is the setting of the independent variable for the
observation, .
,
• We now define the following matrices, with :
•
• , , ,
• Thus, the equations representing as a function of the ’s, ’s,
and ’s can be simultaneously written as
Regression with Two Independent Variables
• For observations from a simple linear regression model of the form
,
, , ,
• The least-squares equations for and were given in the previous
section as
Regression with Two Independent Variables
• Assume the model production function below,
.
• Where is total production, is labor input, is total capital and
the information about each factor is given below for the 15
year period of 2001 to 2016.
Year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
20 35 30 47 60 68 76 90 100 105 130 140 125 120 135
10 15 21 26 40 37 42 33 30 38 60 65 50 35 42
12 10 9 8 5 7 4 5 7 5 3 4 3 1 2
• By using above data, estimate the and
parameters of
by using the ordinary least square (OLS) method.