Linear Regression
Linear Regression
Linear regression is a regression model that uses a straight line to describe the relationship
between variables. It finds the line of best fit through your data by searching for the value of
the regression coefficient(s) that minimizes the total error of the model.
Simple linear regression models describe the effect that a particular variable, called the
explanatory variable, might have on the value of a continuous outcome variable, called the
response variable.
General Concepts
The purpose of a linear regression model is to come up with a function that estimates the
mean of one variable given a particular value of another variable. These variables are known
as the response variable (the “outcome” variable whose mean you are attempting to find) and
the explanatory variable (the “predictor” variable whose value you already have).
Residual Assumptions
The value of eps is assumed to be normally distributed such that eps ∼N(0,σ).
That eps is centered (that is, has a mean of) zero.
The variance of eps , σ2, is constant.
Parameters
The value denoted by β0 is called the intercept, and that of β1 is called the slope. Together,
they are also referred to as the regression coefficients and are interpreted as follows:
• The intercept, β0, is interpreted as the expected value of the response variable when the
predictor is zero.
• Generally, the slope, β1, is the focus of interest. This is interpreted as the change in the
mean response for each one-unit increase in the predictor.
example:
The first argument is the now-familiar response ~ predictor formula, which specifies the
desired model. Through data=datasource we specifically instruct lm to look in the object
supplied to the data argument.
If you simply enter the name of the "lm" object at the prompt, it will provide the most basic
output: a repeat of your call and the estimates of the intercept ( ˆ β0) and slope ( ˆ β1).
Call:
lm(formula = sbp ~ age + weight, data = bprep)
Coefficients:
(Intercept) age weight
43.8642 0.2734 0.4398
The model in prepared by using coefficients as follows
Std error is the average amount that the estimate varies from the actual value.
t-value=estimate/std.error
Pr(>|t|) gives a p value for the t-test to determine if the coefficient is significant.
R-squared gives a measurement of what % of the variance in the response variable can be
explained by the regression
Multiple r squared typically increases each time you add a predictor (x) variable
Adjusted R-squared controls for each additional predictor added (to prevent from
overfitting), so it ma not increase as you add more variables
If your Multiple R-squared is much higher than your adjusted R-squared, your model might
be overfitting
the fitted regression line will be added with the function abline as
abline(survfit,lwd=2)
confint() function :
You can use the confint() function in R to calculate a confidence interval for one or more
parameters in a fitted regression model.
where:
confint(model_fit)
2.5 % 97.5 %
(Intercept) -131.1025620 218.830951
age -1.6770934 2.223863
weight -0.2633077 1.142859
coef() function:
To extract the components of coefficients of an "lm" object, the “direct-access” function
you use is coef().
Ex:
Here, the regression coefficients are extracted from the object and then separately assigned to
the objects beta0.hat and beta1.hat.
Predict() function:
given a trained model, predict the label of a new set of data.
The predict() function in R is used to predict the values based on the input data.
Syntax :
where
Call:
lm(formula = sbp ~ age + weight, data = bprep)
Coefficients:
(Intercept) age weight
43.8642 0.2734 0.4398
#summary of model_fit
summary(model_fit)
Call:
lm(formula = sbp ~ age + weight, data = bprep)
Residuals:
1 2 3 4 5
-0.6432 -0.4364 8.6594 -3.9865 -3.5933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.8642 40.6649 1.079 0.394
age 0.2734 0.4533 0.603 0.608
weight 0.4398 0.1634 2.691 0.115
confint(model_fit, level=0.95)
1
149.6129
#predict sbp for the given age=40 and weight=200 with conidance interval
ci=predict(model_fit,newdata, interval = "confidence", level=0.95)
ci
Hypothesis testing is performed with the objective of obtaining a p-value in order to quantify
evidence against the null statement H0. This is rejected in favour of the alternative, Ha, if the
p-value is itself less than a predefined significance level α, which is conventionally 0.05 or
0.01
To be able to test the validity of your rejection or retention of the null hypothesis, you must
be able to identify two kinds of errors:
TypeI Errors
• A Type I error occurs when you incorrectly reject a true H0. In any given hypothesis test,
the probability of a Type I error is equivalent to the significance level α.
If your p-value is less than α, you reject the null statement. If the null is really true, though,
the α directly defines the probability that you incorrectly reject it. This is referred to as a Type
I error.
Type II error
A Type II error refers to incorrect retention of the null hypothesis—in other words, obtaining
a p-value greater than the significance level when it’s the alternative hypothesis that’s
actually true. For the same scenario you’ve been looking at so far (an upper-tailed test for a
single sample mean), Figure 18-3 illustrates the probability of a Type II error, shaded and
denoted β.
ANOVA (Analysis of Variance):
ANOVA is a statistical test used to determine if there are significant differences between
the means of two or more groups. It analyzes the variance within and between groups to
assess whether the differences observed are due to random chance or actual group differences.
ANOVA is commonly used when you have a continuous dependent variable and one or
more categorical independent variables with multiple levels. The test compares the means
across the groups and calculates an F-statistic and p-value to determine if the differences are
statistically significant.
Chi-Square Test:
The Chi-Square test is a statistical test used to examine the association or independence
between two categorical variables. It compares the observed frequencies of each category
with the expected frequencies under the assumption of independence. The test determines
whether there is a significant relationship between the variables based on the discrepancies
Chi-Square tests are often used when you have categorical data and want to determine if
there is a relationship between two variables. It is commonly used in fields such as social
Thus, ANOVA is used to compare means across multiple groups with continuous
dependent variables and categorical independent variables. On the other hand, Chi-
Square tests assess the association or independence between categorical variables. The
choice between ANOVA and Chi-Square depends on the nature of the variables you are
analyzing and the research question you want to answer.
Types of chisquare
Single Categorical Variable :
Like the Z-test, the one-dimensional chi-squared test is also concerned with
comparing proportions but in a setting where there are more than two
proportions. A chi-squared test is used when you have k levels (or categories) of
a categorical variable and want to hypothesize about their relative frequencies to
find out what proportion of n observations fall into each defined category.
where Oi is the observed count and Ei is the expected count in the ith category; i
= 1, . . ., k. The Oi are obtained directly from the raw data, and the expected
counts, Ei = nπ0(i), are merely the product of the overall sample size n with the
respective null proportion for each category.
TwoCategorical Variables:
The chi-squared test can also apply to the situation in which you have two
mutually exclusive and exhaustive categorical variables at hand—call them
variable A and variable B. It is used to detect whether there might be some
influential relationship (in other words, dependence) between A and B by
looking at the way in which the distribution of frequencies change together with
respect to their categories. If there is no relationship, the distribution of
frequencies in variable A will have nothing to do with the distribution of
frequencies in variable B. As such, this particular variant of the chi-squared test
is called a test of independence and is always performed with the following
hypotheses: H0 : Variables A and B are independent. (or, There is no
relationship between A and B.) HA : Variables A and B are not independent.
(or, There is a relationship between A and B.)
Types of anova
1. A one-way ANOVA
2. two-way ANOVA
A one-way ANOVA only involves one factor or independent variable. A two-way ANOVA
involves two independent variables and one dependent variable. The number of
observations (sample size) need not be the same in each group.