0% found this document useful (0 votes)
4 views13 pages

Linear Regression

Uploaded by

mithalibadi8147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

Linear Regression

Uploaded by

mithalibadi8147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Simple Linear regression

Linear regression is a regression model that uses a straight line to describe the relationship
between variables. It finds the line of best fit through your data by searching for the value of
the regression coefficient(s) that minimizes the total error of the model.

Simple linear regression models describe the effect that a particular variable, called the
explanatory variable, might have on the value of a continuous outcome variable, called the
response variable.

General Concepts
The purpose of a linear regression model is to come up with a function that estimates the
mean of one variable given a particular value of another variable. These variables are known
as the response variable (the “outcome” variable whose mean you are attempting to find) and
the explanatory variable (the “predictor” variable whose value you already have).

Definition of the Model


The simple linear regression model states that the value of a response is expressed as the
following equation: Y|X = β0 + β1X + Eps

Y|X reads as “the valueof Y conditional upon the value of X,

The Eps term represents random error

Residual Assumptions

assumptions made about eps, which are defined as follows

 The value of eps is assumed to be normally distributed such that eps ∼N(0,σ).
 That eps is centered (that is, has a mean of) zero.
 The variance of eps , σ2, is constant.

Parameters

The value denoted by β0 is called the intercept, and that of β1 is called the slope. Together,
they are also referred to as the regression coefficients and are interpreted as follows:

• The intercept, β0, is interpreted as the expected value of the response variable when the
predictor is zero.

• Generally, the slope, β1, is the focus of interest. This is interpreted as the change in the
mean response for each one-unit increase in the predictor.

Estimating the Intercept and Slope Parameters


The fitted model of interest concerns the mean response value, denoted ˆ y, for a specific
value of the predictor, x, and is written as follows:

Fitting Linear Models with lm


The lm() function creates a linear regression model in R. This function takes an R formula Y
~ X where Y is the outcome variable and X is the predictor variable

In R, the command lm performs the estimation for you.

model<- lm(response ~ predictor, )

example:

model_fit=lm(sbp~age+weight, data = bprep)

The first argument is the now-familiar response ~ predictor formula, which specifies the
desired model. Through data=datasource we specifically instruct lm to look in the object
supplied to the data argument.

If you simply enter the name of the "lm" object at the prompt, it will provide the most basic
output: a repeat of your call and the estimates of the intercept ( ˆ β0) and slope ( ˆ β1).

Call:
lm(formula = sbp ~ age + weight, data = bprep)

Coefficients:
(Intercept) age weight
43.8642 0.2734 0.4398
The model in prepared by using coefficients as follows

Y_cap=43.8642+ 0.2734(x1)+ 0.4398(x2)

By taking different independent values for x, the dependent variable y is responsed.


Summery :
The summary function in R is useful to quickly summarize the values in a vector, data frame,
regression model, or ANOVA model in R. summary function in regards to regression models.
is the summary of the distribution of residuals from the regression model.

Residuals : difference between the observed and predicted values.

Median : residual should be near to zero

The min, max and 1Q, 3Q residuals should be close in magnitude

Estimates are used to predict the value of the response variable

Std error is the average amount that the estimate varies from the actual value.

t-value=estimate/std.error

Pr(>|t|) gives a p value for the t-test to determine if the coefficient is significant.

R-squared gives a measurement of what % of the variance in the response variable can be
explained by the regression
Multiple r squared typically increases each time you add a predictor (x) variable

Adjusted R-squared controls for each additional predictor added (to prevent from
overfitting), so it ma not increase as you add more variables

If your Multiple R-squared is much higher than your adjusted R-squared, your model might
be overfitting

F static indicated if the model as a whole is statistically significant.

the fitted regression line will be added with the function abline as

abline(survfit,lwd=2)

confint() function :
You can use the confint() function in R to calculate a confidence interval for one or more
parameters in a fitted regression model.

This function uses the following basic syntax:

confint(object, parm, level=0.95)

where:

 object: Name of the fitted regression model


 parm: Parameters to calculate confidence interval for (default is all)
 level: Confidence level to use (default is 0.95)

Example: How to Use confint() Function in R

confint(model_fit)
2.5 % 97.5 %
(Intercept) -131.1025620 218.830951
age -1.6770934 2.223863
weight -0.2633077 1.142859
coef() function:
To extract the components of coefficients of an "lm" object, the “direct-access” function
you use is coef().

Ex:

#coefficeients of linear regression model


mycoefs <- coef(model_fit)
mycoefs

(Intercept) age weight


43.8641943 0.2733849 0.4397756

Here, the regression coefficients are extracted from the object and then separately assigned to
the objects beta0.hat and beta1.hat.

Predict() function:
given a trained model, predict the label of a new set of data.

The predict() function in R is used to predict the values based on the input data.

Syntax :

predict(object, newdata, interval)

where

 object: The class inheriting from the linear model


 newdata: Input data to predict the values
 interval: Type of interval calculation

#complete multiple linear regression illustrate all the


functions
bprep=read.csv("F:/bpreport.csv")
bprep

sbp age weight


1 130 52 165
2 133 59 167
3 150 67 180
4 128 73 155
5 151 64 212

#mtrix of scatter plots


pairs(bprep)

#corelation coefficient of matrix


round(cor(bprep),2)

sbp age weight


sbp 1.00 0.19 0.87
age 0.19 1.00 0.00
weight 0.87 0.00 1.00

#fitting model with lm


model_fit=lm(sbp~age+weight, data = bprep)
model_fit

Call:
lm(formula = sbp ~ age + weight, data = bprep)

Coefficients:
(Intercept) age weight
43.8642 0.2734 0.4398

#summary of model_fit
summary(model_fit)

Call:
lm(formula = sbp ~ age + weight, data = bprep)

Residuals:
1 2 3 4 5
-0.6432 -0.4364 8.6594 -3.9865 -3.5933
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.8642 40.6649 1.079 0.394
age 0.2734 0.4533 0.603 0.608
weight 0.4398 0.1634 2.691 0.115

Residual standard error: 7.225 on 2 degrees of freedom


Multiple R-squared: 0.7917, Adjusted R-squared: 0.5834
F-statistic: 3.801 on 2 and 2 DF, p-value: 0.2083

#coefficeients of linear regression model


mycoefs <- coef(model_fit)
mycoefs

(Intercept) age weight


43.8641943 0.2733849 0.4397756

confint(model_fit, level=0.95)

(Intercept) age weight


43.8641943 0.2733849 0.4397756

#predict sbp for the given age=40 and weight=200


newdata=data.frame(age=49, weight=210)
predict(model_fit,newdata)

1
149.6129

#predict sbp for the given age=40 and weight=200 with conidance interval
ci=predict(model_fit,newdata, interval = "confidence", level=0.95)
ci

fit lwr upr


1 149.6129 110.6869 188.539

Types of errors in hypothesis testing

Hypothesis testing is performed with the objective of obtaining a p-value in order to quantify
evidence against the null statement H0. This is rejected in favour of the alternative, Ha, if the
p-value is itself less than a predefined significance level α, which is conventionally 0.05 or
0.01

To be able to test the validity of your rejection or retention of the null hypothesis, you must
be able to identify two kinds of errors:
TypeI Errors
• A Type I error occurs when you incorrectly reject a true H0. In any given hypothesis test,
the probability of a Type I error is equivalent to the significance level α.

If your p-value is less than α, you reject the null statement. If the null is really true, though,
the α directly defines the probability that you incorrectly reject it. This is referred to as a Type
I error.

Type II error
A Type II error refers to incorrect retention of the null hypothesis—in other words, obtaining
a p-value greater than the significance level when it’s the alternative hypothesis that’s
actually true. For the same scenario you’ve been looking at so far (an upper-tailed test for a
single sample mean), Figure 18-3 illustrates the probability of a Type II error, shaded and
denoted β.
ANOVA (Analysis of Variance):

ANOVA is a statistical test used to determine if there are significant differences between

the means of two or more groups. It analyzes the variance within and between groups to

assess whether the differences observed are due to random chance or actual group differences.

ANOVA is commonly used when you have a continuous dependent variable and one or

more categorical independent variables with multiple levels. The test compares the means

across the groups and calculates an F-statistic and p-value to determine if the differences are

statistically significant.

Chi-Square Test:

The Chi-Square test is a statistical test used to examine the association or independence

between two categorical variables. It compares the observed frequencies of each category

with the expected frequencies under the assumption of independence. The test determines

whether there is a significant relationship between the variables based on the discrepancies

between observed and expected frequencies.

Chi-Square tests are often used when you have categorical data and want to determine if

there is a relationship between two variables. It is commonly used in fields such as social

sciences, biology, market research, and quality control.

Thus, ANOVA is used to compare means across multiple groups with continuous
dependent variables and categorical independent variables. On the other hand, Chi-
Square tests assess the association or independence between categorical variables. The
choice between ANOVA and Chi-Square depends on the nature of the variables you are
analyzing and the research question you want to answer.

Types of chisquare
Single Categorical Variable :

Like the Z-test, the one-dimensional chi-squared test is also concerned with
comparing proportions but in a setting where there are more than two
proportions. A chi-squared test is used when you have k levels (or categories) of
a categorical variable and want to hypothesize about their relative frequencies to
find out what proportion of n observations fall into each defined category.

Calculation: Chi-Squared Test of Distribution

The quantities of interest are the proportion of n observations in each of k


categories, π1, . . ., πk, for a single mutually exclusive and exhaustive
categorical variable. The null hypothesis defines hypothesized null values for
each proportion; label these respectively as π0(1), . . ., π0(k). The test statistic
χ2 is given as

where Oi is the observed count and Ei is the expected count in the ith category; i
= 1, . . ., k. The Oi are obtained directly from the raw data, and the expected
counts, Ei = nπ0(i), are merely the product of the overall sample size n with the
respective null proportion for each category.

TwoCategorical Variables:

The chi-squared test can also apply to the situation in which you have two
mutually exclusive and exhaustive categorical variables at hand—call them
variable A and variable B. It is used to detect whether there might be some
influential relationship (in other words, dependence) between A and B by
looking at the way in which the distribution of frequencies change together with
respect to their categories. If there is no relationship, the distribution of
frequencies in variable A will have nothing to do with the distribution of
frequencies in variable B. As such, this particular variant of the chi-squared test
is called a test of independence and is always performed with the following
hypotheses: H0 : Variables A and B are independent. (or, There is no
relationship between A and B.) HA : Variables A and B are not independent.
(or, There is a relationship between A and B.)

skin <- matrix(c(20,32,8,52,9,72,8,32,16,64,30,12),4,3,


dimnames=list(c("Injection","Tablet","Laser","Herbal"),
c("None","Partial","Full")))

A two-dimensional table presenting frequencies in this fashion is called a


contingency table.

Calculation: Chi-Squared Test of Independence :

To compute the test statistic, presume data are presented as a kr × kc


contingency table, in other words, a matrix of counts, based on two categorical
variables (both mutually exclusive and exhaustive). The focus of the test is the
way in which the frequencies of N observations between the kr levels of the
“row” variable and the kc levels of the “column” variable are jointly distributed.
The test statistic χ2 is given with

Types of anova

1. A one-way ANOVA
2. two-way ANOVA
A one-way ANOVA only involves one factor or independent variable. A two-way ANOVA
involves two independent variables and one dependent variable. The number of
observations (sample size) need not be the same in each group.

You might also like