0% found this document useful (0 votes)
46 views

Team8 Lab3

The document describes a report on performing multivariable linear regression analysis. It includes tasks assigned to three students: Trần Ngọc Xuân Mai to explain what multivariable linear regression is and use Python to perform the analysis; Trần Thị Mĩ Tiên to explain how it is used and use Excel; and Vũ Thị Thanh Xuân to explain why it is used and use R. The first task provides definitions and an example of multivariable linear regression. The second task involves using Excel, R and Python to perform the analysis on a dataset about colleges and universities.

Uploaded by

thanhxunvu218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Team8 Lab3

The document describes a report on performing multivariable linear regression analysis. It includes tasks assigned to three students: Trần Ngọc Xuân Mai to explain what multivariable linear regression is and use Python to perform the analysis; Trần Thị Mĩ Tiên to explain how it is used and use Excel; and Vũ Thị Thanh Xuân to explain why it is used and use R. The first task provides definitions and an example of multivariable linear regression. The second task involves using Excel, R and Python to perform the analysis on a dataset about colleges and universities.

Uploaded by

thanhxunvu218
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

VIETNAM NATIONAL UNIVERSITY HO CHI MINH

UNIVERSITY OF INFORMATION TECHNOLOGY


INFORMATION SYSTEM FACULTY

REPORT LAB 3
STATISTICAL ANALYSIS

Lecturer: Assoc. Prof. Dr. Nguyễn Đình Thuân


TA: Nguyễn Minh Nhựt
Class: STAT3013.O12. CTTT
Students: Trần Ngọc Xuân Mai - 21522322
Trần Thị Mĩ Tiên - 21522674
Vũ Thị Thanh Xuân - 21522816

1
Table of contents
Task 1: Multivariable Linear Regression
What is Multivariable Linear Regression?......................................
How is Multivariable Linear Regression used?...............................
Why does Multivariable Linear Regression work like that?..........
Example for Multivariable Linear Regression……………………
Task 2: Perform Multivariable Linear Regression with data file “Colleges and
Universities”
By using Microsoft Excel……………………………………………
By using R language…………………………………………………
By using Python language…………………………………………..

_______________________________

 Task
Trần Ngọc Xuân Mai – 21522322 - Explain What is Multivariable Linear
Regression and give example
- Using Python language to perform
Multivariable Linear Regression with
data file “Colleges and Universities”
Trần Thị Mĩ Tiên - 21522674 - Explain How Multivariable Linear
Regression is used
- Using Excel to perform Multivariable
Linear Regression with data file
“Colleges and Universities”
Vũ Thị Thanh Xuân - 21522816 - Explain Why do we use Multivariable
Linear Regression and give example
- Using R language to perform
Multivariable Linear Regression with
data file “Colleges and Universities”

2
Task 1
 Explanation (What, How and Why) and example of Multivariable Linear
Regression.
1. What is Multivariable Linear Regression ?
Multiple regression, also known as multiple linear regression (MLR), is a
statistical technique that uses two or more explanatory variables to predict the
outcome of a response variable. It can explain the relationship between multiple
independent variables against one dependent variable. These independent
variables serve as predictor variables, while the single dependent variable serves
as the criterion variable. You can use this technique in a variety of contexts,
studies and disciplines, including in econometrics and financial inference. [1]

2. How is Multivariable Linear Regression used?


yi = β0+β1xi1+β2xi2+...+βpxip+ϵ [2]
where:
for i=n observations
yi = dependent variable
xi = explanatory variables
β0= y-intercept (constant term)
βp= slope coefficients for each explanatory variable
ϵ= the model’s error term (also known as the residuals)

3. Why do we use Multivariable Linear Regression?


 Prediction: It can be used to make predictions based on multiple factors.
For example, predicting a house's price based on its size, number of
bedrooms, and location.

 Causation: It can help identify which independent variables have a


significant impact on the dependent variable, allowing you to understand
causal relationships.

 Control: It's useful in situations where you want to control or optimize an


outcome by manipulating independent variables. For instance, optimizing
production processes in manufacturing.

3
4. Example of Multivariable Linear Regression.
Suppose you want to predict a student's final exam score (Y) based on several factors
(X1, X2, X3):

 X1 : represents the number of hours the student studied.


 X2 : represents the number of hours the student slept the night before the exam.
 X3 : represents the student's previous test scores.
You collect data for 100 students, including their exam scores and values for X1, X2, X3 . You
can then perform a multivariable linear regression analysis to determine how these factors
influence the final exam score. The output of the regression analysis will provide you with
the coefficients β0, β1 , β2 , β3, which will help you make predictions and understand the
relative importance of each factor in determining the exam score.

Task 2
a) Using MS Excel, R language and Python language to perform Multivariable Linear
Regression with data file: Colleges and Universities
Explain the problem:
Show Multivariable Linear Regression by Excel, R, Python with data file: Colleges and
Universities
MS Excel

4
Multiple R (Multiple Correlation Coefficient): Multiple R measures the strength and
direction of the linear relationship between the independent variables and the dependent
variable. In this case, it is approximately 0.731, which indicates a moderately strong
positive correlation between the independent variables and the dependent variable.

R Square (Coefficient of Determination): R Square represents the proportion of the


variance in the dependent variable that is explained by the independent variables. In this
case, it is approximately 0.534. This means that about 53.4% of the variance in the
dependent variable can be explained by the independent variables in your regression
model. A higher R Square suggests that your model is better at explaining the variability in
the dependent variable.

Adjusted R Square: Adjusted R Square is a modified version of R Square that accounts


for the number of predictors in the model. In this case, it is approximately 0.492. It adjusts
R Square for the degrees of freedom and provides a more accurate assessment of how well
your model fits the data, especially when there are multiple predictors. A higher adjusted
R Square is desirable, indicating a better model fit.

Standard Error (Standard Error of the Estimate): The standard error measures the
standard deviation of the residuals (the differences between observed and predicted
values). In this case, it is approximately 5.308. A lower standard error indicates that the
model's predictions are closer to the actual values.

Observations: This represents the sample size, which is 49 in your dataset .

We will calculate:

From this equation:

We will have:

Y (% Graduation) = 17.92 + 0.072*SAT – 24.85*Acceptance – 0.00013*Expenditures –


0.1627*Top 10% HS

From the equation above we can calculate Y then find the

Then R Square =

Adjusted R Square is a modified form of R-squared that takes into account the number of
predictors in the model [1 with editing]. This metric becomes relevant when there are multiple
independent variables in the analysis.

5
Standard Error represents the degree of variation between the observed and predicted
values of the dependent variable (Y) [2 with editing].

Observations refer to the sample size in the dataset. In Figure 1, there are 49 samples.

Regarding Hypothesis Tests for Regression Coefficients:

 The Null Hypothesis (H0) asserts that βi = 0, indicating that the independent
variable Xi is not statistically significant [3 with editing].

 The Alternative Hypothesis (H1) suggests that βi ≠ 0, implying that the


independent variable Xi has statistical significance.

p-Values are calculated at a significance level of α = 5% to assess the significance of the


coefficients.
The p-value measures the probability of observing a test statistic as extreme as, or more
extreme than, the one calculated from your sample data under the null hypothesis.

A smaller p-value indicates stronger evidence against the null hypothesis. In other words,
if the p-value is very low (typically below a chosen significance level, such as 0.05), it
suggests that the data provides strong evidence to reject the null hypothesis in favor of the
alternative hypothesis . [chatgpt]
Conversely, a larger p-value (close to 1) suggests that the data doesn't provide strong
evidence against the null hypothesis, and you may not have reason to reject it. .[chatgpt]

In the provided table, for example:

 X Variable 1 has a very small P-value of 0.000236106, indicating strong evidence


against the null hypothesis for this variable.
 X Variable 2 also has a small P-value of 0.004559569, suggesting some evidence
against the null hypothesis for this variable.
 X Variable 3 and X Variable 4 have larger P-values, 0.045600178 and
0.046213848, respectively, indicating less strong evidence against the null
hypothesis for these variables, but they still may be considered significant
depending on the chosen significance level (e.g., 0.05).

Link: https://docs.google.com/spreadsheets/d/1tcpSCgSfvbSoS0SftGbSt4D4ELyL294s/
edit?usp=sharing&ouid=107710634019894140302&rtpof=true&sd=true

6
R language

 Residual Standard Error:


It represents the standard deviation of the residuals, which measures the
average distance between the observed values and the predicted values.
 Multiple R-squared:
It indicates the proportion of the variance in the dependent variable
'Graduation' explained by the independent variables. In this case, it's 53.44%.
 Adjusted R-squared:
It's a modified version of R-squared that adjusts for the number of independent
variables in the model. It's 49.21% in this case.
 F-statistic:
It tests the overall significance of the regression model.
A low p-value (6.332e-07) suggests that at least one independent variable has a
significant effect on 'Graduation.'
In summary, this linear regression model is used to predict 'Graduation' based
on the provided independent variables. The coefficients and their associated
statistics tell you the strength and direction of the relationships between
'Graduation' and the independent variables, while the R-squared values provide
an overall measure of how well the model fits the data.

7
Link R:
https://drive.google.com/file/d/1AOtt4LQ2plLRcQ82Wp7FJUbJ0AEP1LS_/
view?usp=sharing
Python language
1.Import the necessary libraries:

2.Load the "Colleges and Universities" data file into a Pandas DataFrame:

3. By establishing an array of many variables with X as the independent


variable and Y as the Graduation variable, you may get the dependent variable
Y and the independent variable.

 The dependent variable is "Graduation," indicated by the symbol y.


 The independent variables in this case are: "MedianSAT,"
"AcceptanceRate," "Expenditures/Student," and "TopHS," correspondingly.

4. Fit the multivariable linear regression model based on x and y variables


using LinearRegression() function in the sklearn() library.

8
5. To print the results, retrieve the common values for the linear regression
model using the scikit-learn API.

 R Square is 0.53
- Explain:
+ We are given the formula y = b0 + b1x1 + b2x2 + ... + bkxk where y depends on k variables:
and we take a sample with n observations. Here, b0 represents the intercept term, while are
the regression coefficients for the independent variables . + To determine the regression
equation, we decompose them into two matrices X and Y
· Matrix X:

· Transport of matrix X:

9
· Matrix Y:

· Matrix C, it has 5 rows and 1 column that include coefficient variables with formula:

· Therefore, we can deduce the following formula:


Graduation = 17.9209 + 0.072 x MedianSAT - 0.2485 x AcceptanceRate - 0.000135 x
Expenditures/Student - 0.1627 x TopHS

· The formula for calculating the correlation coefficient R2 being used is as follows:

SST: Sum of squared variation of observed deviation and mean value

10
SSR: Sum of squared variation of prediction deviation and mean value

SSE: Sum of squares of observed and predicted deviations


Conclusion: With an R square result of 0.53, this linear recovery model fits the data at 53%
=> It is reliable.

Link Python:
https://drive.google.com/file/d/1DlK2WjYnfzUVJd6ARScn6yDQtPNjDmS2/
view?usp=sharing

11
REFERENCES

[1]. Multiple Linear Regression (MLR) Definition, Formula, and Example

[2]. Formula of MLR

12

You might also like