0% found this document useful (0 votes)

46 views

Team8 Lab3

The document describes a report on performing multivariable linear regression analysis. It includes tasks assigned to three students: Trần Ngọc Xuân Mai to explain what multivariable linear regression is and use Python to perform the analysis; Trần Thị Mĩ Tiên to explain how it is used and use Excel; and Vũ Thị Thanh Xuân to explain why it is used and use R. The first task provides definitions and an example of multivariable linear regression. The second task involves using Excel, R and Python to perform the analysis on a dataset about colleges and universities.

Uploaded by

thanhxunvu218

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Team8 Lab3

Uploaded by

thanhxunvu218

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

VIETNAM NATIONAL UNIVERSITY HO CHI MINH

UNIVERSITY OF INFORMATION TECHNOLOGY

INFORMATION SYSTEM FACULTY

REPORT LAB 3
STATISTICAL ANALYSIS

Lecturer: Assoc. Prof. Dr. Nguyễn Đình Thuân

TA: Nguyễn Minh Nhựt
Class: STAT3013.O12. CTTT
Students: Trần Ngọc Xuân Mai - 21522322
Trần Thị Mĩ Tiên - 21522674
Vũ Thị Thanh Xuân - 21522816

1
Table of contents
Task 1: Multivariable Linear Regression
What is Multivariable Linear Regression?......................................
How is Multivariable Linear Regression used?...............................
Why does Multivariable Linear Regression work like that?..........
Example for Multivariable Linear Regression……………………
Task 2: Perform Multivariable Linear Regression with data file “Colleges and
Universities”
By using Microsoft Excel……………………………………………
By using R language…………………………………………………
By using Python language…………………………………………..

_______________________________

 Task
Trần Ngọc Xuân Mai – 21522322 - Explain What is Multivariable Linear
Regression and give example
- Using Python language to perform
Multivariable Linear Regression with
data file “Colleges and Universities”
Trần Thị Mĩ Tiên - 21522674 - Explain How Multivariable Linear
Regression is used
- Using Excel to perform Multivariable
Linear Regression with data file
“Colleges and Universities”
Vũ Thị Thanh Xuân - 21522816 - Explain Why do we use Multivariable
Linear Regression and give example
- Using R language to perform
Multivariable Linear Regression with
data file “Colleges and Universities”

2
Task 1
 Explanation (What, How and Why) and example of Multivariable Linear
Regression.
1. What is Multivariable Linear Regression ?
Multiple regression, also known as multiple linear regression (MLR), is a
statistical technique that uses two or more explanatory variables to predict the
outcome of a response variable. It can explain the relationship between multiple
independent variables against one dependent variable. These independent
variables serve as predictor variables, while the single dependent variable serves
as the criterion variable. You can use this technique in a variety of contexts,
studies and disciplines, including in econometrics and financial inference. [1]

2. How is Multivariable Linear Regression used?

yi = β0+β1xi1+β2xi2+...+βpxip+ϵ [2]
where:
for i=n observations
yi = dependent variable
xi = explanatory variables
β0= y-intercept (constant term)
βp= slope coefficients for each explanatory variable
ϵ= the model’s error term (also known as the residuals)

3. Why do we use Multivariable Linear Regression?

 Prediction: It can be used to make predictions based on multiple factors.
For example, predicting a house's price based on its size, number of
bedrooms, and location.

 Causation: It can help identify which independent variables have a

significant impact on the dependent variable, allowing you to understand
causal relationships.

 Control: It's useful in situations where you want to control or optimize an

outcome by manipulating independent variables. For instance, optimizing
production processes in manufacturing.

3
4. Example of Multivariable Linear Regression.
Suppose you want to predict a student's final exam score (Y) based on several factors
(X1, X2, X3):

 X1 : represents the number of hours the student studied.

 X2 : represents the number of hours the student slept the night before the exam.
 X3 : represents the student's previous test scores.
You collect data for 100 students, including their exam scores and values for X1, X2, X3 . You
can then perform a multivariable linear regression analysis to determine how these factors
influence the final exam score. The output of the regression analysis will provide you with
the coefficients β0, β1 , β2 , β3, which will help you make predictions and understand the
relative importance of each factor in determining the exam score.

Task 2
a) Using MS Excel, R language and Python language to perform Multivariable Linear
Regression with data file: Colleges and Universities
Explain the problem:
Show Multivariable Linear Regression by Excel, R, Python with data file: Colleges and
Universities
MS Excel

4
Multiple R (Multiple Correlation Coefficient): Multiple R measures the strength and
direction of the linear relationship between the independent variables and the dependent
variable. In this case, it is approximately 0.731, which indicates a moderately strong
positive correlation between the independent variables and the dependent variable.

R Square (Coefficient of Determination): R Square represents the proportion of the

variance in the dependent variable that is explained by the independent variables. In this
case, it is approximately 0.534. This means that about 53.4% of the variance in the
dependent variable can be explained by the independent variables in your regression
model. A higher R Square suggests that your model is better at explaining the variability in
the dependent variable.

Adjusted R Square: Adjusted R Square is a modified version of R Square that accounts

for the number of predictors in the model. In this case, it is approximately 0.492. It adjusts
R Square for the degrees of freedom and provides a more accurate assessment of how well
your model fits the data, especially when there are multiple predictors. A higher adjusted
R Square is desirable, indicating a better model fit.

Standard Error (Standard Error of the Estimate): The standard error measures the
standard deviation of the residuals (the differences between observed and predicted
values). In this case, it is approximately 5.308. A lower standard error indicates that the
model's predictions are closer to the actual values.

Observations: This represents the sample size, which is 49 in your dataset .

We will calculate:

From this equation:

We will have:

Y (% Graduation) = 17.92 + 0.072SAT – 24.85Acceptance – 0.00013*Expenditures –

0.1627*Top 10% HS

From the equation above we can calculate Y then find the

Then R Square =

Adjusted R Square is a modified form of R-squared that takes into account the number of
predictors in the model [1 with editing]. This metric becomes relevant when there are multiple
independent variables in the analysis.

5
Standard Error represents the degree of variation between the observed and predicted
values of the dependent variable (Y) [2 with editing].

Observations refer to the sample size in the dataset. In Figure 1, there are 49 samples.

Regarding Hypothesis Tests for Regression Coefficients:

 The Null Hypothesis (H0) asserts that βi = 0, indicating that the independent
variable Xi is not statistically significant [3 with editing].

 The Alternative Hypothesis (H1) suggests that βi ≠ 0, implying that the

independent variable Xi has statistical significance.

p-Values are calculated at a significance level of α = 5% to assess the significance of the

coefficients.
The p-value measures the probability of observing a test statistic as extreme as, or more
extreme than, the one calculated from your sample data under the null hypothesis.

A smaller p-value indicates stronger evidence against the null hypothesis. In other words,
if the p-value is very low (typically below a chosen significance level, such as 0.05), it
suggests that the data provides strong evidence to reject the null hypothesis in favor of the
alternative hypothesis . [chatgpt]
Conversely, a larger p-value (close to 1) suggests that the data doesn't provide strong
evidence against the null hypothesis, and you may not have reason to reject it. .[chatgpt]

In the provided table, for example:

 X Variable 1 has a very small P-value of 0.000236106, indicating strong evidence

against the null hypothesis for this variable.
 X Variable 2 also has a small P-value of 0.004559569, suggesting some evidence
against the null hypothesis for this variable.
 X Variable 3 and X Variable 4 have larger P-values, 0.045600178 and
0.046213848, respectively, indicating less strong evidence against the null
hypothesis for these variables, but they still may be considered significant
depending on the chosen significance level (e.g., 0.05).

Link: https://docs.google.com/spreadsheets/d/1tcpSCgSfvbSoS0SftGbSt4D4ELyL294s/
edit?usp=sharing&ouid=107710634019894140302&rtpof=true&sd=true

6
R language

 Residual Standard Error:

It represents the standard deviation of the residuals, which measures the
average distance between the observed values and the predicted values.
 Multiple R-squared:
It indicates the proportion of the variance in the dependent variable
'Graduation' explained by the independent variables. In this case, it's 53.44%.
 Adjusted R-squared:
It's a modified version of R-squared that adjusts for the number of independent
variables in the model. It's 49.21% in this case.
 F-statistic:
It tests the overall significance of the regression model.
A low p-value (6.332e-07) suggests that at least one independent variable has a
significant effect on 'Graduation.'
In summary, this linear regression model is used to predict 'Graduation' based
on the provided independent variables. The coefficients and their associated
statistics tell you the strength and direction of the relationships between
'Graduation' and the independent variables, while the R-squared values provide
an overall measure of how well the model fits the data.

7
Link R:
https://drive.google.com/file/d/1AOtt4LQ2plLRcQ82Wp7FJUbJ0AEP1LS_/
view?usp=sharing
Python language
1.Import the necessary libraries:

2.Load the "Colleges and Universities" data file into a Pandas DataFrame:

3. By establishing an array of many variables with X as the independent

variable and Y as the Graduation variable, you may get the dependent variable
Y and the independent variable.

 The dependent variable is "Graduation," indicated by the symbol y.

 The independent variables in this case are: "MedianSAT,"
"AcceptanceRate," "Expenditures/Student," and "TopHS," correspondingly.

4. Fit the multivariable linear regression model based on x and y variables

using LinearRegression() function in the sklearn() library.

8
5. To print the results, retrieve the common values for the linear regression
model using the scikit-learn API.

 R Square is 0.53
- Explain:
+ We are given the formula y = b0 + b1x1 + b2x2 + ... + bkxk where y depends on k variables:
and we take a sample with n observations. Here, b0 represents the intercept term, while are
the regression coefficients for the independent variables . + To determine the regression
equation, we decompose them into two matrices X and Y
· Matrix X:

· Transport of matrix X:

9
· Matrix Y:

· Matrix C, it has 5 rows and 1 column that include coefficient variables with formula:

· Therefore, we can deduce the following formula:

Graduation = 17.9209 + 0.072 x MedianSAT - 0.2485 x AcceptanceRate - 0.000135 x
Expenditures/Student - 0.1627 x TopHS

· The formula for calculating the correlation coefficient R2 being used is as follows:

SST: Sum of squared variation of observed deviation and mean value

10
SSR: Sum of squared variation of prediction deviation and mean value

SSE: Sum of squares of observed and predicted deviations

Conclusion: With an R square result of 0.53, this linear recovery model fits the data at 53%
=> It is reliable.

Link Python:
https://drive.google.com/file/d/1DlK2WjYnfzUVJd6ARScn6yDQtPNjDmS2/
view?usp=sharing

11
REFERENCES

[1]. Multiple Linear Regression (MLR) Definition, Formula, and Example

[2]. Formula of MLR

Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Intro To Regresion: Codergirl Data Analysis
No ratings yet
Intro To Regresion: Codergirl Data Analysis
32 pages
Updated_Lecture_7
No ratings yet
Updated_Lecture_7
29 pages
Regression
No ratings yet
Regression
24 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Multiple Linear Regression
100% (3)
Multiple Linear Regression
26 pages
Bivariate
No ratings yet
Bivariate
28 pages
120.508 Module 8 Multiple Regression (PDF Full Page Color)
No ratings yet
120.508 Module 8 Multiple Regression (PDF Full Page Color)
52 pages
Linear Regression
100% (2)
Linear Regression
28 pages
REGRESSION
No ratings yet
REGRESSION
8 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
No ratings yet
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
50 pages
Simple Regression Model: Erbil Technology Institute
No ratings yet
Simple Regression Model: Erbil Technology Institute
9 pages
Chapter 3 Econometrics
No ratings yet
Chapter 3 Econometrics
34 pages
Week 5 Multiple Regression: Busa3500 Statistics For Business Ii Piedmont College
No ratings yet
Week 5 Multiple Regression: Busa3500 Statistics For Business Ii Piedmont College
57 pages
5 - Part II - Regression Analysis w-notes(1)
No ratings yet
5 - Part II - Regression Analysis w-notes(1)
10 pages
Evans - Analytics2e - PPT - 07 and 08
No ratings yet
Evans - Analytics2e - PPT - 07 and 08
49 pages
What Is Multiple Linear Regression
No ratings yet
What Is Multiple Linear Regression
23 pages
Bio2 Module 4 - Multiple Linear Regression
No ratings yet
Bio2 Module 4 - Multiple Linear Regression
20 pages
BDA Unit 4
No ratings yet
BDA Unit 4
144 pages
(Reformatted) Module 5 (Students)
No ratings yet
(Reformatted) Module 5 (Students)
32 pages
Thesis Multiple Linear Regression
100% (2)
Thesis Multiple Linear Regression
5 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Note 13 - Linear Regression
No ratings yet
Note 13 - Linear Regression
25 pages
Evans - Analytics2e - PPT - 07 and 08 CH
No ratings yet
Evans - Analytics2e - PPT - 07 and 08 CH
50 pages
Multiple Regression Analysis 1
No ratings yet
Multiple Regression Analysis 1
57 pages
Multiple Regression ANOVA
No ratings yet
Multiple Regression ANOVA
11 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Multiple Regression Slides Mod-Ed
No ratings yet
Multiple Regression Slides Mod-Ed
32 pages
Week 9
No ratings yet
Week 9
23 pages
Multiple linear regression
No ratings yet
Multiple linear regression
39 pages
125.785 Module 2.2
No ratings yet
125.785 Module 2.2
95 pages
AM Lecture10
No ratings yet
AM Lecture10
27 pages
CH 14 Handout
No ratings yet
CH 14 Handout
6 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
59 pages
Chapter 3 Multiple Linear Regression - We Use This One
No ratings yet
Chapter 3 Multiple Linear Regression - We Use This One
6 pages
How To Do Linear Regression With Excel
No ratings yet
How To Do Linear Regression With Excel
8 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
Section 2
No ratings yet
Section 2
22 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Multiple Regression ANOVA
No ratings yet
Multiple Regression ANOVA
11 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
45 pages
MGT-Three
No ratings yet
MGT-Three
86 pages
4.1 Multiple Regression Models
No ratings yet
4.1 Multiple Regression Models
6 pages
Fsgs
No ratings yet
Fsgs
28 pages
Regression Kann Ur 14
No ratings yet
Regression Kann Ur 14
43 pages
Regression - Part III - 2021
No ratings yet
Regression - Part III - 2021
55 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Unit 5
No ratings yet
Unit 5
10 pages
10 Regression Analysis
No ratings yet
10 Regression Analysis
55 pages
MLR
No ratings yet
MLR
48 pages
Stata PDF
No ratings yet
Stata PDF
5 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
Econometrics for Finance Lecture III
No ratings yet
Econometrics for Finance Lecture III
54 pages
Chapter 11
No ratings yet
Chapter 11
18 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
Lecture 1
No ratings yet
Lecture 1
36 pages
Name: Muhammad Siddique Class: B.Ed. Semester: Fifth Subject: Inferential Statistics Submitted To: Sir Sajid Ali
No ratings yet
Name: Muhammad Siddique Class: B.Ed. Semester: Fifth Subject: Inferential Statistics Submitted To: Sir Sajid Ali
6 pages
Lecture 3 Multiple Regression Model-Estimation
No ratings yet
Lecture 3 Multiple Regression Model-Estimation
40 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Consumer Perception of Global Branded Products Qua
No ratings yet
Consumer Perception of Global Branded Products Qua
8 pages
Stress Management of Teachers at Linamon District
No ratings yet
Stress Management of Teachers at Linamon District
10 pages
Applied Design of Experiments and Taguchi Methods
90% (10)
Applied Design of Experiments and Taguchi Methods
371 pages
CERN Academic Training Lectures - Practical Statistics For LHC Physicists by Prosper PDF
No ratings yet
CERN Academic Training Lectures - Practical Statistics For LHC Physicists by Prosper PDF
283 pages
ECO541 Final Examination 20212 Set 1 - DR CHUAH SOO CHENG
No ratings yet
ECO541 Final Examination 20212 Set 1 - DR CHUAH SOO CHENG
5 pages
Statistics and Probability
50% (2)
Statistics and Probability
4 pages
Statistics and Probability - q4 - Mod3 - Identifying Parameter To Be Tested Given A Real Life-Problem - V2 PDF
No ratings yet
Statistics and Probability - q4 - Mod3 - Identifying Parameter To Be Tested Given A Real Life-Problem - V2 PDF
28 pages
Statistical Primer For Cardiovascular Research: Hypothesis Testing
No ratings yet
Statistical Primer For Cardiovascular Research: Hypothesis Testing
5 pages
STAT14S - PSPP: Exercise Using PSPP To Explore Bivariate Linear Regression
No ratings yet
STAT14S - PSPP: Exercise Using PSPP To Explore Bivariate Linear Regression
4 pages
Review Final Ex
100% (1)
Review Final Ex
20 pages
NCHRP - Syn - 457 Implementation of The MEPDG en Ingles
100% (2)
NCHRP - Syn - 457 Implementation of The MEPDG en Ingles
81 pages
AAMS2613 Tut 1-9
No ratings yet
AAMS2613 Tut 1-9
18 pages
Thesis Using Multiple Linear Regression
75% (4)
Thesis Using Multiple Linear Regression
7 pages
Risk Assessment at The Cosmetic Product
No ratings yet
Risk Assessment at The Cosmetic Product
7 pages
Assignment Statistics
100% (2)
Assignment Statistics
10 pages
Statistics-17 by Keller
100% (1)
Statistics-17 by Keller
76 pages
Econometrics For Finance
100% (1)
Econometrics For Finance
54 pages
Course Outline in StatisticsProbability
No ratings yet
Course Outline in StatisticsProbability
4 pages
Research Time Management
No ratings yet
Research Time Management
10 pages
STAT 200 Quiz 3
No ratings yet
STAT 200 Quiz 3
5 pages
Lab 04 - Simple Difference, A Not A, Simple Paired Comparison Test
No ratings yet
Lab 04 - Simple Difference, A Not A, Simple Paired Comparison Test
13 pages
Lab 8
No ratings yet
Lab 8
4 pages
Chapter 5 CORRELATION AND REGRESSION
No ratings yet
Chapter 5 CORRELATION AND REGRESSION
28 pages
Diversification Strategies, Bus1 N Ess Cycles and Economic Performance
No ratings yet
Diversification Strategies, Bus1 N Ess Cycles and Economic Performance
12 pages
International Journal of Arts, Humanities and Social Studies
No ratings yet
International Journal of Arts, Humanities and Social Studies
14 pages
Animals 13 03155 v2
No ratings yet
Animals 13 03155 v2
11 pages
MPC 006 D
No ratings yet
MPC 006 D
12 pages
Stat QP 2017
No ratings yet
Stat QP 2017
31 pages

Team8 Lab3

Uploaded by

Team8 Lab3

Uploaded by

VIETNAM NATIONAL UNIVERSITY HO CHI MINH

UNIVERSITY OF INFORMATION TECHNOLOGY

Lecturer: Assoc. Prof. Dr. Nguyễn Đình Thuân

2. How is Multivariable Linear Regression used?

3. Why do we use Multivariable Linear Regression?

 Causation: It can help identify which independent variables have a

 Control: It's useful in situations where you want to control or optimize an

 X1 : represents the number of hours the student studied.

R Square (Coefficient of Determination): R Square represents the proportion of the

Adjusted R Square: Adjusted R Square is a modified version of R Square that accounts

Observations: This represents the sample size, which is 49 in your dataset .

From this equation:

Y (% Graduation) = 17.92 + 0.072*SAT – 24.85*Acceptance – 0.00013*Expenditures –

From the equation above we can calculate Y then find the

Regarding Hypothesis Tests for Regression Coefficients:

 The Alternative Hypothesis (H1) suggests that βi ≠ 0, implying that the

p-Values are calculated at a significance level of α = 5% to assess the significance of the

In the provided table, for example:

 X Variable 1 has a very small P-value of 0.000236106, indicating strong evidence

 Residual Standard Error:

3. By establishing an array of many variables with X as the independent

 The dependent variable is "Graduation," indicated by the symbol y.

4. Fit the multivariable linear regression model based on x and y variables

· Therefore, we can deduce the following formula:

SST: Sum of squared variation of observed deviation and mean value

SSE: Sum of squares of observed and predicted deviations

[1]. Multiple Linear Regression (MLR) Definition, Formula, and Example

[2]. Formula of MLR

You might also like

Y (% Graduation) = 17.92 + 0.072SAT – 24.85Acceptance – 0.00013*Expenditures –