0% found this document useful (0 votes)

43 views16 pages

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

ansh kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views16 pages

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

ansh kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

3.

1 Introduction to Regression analysis

Predictive modeling (statistical technique) is the process of developing a model using
historical data to make a prediction on new data. Predictive modeling can be described as the
mathematical problem of approximating a mapping function (f) from input variables (X) to
output variables (y). This is called the problem of function approximation.

Regression is a type of Machine learning which helps in finding the relationship between
independent and dependent variable. As predictive analytics is a tool for machine learning and
big data, regression modeling is a tool for predictive analytics.

In simple words, Regression can be defined as a Machine learning problem where we have to
predict continuous/real values like price, Rating, Fees, etc.

Regression is one of the most common models of machine learning. It differs from
classification models because it estimates a numerical value, whereas classification models
identify which category an observation belongs to.

The main uses of regression analysis are

Financial forecasting (like house price estimates, or stock prices)

Sales and promotions forecasting
Testing automobiles
Weather analysis and prediction
Time series forecasting

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.

o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.

o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.

o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
3.2 MEASURE OF LINEAR RELATIONSHIP

Measures of Relationship
Statistical measures show a relationship between two or more variables or two or more sets of
data. For example, generally there is a high relationship or correlation between parent's
education and academic achievement. On the other hand, there is generally no relationship or
correlation between a person's height and academic achievement. The major statistical measure
of relationship is the correlation coefficient.

Correlational Coefficient:

Correlation is the relationship between two or more variables or sets of data. It is

expressed in the form of a coefficient with +1.00 indicating a perfect positive correlation; -1.00
indicating a perfect inverse correlation; 0.00 indicating a complete lack of a relationship.

Note: A simplified method of determining the magnitude of a correlation is as follows:

1. Pearson's Product Moment Coefficient (r) is the most often used and most precise
coefficient; and generally used with continuous variables.

2. Spearman Rank Order Coefficient (p) is a form of the Pearson's Product Moment
Coefficient which can be used with ordinal or ranked data.
Linear (Line) Representations of Correlation Coefficients

Correlation Coefficient Formula

Example:
Calculate the linear correlation coefficient for the following data. X = 4, 8 ,12, 16 and Y = 5,
10, 15, 20.

Solution:

Given variables are,

X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20

For finding the linear coefficient of these data, we need to first construct a table for the
required values.
According to the formula of linear correlation we have,
3.3 REGRESSION WITH STATS MODELS
Stats model is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics is available for each estimator. The results are
tested against existing statistical packages to ensure that they are correct

Stats models support specifying models using pandas Data Frames. Here is a simple example
using ordinary least squares.
The goal here is to predict/estimate the stock index price based on two macroeconomics
variables: the interest rate and the unemployment rate.

We will use pandas DataFrame to capture the above data in Python.

Here is the complete syntax to perform the linear regression in Python using stats models (for
larger datasets, you may consider to import your data.

from pandas import DataFrame

import statsmodels.api as sm

Stock_Market = {'Year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016,2016],

'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],

'Interest_Rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.
75,1.75,1.75,1.75],
'Unemployment_Rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6
.2,6.1],

'Stock_Index_Price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,88
4,866,876,822,704,719]

df =
DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stoc
k_Index_Price'])

X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for the

multiple linear regression. If you just want to use one variable for simple linear
regression, then use X = df['Interest_Rate'] for example

Y = df['Stock_Index_Price']

X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit()

predictions = model.predict(X)

print_model = model.summary()

print(print_model)

This is the result of OLS (Ordinary Least Squares) that you’ll get once you run the Python
code:
Interpreting the Regression Results

1. Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to
1, where a higher value generally indicates a better fit, assuming certain conditions are
met.
2. const coefficient is your Y-intercept. It means that if both the Interest_Rate and
Unemployment_Rate coefficients are zero, then the expected output (i.e., the Y) would
be equal to the const coefficient.
3. Interest_Rate coefficient represents the change in the output Y due to a change of
one unit in the interest rate (everything else held constant)
4. Unemployment_Rate coefficient represents the change in the output Y due to a
change of one unit in the unemployment rate (everything else held constant)
5. std err reflects the level of accuracy of the coefficients. The lower it is, the higher is
the level of accuracy.
6. P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically
significant
7. Confidence Interval represents the range in which our coefficients are likely to fall
(with a likelihood of 95%)

3.4 Coefficient Determination, Meaning & significance of coefficients

The coefficient of determination (R² or r-squared) is a statistical measure in a regression model
that determines the proportion of variance in the dependent variable that can be explained by
the independent variable. In other words, the coefficient of determination tells one how well
the data fits the model. R-squared assesses how strong the linear relationship is between two
variables. The coefficient of determination can take any values between 0 to 1.

Interpretation of the Coefficient of Determination (R²)

The most common interpretation of the coefficient of determination is how well the regression
model fits the observed data. Generally, a higher coefficient indicates a better fit for the
model.More specifically, R-squared gives you the percentage variation in y explained by x-
variables.

However, it is not always the case that a high r-squared is good for the regression model. The
quality of the coefficient depends on several factors, including the units of measure of the
variables, the nature of the variables employed in the model, and the applied data
transformation. Thus, sometimes, a high coefficient can indicate issues with the regression
model.
Calculation of the Coefficient of determination

Mathematically, the coefficient of determination can be found using the following formula:

Where:

 SSregression – The sum of squares due to regression

 SStotal – The total sum of squares

Although the terms “total sum of squares” and “sum of squares due to regression” seem
confusing, the variables’ meanings are straightforward.

The total sum of squares measures the variation in the observed data (data used in regression
modeling). The sum of squares due to regression measures how well the regression model
represents the data that were used for modeling.

Example

Below is a graph showing how the number lectures per day affects the number of hours spent
at university per day. The equation of the regression line is drawn on the graph and it has
equation
Solution

To calculate R2 you need to find the sum of the residuals squared and the total sum of squares.

Start off by finding the residuals, which is the distance from regression line to each data point.
Work out the predicted y value by plugging in the corresponding x value into the regression
line equation.

For the point (2,2)

The actual value for y is 2.

As you can see from the graph the actual point is below the regression line, so it makes sense
that the residual is negative.

For the point (3,4)

The actual value for y is 4.

As you can see from the graph the actual point is above the regression line, so it makes sense
that the residual is positive.

For the point (4,6)

The actual value for y is 6

For the point (6,7)

The actual value for y is 7.

To find the residuals squared we need to square each of r1 to r4 and sum them.

Therefore,
This means that the number of lectures per day account for 89.5% of the variation in the hours
people spend at university per day.

Example code:

Interpretation of R Squared in Linear Regression

# Import data
Import pandas as pd
inc_exp = pd.read_csv("Inc_Exp_Data.csv")

#Build Model

Import statsmodels.api as sm
xdat = inc_exp['Mthly_HH_Income']
xdat = sm.add_constant(xdat)
ydat = inc_exp['Mthly_HH_Expense']
model = sm.OLS(ydat, xdat).fit()
model.summary()
R Squared Calculation

# R-Squared Calculation Code

y =inc_exp['Mthly_HH_Expense']
x =inc_exp['Mthly_HH_Income']
ybar=inc_exp['Mthly_HH_Expense'].mean()
yhat=linear_mod.predict(x)
SST =sum((y-ybar)**2)
SSE =sum((y-yhat)**2)
SSR = SST - SSE
RSquared= SSR / SST
RSquared

0.42148044723596234

Adjusted Coefficient of Determination (Adjusted R-squared)

The Adjusted Coefficient of Determination (Adjusted R-squared) is an adjustment for the

Coefficient of Determination that considers only those independent variables which actually
have an effect on the performance of the model.

Few values in a data set (a too-small sample size) can lead to misleading statistics, at the same
time too many data points can also lead to problems. Every time we add a data point
in regression analysis, R2 will increase. R2 never decreases. Therefore, the more points you
add, the better the regression will seem to “fit” your data. If your data doesn’t quite fit a line,
we keep on adding data until you have a better fit.

Some of the points you add will be significant (fit the model) and others will not. R2 doesn’t
care about the insignificant points. The more you add, the higher the coefficient of
determination.

The adjusted R2 can be used to include a more appropriate number of variables. The adjusted
R2 will increase only if a new data point improves the regression more than you would expect
by chance. R2 doesn’t include all data points. Negative values will likely happen if R 2 is close
to zero — after the adjustment, the value will dip below zero a little.

 n – Number of points in your data set.

 k – Number of independent variables in the model, excluding the constant

Adjusted R2 value for above R2 problem R2=0.895

( )
Adjusted R2 value =1- ( )

( . ) ( )
Adjusted R2 value =1- ( )

Adjusted R2 value = 0.7015

Adjusted R2 value calculation python code:

Calculate Adjusted R-Squared with sklearn

fromsklearn.linear_modelimportLinearRegression
import pandas aspd

#define URL where dataset is located

url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data = pd.read_csv(url)

#fit regression model

model = LinearRegression()
X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
model.fit(X, y)

#display adjusted R-squared

1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)

0.7787005290062521
Calculate Adjusted R-Squared with statsmodels

importstatsmodels.apiassm
import pandas aspd

#define URL where dataset is located

url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data = pd.read_csv(url)

#fit regression model

X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

#display adjusted R-squared

print(model.rsquared_adj)

0.7787005290062521

Meweek 3
No ratings yet
Meweek 3
57 pages
ML Unit-2
100% (1)
ML Unit-2
52 pages
Linear Regression - Stats 2 (Translated)
No ratings yet
Linear Regression - Stats 2 (Translated)
63 pages
ML Unit-III Notes
No ratings yet
ML Unit-III Notes
83 pages
5 Chapter Fi
No ratings yet
5 Chapter Fi
29 pages
Linear Regression With R
No ratings yet
Linear Regression With R
45 pages
Week-4 Statistical-Forecasting Handout
No ratings yet
Week-4 Statistical-Forecasting Handout
9 pages
UNIt-3 TY
No ratings yet
UNIt-3 TY
67 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
Simple Linear Regression and Correlation (Continue..,)
No ratings yet
Simple Linear Regression and Correlation (Continue..,)
30 pages
Lesson 9
No ratings yet
Lesson 9
4 pages
Unit3 Fha
No ratings yet
Unit3 Fha
9 pages
Unit 3
No ratings yet
Unit 3
24 pages
Data Analytics Lesson 11 Notes
No ratings yet
Data Analytics Lesson 11 Notes
8 pages
Stat Cor Reg
No ratings yet
Stat Cor Reg
85 pages
Correlation R and r2
No ratings yet
Correlation R and r2
5 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
Statistics Overview Part II
No ratings yet
Statistics Overview Part II
29 pages
Lecture 4
No ratings yet
Lecture 4
3 pages
Correlation Regression
No ratings yet
Correlation Regression
58 pages
Econometrics For MGT ppt-2
No ratings yet
Econometrics For MGT ppt-2
58 pages
Regression Notes
No ratings yet
Regression Notes
7 pages
Screenshot 2023-12-04 at 11.27.14
No ratings yet
Screenshot 2023-12-04 at 11.27.14
32 pages
Unit III
No ratings yet
Unit III
13 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Chapter 6
No ratings yet
Chapter 6
58 pages
Regression Analysis (AI)
No ratings yet
Regression Analysis (AI)
9 pages
QT - Unit 2 - Part B - Regression
No ratings yet
QT - Unit 2 - Part B - Regression
40 pages
Correlation
No ratings yet
Correlation
5 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Econometrics For Finance
100% (1)
Econometrics For Finance
54 pages
Correlation Coefficient: How Well Does Your Regression Equation Truly Represent Your Set of Data?
No ratings yet
Correlation Coefficient: How Well Does Your Regression Equation Truly Represent Your Set of Data?
3 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Investigating Variables
No ratings yet
Investigating Variables
15 pages
Regression Analysis
No ratings yet
Regression Analysis
18 pages
Lecture 8 and 9 Regression Correlation and Index
No ratings yet
Lecture 8 and 9 Regression Correlation and Index
32 pages
14 Statistics and Probability
No ratings yet
14 Statistics and Probability
37 pages
Management Science Notes
No ratings yet
Management Science Notes
13 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Chapter No 11 (Simple Linear Regression)
No ratings yet
Chapter No 11 (Simple Linear Regression)
3 pages
Dr. Hussin Abdullah School of Economics, Finance and Banking, Uum Cob
No ratings yet
Dr. Hussin Abdullah School of Economics, Finance and Banking, Uum Cob
12 pages
Reading 07-Correlation and Regression
No ratings yet
Reading 07-Correlation and Regression
18 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
M. Amir Hossain PHD: Course No: Emba 502: Business Mathematics and Statistics
No ratings yet
M. Amir Hossain PHD: Course No: Emba 502: Business Mathematics and Statistics
31 pages
Regression&Corr&Annova
No ratings yet
Regression&Corr&Annova
71 pages
Regression
No ratings yet
Regression
24 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Correlation Coefficient
No ratings yet
Correlation Coefficient
3 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
A Tutorial On How To Run A Simple Linear Regression in Excel
No ratings yet
A Tutorial On How To Run A Simple Linear Regression in Excel
19 pages
4 Decision Making Under Risk
No ratings yet
4 Decision Making Under Risk
49 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
91 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Business Analytics: Advance: Simple & Multiple Linear Regression
No ratings yet
Business Analytics: Advance: Simple & Multiple Linear Regression
38 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
25 pages
Capital Budgeting: Final Paper 2: Strategic Financial Management, Chapter 2: Capital Budgeting, Part 2
No ratings yet
Capital Budgeting: Final Paper 2: Strategic Financial Management, Chapter 2: Capital Budgeting, Part 2
27 pages
Thread and Issues OS
No ratings yet
Thread and Issues OS
15 pages
B.Tech III Year II Semester (R13) Regular Examinations May/June 2016
No ratings yet
B.Tech III Year II Semester (R13) Regular Examinations May/June 2016
1 page
Changes Affecting Feasibility
No ratings yet
Changes Affecting Feasibility
13 pages
MS&E448: Statistical Arbitrage: Group 5: Carolyn Soo, Zhengyi Lian, Jiayu Lou, Hang Yang
No ratings yet
MS&E448: Statistical Arbitrage: Group 5: Carolyn Soo, Zhengyi Lian, Jiayu Lou, Hang Yang
31 pages
Cap 16 Construccion de Modelos
No ratings yet
Cap 16 Construccion de Modelos
25 pages
Pri
No ratings yet
Pri
480 pages
Topic 6 - FE, RE and Tests
No ratings yet
Topic 6 - FE, RE and Tests
46 pages
WT Unit 2
No ratings yet
WT Unit 2
76 pages
1.8 - Quality of Tests
No ratings yet
1.8 - Quality of Tests
27 pages
TVM Case Study
No ratings yet
TVM Case Study
13 pages
Session 4 Forecasting Regression Methods II
No ratings yet
Session 4 Forecasting Regression Methods II
65 pages
Design MCQ Exit
No ratings yet
Design MCQ Exit
8 pages
Lecture Notes 10 To 15
No ratings yet
Lecture Notes 10 To 15
30 pages
Multicollinearity Autocorrelation
No ratings yet
Multicollinearity Autocorrelation
28 pages
24 Bonferroni Inequality
No ratings yet
24 Bonferroni Inequality
3 pages
Ajay Kumar Excel
No ratings yet
Ajay Kumar Excel
59 pages
Lecture 10
No ratings yet
Lecture 10
14 pages
RCT 170407062212
No ratings yet
RCT 170407062212
42 pages
Present Value Factor PVF (R, T) 1 / (1 + R) T
No ratings yet
Present Value Factor PVF (R, T) 1 / (1 + R) T
1 page
ACTL1001, Week 5: - Continuous Compounding - Contingent Cash Flows
No ratings yet
ACTL1001, Week 5: - Continuous Compounding - Contingent Cash Flows
11 pages
T - KEYS To Pastexam1
No ratings yet
T - KEYS To Pastexam1
13 pages
Group Assignment 2 - SB02 - Group 1
No ratings yet
Group Assignment 2 - SB02 - Group 1
6 pages
Assignment-1 Decision Sciences-I: Shrey Agarwal (1811079, Section-B)
No ratings yet
Assignment-1 Decision Sciences-I: Shrey Agarwal (1811079, Section-B)
9 pages
Hayashi ch3 4 - GMM
No ratings yet
Hayashi ch3 4 - GMM
31 pages
Data Fix
No ratings yet
Data Fix
19 pages
IML Chunk 1 - 1696412718585
No ratings yet
IML Chunk 1 - 1696412718585
8 pages
Simultaneity Bias in Ols
No ratings yet
Simultaneity Bias in Ols
2 pages
Estimation Theory - Wikipedia
No ratings yet
Estimation Theory - Wikipedia
9 pages
Predictor Coef SE Coef T P
No ratings yet
Predictor Coef SE Coef T P
6 pages
Extra Proves
No ratings yet
Extra Proves
1 page
Example 3 32: Calculations For Wire Flaws 1 Example 3 32: Calculations For Wire Flaws 2
No ratings yet
Example 3 32: Calculations For Wire Flaws 1 Example 3 32: Calculations For Wire Flaws 2
2 pages
CH 4 P 5
No ratings yet
CH 4 P 5
2 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

3.

1 Introduction to Regression analysis

The main uses of regression analysis are

Financial forecasting (like house price estimates, or stock prices)

Terminologies Related to the Regression Analysis:

Correlation is the relationship between two or more variables or sets of data. It is

Note: A simplified method of determining the magnitude of a correlation is as follows:

Correlation Coefficient Formula

Given variables are,

X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20

We will use pandas DataFrame to capture the above data in Python.

from pandas import DataFrame

'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],

X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for the

X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit()

3.4 Coefficient Determination, Meaning & significance of coefficients

Interpretation of the Coefficient of Determination (R²)

 SSregression – The sum of squares due to regression

 SStotal – The total sum of squares

For the point (2,2)

The actual value for y is 2.

For the point (3,4)

For the point (4,6)

The actual value for y is 6

For the point (6,7)

Interpretation of R Squared in Linear Regression

# R-Squared Calculation Code

Adjusted Coefficient of Determination (Adjusted R-squared)

The Adjusted Coefficient of Determination (Adjusted R-squared) is an adjustment for the

 n – Number of points in your data set.

Adjusted R2 value for above R2 problem R2=0.895

Adjusted R2 value = 0.7015

Adjusted R2 value calculation python code:

Calculate Adjusted R-Squared with sklearn

#define URL where dataset is located

#fit regression model

#display adjusted R-squared

#define URL where dataset is located

#fit regression model

#display adjusted R-squared

You might also like