0% found this document useful (0 votes)
43 views16 pages

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

ansh kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views16 pages

Introudction To Regression Analysis and Measuring With Stat Model 1702371825910

Uploaded by

ansh kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

3.

1 Introduction to Regression analysis


Predictive modeling (statistical technique) is the process of developing a model using
historical data to make a prediction on new data. Predictive modeling can be described as the
mathematical problem of approximating a mapping function (f) from input variables (X) to
output variables (y). This is called the problem of function approximation.

Regression is a type of Machine learning which helps in finding the relationship between
independent and dependent variable. As predictive analytics is a tool for machine learning and
big data, regression modeling is a tool for predictive analytics.

In simple words, Regression can be defined as a Machine learning problem where we have to
predict continuous/real values like price, Rating, Fees, etc.

Regression is one of the most common models of machine learning. It differs from
classification models because it estimates a numerical value, whereas classification models
identify which category an observation belongs to.

The main uses of regression analysis are

Financial forecasting (like house price estimates, or stock prices)


Sales and promotions forecasting
Testing automobiles
Weather analysis and prediction
Time series forecasting

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.

o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.

o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.

o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
3.2 MEASURE OF LINEAR RELATIONSHIP

Measures of Relationship
Statistical measures show a relationship between two or more variables or two or more sets of
data. For example, generally there is a high relationship or correlation between parent's
education and academic achievement. On the other hand, there is generally no relationship or
correlation between a person's height and academic achievement. The major statistical measure
of relationship is the correlation coefficient.

Correlational Coefficient:

Correlation is the relationship between two or more variables or sets of data. It is


expressed in the form of a coefficient with +1.00 indicating a perfect positive correlation; -1.00
indicating a perfect inverse correlation; 0.00 indicating a complete lack of a relationship.

Note: A simplified method of determining the magnitude of a correlation is as follows:

1. Pearson's Product Moment Coefficient (r) is the most often used and most precise
coefficient; and generally used with continuous variables.

2. Spearman Rank Order Coefficient (p) is a form of the Pearson's Product Moment
Coefficient which can be used with ordinal or ranked data.
Linear (Line) Representations of Correlation Coefficients

Correlation Coefficient Formula

Example:
Calculate the linear correlation coefficient for the following data. X = 4, 8 ,12, 16 and Y = 5,
10, 15, 20.

Solution:

Given variables are,

X = 4, 8 ,12, 16 and Y = 5, 10, 15, 20

For finding the linear coefficient of these data, we need to first construct a table for the
required values.
According to the formula of linear correlation we have,
3.3 REGRESSION WITH STATS MODELS
Stats model is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics is available for each estimator. The results are
tested against existing statistical packages to ensure that they are correct

Stats models support specifying models using pandas Data Frames. Here is a simple example
using ordinary least squares.
The goal here is to predict/estimate the stock index price based on two macroeconomics
variables: the interest rate and the unemployment rate.

We will use pandas DataFrame to capture the above data in Python.

Here is the complete syntax to perform the linear regression in Python using stats models (for
larger datasets, you may consider to import your data.

from pandas import DataFrame


import statsmodels.api as sm

Stock_Market = {'Year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016,2016],

'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],

'Interest_Rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.
75,1.75,1.75,1.75],
'Unemployment_Rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6
.2,6.1],

'Stock_Index_Price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,88
4,866,876,822,704,719]

df =
DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stoc
k_Index_Price'])

X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for the


multiple linear regression. If you just want to use one variable for simple linear
regression, then use X = df['Interest_Rate'] for example

Y = df['Stock_Index_Price']

X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit()

predictions = model.predict(X)

print_model = model.summary()

print(print_model)

This is the result of OLS (Ordinary Least Squares) that you’ll get once you run the Python
code:
Interpreting the Regression Results

1. Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to
1, where a higher value generally indicates a better fit, assuming certain conditions are
met.
2. const coefficient is your Y-intercept. It means that if both the Interest_Rate and
Unemployment_Rate coefficients are zero, then the expected output (i.e., the Y) would
be equal to the const coefficient.
3. Interest_Rate coefficient represents the change in the output Y due to a change of
one unit in the interest rate (everything else held constant)
4. Unemployment_Rate coefficient represents the change in the output Y due to a
change of one unit in the unemployment rate (everything else held constant)
5. std err reflects the level of accuracy of the coefficients. The lower it is, the higher is
the level of accuracy.
6. P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically
significant
7. Confidence Interval represents the range in which our coefficients are likely to fall
(with a likelihood of 95%)

3.4 Coefficient Determination, Meaning & significance of coefficients


The coefficient of determination (R² or r-squared) is a statistical measure in a regression model
that determines the proportion of variance in the dependent variable that can be explained by
the independent variable. In other words, the coefficient of determination tells one how well
the data fits the model. R-squared assesses how strong the linear relationship is between two
variables. The coefficient of determination can take any values between 0 to 1.

Interpretation of the Coefficient of Determination (R²)

The most common interpretation of the coefficient of determination is how well the regression
model fits the observed data. Generally, a higher coefficient indicates a better fit for the
model.More specifically, R-squared gives you the percentage variation in y explained by x-
variables.

However, it is not always the case that a high r-squared is good for the regression model. The
quality of the coefficient depends on several factors, including the units of measure of the
variables, the nature of the variables employed in the model, and the applied data
transformation. Thus, sometimes, a high coefficient can indicate issues with the regression
model.
Calculation of the Coefficient of determination

Mathematically, the coefficient of determination can be found using the following formula:

Where:

 SSregression – The sum of squares due to regression

 SStotal – The total sum of squares

Although the terms “total sum of squares” and “sum of squares due to regression” seem
confusing, the variables’ meanings are straightforward.

The total sum of squares measures the variation in the observed data (data used in regression
modeling). The sum of squares due to regression measures how well the regression model
represents the data that were used for modeling.

Example

Below is a graph showing how the number lectures per day affects the number of hours spent
at university per day. The equation of the regression line is drawn on the graph and it has
equation
Solution

To calculate R2 you need to find the sum of the residuals squared and the total sum of squares.

Start off by finding the residuals, which is the distance from regression line to each data point.
Work out the predicted y value by plugging in the corresponding x value into the regression
line equation.

For the point (2,2)

The actual value for y is 2.

As you can see from the graph the actual point is below the regression line, so it makes sense
that the residual is negative.

For the point (3,4)


The actual value for y is 4.

As you can see from the graph the actual point is above the regression line, so it makes sense
that the residual is positive.

For the point (4,6)

The actual value for y is 6

For the point (6,7)


The actual value for y is 7.

To find the residuals squared we need to square each of r1 to r4 and sum them.

Therefore,
This means that the number of lectures per day account for 89.5% of the variation in the hours
people spend at university per day.

Example code:

Interpretation of R Squared in Linear Regression


# Import data
Import pandas as pd
inc_exp = pd.read_csv("Inc_Exp_Data.csv")

#Build Model

Import statsmodels.api as sm
xdat = inc_exp['Mthly_HH_Income']
xdat = sm.add_constant(xdat)
ydat = inc_exp['Mthly_HH_Expense']
model = sm.OLS(ydat, xdat).fit()
model.summary()
R Squared Calculation

# R-Squared Calculation Code

y =inc_exp['Mthly_HH_Expense']
x =inc_exp['Mthly_HH_Income']
ybar=inc_exp['Mthly_HH_Expense'].mean()
yhat=linear_mod.predict(x)
SST =sum((y-ybar)**2)
SSE =sum((y-yhat)**2)
SSR = SST - SSE
RSquared= SSR / SST
RSquared

0.42148044723596234

Adjusted Coefficient of Determination (Adjusted R-squared)

The Adjusted Coefficient of Determination (Adjusted R-squared) is an adjustment for the


Coefficient of Determination that considers only those independent variables which actually
have an effect on the performance of the model.

Few values in a data set (a too-small sample size) can lead to misleading statistics, at the same
time too many data points can also lead to problems. Every time we add a data point
in regression analysis, R2 will increase. R2 never decreases. Therefore, the more points you
add, the better the regression will seem to “fit” your data. If your data doesn’t quite fit a line,
we keep on adding data until you have a better fit.

Some of the points you add will be significant (fit the model) and others will not. R2 doesn’t
care about the insignificant points. The more you add, the higher the coefficient of
determination.

The adjusted R2 can be used to include a more appropriate number of variables. The adjusted
R2 will increase only if a new data point improves the regression more than you would expect
by chance. R2 doesn’t include all data points. Negative values will likely happen if R 2 is close
to zero — after the adjustment, the value will dip below zero a little.

 n – Number of points in your data set.


 k – Number of independent variables in the model, excluding the constant

Adjusted R2 value for above R2 problem R2=0.895


( )
Adjusted R2 value =1- ( )

( . ) ( )
Adjusted R2 value =1- ( )

Adjusted R2 value = 0.7015

Adjusted R2 value calculation python code:

Calculate Adjusted R-Squared with sklearn

fromsklearn.linear_modelimportLinearRegression
import pandas aspd

#define URL where dataset is located


url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data = pd.read_csv(url)

#fit regression model


model = LinearRegression()
X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
model.fit(X, y)

#display adjusted R-squared


1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)

0.7787005290062521
Calculate Adjusted R-Squared with statsmodels

importstatsmodels.apiassm
import pandas aspd

#define URL where dataset is located


url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data = pd.read_csv(url)

#fit regression model


X, y = data[["mpg", "wt", "drat", "qsec"]], data.hp
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

#display adjusted R-squared


print(model.rsquared_adj)

0.7787005290062521

You might also like