Introudction To Regression Analysis and Measuring With Stat Model 1702371825910
Introudction To Regression Analysis and Measuring With Stat Model 1702371825910
Regression is a type of Machine learning which helps in finding the relationship between
independent and dependent variable. As predictive analytics is a tool for machine learning and
big data, regression modeling is a tool for predictive analytics.
In simple words, Regression can be defined as a Machine learning problem where we have to
predict continuous/real values like price, Rating, Fees, etc.
Regression is one of the most common models of machine learning. It differs from
classification models because it estimates a numerical value, whereas classification models
identify which category an observation belongs to.
o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
3.2 MEASURE OF LINEAR RELATIONSHIP
Measures of Relationship
Statistical measures show a relationship between two or more variables or two or more sets of
data. For example, generally there is a high relationship or correlation between parent's
education and academic achievement. On the other hand, there is generally no relationship or
correlation between a person's height and academic achievement. The major statistical measure
of relationship is the correlation coefficient.
Correlational Coefficient:
1. Pearson's Product Moment Coefficient (r) is the most often used and most precise
coefficient; and generally used with continuous variables.
2. Spearman Rank Order Coefficient (p) is a form of the Pearson's Product Moment
Coefficient which can be used with ordinal or ranked data.
Linear (Line) Representations of Correlation Coefficients
Example:
Calculate the linear correlation coefficient for the following data. X = 4, 8 ,12, 16 and Y = 5,
10, 15, 20.
Solution:
For finding the linear coefficient of these data, we need to first construct a table for the
required values.
According to the formula of linear correlation we have,
3.3 REGRESSION WITH STATS MODELS
Stats model is a Python module that provides classes and functions for the estimation of many
different statistical models, as well as for conducting statistical tests, and statistical data
exploration. An extensive list of result statistics is available for each estimator. The results are
tested against existing statistical packages to ensure that they are correct
Stats models support specifying models using pandas Data Frames. Here is a simple example
using ordinary least squares.
The goal here is to predict/estimate the stock index price based on two macroeconomics
variables: the interest rate and the unemployment rate.
Here is the complete syntax to perform the linear regression in Python using stats models (for
larger datasets, you may consider to import your data.
Stock_Market = {'Year':
[2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,20
16,2016,2016,2016,2016,2016,2016],
'Interest_Rate':
[2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.
75,1.75,1.75,1.75],
'Unemployment_Rate':
[5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6
.2,6.1],
'Stock_Index_Price':
[1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,88
4,866,876,822,704,719]
df =
DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stoc
k_Index_Price'])
Y = df['Stock_Index_Price']
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
This is the result of OLS (Ordinary Least Squares) that you’ll get once you run the Python
code:
Interpreting the Regression Results
1. Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to
1, where a higher value generally indicates a better fit, assuming certain conditions are
met.
2. const coefficient is your Y-intercept. It means that if both the Interest_Rate and
Unemployment_Rate coefficients are zero, then the expected output (i.e., the Y) would
be equal to the const coefficient.
3. Interest_Rate coefficient represents the change in the output Y due to a change of
one unit in the interest rate (everything else held constant)
4. Unemployment_Rate coefficient represents the change in the output Y due to a
change of one unit in the unemployment rate (everything else held constant)
5. std err reflects the level of accuracy of the coefficients. The lower it is, the higher is
the level of accuracy.
6. P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically
significant
7. Confidence Interval represents the range in which our coefficients are likely to fall
(with a likelihood of 95%)
The most common interpretation of the coefficient of determination is how well the regression
model fits the observed data. Generally, a higher coefficient indicates a better fit for the
model.More specifically, R-squared gives you the percentage variation in y explained by x-
variables.
However, it is not always the case that a high r-squared is good for the regression model. The
quality of the coefficient depends on several factors, including the units of measure of the
variables, the nature of the variables employed in the model, and the applied data
transformation. Thus, sometimes, a high coefficient can indicate issues with the regression
model.
Calculation of the Coefficient of determination
Mathematically, the coefficient of determination can be found using the following formula:
Where:
Although the terms “total sum of squares” and “sum of squares due to regression” seem
confusing, the variables’ meanings are straightforward.
The total sum of squares measures the variation in the observed data (data used in regression
modeling). The sum of squares due to regression measures how well the regression model
represents the data that were used for modeling.
Example
Below is a graph showing how the number lectures per day affects the number of hours spent
at university per day. The equation of the regression line is drawn on the graph and it has
equation
Solution
To calculate R2 you need to find the sum of the residuals squared and the total sum of squares.
Start off by finding the residuals, which is the distance from regression line to each data point.
Work out the predicted y value by plugging in the corresponding x value into the regression
line equation.
As you can see from the graph the actual point is below the regression line, so it makes sense
that the residual is negative.
As you can see from the graph the actual point is above the regression line, so it makes sense
that the residual is positive.
To find the residuals squared we need to square each of r1 to r4 and sum them.
Therefore,
This means that the number of lectures per day account for 89.5% of the variation in the hours
people spend at university per day.
Example code:
#Build Model
Import statsmodels.api as sm
xdat = inc_exp['Mthly_HH_Income']
xdat = sm.add_constant(xdat)
ydat = inc_exp['Mthly_HH_Expense']
model = sm.OLS(ydat, xdat).fit()
model.summary()
R Squared Calculation
y =inc_exp['Mthly_HH_Expense']
x =inc_exp['Mthly_HH_Income']
ybar=inc_exp['Mthly_HH_Expense'].mean()
yhat=linear_mod.predict(x)
SST =sum((y-ybar)**2)
SSE =sum((y-yhat)**2)
SSR = SST - SSE
RSquared= SSR / SST
RSquared
0.42148044723596234
Few values in a data set (a too-small sample size) can lead to misleading statistics, at the same
time too many data points can also lead to problems. Every time we add a data point
in regression analysis, R2 will increase. R2 never decreases. Therefore, the more points you
add, the better the regression will seem to “fit” your data. If your data doesn’t quite fit a line,
we keep on adding data until you have a better fit.
Some of the points you add will be significant (fit the model) and others will not. R2 doesn’t
care about the insignificant points. The more you add, the higher the coefficient of
determination.
The adjusted R2 can be used to include a more appropriate number of variables. The adjusted
R2 will increase only if a new data point improves the regression more than you would expect
by chance. R2 doesn’t include all data points. Negative values will likely happen if R 2 is close
to zero — after the adjustment, the value will dip below zero a little.
( . ) ( )
Adjusted R2 value =1- ( )
fromsklearn.linear_modelimportLinearRegression
import pandas aspd
#read in data
data = pd.read_csv(url)
0.7787005290062521
Calculate Adjusted R-Squared with statsmodels
importstatsmodels.apiassm
import pandas aspd
#read in data
data = pd.read_csv(url)
0.7787005290062521