Linear_Regression_datascience_basit.pdf
Linear_Regression_datascience_basit.pdf
Problem Statement: You are a part of an investing firm and your work is to do research
about these 759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the details
given in the dataset so as to help your company in investing consciously. Also, provide
them with 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Getting counts of object variable: There is only one object variable and below are the
corresponding values for each value .
Sales are more for the companies are present in SP500 index than the firms that are not
present in the Sp500 index.
Correlation Heatmap:
Observations: There is a very high degree of correlation between sales and employment,
capital, randd with 0.91, 0.87 and 0.85 respectively sales.
Above heatmap Sales has the least correlation with TOBINQ at 0.11 .It is also to note
that there is moderate correlation between patents and capital. Randd has moderate
correlation with capital and high correlation with patents
• Outliers can affect the distribution of data, making it skewed. Treating outliers can
help in normalizing the distribution
• Outliers can distort visualizations such as histograms, box plots, and scatter plots.
Removing or transforming outliers can improve the clarity and interpretability of
these visualizations.
SCALING :
•Scaling can be useful to reduce or check the multi collinearity in the data, so if scaling is not
applied I find the VIF – variance inflation factor values very high. Which indicates presence of
multicollinearity
• These values are calculated after building the model of linear regression. To understand the
multi collinearity in the model
• The scaling had no impact in model score or coefficients of attributes nor the intercept.
• Based on the given data set, as we have attributes that do not have well-defined meanings so
therefore we should scale our data in this case. Accordingly, we have scaled the dataset after
treating the outliers and converting the categorical data into continuous in the dataset.
StandardScaler normalizes the data using the formula (x -mean)/standard deviation
Data encoding is the conversion of data into digital signals i.e. zeros and ones.
There are three common approaches for converting ordinal and categorical variables to numerical
values. They are: • Ordinal Encoding • One-Hot Encoding • Dummy Variable Encoding
Linear regression model does not take categorical values so that we have encoded categorical values to
integer for better results
Here we will use, Dummy Variable Encoding to convert each category into a separate column
containing only 0and 1, where 1 indicates presence and 0 indicates absence.
In this case:
Here, we have used Drop First as True to ensure that levels of categorical variables are not included as
multiple columns in dataset might result in multicollinearity which in turn land into a dummy trap.
Since we need the forecast of sales, we are taking sales as the dependent variable.
We will divide the data into Training and Testing data set, with 70:30 proportion with the fixed
random state as 1 to ensure uniformity across multiple systems.
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x
(input) and y(output).
The coefficient value represents the mean change of the dependent variable given a one-unit shift
in an independent variable. It can be used as the absolute sizes of the coefficient to identify the most
important variable.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:
In above the case test and train data reveals a R-square of 94% which indicates a better fit model
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure of
how spread out these residuals are. In other words, it tells you how concentrated the data is around
the line of best fit.
Multicollinearity exists whenever an independent variable is highly correlated with one or more of the
other independent variables in a multiple regression equation. Multicollinearity is a problem
because it undermines the statistical significance of an independent variable.
Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in
decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each
coefficient. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90%
bigger than what you would expect if there was no multicollinearity — if there was no correlation with
other predictors.
Capital, patents, R&D, employment, value here shows high VIF factors.
Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of
linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression). Least
squares stand for the minimum squares error (SSE).
The equation,
• The R-squared value (0.934) represents the proportion of the variance in the dependent variable
(sales) that is explained by the independent variables in the model. In this case, approximately
93.4% of the variability in sales is explained by the model. The adjusted R-squared (0.933) takes
into account the number of predictors and provides a more accurate measure in the presence of
multiple variables.
• We would advice this firm to invest in companies where in the employment turnover is very high.
Also we can do further classification and encourage firms having lower employment to hire more
candidates who are qualified thus increasing the turn around.
• Inversely since tobinq which is the ratio between a physical asset's market value and its
replacement value is negatively impacting sales, we will be advising investment firm not to look
into Firms which have high tobinq ratio as it negatively impacts sales