0% found this document useful (0 votes)
15 views

Linear_Regression_datascience_basit.pdf

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Linear_Regression_datascience_basit.pdf

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

PROJECT: Predictive Modelling.

Firm Level Data.


Linear Regression.
Name: Basit Ali.
Data Science Program.
Problem 1: Linear Regression

Problem Statement: You are a part of an investing firm and your work is to do research
about these 759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the details
given in the dataset so as to help your company in investing consciously. Also, provide
them with 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.

Importing the libraries

Data Description: Exploratory Data Analysis: Top 5 entries in the dataset.

• The first column is an index ("Unnamed: 0") as these are only


serial numbers, we can remove it.

REPORT TITLE PAGE 2


REPORT TITLE PAGE 3
Insights

• Data consists of both categorical and numerical values


• Null values exist in the data
• There are total 759 rows representing voters and 9 columns with 9 variables. Out of 9, 1
column are of object type and 8 columns are of integer type.
• The first column is an index ("Unnamed: 0") as these are only serial numbers, we can remove
it.
• There are no duplicated records in the data set.
• There are 21 missing data points in tobinq.

Getting counts of object variable: There is only one object variable and below are the
corresponding values for each value .

UNIVARIATE ANALYSIS PLOT:

REPORT TITLE PAGE 4


REPORT TITLE PAGE 5
REPORT TITLE PAGE 6

• Univariate analysis involves the examination of a single variable in isolation. It
aims to understand the distribution, central tendency, and dispersion of that
variable.
• The distribution of data of the variables present in the dataset are pretty much
positively distributed apart from the institutions variable which is shown here
as negatively skewed.
• Outliers are present in the variable except in Institution variable.

Sales are more for the companies are present in SP500 index than the firms that are not
present in the Sp500 index.

REPORT TITLE PAGE 7


Observations: KDE Plot described as Kernel Density Estimate is used for visualizing the Probability
Density of a continuous variable. It depicts the probability density at different values in a continuous
variable Dependent variable sales is heavily right skewed. sales is highly correlated with capital, randd,
employment. Sales has correlation with tobinq. Probably this variable has a very negative impact on
sales. This gives us the initial insights as to what the variables which will contribute to the highest sales
and the variable contributing to the least of sales.

REPORT TITLE PAGE 8


Inference from the above bivariate analysis is that Sales are highly correlated with
Capital, employment and value. From the above scatter plot there is also high
correlation that exist between randd and patents.

Correlation Heatmap:

Observations: There is a very high degree of correlation between sales and employment,
capital, randd with 0.91, 0.87 and 0.85 respectively sales.
Above heatmap Sales has the least correlation with TOBINQ at 0.11 .It is also to note
that there is moderate correlation between patents and capital. Randd has moderate
correlation with capital and high correlation with patents

REPORT TITLE PAGE 9


TREATMENT OF OUTLIERS:

• Outliers can affect the distribution of data, making it skewed. Treating outliers can
help in normalizing the distribution
• Outliers can distort visualizations such as histograms, box plots, and scatter plots.
Removing or transforming outliers can improve the clarity and interpretability of
these visualizations.

NULL VALUE IMPUTE:

Missing value are present in the data

REPORT TITLE PAGE 10


Imputing the values that are missing with Median.

After imputing the is no missing values are present in the data.

SCALING :

•Scaling can be useful to reduce or check the multi collinearity in the data, so if scaling is not
applied I find the VIF – variance inflation factor values very high. Which indicates presence of
multicollinearity
• These values are calculated after building the model of linear regression. To understand the
multi collinearity in the model
• The scaling had no impact in model score or coefficients of attributes nor the intercept.
• Based on the given data set, as we have attributes that do not have well-defined meanings so
therefore we should scale our data in this case. Accordingly, we have scaled the dataset after
treating the outliers and converting the categorical data into continuous in the dataset.
StandardScaler normalizes the data using the formula (x -mean)/standard deviation

REPORT TITLE PAGE 11


1.2 Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on
Train and Test sets using R-square, RMSE.

Data encoding is the conversion of data into digital signals i.e. zeros and ones.
There are three common approaches for converting ordinal and categorical variables to numerical
values. They are: • Ordinal Encoding • One-Hot Encoding • Dummy Variable Encoding

Linear regression model does not take categorical values so that we have encoded categorical values to
integer for better results

Here we will use, Dummy Variable Encoding to convert each category into a separate column
containing only 0and 1, where 1 indicates presence and 0 indicates absence.
In this case:

Here, we have used Drop First as True to ensure that levels of categorical variables are not included as
multiple columns in dataset might result in multicollinearity which in turn land into a dummy trap.

Since we need the forecast of sales, we are taking sales as the dependent variable.

We will divide the data into Training and Testing data set, with 70:30 proportion with the fixed
random state as 1 to ensure uniformity across multiple systems.

REPORT TITLE PAGE 12


Linear Regression

Linear Regression is a machine learning algorithm based on supervised learning. It performs a


regression task. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x
(input) and y(output).

Fitting the Linear regression model to the data

The coefficient value represents the mean change of the dependent variable given a one-unit shift
in an independent variable. It can be used as the absolute sizes of the coefficient to identify the most
important variable.

REPORT TITLE PAGE 13


The intercept (often labeled as constant) is the point where the function crosses the y-axis. the
intercept (often labeled the constant) is the expected mean value of Y when all X=0.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:


• 0% indicates that the model explains none of the variability of the response data around its
mean.
• 100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data.

In above the case test and train data reveals a R-square of 94% which indicates a better fit model

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).
Residuals are a measure of how far from the regression line data points are; RMSE is a measure of
how spread out these residuals are. In other words, it tells you how concentrated the data is around
the line of best fit.

REPORT TITLE PAGE 14


The lower the RMSE, the better a given model is able to “fit” a dataset.

Multicollinearity exists whenever an independent variable is highly correlated with one or more of the
other independent variables in a multiple regression equation. Multicollinearity is a problem
because it undermines the statistical significance of an independent variable.

A variance inflation factor (VIF) detects multicollinearity in regression analysis. Multicollinearity


is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence
can adversely affect your regression results.

Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in
decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each
coefficient. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90%
bigger than what you would expect if there was no multicollinearity — if there was no correlation with
other predictors.

REPORT TITLE PAGE 15


A rule of thumb for interpreting the variance inflation factor:
• 1 = not correlated.
• Between 1 and 5 = moderately correlated.
• Greater than 5 = highly correlated.

Capital, patents, R&D, employment, value here shows high VIF factors.

Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients of
linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression). Least
squares stand for the minimum squares error (SSE).

REPORT TITLE PAGE 16


If we plot the dependent variable again the predicted values of the dependent variable, if
the resultant scatter plot is not having cone shaped distribution(resultant of the noise as
described above) then the model is good and the figure depicts such as distribution

REPORT TITLE PAGE 17


1.4 Inference: Based on these predictions, what are the business insights and
recommendations.

Below is the equation which we obtain on the basis of Stats model:

The equation,

Sales = (0.0) * Intercept + (0.29) * capital + (-0.03) * patents + (


0.08) * randd + (0.41) * employment + (-0.02) * tobinq + (0.21) *
value + (0.0) * institutions + (0.01) * sp500_yes +

• The R-squared value (0.934) represents the proportion of the variance in the dependent variable
(sales) that is explained by the independent variables in the model. In this case, approximately
93.4% of the variability in sales is explained by the model. The adjusted R-squared (0.933) takes
into account the number of predictors and provides a more accurate measure in the presence of
multiple variables.

REPORT TITLE PAGE 18


• The coefficients represent the estimated change in the dependent variable for a one-unit change
in the corresponding independent variable, assuming all other variables are held constant.

Business Insights and Recommendations:


• Variables with significant positive coefficients (e.g., 'capital,' 'employment,' 'value') suggest
positive associations with 'sales.' Companies may consider investing in these areas to potentially
increase sales.
• Variables with significant negative coefficients (e.g., 'tobinq') suggest negative associations with
'sales.' Companies may want to investigate these factors to understand and address potential
issues.
• 'institutions' and 'sp500_yes' do not appear to have a statistically significant impact on 'sales' in
this model. Businesses may choose to reassess the inclusion of these variables in future analyses
or seek additional data.
• Results show that the cooperation between sales and R&D and between sales and marketing has a
significant, positive effect in generating revenue for the firm.

• We would advice this firm to invest in companies where in the employment turnover is very high.
Also we can do further classification and encourage firms having lower employment to hire more
candidates who are qualified thus increasing the turn around.

• Inversely since tobinq which is the ratio between a physical asset's market value and its
replacement value is negatively impacting sales, we will be advising investment firm not to look
into Firms which have high tobinq ratio as it negatively impacts sales

REPORT TITLE PAGE 19

You might also like