Swapnilreport
Swapnilreport
Internship report on
Internship
By
Under Guidance of
Abstract
The objective of this project is to build a machine learning model that can predict
the profit of a company based on its R&D Spend, Administration Cost, and
Marketing Spend. The dataset consists of 50 companies with their
corresponding profit and expenses. In this project, we have implemented
different regression algorithms, divided the dataset into training and testing
sets, and calculated various regression metrics to choose the best model.then
we have to make graph between every independent and dependent variable
.For analysing that how dependent variable has been changed with respect to
independent variable .This process give close conclusion of data that helps in
training our model . Now or data has been ready for training and testing. We
have to split our dataset in train and test dataset the train dataset are used to
trained our model and test dataset has been use to test predictions our model.
On the basis of this we have to check the accuracy of all the machine learning
regression models that which model have highest accuracy we have also plot it
on the bar graph. The project was implemented using Python programming
language and the dataset was obtained from the given link
TABLE OF CONTENTS
Abstract
1. Introduction 1
1.1 Background………………………………………………………………… 1
1.2 Goal…………………………………………………………………………. 4
1.3 Setup………………………………………………………………………... 4
2. Existing Method 5
4.Methedology 22
5.Implementation 35
6.Conclusion 41
7.Refrerences 37
1.INTRODUCTION
In this modern era, businesses are constantly trying to maximize their profits by
increasing their revenue and minimizing their expenses. One way to achieve this
is to use machine learning techniques to analyze and predict profits based on
various factors. In this project, we aim to build a machine learning model that
can predict the profit of a company based on its expenses.
1.1 Background
In the 2000s, the growth of big data and advances in computing technology,
including the development of graphics processing units (GPUs) and cloud
computing, led to a revolution in machine learning. Researchers were able to
develop more sophisticated algorithms and train larger models using vast
amounts of data, leading to breakthroughs in areas such as computer vision,
natural language processing, and speech recognition.
1.2 Goal
The goal of this project is that is to create a machine learning model using the
different regression algorithms like linear regression, decision tree regression,
random forest regression , support vector regression model with the use of all
these model I have to create a machine learning environment that predict the
future values of the given data . To reach the goals of the project, it is required
to address the following questions:
1.3 Setup
Under this setup I have top used four regression model and under training data
set we have to trained the different model and estimate the output
ML MODEL
PREDICTED
INPUT DATASET DATA
2.EXISTING METHODS
In the past, various regression algorithms have been used to predict the profit
of a company. Some of the most commonly used algorithms are linear
regression, decision tree regression, and support vector regression. However,
the performance of these algorithms may vary depending on the dataset and
the problem at hand.
Additionally, there are some hybrid methods that combine aspects of these
main types, such as semi-supervised learning, where the model is trained on
both labeled and unlabeled data, and transfer learning, where knowledge from
a pre-trained model is used to improve performance on a related task
There are several types of machine learning algorithm models. Here are
some of the most common ones:
Decision trees: Decision trees are a type of model that uses a tree-
like structure to make decisions. The tree is constructed by splitting
the data into smaller and smaller subsets based on the input features
until a decision is reached. Decision trees are often used for
classification tasks.
These are just a few examples of the many types of machine learning algorithm
models that exist. The choice of model depends on the specific problem at hand
and the type of data being used.
Linear regression can be classified into two types: simple linear regression
and multiple linear regression. Simple linear regression involves only one
independent variable, whereas multiple linear regression involves more
than one independent variable.
This is our best fit line that I have to predicted with our given data and this line
is called regression line
Now our next task is that to reduce te error between the actual value and the
predicted value and the line wich has the minimum error between the acual
point and the predicted point is called the line of linear regression and best fit
line .
Lets understand the how we have to predict the line of regression or lets
understand the linear regression algorithm
You can clearly see that the how we have to predict the slope and the equation
of the regression line
Comparison betweeen the the line of regression line and the actual point
Or in other words distance between the actual poin t and the ptredicted values
Hence computer used this technique to finding the best fit line by illterating the
value of m from 0 to 1 and compares the distance between the actual value and
the predicted value the value of m for which the distance between the actual
value and the predicted value is minimum will be selected as the best fit line
R-squared value is a statistical measure of how close the data are to the
fitted regression line
It is also known as coefficient of determination , or the coefficient of
multiple determination
Now here r-squared value is 0.3 which is not very good as if we increase
the r-squared value from 0.0 to 0.9 then our actual value come closer to
the actual value when it is equal to 1 then our actual value comes on the
regression line thus the r-squared value tells the accuracy of our model.
If our r-square value is very less then our actual value is very far away from
the data.
In general, both MAE and MSE are useful metrics to evaluate the
performance of linear regression models and can be used to
compare the performance of different models or to fine-tune
model parameters to improve accuracy.
Let we have two two feature x[0] and x[1] and y is target variable now what
should I have to do that the
here I have basically make a plot where the horizontal axis represent the x[0]and
vertical axis represent represent the x[1]
Points and darker point represent the heigher value of y and lighter point
represent the lighter value of y
Now here in decision tree regresson algorithm we have to take one feature
independent variable as a root node now one question comes that
which is the best child node to find this we need to calculate this by which split
will decreasing the impurity of the child node the most for this we have to used
the varience reduction technique just like entropy or the gini index in the
classification problem.
Note that all the yellow and the red bubbles are the y values .
This will tells that the first one decrease the impurity much more than the
second one hence we have to take first one as a child node than the second one
Now for any random value of x[0] and x[1] we have to take mean of the y when
I have to reached that leaf node by passing all the conditions.
But there is one draw back of the decision tree which is over fitting which is that
high training accuracy and low testing accuracy so for solving this problem we
have to used the random forest algorithm.
random subset of the training data. The algorithm then combines the
predictions of all the trees to obtain a final prediction.
3.3.2 WORKING
The decision tress is made by taking the sampling of rows and features of
the data with replacement .
And lastly in random forest regressor we have to take the mean of the
output of the all decision tree and predict this value.
SVR can be used for both linear and nonlinear regression problems. It is
particularly useful when dealing with high-dimensional data with a small
number of samples. SVR has been successfully applied in various fields,
including finance, engineering, and bioinformatics.
ALGORITHM:-
Unlike other Regression models that try to minimize the error between the real
and predicted value, the SVR tries to fit the best line within a threshold value
(Distance between hyperplane and boundary line), a. Thus, we can say that SVR
model tries satisfy the condition -a < y-wx+b < a. It used the points with this
boundary to predict the value.
3. Boundary Lines: These are the two lines that are drawn around the
hyperplane at a distance of ε (epsilon). It is used to create a margin between the
data points.
The objective of SVR is to fit as many data points as possible without violating
the margin. Note that the classification that is in SVM use of support vector was
to define the hyperplane but in SVR they are used to define the linear regression.
2. Selection of Kernel
You can choose any kernel like Sigmoid Kernel, Polynomial Kernel, Gaussian
Kernel, etc based upon the problem. All of these kernels have hyperparameters
that need to be trained. In this article, I will be taking the Gaussian Kernel.
Gaussian Kernel is defined as:
where:
In SVR this training phase is the most expensive part, and lots of research are
going on to develop a better way to do it. We can train it using the gradient-
based optimization method like CG and minimizing the cost function.
4.METHODOLOGY:
Firstry I have to fetch the data from the computer with the help of this
function) as pd.read_csv( ) then we have to print our data
4.2Analysing the data:- In this step we have to analyse our data that which
features act as a dependent variable and which feature act as a dependent
variable.and also see the relationship between dependent and independent
variable.
Here firstly to plot the graph we have to firstly sort the data that has been
calibereated on the x axis henc for this we have to use the function df_sorted
= data.sort_values('R&D Spend') which take column name as a argument.
2.ADMINISTRATION VS PROFIT
4.3 Data Preprocessing: We removed any missing or irrelevant data from the
Here we have to choose the data.isnull() to find any null values in the
graph or not we have also plot a heat map with the use of this function
sns.heatmap(data.isnull()) and my output is
If there is any null values in the dta then it will be showm with the white
mark.
The type of lr is a LinearRegression object. When the code lr =
LinearRegression() is executed, an instance of the LinearRegression class
is created and assigned to the variable lr. The LinearRegression() function
in scikit-learn is used to create a linear regression model object. The
object lr can then be used to train the model on the training data and
make predictions on the test data.
We have do similar thing with all the algorithms
In random forest youe can clearly see the many decision tress
Now firstly we have to predict the value y_pred_lr on giving the x_test
values
Then on the basis of the comparison between the y_pred_lr and y_test
Now x axis represent the models and y axis represent the R-Squared scores
And our linear regression model have the heighest accuray according to our plot
5.IMPLEMENTATION:
SOURCE CODE:-
# Importing Libraries
import pandas as pd #Version: 1.5.3
import numpy as np #Version: 1.24.2
import matplotlib.pyplot as plt #Version: 3.7.1
import seaborn as sns # Version: 0.12.2
from sklearn.model_selection import train_test_split #Version: 1.2.2
from sklearn.linear_model import LinearRegression #Version: 1.2.2
from sklearn.tree import DecisionTreeRegressor #Version: 1.2.2
from sklearn.ensemble import RandomForestRegressor #Version: 1.2.2
from sklearn.svm import SVR #Version: 1.2.2
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
#Version: 1.2.2
# Loading Dataset
data = pd.read_csv(r'C:\Users\Swapnil Jyot\Downloads\50_Startups.csv')
data
df_sorted = data.sort_values('R&D Spend')
df_sorted
len(data.index)
plt.subplot(3, 2, 1)
data["Profit"].plot.hist()
plt.title('Histogram of Profit')
# #Hence we have to get maximum profit of near about 100000 for 10 times
acoording to the table
#
# lets see the how all three factors(R&D Spend,Administration,Marketing Spend)
affect the profit of the companies ...
df_sorted = data.sort_values('Administration')
X = df_sorted["Administration"].values
Y= df_sorted["Profit"].values
plt.subplot(3, 2, 5)
plt.plot(X,Y,color='#58b970', label='Regression Line')
plt.title('Line Graph of Administration vs Profit')
# ### CONCLUSION
# # DATA WRANGLING
# #finding the null values in the data before training and testing the data
data.isnull()
plt.subplot(3, 2, 2)
sns.heatmap(data.isnull())
plt.title('Heatmap of Correlation Matrix')
# ### Almost zero null values in the data now data is ready for testing and
training
x=data.drop("Profit",axis=1)
y=data['Profit']
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.4,random_state=
1)
print(X_train)
print(y_train)
lr.fit(X_train, y_train)
print(mae_lr,mse_lr,r2_lr)
print(mae_dt,mse_dt,r2_dt )
print(mae_rf,mse_rf,r2_rf)
print(mae_svr,mse_svr,r2_svr)
reg_metrices=pd.DataFrame({'LRM':[mae_lr,mse_lr,r2_lr],'DTR':[mae_dt,mse_dt,r2
_dt],
'RFR':[mae_rf,mse_rf,r2_rf],'SVR':[mae_svr,mse_svr,r2_svr]
})
reg_metrices
plt.subplot(3, 2, 3)
colors = ['red', 'green', 'blue','orange']
plt.bar(r2_scores['Model'], r2_scores['R-squared Score'],color=colors)
for i in range(len(r2_scores['Model'])):
plt.text(i, r2_scores['R-squared Score'][i], r2_scores['R-squared
Score'][i], ha='center', va='bottom')
plt.title('Comparison of R-squared Scores for Different Regression Models')
plt.xlabel('Regression Model')
plt.ylabel('R-squared Score')
plt.ylim([0.0, 100.0])
plt.tight_layout()
plt.show()
# ### hence you can clearly see that the heighest accuracy levels of different
regression models
6.CONCLUSION:
References:
4. https://www.analyticsvidhya.com/blog/
5. https://www.kaggle.com/