0% found this document useful (0 votes)

7 views7 pages

Simple Linear Regression

This document discusses linear regression for salary prediction. It describes the assumptions and process for building a simple linear regression model, including splitting data, fitting a model, and making predictions. The model is used to predict salary based on years of experience.

Uploaded by

Dev Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

Simple Linear Regression

Uploaded by

Dev Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

16/08/2023, 20:32 linearregrsalaryprediction.

ipynb - Colaboratory

Linear Regression

Supervised machine learning algorithms: It is a type of machine learning, where the

algorithm learns from labeled data.

Labeled data means the dataset whose respective target value is already known.

Supervised learning has two types:

Classification: It predicts the class of the dataset based on the independent

input variable. Class is the categorical or discrete values. like the image of an
animal is a cat or dog?

Regression: It predicts the continuous output variables based on the

independent input variable. like the prediction of house prices based on different
parameters like house age, distance from the main road, location, area, etc.

The above diagram is an example of Simple Linear Regression, where change in the value
of feature 'Y' is proportional to value of 'X'.

Y : Dependent or Target Variable.

X : Independent Variable.

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 1/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

Regression Line: It is best-fit line of the model, by which we can predict value of 'Y' for
new values of 'X'.

Assumption of Linear Regression:

Linear regression makes several key assumptions about the data and the relationships it
models. Violations of these assumptions can affect the validity and reliability of the
regression results. Here are the main assumptions of linear regression:

Linearity: The relationship between the independent variable(s) and the dependent
variable is linear. This means that the change in the dependent variable for a unit
change in the independent variable is constant.

Independence of Errors: The errors (residuals) of the model are assumed to be

independent of each other. In other words, the error of one observation should not be
influenced by the errors of other observations.

Homoscedasticity: Homoscedasticity refers to the assumption that the variance of

the residuals is constant across all levels of the independent variables. This means
that the spread of residuals should be roughly the same throughout the range of the
predictor variables.

Normality of Errors: The errors (residuals) should be normally distributed. This

assumption is important for hypothesis testing and constructing confidence intervals.

No or Little Multicollinearity: Multicollinearity occurs when two or more independent

variables in the model are highly correlated. This can make it difficult to interpret the
individual effects of each variable on the dependent variable.

No Endogeneity: Endogeneity refers to the situation where an independent variable is

correlated with the error term. This can arise due to omitted variable bias or
simultaneous causation and can lead to biased and inconsistent coefficient
estimates.

No Autocorrelation: Autocorrelation occurs when the residuals of the model are

correlated with each other. This assumption is important when dealing with time
series data, where observations are dependent on previous observations.

Constant Variance of Residuals (Homoscedasticity): Also known as

homoscedasticity, this assumption states that the variance of the residuals is
consistent across all levels of the independent variables. This is crucial for accurate
hypothesis testing and confidence interval estimation.

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 2/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

No Perfect Collinearity: Perfect collinearity exists when one independent variable can
be perfectly predicted by a linear combination of other independent variables. This
situation leads to a rank-deficient matrix, making it impossible to estimate unique
regression coefficients.

Salary Prediction using Simple Linear Regression

# Step1: Import important libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Step2: Import the Dataset

data = pd.read_csv('/kaggle/input/salary-dataset-simple-linear-regression/Salary_datas
print(data.head())

Unnamed: 0 YearsExperience Salary

0 0 1.2 39344.0
1 1 1.4 46206.0
2 2 1.6 37732.0
3 3 2.1 43526.0
4 4 2.3 39892.0

data.shape

(30, 3)

# Get information of the Dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 30 non-null int64
1 YearsExperience 30 non-null float64
2 Salary 30 non-null float64
dtypes: float64(2), int64(1)
memory usage: 848.0 bytes

Exploratory Data Analysis (EDA):

# 1. NULL Value Treatment

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 3/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

data.isna().sum()
# So, no null values present

Unnamed: 0 0
YearsExperience 0
Salary 0
dtype: int64

# 2. Drop duplicate values

data.duplicated()
# No duplicates present

0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
dtype: bool

# 3. Calculate summary statistics

data.describe()

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 4/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

Unnamed: 0 YearsExperience Salary

count 30.000000 30.000000 30.000000

mean 14.500000 5.413333 76004.000000

std 8.803408 2.837888 27414.429785

min 0.000000 1.200000 37732.000000

25% 7.250000 3.300000 56721.750000

50% 14.500000 4.800000 65238.000000

# 4. No categorical variables present
75% 21.750000 7.800000 100545.750000

max 29.000000 10.600000 122392.000000

Split Dataset:

# Extract dependent(denoted by Y - target variable) and

# independent(denoted by X) features from Dataset
X = data['YearsExperience']
Y = data['Salary']

Splitting Training and Testing Dataset:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=

# Convert Series to DataFrame

x_train = pd.DataFrame(x_train)
x_test = pd.DataFrame(x_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

Model Fitting:

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(x_train, y_train)

▾ LinearRegression
LinearRegression()

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 5/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

# Predict output for the x_test dataset

y_pred = regressor.predict(x_test)
y_pred

array([[39297.22202233],
[75603.43359409],
[37386.36878171],
[60316.60766914],
[63182.88753007],
[52673.19470666]])

Checking Accuracy Score:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Mean Squared Error

mse = mean_squared_error(y_test, y_pred)

mse

36064238.493955195

# R2 - Score

r2 = r2_score(y_test, y_pred)
r2

# r2_score = 0.81, which is closer to 1.

# So, the line of regression is accurate.

0.8143022783109011

# Mean Absolute Error

mae = mean_absolute_error(y_test, y_pred)

mae

5392.453356511894

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 6/7
16/08/2023, 20:32 linearregrsalaryprediction.ipynb - Colaboratory

https://colab.research.google.com/drive/1y9LXlz6NVX77W_zD3gdi3z19XXzEpKKd#scrollTo=rr7xAu2L2YET&printMode=true 7/7

Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Chapter 1 - Mathematics of Survival Analysis
No ratings yet
Chapter 1 - Mathematics of Survival Analysis
13 pages
Linear - Regression - Ipynb - Colaboratory
No ratings yet
Linear - Regression - Ipynb - Colaboratory
4 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Linear Regression2
No ratings yet
Linear Regression2
9 pages
Lab 1
No ratings yet
Lab 1
3 pages
Regression Dataset Example
No ratings yet
Regression Dataset Example
14 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Experiment No.8
No ratings yet
Experiment No.8
5 pages
Task1
No ratings yet
Task1
5 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Regression Demo
No ratings yet
Regression Demo
8 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
DMV Unit 3 PPT_RSK_250419_125620 jfhuehiwhu
No ratings yet
DMV Unit 3 PPT_RSK_250419_125620 jfhuehiwhu
89 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Predictive_Modelling_Alternate_Project_Business_Case.docx
No ratings yet
Predictive_Modelling_Alternate_Project_Business_Case.docx
47 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
EXP-4 DMusingPYTHON
No ratings yet
EXP-4 DMusingPYTHON
7 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
2.1 ML (Implementation of Simple Linear Regression in Python)
No ratings yet
2.1 ML (Implementation of Simple Linear Regression in Python)
8 pages
python 1
No ratings yet
python 1
3 pages
Linear Regression 2
No ratings yet
Linear Regression 2
3 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
2 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
Linear Regression 1
No ratings yet
Linear Regression 1
2 pages
Salary_Prediction
No ratings yet
Salary_Prediction
9 pages
ML_recordjp
No ratings yet
ML_recordjp
35 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Building Logistic regression model in python
No ratings yet
Building Logistic regression model in python
24 pages
2 Linear Regression
No ratings yet
2 Linear Regression
5 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
98 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Salaries for San Francisco Employee _ ML _ FA _ DA projects
No ratings yet
Salaries for San Francisco Employee _ ML _ FA _ DA projects
33 pages
Exp 1
No ratings yet
Exp 1
6 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
20dit073 Jay Prajapati ML
No ratings yet
20dit073 Jay Prajapati ML
68 pages
Regression
No ratings yet
Regression
16 pages
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
No ratings yet
Social Network Analysis: Cheruvu Nvss Suhas 21BCE8374
10 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Netflix Stock Price Prediction
No ratings yet
Netflix Stock Price Prediction
20 pages
Data Science Record_05
No ratings yet
Data Science Record_05
20 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
28 pages
Unit5 - Linear Regression
No ratings yet
Unit5 - Linear Regression
4 pages
Data Science Chapitre 2
No ratings yet
Data Science Chapitre 2
132 pages
Machine Exercise 3 (1)
No ratings yet
Machine Exercise 3 (1)
22 pages
R Codes
No ratings yet
R Codes
23 pages
Simple Linear Regression Code
No ratings yet
Simple Linear Regression Code
3 pages
SSRN Id3990877
No ratings yet
SSRN Id3990877
8 pages
Mini Project Report
No ratings yet
Mini Project Report
10 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
DSBDA3 - Jupyter Notebook
No ratings yet
DSBDA3 - Jupyter Notebook
12 pages
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
Linear Regression Research Paper
No ratings yet
Linear Regression Research Paper
2 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Special Correlation Methods
No ratings yet
Special Correlation Methods
64 pages
Index 1 Statistics F Business 120: Instructions
No ratings yet
Index 1 Statistics F Business 120: Instructions
8 pages
Moment Generating Functions 1
No ratings yet
Moment Generating Functions 1
14 pages
Power Function
No ratings yet
Power Function
7 pages
Computational Biology Project Report
No ratings yet
Computational Biology Project Report
15 pages
Dynamic Panel Estimator: GMM: Dr. Elya Nabila Abdul Bahri
No ratings yet
Dynamic Panel Estimator: GMM: Dr. Elya Nabila Abdul Bahri
18 pages
Copy of Assignment5_Fall 2024
No ratings yet
Copy of Assignment5_Fall 2024
14 pages
Detailed Lesson Plan: (Abm - Bm11Pad-Iig-2)
100% (1)
Detailed Lesson Plan: (Abm - Bm11Pad-Iig-2)
15 pages
Capítulo 5
No ratings yet
Capítulo 5
25 pages
QBW_M1V2Ch12_ENG_Vw5Kosuc
No ratings yet
QBW_M1V2Ch12_ENG_Vw5Kosuc
21 pages
Multivariate Analysis-MR
No ratings yet
Multivariate Analysis-MR
8 pages
Data Science Interview Quesions
No ratings yet
Data Science Interview Quesions
22 pages
Gateway Assessment #6 of 6: Correlation and Regression Analysis Submissions
No ratings yet
Gateway Assessment #6 of 6: Correlation and Regression Analysis Submissions
3 pages
Statistical Methods Ecourse (ICAR)
No ratings yet
Statistical Methods Ecourse (ICAR)
76 pages
Core Data Analysis Worksheet 6
No ratings yet
Core Data Analysis Worksheet 6
20 pages
PSM in Stata
No ratings yet
PSM in Stata
64 pages
Stat&PropbTQQ3W1 4
No ratings yet
Stat&PropbTQQ3W1 4
4 pages
Predicting Wins in Baseball
No ratings yet
Predicting Wins in Baseball
7 pages
Inferential Statistic II
No ratings yet
Inferential Statistic II
61 pages
Statistical Analysis With Software Application Module6
No ratings yet
Statistical Analysis With Software Application Module6
4 pages
Geostatistics Presentation
No ratings yet
Geostatistics Presentation
32 pages
Solution Manual for STAT2, 2nd Edition, Ann Cannon, George W. Cobb, Bradley A. Hartlaub, Julie M. Legler, Robin H. Lock, Thomas L. Moore, Allan J. Rossman Jeffrey A. Witmer - Download PDF
100% (5)
Solution Manual for STAT2, 2nd Edition, Ann Cannon, George W. Cobb, Bradley A. Hartlaub, Julie M. Legler, Robin H. Lock, Thomas L. Moore, Allan J. Rossman Jeffrey A. Witmer - Download PDF
41 pages
Int Stats For Eco - Assignment Question Paper
No ratings yet
Int Stats For Eco - Assignment Question Paper
2 pages
Improved Adaptive Gaussian Mixture Model For Background Subtraction PDF
No ratings yet
Improved Adaptive Gaussian Mixture Model For Background Subtraction PDF
5 pages
(Reformatted) Module 5 (Students)
No ratings yet
(Reformatted) Module 5 (Students)
32 pages
An Ova
No ratings yet
An Ova
82 pages
Lecture23-27 FrequencyAnalysis&Mgmt
No ratings yet
Lecture23-27 FrequencyAnalysis&Mgmt
108 pages
属性MSA
No ratings yet
属性MSA
23 pages
Lampiran 5 Hasil Analisis SPSS 20 1. Karakteristik Responden
No ratings yet
Lampiran 5 Hasil Analisis SPSS 20 1. Karakteristik Responden
4 pages

Simple Linear Regression

Uploaded by

Simple Linear Regression

Uploaded by

16/08/2023, 20:32 linearregrsalaryprediction.

Supervised machine learning algorithms: It is a type of machine learning, where the

Supervised learning has two types:

Classification: It predicts the class of the dataset based on the independent

Regression: It predicts the continuous output variables based on the

Y : Dependent or Target Variable.

Assumption of Linear Regression:

Independence of Errors: The errors (residuals) of the model are assumed to be

Homoscedasticity: Homoscedasticity refers to the assumption that the variance of

Normality of Errors: The errors (residuals) should be normally distributed. This

No or Little Multicollinearity: Multicollinearity occurs when two or more independent

No Endogeneity: Endogeneity refers to the situation where an independent variable is

No Autocorrelation: Autocorrelation occurs when the residuals of the model are

Constant Variance of Residuals (Homoscedasticity): Also known as

Salary Prediction using Simple Linear Regression

# Step1: Import important libraries

# Step2: Import the Dataset

Unnamed: 0 YearsExperience Salary

# Get information of the Dataset

Exploratory Data Analysis (EDA):

# 1. NULL Value Treatment

# 2. Drop duplicate values

# 3. Calculate summary statistics

Unnamed: 0 YearsExperience Salary

count 30.000000 30.000000 30.000000

mean 14.500000 5.413333 76004.000000

std 8.803408 2.837888 27414.429785

min 0.000000 1.200000 37732.000000

25% 7.250000 3.300000 56721.750000

50% 14.500000 4.800000 65238.000000

max 29.000000 10.600000 122392.000000

# Extract dependent(denoted by Y - target variable) and

Splitting Training and Testing Dataset:

from sklearn.model_selection import train_test_split

# Convert Series to DataFrame

from sklearn.linear_model import LinearRegression

# Predict output for the x_test dataset

Checking Accuracy Score:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Mean Squared Error

mse = mean_squared_error(y_test, y_pred)

# r2_score = 0.81, which is closer to 1.

# Mean Absolute Error

mae = mean_absolute_error(y_test, y_pred)

You might also like