0% found this document useful (0 votes)
16 views5 pages

Experiment No.8

The document outlines an experiment to implement a simple Linear Regression algorithm using a salary dataset from Kaggle. It includes steps for loading data, training a model, making predictions, evaluating the model's performance, and visualizing results. Key metrics such as Mean Squared Error and R-squared are calculated, and the model parameters are printed, along with a prediction for a specific input.

Uploaded by

aryanbajpai916
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

Experiment No.8

The document outlines an experiment to implement a simple Linear Regression algorithm using a salary dataset from Kaggle. It includes steps for loading data, training a model, making predictions, evaluating the model's performance, and visualizing results. Key metrics such as Mean Squared Error and R-squared are calculated, and the model parameters are printed, along with a prediction for a specific input.

Uploaded by

aryanbajpai916
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Experiment No.

8
Aim: Implement and demonstrate simple Linear Regression Algorithm based on a given set
of training data samples. Read the training data from a .CSV file. Use salary dataset from
Kaggle.

Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset


df = pd.read_csv('Salary_Data.csv'

# Display the first few rows of the dataframe to inspect the data
print("First 5 rows of the dataset:")
print(df.head())

# Get information about the dataset (columns, data types, non-null values)
print("\nDataset information:")
df.info() # Print a summary of the DataFrame df to the console. This includes the data types
of each column, the number of non-null values, and the memory usage.

# Check for missing values


print("\nMissing values:")
print(df.isnull().sum()) # Print the number of missing values in each column of the
DataFrame df to the console. This is important for data cleaning, as linear regression
models require complete data.
# Extract the independent variable (Years of Experience) and the dependent variable
(Salary)
X = df[[‘Years of Expereience’]]
X = df.iloc[:, :-1].values # Extract all rows and all columns except the last one from the
DataFrame df, and store them as a numpy array in the variable X. This represents the
independent variable(s), which in this case is 'Years of Experience'. The .values attribute is
used to get the numpy array representation of the data.
y = df[‘Salary’]
y = df.iloc[:, -1].values # Extract all rows and only the last column from the DataFrame df,
and store them as a numpy array in the variable y. This represents the dependent variable,
which in this case is 'Salary'.

# Split the dataset into training and testing sets


# We'll use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split
the data into training and testing sets.
# X_train: Independent variables for the training set.
# X_test: Independent variables for the testing set.
# y_train: Dependent variable for the training set.
# y_test: Dependent variable for the testing set.
# test_size=0.2: Specifies that 20% of the data should be used for the test set.
# random_state=42: Sets the random seed to 42. This ensures that the data is split in the
same way each time the code is run, which is important for reproducibility.

# Create a Linear Regression model


model = LinearRegression() # Create an instance of the LinearRegression class and store it
in the variable model. This creates a linear regression model object.

# Train the model using the training data


model.fit(X_train, y_train) # Train the linear regression model using the training data. The
fit() method learns the relationship between the independent variables (X_train) and the
dependent variable (y_train).
# Make predictions on the test set
y_pred = model.predict(X_test) # Use the trained model to make predictions on the test set
(X_test). The predicted values are stored in the variable y_pred.

# Evaluate the model


mse = mean_squared_error(y_test, y_pred) # Calculate the Mean Squared Error (MSE)
between the actual test values (y_test) and the predicted values (y_pred). MSE is a measure
of how well the model's predictions match the actual values.
r2 = r2_score(y_test, y_pred) # Calculate the R-squared value, also known as the coefficient
of determination. R-squared measures the proportion of the variance in the dependent
variable that is explained by the independent variable(s).

print("\nModel Evaluation:")
print(f"Mean Squared Error: {mse:.2f}") # Print the calculated Mean Squared Error to the
console, formatted to two decimal places.
print(f"R-squared: {r2:.2f}") # Print the calculated R-squared value to the console, formatted
to two decimal places.

# Visualize the training data and the regression line


plt.figure(figsize=(10, 6)) # Create a new figure with a size of 10x6 inches.
plt.scatter(X_train, y_train, color='blue', label='Training Data') # Create a scatter plot of the
training data.
# X_train: The independent variable (Years of Experience) for the training data.
# y_train: The dependent variable (Salary) for the training data.
# color='blue': Sets the color of the data points to blue.
# label='Training Data': Sets the label for the data points to 'Training Data'. This label will
appear in the legend.
plt.plot(X_train, model.predict(X_train), color='red', label='Regression Line') # Plot the
regression line.
# X_train: The independent variable (Years of Experience) for the training data.
# model.predict(X_train): The predicted values of Salary based on Years of Experience for
the training data, as predicted by the model.
# color='red': Sets the color of the line to red.
# label='Regression Line': Sets the label for the line to 'Regression Line'. This label will
appear in the legend.
plt.title('Salary vs. Years of Experience (Training Set)') # Set the title of the plot.
plt.xlabel('Years of Experience') # Set the label for the x-axis.
plt.ylabel('Salary') # Set the label for the y-axis.
plt.legend() # Display the legend, which shows the labels for the data points and the
regression line.
plt.show() # Display the plot.

# Visualize the test data and the predictions


plt.figure(figsize=(10, 6)) # Create a new figure with a size of 10x6 inches.
plt.scatter(X_test, y_test, color='green', label='Test Data') # Create a scatter plot of the test
data.
# X_test: The independent variable (Years of Experience) for the test data.
# y_test: The dependent variable (Salary) for the test data.
# color='green': Sets the color of the data points to green.
# label='Test Data': Sets the label for the data points to 'Test Data'.
plt.plot(X_test, y_pred, color='red', label='Predictions') # Plot the predicted values.
# X_test: The independent variable (Years of Experience) for the test data.
# y_pred: The predicted values of Salary based on Years of Experience for the test data.
# color='red': Sets the color of the line to red.
# label='Predictions': Sets the label for the line to 'Predictions'

plt.title('Salary vs. Years of Experience (Test Set)') # Set the title of the plot.
plt.xlabel('Years of Experience') # Set the label for the x-axis.
plt.ylabel('Salary') # Set the label for the y-axis.
plt.legend() # Display the legend.
plt.show() # Display the plot.

# Print the model parameters (intercept and coefficient)


print("\nModel Parameters:")
print(f"Intercept: {model.intercept_:.2f}") # Print the intercept of the regression line,
formatted to two decimal places. The intercept is the predicted value of Salary when Years
of Experience is 0.
print(f"Coefficient: {model.coef_[0]:.2f}") # Print the coefficient of the regression line,
formatted to two decimal places. The coefficient represents the change in Salary for each
one-unit increase in Years of Experience.

# Example prediction: Predict salary for 10 years of experience


years_of_experience = np.array([[10]]) # Create a numpy array representing 10 years of
experience. The input to model.predict() must be a 2D array.
predicted_salary = model.predict(years_of_experience) # Use the trained model to predict the
salary for 10 years of experience.
print(f"\nPredicted salary for 10 years of experience: ${predicted_salary[0]:.2f}") # Print the
predicted salary, formatted to two decimal places.

You might also like