0% found this document useful (0 votes)
16 views9 pages

E AI Lab EX 2and 3

The document provides a comprehensive guide on performing exploratory data analysis (EDA) for a two-variable linear regression model, including steps such as loading data, calculating summary statistics, creating visualizations, and checking assumptions. It also includes Python code examples for generating histograms, scatter plots, and correlation analysis using a diabetes dataset. Additionally, the document demonstrates how to implement linear regression models with and without bias using custom classes.

Uploaded by

rajasimman38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

E AI Lab EX 2and 3

The document provides a comprehensive guide on performing exploratory data analysis (EDA) for a two-variable linear regression model, including steps such as loading data, calculating summary statistics, creating visualizations, and checking assumptions. It also includes Python code examples for generating histograms, scatter plots, and correlation analysis using a diabetes dataset. Additionally, the document demonstrates how to implement linear regression models with and without bias using custom classes.

Uploaded by

rajasimman38
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2.

Exploratory data analysis on a 2variable linear regression model


Exploratory data analysis (EDA) is a crucial step in understanding the relationship between
variables before fitting a regression model. In the case of a two-variable linear regression model,
the goal is to explore the relationship between the predictor variable (independent variable) and
the response variable (dependent variable). Here's a step-by-step guide on how to perform EDA
for a two-variable linear regression model:

Load Data: Begin by loading your dataset containing the two variables of interest: the predictor
variable (X) and the response variable (Y).

Summary Statistics: Compute summary statistics for both variables. This includes measures such
as mean, median, standard deviation, minimum, maximum, and quartiles. This will give you an
initial understanding of the distribution and range of values for each variable.

Scatter Plot: Create a scatter plot with the predictor variable on the x-axis and the response
variable on the y-axis. This visualization helps you observe the overall pattern and any potential
relationship between the variables.

Correlation Analysis: Calculate the correlation coefficient between the predictor and response
variables. This will quantify the strength and direction of the linear relationship between the two
variables. Typically, Pearson correlation coefficient is used for linear relationships.

Residual Analysis: If you already have a fitted regression model, you can perform residual
analysis. Residuals are the differences between the observed and predicted values. Plot the
residuals against the predictor variable to check for patterns or heteroscedasticity, which could
indicate violations of assumptions.

Additional Visualizations: Depending on the nature of your data, you may want to create
additional visualizations such as box plots, histograms, or density plots for each variable to
further understand their distributions and identify any potential outliers.

Assumption Checking: Finally, it's essential to assess whether the assumptions of linear
regression are met. These assumptions include linearity, independence, homoscedasticity, and
normality of residuals. EDA can help identify potential violations of these assumptions.

By following these steps, you'll gain valuable insights into the relationship between the predictor
and response variables, which will inform the subsequent steps of fitting and interpreting your
linear regression model.
a) HISTOGRAMS

import matplotlib.pyplot as plt


import seaborn as sns
sns.set(color_codes =True)
%matplotlib inline
diab=pd.read_csv("diabetes.csv")
print("Diabetes DataFile headers Details")
print(diab.head())
dia1 = diab[diab.Outcome==1]
dia0 = diab[diab.Outcome==0]
sns.countplot(x=diab.Outcome)
plt.title("Count Plot for Outcome")
# Creating 3 subplots - 1st for histogram, 2nd for histogram segmented by Outcome and 3rd for
representing same segmentation using boxplot
plt.figure(figsize=(20, 6))
plt.subplot(1,3,1)
sns.set_style("dark")
plt.title("Histogram for BloodPressure")
sns.distplot(diab.BloodPressure,kde=False)
plt.subplot(1,3,2)
sns.distplot(dia0.BloodPressure,kde=False,color="Blue", label="Preg for Outome=0")
sns.distplot(dia1.BloodPressure,kde=False,color = "Gold", label = "Preg for Outcome=1")
plt.title("Histograms for Preg by Outcome")
plt.legend()
plt.subplot(1,3,3)
sns.boxplot(x=diab.Outcome,y=diab.BloodPressure)
plt.title("Boxplot for Preg by Outcome")
OUTPUT

b). CORRELATION

import pandas as pd
diab=pd.read_csv("diabetes.csv")
print("Diabetes DataFile headers Details")
print(diab.head())
import seaborn as sns
sns.scatterplot(x="BloodPressure", y="BMI", data=diab);
ax = sns.scatterplot(x="BloodPressure", y="BMI", data=diab)
ax.set_title("BloodPressure vs. BMI")
ax.set_xlabel("BloodPressure");
sns.lmplot(x="BloodPressure", y="BMI", data=diab);
sns.lmplot(x="BloodPressure", y="BMI", hue="BloodPressure", data=diab);
from scipy import stats
print("Correaltion coefficient between BloodPressure and BMI")
print(stats.pearsonr(diab['BloodPressure'], diab['BMI']))
cormat = diab.corr()
print("correlation MATRIX")
print(round(cormat,2))
sns.heatmap(cormat);

OUTPUT

Diabetes DataFile headers Details


Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
Correaltion coefficient between BloodPressure and BMI
(0.28180528884991063, 1.7378883832365869e-15)
correlation MATRIX
Pregnancies Glucose BloodPressure SkinThickness \
Pregnancies 1.00 0.13 0.14 -0.08
Glucose 0.13 1.00 0.15 0.06
BloodPressure 0.14 0.15 1.00 0.21
SkinThickness -0.08 0.06 0.21 1.00
Insulin -0.07 0.33 0.09 0.44
BMI 0.02 0.22 0.28 0.39
DiabetesPedigreeFunction -0.03 0.14 0.04 0.18
Age 0.54 0.26 0.24 -0.11
Outcome 0.22 0.47 0.07 0.07

Insulin BMI DiabetesPedigreeFunction Age \


Pregnancies -0.07 0.02 -0.03 0.54
Glucose 0.33 0.22 0.14 0.26
BloodPressure 0.09 0.28 0.04 0.24
SkinThickness 0.44 0.39 0.18 -0.11
Insulin 1.00 0.20 0.19 -0.04
BMI 0.20 1.00 0.14 0.04
DiabetesPedigreeFunction 0.19 0.14 1.00 0.03
Age -0.04 0.04 0.03 1.00
Outcome 0.13 0.29 0.17 0.24
Outcome
Pregnancies 0.22
Glucose 0.47
BloodPressure 0.07
SkinThickness 0.07
Insulin 0.13
BMI 0.29
DiabetesPedigreeFunction 0.17
Age 0.24
Outcome 1.00

CORRELATION

HEAT MAP
c.) SCATTER PLOT

# use regplot

import pandas as pd
diab=pd.read_csv("diabetes.csv")
print("Diabetes DataFile headers Details")
print(diab.head())
import seaborn as sns
sns.regplot(x = "Age",
y = "BloodPressure",
ci = None,
data = diab)

OUTPUT

3. Experiment the regression model with out a bias and with bias

3.a). Experiment the regression model with bias

import numpy as np

class LinearRegression:
def __init__(self):
self.weights = None

def fit(self, X, y):


# Add bias term to input features
X_bias = np.c_[np.ones((X.shape[0], 1)), X]

# Calculate weights using the normal equation


self.weights = np.linalg.inv(X_bias.T.dot(X_bias)).dot(X_bias.T).dot(y)

def predict(self, X):


# Add bias term to input features
X_bias = np.c_[np.ones((X.shape[0], 1)), X]

# Predict using the learned weights


y_pred = X_bias.dot(self.weights)
return y_pred

# Example usage:
# Generate some sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Fit the linear regression model


model = LinearRegression()
model.fit(X, y)

# Make predictions
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)

# Print the learned weights


print("Learned weights (bias, slope):", model.weights)

# Plot the data and the regression line


import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.plot(X_new, y_pred, 'r-', label='Predictions')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression with Bias')
plt.legend()
plt.show()
OUTPUT

3.) Experiment the regression model with out a bias

import numpy as np
class LinearRegressionNoBias:
def __init__(self):
self.weights = None

def fit(self, X, y):


# Calculate weights using the pseudo-inverse
self.weights = np.linalg.pinv(X).dot(y)

def predict(self, X):


# Predict using the learned weights
y_pred = X.dot(self.weights)
return y_pred

# Example usage:
# Generate some sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 3 * X + np.random.randn(100, 1) # No bias term in the data

# Fit the linear regression model


model = LinearRegressionNoBias()
model.fit(X, y)

# Make predictions
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)

# Print the learned weights


print("Learned weights (slope):", model.weights)

# Plot the data and the regression line


import matplotlib.pyplot as plt
plt.scatter(X, y)
plt.plot(X_new, y_pred, 'r-', label='Predictions')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression without Bias')
plt.legend()
plt.show()

OUTPUT

You might also like