E AI Lab EX 2and 3
E AI Lab EX 2and 3
Load Data: Begin by loading your dataset containing the two variables of interest: the predictor
variable (X) and the response variable (Y).
Summary Statistics: Compute summary statistics for both variables. This includes measures such
as mean, median, standard deviation, minimum, maximum, and quartiles. This will give you an
initial understanding of the distribution and range of values for each variable.
Scatter Plot: Create a scatter plot with the predictor variable on the x-axis and the response
variable on the y-axis. This visualization helps you observe the overall pattern and any potential
relationship between the variables.
Correlation Analysis: Calculate the correlation coefficient between the predictor and response
variables. This will quantify the strength and direction of the linear relationship between the two
variables. Typically, Pearson correlation coefficient is used for linear relationships.
Residual Analysis: If you already have a fitted regression model, you can perform residual
analysis. Residuals are the differences between the observed and predicted values. Plot the
residuals against the predictor variable to check for patterns or heteroscedasticity, which could
indicate violations of assumptions.
Additional Visualizations: Depending on the nature of your data, you may want to create
additional visualizations such as box plots, histograms, or density plots for each variable to
further understand their distributions and identify any potential outliers.
Assumption Checking: Finally, it's essential to assess whether the assumptions of linear
regression are met. These assumptions include linearity, independence, homoscedasticity, and
normality of residuals. EDA can help identify potential violations of these assumptions.
By following these steps, you'll gain valuable insights into the relationship between the predictor
and response variables, which will inform the subsequent steps of fitting and interpreting your
linear regression model.
a) HISTOGRAMS
b). CORRELATION
import pandas as pd
diab=pd.read_csv("diabetes.csv")
print("Diabetes DataFile headers Details")
print(diab.head())
import seaborn as sns
sns.scatterplot(x="BloodPressure", y="BMI", data=diab);
ax = sns.scatterplot(x="BloodPressure", y="BMI", data=diab)
ax.set_title("BloodPressure vs. BMI")
ax.set_xlabel("BloodPressure");
sns.lmplot(x="BloodPressure", y="BMI", data=diab);
sns.lmplot(x="BloodPressure", y="BMI", hue="BloodPressure", data=diab);
from scipy import stats
print("Correaltion coefficient between BloodPressure and BMI")
print(stats.pearsonr(diab['BloodPressure'], diab['BMI']))
cormat = diab.corr()
print("correlation MATRIX")
print(round(cormat,2))
sns.heatmap(cormat);
OUTPUT
CORRELATION
HEAT MAP
c.) SCATTER PLOT
# use regplot
import pandas as pd
diab=pd.read_csv("diabetes.csv")
print("Diabetes DataFile headers Details")
print(diab.head())
import seaborn as sns
sns.regplot(x = "Age",
y = "BloodPressure",
ci = None,
data = diab)
OUTPUT
3. Experiment the regression model with out a bias and with bias
import numpy as np
class LinearRegression:
def __init__(self):
self.weights = None
# Example usage:
# Generate some sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
# Make predictions
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)
import numpy as np
class LinearRegressionNoBias:
def __init__(self):
self.weights = None
# Example usage:
# Generate some sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 3 * X + np.random.randn(100, 1) # No bias term in the data
# Make predictions
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)
OUTPUT