0% found this document useful (0 votes)
74 views10 pages

Assignment 2 B

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views10 pages

Assignment 2 B

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Computer Laboratory-I Class: BE (AI &DS)

Assignment No:2 B

Title: Implement Multiple Linear Regression

Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Problem Statement:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets

Objectives: To Apply different regression techniques for making predictions in different


applications.

Theory:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to perform
multiple linear Regression are almost similar to that of simple linear Regression. The
Difference Lies in the evaluation. We can use it to find out which factor has the highest impact
on the predicted output and how different variables relate to each other.
Here : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumption of Regression Model:
Linearity: The relationship between dependent and independent variables should be linear.
Homoscedasticity: Constant variance of the errors should be maintained.
Multivariate normality: Multiple Regression assumes that the residuals are normally
distributed.
Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the data.
Dummy Variable:
As we know in the Multiple Regression Model we use a lot of categorical data.
Using Categorical Data is a good method to include non-numeric data into the respective
Regression Model. Categorical Data refers to data values that represent categories-data values
with the fixed and unordered number of values, for instance, gender(male/female).
Computer Laboratory-I Class: BE (AI &DS)

In the regression model, these values can be represented by Dummy Variables.


These variables consist of values such as 0 or 1 representing the presence and absence
Categorical Values.

Dummy Variable Trap:


The Dummy Variable Trap is a condition in which two or more are Highly Correlated.
In the simple term, we can say that one variable can be predicted from the prediction of the
other. The solution of the Dummy Variable Trap is to drop one of the categorical variables.
so if there are m Dummy variables then m-1 variables are used in the model.
D2 = D1-1
Here D2, D1 = Dummy Variables
Steps Involved in any Multiple Linear Regression Model
Step #1: Data Pre Processing
● Importing The Libraries.
● Importing the Data Set.
● Encoding the Categorical Data.
● Avoiding the Dummy Variable Trap.
● Splitting the Data set into Training Set and Test Set.
Step#2: Fitting Multiple Linear Regression to the Training set
Step #3: Predict the Test set results.
Univariate Analysis:
Univariate analysis focuses on analyzing a single variable or attribute in isolation. The main
goal of univariate analysis is to describe and summarize the characteristics of a single
variable. It helps in understanding the distribution and patterns within that variable. Common
techniques and tools used in univariate analysis include:
Computer Laboratory-I Class: BE (AI &DS)

a. Descriptive Statistics: This includes measures like mean, median, mode, range, variance,
and standard deviation, which provide a summary of the central tendency and variability of
the data.
b. Histograms: A histogram is a graphical representation of the frequency distribution of a
continuous variable. It displays data as bars or bins to visualize the shape of the distribution.
c. Bar Charts: Bar charts are used to visualize the frequency distribution of a categorical
variable. They show the frequency of each category or class.
d. Box Plots: A box plot, also known as a box-and-whisker plot, displays the summary of a
continuous variable's distribution, including the median, quartiles, and potential outliers.
e. Frequency Tables: Frequency tables provide a tabular summary of the counts or
percentages of different categories or values within a variable.
Bivariate Analysis:
Bivariate analysis, on the other hand, involves analyzing the relationships and interactions
between two variables. It is used to explore how changes in one variable affect another and to
identify patterns, associations, or correlations. Common techniques and tools used in bivariate
analysis include:
a. Scatter Plots: Scatter plots are used to visualize the relationship between two continuous
variables. Each data point is represented as a point on the graph, allowing you to observe
patterns and trends.
b. Correlation Analysis: Correlation measures the strength and direction of the relationship
between two continuous variables. Common correlation coefficients include Pearson's
correlation coefficient (for linear relationships) and Spearman's rank correlation (for monotonic
relationships).
c. Contingency Tables: Contingency tables are used to analyze the relationships between two
categorical variables. They show how the variables are distributed with respect to each other.
d. Regression Analysis: Regression analysis is used to model and quantify the relationship
between a dependent variable and one or more independent variables. Simple linear regression
and multiple linear regression are common techniques in bivariate analysis.
e. Chi-Square Test: The chi-square test is a statistical test used to determine if there is an
association between two categorical variables. It helps assess the independence of variables.
Univariate and bivariate analysis are crucial for understanding data, identifying outliers, trends,
patterns, and making initial observations before more advanced analyses are conducted. They
provide the foundation for more complex multivariate analysis and hypothesis testing in
statistics and data science.
Computer Laboratory-I Class: BE (AI &DS)

Conclusion:
Students will be able to apply Linear Regression and will be able to Design ML models to
make predictions by using linear regression technique.
18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [18]: !pip install Numpy==1.23.5


Requirement already satisfied: Numpy==1.23.5 in c:\users\chetan\anaconda3\lib\site-pack


ages (1.23.5)

In [19]: !pip install --upgrade --no-deps statsmodels

Requirement already satisfied: statsmodels in c:\users\chetan\anaconda3\lib\site-packag


es (0.14.0)

In [20]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt

In [21]: df = pd.read_csv("diabetes.csv")

In [22]: df.shape

Out[22]: (768, 9)

In [23]: df.head()

Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outco

0 6 148 72 35 0 33.6 0.627 50

1 1 85 66 29 0 26.6 0.351 31

2 8 183 64 0 0 23.3 0.672 32

3 1 89 66 23 94 28.1 0.167 21

4 0 137 40 35 168 43.1 2.288 33

B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a.
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis b. Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d.
Also compare the results of the above analysis for the two data sets

Dataset link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database


(https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

localhost:8888/notebooks/Assignment 2-B.ipynb 1/6


18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [24]: df.describe()

Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunc

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.00

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.47

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.33

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.07

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.24

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.37

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.62

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.42

Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard


Deviation, Skewness and Kurtosis

In [25]: for column in df.columns:


print(f"Column: {column}")
print(f"Frequency:\n{df[column].value_counts()}\n")
print(f"Mean: {df[column].mean()}")
print(f"Median: {df[column].median()}")
print(f"Mode:\n{df[column].mode()}")
print(f"Variance: {df[column].var()}")
print(f"Standard Deviation: {df[column].std()}")
print(f"Skewness: {df[column].skew()}")
print(f"Kurtosis: {df[column].kurt()}")
print("----------\n")

Column: Pregnancies
Frequency:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies dtype: int64

localhost:8888/notebooks/Assignment 2-B.ipynb 2/6


18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

Bivariate analysis: Linear and logistic regression modeling

In [26]: from sklearn.linear_model import LinearRegression, LogisticRegression



# Prepare the data
X_linear = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesP
y_linear = df['Outcome']

# Fit the linear regression model
model_linear = LinearRegression()
model_linear.fit(X_linear, y_linear)

# Print the coefficients
print('Linear Regression Coefficients:')
for feature, coef in zip(X_linear.columns, model_linear.coef_):
print(f'{feature}: {coef}')

# Make predictions
predictions_linear = model_linear.predict(X_linear)

Linear Regression Coefficients:


Glucose: 0.005932504680360901
BloodPressure: -0.002278837125420902
SkinThickness: 0.0001669788998679231
Insulin: -0.0002096169514137949
BMI: 0.013310837289280049
DiabetesPedigreeFunction: 0.1376781570786882
Age: 0.005800684345071768

In [27]: # Prepare the data


X_logistic = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Diabete
y_logistic = df['Outcome']

# Fit the logistic regression model
model_logistic = LogisticRegression()
model_logistic.fit(X_logistic, y_logistic)

# Print the coefficients
print('Logistic Regression Coefficients:')
for feature, coef in zip(X_logistic.columns, model_logistic.coef_[0]):
print(f'{feature}: {coef}')

# Make predictions
predictions_logistic = model_logistic.predict(X_logistic)

Logistic Regression Coefficients:


Glucose: 0.034545440399432255
BloodPressure: -0.01220460771782309
SkinThickness: 0.0010063035920884846
Insulin: -0.0013497641265598785
BMI: 0.0878044448605336
DiabetesPedigreeFunction: 0.8192507685294956
Age: 0.03269965785366651

localhost:8888/notebooks/Assignment 2-B.ipynb 3/6


18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

Multiple Regression analysis

In [35]: import statsmodels.api as sm



# Split the dataset into the independent variables (X) and the dependent variable (y)
X = df.drop('Outcome', axis=1) # Independent variables
y = df['Outcome'] # Dependent variable

# Add a constant column to the independent variables
X = sm.add_constant(X)

# Fit the multiple regression model
model = sm.OLS(y, X)
results = model.fit()

# Print the regression results
print(results.summary())

OLS Regression Results


==============================================================================
Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Wed, 18 Oct 2023 Prob (F-statistic): 7.36e-55
Time: 19:14:48 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust
=======================================================================================
=====
coef std err t P>|t| [0.025
0.975]
---------------------------------------------------------------------------------------
-----
const -0.8539 0.085 -9.989 0.000 -1.022 -
0.686
Pregnancies 0.0206 0.005 4.014 0.000 0.011
0.031
Glucose 0.0059 0.001 11.493 0.000 0.005
0.007
BloodPressure -0.0023 0.001 -2.873 0.004 -0.004 -
0.001
SkinThickness 0.0002 0.001 0.139 0.890 -0.002
0.002
Insulin -0.0002 0.000 -1.205 0.229 -0.000
0.000
BMI 0.0132 0.002 6.344 0.000 0.009
0.017
DiabetesPedigreeFunction 0.1472 0.045 3.268 0.001 0.059
0.236
Age 0.0026 0.002 1.693 0.091 -0.000
0.006
==============================================================================
Omnibus: 41.539 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183
Skew: 0.395 Prob(JB): 1.69e-07
Kurtosis: 2.408 Cond. No. 1.10e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specif
ied.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

localhost:8888/notebooks/Assignment 2-B.ipynb 4/6


18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [39]: df.corr()

Out[39]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diabete

Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683

Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071

BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805

SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573

Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859

BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000

DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647

Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242

Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695

localhost:8888/notebooks/Assignment 2-B.ipynb 5/6


18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [42]: # Import required package


import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
plt.rcParams['figure.figsize'] = [20, 20]
# Plotting Scatterplot Matrix
scatter_matrix(df)
plt.show()

In [ ]: ​

localhost:8888/notebooks/Assignment 2-B.ipynb 6/6

You might also like