0% found this document useful (0 votes)

74 views10 pages

Assignment 2 B

Uploaded by

sahilmukund.awasarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views10 pages

Assignment 2 B

Uploaded by

sahilmukund.awasarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Computer Laboratory-I Class: BE (AI &DS)

Assignment No:2 B

Title: Implement Multiple Linear Regression

Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Problem Statement:
Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
a. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis
b. Bivariate analysis: Linear and logistic regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the two data sets

Objectives: To Apply different regression techniques for making predictions in different

applications.

Theory:
Multiple Linear Regression attempts to model the relationship between two or more
features and a response by fitting a linear equation to observed data. The steps to perform
multiple linear Regression are almost similar to that of simple linear Regression. The
Difference Lies in the evaluation. We can use it to find out which factor has the highest impact
on the predicted output and how different variables relate to each other.
Here : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumption of Regression Model:
Linearity: The relationship between dependent and independent variables should be linear.
Homoscedasticity: Constant variance of the errors should be maintained.
Multivariate normality: Multiple Regression assumes that the residuals are normally
distributed.
Lack of Multicollinearity: It is assumed that there is little or no multicollinearity in the data.
Dummy Variable:
As we know in the Multiple Regression Model we use a lot of categorical data.
Using Categorical Data is a good method to include non-numeric data into the respective
Regression Model. Categorical Data refers to data values that represent categories-data values
with the fixed and unordered number of values, for instance, gender(male/female).
Computer Laboratory-I Class: BE (AI &DS)

In the regression model, these values can be represented by Dummy Variables.

These variables consist of values such as 0 or 1 representing the presence and absence
Categorical Values.

Dummy Variable Trap:

The Dummy Variable Trap is a condition in which two or more are Highly Correlated.
In the simple term, we can say that one variable can be predicted from the prediction of the
other. The solution of the Dummy Variable Trap is to drop one of the categorical variables.
so if there are m Dummy variables then m-1 variables are used in the model.
D2 = D1-1
Here D2, D1 = Dummy Variables
Steps Involved in any Multiple Linear Regression Model
Step #1: Data Pre Processing
● Importing The Libraries.
● Importing the Data Set.
● Encoding the Categorical Data.
● Avoiding the Dummy Variable Trap.
● Splitting the Data set into Training Set and Test Set.
Step#2: Fitting Multiple Linear Regression to the Training set
Step #3: Predict the Test set results.
Univariate Analysis:
Univariate analysis focuses on analyzing a single variable or attribute in isolation. The main
goal of univariate analysis is to describe and summarize the characteristics of a single
variable. It helps in understanding the distribution and patterns within that variable. Common
techniques and tools used in univariate analysis include:
Computer Laboratory-I Class: BE (AI &DS)

a. Descriptive Statistics: This includes measures like mean, median, mode, range, variance,
and standard deviation, which provide a summary of the central tendency and variability of
the data.
b. Histograms: A histogram is a graphical representation of the frequency distribution of a
continuous variable. It displays data as bars or bins to visualize the shape of the distribution.
c. Bar Charts: Bar charts are used to visualize the frequency distribution of a categorical
variable. They show the frequency of each category or class.
d. Box Plots: A box plot, also known as a box-and-whisker plot, displays the summary of a
continuous variable's distribution, including the median, quartiles, and potential outliers.
e. Frequency Tables: Frequency tables provide a tabular summary of the counts or
percentages of different categories or values within a variable.
Bivariate Analysis:
Bivariate analysis, on the other hand, involves analyzing the relationships and interactions
between two variables. It is used to explore how changes in one variable affect another and to
identify patterns, associations, or correlations. Common techniques and tools used in bivariate
analysis include:
a. Scatter Plots: Scatter plots are used to visualize the relationship between two continuous
variables. Each data point is represented as a point on the graph, allowing you to observe
patterns and trends.
b. Correlation Analysis: Correlation measures the strength and direction of the relationship
between two continuous variables. Common correlation coefficients include Pearson's
correlation coefficient (for linear relationships) and Spearman's rank correlation (for monotonic
relationships).
c. Contingency Tables: Contingency tables are used to analyze the relationships between two
categorical variables. They show how the variables are distributed with respect to each other.
d. Regression Analysis: Regression analysis is used to model and quantify the relationship
between a dependent variable and one or more independent variables. Simple linear regression
and multiple linear regression are common techniques in bivariate analysis.
e. Chi-Square Test: The chi-square test is a statistical test used to determine if there is an
association between two categorical variables. It helps assess the independence of variables.
Univariate and bivariate analysis are crucial for understanding data, identifying outliers, trends,
patterns, and making initial observations before more advanced analyses are conducted. They
provide the foundation for more complex multivariate analysis and hypothesis testing in
statistics and data science.
Computer Laboratory-I Class: BE (AI &DS)

Conclusion:
Students will be able to apply Linear Regression and will be able to Design ML models to
make predictions by using linear regression technique.
18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [18]: !pip install Numpy==1.23.5

Requirement already satisfied: Numpy==1.23.5 in c:\users\chetan\anaconda3\lib\site-pack

ages (1.23.5)

In [19]: !pip install --upgrade --no-deps statsmodels

Requirement already satisfied: statsmodels in c:\users\chetan\anaconda3\lib\site-packag

es (0.14.0)

In [20]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

In [21]: df = pd.read_csv("diabetes.csv")

In [22]: df.shape

Out[22]: (768, 9)

In [23]: df.head()

Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outco

0 6 148 72 35 0 33.6 0.627 50

1 1 85 66 29 0 26.6 0.351 31

2 8 183 64 0 0 23.3 0.672 32

3 1 89 66 23 94 28.1 0.167 21

4 0 137 40 35 168 43.1 2.288 33

B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the following: a.
Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation, Skewness and
Kurtosis b. Bivariate analysis: Linear and logistic regression modeling c. Multiple Regression analysis d.
Also compare the results of the above analysis for the two data sets

Dataset link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

(https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

localhost:8888/notebooks/Assignment 2-B.ipynb 1/6

18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [24]: df.describe()

Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunc

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.00

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.47

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.33

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.07

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.24

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.37

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.62

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.42

Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard

Deviation, Skewness and Kurtosis

In [25]: for column in df.columns:

print(f"Column: {column}")
print(f"Frequency:\n{df[column].value_counts()}\n")
print(f"Mean: {df[column].mean()}")
print(f"Median: {df[column].median()}")
print(f"Mode:\n{df[column].mode()}")
print(f"Variance: {df[column].var()}")
print(f"Standard Deviation: {df[column].std()}")
print(f"Skewness: {df[column].skew()}")
print(f"Kurtosis: {df[column].kurt()}")
print("----------\n")

Column: Pregnancies
Frequency:
1 135
0 111
2 103
3 75
4 68
5 57
6 50
7 45
8 38
9 28
10 24
11 11
13 10
12 9
14 2
15 1
17 1
Name: Pregnancies dtype: int64

localhost:8888/notebooks/Assignment 2-B.ipynb 2/6

18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

Bivariate analysis: Linear and logistic regression modeling

In [26]: from sklearn.linear_model import LinearRegression, LogisticRegression

# Prepare the data
X_linear = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesP
y_linear = df['Outcome']

# Fit the linear regression model
model_linear = LinearRegression()
model_linear.fit(X_linear, y_linear)

# Print the coefficients
print('Linear Regression Coefficients:')
for feature, coef in zip(X_linear.columns, model_linear.coef_):
print(f'{feature}: {coef}')

# Make predictions
predictions_linear = model_linear.predict(X_linear)

Linear Regression Coefficients:

Glucose: 0.005932504680360901
BloodPressure: -0.002278837125420902
SkinThickness: 0.0001669788998679231
Insulin: -0.0002096169514137949
BMI: 0.013310837289280049
DiabetesPedigreeFunction: 0.1376781570786882
Age: 0.005800684345071768

In [27]: # Prepare the data

X_logistic = df[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Diabete
y_logistic = df['Outcome']

# Fit the logistic regression model
model_logistic = LogisticRegression()
model_logistic.fit(X_logistic, y_logistic)

# Print the coefficients
print('Logistic Regression Coefficients:')
for feature, coef in zip(X_logistic.columns, model_logistic.coef_[0]):
print(f'{feature}: {coef}')

# Make predictions
predictions_logistic = model_logistic.predict(X_logistic)

Logistic Regression Coefficients:

Glucose: 0.034545440399432255
BloodPressure: -0.01220460771782309
SkinThickness: 0.0010063035920884846
Insulin: -0.0013497641265598785
BMI: 0.0878044448605336
DiabetesPedigreeFunction: 0.8192507685294956
Age: 0.03269965785366651

localhost:8888/notebooks/Assignment 2-B.ipynb 3/6

18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

Multiple Regression analysis

In [35]: import statsmodels.api as sm

# Split the dataset into the independent variables (X) and the dependent variable (y)
X = df.drop('Outcome', axis=1) # Independent variables
y = df['Outcome'] # Dependent variable

# Add a constant column to the independent variables
X = sm.add_constant(X)

# Fit the multiple regression model
model = sm.OLS(y, X)
results = model.fit()

# Print the regression results
print(results.summary())

OLS Regression Results

==============================================================================
Dep. Variable: Outcome R-squared: 0.303
Model: OLS Adj. R-squared: 0.296
Method: Least Squares F-statistic: 41.29
Date: Wed, 18 Oct 2023 Prob (F-statistic): 7.36e-55
Time: 19:14:48 Log-Likelihood: -381.91
No. Observations: 768 AIC: 781.8
Df Residuals: 759 BIC: 823.6
Df Model: 8
Covariance Type: nonrobust
=======================================================================================
=====
coef std err t P>|t| [0.025
0.975]
---------------------------------------------------------------------------------------
-----
const -0.8539 0.085 -9.989 0.000 -1.022 -
0.686
Pregnancies 0.0206 0.005 4.014 0.000 0.011
0.031
Glucose 0.0059 0.001 11.493 0.000 0.005
0.007
BloodPressure -0.0023 0.001 -2.873 0.004 -0.004 -
0.001
SkinThickness 0.0002 0.001 0.139 0.890 -0.002
0.002
Insulin -0.0002 0.000 -1.205 0.229 -0.000
0.000
BMI 0.0132 0.002 6.344 0.000 0.009
0.017
DiabetesPedigreeFunction 0.1472 0.045 3.268 0.001 0.059
0.236
Age 0.0026 0.002 1.693 0.091 -0.000
0.006
==============================================================================
Omnibus: 41.539 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.183
Skew: 0.395 Prob(JB): 1.69e-07
Kurtosis: 2.408 Cond. No. 1.10e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specif
ied.
[2] The condition number is large, 1.1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

localhost:8888/notebooks/Assignment 2-B.ipynb 4/6

18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [39]: df.corr()

Out[39]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diabete

Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683

Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071

BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805

SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573

Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859

BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000

DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647

Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242

Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695

localhost:8888/notebooks/Assignment 2-B.ipynb 5/6

18/10/2023, 19:59 Assignment 2-B - Jupyter Notebook

In [42]: # Import required package

import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
plt.rcParams['figure.figsize'] = [20, 20]
# Plotting Scatterplot Matrix
scatter_matrix(df)
plt.show()

In [ ]:

localhost:8888/notebooks/Assignment 2-B.ipynb 6/6

Probability, Statistics and Numerical Methods PDF
0% (3)
Probability, Statistics and Numerical Methods PDF
3 pages
Data Analytics Regression UNIT-III
No ratings yet
Data Analytics Regression UNIT-III
26 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
104 pages
Unit-2 ML
No ratings yet
Unit-2 ML
39 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
High Yield Notes
No ratings yet
High Yield Notes
251 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
Multiple Regression: Curve Estimation
100% (2)
Multiple Regression: Curve Estimation
23 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
Unit 5
No ratings yet
Unit 5
104 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
33 pages
ML Manoj
No ratings yet
ML Manoj
51 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
IV Ai & Ds Al3451 ML Unit2
No ratings yet
IV Ai & Ds Al3451 ML Unit2
50 pages
5 - AML Lecture 5 - Linear Regression
No ratings yet
5 - AML Lecture 5 - Linear Regression
56 pages
SMDS Unit 3
No ratings yet
SMDS Unit 3
45 pages
R Studio How To
No ratings yet
R Studio How To
12 pages
Unit 2 3 Notes
No ratings yet
Unit 2 3 Notes
16 pages
01 - Quantitative Methods
No ratings yet
01 - Quantitative Methods
28 pages
Unit V - Update
No ratings yet
Unit V - Update
53 pages
SML Updated UNIT 3
No ratings yet
SML Updated UNIT 3
41 pages
Unit 1 Regression
No ratings yet
Unit 1 Regression
26 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
45 pages
Module 3
No ratings yet
Module 3
34 pages
AMA3602Final2024Fall Ray
No ratings yet
AMA3602Final2024Fall Ray
21 pages
Endogeneity
No ratings yet
Endogeneity
10 pages
Chapter 13 Part 1
No ratings yet
Chapter 13 Part 1
49 pages
Applied Bayesian Econometrics For Central Bankers Updated 2017 PDF
No ratings yet
Applied Bayesian Econometrics For Central Bankers Updated 2017 PDF
222 pages
Chapter 16 - Class
No ratings yet
Chapter 16 - Class
36 pages
Statistical Models in Toxicology - 1st Edition Exclusive Download
100% (17)
Statistical Models in Toxicology - 1st Edition Exclusive Download
14 pages
UNIT II Regression
No ratings yet
UNIT II Regression
59 pages
Linear Regression Analysis - 4
No ratings yet
Linear Regression Analysis - 4
23 pages
Pooled Cross-Section Time Series Data
No ratings yet
Pooled Cross-Section Time Series Data
97 pages
Data Analytics Regression Unit III
No ratings yet
Data Analytics Regression Unit III
27 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Chapter4 (2W) - Signal Modeling - Statistical Digital Signal Processing and Modeling
No ratings yet
Chapter4 (2W) - Signal Modeling - Statistical Digital Signal Processing and Modeling
137 pages
Unit 2 Regression Analysis
No ratings yet
Unit 2 Regression Analysis
16 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
AI - Mod 5. Part 3
No ratings yet
AI - Mod 5. Part 3
26 pages
Section 2
No ratings yet
Section 2
22 pages
Statistics in Medicine - 2024 - Zhang - Weighted Expectile Regression Neural Networks For Right Censored Data
No ratings yet
Statistics in Medicine - 2024 - Zhang - Weighted Expectile Regression Neural Networks For Right Censored Data
15 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Da 2
No ratings yet
Da 2
31 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
No ratings yet
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
12 pages
Banking Risk Management
No ratings yet
Banking Risk Management
57 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Final Cc01 Group7
No ratings yet
Final Cc01 Group7
23 pages
Aih Lab1
No ratings yet
Aih Lab1
10 pages
ML Exp1 C36
No ratings yet
ML Exp1 C36
13 pages
FINAL - CC01 - Group7
No ratings yet
FINAL - CC01 - Group7
23 pages
Econometrics SEM IV (2017)
No ratings yet
Econometrics SEM IV (2017)
7 pages
Statistical Modelling
No ratings yet
Statistical Modelling
16 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
6 Regression Analysis
No ratings yet
6 Regression Analysis
12 pages
Regression
No ratings yet
Regression
9 pages
AP 7.1 Guided Notes For Reading Textbook
No ratings yet
AP 7.1 Guided Notes For Reading Textbook
6 pages
CHAPTER 5. Introduction To Estimation
No ratings yet
CHAPTER 5. Introduction To Estimation
29 pages
Exp No 03
No ratings yet
Exp No 03
15 pages
Regression PDF
No ratings yet
Regression PDF
16 pages
Coding 2
No ratings yet
Coding 2
3 pages
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
No ratings yet
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
8 pages
Sample Question - Statistics
No ratings yet
Sample Question - Statistics
2 pages
Assignment 2 - LP1
No ratings yet
Assignment 2 - LP1
7 pages
6 +ARTIKEL+Nur+Rahma
No ratings yet
6 +ARTIKEL+Nur+Rahma
9 pages
Tute - 04
No ratings yet
Tute - 04
6 pages
SARIMA Model RMSE 1
No ratings yet
SARIMA Model RMSE 1
9 pages
325unit 1 Simple Regression Analysis
No ratings yet
325unit 1 Simple Regression Analysis
10 pages
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
No ratings yet
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
16 pages
Topic - 9 PDF
No ratings yet
Topic - 9 PDF
12 pages
Bozorgzadeh Et Al. (2018) - Comp. Stat. Analysis of Intact Rock Strength For Reliability-Based Design
No ratings yet
Bozorgzadeh Et Al. (2018) - Comp. Stat. Analysis of Intact Rock Strength For Reliability-Based Design
14 pages
Cec Project of Business Statistics (Autosaved)
No ratings yet
Cec Project of Business Statistics (Autosaved)
9 pages
Beta Calcutaion SPSS
No ratings yet
Beta Calcutaion SPSS
3 pages
Analisis Regresi Linier Berganda Untuk Mengetahui Pengaruh Curah Hujan Terhadap Luas Panen Serta Produksi Padi Dan Jagung Di Jawa Timur
No ratings yet
Analisis Regresi Linier Berganda Untuk Mengetahui Pengaruh Curah Hujan Terhadap Luas Panen Serta Produksi Padi Dan Jagung Di Jawa Timur
12 pages
Ridge and Lasso
No ratings yet
Ridge and Lasso
2 pages
Sample Size and Estimation New
No ratings yet
Sample Size and Estimation New
4 pages
Multiple Regression
0% (1)
Multiple Regression
41 pages
Hayashi Econometrics: Typo/Error Alert
No ratings yet
Hayashi Econometrics: Typo/Error Alert
11 pages
STATA Command Summary
No ratings yet
STATA Command Summary
3 pages
Homework2
No ratings yet
Homework2
2 pages
Anova: Sum of Squares DF Mean Square F Sig. Between Groups Within Groups Total
No ratings yet
Anova: Sum of Squares DF Mean Square F Sig. Between Groups Within Groups Total
2 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
High-Dimensional Covariance Estimation: With High-Dimensional Data
From Everand
High-Dimensional Covariance Estimation: With High-Dimensional Data
Mohsen Pourahmadi
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet

Assignment 2 B

Uploaded by

Assignment 2 B

Uploaded by

Computer Laboratory-I Class: BE (AI &DS)

Title: Implement Multiple Linear Regression

Dataset Link: https://www.kaggle.com/datasets/uciml/iris

Objectives: To Apply different regression techniques for making predictions in different

In the regression model, these values can be represented by Dummy Variables.

Dummy Variable Trap:

In [18]: !pip install Numpy==1.23.5

Requirement already satisfied: Numpy==1.23.5 in c:\users\chetan\anaconda3\lib\site-pack

In [19]: !pip install --upgrade --no-deps statsmodels

Requirement already satisfied: statsmodels in c:\users\chetan\anaconda3\lib\site-packag

In [20]: import numpy as np

0 6 148 72 35 0 33.6 0.627 50

2 8 183 64 0 0 23.3 0.672 32

4 0 137 40 35 168 43.1 2.288 33

Dataset link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

localhost:8888/notebooks/Assignment 2-B.ipynb 1/6

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.00

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.47

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.33

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.07

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.24

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.37

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.62

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.42

Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard

In [25]: for column in df.columns:

localhost:8888/notebooks/Assignment 2-B.ipynb 2/6

Bivariate analysis: Linear and logistic regression modeling

In [26]: from sklearn.linear_model import LinearRegression, LogisticRegression

Linear Regression Coefficients:

In [27]: # Prepare the data

Logistic Regression Coefficients:

localhost:8888/notebooks/Assignment 2-B.ipynb 3/6

Multiple Regression analysis

In [35]: import statsmodels.api as sm

OLS Regression Results

localhost:8888/notebooks/Assignment 2-B.ipynb 4/6

Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683

Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071

BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805

SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573

Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859

BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000

DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647

Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242

Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695

localhost:8888/notebooks/Assignment 2-B.ipynb 5/6

In [42]: # Import required package

localhost:8888/notebooks/Assignment 2-B.ipynb 6/6

You might also like