0% found this document useful (0 votes)
4 views4 pages

Question 7 - Jupyter Notebook

The document outlines the creation of a machine learning model using Sklearn's Diabetes dataset, detailing the first five steps of the data science life cycle. It includes importing libraries, loading and exploring the dataset, preprocessing the data, building and training a logistic regression model, and evaluating its performance. Key metrics such as accuracy, classification report, confusion matrix, and ROC curve are also generated to assess the model's effectiveness.

Uploaded by

paulbogale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Question 7 - Jupyter Notebook

The document outlines the creation of a machine learning model using Sklearn's Diabetes dataset, detailing the first five steps of the data science life cycle. It includes importing libraries, loading and exploring the dataset, preprocessing the data, building and training a logistic regression model, and evaluating its performance. Key metrics such as accuracy, classification report, confusion matrix, and ROC curve are also generated to assess the model's effectiveness.

Uploaded by

paulbogale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

question 7 - Jupyter Notebook http://localhost:8888/notebooks/OneDrive/Desktop/Folders/MU/Sem-7...

Prince Lunia, Enroll: 92000103170, class: 7TC1-C

Create a full ML model for Sklearn’s Diabetes dataset. Load the dataset from sklearn itself. a.
Perform the first 5 data science life cycle steps for this model. b. Write down information
surmised from each code snippet in the markdown cell below each code cell.

Step 1: Import Libraries

In [1]: import pandas as pd


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

Step 2: Load and Explore the Dataset

In [2]: #Data Collection

In [3]:

Out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction

0 6 148 72 35 0 33.6 0.627

1 1 85 66 29 0 26.6 0.351

2 8 183 64 0 0 23.3 0.672

3 1 89 66 23 94 28.1 0.167

4 0 137 40 35 168 43.1 2.288

1 of 4 10/23/2023, 1:21 AM
question 7 - Jupyter Notebook http://localhost:8888/notebooks/OneDrive/Desktop/Folders/MU/Sem-7...

In [4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [5]:

Step 3: Preprocess the Data

In [6]: # Split the data into features (X) and target (y)
X = df.drop(columns='Outcome')
y = df['Outcome']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state

# Scale the features using StandardScaler


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Build and Train the Model

In [7]: # Create a Logistic Regression model


model = LogisticRegression()

# Train the model on the training data


model.fit(X_train, y_train)
Out[7]: ▾ LogisticRegression
LogisticRegression()

Step 5: Evaluate the Model

2 of 4 10/23/2023, 1:21 AM
question 7 - Jupyter Notebook http://localhost:8888/notebooks/OneDrive/Desktop/Folders/MU/Sem-7...

In [8]: # Make predictions on the test data


y_pred = model.predict(X_test)

In [9]: # Calculate accuracy

In [10]:

0.7532467532467533

In [11]: # Generate a classification report

In [12]:

precision recall f1-score support

0 0.81 0.80 0.81 99


1 0.65 0.67 0.66 55

accuracy 0.75 154


macro avg 0.73 0.74 0.73 154
weighted avg 0.76 0.75 0.75 154

In [13]: # Generate a confusion matrix

In [14]:

[[79 20]
[18 37]]

3 of 4 10/23/2023, 1:21 AM
question 7 - Jupyter Notebook http://localhost:8888/notebooks/OneDrive/Desktop/Folders/MU/Sem-7...

In [15]: # Calculate ROC AUC


y_pred_prob = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Plot ROC curve


fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')

In [ ]:

4 of 4 10/23/2023, 1:21 AM

You might also like