0% found this document useful (0 votes)
31 views6 pages

Project 10 Movie Recommendation - Ipynb - Colaboratory

The document discusses analyzing diabetes data from a CSV file using Pandas and machine learning models in Python. It loads and inspects the data, cleans it, analyzes distributions, splits it into training and test sets, and trains a logistic regression model to predict diabetes diagnoses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views6 pages

Project 10 Movie Recommendation - Ipynb - Colaboratory

The document discusses analyzing diabetes data from a CSV file using Pandas and machine learning models in Python. It loads and inspects the data, cleans it, analyzes distributions, splits it into training and test sets, and trains a logistic regression model to predict diabetes diagnoses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

import pandas as pd

import numpy as np

df=pd.read_csv(r"https://github.com/YBI-Foundation/Dataset/raw/main/Diabetes.csv")

df.head()

pregnancies glucose diastolic triceps insulin bmi dpf age diabetes

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pregnancies 768 non-null int64
1 glucose 768 non-null int64
2 diastolic 768 non-null int64
3 triceps 768 non-null int64
4 insulin 768 non-null int64
5 bmi 768 non-null float64
6 dpf 768 non-null float64
7 age 768 non-null int64
8 diabetes 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

df = df.dropna()

df.describe()
pregnancies glucose diastolic triceps insulin bmi

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.0

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.4

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.3

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.2

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.3

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.6

17 000000 199 000000 122 000000 99 000000 846 000000 67 100000 24


df[['diabetes']].value_counts()

diabetes
0 500
1 268
dtype: int64

df.groupby('diabetes').mean()

pregnancies glucose diastolic triceps insulin bmi

diabetes

0 3.298000 109.980000 68.184000 19.664000 68.792000 30.304200 0.4

1 4.865672 141.257463 70.824627 22.164179 100.335821 35.142537 0.5

df.columns

Index(['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',


'dpf', 'age', 'diabetes'],
dtype='object')

df.shape

(768, 9)

y = df['diabetes']

y.shape

(768,)
y

0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: diabetes, Length: 768, dtype: int64

X = df[['pregnancies', 'glucose', 'diastolic', 'triceps', 'insulin', 'bmi',


'dpf', 'age']]

X = df.drop(['diabetes'],axis=1)

X.shape

(768, 8)

X
pregnancies glucose diastolic triceps insulin bmi dpf age

from sklearn.preprocessing import MinMaxScaler

mm= MinMaxScaler()

X = mm.fit_transform(X)

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,


0.48333333],
[0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
0.16666667],
[0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
0.18333333],
...,
[0.29411765, 0.6080402 , 0.59016393, ..., 0.390462 , 0.07130658,
0.15 ],
[0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
0.43333333],
[0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
0.03333333]])

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=2529)

X_train.shape,X_test.shape,y_train.shape,y_test.shape

((537, 8), (231, 8), (537,), (231,))

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train,y_train)

LogisticRegression()

lr.predict_proba

<bound method LogisticRegression.predict_proba of LogisticRegression()>

X_new =df.sample(1)
X_new

pregnancies glucose diastolic triceps insulin bmi dpf age diabet

633 1 128 82 17 183 27.5 0.115 22

X_new.shape

(1, 9)

X_new = X_new.drop('diabetes',axis=1)

X_new

pregnancies glucose diastolic triceps insulin bmi dpf age

633 1 128 82 17 183 27.5 0.115 22

X_new.shape

(1, 8)

X_new = mm.fit_transform(X_new)

y_pred_new = lr.predict(X_new)

y_pred_new

array([0])

lr.predict_proba(X_new)

array([[0.9928188, 0.0071812]])

You might also like