Loading The Dataset: 'Diabetes - CSV'
Loading The Dataset: 'Diabetes - CSV'
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn import preprocessing
In [2]: df = pd.read_csv('diabetes.csv')
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 Pedigree 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [4]: df.head()
Out[4]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Pedigree Age Outcome
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
Cleaning
In [11]: df.corr().style.background_gradient(cmap='BuGn')
In [14]: df.isna().sum()
Out[14]: Pregnancies 0
Glucose 0
Insulin 0
BMI 0
Pedigree 0
Age 0
Outcome 0
dtype: int64
In [15]: df.describe()
Visualization
In [16]: hist = df.hist(figsize=(20,16))
Hyperparameter tuning
In [28]: param_grid = {
'n_neighbors': range(1, 51),
'p': range(1, 4)
}
grid = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
grid.best_estimator_, grid.best_params_, grid.best_score_
Out[28]: (KNeighborsClassifier(n_neighbors=27),
{'n_neighbors': 27, 'p': 2},
0.7719845395175262)
Classification report :
precision recall f1-score support