ML Minor May
ML Minor May
Contents:
1.Problem
2.Imported packages
3.Procedure of solving
4.Code(Screenshot)
5.Conclusion
Project question:
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether or not a patient
has diabetes, based on certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage.
1.Procedure of solving:
2.Loaded my data from my drive.
3.Gave the basic instructions of pandas,numpy and matplotlib.pyplot.
4.Read the data.
5.Describe the data
Here we can see that there are 9 columns in the data. All columns seem to be numeric in
nature which is good for modelling. In case of Character string columns, we could have used
dummy numeric variables for modelling. The columns here are:
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Outcome: Class variable (0 or 1).
6.Taking the info of the data
All values seem to be in the integer or float format which is opt for modelling. Hence
there’s no type conversions required on the dataset. The row count of the data is 768. Hence
the shape of our data is 768 * 9
Here we can see that there’s no missing or Null values. But in the data head we had
spotted a 0 value. Now we should check the data range and basic summary statistics about
our data.
Missing Values
Here we can see certain columns have a minimum value as 0 which is clearly not logical.
The columns are:
1. Glucose
2. Blood Pressure
3. Skin Thickness
4. Insulin
5. BMI
Next we need to check the amount of missing information in these columns. We can check
this looking for the 0 value rows
9.Defining x
10.Defining df
11.Creating a dataframe df1 with Median values of Skin thickness and median values of
Insulin
12.Creating a dataframe df2 with Median values for Skin Thickness but removing the
missing data for Insulin
14.Logistic Regression
Based on the model Accuracy scores on the two datasets: df1 and df2, we can clearly see
that the accuracy in df1 is higher. This proves that even though we made certain
assumptions about the missing values in our dataset, the predictions performed better than
the case where we removed the rows with missing data. This is a positive scenario as we
can use our modelling on the entire dataset using -> df1
16.The result:
Here, we can see that the Final model has an Accuracy of 0.7447916666666666. It has
anArea Under the Curve of 0.78.