0% found this document useful (0 votes)
48 views5 pages

ML Minor May

The document summarizes a machine learning project to predict diabetes using a Pima Indian diabetes dataset. It introduces the dataset and objective. It then lists the packages used in the code and provides a step-by-step description of the data preprocessing, modeling, and evaluation process. This includes handling missing values, splitting data, performing logistic regression, and selecting the best model. The results show the final model can predict diabetes with 74% accuracy.

Uploaded by

govind kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views5 pages

ML Minor May

The document summarizes a machine learning project to predict diabetes using a Pima Indian diabetes dataset. It introduces the dataset and objective. It then lists the packages used in the code and provides a step-by-step description of the data preprocessing, modeling, and evaluation process. This includes handling missing values, splitting data, performing logistic regression, and selecting the best model. The results show the final model can predict diabetes with 74% accuracy.

Uploaded by

govind kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ML-MINOR-MAY

Submitted by: Chodapaneedi Govind kumar


Submitted to : [email protected]

Contents:
1.Problem
2.Imported packages
3.Procedure of solving
4.Code(Screenshot)
5.Conclusion

Project question:
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether or not a patient
has diabetes, based on certain diagnostic measurements included in the dataset. Several
constraints were placed on the selection of these instances from a larger database. In
particular, all patients here are females at least 21 years old of Pima Indian heritage.

Imported packages in the code:


Pandas
Numpy
Matplotlib.pyplot
From slearn.linear.model_selection imported tran_test_split
From sklearn_linear.model imported logistic regression
From sklearn.neighbors imported KNeighbors classifier.

1.Procedure of solving:
2.Loaded my data from my drive.
3.Gave the basic instructions of pandas,numpy and matplotlib.pyplot.
4.Read the data.
5.Describe the data
Here we can see that there are 9 columns in the data. All columns seem to be numeric in
nature which is good for modelling. In case of Character string columns, we could have used
dummy numeric variables for modelling. The columns here are:
1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Outcome: Class variable (0 or 1).
6.Taking the info of the data

All values seem to be in the integer or float format which is opt for modelling. Hence
there’s no type conversions required on the dataset. The row count of the data is 768. Hence
the shape of our data is 768 * 9

7.Checking missing Data

Here we can see that there’s no missing or Null values. But in the data head we had
spotted a 0 value. Now we should check the data range and basic summary statistics about
our data.

8.Describing the data

Missing Values
Here we can see certain columns have a minimum value as 0 which is clearly not logical.
The columns are:
1. Glucose
2. Blood Pressure
3. Skin Thickness
4. Insulin
5. BMI

Next we need to check the amount of missing information in these columns. We can check
this looking for the 0 value rows
9.Defining x

10.Defining df

11.Creating a dataframe df1 with Median values of Skin thickness and median values of
Insulin

12.Creating a dataframe df2 with Median values for Skin Thickness but removing the
missing data for Insulin

13.Training and testing values

14.Logistic Regression

15.Selecting the appropriate data for Evaluation

Based on the model Accuracy scores on the two datasets: df1 and df2, we can clearly see
that the accuracy in df1 is higher. This proves that even though we made certain
assumptions about the missing values in our dataset, the predictions performed better than
the case where we removed the rows with missing data. This is a positive scenario as we
can use our modelling on the entire dataset using -> df1
16.The result:
Here, we can see that the Final model has an Accuracy of 0.7447916666666666. It has
anArea Under the Curve of 0.78.

Code( screen shot):


5.Conclusion:

Diabetes is a serious disease in our society. It is very common in developing nations. In


India, it is said that nearly 7% of the adult population has diabetes and it is commonly found
in my family as well. A Machine learning model, if used in the right manner could help in
detecting symptoms that lead up to Diabetes. This could have tremendous health and cost
benefits to the users.
From our Model we are able to predict Diabetes with 74% accuracy. It is also important to
note that the 2 most important factors while detecting diabetes are:
1. Glucose
2. Body Mass Index

You might also like