Ads Exp 10
Ads Exp 10
“Diabetes Prediction “
by
Tools:
1. NumPy: For numerical computations and array manipulations.
2. Pandas: For data manipulation and analysis, including reading and
loading datasets into DataFrames.
3. Matplotlib: For creating static, interactive, and animated visualizations in
Python.
4. Seaborn: For statistical data visualization based on Matplotlib, providing
a high-level interface for drawing attractive and informative statistical
graphics.
5. Scikit-Learn: For machine learning tasks, including K-Means clustering,
which is imported from sklearn.cluster.
PROBLEM STATEMENT:
The problem at hand revolves around predicting diabetes whether a person has
diabetes or not, based on information about the patient such as blood pressure,
body mass index (BMI), age, etc. By leveraging machine learning techniques,
specifically Support vector machine, the aim is to allow users to predict diabetes
utilizing the prediction engine. The objective is set to achieve the aims of the
project through a Research on statistical models in machine learning and to
understand how the algorithms works. This case study walks through the various
stages of the data science workflow.
LIFE CYCLE:
I. Data Collection
1. Data Collection: The dataset used for this model is the Pima Indians
Diabetes dataset which consists of several medical predictor variables
and one target variable, Outcome. Predictor variables include the
number of pregnancies the patient has had, their Body Mass Index,
insulin level, glucose level, diabetes pedigree function, blood pressure,
skin thickness and age.
2. Data Cleaning: Clean the data by handling missing values, outliers, and
ensuring consistency. This step is crucial for accurate predictions .
5. Model Selection:
Choose appropriate machine learning models for binary classification
(diabetes vs. non-diabetes). Some common models include:
i. Logistic Regression:A simple yet effective model
ii. Random Forest: An ensemble of decision trees.
iii. Support Vector Machine (SVM): Good for non-linear data.
i. Data Splitting: Divide your dataset into training and testing subsets.
ii. Model Training: Train each selected model on the training data.
iii. Model Evaluation: Assess model performance using metrics such as
accuracy, sensitivity, specificity, precision, F1 score, and the Receiver
Operating Characteristic (ROC) curve.
iv. Use k-fold cross-validation to estimate how well the model generalizes to
unseen data.
7. Model Deployment:
8. Prediction of Diabetes: