This repository contains a collection of machine learning projects focusing on data preprocessing and visualization techniques.
-
preprocessing.ipynb: Contains data preprocessing techniques including:- Data loading and exploration
- Min-Max scaling with sklearn
- Standardization (0 mean, 1 standard deviation)
-
plots.ipynb: Contains data visualization techniques including:- Basic data shape and statistics
- Skewness analysis
- Histograms for feature distributions
-
feature_selection.ipynb: Contains feature extraction techniques including:- Univariate Feature Selection
- Recursive Feature Elimination (RFE)
- Principal Component Analysis (PCA)
The project uses the Pima Indians Diabetes Dataset, which includes several health metrics and a binary classification for diabetes. The dataset has the following features:
- 'preg': Number of pregnancies
- 'plas': Plasma glucose concentration
- 'pres': Blood pressure
- 'skin': Skin thickness
- 'test': Insulin level
- 'mass': BMI
- 'pedi': Diabetes pedigree function
- 'age': Age
- 'class': Binary outcome (diabetes or not)
To run these notebooks, you'll need:
- Python 3.x
- Jupyter Notebook
- Required packages: pandas, numpy, sklearn, matplotlib
- Clone this repository
- Install the required packages:
pip install pandas numpy scikit-learn matplotlib jupyter - Start Jupyter Notebook:
jupyter notebook - Open and run the notebooks in your browser
This project is open source and available under the MIT License.
Harrison Miller - GitHub Profile