0% found this document useful (0 votes)

17 views4 pages

B58 - Handling Missing Values, Feature - Selection

The document outlines a process for handling missing values in a diabetes dataset using the SimpleImputer from sklearn, ensuring all features have no missing values after imputation. It also details feature selection for a placement dataset, where numerical and categorical features are processed, and a RandomForestClassifier is trained to predict placement status, achieving an accuracy of approximately 95.35% with selected features. The document includes code snippets for data imputation, feature importance evaluation, and model training.

Uploaded by

regularuse0001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

B58 - Handling Missing Values, Feature - Selection

Uploaded by

regularuse0001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

1 Handling Missing Values

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.read_csv('diabetes.csv')

print(df.isnull().sum())

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

# Example using SimpleImputer for numerical features

numerical_cols = df.select_dtypes(include=np.number).columns
imputer = SimpleImputer(strategy='mean') # You can change the
strategy (median, most_frequent, constant)
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])

# Verify if missing values are handled

print("\nMissing values after imputation:")
print(df.isnull().sum())

Missing values after imputation:

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

# Further analysis or model training can be done here...

print("\nFirst few rows of the DataFrame after handling missing
values:")
df.head()
First few rows of the DataFrame after handling missing values:

{"summary":"{\n \"name\": \"df\",\n \"rows\": 768,\n \"fields\": [\

n {\n \"column\": \"Pregnancies\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 3.3695780626988623,\n
\"min\": 0.0,\n \"max\": 17.0,\n \"num_unique_values\":
17,\n \"samples\": [\n 6.0,\n 1.0,\n
3.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Glucose\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 31.97261819513622,\n \"min\": 0.0,\n
\"max\": 199.0,\n \"num_unique_values\": 136,\n
\"samples\": [\n 151.0,\n 101.0,\n 112.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"BloodPressure\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
19.355807170644777,\n \"min\": 0.0,\n \"max\": 122.0,\n
\"num_unique_values\": 47,\n \"samples\": [\n 86.0,\n
46.0,\n 85.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"SkinThickness\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 15.952217567727677,\n \"min\":
0.0,\n \"max\": 99.0,\n \"num_unique_values\": 51,\n
\"samples\": [\n 7.0,\n 12.0,\n 48.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Insulin\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
115.24400235133837,\n \"min\": 0.0,\n \"max\": 846.0,\n
\"num_unique_values\": 186,\n \"samples\": [\n 52.0,\n
41.0,\n 183.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"BMI\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 7.8841603203754405,\n \"min\": 0.0,\n \"max\":
67.1,\n \"num_unique_values\": 248,\n \"samples\": [\n
19.9,\n 31.0,\n 38.1\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"DiabetesPedigreeFunction\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
0.33132859501277484,\n \"min\": 0.078,\n \"max\": 2.42,\
n \"num_unique_values\": 517,\n \"samples\": [\n
1.731,\n 0.426,\n 0.138\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"Age\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 11.76023154067868,\n
\"min\": 21.0,\n \"max\": 81.0,\n \"num_unique_values\":
52,\n \"samples\": [\n 60.0,\n 47.0,\n
72.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Outcome\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 0.4769513772427971,\n \"min\": 0.0,\n
\"max\": 1.0,\n \"num_unique_values\": 2,\n \"samples\":
[\n 0.0,\n 1.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n }\n ]\n}","type":"dataframe","variable_name":"df"}

1.2 Feature Selection

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_csv('Placement_Dataset.csv')

# Identify numerical and categorical columns

numerical_cols = df.select_dtypes(include=np.number).columns
categorical_cols = df.select_dtypes(exclude=np.number).columns

# Handle missing values in numerical features

numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] =
numerical_imputer.fit_transform(df[numerical_cols])

# Handle missing values in categorical features (using most frequent)

categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] =
categorical_imputer.fit_transform(df[categorical_cols])

# Convert categorical features to numerical using one-hot encoding

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

X = df.drop('status_Placed', axis=1)
y = df['status_Placed']

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

importances = model.feature_importances_
feature_importances = pd.Series(importances, index=X.columns)
print("Feature Importances:")
print(feature_importances.sort_values(ascending=False))
Feature Importances:
salary 0.323788
ssc_p 0.234238
degree_p 0.133835
hsc_p 0.119721
mba_p 0.052254
etest_p 0.036298
sl_no 0.033773
workex_Yes 0.019531
specialisation_Mkt&HR 0.012677
gender_M 0.010178
ssc_b_Others 0.006278
degree_t_Sci&Tech 0.004213
hsc_b_Others 0.003628
degree_t_Others 0.003604
hsc_s_Science 0.003567
hsc_s_Commerce 0.002416
dtype: float64

top_features = feature_importances.nlargest(5).index
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

model_selected = RandomForestClassifier(random_state=42)
model_selected.fit(X_train_selected, y_train)

RandomForestClassifier(random_state=42)

y_pred_selected = model_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)

print("\nAccuracy with selected features:", accuracy_selected)

Accuracy with selected features: 0.9534883720930233

Lesson 5 Tools, Techniques and Procedures
100% (2)
Lesson 5 Tools, Techniques and Procedures
39 pages
Healthcare Insurance Prediction Main
No ratings yet
Healthcare Insurance Prediction Main
74 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
LAB8 LogisticReg HeartDisease
No ratings yet
LAB8 LogisticReg HeartDisease
31 pages
IS - Extended - Project - Guided - Template - Notebook
No ratings yet
IS - Extended - Project - Guided - Template - Notebook
26 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
The Witches' Devil by Roger J. Horne
No ratings yet
The Witches' Devil by Roger J. Horne
249 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Eda-Ml-Decision-Tree - Ipynb - Colab
No ratings yet
Eda-Ml-Decision-Tree - Ipynb - Colab
20 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
No ratings yet
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
18 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
Major Project - Colab
No ratings yet
Major Project - Colab
15 pages
BD WPS2
No ratings yet
BD WPS2
23 pages
Stroke Prediction
No ratings yet
Stroke Prediction
14 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
m3125 Practical 3
No ratings yet
m3125 Practical 3
13 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Iso 13385-2 - 2011
No ratings yet
Iso 13385-2 - 2011
8 pages
Aiml Programs
No ratings yet
Aiml Programs
12 pages
Covid 19 Analysis and Visualization Using Plotly Express
No ratings yet
Covid 19 Analysis and Visualization Using Plotly Express
11 pages
Veterinary Cytology - 1st Edition Complete EPUB Download
100% (15)
Veterinary Cytology - 1st Edition Complete EPUB Download
16 pages
# Importing Necessary Libraries: Import As Import As Import As Import As
No ratings yet
# Importing Necessary Libraries: Import As Import As Import As Import As
21 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Rapport
No ratings yet
Rapport
21 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
DACLUSTER
No ratings yet
DACLUSTER
9 pages
No Quarter 51 - New Monsternomicon
No ratings yet
No Quarter 51 - New Monsternomicon
10 pages
Import As Import As Import As Import: Pandas PD Numpy NP Matplotlib - Pyplot PLT Sklearn DF PD - Read - CSV DF
No ratings yet
Import As Import As Import As Import: Pandas PD Numpy NP Matplotlib - Pyplot PLT Sklearn DF PD - Read - CSV DF
9 pages
VoThaiThaoNhi ECON209 F2024 Lab 2
No ratings yet
VoThaiThaoNhi ECON209 F2024 Lab 2
10 pages
MLT Ann Lab 2
No ratings yet
MLT Ann Lab 2
7 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
Drilling Calculations
No ratings yet
Drilling Calculations
7 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
21mic0107 1
No ratings yet
21mic0107 1
7 pages
Baseline - Ipynb - Colab
No ratings yet
Baseline - Ipynb - Colab
5 pages
Private Islands ISSUE # 32 - FALL-WINTER 2024-2025
No ratings yet
Private Islands ISSUE # 32 - FALL-WINTER 2024-2025
57 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
Heart Disease Classification Full-1
No ratings yet
Heart Disease Classification Full-1
3 pages
Richard Cross - The Medieval Christian Philosophers - An Introduction (Library of Medieval Studies) - I.B. Tauris (2013)
No ratings yet
Richard Cross - The Medieval Christian Philosophers - An Introduction (Library of Medieval Studies) - I.B. Tauris (2013)
286 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Healthcare-Project-Simplilearn - Week1
No ratings yet
Healthcare-Project-Simplilearn - Week1
6 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
02 B Regression Healthcare
No ratings yet
02 B Regression Healthcare
5 pages
02 B Regression Healthcare
No ratings yet
02 B Regression Healthcare
5 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
ML 7
No ratings yet
ML 7
6 pages
Diabetes
No ratings yet
Diabetes
10 pages
ENG-Accessories For Operating tables-210210X52P-20200318-small
No ratings yet
ENG-Accessories For Operating tables-210210X52P-20200318-small
26 pages
General Physics 2 Performance Task #1 Module 1, Week 1, Quarter 3
100% (1)
General Physics 2 Performance Task #1 Module 1, Week 1, Quarter 3
2 pages
ML Practical 3D
No ratings yet
ML Practical 3D
4 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
ML Lab-1
No ratings yet
ML Lab-1
5 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Exp 5
No ratings yet
Exp 5
7 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
ExNo 08ml
No ratings yet
ExNo 08ml
4 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
4 pages
Week1 Code Corrected
No ratings yet
Week1 Code Corrected
2 pages
Loading The Dataset: 'Diabetes - CSV'
No ratings yet
Loading The Dataset: 'Diabetes - CSV'
4 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
PDPII - May 2011 - Dhanaraj - 10644
No ratings yet
PDPII - May 2011 - Dhanaraj - 10644
96 pages
PSIC - Final of Domestic Electrical Appliances
No ratings yet
PSIC - Final of Domestic Electrical Appliances
82 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
Sexual Reproduction in Humans Notes
No ratings yet
Sexual Reproduction in Humans Notes
10 pages
Astro Case Study 415972
No ratings yet
Astro Case Study 415972
5 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
B7 CREATIVE ARTS First-Term 2024 DEC EXAMS
No ratings yet
B7 CREATIVE ARTS First-Term 2024 DEC EXAMS
6 pages
GAs - BOUNDARY WALL-S77 BOUNDARY WALLS
100% (1)
GAs - BOUNDARY WALL-S77 BOUNDARY WALLS
1 page
At Tutorail 1 Merged
No ratings yet
At Tutorail 1 Merged
23 pages
Wetted Surface Area of Partially Filled Horizontal Vessel
No ratings yet
Wetted Surface Area of Partially Filled Horizontal Vessel
1 page
John Dewey - Towards A Flexible Curriculum
No ratings yet
John Dewey - Towards A Flexible Curriculum
8 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
Simplilearn Certificate
No ratings yet
Simplilearn Certificate
1 page
Assignment 114151 107139 67c922111f784
No ratings yet
Assignment 114151 107139 67c922111f784
1 page
Assigement 1 Pete Olmeca
No ratings yet
Assigement 1 Pete Olmeca
2 pages
Resolución Test 2
No ratings yet
Resolución Test 2
2 pages
Mini Project V Sem
No ratings yet
Mini Project V Sem
8 pages
2012 DSE Chem Sample Paper 1A
No ratings yet
2012 DSE Chem Sample Paper 1A
10 pages
IEEE Conference Template 1
No ratings yet
IEEE Conference Template 1
4 pages
B58 - Coding Assignment 07
No ratings yet
B58 - Coding Assignment 07
4 pages
List 2 Lab
No ratings yet
List 2 Lab
22 pages
Role of UN and International NGOs in Global Health Governance - Edited
No ratings yet
Role of UN and International NGOs in Global Health Governance - Edited
3 pages
Untitled Document
No ratings yet
Untitled Document
3 pages
Codingassignment2 (B58)
No ratings yet
Codingassignment2 (B58)
2 pages
Ty Aiml Pe1 Uame0621 Immersive Technology Arvr
No ratings yet
Ty Aiml Pe1 Uame0621 Immersive Technology Arvr
2 pages
DS-SIMULATION B58 Completion Certificate
No ratings yet
DS-SIMULATION B58 Completion Certificate
1 page
505assignment 103106 96593 6746dbccc80d6
No ratings yet
505assignment 103106 96593 6746dbccc80d6
1 page
Prat Res
No ratings yet
Prat Res
1 page
The Z-Transform in DSP Lecture 10-12 Andreas Spanias Spanias@asu - Edu
No ratings yet
The Z-Transform in DSP Lecture 10-12 Andreas Spanias Spanias@asu - Edu
16 pages
HSSRPTR - Plus One Weightage To Chapters-2023
No ratings yet
HSSRPTR - Plus One Weightage To Chapters-2023
15 pages
13.29 PP 440 442 Private Fears in Public Spaces
No ratings yet
13.29 PP 440 442 Private Fears in Public Spaces
3 pages
Adobe Scan Oct 16, 2024
No ratings yet
Adobe Scan Oct 16, 2024
1 page
Chia Verini 2002
No ratings yet
Chia Verini 2002
2 pages
Steering Damper For Yamaha R1M: Otee! Otee!
No ratings yet
Steering Damper For Yamaha R1M: Otee! Otee!
4 pages
IDCON CMS 102R Coupling Sure Flex TOC (With Watermark)
No ratings yet
IDCON CMS 102R Coupling Sure Flex TOC (With Watermark)
3 pages

B58 - Handling Missing Values, Feature - Selection

Uploaded by

B58 - Handling Missing Values, Feature - Selection

Uploaded by

1.

1 Handling Missing Values

# Example using SimpleImputer for numerical features

# Verify if missing values are handled

Missing values after imputation:

# Further analysis or model training can be done here...

{"summary":"{\n \"name\": \"df\",\n \"rows\": 768,\n \"fields\": [\

1.2 Feature Selection

# Identify numerical and categorical columns

# Handle missing values in numerical features

# Handle missing values in categorical features (using most frequent)

# Convert categorical features to numerical using one-hot encoding

X_train, X_test, y_train, y_test = train_test_split(X, y,

print("\nAccuracy with selected features:", accuracy_selected)

Accuracy with selected features: 0.9534883720930233

You might also like