0% found this document useful (0 votes)
17 views4 pages

B58 - Handling Missing Values, Feature - Selection

The document outlines a process for handling missing values in a diabetes dataset using the SimpleImputer from sklearn, ensuring all features have no missing values after imputation. It also details feature selection for a placement dataset, where numerical and categorical features are processed, and a RandomForestClassifier is trained to predict placement status, achieving an accuracy of approximately 95.35% with selected features. The document includes code snippets for data imputation, feature importance evaluation, and model training.

Uploaded by

regularuse0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

B58 - Handling Missing Values, Feature - Selection

The document outlines a process for handling missing values in a diabetes dataset using the SimpleImputer from sklearn, ensuring all features have no missing values after imputation. It also details feature selection for a placement dataset, where numerical and categorical features are processed, and a RandomForestClassifier is trained to predict placement status, achieving an accuracy of approximately 95.35% with selected features. The document includes code snippets for data imputation, feature importance evaluation, and model training.

Uploaded by

regularuse0001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

1 Handling Missing Values

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.read_csv('diabetes.csv')

print(df.isnull().sum())

Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

# Example using SimpleImputer for numerical features


numerical_cols = df.select_dtypes(include=np.number).columns
imputer = SimpleImputer(strategy='mean') # You can change the
strategy (median, most_frequent, constant)
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])

# Verify if missing values are handled


print("\nMissing values after imputation:")
print(df.isnull().sum())

Missing values after imputation:


Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

# Further analysis or model training can be done here...


print("\nFirst few rows of the DataFrame after handling missing
values:")
df.head()
First few rows of the DataFrame after handling missing values:

{"summary":"{\n \"name\": \"df\",\n \"rows\": 768,\n \"fields\": [\


n {\n \"column\": \"Pregnancies\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 3.3695780626988623,\n
\"min\": 0.0,\n \"max\": 17.0,\n \"num_unique_values\":
17,\n \"samples\": [\n 6.0,\n 1.0,\n
3.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Glucose\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 31.97261819513622,\n \"min\": 0.0,\n
\"max\": 199.0,\n \"num_unique_values\": 136,\n
\"samples\": [\n 151.0,\n 101.0,\n 112.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"BloodPressure\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
19.355807170644777,\n \"min\": 0.0,\n \"max\": 122.0,\n
\"num_unique_values\": 47,\n \"samples\": [\n 86.0,\n
46.0,\n 85.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"SkinThickness\",\n \"properties\": {\n \"dtype\":
\"number\",\n \"std\": 15.952217567727677,\n \"min\":
0.0,\n \"max\": 99.0,\n \"num_unique_values\": 51,\n
\"samples\": [\n 7.0,\n 12.0,\n 48.0\n
],\n \"semantic_type\": \"\",\n \"description\": \"\"\n
}\n },\n {\n \"column\": \"Insulin\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
115.24400235133837,\n \"min\": 0.0,\n \"max\": 846.0,\n
\"num_unique_values\": 186,\n \"samples\": [\n 52.0,\n
41.0,\n 183.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"BMI\",\n \"properties\": {\n \"dtype\": \"number\",\n
\"std\": 7.8841603203754405,\n \"min\": 0.0,\n \"max\":
67.1,\n \"num_unique_values\": 248,\n \"samples\": [\n
19.9,\n 31.0,\n 38.1\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"DiabetesPedigreeFunction\",\n
\"properties\": {\n \"dtype\": \"number\",\n \"std\":
0.33132859501277484,\n \"min\": 0.078,\n \"max\": 2.42,\
n \"num_unique_values\": 517,\n \"samples\": [\n
1.731,\n 0.426,\n 0.138\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n },\n {\n \"column\": \"Age\",\n \"properties\": {\n
\"dtype\": \"number\",\n \"std\": 11.76023154067868,\n
\"min\": 21.0,\n \"max\": 81.0,\n \"num_unique_values\":
52,\n \"samples\": [\n 60.0,\n 47.0,\n
72.0\n ],\n \"semantic_type\": \"\",\n
\"description\": \"\"\n }\n },\n {\n \"column\":
\"Outcome\",\n \"properties\": {\n \"dtype\": \"number\",\
n \"std\": 0.4769513772427971,\n \"min\": 0.0,\n
\"max\": 1.0,\n \"num_unique_values\": 2,\n \"samples\":
[\n 0.0,\n 1.0\n ],\n
\"semantic_type\": \"\",\n \"description\": \"\"\n }\
n }\n ]\n}","type":"dataframe","variable_name":"df"}

1.2 Feature Selection

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_csv('Placement_Dataset.csv')

# Identify numerical and categorical columns


numerical_cols = df.select_dtypes(include=np.number).columns
categorical_cols = df.select_dtypes(exclude=np.number).columns

# Handle missing values in numerical features


numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] =
numerical_imputer.fit_transform(df[numerical_cols])

# Handle missing values in categorical features (using most frequent)


categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] =
categorical_imputer.fit_transform(df[categorical_cols])

# Convert categorical features to numerical using one-hot encoding


df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

X = df.drop('status_Placed', axis=1)
y = df['status_Placed']

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

importances = model.feature_importances_
feature_importances = pd.Series(importances, index=X.columns)
print("Feature Importances:")
print(feature_importances.sort_values(ascending=False))
Feature Importances:
salary 0.323788
ssc_p 0.234238
degree_p 0.133835
hsc_p 0.119721
mba_p 0.052254
etest_p 0.036298
sl_no 0.033773
workex_Yes 0.019531
specialisation_Mkt&HR 0.012677
gender_M 0.010178
ssc_b_Others 0.006278
degree_t_Sci&Tech 0.004213
hsc_b_Others 0.003628
degree_t_Others 0.003604
hsc_s_Science 0.003567
hsc_s_Commerce 0.002416
dtype: float64

top_features = feature_importances.nlargest(5).index
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

model_selected = RandomForestClassifier(random_state=42)
model_selected.fit(X_train_selected, y_train)

RandomForestClassifier(random_state=42)

y_pred_selected = model_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)

print("\nAccuracy with selected features:", accuracy_selected)

Accuracy with selected features: 0.9534883720930233

You might also like