Data Cleaning Approaches in Machine Learning Algorithms
Data Cleaning Approaches in Machine Learning Algorithms
# Mean/Median/Mode Imputation
imputer = SimpleImputer(strategy='mean') # Can change to 'median' or 'most_frequent'
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Forward/Backward Fill
data_filled = data.fillna(method='ffill') # Can change to 'bfill' for backward fill
2. Handling Outliers
Detect outliers using statistical methods or visual tools.
Handle outliers by capping, transforming, or removing them.
Python Code:
import numpy as np
3. Removing Duplicates
Identify duplicate records in the dataset.
Remove duplicates while retaining necessary unique entries.
Python Code:
# Identify and remove duplicates
data_no_duplicates = data.drop_duplicates()
# Standardization
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(data),
columns=data.columns)
5. Encoding Categorical Variables
Convert categorical variables into numerical values using encoding techniques
like One-Hot Encoding or Label Encoding.
Python Code:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding
label_encoder = LabelEncoder()
data['encoded_column'] = label_encoder.fit_transform(data['categorical_column'])
# One-Hot Encoding
one_hot_encoder = pd.get_dummies(data, columns=['categorical_column'],
drop_first=True)
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# SMOTE Oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
7. Handling Inconsistent Data
Standardize formats, correct typos, and handle inconsistencies in data types or
units.
Python Code:
# Correct inconsistent data
data['date_column'] = pd.to_datetime(data['date_column'])
data['text_column'] = data['text_column'].str.lower() # Lowercase text
8. Feature Engineering
Create new features based on existing ones, or use interaction and polynomial
features.
Python Code:
from sklearn.preprocessing import PolynomialFeatures
# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['feature1', 'feature2']])
9. Removing Irrelevant Features
Remove features that provide little or no information.
Python Code:
# Variance Threshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
data_reduced = selector.fit_transform(data)
y_pred_prob = classifier.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
best_threshold = thresholds[np.argmax(precision)]
Note: The above data cleaning techniques and their corresponding Python code can
help you create a robust preprocessing pipeline, improving the quality of the datasets
before feeding them into machine learning models.