0% found this document useful (0 votes)
12 views

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Cleaning Approaches in Machine Learning Algorithms

1. Handling Missing Data


 Identify missing values.
 Impute or remove missing data using appropriate techniques.
 Python Code:
import pandas as pd
from sklearn.impute import SimpleImputer

# Identify missing data


missing_values = data.isnull().sum()

# Mean/Median/Mode Imputation
imputer = SimpleImputer(strategy='mean') # Can change to 'median' or 'most_frequent'
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Forward/Backward Fill
data_filled = data.fillna(method='ffill') # Can change to 'bfill' for backward fill

2. Handling Outliers
 Detect outliers using statistical methods or visual tools.
 Handle outliers by capping, transforming, or removing them.
 Python Code:
import numpy as np

# Z-score Method for Outlier Detection


z_scores = (data - data.mean()) / data.std()
data_no_outliers = data[(np.abs(z_scores) < 3).all(axis=1)] # Remove data with z > 3
# IQR Method for Outlier Detection
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data_no_outliers = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 *
IQR))).any(axis=1)]

3. Removing Duplicates
 Identify duplicate records in the dataset.
 Remove duplicates while retaining necessary unique entries.
 Python Code:
# Identify and remove duplicates
data_no_duplicates = data.drop_duplicates()

4. Normalizing and Scaling


 Normalize or scale features for algorithms sensitive to different feature scales.
 Python Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(data),
columns=data.columns)
5. Encoding Categorical Variables
 Convert categorical variables into numerical values using encoding techniques
like One-Hot Encoding or Label Encoding.
 Python Code:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
label_encoder = LabelEncoder()
data['encoded_column'] = label_encoder.fit_transform(data['categorical_column'])

# One-Hot Encoding
one_hot_encoder = pd.get_dummies(data, columns=['categorical_column'],
drop_first=True)

6. Dealing with Imbalanced Data


 Apply oversampling or undersampling techniques to balance class distributions.
 Python Code:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# SMOTE Oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
7. Handling Inconsistent Data
 Standardize formats, correct typos, and handle inconsistencies in data types or
units.
 Python Code:
# Correct inconsistent data
data['date_column'] = pd.to_datetime(data['date_column'])
data['text_column'] = data['text_column'].str.lower() # Lowercase text

# Handling typos using fuzzy matching


import fuzzywuzzy
from fuzzywuzzy import process
correct_spellings = ["category1", "category2"]
data['corrected_column'] = data['categorical_column'].apply(lambda x:
process.extractOne(x, correct_spellings)[0])

8. Feature Engineering
 Create new features based on existing ones, or use interaction and polynomial
features.
 Python Code:
from sklearn.preprocessing import PolynomialFeatures

# Creating new features (e.g., interaction terms)


data['new_feature'] = data['feature1'] * data['feature2']

# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['feature1', 'feature2']])
9. Removing Irrelevant Features
 Remove features that provide little or no information.
 Python Code:
# Variance Threshold
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.1)
data_reduced = selector.fit_transform(data)

10. Handling Multicollinearity


 Detect multicollinearity using correlation matrices or VIF and remove highly
correlated features.
 Python Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature


vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Feature'] = X.columns

# Remove highly collinear features based on VIF score


X_reduced = X.drop(columns=['high_vif_feature'])

11. Text Data Cleaning


 Clean and preprocess text data by tokenizing, removing stopwords, and
normalizing case.
 Python Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Remove punctuation, stopwords, and lowercase
stop_words = set(stopwords.words('english'))
data['cleaned_text'] = data['text_column'].apply(lambda x: ' '.join([word for word in
word_tokenize(x.lower()) if word not in stop_words and word not in string.punctuation]))

12. Date/Time Data Handling


 Extract features from date columns or normalize to a common time zone.
 Python Code:
# Extracting year, month, day from datetime
data['year'] = data['date_column'].dt.year
data['month'] = data['date_column'].dt.month
data['day'] = data['date_column'].dt.day

13. Handling Data Leakage


 Prevent target leakage by separating training and test datasets early and
ensuring no future information is included.
 Python Code:
# Separate data before processing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

14. Handling Zero-Variance Features


 Identify features with no variance and remove them from the dataset.
 Python Code:
# Variance Threshold to remove zero variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0)
data_cleaned = selector.fit_transform(data)
15. Addressing Class Imbalance in Regression
 Use techniques like stratified sampling or weighted loss functions to handle
imbalanced data in regression problems.
 Python Code:
from sklearn.utils import class_weight

# Class weights in regression


class_weights = class_weight.compute_sample_weight(class_weight='balanced',
y=y_train)

16. Addressing Imbalanced Data in Classification


 Use oversampling, undersampling, or adjusting decision thresholds to handle
imbalanced classes.
 Python Code:
# Adjust decision threshold for a classifier
from sklearn.metrics import precision_recall_curve

y_pred_prob = classifier.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
best_threshold = thresholds[np.argmax(precision)]

17. Handling Missing Categorical Values


 Impute missing categorical values using the mode or create a separate category.
 Python Code:
# Impute missing categorical values with the most frequent category (mode)
imputer = SimpleImputer(strategy='most_frequent')
data['categorical_column'] = imputer.fit_transform(data[['categorical_column']])
18. Log Transformation
 Apply logarithmic transformation to reduce skewness in data.
 Python Code:
# Log transformation for skewed features
data['log_transformed_feature'] = np.log(data['skewed_feature'] + 1) # Adding 1 to
avoid log(0)

19. Binning Continuous Variables


 Convert continuous features into discrete intervals or bins for simplification.
 Python Code:
# Binning a continuous feature into discrete categories
data['binned_feature'] = pd.cut(data['continuous_feature'], bins=5, labels=['very low',
'low', 'medium', 'high', 'very high'])

20. Converting Numerical to Categorical


 Convert numerical variables into categorical ones based on specific ranges or
thresholds.
 Python Code:
# Convert numerical age into categories
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young
adult', 'senior'])

Note: The above data cleaning techniques and their corresponding Python code can
help you create a robust preprocessing pipeline, improving the quality of the datasets
before feeding them into machine learning models.

You might also like