0% found this document useful (0 votes)
22 views8 pages

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views8 pages

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Cleaning Approaches in Machine Learning Algorithms

1. Handling Missing Data


 Identify missing values.
 Impute or remove missing data using appropriate techniques.
 Python Code:
import pandas as pd
from sklearn.impute import SimpleImputer

# Identify missing data


missing_values = data.isnull().sum()

# Mean/Median/Mode Imputation
imputer = SimpleImputer(strategy='mean') # Can change to 'median' or 'most_frequent'
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Forward/Backward Fill
data_filled = data.fillna(method='ffill') # Can change to 'bfill' for backward fill

2. Handling Outliers
 Detect outliers using statistical methods or visual tools.
 Handle outliers by capping, transforming, or removing them.
 Python Code:
import numpy as np

# Z-score Method for Outlier Detection


z_scores = (data - data.mean()) / data.std()
data_no_outliers = data[(np.abs(z_scores) < 3).all(axis=1)] # Remove data with z > 3
# IQR Method for Outlier Detection
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data_no_outliers = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 *
IQR))).any(axis=1)]

3. Removing Duplicates
 Identify duplicate records in the dataset.
 Remove duplicates while retaining necessary unique entries.
 Python Code:
# Identify and remove duplicates
data_no_duplicates = data.drop_duplicates()

4. Normalizing and Scaling


 Normalize or scale features for algorithms sensitive to different feature scales.
 Python Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(data),
columns=data.columns)
5. Encoding Categorical Variables
 Convert categorical variables into numerical values using encoding techniques
like One-Hot Encoding or Label Encoding.
 Python Code:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
label_encoder = LabelEncoder()
data['encoded_column'] = label_encoder.fit_transform(data['categorical_column'])

# One-Hot Encoding
one_hot_encoder = pd.get_dummies(data, columns=['categorical_column'],
drop_first=True)

6. Dealing with Imbalanced Data


 Apply oversampling or undersampling techniques to balance class distributions.
 Python Code:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# SMOTE Oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
7. Handling Inconsistent Data
 Standardize formats, correct typos, and handle inconsistencies in data types or
units.
 Python Code:
# Correct inconsistent data
data['date_column'] = pd.to_datetime(data['date_column'])
data['text_column'] = data['text_column'].str.lower() # Lowercase text

# Handling typos using fuzzy matching


import fuzzywuzzy
from fuzzywuzzy import process
correct_spellings = ["category1", "category2"]
data['corrected_column'] = data['categorical_column'].apply(lambda x:
process.extractOne(x, correct_spellings)[0])

8. Feature Engineering
 Create new features based on existing ones, or use interaction and polynomial
features.
 Python Code:
from sklearn.preprocessing import PolynomialFeatures

# Creating new features (e.g., interaction terms)


data['new_feature'] = data['feature1'] * data['feature2']

# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['feature1', 'feature2']])
9. Removing Irrelevant Features
 Remove features that provide little or no information.
 Python Code:
# Variance Threshold
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.1)
data_reduced = selector.fit_transform(data)

10. Handling Multicollinearity


 Detect multicollinearity using correlation matrices or VIF and remove highly
correlated features.
 Python Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature


vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Feature'] = X.columns

# Remove highly collinear features based on VIF score


X_reduced = X.drop(columns=['high_vif_feature'])

11. Text Data Cleaning


 Clean and preprocess text data by tokenizing, removing stopwords, and
normalizing case.
 Python Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Remove punctuation, stopwords, and lowercase
stop_words = set(stopwords.words('english'))
data['cleaned_text'] = data['text_column'].apply(lambda x: ' '.join([word for word in
word_tokenize(x.lower()) if word not in stop_words and word not in string.punctuation]))

12. Date/Time Data Handling


 Extract features from date columns or normalize to a common time zone.
 Python Code:
# Extracting year, month, day from datetime
data['year'] = data['date_column'].dt.year
data['month'] = data['date_column'].dt.month
data['day'] = data['date_column'].dt.day

13. Handling Data Leakage


 Prevent target leakage by separating training and test datasets early and
ensuring no future information is included.
 Python Code:
# Separate data before processing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

14. Handling Zero-Variance Features


 Identify features with no variance and remove them from the dataset.
 Python Code:
# Variance Threshold to remove zero variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0)
data_cleaned = selector.fit_transform(data)
15. Addressing Class Imbalance in Regression
 Use techniques like stratified sampling or weighted loss functions to handle
imbalanced data in regression problems.
 Python Code:
from sklearn.utils import class_weight

# Class weights in regression


class_weights = class_weight.compute_sample_weight(class_weight='balanced',
y=y_train)

16. Addressing Imbalanced Data in Classification


 Use oversampling, undersampling, or adjusting decision thresholds to handle
imbalanced classes.
 Python Code:
# Adjust decision threshold for a classifier
from sklearn.metrics import precision_recall_curve

y_pred_prob = classifier.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
best_threshold = thresholds[np.argmax(precision)]

17. Handling Missing Categorical Values


 Impute missing categorical values using the mode or create a separate category.
 Python Code:
# Impute missing categorical values with the most frequent category (mode)
imputer = SimpleImputer(strategy='most_frequent')
data['categorical_column'] = imputer.fit_transform(data[['categorical_column']])
18. Log Transformation
 Apply logarithmic transformation to reduce skewness in data.
 Python Code:
# Log transformation for skewed features
data['log_transformed_feature'] = np.log(data['skewed_feature'] + 1) # Adding 1 to
avoid log(0)

19. Binning Continuous Variables


 Convert continuous features into discrete intervals or bins for simplification.
 Python Code:
# Binning a continuous feature into discrete categories
data['binned_feature'] = pd.cut(data['continuous_feature'], bins=5, labels=['very low',
'low', 'medium', 'high', 'very high'])

20. Converting Numerical to Categorical


 Convert numerical variables into categorical ones based on specific ranges or
thresholds.
 Python Code:
# Convert numerical age into categories
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young
adult', 'senior'])

Note: The above data cleaning techniques and their corresponding Python code can
help you create a robust preprocessing pipeline, improving the quality of the datasets
before feeding them into machine learning models.

You might also like