0% found this document useful (0 votes)

22 views8 pages

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views8 pages

Data Cleaning Approaches in Machine Learning Algorithms

Uploaded by

rayachotiusa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Data Cleaning Approaches in Machine Learning Algorithms

1. Handling Missing Data

 Identify missing values.
 Impute or remove missing data using appropriate techniques.
 Python Code:
import pandas as pd
from sklearn.impute import SimpleImputer

# Identify missing data

missing_values = data.isnull().sum()

# Mean/Median/Mode Imputation
imputer = SimpleImputer(strategy='mean') # Can change to 'median' or 'most_frequent'
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Forward/Backward Fill
data_filled = data.fillna(method='ffill') # Can change to 'bfill' for backward fill

2. Handling Outliers
 Detect outliers using statistical methods or visual tools.
 Handle outliers by capping, transforming, or removing them.
 Python Code:
import numpy as np

# Z-score Method for Outlier Detection

z_scores = (data - data.mean()) / data.std()
data_no_outliers = data[(np.abs(z_scores) < 3).all(axis=1)] # Remove data with z > 3
# IQR Method for Outlier Detection
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data_no_outliers = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 *
IQR))).any(axis=1)]

3. Removing Duplicates
 Identify duplicate records in the dataset.
 Remove duplicates while retaining necessary unique entries.
 Python Code:
# Identify and remove duplicates
data_no_duplicates = data.drop_duplicates()

4. Normalizing and Scaling

 Normalize or scale features for algorithms sensitive to different feature scales.
 Python Code:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(data),
columns=data.columns)
5. Encoding Categorical Variables
 Convert categorical variables into numerical values using encoding techniques
like One-Hot Encoding or Label Encoding.
 Python Code:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
label_encoder = LabelEncoder()
data['encoded_column'] = label_encoder.fit_transform(data['categorical_column'])

# One-Hot Encoding
one_hot_encoder = pd.get_dummies(data, columns=['categorical_column'],
drop_first=True)

6. Dealing with Imbalanced Data

 Apply oversampling or undersampling techniques to balance class distributions.
 Python Code:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# SMOTE Oversampling
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
7. Handling Inconsistent Data
 Standardize formats, correct typos, and handle inconsistencies in data types or
units.
 Python Code:
# Correct inconsistent data
data['date_column'] = pd.to_datetime(data['date_column'])
data['text_column'] = data['text_column'].str.lower() # Lowercase text

# Handling typos using fuzzy matching

import fuzzywuzzy
from fuzzywuzzy import process
correct_spellings = ["category1", "category2"]
data['corrected_column'] = data['categorical_column'].apply(lambda x:
process.extractOne(x, correct_spellings)[0])

8. Feature Engineering
 Create new features based on existing ones, or use interaction and polynomial
features.
 Python Code:
from sklearn.preprocessing import PolynomialFeatures

# Creating new features (e.g., interaction terms)

data['new_feature'] = data['feature1'] * data['feature2']

# Polynomial features
poly = PolynomialFeatures(degree=2)
data_poly = poly.fit_transform(data[['feature1', 'feature2']])
9. Removing Irrelevant Features
 Remove features that provide little or no information.
 Python Code:
# Variance Threshold
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.1)
data_reduced = selector.fit_transform(data)

10. Handling Multicollinearity

 Detect multicollinearity using correlation matrices or VIF and remove highly
correlated features.
 Python Code:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['Feature'] = X.columns

# Remove highly collinear features based on VIF score

X_reduced = X.drop(columns=['high_vif_feature'])

11. Text Data Cleaning

 Clean and preprocess text data by tokenizing, removing stopwords, and
normalizing case.
 Python Code:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Remove punctuation, stopwords, and lowercase
stop_words = set(stopwords.words('english'))
data['cleaned_text'] = data['text_column'].apply(lambda x: ' '.join([word for word in
word_tokenize(x.lower()) if word not in stop_words and word not in string.punctuation]))

12. Date/Time Data Handling

 Extract features from date columns or normalize to a common time zone.
 Python Code:
# Extracting year, month, day from datetime
data['year'] = data['date_column'].dt.year
data['month'] = data['date_column'].dt.month
data['day'] = data['date_column'].dt.day

13. Handling Data Leakage

 Prevent target leakage by separating training and test datasets early and
ensuring no future information is included.
 Python Code:
# Separate data before processing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

14. Handling Zero-Variance Features

 Identify features with no variance and remove them from the dataset.
 Python Code:
# Variance Threshold to remove zero variance features
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0)
data_cleaned = selector.fit_transform(data)
15. Addressing Class Imbalance in Regression
 Use techniques like stratified sampling or weighted loss functions to handle
imbalanced data in regression problems.
 Python Code:
from sklearn.utils import class_weight

# Class weights in regression

class_weights = class_weight.compute_sample_weight(class_weight='balanced',
y=y_train)

16. Addressing Imbalanced Data in Classification

 Use oversampling, undersampling, or adjusting decision thresholds to handle
imbalanced classes.
 Python Code:
# Adjust decision threshold for a classifier
from sklearn.metrics import precision_recall_curve

y_pred_prob = classifier.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
best_threshold = thresholds[np.argmax(precision)]

17. Handling Missing Categorical Values

 Impute missing categorical values using the mode or create a separate category.
 Python Code:
# Impute missing categorical values with the most frequent category (mode)
imputer = SimpleImputer(strategy='most_frequent')
data['categorical_column'] = imputer.fit_transform(data[['categorical_column']])
18. Log Transformation
 Apply logarithmic transformation to reduce skewness in data.
 Python Code:
# Log transformation for skewed features
data['log_transformed_feature'] = np.log(data['skewed_feature'] + 1) # Adding 1 to
avoid log(0)

19. Binning Continuous Variables

 Convert continuous features into discrete intervals or bins for simplification.
 Python Code:
# Binning a continuous feature into discrete categories
data['binned_feature'] = pd.cut(data['continuous_feature'], bins=5, labels=['very low',
'low', 'medium', 'high', 'very high'])

20. Converting Numerical to Categorical

 Convert numerical variables into categorical ones based on specific ranges or
thresholds.
 Python Code:
# Convert numerical age into categories
data['age_group'] = pd.cut(data['age'], bins=[0, 18, 35, 60, 100], labels=['child', 'young
adult', 'senior'])

Note: The above data cleaning techniques and their corresponding Python code can
help you create a robust preprocessing pipeline, improving the quality of the datasets
before feeding them into machine learning models.

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Aurora Chain White Paper en
No ratings yet
Aurora Chain White Paper en
36 pages
ANZSMM 2018 NZ Guidance Notes - 19feb2018
100% (1)
ANZSMM 2018 NZ Guidance Notes - 19feb2018
8 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
III-Unit
No ratings yet
III-Unit
4 pages
EXP-2 ML
No ratings yet
EXP-2 ML
6 pages
EXP-2
No ratings yet
EXP-2
6 pages
Advance Python
No ratings yet
Advance Python
5 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
1_Data Preprocessing and Cleaning_55
No ratings yet
1_Data Preprocessing and Cleaning_55
8 pages
ML Viva Practice [Answers]
No ratings yet
ML Viva Practice [Answers]
4 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
DataAnalytics Lab Manual (1)
No ratings yet
DataAnalytics Lab Manual (1)
35 pages
Machine_Learning_Lab_File (1)
No ratings yet
Machine_Learning_Lab_File (1)
45 pages
S-9
No ratings yet
S-9
18 pages
Okay
No ratings yet
Okay
1 page
Data cleaning Using R
No ratings yet
Data cleaning Using R
5 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
AI course help guide
No ratings yet
AI course help guide
3 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
DIFFERENCES
No ratings yet
DIFFERENCES
3 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
AIML%20Short%20Term%20Internship%20Session%2010%20Summary-1719293295226
No ratings yet
AIML%20Short%20Term%20Internship%20Session%2010%20Summary-1719293295226
3 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
Ai Chapter 3
No ratings yet
Ai Chapter 3
8 pages
AAM 1st Unit QB
No ratings yet
AAM 1st Unit QB
4 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
DSBDA LAB_2.1_1736750718198
No ratings yet
DSBDA LAB_2.1_1736750718198
9 pages
SML
No ratings yet
SML
8 pages
PMA Unit-2 pdf
No ratings yet
PMA Unit-2 pdf
19 pages
Data Preprocessing for Machine Learning in Python
No ratings yet
Data Preprocessing for Machine Learning in Python
27 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
ML1-ESA paper solutions
No ratings yet
ML1-ESA paper solutions
10 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
Presentation
No ratings yet
Presentation
10 pages
UNIT-1,2,3
No ratings yet
UNIT-1,2,3
30 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
C Programming
From Everand
C Programming
Netra
No ratings yet
Identifying Scenario's in Auto Insurance Fraud and Key Probing Questions
No ratings yet
Identifying Scenario's in Auto Insurance Fraud and Key Probing Questions
10 pages
Risk Analytics - Industry Case Studies
No ratings yet
Risk Analytics - Industry Case Studies
3 pages
02 B Regression Healthcare
No ratings yet
02 B Regression Healthcare
5 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
Design of Stopwatch Through Digital Logic Design: December 2019
50% (2)
Design of Stopwatch Through Digital Logic Design: December 2019
5 pages
Thermo-Lag 440 Brochure PDF
No ratings yet
Thermo-Lag 440 Brochure PDF
2 pages
Point D.P.P Subjective
No ratings yet
Point D.P.P Subjective
5 pages
Pipe Size Calc
No ratings yet
Pipe Size Calc
8 pages
Software Model For Computer Based Training: October 2009
No ratings yet
Software Model For Computer Based Training: October 2009
7 pages
359-Aircraft Maintenace Engineering (Avionics)
No ratings yet
359-Aircraft Maintenace Engineering (Avionics)
134 pages
Bottom Hole Pressure
No ratings yet
Bottom Hole Pressure
30 pages
AITS Practice Paper (Chemistry) - Dropper NEET
No ratings yet
AITS Practice Paper (Chemistry) - Dropper NEET
4 pages
Differential Equations
No ratings yet
Differential Equations
28 pages
Documentum Content Transformation Services 16.7 Release Notes
No ratings yet
Documentum Content Transformation Services 16.7 Release Notes
22 pages
DAA Assignment 4
No ratings yet
DAA Assignment 4
7 pages
Gauge R & R Data Sheet
No ratings yet
Gauge R & R Data Sheet
2 pages
Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments For Multiple Autonomous Underwater Vehicles
No ratings yet
Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments For Multiple Autonomous Underwater Vehicles
14 pages
Make A Maze in Blender For Unity 3d
No ratings yet
Make A Maze in Blender For Unity 3d
2 pages
PLZ Series: Vishay Semiconductors
No ratings yet
PLZ Series: Vishay Semiconductors
10 pages
Kyawthuite Bi3Sb5O4 A New Gem Mineral From Mogok 1
No ratings yet
Kyawthuite Bi3Sb5O4 A New Gem Mineral From Mogok 1
8 pages
Record Plus: GE Consumer & Industrial Power Protection
No ratings yet
Record Plus: GE Consumer & Industrial Power Protection
50 pages
9.1 - Chem Project
No ratings yet
9.1 - Chem Project
18 pages
Gas Singeing Titbits
No ratings yet
Gas Singeing Titbits
4 pages
Horn Antenna
No ratings yet
Horn Antenna
16 pages
A New Approach To Parts of Speech Tagging in Malayalam
No ratings yet
A New Approach To Parts of Speech Tagging in Malayalam
10 pages
Eerc 07 01
No ratings yet
Eerc 07 01
73 pages
Download full Schaum s outline of theory and problems of discrete mathematics 3rd Edition Seymour Lipschutz ebook all chapters
100% (21)
Download full Schaum s outline of theory and problems of discrete mathematics 3rd Edition Seymour Lipschutz ebook all chapters
60 pages
Smart Test Series: 1-Circle The Correct One. (15x1 15)
No ratings yet
Smart Test Series: 1-Circle The Correct One. (15x1 15)
3 pages
1. C3 & M3 Maths (SRP) Material (25-26)
No ratings yet
1. C3 & M3 Maths (SRP) Material (25-26)
10 pages
Gj 024203001
No ratings yet
Gj 024203001
11 pages
Fixed, Variable and Total Costs (Printable) (1)
No ratings yet
Fixed, Variable and Total Costs (Printable) (1)
1 page
Hamworthy Flue Gas Generator PLC Controlled
No ratings yet
Hamworthy Flue Gas Generator PLC Controlled
3 pages