Data Preprocessing Preparing Data for Success
Data Preprocessing Preparing Data for Success
Descriptive Statistics
Mean, median, standard deviation, and quartiles summarize data
characteristics.
Data Visualization
Histograms, box plots, and scatter plots reveal data distribution and patterns.
Data Issues
Detect missing values, outliers, and inconsistent entries early.
preencoded.png
Handling Missing Data
Common Causes
Data entry errors and system failures create gaps.
Simple Techniques
Imputation using mean, median, or mode; deletion methods.
Advanced Methods
k-NN and regression imputations handle complex missingness.
Example
Median age replacement with sklearn's SimpleImputer
minimizes bias.
preencoded.png
Data Cleaning: Removing
Noise and Inconsistencies
Error Correction
Fix typos and wrong formats to enhance data reliability.
Outlier Treatments
Use Z-score, IQR, or winsorizing to handle outliers.
Deduplication
Remove duplicate records to prevent skewed analysis.
Resolving Conflicts
Standardize units and fix inconsistent entries across data.
preencoded.png
Data Transformation: Scaling and
Normalization
Why Transform?
Different feature scales bias model training and predictions.
Min-Max Scaling
Normalizes data to a range between 0 and 1.
Z-score Standardization
Centers data with mean zero and unit variance.
Tools
Use StandardScaler in scikit-learn for standardization.
preencoded.png
Data Encoding: Converting Categorical Data
Need for Encoding Encoding Techniques
Models require numerical data inputs to process • One-hot encoding creates binary columns per
categories. category
• Label encoding assigns unique integers
• Ordinal encoding suits ordered categories like ratings
preencoded.png
Feature Selection and Engineering
Feature Selection Selection Methods
1 2
Choose relevant features to reduce Filter, wrapper, and embedded
model complexity. approaches improve accuracy.
preencoded.png
Best Practices and Tools
Documentation Version Control Validation
Record all preprocessing steps Track changes using version Rigorously test preprocessing
for transparency and control systems like Git. pipelines to ensure data
reproducibility. quality.
preencoded.png