0% found this document useful (0 votes)
5 views

3.Data Preprocessing

Uploaded by

aqsa.afzal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

3.Data Preprocessing

Uploaded by

aqsa.afzal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Preprocessing

By Aqsa Afzal
Data Pre-Processing
• Data preprocessing involves cleaning and transforming raw data to
prepare it for machine learning algorithms.
• Essential for ensuring data quality and improving model performance.
Steps of Data Preprocessing
Data Collection
• Gathering data from various sources such as databases, files, APIs, or web
scraping.
Data Wrangling
• Cleaning: Handling missing values, removing duplicates, correcting
inconsistencies.
• Transformation: Restructuring, converting data types, standardizing units.
• Enrichment: Adding derived features, merging datasets.
• Validation: Verifying data quality, checking for outliers.
• Aggregation: Combining datasets, summarizing data.
Data Visualization
• Histograms: Distribution of numerical data.
• Scatter plots: Relationship between two numerical variables.
• Box plots: Distribution and variability of data.
• Bar charts: Comparison of categorical variables.
• Heatmaps: Correlation matrix visualization.
Data Reduction

• Dimensionality reduction techniques using libraries like Scikit-learn


• Feature selection with Scikit-learn or other relevant libraries
Data Augmentation
• Artificially increasing data diversity by applying transformations such
as rotation, translation, flipping, etc.
• Improves model generalization and robustness, especially in
computer vision tasks.
Libraries for Data Preprocessing,
Visualization, and Augmentation
• Pandas: Data cleaning, manipulation, and basic feature engineering.
• Matplotlib and Seaborn: Python libraries for data visualization.
• Scikit-learn: Provides tools for data preprocessing including feature
scaling, encoding, and dimensionality reduction.
• TensorFlow or Keras: Libraries for data augmentation, especially in
deep learning applications.
• NumPy: Used for numerical operations and handling missing values.
Task 2

DATA PRE-
PROCESSING
Due by next

You might also like