0% found this document useful (0 votes)

21 views5 pages

Module 3 Notes

Module 3 focuses on data preparation and analysis, emphasizing the importance of cleaning and transforming raw data for accurate results in data science and machine learning. Key techniques include handling missing values, detecting and managing outliers, and applying data transformation methods such as normalization and encoding. The module highlights that proper data preprocessing is essential for building reliable models and gaining insights from data.

Uploaded by

nilsa.vp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Module 3 Notes

Uploaded by

nilsa.vp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Module 3: Data Preparation and Analysis

1. Introduction to Data Preparation and Analysis

Data preparation is the process of cleaning and transforming raw data into a
format that is suitable for analysis. This is one of the most crucial steps in any
data science or machine learning project because the quality of data significantly
impacts the quality of the results. The main goal of this module is to teach you
techniques for preprocessing data, handling missing values, addressing outliers,
and transforming data to make it suitable for analysis.

2. Data Preprocessing Techniques

Data preprocessing is the process of converting raw data into a format that is
suitable for analysis. This involves several tasks such as:
 Cleaning the data (removing noise, missing values, etc.)
 Transforming data (normalization, encoding categorical data, etc.)
 Splitting the dataset into training and testing sets for machine learning.
Steps in Data Preprocessing:
1. Data Cleaning:
o Remove irrelevant or duplicate data.

o Handle missing or incomplete data.

o Detect and remove outliers.

2. Data Transformation:
o Normalize/Standardize data.

o Convert data types (e.g., categorical to numerical).

o Create new features (e.g., feature engineering).

3. Data Reduction:
o Dimensionality reduction (e.g., PCA).

o Feature selection.

4. Splitting the Data:

o Split data into training and testing datasets.

3. Handling Missing Data

Missing data is a common problem in real-world datasets. If not handled
correctly, missing values can lead to inaccurate models or biased results.
Techniques to Handle Missing Data:
1. Removing Missing Data:
o Remove rows with missing values (only if there are a small number
of rows missing).
o Drop columns with too many missing values (if they are not crucial).

2. Imputation:
o Mean/Median/Mode Imputation: Replace missing values with the
mean (for numerical data), median, or mode (for categorical data)
of the column.
o Predictive Imputation: Use other features to predict missing
values. Techniques include regression or K-nearest neighbors (KNN).
o Forward/Backward Fill: In time series data, missing values can be
filled using previous or next values.
3. Using Algorithms that Handle Missing Data: Some machine learning
algorithms, such as decision trees, can handle missing data directly.
4. Multiple Imputation: Multiple imputation creates multiple datasets with
different imputed values and averages the results to deal with the
uncertainty in missing values.

4. Handling Outliers
Outliers are extreme values that deviate significantly from the rest of the data.
They can distort statistical analyses and machine learning models.
Detecting Outliers:
1. Visual Methods:
o Boxplots: Outliers are often shown as points outside the whiskers
of a boxplot.
o Scatter Plots: For multivariate data, scatter plots help to identify
outliers.
2. Statistical Methods:
o Z-score: Outliers can be identified by calculating the Z-score (how
many standard deviations away a point is from the mean). A Z-score
greater than 3 is often considered an outlier.
o Interquartile Range (IQR): Any data point beyond 1.5 times the
IQR above the third quartile or below the first quartile is considered
an outlier.
Handling Outliers:
1. Removing Outliers:
o If the outliers are errors or not important for the analysis, they can
be removed.
2. Transforming Data:
o Log Transformation: Apply log transformations to reduce the
impact of outliers.
o Winsorizing: Replacing extreme values with the nearest data point
within a specified range.
3. Capping or Truncation:
o Set a maximum or minimum value to cap outliers, bringing them
closer to the rest of the data.
4. Using Algorithms Robust to Outliers:
o Some algorithms, like decision trees, are more robust to outliers and
can handle them better without the need for removal or
transformation.

5. Data Transformation
Data transformation is necessary to convert data into a suitable format and scale
for analysis. Common transformations include scaling, encoding, and
normalization.
Techniques for Data Transformation:
1. Normalization: Normalization rescales the data into a specific range
(usually between 0 and 1). This is especially useful when features have
different units or scales.

 Min-Max Normalization:  Normalized Value=Max(X)−Min(X)/X−Min(X)

2. Standardization: Standardization converts data into a distribution with a

mean of 0 and a standard deviation of 1.

 Z-score Standardization:  Z=X−μ /σ

Where:

o μ is the mean
o σ is the standard deviation

o 
3. Log Transformation:
o Apply the natural logarithm to reduce the impact of large values
and make the data more normally distributed.
4. Binning:
o Divide continuous data into bins or intervals. This is especially
useful in decision tree models and can help with data stability.
5. Encoding Categorical Variables:
o One-Hot Encoding: Create binary columns for each category in
the categorical variable.
o Label Encoding: Assign a unique integer to each category in the
categorical variable.

6. Cleaning Data
Data cleaning involves detecting and correcting errors in the dataset. It’s a vital
part of data preprocessing to improve the quality and reliability of the analysis.
Common Cleaning Steps:
1. Removing Duplicates:
o Identify and remove duplicate rows that don’t add new information.

2. Handling Inconsistent Data:

o Standardize data values (e.g., converting all text to lowercase,
correcting typos in categorical values).
3. Addressing Irrelevant Data:
o Remove unnecessary features or columns that don't contribute to
the analysis.
4. Fixing Structural Errors:
o Ensure data is in the correct format (e.g., converting dates to a
standard date format, fixing inconsistent measurement units).
5. Dealing with Noise:
o Noise refers to random errors or fluctuations in the data. Techniques
like smoothing or aggregation can help reduce noise.

7. Summary of Key Points

 Data Preprocessing is essential for transforming raw data into a usable
format for analysis.
 Handling Missing Data involves techniques like imputation or removal
of rows/columns with missing values.
 Outliers can be detected using statistical methods like Z-scores and IQR,
and they can be removed or transformed.
 Data Transformation methods like normalization, standardization, and
encoding are used to prepare data for machine learning algorithms.
 Data Cleaning is about fixing errors, removing duplicates, and ensuring
consistency in the dataset.
Proper data preprocessing is crucial for building accurate models and extracting
meaningful insights from data.

2020 Y11 Advanced Task 2 Investigation
No ratings yet
2020 Y11 Advanced Task 2 Investigation
11 pages
Rethinking Sexism Gender and Sexuality
No ratings yet
Rethinking Sexism Gender and Sexuality
2 pages
Lesson 1 Activities What Is Science?: Fill Up The Box Below With at Least 10-15 Keywords
100% (11)
Lesson 1 Activities What Is Science?: Fill Up The Box Below With at Least 10-15 Keywords
7 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Unit - II
No ratings yet
Unit - II
56 pages
C2 - Data Cleaning & Preprocessing
No ratings yet
C2 - Data Cleaning & Preprocessing
59 pages
Week 2
No ratings yet
Week 2
3 pages
Week 3
No ratings yet
Week 3
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Document
No ratings yet
Document
29 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Chap 3
No ratings yet
Chap 3
26 pages
UNIT 2 Data Warehousing
No ratings yet
UNIT 2 Data Warehousing
45 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Chương
No ratings yet
Chương
12 pages
Da Mid1
No ratings yet
Da Mid1
32 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
ML Exp No 1
No ratings yet
ML Exp No 1
8 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Assignment 2 - Data Collection and Preprocessing
No ratings yet
Assignment 2 - Data Collection and Preprocessing
3 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
6-Deep Networks Basics - Shallow Neural Networks-29-07-2024
No ratings yet
6-Deep Networks Basics - Shallow Neural Networks-29-07-2024
8 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
FBA Module 3
No ratings yet
FBA Module 3
41 pages
1.research Methodology-BBA S1M1
No ratings yet
1.research Methodology-BBA S1M1
65 pages
6.research Methodology-BBA S1M6
No ratings yet
6.research Methodology-BBA S1M6
64 pages
2.research Methodology-BBA S1M2
No ratings yet
2.research Methodology-BBA S1M2
22 pages
A Phenomenological Study On The Influences of Gender Based Friendships Among Regional Science High School For Region VI Students
No ratings yet
A Phenomenological Study On The Influences of Gender Based Friendships Among Regional Science High School For Region VI Students
104 pages
Ebook - 7 Tips To Pass The TOEIC PDF
No ratings yet
Ebook - 7 Tips To Pass The TOEIC PDF
10 pages
E Emerging Leaders Roundtable - Leadership Style Quiz
No ratings yet
E Emerging Leaders Roundtable - Leadership Style Quiz
2 pages
Ciri-Ciri Kepimpinan Islam, Hubungannya Terhadap Komitmen Organisasi
No ratings yet
Ciri-Ciri Kepimpinan Islam, Hubungannya Terhadap Komitmen Organisasi
16 pages
Pasca Upsr Bahasa Inggeris 2019.docx Version 1
No ratings yet
Pasca Upsr Bahasa Inggeris 2019.docx Version 1
7 pages
Drug Inspector Timetable With Workout
No ratings yet
Drug Inspector Timetable With Workout
1 page
Drift - The Conversational Marketing Blueprint
No ratings yet
Drift - The Conversational Marketing Blueprint
30 pages
Ai Unit-2
No ratings yet
Ai Unit-2
45 pages
Basic Research - Wikipedia
No ratings yet
Basic Research - Wikipedia
38 pages
Micah Angelica Pablico: Sent 16 Hours Ago
No ratings yet
Micah Angelica Pablico: Sent 16 Hours Ago
16 pages
Mitcham Primary School Australia (Julia Gillard) Historic.
No ratings yet
Mitcham Primary School Australia (Julia Gillard) Historic.
6 pages
Architectural Assistantship
No ratings yet
Architectural Assistantship
162 pages
LG Gantt Chart
No ratings yet
LG Gantt Chart
3 pages
مستند نصي جديد ‫‬
No ratings yet
مستند نصي جديد ‫‬
4 pages
Reflection Paper Format 2024
No ratings yet
Reflection Paper Format 2024
2 pages
IE506 Challengequestion
No ratings yet
IE506 Challengequestion
2 pages
Q3 English 3 Week 2
No ratings yet
Q3 English 3 Week 2
4 pages
Cot DLP Kinder q1 Week 7
No ratings yet
Cot DLP Kinder q1 Week 7
7 pages
Pepsi Screening Case Study
No ratings yet
Pepsi Screening Case Study
11 pages
Day 1 - Census Research
No ratings yet
Day 1 - Census Research
2 pages
Assessment of Self and Family
100% (2)
Assessment of Self and Family
18 pages
Field Study 1 SHOW Your Learning Artifacts My Personal Illustration of An Effective School Environment
100% (1)
Field Study 1 SHOW Your Learning Artifacts My Personal Illustration of An Effective School Environment
2 pages
Managing Information Technology 7 Edition: Methodologies For Purchased Software Packages
No ratings yet
Managing Information Technology 7 Edition: Methodologies For Purchased Software Packages
33 pages
BI Group Presentation
No ratings yet
BI Group Presentation
75 pages
Homeroom Guidance Learner
No ratings yet
Homeroom Guidance Learner
2 pages
NDEO Culture of Power Presentation
No ratings yet
NDEO Culture of Power Presentation
5 pages
Diss Lesson 2
No ratings yet
Diss Lesson 2
26 pages

Module 3 Notes

Uploaded by

Module 3 Notes

Uploaded by

Module 3: Data Preparation and Analysis

1. Introduction to Data Preparation and Analysis

2. Data Preprocessing Techniques

o Handle missing or incomplete data.

o Detect and remove outliers.

o Convert data types (e.g., categorical to numerical).

o Create new features (e.g., feature engineering).

4. Splitting the Data:

3. Handling Missing Data

 Min-Max Normalization:  Normalized Value=Max(X)−Min(X)/X−Min(X)

2. Standardization: Standardization converts data into a distribution with a

 Z-score Standardization:  Z=X−μ /σ

2. Handling Inconsistent Data:

7. Summary of Key Points

You might also like