0% found this document useful (0 votes)

2 views

Lesson 3. Data Preparation and Structuring 1 Data Cleaning

The document outlines various topics related to data collection, preparation, and cleaning, emphasizing the importance of data quality for analysis and machine learning. It details methods for handling missing values, removing duplicates, correcting data types, and standardizing data, as well as techniques for data transformation and validation. The goal is to ensure that datasets are clean, consistent, and suitable for accurate analysis and modeling.

Uploaded by

alfredjoso847

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lesson 3. Data Preparation and Structuring 1 Data Cleaning

Uploaded by

alfredjoso847

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

TOPICS OF STUDY

Data Collection

● Reliable sites for data collection Exploratory Data Analysis

● Merging data sources ● Types of EDA
● Data crawling o Univariate non-graphical
● Web Scraping o Univariate graphicalRanges
● Creating a dataset o Multivariate non graphical
Data discovery o Multivariate graphical
● Understanding problem domain (domain knowledge) ● Packages for EDA and Munging
● Understanding data patterns o (Numpy, SciPy,Pandas, MatPlotlib, Google Data Prep, Tabula, Data
● Identifying and handling the missing values Wrangler)

● Feature engineering ● Variable relationship analysis

Data preparation and structuring o Box plots, Histograms, Scatter plots, Bar Charts, Pie Charts, Line
Charts
● Identifying and handling the missing values
● Choosing right data model
● Feature engineering
o Types of data models
● Encoding data
Data Publishing
● Feature scaling
● Report writing
● Data cleaning
● Data Visualization
● Data enriching or augmentation

● Data validation

● Splitting dataset
Overview

Data preparation and structuring 1

Data cleaning
Data Preprocessing

● Data Preprocessing refers to the steps taken to clean and

prepare raw data before it's used in a machine learning model or
for any kind of analysis.
● The goal is to improve the quality of the data, eliminate
inconsistencies, and make it more suitable for building accurate
and reliable models.
● Data cleaning is a very important stage in data science.
Data cleaning
● Data cleaning is the process of identifying and correcting (or
removing) errors and inconsistencies in data to improve its
quality and reliability for analysis.
● It is an essential step in preparing raw data for further analysis or
machine learning tasks.
● There are steps aim to ensure that data is clean, consistent,
and suitable for analysis.
● Each step is integral to preparing a dataset for machine
learning, statistical analysis, or business reporting, ensuring
that the insights drawn are based on high-quality, reliable
data.
Data Cleaning
● Steps aim to ensure that data is clean, consistent, and suitable for analysis.

− Handling missing values: Filling in missing data points (imputation), or

removing rows/columns with missing data.

− Removing duplicates: Identifying and eliminating duplicate records to avoid

redundancy and bias in the model.

− Outlier detection and handling: Identifying and handling extreme values that
may skew the analysis.

− Standardizing data formats: Ensuring consistency in formats, such as dates or

categorical variables.

− Error correction: Fixing incorrect data entries, such as typos or inconsistencies

in labelling.
Data cleaning-Handling Missing Data

● Problem: Missing data is a common issue in real-world datasets,

and how we handle it can significantly impact the quality of
analysis.
● Steps Involved:
•Identify Missing Data:
•Use functions like isnull() (Pandas) or is.na() (R) to identify missing
values in a dataset.
•Check both individual columns and rows for any gaps.
Methods to Handle Missing Values

● Remove Missing Data: Remove rows or columns with missing data

(useful when the missing proportion is small).
– If the missing data is minimal and not crucial, you can delete the rows
(dropna(axis=0) in pandas) or columns (dropna(axis=1)) containing missing
values.
● What it means: This method involves deleting entire rows or columns
that have missing data.
● When to use: It's useful when the missing data is very small or
insignificant, so it won't affect the overall results much.
Remove Rows or Columns with Missing Data

● Example: If only a few rows have missing values in a large

dataset (e.g., 2 or 3 out of 1000 rows), you can safely
remove those rows without losing important information.
● Advantages: Quick and simple to implement.
● Disadvantages: If too much data is missing, removing
rows or columns could cause you to lose valuable
information.
Imputation (Filling Missing Values)

● Imputation: Instead of removing, fill in missing values

with with a value.
● There are several ways to do this:
– Fill in missing values using statistical methods like mean, median, mode

– Forward/Backward Filling ( for time series e.g weekly maize prices )

– Using flags (True/False to indicate whether missing)

– more sophisticated imputation techniques (e.g., using regression or

machine learning algorithms).
Mean, Median, or Mode Imputation

● Mean: What it means you replace the missing values with the
average (mean)
● Middle value (median), or the most frequent value (mode) of
that column.
● Mode Imputation: For categorical data, replace missing values
with the most frequent value (mode).
● When to use: This is useful when the missing data is random
(MCAR or MAR), and the data you're filling in doesn't have
extreme values or outliers
Mean, Median, or Mode Imputation

• Example: In a customer survey, if the "age" column has

missing values, you could fill those with the average age of
the respondents.
• Advantages: Simple, quick, and easy to implement.

• Disadvantages: It can distort the data distribution if the

missing values are not randomly distributed (e.g., if there

are outliers).
Forward/Backward Filling

● In time-series data, missing values can be replaced by the

preceding or succeeding values.
● Example: Consider weekly maize prices below on the next
slide.
Forward/Backward Filling
● Forward Fill: Missing values (NaN) are filled with the last known value (e.g.,
Week 3 and 4 take the value from Week 2, which is 210).
● Backward Fill: Missing values are filled with the next known value (e.g., Week
3 and 4 take the value from Week 5, which is 220).

Week Original Price Forward Fill Backward Fill

1 200 200 200
2 210 210 210
3 NaN 210 220
4 NaN 210 220
5 220 220 220
6 NaN 220 230
7 230 230 230
Flags

● For categorical variables, create flags to indicate if the value was

missing
● Add a separate binary (True/False) column that marks whether a
value was missing, so you retain the information about its
absence.
● Flagging: Creating a new variable to indicate missingness for
further analysis.
Predictive Models (Regression or k-Nearest Neighbours )

● What it means: You use machine learning models to predict

what the missing values should be based on other data.
● Regression: You use relationships between variables (e.g.,
predicting income based on age and education level).
● k-Nearest Neighbours (k-NN): You fill in missing values by looking
at the closest "neighbours" (similar data points) to make a
prediction.
Predictive Models (Regression or k-Nearest Neighbors)

● When to use: This is helpful when missing values are more

complicated and might depend on other variables in the dataset.
● Example: If the "income" field is missing in a dataset, you might
predict the missing income value based on other information like
"age" and "education level" using a regression model.
● Advantages: More accurate, especially when the missing data is
not randomly missing.
● Disadvantages: More complex and requires more data or
computational resources.
Flagging (Creating a Missingness Indicator)

What it means: Instead of filling in the missing data, you create a new variable
(flag) that indicates whether the data is missing or not.
When to use: This is helpful when you want to keep track of missing values and
use it for further analysis (e.g., analyzing whether missing data correlates with
certain patterns or outcomes).
Example: In a customer survey dataset, if some "age" entries are missing, you
could add a new column called "Age_Missing" that has a 1 for missing data and
a 0 for available data.
Advantages: Helps preserve the original data and adds valuable information
about the missingness itself.
Disadvantages: It doesn't solve the problem of missing data but can help in
analysing how missingness affects your model
Data cleaning -Removing Duplicates

Problem: Duplicates can distort analysis and introduce bias or

redundancy into datasets.
Steps Involved:
•Identify/Detect Duplicates: Identify any rows in the dataset that are
identical.
•Duplicates occur when identical rows or entries appear more than
once in the dataset.
•Use .duplicated() in pandas to identify duplicate rows.
•Visualize data to detect repeated entries or use summary
statistics to spot anomalies.
Removing Duplicates

•Remove Duplicates: Eliminate the duplicate entries using

.drop_duplicates() in pandas or similar functions like distinct() in SQL to
remove duplicate rows.
Consider whether duplicates are legitimate and should be kept (e.g.,
repeated transactions).

•Decide whether you want to remove rows with identical values

across all columns or only across specific columns.
•Check if keeping duplicates makes sense for the analysis, for
example, in case of repeated transactions.
Data cleaning-Correcting Data Types
Correcting Data Types
•Check Data Types: Ensure that each column has the correct data
type (e.g., numerical, categorical, date, string).
•Convert Data Types: If necessary, convert columns to appropriate
types using functions like .astype() or pandas' pd.to_datetime() for
dates.
•Handling Misclassified Data: For example, converting numerical
values stored as strings into actual numeric types.
Correcting Data Types

Problem: Incorrect data types can lead to errors during analysis,

especially when mathematical operations or string manipulations are
involved.
Steps Involved:
•Identify Incorrect Data Types:
•Use dtypes in pandas to check the data type of each column.
•Columns like numerical data (e.g., age, salary) may be incorrectly
stored as strings, and categorical data (e.g., country, gender) might
be misclassified as numerical.
Correcting Data Types

•Convert Data Types:

•Numeric Conversion: Convert string representations of numbers
into actual numeric types using astype().
•Date/Time Conversion: Use pd.to_datetime() to convert columns
containing dates stored as strings into proper datetime objects.
•Categorical Data Conversion: Convert categorical variables into
type category for memory optimization and speed improvements.
•Boolean Conversion: If a column contains binary values like
"yes"/"no" or 1/0, ensure it is properly converted to bool.
Data cleaning-Handling Outliers

● Outliers are data points that significantly differ from the rest of
the data, and they can distort statistical analyses and model
performance. Handling outliers is crucial in feature engineering
because they may lead to biased results or reduced model
accuracy.
Techniques for handling outliers include:
• Identify Outliers: Use statistical methods (e.g., IQR, Z-scores) or
visualization techniques (boxplots, histograms) to identify
extreme values.
Data Cleaning-Handling Outliers

• Handle Outliers: Depending on the context:

• Removing outliers: Remove If the outliers are errors or irrelevant to the
analysis, remove them.
• extreme outliers if they are errors or do not fit the distribution.
• If the outliers are due to errors or represent noise, they may be removed
from the dataset entirely.
• Transform outliers by applying log transformations or other methods.
• Transformation: Apply mathematical transformations like logarithms or
square roots to reduce the impact of extreme values or set a threshold
value to cap the outliers.
Data Cleaning-Handling Outliers

Cap outliers by setting them to a maximum or minimum

threshold.
• Capping or winsorization: Replace outlier values with a
predefined threshold value (e.g., the 95th percentile) to minimize
their effect.
• Imputation: In cases where outliers represent missing or
erroneous data, imputing these values based on the median or
other relevant statistics might be appropriate.
• Robust Models: Use models that are less sensitive to outliers,
such as tree-based methods (e.g., decision trees, random
forests), or algorithms that incorporate robust scaling methods.
Correcting Inconsistent Data

• Inconsistent Categories: Standardise categorical columns that

have inconsistent naming conventions or values. For example,
"Yes" and "yes" should be the same.
• Spelling and Formatting Errors: Check for typographical errors in
categorical data and correct them (e.g., “Blantyre" vs.
“Blantyer").
• Inconsistent Date Formats: Ensure all dates are in the same
format.
Handling Incorrect Data

• Validation: Identify and fix incorrect data entries that don't

make sense or violate domain knowledge (e.g., negative ages,
unrealistic sales figures).
• Cross-checking: Use domain-specific rules to flag potential
errors (e.g., birthdate of a person being in the future).
• Automated Checks: Implement automated rule checks to catch
anomalies.
Standardizing Data

Problem: Inconsistent formats, units, or naming conventions in the

dataset can make it difficult to compare and analyze the data.
Steps Involved:
• Unit Standardization (Normalize Units): Ensure that all
measurements are in the same unit- use consistent units (e.g.,
converting height from feet to meters, all currency values to the
same currency or converting weight units like pounds to
kilograms).
Standardizing Data

● Consistent Categorical Values: Ensure that categorical values are

consistent (e.g., "Male" and "M" should be treated as the same
category).
● Text Standardization: Convert text data to a consistent case (e.g.,
lower case or upper case) using str.lower() or str.upper().
● Standardize abbreviations and misspellings (e.g., For text data,
standardize abbreviations, capitalization, and formatting. For
instance, converting "N.Y." to "New York" or normalizing date
formats (e.g., YYYY-MM-DD).
Standardizing Data

•Date Standardization:
•Convert all dates to the same format, e.g., YYYY-MM-DD or DD/MM/YYYY,
depending on the region or system used.
•Categorical Standardization:
•Standardize categories (e.g., "Male" and "M" should be the same category
for gender).
Data Transformation

Problem: Raw data may need to be transformed to ensure

consistency or improve model performance.
Steps Involved
• Scaling and Normalization: Standardise or normalise numerical
data (e.g., using Min-Max scaling or Z-score normalization/
standardization) to bring different variables to the same scale.
Log Transformation: Apply logarithmic transformation to skewed
data distributions to normalize the distribution.
Data Transformation

Encoding Categorical Variables: Convert categorical data/variables

to numerical values for modelling using techniques like:
● One-Hot Encoding (binary columns for each category),
● Label Encoding (assigning an integer value to each category). or
● Ordinal Encoding.
Validating Data Integrity

● Problem: Ensuring that the data maintains its consistency and

correctness after cleaning and transformation.
● Steps Involved:
• Consistency Check: Verify that relationships between
columns/variables are correct and logically consistent (e.g., a
person’s birthdate should always be before their death date,
check if a child's age is less than the parent's age).
Validating Data Integrity

• Cross-validation: Check if data across multiple columns or datasets

aligns (e.g., postal codes match the corresponding cities).
• Ensure that data in related columns matches logically, e.g.,
"start_date" should always precede "end_date".
• Duplication of Data Check: Ensure that there is no accidental
repetition of information, especially after transformations and
cleaning steps.
• Data Consistency Across Sources:

• If data is integrated from multiple sources, ensure there are no

discrepancies between them.
Data Aggregation

Problem: Aggregating data can help to simplify and summarize

large datasets, making it easier to analyze trends or patterns.
Steps Involved:
• Summarize Data: Apply aggregation functions such as sum(),
mean(), count(), median() to summarize the data at different levels.
• Aggregate data to create meaningful summaries or reports, such as
computing averages, sums, counts, or other statistical summaries at
different group levels.
Data Aggregation

•Group Data: Use grouping techniques to segment the data by

categories and then apply aggregation functions (e.g., total sales by
region).

•Pivoting or Reshaping:
•Reshape data using pivot tables or pivot() functions to get a
more compact and readable format, especially when analyzing
multiple dimensions.
•Transform the data structure using pivot tables or similar
methods to ensure it’s in a format that is suitable for analysis

Morfik Help File
No ratings yet
Morfik Help File
873 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Document (2)
No ratings yet
Document (2)
29 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Preparation .1
No ratings yet
Data Preparation .1
37 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
UNIT V
No ratings yet
UNIT V
47 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
lec 4
No ratings yet
lec 4
9 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Data Cleaning_ Importance and Techniques
No ratings yet
Data Cleaning_ Importance and Techniques
1 page
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Missing Data
No ratings yet
Missing Data
14 pages
chap3
No ratings yet
chap3
26 pages
PHD seminar
No ratings yet
PHD seminar
38 pages
07 Data Cleaning and Preparation
No ratings yet
07 Data Cleaning and Preparation
41 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Project 48 Lis
No ratings yet
Project 48 Lis
107 pages
Problem Solving Techniques
No ratings yet
Problem Solving Techniques
52 pages
Eikon Data Api For Python v1
No ratings yet
Eikon Data Api For Python v1
15 pages
Visual Basic Tutorial
100% (1)
Visual Basic Tutorial
37 pages
Igc CBC Funds Checker
No ratings yet
Igc CBC Funds Checker
86 pages
Internship Report Priyank vasoya
No ratings yet
Internship Report Priyank vasoya
80 pages
Core JAVA Slides
100% (1)
Core JAVA Slides
175 pages
College of Information Technology: CSC 103: Computer Programming For Scientists and Engineers
No ratings yet
College of Information Technology: CSC 103: Computer Programming For Scientists and Engineers
72 pages
Python Unit 5-9
No ratings yet
Python Unit 5-9
28 pages
An Introduction to Python Programming for Scientists and Engineers Johnny Wei-Bing Lin instant download
100% (3)
An Introduction to Python Programming for Scientists and Engineers Johnny Wei-Bing Lin instant download
49 pages
Java Unit 5
No ratings yet
Java Unit 5
138 pages
Language Manual
No ratings yet
Language Manual
13 pages
Data Analysis With PANDAS: Cheat Sheet
80% (5)
Data Analysis With PANDAS: Cheat Sheet
4 pages
Lab 3
No ratings yet
Lab 3
6 pages
PYTHON Full Notes
No ratings yet
PYTHON Full Notes
241 pages
Concepts of Data Types in Java
No ratings yet
Concepts of Data Types in Java
14 pages
JSON Extension - DuckDB
No ratings yet
JSON Extension - DuckDB
33 pages
Java by Comparison Cheat Sheet
No ratings yet
Java by Comparison Cheat Sheet
2 pages
AtoZ of C
No ratings yet
AtoZ of C
861 pages
Correct Answer: 3
No ratings yet
Correct Answer: 3
11 pages
Object Oriented Programming Through C++
No ratings yet
Object Oriented Programming Through C++
28 pages
s500 Doc 23 Serie ST
No ratings yet
s500 Doc 23 Serie ST
12 pages
Fundamentele Programarii CURS 1
No ratings yet
Fundamentele Programarii CURS 1
21 pages
Conditionals
No ratings yet
Conditionals
35 pages
VLC Lua README
No ratings yet
VLC Lua README
9 pages
WPF - Navigation
No ratings yet
WPF - Navigation
196 pages
WMB DB
No ratings yet
WMB DB
5 pages
40 Viva Questions For Python Programming
No ratings yet
40 Viva Questions For Python Programming
4 pages
0 - FB UFM V5 (TwinCat 2 Und 3) - EN
No ratings yet
0 - FB UFM V5 (TwinCat 2 Und 3) - EN
40 pages