0% found this document useful (0 votes)
2 views

Lesson 3. Data Preparation and Structuring 1 Data Cleaning

The document outlines various topics related to data collection, preparation, and cleaning, emphasizing the importance of data quality for analysis and machine learning. It details methods for handling missing values, removing duplicates, correcting data types, and standardizing data, as well as techniques for data transformation and validation. The goal is to ensure that datasets are clean, consistent, and suitable for accurate analysis and modeling.

Uploaded by

alfredjoso847
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lesson 3. Data Preparation and Structuring 1 Data Cleaning

The document outlines various topics related to data collection, preparation, and cleaning, emphasizing the importance of data quality for analysis and machine learning. It details methods for handling missing values, removing duplicates, correcting data types, and standardizing data, as well as techniques for data transformation and validation. The goal is to ensure that datasets are clean, consistent, and suitable for accurate analysis and modeling.

Uploaded by

alfredjoso847
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

TOPICS OF STUDY

Data Collection

● Reliable sites for data collection Exploratory Data Analysis


● Merging data sources ● Types of EDA
● Data crawling o Univariate non-graphical
● Web Scraping o Univariate graphicalRanges
● Creating a dataset o Multivariate non graphical
Data discovery o Multivariate graphical
● Understanding problem domain (domain knowledge) ● Packages for EDA and Munging
● Understanding data patterns o (Numpy, SciPy,Pandas, MatPlotlib, Google Data Prep, Tabula, Data
● Identifying and handling the missing values Wrangler)

● Feature engineering ● Variable relationship analysis

Data preparation and structuring o Box plots, Histograms, Scatter plots, Bar Charts, Pie Charts, Line
Charts
● Identifying and handling the missing values
● Choosing right data model
● Feature engineering
o Types of data models
● Encoding data
Data Publishing
● Feature scaling
● Report writing
● Data cleaning
● Data Visualization
● Data enriching or augmentation

● Data validation

● Splitting dataset
Overview

Data preparation and structuring 1


Data cleaning
Data Preprocessing

● Data Preprocessing refers to the steps taken to clean and


prepare raw data before it's used in a machine learning model or
for any kind of analysis.
● The goal is to improve the quality of the data, eliminate
inconsistencies, and make it more suitable for building accurate
and reliable models.
● Data cleaning is a very important stage in data science.
Data cleaning
● Data cleaning is the process of identifying and correcting (or
removing) errors and inconsistencies in data to improve its
quality and reliability for analysis.
● It is an essential step in preparing raw data for further analysis or
machine learning tasks.
● There are steps aim to ensure that data is clean, consistent,
and suitable for analysis.
● Each step is integral to preparing a dataset for machine
learning, statistical analysis, or business reporting, ensuring
that the insights drawn are based on high-quality, reliable
data.
Data Cleaning
● Steps aim to ensure that data is clean, consistent, and suitable for analysis.

− Handling missing values: Filling in missing data points (imputation), or


removing rows/columns with missing data.

− Removing duplicates: Identifying and eliminating duplicate records to avoid


redundancy and bias in the model.

− Outlier detection and handling: Identifying and handling extreme values that
may skew the analysis.

− Standardizing data formats: Ensuring consistency in formats, such as dates or


categorical variables.

− Error correction: Fixing incorrect data entries, such as typos or inconsistencies


in labelling.
Data cleaning-Handling Missing Data

● Problem: Missing data is a common issue in real-world datasets,


and how we handle it can significantly impact the quality of
analysis.
● Steps Involved:
•Identify Missing Data:
•Use functions like isnull() (Pandas) or is.na() (R) to identify missing
values in a dataset.
•Check both individual columns and rows for any gaps.
Methods to Handle Missing Values

● Remove Missing Data: Remove rows or columns with missing data


(useful when the missing proportion is small).
– If the missing data is minimal and not crucial, you can delete the rows
(dropna(axis=0) in pandas) or columns (dropna(axis=1)) containing missing
values.
● What it means: This method involves deleting entire rows or columns
that have missing data.
● When to use: It's useful when the missing data is very small or
insignificant, so it won't affect the overall results much.
Remove Rows or Columns with Missing Data

● Example: If only a few rows have missing values in a large


dataset (e.g., 2 or 3 out of 1000 rows), you can safely
remove those rows without losing important information.
● Advantages: Quick and simple to implement.
● Disadvantages: If too much data is missing, removing
rows or columns could cause you to lose valuable
information.
Imputation (Filling Missing Values)

● Imputation: Instead of removing, fill in missing values


with with a value.
● There are several ways to do this:
– Fill in missing values using statistical methods like mean, median, mode

– Forward/Backward Filling ( for time series e.g weekly maize prices )

– Using flags (True/False to indicate whether missing)

– more sophisticated imputation techniques (e.g., using regression or


machine learning algorithms).
Mean, Median, or Mode Imputation

● Mean: What it means you replace the missing values with the
average (mean)
● Middle value (median), or the most frequent value (mode) of
that column.
● Mode Imputation: For categorical data, replace missing values
with the most frequent value (mode).
● When to use: This is useful when the missing data is random
(MCAR or MAR), and the data you're filling in doesn't have
extreme values or outliers
Mean, Median, or Mode Imputation

• Example: In a customer survey, if the "age" column has


missing values, you could fill those with the average age of
the respondents.
• Advantages: Simple, quick, and easy to implement.

• Disadvantages: It can distort the data distribution if the

missing values are not randomly distributed (e.g., if there


are outliers).
Forward/Backward Filling

● In time-series data, missing values can be replaced by the


preceding or succeeding values.
● Example: Consider weekly maize prices below on the next
slide.
Forward/Backward Filling
● Forward Fill: Missing values (NaN) are filled with the last known value (e.g.,
Week 3 and 4 take the value from Week 2, which is 210).
● Backward Fill: Missing values are filled with the next known value (e.g., Week
3 and 4 take the value from Week 5, which is 220).

Week Original Price Forward Fill Backward Fill


1 200 200 200
2 210 210 210
3 NaN 210 220
4 NaN 210 220
5 220 220 220
6 NaN 220 230
7 230 230 230
Flags

● For categorical variables, create flags to indicate if the value was


missing
● Add a separate binary (True/False) column that marks whether a
value was missing, so you retain the information about its
absence.
● Flagging: Creating a new variable to indicate missingness for
further analysis.
Predictive Models (Regression or k-Nearest Neighbours )

● What it means: You use machine learning models to predict


what the missing values should be based on other data.
● Regression: You use relationships between variables (e.g.,
predicting income based on age and education level).
● k-Nearest Neighbours (k-NN): You fill in missing values by looking
at the closest "neighbours" (similar data points) to make a
prediction.
Predictive Models (Regression or k-Nearest Neighbors)

● When to use: This is helpful when missing values are more


complicated and might depend on other variables in the dataset.
● Example: If the "income" field is missing in a dataset, you might
predict the missing income value based on other information like
"age" and "education level" using a regression model.
● Advantages: More accurate, especially when the missing data is
not randomly missing.
● Disadvantages: More complex and requires more data or
computational resources.
Flagging (Creating a Missingness Indicator)

What it means: Instead of filling in the missing data, you create a new variable
(flag) that indicates whether the data is missing or not.
When to use: This is helpful when you want to keep track of missing values and
use it for further analysis (e.g., analyzing whether missing data correlates with
certain patterns or outcomes).
Example: In a customer survey dataset, if some "age" entries are missing, you
could add a new column called "Age_Missing" that has a 1 for missing data and
a 0 for available data.
Advantages: Helps preserve the original data and adds valuable information
about the missingness itself.
Disadvantages: It doesn't solve the problem of missing data but can help in
analysing how missingness affects your model
Data cleaning -Removing Duplicates

Problem: Duplicates can distort analysis and introduce bias or


redundancy into datasets.
Steps Involved:
•Identify/Detect Duplicates: Identify any rows in the dataset that are
identical.
•Duplicates occur when identical rows or entries appear more than
once in the dataset.
•Use .duplicated() in pandas to identify duplicate rows.
•Visualize data to detect repeated entries or use summary
statistics to spot anomalies.
Removing Duplicates

•Remove Duplicates: Eliminate the duplicate entries using


.drop_duplicates() in pandas or similar functions like distinct() in SQL to
remove duplicate rows.
Consider whether duplicates are legitimate and should be kept (e.g.,
repeated transactions).

•Decide whether you want to remove rows with identical values


across all columns or only across specific columns.
•Check if keeping duplicates makes sense for the analysis, for
example, in case of repeated transactions.
Data cleaning-Correcting Data Types
Correcting Data Types
•Check Data Types: Ensure that each column has the correct data
type (e.g., numerical, categorical, date, string).
•Convert Data Types: If necessary, convert columns to appropriate
types using functions like .astype() or pandas' pd.to_datetime() for
dates.
•Handling Misclassified Data: For example, converting numerical
values stored as strings into actual numeric types.
Correcting Data Types

Problem: Incorrect data types can lead to errors during analysis,


especially when mathematical operations or string manipulations are
involved.
Steps Involved:
•Identify Incorrect Data Types:
•Use dtypes in pandas to check the data type of each column.
•Columns like numerical data (e.g., age, salary) may be incorrectly
stored as strings, and categorical data (e.g., country, gender) might
be misclassified as numerical.
Correcting Data Types

•Convert Data Types:


•Numeric Conversion: Convert string representations of numbers
into actual numeric types using astype().
•Date/Time Conversion: Use pd.to_datetime() to convert columns
containing dates stored as strings into proper datetime objects.
•Categorical Data Conversion: Convert categorical variables into
type category for memory optimization and speed improvements.
•Boolean Conversion: If a column contains binary values like
"yes"/"no" or 1/0, ensure it is properly converted to bool.
Data cleaning-Handling Outliers

● Outliers are data points that significantly differ from the rest of
the data, and they can distort statistical analyses and model
performance. Handling outliers is crucial in feature engineering
because they may lead to biased results or reduced model
accuracy.
Techniques for handling outliers include:
• Identify Outliers: Use statistical methods (e.g., IQR, Z-scores) or
visualization techniques (boxplots, histograms) to identify
extreme values.
Data Cleaning-Handling Outliers

• Handle Outliers: Depending on the context:


• Removing outliers: Remove If the outliers are errors or irrelevant to the
analysis, remove them.
• extreme outliers if they are errors or do not fit the distribution.
• If the outliers are due to errors or represent noise, they may be removed
from the dataset entirely.
• Transform outliers by applying log transformations or other methods.
• Transformation: Apply mathematical transformations like logarithms or
square roots to reduce the impact of extreme values or set a threshold
value to cap the outliers.
Data Cleaning-Handling Outliers

Cap outliers by setting them to a maximum or minimum


threshold.
• Capping or winsorization: Replace outlier values with a
predefined threshold value (e.g., the 95th percentile) to minimize
their effect.
• Imputation: In cases where outliers represent missing or
erroneous data, imputing these values based on the median or
other relevant statistics might be appropriate.
• Robust Models: Use models that are less sensitive to outliers,
such as tree-based methods (e.g., decision trees, random
forests), or algorithms that incorporate robust scaling methods.
Correcting Inconsistent Data

• Inconsistent Categories: Standardise categorical columns that


have inconsistent naming conventions or values. For example,
"Yes" and "yes" should be the same.
• Spelling and Formatting Errors: Check for typographical errors in
categorical data and correct them (e.g., “Blantyre" vs.
“Blantyer").
• Inconsistent Date Formats: Ensure all dates are in the same
format.
Handling Incorrect Data

• Validation: Identify and fix incorrect data entries that don't


make sense or violate domain knowledge (e.g., negative ages,
unrealistic sales figures).
• Cross-checking: Use domain-specific rules to flag potential
errors (e.g., birthdate of a person being in the future).
• Automated Checks: Implement automated rule checks to catch
anomalies.
Standardizing Data

Problem: Inconsistent formats, units, or naming conventions in the


dataset can make it difficult to compare and analyze the data.
Steps Involved:
• Unit Standardization (Normalize Units): Ensure that all
measurements are in the same unit- use consistent units (e.g.,
converting height from feet to meters, all currency values to the
same currency or converting weight units like pounds to
kilograms).
Standardizing Data

● Consistent Categorical Values: Ensure that categorical values are


consistent (e.g., "Male" and "M" should be treated as the same
category).
● Text Standardization: Convert text data to a consistent case (e.g.,
lower case or upper case) using str.lower() or str.upper().
● Standardize abbreviations and misspellings (e.g., For text data,
standardize abbreviations, capitalization, and formatting. For
instance, converting "N.Y." to "New York" or normalizing date
formats (e.g., YYYY-MM-DD).
Standardizing Data

•Date Standardization:
•Convert all dates to the same format, e.g., YYYY-MM-DD or DD/MM/YYYY,
depending on the region or system used.
•Categorical Standardization:
•Standardize categories (e.g., "Male" and "M" should be the same category
for gender).
Data Transformation

Problem: Raw data may need to be transformed to ensure


consistency or improve model performance.
Steps Involved
• Scaling and Normalization: Standardise or normalise numerical
data (e.g., using Min-Max scaling or Z-score normalization/
standardization) to bring different variables to the same scale.
Log Transformation: Apply logarithmic transformation to skewed
data distributions to normalize the distribution.
Data Transformation

Encoding Categorical Variables: Convert categorical data/variables


to numerical values for modelling using techniques like:
● One-Hot Encoding (binary columns for each category),
● Label Encoding (assigning an integer value to each category). or
● Ordinal Encoding.
Validating Data Integrity

● Problem: Ensuring that the data maintains its consistency and


correctness after cleaning and transformation.
● Steps Involved:
• Consistency Check: Verify that relationships between
columns/variables are correct and logically consistent (e.g., a
person’s birthdate should always be before their death date,
check if a child's age is less than the parent's age).
Validating Data Integrity

• Cross-validation: Check if data across multiple columns or datasets


aligns (e.g., postal codes match the corresponding cities).
• Ensure that data in related columns matches logically, e.g.,
"start_date" should always precede "end_date".
• Duplication of Data Check: Ensure that there is no accidental
repetition of information, especially after transformations and
cleaning steps.
• Data Consistency Across Sources:

• If data is integrated from multiple sources, ensure there are no


discrepancies between them.
Data Aggregation

Problem: Aggregating data can help to simplify and summarize


large datasets, making it easier to analyze trends or patterns.
Steps Involved:
• Summarize Data: Apply aggregation functions such as sum(),
mean(), count(), median() to summarize the data at different levels.
• Aggregate data to create meaningful summaries or reports, such as
computing averages, sums, counts, or other statistical summaries at
different group levels.
Data Aggregation

•Group Data: Use grouping techniques to segment the data by


categories and then apply aggregation functions (e.g., total sales by
region).

•Pivoting or Reshaping:
•Reshape data using pivot tables or pivot() functions to get a
more compact and readable format, especially when analyzing
multiple dimensions.
•Transform the data structure using pivot tables or similar
methods to ensure it’s in a format that is suitable for analysis

You might also like