0% found this document useful (0 votes)

42 views45 pages

Lec4 SWN MC

The document discusses various techniques for identifying and handling different types of dirty data in a dataset. It begins by describing techniques to detect missing data, such as a missing data heatmap, percentage list, and histogram. It then discusses ways to handle missing data, such as dropping observations or features, imputing values, or replacing with a missing data code. The document also addresses detecting and handling outliers using histograms, box plots, and descriptive statistics. It identifies unnecessary data such as repetitive, irrelevant, or duplicate data values. Various techniques are presented for identifying and addressing each of these dirty data types to clean the dataset.

Uploaded by

AYAH ALI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views45 pages

Lec4 SWN MC

Uploaded by

AYAH ALI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 45

Data description

 From these results, we learn that the dataset has 30,471 rows and 292 columns.
We also identify whether the features are numeric or categorical variables. These
are all useful information.
 Now we can run through the checklist of “dirty” data types and fix them one by
one.
 Now:
 Missing Data
 Irregular Data (Outliers)
 Unnecessary Data — Repetitive Data, Duplicates and more
 Inconsistent Data — Capitalization, Addresses and more
Missing data

Dealing with missing data/value is one of the most

tricky but common parts of data cleaning. While many
models can live with other problems of the data, most
models don’t accept missing data.
Technique #1: Missing Data Heatmap

When there is a smaller number of features/attributes, we can visualize

the missing data via heatmap.
30 or less features/attributes
Technique #1: Missing Data Heatmap

 The chart below

demonstrates the missing
data patterns of the first 30
features. The horizontal axis
shows the feature name; the
vertical axis shows the
number of
observations/rows; the
yellow color represents the
missing data while the blue
color otherwise.
 For example, we see that
the life_sq feature has
missing values throughout
many rows. While
the floor feature only has
little missing values around
the 7000th row.
Technique #2: Missing Data Percentage List

When there are many features in the dataset, we can make a list of
missing data % for each feature.
Technique #2: Missing Data Percentage List

This produces a list below

showing the percentage
of missing values for each
of the features.
Specifically, we see that
the life_sq feature has
21% missing,
while floor has only 1%
missing. This list is a
useful summary that can
complement the heatmap
visualization.
Technique #3: Missing Data Histogram

 To learn more about the missing value patterns among observations, we can
visualize it by a histogram.
Technique #3: Missing Data Histogram

 This histogram helps to identify

the missing values situations
among the 30,471 observations.
 For example, there are over
6000 observations with no
missing values and close to
4000 observations with one
missing value.
What do we do with the missing data ?

There are NO agreed-upon solutions to dealing with missing data. We

have to study the specific feature and dataset to decide the best way of
handling them.
the four most common methods of handling missing data will be
covered.
But, if the situation is more complicated than usual, we need to be
creative to use more sophisticated methods such as missing data
modeling.
Solution #1: Drop the Observation

In statistics, this method is called the listwise deletion technique.

In this solution, we drop the entire observation as long as it contains a
missing value.
Only if we are sure that the missing data is not informative, we perform
this. Otherwise, we should consider other solutions. (There could be
other criteria to use to drop the observations.)
Solution #1: Drop the Observation

 For example, from the missing data histogram, we notice that only a minimal
amount of observations have over 35 features missing altogether.
 We may create a new dataset df_less_missing_rows deleting observations with
over 35 missing features.
Solution #2: Drop the Feature

Similar to Solution #1, we only do this when we are confident that this
feature doesn’t provide useful information.
For example, from the missing data % list, we notice
that hospital_beds_raion has a high missing value percentage of 47%.
We may drop the entire feature.
Solution #3: Impute the Missing

In statistics, imputation is the process of replacing missing data with

substituted values.
When the feature is a numeric variable, we can conduct missing data
imputation. We replace the missing values with the average or median
value from the data of the same feature that is not missing.
When the feature is a categorical variable, we may impute the missing
data by the mode (the most frequent value).
Solution #3: Impute the Missing (numeric)
Solution #3: Impute the Missing (non-numeric

We can apply the mode imputation strategy for all the categorical
features at once.
Solution #4: Replace the Missing

For categorical features, we can add a new category with a value such as
“_MISSING_”. For numerical features, we can replace it with a
particular value such as -999.
This way, we are still keeping the missing values as valuable
information.
Irregular data (Outliers)

Outliers are data that is distinctively different from other

observations. They could be real outliers or mistakes.

Depending on whether the feature is numeric or

categorical, we can use different techniques to study its
distribution to detect outliers.
Technique #1: Histogram/Box Plot

When the feature is numeric, we can use a histogram and box plot to
detect outliers.
Technique #1: Histogram/Box Plot

 To study the feature closer, let’s make a box plot.

Technique #2: Descriptive Statistics

 Also, for numeric features, the outliers could be too distinct/out of expected range that the box
plot can’t visualize them. Instead, we can look at their descriptive statistics.
 For example, for the feature life_sq again, we can see that the maximum value is 7478, while the
75% quartile is only 43. The 7478 value is an outlier.
Technique #3: Bar Chart

 When the feature is categorical. We can use a bar chart to learn about its categories and
distribution.
 For example, the feature ecology has a reasonable distribution. But if there is a category with
only one value called “other”, then that would be an outlier.
What to do with outliers?

While outliers are not hard to detect, we have to determine the right
solutions to handle them.
It highly depends on the dataset and the goal of the project.
The methods of handling outliers are somewhat similar to missing data.
We either drop or adjust or keep them.
We can refer back to the missing data section for possible solutions.
Unnecessary data

• All the data feeding into the model should serve the purpose
of the project.
• The unnecessary data is when the data doesn’t add value.
• We cover three main types of unnecessary data due to
different reasons.
Unnecessary type #1: Uninformative / Repetitive

Sometimes one feature is uninformative because it has too many rows

being the same value.
How to find out? We can create a list of features with a high
percentage of the same value.
The value_counts () function is used to get a Series containing
counts of unique values. The resulting object will be in descending
order so that the first element is the most frequently-occurring element.
Excludes NA values by default.
Unnecessary type #1: Uninformative / Repetitive

•len () is a built-in function in python.You

can use the len () to get the length of the
given string, array, list, tuple, dictionary, etc.
•append () adds on a passed object into the
existing list.

We need to understand the

reasons behind the repetitive
feature. When they are genuinely
uninformative, we can toss them
out.
Unnecessary type #2: Irrelevant

 Again, the data needs to provide valuable information for the project. If the
features are not related to the question we are trying to solve in the project, they
are irrelevant.
 How to find out? We need to skim through the features to identify irrelevant
ones.
 Example: a feature recording the temperature in Toronto doesn’t provide any
useful insights to predict Russian housing prices.
 What to do? When the features are not serving the project’s goal, we can
remove them.
Unnecessary type #3: Duplicates

The duplicate data is when copies of the same observation exist.

There are two main types of duplicate data.
Duplicates type #1: All Features based
Duplicates type #2: Key Features based
Duplicates type #1: All Features based

 This duplicate happens when all the features’ values within the
observations/record are the same. (It is easy to find.)
 We first remove the unique identifier id in the dataset.
 Then we create a dataset called df_dedupped by dropping the duplicates. We
compare the shapes of the two datasets (df and df_dedupped) to find out the
number of duplicated rows.

The drop () function is used to drop specified

labels from rows or columns. Remove rows or drop_duplicates() function is used
columns by specifying label names and in analyzing duplicate data and
corresponding axis, or by specifying directly removing them.
index or column names. When using a multi- The function basically helps in
index, labels on different levels can be removing duplicates from the
removed by specifying the level. DataFrame.
Duplicates type #1: All Features based

10 rows are being complete duplicate observations.

What to do? We should remove these duplicates, which we

already did.
Duplicates type #2: Key Features based

How to find out? Sometimes it is better to remove duplicate data

based on a set of unique identifiers.
Example, the chances of two transactions happening at the same time,
with the same square footage, the same price, and the same build year
are close to zero/unlikely. (this is using logical knowledge of data
scientist)
We can set up a group of critical features as unique identifiers for
transactions. We include timestamp, full_sq, life_sq, floor, build_year,
num_room, price_doc. We check if there are duplicates based on them.
Duplicates type #2: Key Features based

fillna() manages and let the user replace The idea of groupby () is pretty simple:
NaN/Null values with some value of their create groups of categories and apply a
own. function to them.
The head () function is used to get the first n rows. This function returns the first n
rows for the object based on position. It is useful for quickly testing if your object
has the right type of data in it.

There are 16
duplicates based on
this set of key features:
Duplicates type #2: Key Features based

What to do? We can drop these duplicates based on the key features.

We dropped the 16 duplicates within the new dataset named df_dedupped2.

Inconsistent data
• It is also crucial to have the dataset follow specific standards
to fit a model.
• We need to explore the data in different ways to find out the
inconsistent data.
• Much of the time, it depends on observations and
experience.
• There is no set code to run and fix them all.
• We will cover four inconsistent data types.
Inconsistent type #1: Capitalization

 Inconsistent usage of upper and lower cases in categorical

values is a common mistake. It could cause issues since
analyses in Python is case sensitive.
 How to find out? Let’s look at the sub_area feature as an
example

But sometimes there is

inconsistent capitalization
usage within the same
feature. The “Poselenie
Sosenskoe” and “pOseleNie
sosenskeo” could refer to
the same area.
Inconsistent type #1: Capitalization

What to do? To avoid this, we can put all letters to lower cases (or
upper cases).
Inconsistent type #2: Formats

Another standardization we need to perform is the data formats.

One example is to convert the feature from string to DateTime format.
DateTime formatting is the process of generating a string with how
the elements like day, month, year, hour, minutes and seconds be
displayed in a string. For example, displaying the date as DD-MM-
YYYY is a format, and displaying the date as MM-DD-YYYY is
another format.
Inconsistent type #2: Formats

How to find out? The feature timestamp is in string format while it

represents dates.
What to do? We can convert it and extract the date or time values by
using the code below. After this, it’s easier to analyze the transaction
volume group by either year or month
Inconsistent type #3: Categorical Values

Inconsistent categorical values are the last inconsistent type we cover.

A categorical feature has a limited number of values.
Sometimes there may be other values due to reasons such as typos.
How to find out? We need to observe the feature to find out this
inconsistency. Let’s show this with an example.
Example is a different dataset than Russian housing
Inconsistent type #3: Categorical Values

New dataset below since we don’t have such a problem in the real estate
dataset.
For instance, the value of city was typed by mistakes as “torontoo” and
“tronto”. But they both refer to the correct value “toronto”.
A simple way to identify them is fuzzy logic (or edit distance). It
measures how many letters (distance) we need to change the spelling of
one value to match with another value.
Inconsistent type #3: Categorical Values

 We know that the categories should only have four values of “toronto”, “vancouver”, “montreal”,
and “calgary”.
 We calculate the distance between all the values and the word “toronto” (and “vancouver”).
 We can see that the ones likely to be typos have a smaller distance with the correct word. Since
they only differ by a couple of letters.
Inconsistent type #3: Categorical Values

What to do? We can set criteria to convert these typos to the correct
values.
For example, the below code sets all the values within 2 letters distance
from “toronto” to be “toronto”.
Inconsistent type #4: Addresses
 The address feature could be a headache for many of us.
 Because people entering the data into the database
often don’t follow a standard format.
 How to find out? We can find messy address data by
looking at it. Even though sometimes we can’t spot any issues,
we can still run code to standardize them.
 Example Russian housing with column added for addresses
that is not in original data
What to do? We run the below code to lowercase the letters, remove
white space, delete periods and standardize wordings.

df_add_ex['address_std'] = df_add_ex['address'].str.lower()
df_add_ex['address_std'] = df_add_ex['address_std'].str.strip() # remove leading and trailing whitespace.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\.', '') # remove period.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bstreet\\b', 'st') # replace street with st.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bapartment\\b', 'apt') # replace apartment with apt.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bav\\b', 'ave') # replace av with ave

df_add_ex
Assignment 1

Download Russian housing Data

Implement slides 3, 4 and but name your dataset as your name
Then implement one of the following
 Missing Data (one identification with one solution)
 Irregular Data (Outliers) (one identification with one solution)
 Unnecessary Data — Repetitive Data, Duplicates and more (one identification with one
solution)
 Inconsistent Data — Capitalization, Addresses and more (one identification with one solution)
Turn in code/word document with summary of work/description of what you
did/ Video with camera on of you implementing and explaining the code (10-
15 minutes YouTube link)

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Quality
No ratings yet
Data Quality
14 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Engo 645
No ratings yet
Engo 645
9 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Data Quality
100% (2)
Data Quality
16 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
300+ Python Algorithms: Mastering the Art of Problem-Solving
From Everand
300+ Python Algorithms: Mastering the Art of Problem-Solving
Hernando Abella
5/5 (1)
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
String DS
No ratings yet
String DS
13 pages
DAA Notes Module 2
No ratings yet
DAA Notes Module 2
39 pages
L LL
No ratings yet
L LL
10 pages
Mastering EES TOC
No ratings yet
Mastering EES TOC
14 pages
Backend Developer Roadmap
No ratings yet
Backend Developer Roadmap
3 pages
C Notes For CP
No ratings yet
C Notes For CP
6 pages
3-Arithmetic and Logic Instructions
No ratings yet
3-Arithmetic and Logic Instructions
83 pages
Dijkstra A Note On Two Problems in Connexion With Graphs
No ratings yet
Dijkstra A Note On Two Problems in Connexion With Graphs
4 pages
Cobol Db2 Sample Program
0% (1)
Cobol Db2 Sample Program
5 pages
Java Class 3
No ratings yet
Java Class 3
18 pages
STL To Hollenbach
No ratings yet
STL To Hollenbach
4 pages
Osy Amir PR - 11
No ratings yet
Osy Amir PR - 11
5 pages
Plant Disease Identification Using A Novel Convolutional Neural Network
No ratings yet
Plant Disease Identification Using A Novel Convolutional Neural Network
44 pages
Masm Procedures
No ratings yet
Masm Procedures
9 pages
Angular Index
No ratings yet
Angular Index
3 pages
DECAP267
No ratings yet
DECAP267
1 page
Computer Programming For Everyone
No ratings yet
Computer Programming For Everyone
3 pages
SQL Notes SQL
No ratings yet
SQL Notes SQL
57 pages
Module 3
No ratings yet
Module 3
21 pages
Cobol
100% (2)
Cobol
430 pages
DWM Exp1 Te3 B 32
No ratings yet
DWM Exp1 Te3 B 32
9 pages
Study Material XII Computer Science
No ratings yet
Study Material XII Computer Science
216 pages
Left - Right-Rotation PDF
No ratings yet
Left - Right-Rotation PDF
5 pages
Daa Unit 2
No ratings yet
Daa Unit 2
12 pages
BI Trouble Shotting (364547.1)
No ratings yet
BI Trouble Shotting (364547.1)
13 pages
Xilinx Tools in Command Line Mode
No ratings yet
Xilinx Tools in Command Line Mode
6 pages
Data Analysis Science Coursework
100% (2)
Data Analysis Science Coursework
4 pages
Smart Money Concept Indicator
No ratings yet
Smart Money Concept Indicator
11 pages
TILOS 8 Primavera Exchange
No ratings yet
TILOS 8 Primavera Exchange
19 pages

Lec4 SWN MC

Uploaded by

Lec4 SWN MC

Uploaded by

Data description

Dealing with missing data/value is one of the most

When there is a smaller number of features/attributes, we can visualize

 The chart below

This produces a list below

 This histogram helps to identify

There are NO agreed-upon solutions to dealing with missing data. We

In statistics, this method is called the listwise deletion technique.

In statistics, imputation is the process of replacing missing data with

Outliers are data that is distinctively different from other

Depending on whether the feature is numeric or

 To study the feature closer, let’s make a box plot.

Sometimes one feature is uninformative because it has too many rows

•len () is a built-in function in python.You

We need to understand the

The duplicate data is when copies of the same observation exist.

The drop () function is used to drop specified

10 rows are being complete duplicate observations.

What to do? We should remove these duplicates, which we

How to find out? Sometimes it is better to remove duplicate data

We dropped the 16 duplicates within the new dataset named df_dedupped2.

 Inconsistent usage of upper and lower cases in categorical

But sometimes there is

Another standardization we need to perform is the data formats.

How to find out? The feature timestamp is in string format while it

Inconsistent categorical values are the last inconsistent type we cover.

Download Russian housing Data

You might also like