0% found this document useful (0 votes)
42 views45 pages

Lec4 SWN MC

The document discusses various techniques for identifying and handling different types of dirty data in a dataset. It begins by describing techniques to detect missing data, such as a missing data heatmap, percentage list, and histogram. It then discusses ways to handle missing data, such as dropping observations or features, imputing values, or replacing with a missing data code. The document also addresses detecting and handling outliers using histograms, box plots, and descriptive statistics. It identifies unnecessary data such as repetitive, irrelevant, or duplicate data values. Various techniques are presented for identifying and addressing each of these dirty data types to clean the dataset.

Uploaded by

AYAH ALI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views45 pages

Lec4 SWN MC

The document discusses various techniques for identifying and handling different types of dirty data in a dataset. It begins by describing techniques to detect missing data, such as a missing data heatmap, percentage list, and histogram. It then discusses ways to handle missing data, such as dropping observations or features, imputing values, or replacing with a missing data code. The document also addresses detecting and handling outliers using histograms, box plots, and descriptive statistics. It identifies unnecessary data such as repetitive, irrelevant, or duplicate data values. Various techniques are presented for identifying and addressing each of these dirty data types to clean the dataset.

Uploaded by

AYAH ALI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Data description

 From these results, we learn that the dataset has 30,471 rows and 292 columns.
We also identify whether the features are numeric or categorical variables. These
are all useful information.
 Now we can run through the checklist of “dirty” data types and fix them one by
one.
 Now:
 Missing Data
 Irregular Data (Outliers)
 Unnecessary Data — Repetitive Data, Duplicates and more
 Inconsistent Data — Capitalization, Addresses and more
Missing data

Dealing with missing data/value is one of the most


tricky but common parts of data cleaning. While many
models can live with other problems of the data, most
models don’t accept missing data.
Technique #1: Missing Data Heatmap

When there is a smaller number of features/attributes, we can visualize


the missing data via heatmap.
30 or less features/attributes
Technique #1: Missing Data Heatmap

 The chart below


demonstrates the missing
data patterns of the first 30
features. The horizontal axis
shows the feature name; the
vertical axis shows the
number of
observations/rows; the
yellow color represents the
missing data while the blue
color otherwise.
 For example, we see that
the life_sq feature has
missing values throughout
many rows. While
the floor feature only has
little missing values around
the 7000th row.
Technique #2: Missing Data Percentage List

When there are many features in the dataset, we can make a list of
missing data % for each feature.
Technique #2: Missing Data Percentage List

This produces a list below


showing the percentage
of missing values for each
of the features.
Specifically, we see that
the life_sq feature has
21% missing,
while floor has only 1%
missing. This list is a
useful summary that can
complement the heatmap
visualization.
Technique #3: Missing Data Histogram

 To learn more about the missing value patterns among observations, we can
visualize it by a histogram.
Technique #3: Missing Data Histogram

 This histogram helps to identify


the missing values situations
among the 30,471 observations.
 For example, there are over
6000 observations with no
missing values and close to
4000 observations with one
missing value.
What do we do with the missing data ?

There are NO agreed-upon solutions to dealing with missing data. We


have to study the specific feature and dataset to decide the best way of
handling them.
the four most common methods of handling missing data will be
covered.
But, if the situation is more complicated than usual, we need to be
creative to use more sophisticated methods such as missing data
modeling.
Solution #1: Drop the Observation

In statistics, this method is called the listwise deletion technique.


In this solution, we drop the entire observation as long as it contains a
missing value.
Only if we are sure that the missing data is not informative, we perform
this. Otherwise, we should consider other solutions. (There could be
other criteria to use to drop the observations.)
Solution #1: Drop the Observation

 For example, from the missing data histogram, we notice that only a minimal
amount of observations have over 35 features missing altogether.
 We may create a new dataset df_less_missing_rows deleting observations with
over 35 missing features.
Solution #2: Drop the Feature

Similar to Solution #1, we only do this when we are confident that this
feature doesn’t provide useful information.
For example, from the missing data % list, we notice
that hospital_beds_raion has a high missing value percentage of 47%.
We may drop the entire feature.
Solution #3: Impute the Missing

In statistics, imputation is the process of replacing missing data with


substituted values.
When the feature is a numeric variable, we can conduct missing data
imputation. We replace the missing values with the average or median
value from the data of the same feature that is not missing.
When the feature is a categorical variable, we may impute the missing
data by the mode (the most frequent value).
Solution #3: Impute the Missing (numeric)
Solution #3: Impute the Missing (non-numeric

We can apply the mode imputation strategy for all the categorical
features at once.
Solution #4: Replace the Missing

For categorical features, we can add a new category with a value such as
“_MISSING_”. For numerical features, we can replace it with a
particular value such as -999.
This way, we are still keeping the missing values as valuable
information.
Irregular data (Outliers)

Outliers are data that is distinctively different from other


observations. They could be real outliers or mistakes.

Depending on whether the feature is numeric or


categorical, we can use different techniques to study its
distribution to detect outliers.
Technique #1: Histogram/Box Plot

When the feature is numeric, we can use a histogram and box plot to
detect outliers.
Technique #1: Histogram/Box Plot

 To study the feature closer, let’s make a box plot.


Technique #2: Descriptive Statistics

 Also, for numeric features, the outliers could be too distinct/out of expected range that the box
plot can’t visualize them. Instead, we can look at their descriptive statistics.
 For example, for the feature life_sq again, we can see that the maximum value is 7478, while the
75% quartile is only 43. The 7478 value is an outlier.
Technique #3: Bar Chart

 When the feature is categorical. We can use a bar chart to learn about its categories and
distribution.
 For example, the feature ecology has a reasonable distribution. But if there is a category with
only one value called “other”, then that would be an outlier.
What to do with outliers?

While outliers are not hard to detect, we have to determine the right
solutions to handle them.
It highly depends on the dataset and the goal of the project.
The methods of handling outliers are somewhat similar to missing data.
We either drop or adjust or keep them.
We can refer back to the missing data section for possible solutions.
Unnecessary data

• All the data feeding into the model should serve the purpose
of the project.
• The unnecessary data is when the data doesn’t add value.
• We cover three main types of unnecessary data due to
different reasons.
Unnecessary type #1: Uninformative / Repetitive

Sometimes one feature is uninformative because it has too many rows


being the same value.
How to find out? We can create a list of features with a high
percentage of the same value.
The value_counts () function is used to get a Series containing
counts of unique values. The resulting object will be in descending
order so that the first element is the most frequently-occurring element.
Excludes NA values by default.
Unnecessary type #1: Uninformative / Repetitive

•len () is a built-in function in python.You


can use the len () to get the length of the
given string, array, list, tuple, dictionary, etc.
•append () adds on a passed object into the
existing list.

We need to understand the


reasons behind the repetitive
feature. When they are genuinely
uninformative, we can toss them
out.
Unnecessary type #2: Irrelevant

 Again, the data needs to provide valuable information for the project. If the
features are not related to the question we are trying to solve in the project, they
are irrelevant.
 How to find out? We need to skim through the features to identify irrelevant
ones.
 Example: a feature recording the temperature in Toronto doesn’t provide any
useful insights to predict Russian housing prices.
 What to do? When the features are not serving the project’s goal, we can
remove them.
Unnecessary type #3: Duplicates

The duplicate data is when copies of the same observation exist.


There are two main types of duplicate data.
Duplicates type #1: All Features based
Duplicates type #2: Key Features based
Duplicates type #1: All Features based

 This duplicate happens when all the features’ values within the
observations/record are the same. (It is easy to find.)
 We first remove the unique identifier id in the dataset.
 Then we create a dataset called df_dedupped by dropping the duplicates. We
compare the shapes of the two datasets (df and df_dedupped) to find out the
number of duplicated rows.

The drop () function is used to drop specified


labels from rows or columns. Remove rows or drop_duplicates() function is used
columns by specifying label names and in analyzing duplicate data and
corresponding axis, or by specifying directly removing them.
index or column names. When using a multi- The function basically helps in
index, labels on different levels can be removing duplicates from the
removed by specifying the level. DataFrame.
Duplicates type #1: All Features based

10 rows are being complete duplicate observations.

What to do? We should remove these duplicates, which we


already did.
Duplicates type #2: Key Features based

How to find out? Sometimes it is better to remove duplicate data


based on a set of unique identifiers.
Example, the chances of two transactions happening at the same time,
with the same square footage, the same price, and the same build year
are close to zero/unlikely. (this is using logical knowledge of data
scientist)
We can set up a group of critical features as unique identifiers for
transactions. We include timestamp, full_sq, life_sq, floor, build_year,
num_room, price_doc. We check if there are duplicates based on them.
Duplicates type #2: Key Features based

fillna() manages and let the user replace The idea of groupby () is pretty simple:
NaN/Null values with some value of their create groups of categories and apply a
own. function to them.
The head () function is used to get the first n rows. This function returns the first n
rows for the object based on position. It is useful for quickly testing if your object
has the right type of data in it.

There are 16
duplicates based on
this set of key features:
Duplicates type #2: Key Features based

What to do? We can drop these duplicates based on the key features.

We dropped the 16 duplicates within the new dataset named df_dedupped2.


Inconsistent data
• It is also crucial to have the dataset follow specific standards
to fit a model.
• We need to explore the data in different ways to find out the
inconsistent data.
• Much of the time, it depends on observations and
experience.
• There is no set code to run and fix them all.
• We will cover four inconsistent data types.
Inconsistent type #1: Capitalization

 Inconsistent usage of upper and lower cases in categorical


values is a common mistake. It could cause issues since
analyses in Python is case sensitive.
 How to find out? Let’s look at the sub_area feature as an
example

But sometimes there is


inconsistent capitalization
usage within the same
feature. The “Poselenie
Sosenskoe” and “pOseleNie
sosenskeo” could refer to
the same area.
Inconsistent type #1: Capitalization

What to do? To avoid this, we can put all letters to lower cases (or
upper cases).
Inconsistent type #2: Formats

Another standardization we need to perform is the data formats.


One example is to convert the feature from string to DateTime format.
DateTime formatting is the process of generating a string with how
the elements like day, month, year, hour, minutes and seconds be
displayed in a string. For example, displaying the date as DD-MM-
YYYY is a format, and displaying the date as MM-DD-YYYY is
another format.
Inconsistent type #2: Formats

How to find out? The feature timestamp is in string format while it


represents dates.
What to do? We can convert it and extract the date or time values by
using the code below. After this, it’s easier to analyze the transaction
volume group by either year or month
Inconsistent type #3: Categorical Values

Inconsistent categorical values are the last inconsistent type we cover.


A categorical feature has a limited number of values.
Sometimes there may be other values due to reasons such as typos.
How to find out? We need to observe the feature to find out this
inconsistency. Let’s show this with an example.
Example is a different dataset than Russian housing
Inconsistent type #3: Categorical Values

New dataset below since we don’t have such a problem in the real estate
dataset.
For instance, the value of city was typed by mistakes as “torontoo” and
“tronto”. But they both refer to the correct value “toronto”.
A simple way to identify them is fuzzy logic (or edit distance). It
measures how many letters (distance) we need to change the spelling of
one value to match with another value.
Inconsistent type #3: Categorical Values

 We know that the categories should only have four values of “toronto”, “vancouver”, “montreal”,
and “calgary”.
 We calculate the distance between all the values and the word “toronto” (and “vancouver”).
 We can see that the ones likely to be typos have a smaller distance with the correct word. Since
they only differ by a couple of letters.
Inconsistent type #3: Categorical Values

What to do? We can set criteria to convert these typos to the correct
values.
For example, the below code sets all the values within 2 letters distance
from “toronto” to be “toronto”.
Inconsistent type #4: Addresses
 The address feature could be a headache for many of us.
 Because people entering the data into the database
often don’t follow a standard format.
 How to find out? We can find messy address data by
looking at it. Even though sometimes we can’t spot any issues,
we can still run code to standardize them.
 Example Russian housing with column added for addresses
that is not in original data
What to do? We run the below code to lowercase the letters, remove
white space, delete periods and standardize wordings.

df_add_ex['address_std'] = df_add_ex['address'].str.lower()
df_add_ex['address_std'] = df_add_ex['address_std'].str.strip() # remove leading and trailing whitespace.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\.', '') # remove period.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bstreet\\b', 'st') # replace street with st.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bapartment\\b', 'apt') # replace apartment with apt.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bav\\b', 'ave') # replace av with ave

df_add_ex
Assignment 1

Download Russian housing Data


Implement slides 3, 4 and but name your dataset as your name
Then implement one of the following
 Missing Data (one identification with one solution)
 Irregular Data (Outliers) (one identification with one solution)
 Unnecessary Data — Repetitive Data, Duplicates and more (one identification with one
solution)
 Inconsistent Data — Capitalization, Addresses and more (one identification with one solution)
Turn in code/word document with summary of work/description of what you
did/ Video with camera on of you implementing and explaining the code (10-
15 minutes YouTube link)

You might also like