Lec4 SWN MC
Lec4 SWN MC
From these results, we learn that the dataset has 30,471 rows and 292 columns.
We also identify whether the features are numeric or categorical variables. These
are all useful information.
Now we can run through the checklist of “dirty” data types and fix them one by
one.
Now:
Missing Data
Irregular Data (Outliers)
Unnecessary Data — Repetitive Data, Duplicates and more
Inconsistent Data — Capitalization, Addresses and more
Missing data
When there are many features in the dataset, we can make a list of
missing data % for each feature.
Technique #2: Missing Data Percentage List
To learn more about the missing value patterns among observations, we can
visualize it by a histogram.
Technique #3: Missing Data Histogram
For example, from the missing data histogram, we notice that only a minimal
amount of observations have over 35 features missing altogether.
We may create a new dataset df_less_missing_rows deleting observations with
over 35 missing features.
Solution #2: Drop the Feature
Similar to Solution #1, we only do this when we are confident that this
feature doesn’t provide useful information.
For example, from the missing data % list, we notice
that hospital_beds_raion has a high missing value percentage of 47%.
We may drop the entire feature.
Solution #3: Impute the Missing
We can apply the mode imputation strategy for all the categorical
features at once.
Solution #4: Replace the Missing
For categorical features, we can add a new category with a value such as
“_MISSING_”. For numerical features, we can replace it with a
particular value such as -999.
This way, we are still keeping the missing values as valuable
information.
Irregular data (Outliers)
When the feature is numeric, we can use a histogram and box plot to
detect outliers.
Technique #1: Histogram/Box Plot
Also, for numeric features, the outliers could be too distinct/out of expected range that the box
plot can’t visualize them. Instead, we can look at their descriptive statistics.
For example, for the feature life_sq again, we can see that the maximum value is 7478, while the
75% quartile is only 43. The 7478 value is an outlier.
Technique #3: Bar Chart
When the feature is categorical. We can use a bar chart to learn about its categories and
distribution.
For example, the feature ecology has a reasonable distribution. But if there is a category with
only one value called “other”, then that would be an outlier.
What to do with outliers?
While outliers are not hard to detect, we have to determine the right
solutions to handle them.
It highly depends on the dataset and the goal of the project.
The methods of handling outliers are somewhat similar to missing data.
We either drop or adjust or keep them.
We can refer back to the missing data section for possible solutions.
Unnecessary data
• All the data feeding into the model should serve the purpose
of the project.
• The unnecessary data is when the data doesn’t add value.
• We cover three main types of unnecessary data due to
different reasons.
Unnecessary type #1: Uninformative / Repetitive
Again, the data needs to provide valuable information for the project. If the
features are not related to the question we are trying to solve in the project, they
are irrelevant.
How to find out? We need to skim through the features to identify irrelevant
ones.
Example: a feature recording the temperature in Toronto doesn’t provide any
useful insights to predict Russian housing prices.
What to do? When the features are not serving the project’s goal, we can
remove them.
Unnecessary type #3: Duplicates
This duplicate happens when all the features’ values within the
observations/record are the same. (It is easy to find.)
We first remove the unique identifier id in the dataset.
Then we create a dataset called df_dedupped by dropping the duplicates. We
compare the shapes of the two datasets (df and df_dedupped) to find out the
number of duplicated rows.
fillna() manages and let the user replace The idea of groupby () is pretty simple:
NaN/Null values with some value of their create groups of categories and apply a
own. function to them.
The head () function is used to get the first n rows. This function returns the first n
rows for the object based on position. It is useful for quickly testing if your object
has the right type of data in it.
There are 16
duplicates based on
this set of key features:
Duplicates type #2: Key Features based
What to do? We can drop these duplicates based on the key features.
What to do? To avoid this, we can put all letters to lower cases (or
upper cases).
Inconsistent type #2: Formats
New dataset below since we don’t have such a problem in the real estate
dataset.
For instance, the value of city was typed by mistakes as “torontoo” and
“tronto”. But they both refer to the correct value “toronto”.
A simple way to identify them is fuzzy logic (or edit distance). It
measures how many letters (distance) we need to change the spelling of
one value to match with another value.
Inconsistent type #3: Categorical Values
We know that the categories should only have four values of “toronto”, “vancouver”, “montreal”,
and “calgary”.
We calculate the distance between all the values and the word “toronto” (and “vancouver”).
We can see that the ones likely to be typos have a smaller distance with the correct word. Since
they only differ by a couple of letters.
Inconsistent type #3: Categorical Values
What to do? We can set criteria to convert these typos to the correct
values.
For example, the below code sets all the values within 2 letters distance
from “toronto” to be “toronto”.
Inconsistent type #4: Addresses
The address feature could be a headache for many of us.
Because people entering the data into the database
often don’t follow a standard format.
How to find out? We can find messy address data by
looking at it. Even though sometimes we can’t spot any issues,
we can still run code to standardize them.
Example Russian housing with column added for addresses
that is not in original data
What to do? We run the below code to lowercase the letters, remove
white space, delete periods and standardize wordings.
df_add_ex['address_std'] = df_add_ex['address'].str.lower()
df_add_ex['address_std'] = df_add_ex['address_std'].str.strip() # remove leading and trailing whitespace.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\.', '') # remove period.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bstreet\\b', 'st') # replace street with st.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bapartment\\b', 'apt') # replace apartment with apt.
df_add_ex['address_std'] = df_add_ex['address_std'].str.replace('\\bav\\b', 'ave') # replace av with ave
df_add_ex
Assignment 1