Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)
Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)
Lec [5]
Artificial Intelligence 3
Feature Engineering
Artificial Intelligence 4
Feature Engineering
Imputation
Outlier management
One-hot encoding
Log transform
Scaling
Date manipulation
Artificial Intelligence 5
Imputation
Artificial Intelligence 6
Imputation
Removing the row with missing values -> code
threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]
threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]
Artificial Intelligence 7
Numerical Imputation
data = data.fillna(0)
data = data.fillna(data.median())
print(data)
Artificial Intelligence 8
Categorical Imputation
Artificial Intelligence 9
Outlier management
Artificial Intelligence 10
Outlier management
Artificial Intelligence 11
Outlier management
Artificial Intelligence 12
Outlier management
Artificial Intelligence 13
One-hot encoding
Artificial Intelligence 14
One-hot encoding
import pandas as pd
data = pd.read_csv("dataset.csv")
encoded_columns = pd.get_dummies(data['color'])
data = data.join(encoded_columns).drop('color', axis
=1)
print(data)
Artificial Intelligence 15
Log transform
Artificial Intelligence 16
Log transformation
Artificial Intelligence 17
Scaling
data['normalized'] = (data['value'] -
print(data)
Artificial Intelligence 18
Standardization
print(data)
Artificial Intelligence 19
Date manipulation
Time features can be of critical importance for some data science problems.
Dates without any processing might not provide much significance to most
models and the values are going to be too unique to provide any predictive
power. Why is 10/21/2019 different from 10/19/2019? If we use some of the
domain knowledge, we might be able to greatly increase the information value
of the feature.
For example, converting the date to a categorical variable might help. If the
target feature is that you are trying to determine when rent is going to get paid,
convert the date to a binary value where the possible values are:
Before the 5th of the month = 1
After the 5th of the month = 0
Artificial Intelligence 20
THANKS