0% found this document useful (0 votes)
54 views21 pages

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

The document discusses various feature engineering techniques used in artificial intelligence and machine learning models. It describes techniques such as imputation to handle missing values, outlier management to deal with anomalous values, one-hot encoding of categorical variables, log transformations of skewed data, scaling of variables, and date manipulation to make temporal features more informative. Specific code examples are provided for each technique to demonstrate how it can be implemented.

Uploaded by

Marva Touheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views21 pages

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

The document discusses various feature engineering techniques used in artificial intelligence and machine learning models. It describes techniques such as imputation to handle missing values, outlier management to deal with anomalous values, one-hot encoding of categorical variables, log transformations of skewed data, scaling of variables, and date manipulation to make temporal features more informative. Specific code examples are provided for each technique to demonstrate how it can be implemented.

Uploaded by

Marva Touheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

MIRPUR UNIVERSITY OF SCIENCE AND TECHNOLOGY (MUST), MIRPUR

DEPARMENT OF COMPUTER SYSTEMS ENGINEERING(CSE)


Artificial Intelligence
(CSE-471)

Lec [5]

Dr. Ashfaq Ahmed


(Assistant Professor)
Feature Engineering

Debt to Income (DTI) ratio

Breakdown of time spent by data scientists (source: Forbes)

Artificial Intelligence 3
Feature Engineering

1. Brainstorm about which features are relevant


2. Decide what features might improve the model
performance
3. Create new features
4. Determine if the new features add to the model
performance; if not, drop them
5. Go back to Step 1 until the performance of the model
meets expectations

Artificial Intelligence 4
Feature Engineering

Imputation
Outlier management
One-hot encoding
Log transform
Scaling
Date manipulation
Artificial Intelligence 5
Imputation

Removing the row with missing values

Artificial Intelligence 6
Imputation
Removing the row with missing values -> code

threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Drop rows with missing value rate higher than threshold


data = data.loc[data.isnull().mean(axis=1) < threshold]

threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Drop rows with missing value rate higher than threshold


data = data.loc[data.isnull().mean(axis=1) < threshold]
print(data) Drop missing values output

Artificial Intelligence 7
Numerical Imputation

#Filling all missing values with 0

data = data.fillna(0)

#Filling missing values with medians of the columns

data = data.fillna(data.median())

print(data)

Artificial Intelligence 8
Categorical Imputation

#Max fill function for categorical columns


import pandas as pd
data = pd.read_csv("dataset.csv")
data['color'].fillna(data['color'].value_counts().idxmax(),
inplace=True)
print(data)

Artificial Intelligence 9
Outlier management

Artificial Intelligence 10
Outlier management

#Dropping the outlier rows with standard deviation


import pandas as pd
data = pd.read_csv("train1.csv")
#Dropping the outlier rows with standard deviation
factor = 2
upper_lim = data['battery_power'].mean () + data['battery_po
wer'].std() * factor
lower_lim = data['battery_power'].mean () -
data['battery_power'].std() * factor
data = data[(data['battery_power'] < upper_lim) & (data['batt
ery_power'] > lower_lim)] Drop outlier rows output
print(data)

Artificial Intelligence 11
Outlier management

#Dropping the outlier rows with Percentiles


upper_lim = data['battery_power'].quantile(.99
)
lower_lim = data['battery_power'].quantile(.01
)
data = data[(data['battery_power'] < upper_lim
) & (data['battery_power'] > lower_lim)]
print(data)

Drop outlier rows output

Artificial Intelligence 12
Outlier management

#Capping the outlier rows with percentiles


upper_lim = data['battery_power'].quantile(.99)
lower_lim = data['battery_power'].quantile(.01)
data.loc[(data['battery_power'] > upper_lim), 'battery_power'] =
upper_lim
data.loc[(data['battery_power'] < lower_lim), 'battery_power'] =
lower_lim
print(data)

Cap outlier rows output

Artificial Intelligence 13
One-hot encoding

Artificial Intelligence 14
One-hot encoding

import pandas as pd
data = pd.read_csv("dataset.csv")
encoded_columns = pd.get_dummies(data['color'])
data = data.join(encoded_columns).drop('color', axis
=1)
print(data)

Artificial Intelligence 15
Log transform

#Log Transform Example


data = pd.DataFrame({'value':[3,67, -17, 44, 37, 3, 31, -38]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)
print(data)

Artificial Intelligence 16
Log transformation

Artificial Intelligence 17
Scaling

data['normalized'] = (data['value'] -

data['value'].min()) /(data['value'].max() - data['val’]

print(data)

Artificial Intelligence 18
Standardization

data = pd.DataFrame({'value':[7,25, -47, 73, 8, 22, 53, -25]})

data['standardized'] = (data['value'] - data['value'].mean()) /


data['value'].std()

print(data)

Artificial Intelligence 19
Date manipulation

Time features can be of critical importance for some data science problems.

Dates without any processing might not provide much significance to most
models and the values are going to be too unique to provide any predictive
power. Why is 10/21/2019 different from 10/19/2019? If we use some of the
domain knowledge, we might be able to greatly increase the information value
of the feature.
For example, converting the date to a categorical variable might help. If the
target feature is that you are trying to determine when rent is going to get paid,
convert the date to a binary value where the possible values are:
Before the 5th of the month = 1
After the 5th of the month = 0
Artificial Intelligence 20
THANKS

You might also like