0% found this document useful (0 votes)

54 views21 pages

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

The document discusses various feature engineering techniques used in artificial intelligence and machine learning models. It describes techniques such as imputation to handle missing values, outlier management to deal with anomalous values, one-hot encoding of categorical variables, log transformations of skewed data, scaling of variables, and date manipulation to make temporal features more informative. Specific code examples are provided for each technique to demonstrate how it can be implemented.

Uploaded by

Marva Touheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views21 pages

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

Uploaded by

Marva Touheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

MIRPUR UNIVERSITY OF SCIENCE AND TECHNOLOGY (MUST), MIRPUR

DEPARMENT OF COMPUTER SYSTEMS ENGINEERING(CSE)

Artificial Intelligence
(CSE-471)

Lec [5]

Dr. Ashfaq Ahmed

(Assistant Professor)
Feature Engineering

Debt to Income (DTI) ratio

Breakdown of time spent by data scientists (source: Forbes)

Artificial Intelligence 3
Feature Engineering

1. Brainstorm about which features are relevant

2. Decide what features might improve the model
performance
3. Create new features
4. Determine if the new features add to the model
performance; if not, drop them
5. Go back to Step 1 until the performance of the model
meets expectations

Artificial Intelligence 4
Feature Engineering

Imputation
Outlier management
One-hot encoding
Log transform
Scaling
Date manipulation
Artificial Intelligence 5
Imputation

Removing the row with missing values

Artificial Intelligence 6
Imputation
Removing the row with missing values -> code

threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Drop rows with missing value rate higher than threshold

data = data.loc[data.isnull().mean(axis=1) < threshold]

threshold = 0.6
#Drop columns with a missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]]

#Drop rows with missing value rate higher than threshold

data = data.loc[data.isnull().mean(axis=1) < threshold]
print(data) Drop missing values output

Artificial Intelligence 7
Numerical Imputation

#Filling all missing values with 0

data = data.fillna(0)

#Filling missing values with medians of the columns

data = data.fillna(data.median())

print(data)

Artificial Intelligence 8
Categorical Imputation

#Max fill function for categorical columns

import pandas as pd
data = pd.read_csv("dataset.csv")
data['color'].fillna(data['color'].value_counts().idxmax(),
inplace=True)
print(data)

Artificial Intelligence 9
Outlier management

Artificial Intelligence 10
Outlier management

#Dropping the outlier rows with standard deviation

import pandas as pd
data = pd.read_csv("train1.csv")
#Dropping the outlier rows with standard deviation
factor = 2
upper_lim = data['battery_power'].mean () + data['battery_po
wer'].std() * factor
lower_lim = data['battery_power'].mean () -
data['battery_power'].std() * factor
data = data[(data['battery_power'] < upper_lim) & (data['batt
ery_power'] > lower_lim)] Drop outlier rows output
print(data)

Artificial Intelligence 11
Outlier management

#Dropping the outlier rows with Percentiles

upper_lim = data['battery_power'].quantile(.99
)
lower_lim = data['battery_power'].quantile(.01
)
data = data[(data['battery_power'] < upper_lim
) & (data['battery_power'] > lower_lim)]
print(data)

Drop outlier rows output

Artificial Intelligence 12
Outlier management

#Capping the outlier rows with percentiles

upper_lim = data['battery_power'].quantile(.99)
lower_lim = data['battery_power'].quantile(.01)
data.loc[(data['battery_power'] > upper_lim), 'battery_power'] =
upper_lim
data.loc[(data['battery_power'] < lower_lim), 'battery_power'] =
lower_lim
print(data)

Cap outlier rows output

Artificial Intelligence 13
One-hot encoding

Artificial Intelligence 14
One-hot encoding

import pandas as pd
data = pd.read_csv("dataset.csv")
encoded_columns = pd.get_dummies(data['color'])
data = data.join(encoded_columns).drop('color', axis
=1)
print(data)

Artificial Intelligence 15
Log transform

#Log Transform Example

data = pd.DataFrame({'value':[3,67, -17, 44, 37, 3, 31, -38]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)
print(data)

Artificial Intelligence 16
Log transformation

Artificial Intelligence 17
Scaling

data['normalized'] = (data['value'] -

data['value'].min()) /(data['value'].max() - data['val’]

print(data)

Artificial Intelligence 18
Standardization

data = pd.DataFrame({'value':[7,25, -47, 73, 8, 22, 53, -25]})

data['standardized'] = (data['value'] - data['value'].mean()) /

data['value'].std()

print(data)

Artificial Intelligence 19
Date manipulation

Time features can be of critical importance for some data science problems.

Dates without any processing might not provide much significance to most
models and the values are going to be too unique to provide any predictive
power. Why is 10/21/2019 different from 10/19/2019? If we use some of the
domain knowledge, we might be able to greatly increase the information value
of the feature.
For example, converting the date to a categorical variable might help. If the
target feature is that you are trying to determine when rent is going to get paid,
convert the date to a binary value where the possible values are:
Before the 5th of the month = 1
After the 5th of the month = 0
Artificial Intelligence 20
THANKS

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
NA To SS en 1997-1 2010 - Singapore National Annex To Eurocode 7
100% (2)
NA To SS en 1997-1 2010 - Singapore National Annex To Eurocode 7
26 pages
BCM-Blood Circulatory Massager - TIEN'S Presentation
75% (8)
BCM-Blood Circulatory Massager - TIEN'S Presentation
52 pages
AI Lab Manual
No ratings yet
AI Lab Manual
25 pages
210..127 Ai
No ratings yet
210..127 Ai
35 pages
210170111018ai (1) Rkjher
No ratings yet
210170111018ai (1) Rkjher
36 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Data Preparation in Data Analytics Using AI
No ratings yet
Data Preparation in Data Analytics Using AI
4 pages
Ruturajfinalmannual
No ratings yet
Ruturajfinalmannual
36 pages
Module 2 - PART 1
No ratings yet
Module 2 - PART 1
50 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Hareesh Sir Portion
No ratings yet
Hareesh Sir Portion
121 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Artificial Neural Networks: Supriya A Jadhav
No ratings yet
Artificial Neural Networks: Supriya A Jadhav
40 pages
Tursunova Kumushbibi AI Final
No ratings yet
Tursunova Kumushbibi AI Final
30 pages
AI Manual
No ratings yet
AI Manual
69 pages
AI Course Help Guide
No ratings yet
AI Course Help Guide
3 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
Ad3461 ML Lab
No ratings yet
Ad3461 ML Lab
24 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Manual Data
No ratings yet
Manual Data
13 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Blended Data Cleaning
No ratings yet
Blended Data Cleaning
9 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Cse3001 Ai ML m2
No ratings yet
Cse3001 Ai ML m2
118 pages
ML Lab 3
No ratings yet
ML Lab 3
8 pages
Module 2 - Aiml Part 1
No ratings yet
Module 2 - Aiml Part 1
37 pages
Module 2 - PART 1
No ratings yet
Module 2 - PART 1
50 pages
24CSPC212-PIC Lab Manual
No ratings yet
24CSPC212-PIC Lab Manual
45 pages
If Your Data Is Bad, Your Machine Learning Tools Are Useless
No ratings yet
If Your Data Is Bad, Your Machine Learning Tools Are Useless
5 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
Indexdw
No ratings yet
Indexdw
34 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
Module 3
No ratings yet
Module 3
50 pages
Data Validation in ML
No ratings yet
Data Validation in ML
3 pages
Machine Learning Laboratory Manual
No ratings yet
Machine Learning Laboratory Manual
11 pages
Ad3461-ML Manual
No ratings yet
Ad3461-ML Manual
27 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
C1000-154 STU C1000154v2STUSGC1000154
No ratings yet
C1000-154 STU C1000154v2STUSGC1000154
10 pages
Lab 2 More Python
No ratings yet
Lab 2 More Python
28 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
58 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
ML 2022
No ratings yet
ML 2022
10 pages
Data Science Methodology - English Template
No ratings yet
Data Science Methodology - English Template
23 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
SpecCV SeniorDataScientist
No ratings yet
SpecCV SeniorDataScientist
3 pages
Take It Easy: Created Status Last Read
No ratings yet
Take It Easy: Created Status Last Read
55 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Breaking Into AI!
No ratings yet
Breaking Into AI!
30 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Aml Midsem
No ratings yet
Aml Midsem
59 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
AI Manual
No ratings yet
AI Manual
36 pages
Lecture Plan Sociology and Development
No ratings yet
Lecture Plan Sociology and Development
2 pages
Case Study of Pakistan
No ratings yet
Case Study of Pakistan
37 pages
Sociology and Development HS-483: Lecture: Population Dynamics. Community Development
No ratings yet
Sociology and Development HS-483: Lecture: Population Dynamics. Community Development
20 pages
Real-Time Fire Detection For Video Surveillance Using Digital Image Processing
No ratings yet
Real-Time Fire Detection For Video Surveillance Using Digital Image Processing
4 pages
Assginment 4 5 MTS 811 Usama Ijaz 327118 PDF
No ratings yet
Assginment 4 5 MTS 811 Usama Ijaz 327118 PDF
31 pages
Early Fire Detection System Using Image Processing
No ratings yet
Early Fire Detection System Using Image Processing
9 pages
Department of Software Engineering: Shanzay Touheed (FA18-BSE-088)
No ratings yet
Department of Software Engineering: Shanzay Touheed (FA18-BSE-088)
5 pages
Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)
No ratings yet
Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)
24 pages
Data Sheet - F12 EN 2021.12.09
No ratings yet
Data Sheet - F12 EN 2021.12.09
4 pages
Work Civility
No ratings yet
Work Civility
7 pages
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
No ratings yet
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
450 pages
Elapan Company Profile 2023
No ratings yet
Elapan Company Profile 2023
7 pages
Business Overview
No ratings yet
Business Overview
42 pages
USP-NF Aloe
No ratings yet
USP-NF Aloe
3 pages
Impatient Parents Analysis and Rationale
No ratings yet
Impatient Parents Analysis and Rationale
6 pages
Wilson Newsletter September 2020
No ratings yet
Wilson Newsletter September 2020
2 pages
MC Granahan Anthropologyas Theoretical Storytelling 2020
No ratings yet
MC Granahan Anthropologyas Theoretical Storytelling 2020
8 pages
Prime MX FIRA 6250 2018
No ratings yet
Prime MX FIRA 6250 2018
4 pages
Đọc Viết 2 - 23092021
No ratings yet
Đọc Viết 2 - 23092021
9 pages
Grade 8 Diagnostic Test Kasaysayan NG Daigdig
No ratings yet
Grade 8 Diagnostic Test Kasaysayan NG Daigdig
4 pages
Lesson 12.1 and 12.2 Seatwork
No ratings yet
Lesson 12.1 and 12.2 Seatwork
3 pages
Punjab PET Syllabus
No ratings yet
Punjab PET Syllabus
4 pages
CRUSHER JOE and DIRTY PAIR - Complete Movie, OVA, TV Series - 720p-1080p BluRay DUAL AUDIO x264
No ratings yet
CRUSHER JOE and DIRTY PAIR - Complete Movie, OVA, TV Series - 720p-1080p BluRay DUAL AUDIO x264
4 pages
Protection Coordinator
No ratings yet
Protection Coordinator
4 pages
S Shangase
No ratings yet
S Shangase
21 pages
4-Discovery of The Subatomic Particles
100% (1)
4-Discovery of The Subatomic Particles
35 pages
10TH B Test Series 2024-2025 1ST Round Front Page
No ratings yet
10TH B Test Series 2024-2025 1ST Round Front Page
2 pages
Lecture - Notes - DE&Series-Chapter 2-Lessons 1&2
No ratings yet
Lecture - Notes - DE&Series-Chapter 2-Lessons 1&2
12 pages
WK 6 Strategic Planning Policy Analysis
No ratings yet
WK 6 Strategic Planning Policy Analysis
47 pages
Solid State Physics From The Material Properties of Solids To Nanotechnologies Essentials of Physics Series David Schmool (Author) Download
No ratings yet
Solid State Physics From The Material Properties of Solids To Nanotechnologies Essentials of Physics Series David Schmool (Author) Download
42 pages
gooFSM Research Full Chapters
No ratings yet
gooFSM Research Full Chapters
79 pages
Resume Math
No ratings yet
Resume Math
4 pages
BSC12 PDF
No ratings yet
BSC12 PDF
2 pages
MATH 8 - Term 1 Lesson 4
No ratings yet
MATH 8 - Term 1 Lesson 4
22 pages
Gauge Sizes Chart: EN 10253 4 Structural Dimensions of Fittings ISO 5251 ISO 3419
No ratings yet
Gauge Sizes Chart: EN 10253 4 Structural Dimensions of Fittings ISO 5251 ISO 3419
10 pages
Self Respect
No ratings yet
Self Respect
10 pages

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

Uploaded by

Mirpur University of Science and Technology (Must), Mirpur Deparment Computer Systems Engineering (Cse)

Uploaded by

MIRPUR UNIVERSITY OF SCIENCE AND TECHNOLOGY (MUST), MIRPUR

DEPARMENT OF COMPUTER SYSTEMS ENGINEERING(CSE)

Dr. Ashfaq Ahmed

Debt to Income (DTI) ratio

Breakdown of time spent by data scientists (source: Forbes)

1. Brainstorm about which features are relevant

Removing the row with missing values

#Drop rows with missing value rate higher than threshold

#Drop rows with missing value rate higher than threshold

#Filling all missing values with 0

#Filling missing values with medians of the columns

#Max fill function for categorical columns

#Dropping the outlier rows with standard deviation

#Dropping the outlier rows with Percentiles

Drop outlier rows output

#Capping the outlier rows with percentiles

Cap outlier rows output

#Log Transform Example

data['value'].min()) /(data['value'].max() - data['val’]

data = pd.DataFrame({'value':[7,25, -47, 73, 8, 22, 53, -25]})

data['standardized'] = (data['value'] - data['value'].mean()) /

You might also like