SlideShare a Scribd company logo
Programming for Data
Analysis
Week 8
Dr. Ferdin Joe John Joseph
Faculty of Information Technology
Thai – Nichi Institute of Technology, Bangkok
Today’s lesson
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
2
• Feature Engineering
• Feature Selection
• Feature Construction
• Laboratory
Importance
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
3
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
4
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
5
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
6
Real Data features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
7
Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
8
Feature Engineering
• Feature engineering is the process of using domain knowledge to
extract features from raw data via data mining techniques.
• These features can be used to improve the performance of machine
learning algorithms.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
9
Features
• A feature is an attribute or property shared by all of the independent
units on which analysis or prediction is to be done. Any attribute
could be a feature, as long as it is useful to the model.
• The purpose of a feature, other than being an attribute, would be
much easier to understand in the context of a problem. A feature is a
characteristic that might help when solving the problem.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
10
Process of Feature Engineering
Brainstorming or testing features
Deciding what features to create
Creating features
Checking how the features work with your model
Improving your features if needed
Go back to brainstorming/creating more features until the work is done
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
11
Techniques in Feature Engineering
• Imputation
• Handling Outliers
• Binning
• Log Transform
• One-Hot Encoding
• Grouping Operations
• Feature Split
• Scaling
• Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
12
Imputation
• Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine learning.
• The reason for the missing values might be human errors,
interruptions in the data flow, privacy concerns, and so on.
• This affects the performance of machine learning models
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
13
Imputation
• Dropping columns with missing values will reduce performance
• Make a threshold of 70%
• Remove columns having more than 30% missing values
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
14
Numerical Imputation
• Fill missing values with a constant
• Fill missing values with a statistical formula
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
15
Categorical imputation
• Replacing missing value with maximum occurred value in that column
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
16
Handling Outliers
• Best way to detect outliers is to visualize data
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
17
Statistical ways to handle outliers
• Standard Deviation
• Percentiles
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
18
Handling outliers – Standard Deviation
• If a value has a distance to the average higher than x * standard
deviation, it can be assumed as an outlier.
• x = 2 to 4 is practical. Z-score can also be used
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
19
Handling Outliers - Percentile
• If your data ranges from 0 to 100, your top 5% is not the values
between 96 and 100.
• Top 5% means here the values that are out of the 95th percentile of
data.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
20
Binning
• Binning is done for numerical data
• Categorical data are converted to numerical format and binned
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
21
Binning - Example
#Numerical Binning Example
Value Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
#Categorical Binning Example
Value Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
22
Motivation of binning
• Make the model robust
• Prevent overfitting
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
23
Log Transform
• Logarithmic Transformation
• The data you apply log transform must have only positive values,
otherwise you receive an error.
• Also, you can add 1 to your data before transform it.
• Thus, you ensure the output of the transformation to be positive.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
24
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
25
One hot encoding
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
26
Grouping Operations
• Categorical Column Grouping
• Numerical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
27
Categorical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
28
Numerical Column Grouping
• Numerical columns are grouped using sum and mean functions in
most of the cases.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
29
Feature Split
• Splitting features is a good way to make them useful in terms of
machine learning.
• By extracting the utilizable parts of a column into new features:
• We enable machine learning algorithms to comprehend them.
• Make possible to bin and group them.
• Improve model performance by uncovering potential information.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
30
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
31
Scaling
• In real life, it is nonsense to expect age and income columns to have
the same range.
• Scaling solves this problem.
• However, the algorithms based on distance calculations such as k-NN
or k-Means need to have scaled continuous features as model input.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
32
Scaling Methods
• Normalization
• Standardization
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
33
Normalization
• Normalization (or min-max normalization) scale all values in a fixed
range between 0 and 1.
• This transformation does not change the distribution of the feature
and due to the decreased standard deviations, the effects of the
outliers increases.
• Therefore, before normalization, it is recommended to handle the
outliers.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
34
Normalization - Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
35
Standardization
• Also known as z-score normalization
• Scales the values while taking into account standard deviation.
• If the standard deviation of features is different, their range also
would differ from each other.
• This reduces the effect of the outliers in the features.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
36
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
37
Extracting Date
• Extracting the parts of the date into different columns: Year, month,
day, etc.
• Extracting the time period between the current date and columns in
terms of years, months, days, etc.
• Extracting some specific features from the date: Name of the
weekday, Weekend or not, holiday or not, etc.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
38
Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
39
DSA 207 – Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
40

More Related Content

PDF
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week2: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Programming for Data Analysis: Week 3
Ferdin Joe John Joseph PhD
 
PDF
Week 1: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
Ferdin Joe John Joseph PhD
 
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week2: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
Ferdin Joe John Joseph PhD
 
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Programming for Data Analysis: Week 3
Ferdin Joe John Joseph PhD
 
Week 1: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 4 - Hyperledger and Smart Contracts
Ferdin Joe John Joseph PhD
 

What's hot (20)

PDF
Programming for Data Analysis: Week 4
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 2 - Blockchain Terminologies
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 9 - Blockciphers
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 5 - Cryptography and Steganography
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week 10
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 1 - Introduction to Blockchain
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week3
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
Ferdin Joe John Joseph PhD
 
PDF
Data Wrangling Week 4
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week 6
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week2
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week1
Ferdin Joe John Joseph PhD
 
PDF
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Ferdin Joe John Joseph PhD
 
PDF
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Deep learning - Introduction
Ferdin Joe John Joseph PhD
 
PDF
DSA 103 Object Oriented Programming :: Week 1
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Programming for Data Analysis: Week 4
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 2 - Blockchain Terminologies
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 9 - Blockciphers
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 11 - Thai-Nichi Institute of Technology
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 5 - Cryptography and Steganography
Ferdin Joe John Joseph PhD
 
Data wrangling week 10
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 1 - Introduction to Blockchain
Ferdin Joe John Joseph PhD
 
Data wrangling week3
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 3 - FinTech and Cryptocurrencies
Ferdin Joe John Joseph PhD
 
Data Wrangling Week 4
Ferdin Joe John Joseph PhD
 
Data wrangling week 6
Ferdin Joe John Joseph PhD
 
Data wrangling week2
Ferdin Joe John Joseph PhD
 
Data wrangling week1
Ferdin Joe John Joseph PhD
 
Blockchain Technology - Week 10 - CAP Teorem, Byzantines General Problem
Ferdin Joe John Joseph PhD
 
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Deep learning - Introduction
Ferdin Joe John Joseph PhD
 
DSA 103 Object Oriented Programming :: Week 1
Ferdin Joe John Joseph PhD
 
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Ad

Similar to Week 8: Programming for Data Analysis (20)

PPTX
2019 DSA 105 Introduction to Data Science Week 3
Ferdin Joe John Joseph PhD
 
PDF
Introduction to Data Science - Week 2 - Predictive Analytics
Ferdin Joe John Joseph PhD
 
PDF
Introduction to Data Science - Week 3 - Steps involved in Data Science
Ferdin Joe John Joseph PhD
 
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Data preprocessing in Machine learning
pyingkodi maran
 
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Matt Stubbs
 
PDF
Feature Engineering in Machine Learning
Knoldus Inc.
 
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
PDF
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
IJCSES Journal
 
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
ijcseit
 
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
IJCSES Journal
 
PDF
Data preprocessing in Machine Learning
Pyingkodi Maran
 
PPTX
Data .pptx
ssuserbda195
 
PPT
Data Mining
shrapb
 
PPT
Lecture1
sumit621
 
PDF
Unit 1_Concet of Feature-Feature Selection Methods.pdf
KanchanPatil34
 
PDF
Cyber Threat Ranking using READ
Zachary S. Brown
 
PPTX
This notes are more beneficial for artifical intelligence
ghulammuhammad83506
 
2019 DSA 105 Introduction to Data Science Week 3
Ferdin Joe John Joseph PhD
 
Introduction to Data Science - Week 2 - Predictive Analytics
Ferdin Joe John Joseph PhD
 
Introduction to Data Science - Week 3 - Steps involved in Data Science
Ferdin Joe John Joseph PhD
 
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
 
Data preprocessing in Machine learning
pyingkodi maran
 
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Matt Stubbs
 
Feature Engineering in Machine Learning
Knoldus Inc.
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Maninda Edirisooriya
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
IJCSES Journal
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
ijcseit
 
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
IJCSES Journal
 
Data preprocessing in Machine Learning
Pyingkodi Maran
 
Data .pptx
ssuserbda195
 
Data Mining
shrapb
 
Lecture1
sumit621
 
Unit 1_Concet of Feature-Feature Selection Methods.pdf
KanchanPatil34
 
Cyber Threat Ranking using READ
Zachary S. Brown
 
This notes are more beneficial for artifical intelligence
ghulammuhammad83506
 
Ad

More from Ferdin Joe John Joseph PhD (16)

PDF
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
PDF
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
PDF
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
PDF
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
PDF
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week 11
Ferdin Joe John Joseph PhD
 
PDF
Data wrangling week 9
Ferdin Joe John Joseph PhD
 
PDF
Data Wrangling Week 7
Ferdin Joe John Joseph PhD
 
PDF
Deep Learning and CNN Architectures
Ferdin Joe John Joseph PhD
 
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
Data wrangling week 11
Ferdin Joe John Joseph PhD
 
Data wrangling week 9
Ferdin Joe John Joseph PhD
 
Data Wrangling Week 7
Ferdin Joe John Joseph PhD
 
Deep Learning and CNN Architectures
Ferdin Joe John Joseph PhD
 

Recently uploaded (20)

PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
PDF
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
PPTX
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Azure Data management Engineer project.pptx
sumitmundhe77
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
artificial intelligence deeplearning-200712115616.pptx
revathi148366
 
A Systems Thinking Approach to Algorithmic Fairness.pdf
Epistamai
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Data_Cleaning_Infographic_Series_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Presentation1.pptxvhhh. H ycycyyccycycvvv
ItratBatool16
 
Analysis of Employee_Attrition_Presentation.pptx
AdawuRedeemer
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Trading Procedures (1).pptxcffcdddxxddsss
garv794
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Decoding Physical Presence: Unlocking Business Intelligence with Wi-Fi Analytics
meghahiremath253
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Azure Data management Engineer project.pptx
sumitmundhe77
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 

Week 8: Programming for Data Analysis

  • 1. Programming for Data Analysis Week 8 Dr. Ferdin Joe John Joseph Faculty of Information Technology Thai – Nichi Institute of Technology, Bangkok
  • 2. Today’s lesson Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 2 • Feature Engineering • Feature Selection • Feature Construction • Laboratory
  • 3. Importance Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 3
  • 4. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 4
  • 5. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 5
  • 6. Features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 6
  • 7. Real Data features Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 7
  • 8. Feature Engineering Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 8
  • 9. Feature Engineering • Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. • These features can be used to improve the performance of machine learning algorithms. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 9
  • 10. Features • A feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model. • The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 10
  • 11. Process of Feature Engineering Brainstorming or testing features Deciding what features to create Creating features Checking how the features work with your model Improving your features if needed Go back to brainstorming/creating more features until the work is done Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 11
  • 12. Techniques in Feature Engineering • Imputation • Handling Outliers • Binning • Log Transform • One-Hot Encoding • Grouping Operations • Feature Split • Scaling • Extracting Date Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 12
  • 13. Imputation • Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. • The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. • This affects the performance of machine learning models Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 13
  • 14. Imputation • Dropping columns with missing values will reduce performance • Make a threshold of 70% • Remove columns having more than 30% missing values Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 14
  • 15. Numerical Imputation • Fill missing values with a constant • Fill missing values with a statistical formula Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 15
  • 16. Categorical imputation • Replacing missing value with maximum occurred value in that column Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 16
  • 17. Handling Outliers • Best way to detect outliers is to visualize data Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 17
  • 18. Statistical ways to handle outliers • Standard Deviation • Percentiles Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 18
  • 19. Handling outliers – Standard Deviation • If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. • x = 2 to 4 is practical. Z-score can also be used Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 19
  • 20. Handling Outliers - Percentile • If your data ranges from 0 to 100, your top 5% is not the values between 96 and 100. • Top 5% means here the values that are out of the 95th percentile of data. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 20
  • 21. Binning • Binning is done for numerical data • Categorical data are converted to numerical format and binned Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 21
  • 22. Binning - Example #Numerical Binning Example Value Bin 0-30 -> Low 31-70 -> Mid 71-100 -> High #Categorical Binning Example Value Bin Spain -> Europe Italy -> Europe Chile -> South America Brazil -> South America Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 22
  • 23. Motivation of binning • Make the model robust • Prevent overfitting Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 23
  • 24. Log Transform • Logarithmic Transformation • The data you apply log transform must have only positive values, otherwise you receive an error. • Also, you can add 1 to your data before transform it. • Thus, you ensure the output of the transformation to be positive. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 24
  • 25. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 25
  • 26. One hot encoding Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 26
  • 27. Grouping Operations • Categorical Column Grouping • Numerical Column Grouping Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 27
  • 28. Categorical Column Grouping Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 28
  • 29. Numerical Column Grouping • Numerical columns are grouped using sum and mean functions in most of the cases. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 29
  • 30. Feature Split • Splitting features is a good way to make them useful in terms of machine learning. • By extracting the utilizable parts of a column into new features: • We enable machine learning algorithms to comprehend them. • Make possible to bin and group them. • Improve model performance by uncovering potential information. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 30
  • 31. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 31
  • 32. Scaling • In real life, it is nonsense to expect age and income columns to have the same range. • Scaling solves this problem. • However, the algorithms based on distance calculations such as k-NN or k-Means need to have scaled continuous features as model input. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 32
  • 33. Scaling Methods • Normalization • Standardization Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 33
  • 34. Normalization • Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1. • This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. • Therefore, before normalization, it is recommended to handle the outliers. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 34
  • 35. Normalization - Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 35
  • 36. Standardization • Also known as z-score normalization • Scales the values while taking into account standard deviation. • If the standard deviation of features is different, their range also would differ from each other. • This reduces the effect of the outliers in the features. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 36
  • 37. Example Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 37
  • 38. Extracting Date • Extracting the parts of the date into different columns: Year, month, day, etc. • Extracting the time period between the current date and columns in terms of years, months, days, etc. • Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 38
  • 39. Extracting Date Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 39
  • 40. DSA 207 – Feature Engineering Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 40