0% found this document useful (0 votes)

60 views20 pages

Chapter4 PDF

This document discusses feature selection techniques for machine learning models. It describes feature selection as the process of selecting a subset of relevant features for building models in order to improve performance without creating new features. The document discusses scenarios where manual feature selection is needed, such as removing redundant, correlated or duplicated features. It also discusses using text vectors and dimensionality reduction via principal component analysis to perform feature selection.

Uploaded by

Tongai Mutengwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views20 pages

Chapter4 PDF

Uploaded by

Tongai Mutengwa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Feature selection

P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
What is feature selection?
Selecting features to be used for modeling

Doesn't create new features

Improve model's performance

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

When to select features
city state lat long

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Removing redundant
features
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Redundant features
Remove noisy features

Remove correlated features

Remove duplicated features

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Scenarios for manual removal
city state lat long

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Correlated features
Statistically correlated: features move together directionally

Linear models assume feature independence

Pearson correlation coef cient

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Correlated features
print(df)

A B C
0 3.06 3.92 1.04
1 2.76 3.40 1.05
2 3.24 3.17 1.03
...

print(df.corr())

A B C
A 1.000000 0.787194 0.543479
B 0.787194 1.000000 0.565468
C 0.543479 0.565468 1.000000

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Selecting features
using text vectors
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Looking at word weights
print(tfidf_vec.vocabulary_) print(text_tfidf[3].data)

{'200': 0, [0.19392702 0.20261085 0.249

'204th': 1, 0.31957651 0.18599931 ...]
'33rd': 2,
'ahead': 3, print(text_tfidf[3].indices)
'alley': 4,
...
[ 31 102 20 70 5 ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Looking at word weights
vocab = {v:k for k,v in zipped_row =
tfidf_vec.vocabulary_.items() dict(zip(text_tfidf[3].indices,
text_tfidf[3].data))

print(vocab)
print(zipped_row)

{0: '200',
1: '204th', {5: 0.1597882543332701,
2: '33rd', 7: 0.26576432098763175,
3: 'ahead', 8: 0.18599931331925676,
4: 'alley', 9: 0.26576432098763175,
... 10: 0.13077355258450366,
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Looking at word weights
def return_weights(vocab, vector, vector_index):

zipped = dict(zip(vector[vector_index].indices,
vector[vector_index].data))

return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

print(return_weights(vocab, text_tfidf, 3))

{'and': 0.1597882543332701,
'are': 0.26576432098763175,
'at': 0.18599931331925676,
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Dimensionality
reduction
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Dimensionality reduction and PCA
Unsupervised learning Principal component analysis
method Linear transformation to
Combines/decomposes a uncorrelated space
feature space Captures as much variance as
Feature extraction - here we'll possible in each component
use to reduce our feature space

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PCA in scikit-learn
from sklearn.decomposition import PCA
pca = PCA()
df_pca = pca.fit_transform(df)

print(df_pca)

[88.4583, 18.7764, -2.2379, ..., 0.0954, 0.0361, -0.0034],

[93.4564, 18.6709, -1.7887, ..., -0.0509, 0.1331, 0.0119],
[-186.9433, -0.2133, -5.6307, ..., 0.0332, 0.0271, 0.0055]

print(pca.explained_variance_ratio_)

[0.9981, 0.0017, 0.0001, 0.0001, ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PCA caveats
Dif cult to interpret components

End of preprocessing journey

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Machine Learning Guide: Meher Krishna Patel
No ratings yet
Machine Learning Guide: Meher Krishna Patel
121 pages
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
Pattern Recognition and AI Using Matlab Textbook PDF
No ratings yet
Pattern Recognition and AI Using Matlab Textbook PDF
263 pages
ANN-unit 4
No ratings yet
ANN-unit 4
25 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Full Lecture
No ratings yet
Full Lecture
69 pages
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
No ratings yet
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
239 pages
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
No ratings yet
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
212 pages
ML Chapter 4 Part3
No ratings yet
ML Chapter 4 Part3
82 pages
ML Record
No ratings yet
ML Record
19 pages
Deep Learning Based Electricity Theft Prediction in Non Smart Gri 2024 Heliy
No ratings yet
Deep Learning Based Electricity Theft Prediction in Non Smart Gri 2024 Heliy
26 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
L06 Feature Selection and Extraction
No ratings yet
L06 Feature Selection and Extraction
29 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
ML Contenthalf
No ratings yet
ML Contenthalf
35 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
77 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Machine Learning Laboratory: Manual
No ratings yet
Machine Learning Laboratory: Manual
52 pages
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models Full Digital Edition
100% (10)
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models Full Digital Edition
14 pages
ML Lab Record
No ratings yet
ML Lab Record
27 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Machine Learning (ML)
No ratings yet
Machine Learning (ML)
35 pages
AI Unit 4
No ratings yet
AI Unit 4
25 pages
Preprocessing ch.3
No ratings yet
Preprocessing ch.3
23 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Preprocessing ch.4
No ratings yet
Preprocessing ch.4
20 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Presentation 1
No ratings yet
Presentation 1
15 pages
1
No ratings yet
1
13 pages
Machine Learning Lab New
No ratings yet
Machine Learning Lab New
14 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
Pattern Recognition Practicals
No ratings yet
Pattern Recognition Practicals
8 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
ML Cyber Lab
No ratings yet
ML Cyber Lab
16 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
19 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
DumbLoc Dumb Indoor Localization Framework Using Wi-Fi Fingerprinting
No ratings yet
DumbLoc Dumb Indoor Localization Framework Using Wi-Fi Fingerprinting
8 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
Image Processing Basics
No ratings yet
Image Processing Basics
17 pages
PPT1
No ratings yet
PPT1
93 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Machine Learning With Python
100% (2)
Machine Learning With Python
137 pages
Metis Bootcamp Curriculum
No ratings yet
Metis Bootcamp Curriculum
18 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
Unit1 ML Programs
No ratings yet
Unit1 ML Programs
5 pages
Sushant Tomar (12917704423) - MCA 3C AIML Assignment 2
No ratings yet
Sushant Tomar (12917704423) - MCA 3C AIML Assignment 2
11 pages
Introduction To Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
No ratings yet
Introduction To Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
15 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
1 An Introduction To Machine Learning With Scikit Learn
No ratings yet
1 An Introduction To Machine Learning With Scikit Learn
2 pages
Week 4
No ratings yet
Week 4
5 pages
3C's, Regression and Dimension Reduction in Machine Learning.
No ratings yet
3C's, Regression and Dimension Reduction in Machine Learning.
3 pages
Revision Questions: 25 Comment(s)
No ratings yet
Revision Questions: 25 Comment(s)
6 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
AAIC Syllabus
No ratings yet
AAIC Syllabus
19 pages
2010 1 11 - Zaremba
No ratings yet
2010 1 11 - Zaremba
26 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
ML
No ratings yet
ML
8 pages
ML 1
No ratings yet
ML 1
6 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Ex 7
No ratings yet
Ex 7
17 pages
AI Syllabus Course
No ratings yet
AI Syllabus Course
16 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Load Profiling and Its Application To Demand Response - A Review
No ratings yet
Load Profiling and Its Application To Demand Response - A Review
13 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
23 pages
11.feature Selection, Extraction
No ratings yet
11.feature Selection, Extraction
38 pages
Malicious URL Detection Using Machine Learning 2
No ratings yet
Malicious URL Detection Using Machine Learning 2
24 pages
Fault Detection, Classification and Location For Transmission Lines and Distribution Systems: A Review On The Methods
No ratings yet
Fault Detection, Classification and Location For Transmission Lines and Distribution Systems: A Review On The Methods
9 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
4 pages
PCA
100% (1)
PCA
33 pages
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
From Everand
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
Norman Matloff
5/5 (1)
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)

Chapter4 PDF

Uploaded by

Chapter4 PDF

Uploaded by

Feature selection

P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Doesn't create new features

Improve model's performance

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Remove correlated features

Remove duplicated features

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Linear models assume feature independence

Pearson correlation coef cient

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

{'200': 0, [0.19392702 0.20261085 0.249

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

print(return_weights(vocab, text_tfidf, 3))

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

[88.4583, 18.7764, -2.2379, ..., 0.0954, 0.0361, -0.0034],

[0.9981, 0.0017, 0.0001, 0.0001, ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

End of preprocessing journey

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

You might also like