0% found this document useful (0 votes)
60 views20 pages

Chapter4 PDF

This document discusses feature selection techniques for machine learning models. It describes feature selection as the process of selecting a subset of relevant features for building models in order to improve performance without creating new features. The document discusses scenarios where manual feature selection is needed, such as removing redundant, correlated or duplicated features. It also discusses using text vectors and dimensionality reduction via principal component analysis to perform feature selection.

Uploaded by

Tongai Mutengwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views20 pages

Chapter4 PDF

This document discusses feature selection techniques for machine learning models. It describes feature selection as the process of selecting a subset of relevant features for building models in order to improve performance without creating new features. The document discusses scenarios where manual feature selection is needed, such as removing redundant, correlated or duplicated features. It also discusses using text vectors and dimensionality reduction via principal component analysis to perform feature selection.

Uploaded by

Tongai Mutengwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Feature selection

P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
What is feature selection?
Selecting features to be used for modeling

Doesn't create new features

Improve model's performance

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


When to select features
city state lat long

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Removing redundant
features
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Redundant features
Remove noisy features

Remove correlated features

Remove duplicated features

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Scenarios for manual removal
city state lat long

hico tx 31.982778 -98.033333

mackinaw city mi 45.783889 -84.727778

winchester ky 37.990000 -84.179722

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Correlated features
Statistically correlated: features move together directionally

Linear models assume feature independence

Pearson correlation coef cient

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Correlated features
print(df)

A B C
0 3.06 3.92 1.04
1 2.76 3.40 1.05
2 3.24 3.17 1.03
...

print(df.corr())

A B C
A 1.000000 0.787194 0.543479
B 0.787194 1.000000 0.565468
C 0.543479 0.565468 1.000000

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Selecting features
using text vectors
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Looking at word weights
print(tfidf_vec.vocabulary_) print(text_tfidf[3].data)

{'200': 0, [0.19392702 0.20261085 0.249


'204th': 1, 0.31957651 0.18599931 ...]
'33rd': 2,
'ahead': 3, print(text_tfidf[3].indices)
'alley': 4,
...
[ 31 102 20 70 5 ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Looking at word weights
vocab = {v:k for k,v in zipped_row =
tfidf_vec.vocabulary_.items() dict(zip(text_tfidf[3].indices,
text_tfidf[3].data))

print(vocab)
print(zipped_row)

{0: '200',
1: '204th', {5: 0.1597882543332701,
2: '33rd', 7: 0.26576432098763175,
3: 'ahead', 8: 0.18599931331925676,
4: 'alley', 9: 0.26576432098763175,
... 10: 0.13077355258450366,
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Looking at word weights
def return_weights(vocab, vector, vector_index):

zipped = dict(zip(vector[vector_index].indices,
vector[vector_index].data))

return {vocab[i]:zipped[i] for i in vector[vector_index].indices}

print(return_weights(vocab, text_tfidf, 3))

{'and': 0.1597882543332701,
'are': 0.26576432098763175,
'at': 0.18599931331925676,
...

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON
Dimensionality
reduction
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

Sarah Guido
Senior Data Scientist
Dimensionality reduction and PCA
Unsupervised learning Principal component analysis
method Linear transformation to
Combines/decomposes a uncorrelated space
feature space Captures as much variance as
Feature extraction - here we'll possible in each component
use to reduce our feature space

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


PCA in scikit-learn
from sklearn.decomposition import PCA
pca = PCA()
df_pca = pca.fit_transform(df)

print(df_pca)

[88.4583, 18.7764, -2.2379, ..., 0.0954, 0.0361, -0.0034],


[93.4564, 18.6709, -1.7887, ..., -0.0509, 0.1331, 0.0119],
[-186.9433, -0.2133, -5.6307, ..., 0.0332, 0.0271, 0.0055]

print(pca.explained_variance_ratio_)

[0.9981, 0.0017, 0.0001, 0.0001, ...]

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


PCA caveats
Dif cult to interpret components

End of preprocessing journey

PREPROCESSING FOR MACHINE LEARNING IN PYTHON


Let's practice!
P REP ROCES S IN G F OR MACH IN E LEARN IN G IN P YTH ON

You might also like