Chapter4 PDF
Chapter4 PDF
Sarah Guido
Senior Data Scientist
What is feature selection?
Selecting features to be used for modeling
Sarah Guido
Senior Data Scientist
Redundant features
Remove noisy features
A B C
0 3.06 3.92 1.04
1 2.76 3.40 1.05
2 3.24 3.17 1.03
...
print(df.corr())
A B C
A 1.000000 0.787194 0.543479
B 0.787194 1.000000 0.565468
C 0.543479 0.565468 1.000000
Sarah Guido
Senior Data Scientist
Looking at word weights
print(tfidf_vec.vocabulary_) print(text_tfidf[3].data)
print(vocab)
print(zipped_row)
{0: '200',
1: '204th', {5: 0.1597882543332701,
2: '33rd', 7: 0.26576432098763175,
3: 'ahead', 8: 0.18599931331925676,
4: 'alley', 9: 0.26576432098763175,
... 10: 0.13077355258450366,
...
zipped = dict(zip(vector[vector_index].indices,
vector[vector_index].data))
{'and': 0.1597882543332701,
'are': 0.26576432098763175,
'at': 0.18599931331925676,
...
Sarah Guido
Senior Data Scientist
Dimensionality reduction and PCA
Unsupervised learning Principal component analysis
method Linear transformation to
Combines/decomposes a uncorrelated space
feature space Captures as much variance as
Feature extraction - here we'll possible in each component
use to reduce our feature space
print(df_pca)
print(pca.explained_variance_ratio_)