0% found this document useful (0 votes)
51 views8 pages

7-8 Feature Engineering 101-Normalization

Feature engineering is the process of transforming raw data into features that are useful for machine learning algorithms. Data binning reduces the effects of minor errors by grouping values into bins and labeling each bin. One hot encoding transforms categorical features into a format suitable for algorithms by creating a new binary column for each category. Transformers like log and Box-Cox map data distributions to a normal distribution to improve algorithm performance.

Uploaded by

Khan Rafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

7-8 Feature Engineering 101-Normalization

Feature engineering is the process of transforming raw data into features that are useful for machine learning algorithms. Data binning reduces the effects of minor errors by grouping values into bins and labeling each bin. One hot encoding transforms categorical features into a format suitable for algorithms by creating a new binary column for each category. Transformers like log and Box-Cox map data distributions to a normal distribution to improve algorithm performance.

Uploaded by

Khan Rafi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Feature Engineering

Feature engineering is the process of transforming raw data into features.

In [1]: import pandas as pd


import numpy as np

In [2]: titanic = pd.read_csv('data/train.csv')


titanic.head()

Out[2]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Data Binning
Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small
interval, a bin, are replaced by a value representative of that interval.

In [3]: titanic.nunique()

Out[3]: PassengerId 891


Survived 2
Pclass 3
Name 891
Sex 2
Age 88
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 147
Embarked 3
dtype: int64
In [4]: age = titanic['Age']
df = pd.DataFrame(age)

cut_labels = ['child', 'teenage', 'young adullt', 'mid-age adult', 'old']
cut_bins = [0, 12, 18, 35, 55, float("inf") ]

df['Age binning'] = pd.cut(df['Age'], bins=cut_bins, labels=cut_labels)
df

Out[4]: Age Age binning

0 22.0 young adullt

1 38.0 mid-age adult

2 26.0 young adullt

3 35.0 young adullt

4 35.0 young adullt

... ... ...

886 27.0 young adullt

887 19.0 young adullt

888 NaN NaN

889 26.0 young adullt

890 32.0 young adullt

891 rows × 2 columns

One Hot Encoding


One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.

This works very well with most machine learning algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one hot encoding
is not necessary. The process of one hot encoding may seem tedious, but fortunately, most modern machine learning libraries can take care of it.

In [5]: titanic.nunique()

Out[5]: PassengerId 891


Survived 2
Pclass 3
Name 891
Sex 2
Age 88
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 147
Embarked 3
dtype: int64
In [6]: cls = titanic['Pclass']
df = pd.DataFrame(cls)

one_hot = pd.get_dummies(titanic['Pclass'], prefix='Pclass')
df = df.join(one_hot)
df

Out[6]: Pclass Pclass_1 Pclass_2 Pclass_3

0 3 0 0 1

1 1 1 0 0

2 3 0 0 1

3 1 1 0 0

4 3 0 0 1

... ... ... ... ...

886 2 0 1 0

887 1 1 0 0

888 3 0 0 1

889 1 1 0 0

890 3 0 0 1

891 rows × 4 columns

Transformer
To map data from various distributions to a normal distribution.

1. Log transformer
2. Box-Cox transformer
3. Yeo-Johnson transformer

In [7]: news = pd.read_csv('data/OnlineNewsPopularity.csv')


news

Out[7]: url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num

http://mashable.com/2013/01/07/amazon-
0 731.0 12.0 219.0 0.663594 1.0 0.815385
instant-...

http://mashable.com/2013/01/07/ap-
1 731.0 9.0 255.0 0.604743 1.0 0.791946
samsung-spon...

http://mashable.com/2013/01/07/apple-40-
2 731.0 9.0 211.0 0.575130 1.0 0.663866
billio...

http://mashable.com/2013/01/07/astronaut-
3 731.0 9.0 531.0 0.503788 1.0 0.665635
notre...

http://mashable.com/2013/01/07/att-u-
4 731.0 13.0 1072.0 0.415646 1.0 0.540890
verse-apps/

... ... ... ... ... ... ... ...

http://mashable.com/2014/12/27/samsung-
39639 8.0 11.0 346.0 0.529052 1.0 0.684783
app-aut...

http://mashable.com/2014/12/27/seth-
39640 8.0 12.0 328.0 0.696296 1.0 0.885057
rogen-jame...

http://mashable.com/2014/12/27/son-pays-
39641 8.0 10.0 442.0 0.516355 1.0 0.644128
off-mo...

http://mashable.com/2014/12/27/ukraine-
39642 8.0 6.0 682.0 0.539493 1.0 0.692661
blasts/

http://mashable.com/2014/12/27/youtube-
39643 8.0 10.0 157.0 0.701987 1.0 0.846154
channel...

39644 rows × 61 columns


In [8]: news[' n_tokens_content'].describe(), news[' n_tokens_content'].median()

Out[8]: (count 39644.000000


mean 546.514731
std 471.107508
min 0.000000
25% 246.000000
50% 409.000000
75% 716.000000
max 8474.000000
Name: n_tokens_content, dtype: float64,
409.0)

In [9]: import matplotlib.pyplot as plt



news['log_n_tokens_content'] = np.log10(news[' n_tokens_content']+1)

fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,10))
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('Log of Number of Words', fontsize=14)
news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['log_n_tokens_content'].hist(ax=ax2, bins=20)
plt.show()

In [10]: news['log_n_tokens_content'].median(), news['log_n_tokens_content'].mean(), news['log_n_tokens_content'].mode()

Out[10]: (2.6127838567197355,
2.557981958990265,
0 0.0
Name: log_n_tokens_content, dtype: float64)

Box-Cox transformer
In [11]: from scipy.stats import boxcox
from scipy.stats import yeojohnson

y = news[' n_tokens_content'] + 1
y, fitted_lambda= boxcox(y,lmbda=None)
print("lambda :",fitted_lambda)
news['boxcox_n_tokens_content'] = y

fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,10))
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('Box-cox of Number of Words', fontsize=14)
news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['boxcox_n_tokens_content'].hist(ax=ax2, bins=20)
plt.show()

lambda : 0.38045297261832045

In [12]: news['boxcox_n_tokens_content'].mean(), news['boxcox_n_tokens_content'].median(), news['boxcox_n_tokens_content'].mode()

Out[12]: (24.13409546936699,
23.297845966465566,
0 0.0
Name: boxcox_n_tokens_content, dtype: float64)

Yeo-Johnson transformer
source : https://www.stat.umn.edu/arc/yjpower.pdf (https://www.stat.umn.edu/arc/yjpower.pdf)

In [13]: y = news[' n_tokens_content']


y, lmbda = yeojohnson(y)
news['yeojohnson'] = y

fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,10))
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('yeo-johnson of Number of Words', fontsize=14)
news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['yeojohnson'].hist(ax=ax2, bins=20)
plt.show()

In [14]: lmbda

Out[14]: 0.38045297261832045

In [15]: news['yeojohnson'].mean(), news['yeojohnson'].median(), news['yeojohnson'].mode()

Out[15]: (24.13409546936699,
23.297845966465562,
0 0.0
Name: yeojohnson, dtype: float64)
Scaling & Normalization
Numeric features, such as counts, may increase without bound. Models that are smooth functions of the input, such as linear regression, logistic regression, or
anything that involves a matrix, are affected by the scale of the input. Tree-based models, on the other hand, couldn’t care less. If your model is sensitive to
the scale of input features, feature scaling could help.

1. Min-max
2. Standardization
3. l2 Norm.

Min-max

Min-max scaling squeezes all feature values to be within the range of [0, 1]

Illustration of min-max scaling

Standardization

It subtracts off the mean of the feature (over all data points) and divides by the variance. Hence, it can also be called variance scaling.
In [ ]: import pandas as pd
import sklearn.preprocessing as preproc

# Look at the original data - the number of words in an article
print('values: ',news[' n_tokens_content'].values)
# Min-max scaling

news['minmax'] = preproc.minmax_scale(news[[' n_tokens_content']])


print("\nmin-max : ",news['minmax'].values)

# Standardization - note that by definition, some outputs will be negative
news['standardized'] = preproc.StandardScaler().fit_transform(news[[' n_tokens_content']])
print('\nstandardized : ',news['standardized'].values)

# L2-normalization
news['l2_normalized'] = preproc.normalize(news[[' n_tokens_content']], axis=0)
print('\nl2 norm : ',news['l2_normalized'].values)

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6,12))
fig.tight_layout()

news[' n_tokens_content'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Article word count', fontsize=14)
ax1.set_ylabel('Number of articles', fontsize=14)

news['minmax'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Min-max scaled word count')
ax2.set_ylabel('Number of articles', fontsize=14)

news['standardized'].hist(ax=ax3, bins=100)
ax3.tick_params(labelsize=14)
ax3.set_xlabel('Standardized word count')
ax3.set_ylabel('Number of articles', fontsize=14)

plt.show()

values: [219. 255. 211. ... 442. 682. 157.]

min-max : [0.02584376 0.03009205 0.02489969 ... 0.05215955 0.08048147 0.01852726]

standardized : [-0.69521045 -0.61879381 -0.71219192 ... -0.2218518 0.28759248


-0.82681689]

l2 norm : [0.00152439 0.00177498 0.00146871 ... 0.00307663 0.0047472 0.00109283]

In [ ]: ​

You might also like