7-8 Feature Engineering 101-Normalization
7-8 Feature Engineering 101-Normalization
Out[2]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
Data Binning
Data binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small
interval, a bin, are replaced by a value representative of that interval.
In [3]: titanic.nunique()
This works very well with most machine learning algorithms. Some algorithms, like random forests, handle categorical values natively. Then, one hot encoding
is not necessary. The process of one hot encoding may seem tedious, but fortunately, most modern machine learning libraries can take care of it.
In [5]: titanic.nunique()
0 3 0 0 1
1 1 1 0 0
2 3 0 0 1
3 1 1 0 0
4 3 0 0 1
886 2 0 1 0
887 1 1 0 0
888 3 0 0 1
889 1 1 0 0
890 3 0 0 1
Transformer
To map data from various distributions to a normal distribution.
1. Log transformer
2. Box-Cox transformer
3. Yeo-Johnson transformer
http://mashable.com/2013/01/07/amazon-
0 731.0 12.0 219.0 0.663594 1.0 0.815385
instant-...
http://mashable.com/2013/01/07/ap-
1 731.0 9.0 255.0 0.604743 1.0 0.791946
samsung-spon...
http://mashable.com/2013/01/07/apple-40-
2 731.0 9.0 211.0 0.575130 1.0 0.663866
billio...
http://mashable.com/2013/01/07/astronaut-
3 731.0 9.0 531.0 0.503788 1.0 0.665635
notre...
http://mashable.com/2013/01/07/att-u-
4 731.0 13.0 1072.0 0.415646 1.0 0.540890
verse-apps/
http://mashable.com/2014/12/27/samsung-
39639 8.0 11.0 346.0 0.529052 1.0 0.684783
app-aut...
http://mashable.com/2014/12/27/seth-
39640 8.0 12.0 328.0 0.696296 1.0 0.885057
rogen-jame...
http://mashable.com/2014/12/27/son-pays-
39641 8.0 10.0 442.0 0.516355 1.0 0.644128
off-mo...
http://mashable.com/2014/12/27/ukraine-
39642 8.0 6.0 682.0 0.539493 1.0 0.692661
blasts/
http://mashable.com/2014/12/27/youtube-
39643 8.0 10.0 157.0 0.701987 1.0 0.846154
channel...
Out[10]: (2.6127838567197355,
2.557981958990265,
0 0.0
Name: log_n_tokens_content, dtype: float64)
Box-Cox transformer
In [11]: from scipy.stats import boxcox
from scipy.stats import yeojohnson
y = news[' n_tokens_content'] + 1
y, fitted_lambda= boxcox(y,lmbda=None)
print("lambda :",fitted_lambda)
news['boxcox_n_tokens_content'] = y
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,10))
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax2.set_xlabel('Box-cox of Number of Words', fontsize=14)
news[' n_tokens_content'].hist(ax=ax1, bins=20)
news['boxcox_n_tokens_content'].hist(ax=ax2, bins=20)
plt.show()
lambda : 0.38045297261832045
Out[12]: (24.13409546936699,
23.297845966465566,
0 0.0
Name: boxcox_n_tokens_content, dtype: float64)
Yeo-Johnson transformer
source : https://www.stat.umn.edu/arc/yjpower.pdf (https://www.stat.umn.edu/arc/yjpower.pdf)
In [14]: lmbda
Out[14]: 0.38045297261832045
Out[15]: (24.13409546936699,
23.297845966465562,
0 0.0
Name: yeojohnson, dtype: float64)
Scaling & Normalization
Numeric features, such as counts, may increase without bound. Models that are smooth functions of the input, such as linear regression, logistic regression, or
anything that involves a matrix, are affected by the scale of the input. Tree-based models, on the other hand, couldn’t care less. If your model is sensitive to
the scale of input features, feature scaling could help.
1. Min-max
2. Standardization
3. l2 Norm.
Min-max
Min-max scaling squeezes all feature values to be within the range of [0, 1]
Standardization
It subtracts off the mean of the feature (over all data points) and divides by the variance. Hence, it can also be called variance scaling.
In [ ]: import pandas as pd
import sklearn.preprocessing as preproc
# Look at the original data - the number of words in an article
print('values: ',news[' n_tokens_content'].values)
# Min-max scaling
In [ ]: