Mini 4
Mini 4
In this case, X_train and X_test are guaranteed to have the same number of
features. Another way to achieve the same result is to fix the number of features:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
SimpleImputer()
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))
[[4. 2. ]
[6. 3.666...]
[7. 6. ]]
The SimpleImputer class also supports sparse matrices:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, -1], [8, 4]])
imp = SimpleImputer(missing_values=-1, strategy='mean')
imp.fit(X)
SimpleImputer(missing_values=-1)
X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
print(imp.transform(X_test).toarray())
[[3. 2.]
[6. 3.]
[7. 6.]]
Note that this format is not meant to be used to implicitly store missing values in
the matrix because it would densify it at transform time. Missing values encoded by
0 must be used with dense input.
The SimpleImputer class also supports categorical data represented as string values
or pandas categoricals when using the 'most_frequent' or 'constant' strategy:
import pandas as pd
df = pd.DataFrame([["a", "x"],
[np.nan, "y"],
["a", np.nan],
["b", "y"]], dtype="category")
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))
[['a' 'x']
['a' 'y']
['a' 'y']
['b' 'y']]
For another example on usage, see Imputing missing values before building an
estimator.
Apart from supporting library functions other functions that will be used to
achieve the functionality are:
The fit(data) method is used to compute the mean and std dev for a given feature so
that it can be used further for scaling.
The transform(data) method is used to perform scaling using mean and std dev
calculated using the .fit() method.
The fit_transform() method does both fit and transform.
Standard Scaler
Standard Scaler helps to get standardized distribution, with a zero mean and
standard deviation of one (unit variance). It standardizes features by subtracting
the mean value from the feature and then dividing the result by feature standard
deviation.
z = (x - u) / s
Where,
z is scaled data.
x is to be scaled data.
u is the mean of the training samples
s is the standard deviation of the training samples.
Sklearn preprocessing supports StandardScaler() method to achieve this directly in
merely 2-3 steps.
Parameters:
Approach:
Import module
Create data
Compute required values
Print processed data
Example:
# import module
from sklearn.preprocessing import StandardScaler
# create data
data = [[11, 2], [3, 7], [0, 10], [11, 8]]
[[ 0.97596444 -1.61155897]
[-0.66776515 0.08481889]
[-1.28416374 1.10264561]
[ 0.97596444 0.42409446]]
MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to
zero and the maximum of feature equal to one. MinMax Scaler shrinks the data within
the given range, usually of 0 to 1. It transforms data by scaling features to a
given range. It scales the values to a specific value range without changing the
shape of the original distribution.
Where,
Parameters:
feature_range: Desired range of scaled data. The default range for the feature
returned by MinMaxScaler is 0 to 1. The range is provided in tuple form as
(min,max).
copy: If False, inplace scaling is done. If True , copy is created instead of
inplace scaling.
clip: If True, scaled data is clipped to provided feature range.
Approach:
Import module
Create data
Scale data
print scaled data
Example:
# import module
from sklearn.preprocessing import MinMaxScaler
# create data
data = [[11, 2], [3, 7], [0, 10], [11, 8]]
# scale features
scaler = MinMaxScaler()
model=scaler.fit(data)
scaled_data=model.transform(data)
Output:
[[1. 0. ]
[0.27272727 0.625 ]
[0. 1. ]
[1. 0.75 ]]
Are you passionate about data and looking to make one giant leap into your career?
Our Data Science Course will help you change your game and, most importantly, allow
students, professionals, and working adults to tide over into the data science
immersion. Master state-of-the-art methodologies, powerful tools, and industry best
practices, hands-on projects, and real-world applications. Become the executive
head of industries related to Data Analysis, Machine Learning, and Data
Visualization with these growing skills. Ready to Transform Your Future? Enroll Now
to Be a Data Science Expert!
hemavatisabu
Follow
News
11
Next Article
Understanding Kernel Ridge Regression With Sklearn