Data Preprocessing
Data Preprocessing
Data preprocessing is a crucial step in machine learning because it directly impacts the quality and performance of your
models.
Handling Missing Values: Missing data can lead to biased or inaccurate models. Imputation techniques fill in missing values with
reasonable estimates, improving data completeness and model accuracy.
Outlier Detection and Treatment: Outliers can skew the model's learning process. Identifying and handling them appropriately
prevents the model from being overly influenced by extreme values.
Feature Scaling: Features with different scales can affect the model's learning process. Scaling techniques like normalization or
standardization ensure that all features contribute equally to the model.
Data Cleaning: Removing inconsistencies, errors, and noise in the data ensures that the model learns from reliable information.
Feature Engineering: Creating new features or transforming existing ones can make the data more meaningful and easier to
interpret.
Dimensionality Reduction: Techniques like PCA or t-SNE reduce the number of features, making the model simpler and easier to
understand. 3. Faster Model Training:
Data Cleaning and Preprocessing: Clean and well-prepared data reduces the computational cost of training, leading to faster
model development.
Feature Selection: Selecting the most relevant features reduces the model's complexity and speeds up training.
4. Better Generalization:
Data Cleaning and Preprocessing: High-quality data helps the model learn general patterns and avoid overfitting, improving its
ability to generalize to new, unseen data.
Data Preprocessing Pipeline: Creating a robust preprocessing pipeline ensures that new data can be efficiently prepared for the
model, facilitating deployment and real-time predictions.
In essence, data preprocessing is the foundation of successful machine learning. By investing time and effort in this step, you
can significantly improve the accuracy, interpretability, and efficiency of your models.
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
1. StandardScaler
StandardScaler is a common technique used in data preprocessing to standardize numerical features. It transforms features by
removing the mean and scaling them to unit variance. This ensures that all features contribute equally to the model, regardless
of their original scale.
Key points:
titanic = dfx.copy(deep=True)
# Apply StandardScaler
scaler = StandardScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
2. MinMaxScaler
MinMaxScaler is a data preprocessing technique that scales numerical features to a specific range, typically between 0 and 1.
It's useful when you want to transform features to a bounded range or when you want to avoid negative values.
Key points:
titanic = dfx.copy(deep=True)
# Apply MinMaxScaler
scaler = MinMaxScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
Survived Sex Age Fare Embarked
3. RobustScaler
RobustScaler is a data preprocessing technique that scales features using the interquartile range (IQR). It's robust to outliers,
making it suitable for data with skewed distributions or outliers.
Key points:
IQR: Measures the range between the 25th and 75th percentiles.
Scaling: Scales features using the IQR, reducing the influence of outliers.
Normalization: Results in a distribution with a smaller range and less sensitivity to extreme values.
Benefits: Improved model performance, especially for algorithms sensitive to outliers.
Use Cases: Linear Regression, Logistic Regression, and other algorithms that can be affected by outliers.
titanic = dfx.copy(deep=True)
# Apply RobustScaler
scaler = RobustScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
4. Normalizer
Normalizer is a data preprocessing technique that scales individual samples to have a unit norm. This means each sample
(row) in the dataset is rescaled so that its length (Euclidean norm) becomes 1.
Key points:
titanic = dfx.copy(deep=True)
# Apply Normalizer
scaler = Normalizer()
# Fit the imputer on your numerical features and transform the data
titanic[numerical_features] = imputer.fit_transform(titanic[numerical_features])
titanic.head()
Survived Sex Age Fare Embarked
5. Binarizer
Binarizer is a data preprocessing technique that transforms numerical values into binary values (0 or 1) based on a specified
threshold. It's useful for feature discretization or when you want to treat numerical features as categorical.
Key points:
Thresholding: Values greater than the threshold are mapped to 1, and values less than or equal to the threshold are mapped to 0.
Feature Discretization: Converts continuous features into discrete binary features.
Use Cases: Text analysis, where presence or absence of a word is important, and other applications where binary representation is
beneficial.
Benefits: Simplifies data, reduces dimensionality, and can improve model performance in certain scenarios.
titanic = dfx.copy(deep=True)
# Apply Binarizer
binarizer = Binarizer(threshold=0.5)
titanic['Age'] = binarizer.fit_transform(titanic[['Age']])
titanic.head()
6. LabelEncoder
LabelEncoder is a technique used to convert categorical data into numerical data. It assigns a unique integer to each category.
This is useful for machine learning algorithms that only work with numerical data.
Key points:
titanic = dfx.copy(deep=True)
# Apply LabelEncoder
encoder = LabelEncoder()
titanic['Sex'] = encoder.fit_transform(titanic['Sex'])
titanic.head()
Survived Sex Age Fare Embarked
0 0 1 22.0 7.2500 S
1 1 0 38.0 71.2833 C
2 1 0 26.0 7.9250 S
3 1 0 35.0 53.1000 S
4 0 1 35.0 8.0500 S
7. OneHotEncoder
OneHotEncoder is a technique used to convert categorical data into numerical data by creating binary columns for each
category. It's useful for machine learning algorithms that only work with numerical data.
Key points:
titanic = dfx.copy(deep=True)
# Apply OneHotEncoder
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(titanic[['Embarked']]).toarray()
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Embarked']))
titanic = pd.concat([titanic, encoded_df], axis=1).drop('Embarked', axis=1)
titanic.head()
8. PolynomialFeatures
PolynomialFeatures is a technique used to create polynomial features from existing features. This is useful for capturing non-
linear relationships between features.
Key points:
titanic = dfx.copy(deep=True)
# Apply PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(titanic[numerical_features])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(numerical_features))
titanic = pd.concat([titanic, poly_df], axis=1)
titanic.head()
Survived Sex Age Fare Embarked 1 Age Fare Age^2 Age Fare Fare^2
9. SimpleImputer
SimpleImputer is a technique used to handle missing values in a dataset. It replaces missing values with a specified value,
such as the mean, median, mode, or a constant.
Key points:
titanic = dfx.copy(deep=True)
# Apply SimpleImputer
imputer = SimpleImputer(strategy='mean')
titanic['Age'] = imputer.fit_transform(titanic[['Age']])
titanic.head()
10. KNNImputer
KNNImputer is a technique used to handle missing values in a dataset by imputing them with values from similar data points. It
leverages the k-Nearest Neighbors algorithm to find the closest neighbors to a data point with missing values and uses their
values to fill in the missing information.
Key points:
Neighborhood-Based Imputation: Fills missing values based on the values of nearby data points.
k-Nearest Neighbors: Identifies the k most similar data points to the one with missing values.
Weighted Average: The missing value is imputed using a weighted average of the values from the k nearest neighbors.
Benefits: Can handle complex patterns in data and often provides more accurate imputations than simple methods like mean or
median imputation.
Considerations: Requires careful selection of the number of neighbors (k) and distance metric.
titanic = dfx.copy(deep=True)
# Apply KNNImputer
imputer = KNNImputer(n_neighbors=5)
titanic['Age'] = imputer.fit_transform(titanic[['Age']])
titanic.head()
Survived Sex Age Fare Embarked
11. PowerTransformer
PowerTransformer is a data preprocessing technique that transforms data to a normal distribution using a power
transformation. This is useful when your data is not normally distributed, as many statistical techniques assume normality.
Key points:
titanic = dfx.copy(deep=True)
# Apply PowerTransformer
scaler = PowerTransformer()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
12. QuantileTransformer
QuantileTransformer is a data preprocessing technique that maps data to a uniform or normal distribution using quantile-based
transformations. This is useful when your data is not normally distributed and you want to improve the performance of
algorithms that assume normality.
Key points:
titanic = dfx.copy(deep=True)
# Apply QuantileTransformer
scaler = QuantileTransformer()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
Key points:
titanic = dfx.copy(deep=True)
# Apply MaxAbsScaler
scaler = MaxAbsScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
14. OrdinalEncoder
OrdinalEncoder is a technique used to encode categorical features with integer values based on their order. This is useful for
categorical features that have a natural order, such as "low," "medium," and "high."
Key points:
titanic = dfx.copy(deep=True)
# Apply OrdinalEncoder
encoder = OrdinalEncoder()
titanic[['Embarked']] = encoder.fit_transform(titanic[['Embarked']])
titanic.head()
15. KBinsDiscretizer
KBinsDiscretizer is a technique used to discretize continuous features into intervals (bins). This is useful for converting
continuous features into categorical features, which can be beneficial for certain machine learning algorithms.
Key points:
titanic = dfx.copy(deep=True)
# Apply KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
titanic['Fare_binned'] = discretizer.fit_transform(titanic[['Fare']])
titanic.head()
titanic['Fare_binned'].value_counts()
Fare_binned
0.0 838
1.0 33
2.0 17
4.0 3
Name: count, dtype: int64
16. FunctionTransformer
FunctionTransformer is a versatile preprocessing technique that allows you to apply custom functions to your data. It provides
a flexible way to incorporate custom transformations into your machine learning pipeline.
Key points:
Custom Transformations: Enables you to define and apply your own functions to data.
Flexibility: Can be used for a wide range of transformations, from simple arithmetic operations to complex statistical calculations.
Integration with Pipelines: Easily integrates into scikit-learn pipelines for seamless data processing.
Use Cases: Feature engineering, data cleaning, and other custom data transformations.
Considerations: Ensure that your custom function is efficient and compatible with the scikit-learn framework.
titanic = dfx.copy(deep=True)
# Apply FunctionTransformer
transformer = FunctionTransformer(lambda x: x + 1)
titanic[numerical_features] = transformer.fit_transform(titanic[numerical_features])
titanic.head()
17. MultiLabelBinarizer
MultiLabelBinarizer is a technique used to encode multi-label categorical features into binary matrices. It's useful for tasks
where a sample can belong to multiple categories simultaneously.
Key points:
titanic = df[['Survived','Sex','Age','Fare','Embarked','Cabin']]
# Apply MultiLabelBinarizer
mlb = MultiLabelBinarizer()
titanic['Cabin'] = titanic['Cabin'].fillna('').apply(lambda x: x.split())
titanic = titanic.join(pd.DataFrame(mlb.fit_transform(titanic.pop('Cabin')), columns=mlb.classes_, index=titanic
titanic.head()
Survived Sex Age Fare Embarked A10 A14 A16 A19 A20 ... E8 F F2 F33 F38 F4 G6 G63 G73 T
Key points:
titanic = dfx.copy(deep=True)
19. ColumnTransformer
ColumnTransformer is a powerful tool in scikit-learn that allows you to apply different transformations to different columns of
your dataset. This is especially useful when dealing with datasets containing both numerical and categorical features that
require different preprocessing steps.
Key points:
titanic = dfx.copy(deep=True)
titanic = dfx.copy(deep=True)
# Apply ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
transformer = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), ['Embarked']),
],
remainder='passthrough' # Keep other columns
)
titanic_transformed = transformer.fit_transform(titanic)
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js