0% found this document useful (0 votes)
3 views

Data Preprocessing

Data preprocessing is essential in machine learning as it enhances model accuracy, interpretability, and training speed. Techniques such as handling missing values, outlier detection, feature scaling, and encoding categorical variables are crucial for preparing data. Proper preprocessing leads to better generalization and efficient model deployment.

Uploaded by

Rjab Karim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Data Preprocessing

Data preprocessing is essential in machine learning as it enhances model accuracy, interpretability, and training speed. Techniques such as handling missing values, outlier detection, feature scaling, and encoding categorical variables are crucial for preparing data. Proper preprocessing leads to better generalization and efficient model deployment.

Uploaded by

Rjab Karim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data preprocessing Part 01

Data preprocessing is a crucial step in machine learning because it directly impacts the quality and performance of your
models.

1. Improved Model Accuracy:

Handling Missing Values: Missing data can lead to biased or inaccurate models. Imputation techniques fill in missing values with
reasonable estimates, improving data completeness and model accuracy.

Outlier Detection and Treatment: Outliers can skew the model's learning process. Identifying and handling them appropriately
prevents the model from being overly influenced by extreme values.

Feature Scaling: Features with different scales can affect the model's learning process. Scaling techniques like normalization or
standardization ensure that all features contribute equally to the model.

Data Cleaning: Removing inconsistencies, errors, and noise in the data ensures that the model learns from reliable information.

2. Enhanced Model Interpretability:

Feature Engineering: Creating new features or transforming existing ones can make the data more meaningful and easier to
interpret.

Dimensionality Reduction: Techniques like PCA or t-SNE reduce the number of features, making the model simpler and easier to
understand. 3. Faster Model Training:

Data Cleaning and Preprocessing: Clean and well-prepared data reduces the computational cost of training, leading to faster
model development.

Feature Selection: Selecting the most relevant features reduces the model's complexity and speeds up training.

4. Better Generalization:

Data Cleaning and Preprocessing: High-quality data helps the model learn general patterns and avoid overfitting, improving its
ability to generalize to new, unseen data.

5. Efficient Model Deployment:

Data Preprocessing Pipeline: Creating a robust preprocessing pipeline ensures that new data can be efficiently prepared for the
model, facilitating deployment and real-time predictions.

In essence, data preprocessing is the foundation of successful machine learning. By investing time and effort in this step, you
can significantly improve the accuracy, interpretability, and efficiency of your models.

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Load Titanic dataset


df = pd.read_csv('titanic.csv')
df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
(Florence Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques Heath (Lily


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
May Peel)

4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Download Code: https://t.me/AIMLDeepThaught/295


dfx = df[['Survived','Sex','Age','Fare','Embarked']]
dfx.head()
Survived Sex Age Fare Embarked

0 0 male 22.0 7.2500 S

1 1 female 38.0 71.2833 C

2 1 female 26.0 7.9250 S

3 1 female 35.0 53.1000 S

4 0 male 35.0 8.0500 S

1. StandardScaler
StandardScaler is a common technique used in data preprocessing to standardize numerical features. It transforms features by
removing the mean and scaling them to unit variance. This ensures that all features contribute equally to the model, regardless
of their original scale.

Key points:

Centering: Subtracts the mean from each feature value.


Scaling: Divides each centered value by the standard deviation.
Normalization: Results in a distribution with a mean of 0 and a standard deviation of 1.
Benefits: Improved model performance, especially for algorithms sensitive to feature scales.
Use Cases: Linear Regression, Logistic Regression, Support Vector Machines, and Neural Networks.

from sklearn.preprocessing import StandardScaler

titanic = dfx.copy(deep=True)

# Select numerical features


numerical_features = ['Age', 'Fare']

# Apply StandardScaler
scaler = StandardScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male -0.530377 -0.502445 S

1 1 female 0.571831 0.786845 C

2 1 female -0.254825 -0.488854 S

3 1 female 0.365167 0.420730 S

4 0 male 0.365167 -0.486337 S

2. MinMaxScaler
MinMaxScaler is a data preprocessing technique that scales numerical features to a specific range, typically between 0 and 1.
It's useful when you want to transform features to a bounded range or when you want to avoid negative values.

Key points:

Rescaling: Maps the minimum value to 0 and the maximum value to 1.


Normalization: Scales all values proportionally to the new range.
Preserves: Original data distribution.
Benefits: Improved model performance, especially for algorithms that assume a specific input range.
Use Cases: Neural Networks, K-Means Clustering, and other algorithms that require feature values within a specific range.

from sklearn.preprocessing import MinMaxScaler

titanic = dfx.copy(deep=True)

# Apply MinMaxScaler
scaler = MinMaxScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()
Survived Sex Age Fare Embarked

0 0 male 0.271174 0.014151 S

1 1 female 0.472229 0.139136 C

2 1 female 0.321438 0.015469 S

3 1 female 0.434531 0.103644 S

4 0 male 0.434531 0.015713 S

3. RobustScaler
RobustScaler is a data preprocessing technique that scales features using the interquartile range (IQR). It's robust to outliers,
making it suitable for data with skewed distributions or outliers.

Key points:

IQR: Measures the range between the 25th and 75th percentiles.
Scaling: Scales features using the IQR, reducing the influence of outliers.
Normalization: Results in a distribution with a smaller range and less sensitivity to extreme values.
Benefits: Improved model performance, especially for algorithms sensitive to outliers.
Use Cases: Linear Regression, Logistic Regression, and other algorithms that can be affected by outliers.

from sklearn.preprocessing import RobustScaler

titanic = dfx.copy(deep=True)

# Apply RobustScaler
scaler = RobustScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male -0.335664 -0.312011 S

1 1 female 0.559441 2.461242 C

2 1 female -0.111888 -0.282777 S

3 1 female 0.391608 1.673732 S

4 0 male 0.391608 -0.277363 S

4. Normalizer
Normalizer is a data preprocessing technique that scales individual samples to have a unit norm. This means each sample
(row) in the dataset is rescaled so that its length (Euclidean norm) becomes 1.

Key points:

Unit Norm: Each sample is scaled to have a magnitude of 1.


Feature Importance: Normalization doesn't change the relative importance of features within a sample.
Use Cases: Text classification, clustering, and other techniques that rely on cosine similarity or other distance metrics.
Benefits: Improved performance for algorithms that are sensitive to the magnitude of feature values.

from sklearn.preprocessing import Normalizer


from sklearn.impute import SimpleImputer

titanic = dfx.copy(deep=True)

# Apply Normalizer
scaler = Normalizer()

# Create an imputer to fill NaN values with the mean


imputer = SimpleImputer(strategy='mean')

# Fit the imputer on your numerical features and transform the data
titanic[numerical_features] = imputer.fit_transform(titanic[numerical_features])

# Now apply the Normalizer on the imputed data


titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])

titanic.head()
Survived Sex Age Fare Embarked

0 0 male 0.949757 0.312988 S

1 1 female 0.470417 0.882444 C

2 1 female 0.956551 0.291564 S

3 1 female 0.550338 0.834942 S

4 0 male 0.974555 0.224148 S

5. Binarizer
Binarizer is a data preprocessing technique that transforms numerical values into binary values (0 or 1) based on a specified
threshold. It's useful for feature discretization or when you want to treat numerical features as categorical.

Key points:

Thresholding: Values greater than the threshold are mapped to 1, and values less than or equal to the threshold are mapped to 0.
Feature Discretization: Converts continuous features into discrete binary features.
Use Cases: Text analysis, where presence or absence of a word is important, and other applications where binary representation is
beneficial.
Benefits: Simplifies data, reduces dimensionality, and can improve model performance in certain scenarios.

from sklearn.preprocessing import Binarizer


from sklearn.impute import SimpleImputer

titanic = dfx.copy(deep=True)

# Impute missing values in 'Age' with the mean


imputer = SimpleImputer(strategy='mean')
titanic['Age'] = imputer.fit_transform(titanic[['Age']])

# Apply Binarizer
binarizer = Binarizer(threshold=0.5)
titanic['Age'] = binarizer.fit_transform(titanic[['Age']])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 1.0 7.2500 S

1 1 female 1.0 71.2833 C

2 1 female 1.0 7.9250 S

3 1 female 1.0 53.1000 S

4 0 male 1.0 8.0500 S

6. LabelEncoder
LabelEncoder is a technique used to convert categorical data into numerical data. It assigns a unique integer to each category.
This is useful for machine learning algorithms that only work with numerical data.

Key points:

Categorical to Numerical: Transforms categorical labels into numerical labels.


Unique Integer Assignment: Assigns a unique integer to each category.
Ordinal Encoding: Assumes an inherent order between categories, which may not always be appropriate.
Use Cases: When categorical features have a natural order (e.g., low, medium, high).
Caution: Be careful when using LabelEncoder for nominal categorical data, as it might introduce a false sense of ordinality.

from sklearn.preprocessing import LabelEncoder

titanic = dfx.copy(deep=True)

# Apply LabelEncoder
encoder = LabelEncoder()
titanic['Sex'] = encoder.fit_transform(titanic['Sex'])
titanic.head()
Survived Sex Age Fare Embarked

0 0 1 22.0 7.2500 S

1 1 0 38.0 71.2833 C

2 1 0 26.0 7.9250 S

3 1 0 35.0 53.1000 S

4 0 1 35.0 8.0500 S

7. OneHotEncoder
OneHotEncoder is a technique used to convert categorical data into numerical data by creating binary columns for each
category. It's useful for machine learning algorithms that only work with numerical data.

Key points:

Categorical to Numerical: Transforms categorical labels into numerical representations.


Binary Columns: Creates a new binary column for each category.
One-Hot Encoding: Sets the value to 1 for the corresponding category and 0 for others.
Use Cases: When categorical features have no inherent order (e.g., color, country).
Benefits: Avoids introducing a false sense of ordinality, as each category is treated independently.

from sklearn.preprocessing import OneHotEncoder

titanic = dfx.copy(deep=True)

# Apply OneHotEncoder
encoder = OneHotEncoder()
encoded_features = encoder.fit_transform(titanic[['Embarked']]).toarray()
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['Embarked']))
titanic = pd.concat([titanic, encoded_df], axis=1).drop('Embarked', axis=1)
titanic.head()

Survived Sex Age Fare Embarked_C Embarked_Q Embarked_S Embarked_nan

0 0 male 22.0 7.2500 0.0 0.0 1.0 0.0

1 1 female 38.0 71.2833 1.0 0.0 0.0 0.0

2 1 female 26.0 7.9250 0.0 0.0 1.0 0.0

3 1 female 35.0 53.1000 0.0 0.0 1.0 0.0

4 0 male 35.0 8.0500 0.0 0.0 1.0 0.0

8. PolynomialFeatures
PolynomialFeatures is a technique used to create polynomial features from existing features. This is useful for capturing non-
linear relationships between features.

Key points:

Feature Engineering: Generates new features by raising existing features to powers.


Non-Linear Relationships: Captures complex patterns that linear models might miss.
Polynomial Degree: Determines the maximum degree of the polynomial features.
Use Cases: Polynomial Regression, Support Vector Machines with polynomial kernels.
Caution: Higher-degree polynomials can lead to overfitting, so careful tuning is required.

from sklearn.preprocessing import PolynomialFeatures

titanic = dfx.copy(deep=True)

# Impute missing values in 'Age' with the mean


imputer = SimpleImputer(strategy='mean')
titanic['Age'] = imputer.fit_transform(titanic[['Age']])

# Apply PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(titanic[numerical_features])
poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(numerical_features))
titanic = pd.concat([titanic, poly_df], axis=1)
titanic.head()
Survived Sex Age Fare Embarked 1 Age Fare Age^2 Age Fare Fare^2

0 0 male 22.0 7.2500 S 1.0 22.0 7.2500 484.0 159.5000 52.562500

1 1 female 38.0 71.2833 C 1.0 38.0 71.2833 1444.0 2708.7654 5081.308859

2 1 female 26.0 7.9250 S 1.0 26.0 7.9250 676.0 206.0500 62.805625

3 1 female 35.0 53.1000 S 1.0 35.0 53.1000 1225.0 1858.5000 2819.610000

4 0 male 35.0 8.0500 S 1.0 35.0 8.0500 1225.0 281.7500 64.802500

9. SimpleImputer
SimpleImputer is a technique used to handle missing values in a dataset. It replaces missing values with a specified value,
such as the mean, median, mode, or a constant.

Key points:

Missing Value Imputation: Fills in missing values with a suitable replacement.


Imputation Strategies: Mean, median, mode, or a constant value.
Benefits: Prevents data loss and ensures that machine learning algorithms can process the data.
Use Cases: Handling missing values in various datasets, especially when the missing values are missing at random.
Considerations: The choice of imputation strategy depends on the nature of the missing data and the specific machine learning
algorithm.

from sklearn.impute import SimpleImputer

titanic = dfx.copy(deep=True)

# Apply SimpleImputer
imputer = SimpleImputer(strategy='mean')
titanic['Age'] = imputer.fit_transform(titanic[['Age']])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 22.0 7.2500 S

1 1 female 38.0 71.2833 C

2 1 female 26.0 7.9250 S

3 1 female 35.0 53.1000 S

4 0 male 35.0 8.0500 S

10. KNNImputer
KNNImputer is a technique used to handle missing values in a dataset by imputing them with values from similar data points. It
leverages the k-Nearest Neighbors algorithm to find the closest neighbors to a data point with missing values and uses their
values to fill in the missing information.

Key points:

Neighborhood-Based Imputation: Fills missing values based on the values of nearby data points.
k-Nearest Neighbors: Identifies the k most similar data points to the one with missing values.
Weighted Average: The missing value is imputed using a weighted average of the values from the k nearest neighbors.
Benefits: Can handle complex patterns in data and often provides more accurate imputations than simple methods like mean or
median imputation.
Considerations: Requires careful selection of the number of neighbors (k) and distance metric.

from sklearn.impute import KNNImputer

titanic = dfx.copy(deep=True)

# Apply KNNImputer
imputer = KNNImputer(n_neighbors=5)
titanic['Age'] = imputer.fit_transform(titanic[['Age']])
titanic.head()
Survived Sex Age Fare Embarked

0 0 male 22.0 7.2500 S

1 1 female 38.0 71.2833 C

2 1 female 26.0 7.9250 S

3 1 female 35.0 53.1000 S

4 0 male 35.0 8.0500 S

11. PowerTransformer
PowerTransformer is a data preprocessing technique that transforms data to a normal distribution using a power
transformation. This is useful when your data is not normally distributed, as many statistical techniques assume normality.

Key points:

Normalization: Maps data to a normal distribution.


Transformation: Applies a power transformation (e.g., Box-Cox or Yeo-Johnson).
Benefits: Improved performance for algorithms that assume normally distributed data.
Use Cases: Linear Regression, Logistic Regression, and other parametric models.
Considerations: Careful selection of the appropriate power transformation is crucial.

from sklearn.preprocessing import PowerTransformer

titanic = dfx.copy(deep=True)

# Apply PowerTransformer
scaler = PowerTransformer()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male -0.472172 -0.878820 S

1 1 female 0.605017 1.336651 C

2 1 female -0.189376 -0.790065 S

3 1 female 0.412768 1.067352 S

4 0 male 0.412768 -0.774439 S

12. QuantileTransformer
QuantileTransformer is a data preprocessing technique that maps data to a uniform or normal distribution using quantile-based
transformations. This is useful when your data is not normally distributed and you want to improve the performance of
algorithms that assume normality.

Key points:

Distribution Mapping: Transforms data to a uniform or normal distribution.


Quantile-Based Transformation: Maps data points to their corresponding quantiles in the target distribution.
Benefits: Improved performance for algorithms that assume normal distribution.
Use Cases: Linear Regression, Logistic Regression, and other parametric models.
Considerations: The number of quantiles can influence the transformation's effectiveness.

from sklearn.preprocessing import QuantileTransformer

titanic = dfx.copy(deep=True)

# Apply QuantileTransformer
scaler = QuantileTransformer()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 0.304494 0.084831 S

1 1 female 0.744944 0.885393 C

2 1 female 0.433708 0.259551 S

3 1 female 0.683708 0.838764 S

4 0 male 0.683708 0.295506 S


13. MinMaxScaler
MaxAbsScaler is a data preprocessing technique that scales each feature by its maximum absolute value. This ensures that all
features are scaled to a range between -1 and 1.

Key points:

Scaling: Divides each feature value by its maximum absolute value.


Range: Scales features to a range between -1 and 1.
Preserves Sparsity: Doesn't shift or center the data, preserving zero values.
Use Cases: When you want to scale features without affecting sparsity and when the sign of the feature is important.
Benefits: Simple and effective for scaling features with varying magnitudes.

from sklearn.preprocessing import MaxAbsScaler

titanic = dfx.copy(deep=True)

# Apply MaxAbsScaler
scaler = MaxAbsScaler()
titanic[numerical_features] = scaler.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 0.2750 0.014151 S

1 1 female 0.4750 0.139136 C

2 1 female 0.3250 0.015469 S

3 1 female 0.4375 0.103644 S

4 0 male 0.4375 0.015713 S

14. OrdinalEncoder
OrdinalEncoder is a technique used to encode categorical features with integer values based on their order. This is useful for
categorical features that have a natural order, such as "low," "medium," and "high."

Key points:

Categorical to Numerical: Transforms categorical labels into numerical labels.


Order Preservation: Encodes categories according to their order.
Use Cases: When categorical features have a clear ordinal relationship.
Caution: Avoid using OrdinalEncoder for nominal categorical features, as it might introduce a false sense of order.
Benefits: Simple and effective for encoding ordinal categorical features.

from sklearn.preprocessing import OrdinalEncoder

titanic = dfx.copy(deep=True)

# Apply OrdinalEncoder
encoder = OrdinalEncoder()
titanic[['Embarked']] = encoder.fit_transform(titanic[['Embarked']])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 22.0 7.2500 2.0

1 1 female 38.0 71.2833 0.0

2 1 female 26.0 7.9250 2.0

3 1 female 35.0 53.1000 2.0

4 0 male 35.0 8.0500 2.0

15. KBinsDiscretizer
KBinsDiscretizer is a technique used to discretize continuous features into intervals (bins). This is useful for converting
continuous features into categorical features, which can be beneficial for certain machine learning algorithms.

Key points:

Discretization: Divides continuous features into discrete intervals.


Binning Strategies: Equal-width binning, equal-frequency binning, or quantile-based binning.
Encoding: Encodes the bins using integer values or one-hot encoding.
Benefits: Reduces the impact of noise and outliers, improves model interpretability, and can enhance the performance of certain
algorithms.
Considerations: The number of bins and the binning strategy can significantly impact the results.

from sklearn.preprocessing import KBinsDiscretizer

titanic = dfx.copy(deep=True)

# Apply KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
titanic['Fare_binned'] = discretizer.fit_transform(titanic[['Fare']])
titanic.head()

Survived Sex Age Fare Embarked Fare_binned

0 0 male 22.0 7.2500 S 0.0

1 1 female 38.0 71.2833 C 0.0

2 1 female 26.0 7.9250 S 0.0

3 1 female 35.0 53.1000 S 0.0

4 0 male 35.0 8.0500 S 0.0

titanic['Fare_binned'].value_counts()

Fare_binned
0.0 838
1.0 33
2.0 17
4.0 3
Name: count, dtype: int64

16. FunctionTransformer
FunctionTransformer is a versatile preprocessing technique that allows you to apply custom functions to your data. It provides
a flexible way to incorporate custom transformations into your machine learning pipeline.

Key points:

Custom Transformations: Enables you to define and apply your own functions to data.
Flexibility: Can be used for a wide range of transformations, from simple arithmetic operations to complex statistical calculations.
Integration with Pipelines: Easily integrates into scikit-learn pipelines for seamless data processing.
Use Cases: Feature engineering, data cleaning, and other custom data transformations.
Considerations: Ensure that your custom function is efficient and compatible with the scikit-learn framework.

from sklearn.preprocessing import FunctionTransformer

titanic = dfx.copy(deep=True)

# Apply FunctionTransformer
transformer = FunctionTransformer(lambda x: x + 1)
titanic[numerical_features] = transformer.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male 23.0 8.2500 S

1 1 female 39.0 72.2833 C

2 1 female 27.0 8.9250 S

3 1 female 36.0 54.1000 S

4 0 male 36.0 9.0500 S

17. MultiLabelBinarizer
MultiLabelBinarizer is a technique used to encode multi-label categorical features into binary matrices. It's useful for tasks
where a sample can belong to multiple categories simultaneously.

Key points:

Multi-Label Encoding: Converts multi-label categorical data into a binary matrix.


Binary Matrix: Each row represents a sample, and each column represents a class.
One-Hot Encoding: A 1 indicates the presence of a class, and a 0 indicates its absence.
Use Cases: Text classification with multiple labels, image classification with multiple objects.
Benefits: Allows for flexible representation of multi-label data.

from sklearn.preprocessing import MultiLabelBinarizer

titanic = df[['Survived','Sex','Age','Fare','Embarked','Cabin']]

# Apply MultiLabelBinarizer
mlb = MultiLabelBinarizer()
titanic['Cabin'] = titanic['Cabin'].fillna('').apply(lambda x: x.split())
titanic = titanic.join(pd.DataFrame(mlb.fit_transform(titanic.pop('Cabin')), columns=mlb.classes_, index=titanic
titanic.head()

Survived Sex Age Fare Embarked A10 A14 A16 A19 A20 ... E8 F F2 F33 F38 F4 G6 G63 G73 T

0 0 male 22.0 7.2500 S 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

1 1 female 38.0 71.2833 C 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 1 female 26.0 7.9250 S 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

3 1 female 35.0 53.1000 S 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4 0 male 35.0 8.0500 S 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 166 columns

18. StandardScaler with Pipeline


StandardScaler with Pipeline is a powerful combination used to streamline data preprocessing and model building. It involves
creating a sequence of steps, including scaling and modeling, within a single pipeline.

Key points:

Sequential Workflow: Combines multiple steps into a single pipeline.


StandardScaler Integration: Includes the StandardScaler step to standardize features.
Model Integration: Incorporates a machine learning model (e.g., linear regression, logistic regression, etc.) into the pipeline.
Benefits: Simplifies the workflow, reduces the risk of data leakage, and improves model performance.
Use Cases: A wide range of machine learning tasks, especially those involving multiple preprocessing steps and model training.

from sklearn.pipeline import Pipeline

titanic = dfx.copy(deep=True)

# Apply StandardScaler with Pipeline


pipeline = Pipeline([
('scaler', StandardScaler())
])
titanic[numerical_features] = pipeline.fit_transform(titanic[numerical_features])
titanic.head()

Survived Sex Age Fare Embarked

0 0 male -0.530377 -0.502445 S

1 1 female 0.571831 0.786845 C

2 1 female -0.254825 -0.488854 S

3 1 female 0.365167 0.420730 S

4 0 male 0.365167 -0.486337 S

19. ColumnTransformer
ColumnTransformer is a powerful tool in scikit-learn that allows you to apply different transformations to different columns of
your dataset. This is especially useful when dealing with datasets containing both numerical and categorical features that
require different preprocessing steps.

Key points:

Selective Transformation: Applies specific transformations to subsets of columns.


Pipeline Integration: Easily integrates with scikit-learn's Pipeline for streamlined workflows.
Code Organization: Encapsulates preprocessing logic in a single, maintainable object.
Benefits: Improved data quality, efficient preprocessing, and better model performance.
Use Cases: Handling mixed data types, feature engineering, and complex data preprocessing pipelines.

titanic = dfx.copy(deep=True)
titanic = dfx.copy(deep=True)

# Select numerical features


numerical_features = ['Age', 'Fare']

# Apply ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
transformer = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), ['Embarked']),
],
remainder='passthrough' # Keep other columns
)
titanic_transformed = transformer.fit_transform(titanic)

# Convert the NumPy array back into a Pandas DataFrame if needed.


# Get feature names after transformation
num_features_transformed = transformer.named_transformers_['num'].get_feature_names_out(numerical_features)
cat_features_transformed = transformer.named_transformers_['cat'].get_feature_names_out(['Embarked'])
all_features = list(num_features_transformed) + list(cat_features_transformed) + list(titanic.drop(numerical_features

titanic = pd.DataFrame(titanic_transformed, columns=all_features)


titanic.head()

Age Fare Embarked_C Embarked_Q Embarked_S Embarked_nan Survived Sex

0 -0.530377 -0.502445 0.0 0.0 1.0 0.0 0 male

1 0.571831 0.786845 1.0 0.0 0.0 0.0 1 female

2 -0.254825 -0.488854 0.0 0.0 1.0 0.0 1 female

3 0.365167 0.42073 0.0 0.0 1.0 0.0 1 female

4 0.365167 -0.486337 0.0 0.0 1.0 0.0 0 male

Download Code: https://t.me/AIMLDeepThaught/295

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like