0% found this document useful (0 votes)
4 views5 pages

Data Preprocessing 2

The document outlines a data preprocessing workflow using Python, focusing on handling missing values, encoding categorical data, feature scaling, and splitting data into training and testing sets. It also discusses feature engineering techniques such as creating new features, polynomial transformations, binning, and log transformations. The code examples demonstrate these processes using a sample dataset with attributes like Age, Salary, Country, and Purchased status.

Uploaded by

Harshitha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Data Preprocessing 2

The document outlines a data preprocessing workflow using Python, focusing on handling missing values, encoding categorical data, feature scaling, and splitting data into training and testing sets. It also discusses feature engineering techniques such as creating new features, polynomial transformations, binning, and log transformations. The code examples demonstrate these processes using a sample dataset with attributes like Age, Salary, Country, and Purchased status.

Uploaded by

Harshitha Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.

ipynb - Colab

#Loading the Data


import pandas as pd

# Sample data creation for demonstration


data = {
'Age': [25, 30, 45, None, 22],
'Salary': [50000, 60000, 80000, 90000, None],
'Country': ['USA', 'France', 'Germany', 'USA', 'France'],
'Purchased': ['No', 'Yes', 'No', 'Yes', 'No']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Display the data


print(df)

Age Salary Country Purchased


0 25.0 50000.0 USA No
1 30.0 60000.0 France Yes
2 45.0 80000.0 Germany No
3 NaN 90000.0 USA Yes
4 22.0 NaN France No

Handling Missing Values

from sklearn.impute import SimpleImputer

# Handling missing values for numerical columns


imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

print("After handling missing values:")


print(df)

After handling missing values:


Age Salary Country Purchased
0 25.0 50000.0 USA No
1 30.0 60000.0 France Yes
2 45.0 80000.0 Germany No
3 30.5 90000.0 USA Yes
4 22.0 70000.0 France No

?SimpleImputer

3. Encoding Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder


from sklearn.compose import ColumnTransformer

# Encoding categorical data for 'Country' using OneHotEncoder


# The 'Country' column will be replaced with three new columns (one for each country)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['Country'])], remainder='passthrough')
df_encoded = ct.fit_transform(df)

# Convert the result back to a DataFrame for easier viewing


df_encoded = pd.DataFrame(df_encoded, columns=['France', 'Germany', 'USA', 'Age', 'Salary', 'Purchased'])

print("After encoding categorical data:")


print(df_encoded)

# Encoding 'Purchased' column using LabelEncoder


label_encoder = LabelEncoder()
df_encoded['Purchased'] = label_encoder.fit_transform(df_encoded['Purchased'])

print("After encoding 'Purchased' column:")


print(df_encoded)

After encoding categorical data:


France Germany USA Age Salary Purchased
0 0.0 0.0 1.0 25.0 50000.0 No
1 1.0 0.0 0.0 30.0 60000.0 Yes
2 0.0 1.0 0.0 45.0 80000.0 No
3 0.0 0.0 1.0 30.5 90000.0 Yes
4 1.0 0.0 0.0 22.0 70000.0 No
After encoding 'Purchased' column:
France Germany USA Age Salary Purchased

https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 1/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
0 0.0 0.0 1.0 25.0 50000.0 0
1 1.0 0.0 0.0 30.0 60000.0 1
2 0.0 1.0 0.0 45.0 80000.0 0
3 0.0 0.0 1.0 30.5 90000.0 1
4 1.0 0.0 0.0 22.0 70000.0 0

4. Feature Scaling

from sklearn.preprocessing import StandardScaler

# Feature scaling for 'Age' and 'Salary'


scaler = StandardScaler()
df_encoded[['Age', 'Salary']] = scaler.fit_transform(df_encoded[['Age', 'Salary']])

print("After feature scaling:")


print(df_encoded)
# z = (x - u) / s

After feature scaling:


France Germany USA Age Salary Purchased
0 0.0 0.0 1.0 -0.695145 -1.414214 0
1 1.0 0.0 0.0 -0.063195 -0.707107 1
2 0.0 1.0 0.0 1.832656 0.707107 0
3 0.0 0.0 1.0 0.000000 1.414214 1
4 1.0 0.0 0.0 -1.074315 0.000000 0

5. Splitting the Data into Training and Testing Sets

from sklearn.model_selection import train_test_split

# Splitting the data into features (X) and target (y)


X = df_encoded.drop('Purchased', axis=1)
y = df_encoded['Purchased']

# Splitting the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set:")
print(X_train)
print(y_train)

print("Testing set:")
print(X_test)
print(y_test)

Training set:
France Germany USA Age Salary
4 1.0 0.0 0.0 -1.074315 0.000000
2 0.0 1.0 0.0 1.832656 0.707107
0 0.0 0.0 1.0 -0.695145 -1.414214
3 0.0 0.0 1.0 0.000000 1.414214
4 0
2 0
0 0
3 1
Name: Purchased, dtype: int64
Testing set:
France Germany USA Age Salary
1 1.0 0.0 0.0 -0.063195 -0.707107
1 1
Name: Purchased, dtype: int64

6. Feature engineering involves creating new features or transforming existing ones to improve model performance. Here are some common
techniques with Python code examples:

7. Creating New Features Date-Time Features: Extracting components like year, month, day, or hour from a datetime column. Interaction
Features: Combining two or more features to create interaction terms.

8. Polynomial Features Polynomial Transformations: Generating polynomial and interaction features.

9. Binning Binning Continuous Variables: Converting a continuous variable into categorical by binning.

10. Log Transformation Log Transformation: Applying a logarithmic transformation to reduce skewness.

11. Feature Selection Removing Low Variance Features: Removing features with low variance. Let's demonstrate these techniques with code.

https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 2/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
import pandas as pd
import numpy as np

# Sample data for feature engineering


data = {
'Age': [25, 30, 45, 35, 22],
'Salary': [50000, 60000, 80000, 90000, 75000],
'Country': ['USA', 'France', 'Germany', 'USA', 'France'],
'Purchased': ['No', 'Yes', 'No', 'Yes', 'No'],
'JoinDate': pd.to_datetime(['2015-03-01', '2017-07-12', '2018-01-01', '2020-02-20', '2019-05-15'])
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
Age Salary Country Purchased JoinDate
0 25 50000 USA No 2015-03-01
1 30 60000 France Yes 2017-07-12
2 45 80000 Germany No 2018-01-01
3 35 90000 USA Yes 2020-02-20
4 22 75000 France No 2019-05-15

#a) Date-Time Features


# Extracting year, month, and day from 'JoinDate'
df['Year'] = df['JoinDate'].dt.year
df['Month'] = df['JoinDate'].dt.month
df['Day'] = df['JoinDate'].dt.day

print("After extracting date-time features:")


print(df)

After extracting date-time features:


Age Salary Country Purchased JoinDate Year Month Day
0 25 50000 USA No 2015-03-01 2015 3 1
1 30 60000 France Yes 2017-07-12 2017 7 12
2 45 80000 Germany No 2018-01-01 2018 1 1
3 35 90000 USA Yes 2020-02-20 2020 2 20
4 22 75000 France No 2019-05-15 2019 5 15

#b) Interaction Features


# Creating interaction between 'Age' and 'Salary'
df['Age_Salary_Interaction'] = df['Age'] * df['Salary']

print("After creating interaction feature:")


print(df)

After creating interaction feature:


Age Salary Country Purchased JoinDate Year Month Day \
0 25 50000 USA No 2015-03-01 2015 3 1
1 30 60000 France Yes 2017-07-12 2017 7 12
2 45 80000 Germany No 2018-01-01 2018 1 1
3 35 90000 USA Yes 2020-02-20 2020 2 20
4 22 75000 France No 2019-05-15 2019 5 15

Age_Salary_Interaction
0 1250000
1 1800000
2 3600000
3 3150000
4 1650000

#2. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Creating polynomial features for 'Age' and 'Salary'


poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[['Age', 'Salary']])

# Convert the result back to a DataFrame


df_poly = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(['Age', 'Salary']))

# Concatenate with original DataFrame


df = pd.concat([df, df_poly], axis=1)

print("After polynomial transformation:")


print(df)

After polynomial transformation:


Age Salary Country Purchased JoinDate Year Month Day \
0 25 50000 USA No 2015-03-01 2015 3 1

https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 3/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
1 30 60000 France Yes 2017-07-12 2017 7 12
2 45 80000 Germany No 2018-01-01 2018 1 1
3 35 90000 USA Yes 2020-02-20 2020 2 20
4 22 75000 France No 2019-05-15 2019 5 15

Age_Salary_Interaction Age Salary Age^2 Age Salary Salary^2


0 1250000 25.0 50000.0 625.0 1250000.0 2.500000e+09
1 1800000 30.0 60000.0 900.0 1800000.0 3.600000e+09
2 3600000 45.0 80000.0 2025.0 3600000.0 6.400000e+09
3 3150000 35.0 90000.0 1225.0 3150000.0 8.100000e+09
4 1650000 22.0 75000.0 484.0 1650000.0 5.625000e+09

#3. Binning Continuous Variables

# Binning 'Age' into categories: 'Young', 'Middle-aged', 'Senior'


bins = [0, 25, 40, 100]
labels = ['Young', 'Middle-aged', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)

print("After binning 'Age':")


print(df)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-434f33adc81e> in <cell line: 6>()
4 bins = [0, 25, 40, 100]
5 labels = ['Young', 'Middle-aged', 'Senior']
----> 6 df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
7
8 print("After binning 'Age':")

1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/reshape/tile.py in _preprocess_for_cut(x)
610 x = np.asarray(x)
611 if x.ndim != 1:
--> 612 raise ValueError("Input array must be 1 dimensional")
613
614 return x

ValueError: Input array must be 1 dimensional

#Log Transformation
# Log transformation of 'Salary' to reduce skewness
df['Log_Salary'] = np.log(df['Salary'])

print("After log transformation of 'Salary':")


print(df)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-570904002dd3> in <cell line: 3>()
1 #Log Transformation
2 # Log transformation of 'Salary' to reduce skewness
----> 3 df['Log_Salary'] = np.log(df['Salary'])
4
5 print("After log transformation of 'Salary':")

1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _set_item_frame_value(self, key, value)
4237
4238 if len(value.columns) != 1:
-> 4239 raise ValueError(
4240 "Cannot set a DataFrame with multiple columns to the single "
4241 f"column {key}"

ValueError: Cannot set a DataFrame with multiple columns to the single column Log_Salary

#5. Feature Selection


#a) Removing Low Variance Features

from sklearn.feature_selection import VarianceThreshold

# Removing features with low variance (variance threshold of 0.01)


selector = VarianceThreshold(threshold=0.01)
df_high_var = selector.fit_transform(df[['Age', 'Salary', 'Age_Salary_Interaction']])

# Convert back to DataFrame


df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])

print("After removing low variance features:")


print(df_high_var)

https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 4/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-7dcdb6f5e111> in <cell line: 11>()
9
10 # Convert back to DataFrame
---> 11 df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])
12
13 print("After removing low variance features:")

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index,
columns)
418 passed = values.shape
419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
421
422

ValueError: Shape of passed values is (5, 5), indices imply (5, 3)

from sklearn.feature_selection import VarianceThreshold

# Removing features with low variance (variance threshold of 0.01)


selector = VarianceThreshold(threshold=0.01)
df_high_var = selector.fit_transform(df[['Age', 'Salary', 'Age_Salary_Interaction']])

# Convert back to DataFrame


df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])

print("After removing low variance features:")


print(df_high_var)

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-10614f4c18b7> in <cell line: 8>()
6
7 # Convert back to DataFrame
----> 8 df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])
9
10 print("After removing low variance features:")

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index,
columns)
418 passed = values.shape
419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
421
422

ValueError: Shape of passed values is (5, 5), indices imply (5, 3)

https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 5/5

You might also like