Data Preprocessing 2
Data Preprocessing 2
ipynb - Colab
# Creating a DataFrame
df = pd.DataFrame(data)
?SimpleImputer
https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 1/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
0 0.0 0.0 1.0 25.0 50000.0 0
1 1.0 0.0 0.0 30.0 60000.0 1
2 0.0 1.0 0.0 45.0 80000.0 0
3 0.0 0.0 1.0 30.5 90000.0 1
4 1.0 0.0 0.0 22.0 70000.0 0
4. Feature Scaling
print("Training set:")
print(X_train)
print(y_train)
print("Testing set:")
print(X_test)
print(y_test)
Training set:
France Germany USA Age Salary
4 1.0 0.0 0.0 -1.074315 0.000000
2 0.0 1.0 0.0 1.832656 0.707107
0 0.0 0.0 1.0 -0.695145 -1.414214
3 0.0 0.0 1.0 0.000000 1.414214
4 0
2 0
0 0
3 1
Name: Purchased, dtype: int64
Testing set:
France Germany USA Age Salary
1 1.0 0.0 0.0 -0.063195 -0.707107
1 1
Name: Purchased, dtype: int64
6. Feature engineering involves creating new features or transforming existing ones to improve model performance. Here are some common
techniques with Python code examples:
7. Creating New Features Date-Time Features: Extracting components like year, month, day, or hour from a datetime column. Interaction
Features: Combining two or more features to create interaction terms.
9. Binning Binning Continuous Variables: Converting a continuous variable into categorical by binning.
10. Log Transformation Log Transformation: Applying a logarithmic transformation to reduce skewness.
11. Feature Selection Removing Low Variance Features: Removing features with low variance. Let's demonstrate these techniques with code.
https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 2/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Original DataFrame:
Age Salary Country Purchased JoinDate
0 25 50000 USA No 2015-03-01
1 30 60000 France Yes 2017-07-12
2 45 80000 Germany No 2018-01-01
3 35 90000 USA Yes 2020-02-20
4 22 75000 France No 2019-05-15
Age_Salary_Interaction
0 1250000
1 1800000
2 3600000
3 3150000
4 1650000
https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 3/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
1 30 60000 France Yes 2017-07-12 2017 7 12
2 45 80000 Germany No 2018-01-01 2018 1 1
3 35 90000 USA Yes 2020-02-20 2020 2 20
4 22 75000 France No 2019-05-15 2019 5 15
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-434f33adc81e> in <cell line: 6>()
4 bins = [0, 25, 40, 100]
5 labels = ['Young', 'Middle-aged', 'Senior']
----> 6 df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
7
8 print("After binning 'Age':")
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/reshape/tile.py in _preprocess_for_cut(x)
610 x = np.asarray(x)
611 if x.ndim != 1:
--> 612 raise ValueError("Input array must be 1 dimensional")
613
614 return x
#Log Transformation
# Log transformation of 'Salary' to reduce skewness
df['Log_Salary'] = np.log(df['Salary'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-570904002dd3> in <cell line: 3>()
1 #Log Transformation
2 # Log transformation of 'Salary' to reduce skewness
----> 3 df['Log_Salary'] = np.log(df['Salary'])
4
5 print("After log transformation of 'Salary':")
1 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in _set_item_frame_value(self, key, value)
4237
4238 if len(value.columns) != 1:
-> 4239 raise ValueError(
4240 "Cannot set a DataFrame with multiple columns to the single "
4241 f"column {key}"
ValueError: Cannot set a DataFrame with multiple columns to the single column Log_Salary
https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 4/5
8/23/24, 5:23 PM ML_Lab_3_Pre_Processing.ipynb - Colab
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-7dcdb6f5e111> in <cell line: 11>()
9
10 # Convert back to DataFrame
---> 11 df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])
12
13 print("After removing low variance features:")
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index,
columns)
418 passed = values.shape
419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
421
422
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-10614f4c18b7> in <cell line: 8>()
6
7 # Convert back to DataFrame
----> 8 df_high_var = pd.DataFrame(df_high_var, columns=['Age', 'Salary', 'Age_Salary_Interaction'])
9
10 print("After removing low variance features:")
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index,
columns)
418 passed = values.shape
419 implied = (len(index), len(columns))
--> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
421
422
https://colab.research.google.com/drive/1dK3nPRGh9f4IzJLne3IY7v2SWOHGI5ti#printMode=true 5/5