0% found this document useful (0 votes)
14 views1 page

PracticalWeek02

The document outlines a practical exercise focused on data pre-processing and cleansing, including handling missing data, data integration, transformation, and discretization using various datasets. Key tasks include removing and imputing missing values, handling outliers, and transforming categorical data into numerical formats. The practical is designed to be used alongside a tutorial for a comprehensive understanding of data pre-processing techniques.

Uploaded by

Chloe Tee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views1 page

PracticalWeek02

The document outlines a practical exercise focused on data pre-processing and cleansing, including handling missing data, data integration, transformation, and discretization using various datasets. Key tasks include removing and imputing missing values, handling outliers, and transforming categorical data into numerical formats. The practical is designed to be used alongside a tutorial for a comprehensive understanding of data pre-processing techniques.

Uploaded by

Chloe Tee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PracticalWeek02 - Data Pre-processing & Cleansing

Use this practical together with TutorialWeek02

Exercise:

1. Handle Missing Data


Removing of data (Banking_Marketing.csv)
Imputation (Banking_Marketing.csv)
Removing Outliers (german_credit_data.csv)
2. Data Integration (student.csv & marks.csv)
3. Data Transformation
Replacement of Categorical Data with Numbers (student.csv)
Label encoding (Banking_Marketing.csv)
Transforming Data of Different Scale (Wholesale customers data.csv)
4. Data Discretization (Student_bucketing.csv)

In [1]:
# Line Wrapping in Collaboratory Google results
# put this in the first cell of your notebook

from IPython.display import HTML, display

def set_css():
display(HTML('''
<style>
pre {
white-space: pre-wrap;
}
</style>
'''))
get_ipython().events.register('pre_run_cell', set_css)

Mount Google Drive


Important: Remember to re-mount for each time a new dataset is added to Google Drive

In [2]:
import io
import requests
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive

1. Handle Missing Data


- Removing of data (Banking_Marketing.csv)
- Imputation (Banking_Marketing.csv)
- Removing Outliers (german_credit_data.csv)

Dataset import as:

Banking_Marketing_df
german_credit_df

In [3]:
# Import dataset
import pandas as pd
DATA_DIR_1 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Banking_Marketing.csv"
Banking_Marketing_df = pd.read_csv (DATA_DIR_1, header=0)

1.1 - Removing of Data


In [ ]:
# Determine the datatype of Each Column by using dtypes
print (Banking_Marketing_df.dtypes)

age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object

In [ ]:
print("Find missing value of each column using isna()")
print (Banking_Marketing_df.isna().sum())

Find missing value of each column using isna()


age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
print("\nRemove all rows with missing data by using dropna()")
data = Banking_Marketing_df.dropna ()
print(data.isna().sum())

Remove all rows with missing data by using dropna()


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
print(Banking_Marketing_df.isna().sum())

age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

1.2 - Imputation
Dataset: Banking_Marketing.csv

In [ ]:
# Computation of the Mean value by using mean ()
mean_age = Banking_Marketing_df.age.mean ()
print()
print ("Mean age: %.2f" % mean_age)

# Impute the missing data with its mean by using fillna ()


Banking_Marketing_df.age.fillna(mean_age, inplace=True)
print("\nImpute missing data with mean value:")
print (Banking_Marketing_df.isna().sum())

Mean age: 40.02

Impute missing data with mean value:


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
# Computation of Median value by using median ()
# used median because the 'duration' variable is too diverse
median_duration = Banking_Marketing_df.duration.median()
print ("\nMedian duration: %.2f" % median_duration)

# Impute the missing data with its median by using fillna ()


Banking_Marketing_df.duration.fillna(median_duration, inplace=True)
print("\nImpute missing data with median value:")
print (Banking_Marketing_df.isna().sum())

Median duration: 180.00

Impute missing data with median value:


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
# Computation of the Mean value by using mean ()
mean_age = Banking_Marketing_df.age.mean ()
print()
print ("Mean age: %.2f" % mean_age)

# Impute the missing data with its mean by using fillna ()


Banking_Marketing_df.age.fillna(mean_age, inplace=True)
print("\nImpute missing data with mean value:")
print (Banking_Marketing_df.isna().sum())

# Computation of Median value by using median ()


# used median because the 'duration' variable is too diverse
median_duration = Banking_Marketing_df.duration.median()
print ("\nMedian duration: %.2f" % median_duration)

# Impute the missing data with its median by using fillna ()


Banking_Marketing_df.duration.fillna(median_duration, inplace=True)
print("\nImpute missing data with median value:")
print (Banking_Marketing_df.isna().sum())

# Impute Categorical Data with its mode by using mode ()


# find out the mode
mode_contact = Banking_Marketing_df.contact.mode()[0]
print("\nImpute categorical data with its mode:")
print (mode_contact)

# impute using fillna. Used mode to find the most popular contact
Banking_Marketing_df.contact.fillna (mode_contact, inplace = True)
print("\nImpute missing data with mode (most popular contact):")
print (Banking_Marketing_df.isna().sum())

Mean age: 40.02

Impute missing data with mean value:


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

Median duration: 180.00

Impute missing data with median value:


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

Impute categorical data with its mode:


cellular

Impute missing data with mode (most popular contact):


age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

1.3 - Removing Outliers


Dataset: german_credit_data.csv

In [4]:
DATA_DIR_2 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/german_credit_data.csv"
german_credit_df = pd.read_csv (DATA_DIR_2, header=0)

In [6]:
german_credit_df.shape

Out[6]: (1000, 10)

In [5]:
# Display a BoxPlot
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sbn
sbn.boxplot(german_credit_df['Age'])

# Compute the Interquartile Range (IQR)


Q1 = german_credit_df['Age'].quantile(0.25)
Q3 = german_credit_df['Age'].quantile(0.75)
IQR = Q3 - Q1
print ("IQR: %.2f" %IQR)

# Calculate the Lower and Upper Fence


Lower_Fence = Q1 - (1.5 * IQR)
print ("Lower_Fence: %.2f" %Lower_Fence)
Upper_Fence = Q3 + (1.5 * IQR)
print ("Upper_Fence: %.2f" %Upper_Fence)

# Display Outliers and Filtering Out the Outliers


print("\nDisplay Outliers")
print (german_credit_df[((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])

# display data with outliers filtered out, use ~ to filter


print("\nDisplay data without outliers")
print (german_credit_df[~((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argum
ent will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
IQR: 15.00
Lower_Fence: 4.50
Upper_Fence: 64.50

Display Outliers
Unnamed: 0 Age Sex ... Credit amount Duration Purpose
0 0 67 male ... 1169 6 radio/TV
75 75 66 male ... 1526 12 car
137 137 66 male ... 766 12 radio/TV
163 163 70 male ... 7308 10 car
179 179 65 male ... 571 21 car
186 186 74 female ... 5129 9 car
187 187 68 male ... 1175 16 car
213 213 66 male ... 1908 30 business
330 330 75 male ... 6615 24 car
430 430 74 male ... 3448 5 business
438 438 65 male ... 3394 42 repairs
536 536 75 female ... 1374 6 car
554 554 67 female ... 1199 9 education
606 606 74 male ... 4526 24 business
624 624 65 male ... 2600 18 radio/TV
723 723 66 female ... 790 9 radio/TV
756 756 74 male ... 1299 6 car
774 774 66 male ... 1480 12 car
779 779 67 female ... 3872 18 repairs
807 807 65 male ... 930 12 radio/TV
846 846 68 male ... 6761 18 car
883 883 65 female ... 1098 18 radio/TV
917 917 68 male ... 14896 6 car

[23 rows x 10 columns]

Display data without outliers


Unnamed: 0 Age Sex ... Credit amount Duration Purpose
1 1 22 female ... 5951 48 radio/TV
2 2 49 male ... 2096 12 education
3 3 45 male ... 7882 42 furniture/equipment
4 4 53 male ... 4870 24 car
5 5 35 male ... 9055 36 education
.. ... ... ... ... ... ... ...
995 995 31 female ... 1736 12 furniture/equipment
996 996 40 male ... 3857 30 car
997 997 38 male ... 804 12 radio/TV
998 998 23 male ... 1845 45 radio/TV
999 999 27 male ... 4576 45 car

[977 rows x 10 columns]

Why Seaborn Boxplot still showing outliers, after removing the outliers?
Seaborn uses inter-quartile range to detect the outliers. When you remove outliers, the number of data changes thus its quantile changes

means lower range and upper range changes


thus it is again showing outliers

Let's investigate by computing a new quantile range after remove the outliers.

Before remove outliers:

IQR: 15.00
Lower_Fence: 4.50
Upper_Fence: 64.50

After remove outliers:

IQRb: 14.00
Lower_Fence_b: 6.00
Upper_Fence_b: 63.00

The new upper fence now is at 63, if you check the condition based on the new upper and lower fence, you will see there a 5 rows with outliers (german_credit_remOutliers["Age"] < Lower_Fence_b) |
(german_credit_remOutliers["Age"] > Upper_Fence_b)

But if you check the condition against the firstly calculated upper and lower fence, you will get an empty array print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence) |
(german_credit_remOutliers["Age"] > Upper_Fence))])

What does it mean?


The outliers are actually removed (for the attribute age of the dataframe), but Seaborn boxplot shows the outliers based on the newly calculated inter-quartile range.

In [26]:
german_credit_remOutliers = (german_credit_df[~((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])
german_credit_remOutliers.shape

Out[26]: (977, 10)

In [28]:
# Compute new quantile range after remove outliers
# Compute the Interquartile Range (IQR)
Q1b = german_credit_remOutliers['Age'].quantile(0.25)
Q3b = german_credit_remOutliers['Age'].quantile(0.75)
IQRb = Q3b - Q1b
print ("IQRb: %.2f" %IQRb)

# Calculate the Lower and Upper Fence


Lower_Fence_b = Q1b - (1.5 * IQRb)
print ("Lower_Fence: %.2f" %Lower_Fence_b)
Upper_Fence_b = Q3 + (1.5 * IQRb)
print ("Upper_Fence: %.2f" %Upper_Fence_b)

IQRb: 14.00
Lower_Fence: 6.00
Upper_Fence: 63.00

In [31]:
sbn.boxplot(german_credit_remOutliers['Age'])
# Use showfliers=False if you want to disable outliers from boxplot

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argum
ent will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Out[31]: <matplotlib.axes._subplots.AxesSubplot at 0x7fd10cc710d0>

In [33]:
# Check condition based on the firstly calculated IQR → results return empty df
# Display Outliers and Filtering Out the Outliers
print("\nDisplay Outliers")
print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence) | (german_credit_remOutliers["Age"] > Upper_Fence))])

Display Outliers
Empty DataFrame
Columns: [Unnamed: 0, Age, Sex, Job, Housing, Saving accounts, Checking account, Credit amount, Duration, Purpose]
Index: []

In [35]:
# Check condition based on the newly calculated IQR → results return 5 rows with outliers
# Note that the age 64 > new upper fence 63
# Display Outliers and Filtering Out the Outliers
print("\nDisplay Outliers")
print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence_b) | (german_credit_remOutliers["Age"] > Upper_Fence_b))])

Display Outliers
Unnamed: 0 Age Sex ... Credit amount Duration Purpose
219 219 64 female ... 1364 10 car
629 629 64 male ... 3832 9 education
678 678 64 male ... 2384 24 radio/TV
976 976 64 female ... 753 6 radio/TV
987 987 64 female ... 1409 13 radio/TV

[5 rows x 10 columns]

2. Data Integration
Dataset:

1. student.csv
2. marks.csv

In [ ]:
# Import dataset
import pandas as pd
DATA_DIR_3 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/student.csv"
DATA_DIR_4 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/marks.csv"
student_df = pd.read_csv (DATA_DIR_3, header=0)
marks_df = pd.read_csv (DATA_DIR_4, header=0)

In [ ]:
#Checking of Data
print (student_df.head())
print (marks_df.head())

# Merging of DataFrame using the pd.merge ()


df = pd.merge(student_df, marks_df, on = "Student_id")
print (df.head (10))

Student_id Age Gender Grade Employed


0 1 19 Male 1st Class yes
1 2 20 Female 2nd Class no
2 3 18 Male 1st Class no
3 4 21 Female 2nd Class no
4 5 19 Male 1st Class no
Student_id Mark City
0 1 95 Chennai
1 2 70 Delhi
2 3 98 Mumbai
3 4 75 Pune
4 5 89 Kochi
Student_id Age Gender Grade Employed Mark City
0 1 19 Male 1st Class yes 95 Chennai
1 2 20 Female 2nd Class no 70 Delhi
2 3 18 Male 1st Class no 98 Mumbai
3 4 21 Female 2nd Class no 75 Pune
4 5 19 Male 1st Class no 89 Kochi
5 6 20 Male 2nd Class yes 69 Gwalior
6 7 19 Female 3rd Class yes 52 Bhopal
7 8 21 Male 3rd Class yes 54 Chennai
8 9 22 Female 3rd Class yes 55 Delhi
9 10 21 Male 1st Class no 94 Mumbai

3. Data Transformation
- Replacement of Categorical Data with Numbers (student.csv)
- Label encoding (Banking_Marketing.csv)
- Transforming Data of Different Scale (Wholesale customers data.csv)

Numerical Data

Discrete: Numerical data that is countable


Continuous: Numerical data that is measurable

Categorical Data

Ordered: Categorical data that is orderly or structured


Nominal: Categorical data that has no order or structure

Dataset:

1. student.csv
2. Banking_Marketing.csv
3. Wholesale customers data.csv

In [ ]:
import numpy as np

# Separating Categorical Columns from Dataframe using select_dtypes()


df_categorical = student_df.select_dtypes(exclude=[np.number]) # exclude numerical using numpy
print(df_categorical)

Gender Grade Employed


0 Male 1st Class yes
1 Female 2nd Class no
2 Male 1st Class no
3 Female 2nd Class no
4 Male 1st Class no
.. ... ... ...
227 Female 1st Class no
228 Male 2nd Class no
229 Male 3rd Class yes
230 Female 1st Class yes
231 Male 3rd Class yes

[232 rows x 3 columns]


Finding the Frequency of Distribution to Each Categorical Column

In [ ]:
print(df_categorical['Grade'].unique())

['1st Class' '2nd Class' '3rd Class']

In [ ]:
print(df_categorical.Grade.value_counts())

2nd Class 80
3rd Class 80
1st Class 72
Name: Grade, dtype: int64

In [ ]:
print(df_categorical.Gender.value_counts())

Male 136
Female 96
Name: Gender, dtype: int64

In [ ]:
print(df_categorical.Employed.value_counts())

no 133
yes 99
Name: Employed, dtype: int64

3.1 - Replacing Categorical Data with Numbers


The following code may produce warning, but it's okay, still able to replace categorical data with numbers.

Warning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https: ...

In [ ]:
df_categorical.Grade.replace({"1st Class": 1, "2nd Class": 2, "3rd Class": 3 }, inplace=True)

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


method=method,

In [ ]:
df_categorical.Gender.replace({"Male": 0, "Female": 1}, inplace=True)

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


method=method,

In [ ]:
df_categorical.Employed.replace({"yes": 1, "no": 2}, inplace=True)
print (df_categorical.head())

Gender Grade Employed


0 0 1 1
1 1 2 2
2 0 1 2
3 1 2 2
4 0 1 2
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


method=method,

3.2 - Label Encoding


This technique is used to replace each value in a categorical column with numbers from 0 to N-1.

Dataset:

Banking_Marketing.csv
(this dataset already imported previously and used as 'Banking_Marketing_df')

In [ ]:
# Read Dataset and import LabelEncoder from sklearn.preprocessing package
from sklearn.preprocessing import LabelEncoder

print (Banking_Marketing_df.head())

age job marital ... euribor3m nr_employed y


0 44.0 blue-collar married ... 4.963 5228.1 0
1 53.0 technician married ... 4.021 5195.8 0
2 28.0 management single ... 0.729 4991.6 1
3 39.0 services married ... 1.405 5099.1 0
4 55.0 retired married ... 0.869 5076.2 1

[5 rows x 21 columns]

In [ ]:
# Remove Missing Data
Banking_Marketing_df = Banking_Marketing_df.dropna()

In [ ]:
# Select Non-Numerical Columns
data_column_category = Banking_Marketing_df.select_dtypes (exclude=[np.number]).columns
print (data_column_category)
print (Banking_Marketing_df[data_column_category].head())

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',


'month', 'day_of_week', 'poutcome'],
dtype='object')
job marital education ... month day_of_week poutcome
0 blue-collar married basic.4y ... aug thu nonexistent
1 technician married unknown ... nov fri nonexistent
2 management single university.degree ... jun thu success
3 services married high.school ... apr fri nonexistent
4 retired married basic.4y ... aug fri success

[5 rows x 10 columns]

In [ ]:
# Iterate through column to convert to numeric data using LabelEncoder ()
label_encoder = LabelEncoder()
for i in data_column_category:
Banking_Marketing_df[i] = label_encoder.fit_transform (Banking_Marketing_df[i])

In [ ]:
print("Label Encoder Data:")
print(Banking_Marketing_df.head())

Label Encoder Data:


age job marital education ... cons_conf_idx euribor3m nr_employed y
0 44.0 1 1 0 ... -36.1 4.963 5228.1 0
1 53.0 9 1 7 ... -42.0 4.021 5195.8 0
2 28.0 4 2 6 ... -39.8 0.729 4991.6 1
3 39.0 7 1 3 ... -47.1 1.405 5099.1 0
4 55.0 5 1 0 ... -31.4 0.869 5076.2 1

[5 rows x 21 columns]

3.3 - Transforming Data of Different Scale


Dataset:

Wholesale customers data.csv

In [ ]:
DATA_DIR_5 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Wholesale customers data.csv"

In [ ]:
# Read Dataset
from sklearn import preprocessing
WholesaleData_df = pd.read_csv (DATA_DIR_5, header=0)
print (WholesaleData_df.head())

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen


0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185

In [ ]:
null_ = WholesaleData_df.isna().any()

In [ ]:
dtypes = WholesaleData_df.dtypes

In [ ]:
# Check for Missing Data
null_ = WholesaleData_df.isna().any()
dtypes = WholesaleData_df.dtypes
info = pd.concat ([null_,dtypes], axis = 1, keys = ['Null', 'type'])
print(info) # This is different way of viewing data

Null type
Channel False int64
Region False int64
Fresh False int64
Milk False int64
Grocery False int64
Frozen False int64
Detergents_Paper False int64
Delicassen False int64

In [ ]:
# Perform Standard Scaling and Implement fit_transform () method
std_scale = preprocessing.StandardScaler().fit_transform (WholesaleData_df)
scaled_frame = pd.DataFrame (std_scale, columns = WholesaleData_df.columns)
print (scaled_frame.head(25))

Channel Region Fresh ... Frozen Detergents_Paper Delicassen


0 1.448652 0.590668 0.052933 ... -0.589367 -0.043569 -0.066339
1 1.448652 0.590668 -0.391302 ... -0.270136 0.086407 0.089151
2 1.448652 0.590668 -0.447029 ... -0.137536 0.133232 2.243293
3 -0.690297 0.590668 0.100111 ... 0.687144 -0.498588 0.093411
4 1.448652 0.590668 0.840239 ... 0.173859 -0.231918 1.299347
5 1.448652 0.590668 -0.204806 ... -0.496155 -0.228138 -0.026224
6 1.448652 0.590668 0.009950 ... -0.534512 0.054280 -0.347854
7 1.448652 0.590668 -0.349981 ... -0.289315 0.092286 0.369601
8 -0.690297 0.590668 -0.477901 ... -0.545854 -0.244726 -0.275079
9 1.448652 0.590668 -0.474497 ... -0.394488 0.954031 0.203461
10 1.448652 0.590668 -0.683474 ... 0.273876 0.649984 0.077791
11 1.448652 0.590668 0.090692 ... -0.340664 -0.489769 -0.364894
12 1.448652 0.590668 1.560499 ... -0.574313 0.209873 0.499176
13 1.448652 0.590668 0.729576 ... 0.004757 0.803267 -0.327619
14 1.448652 0.590668 1.001564 ... -0.572869 0.457016 0.228311
15 -0.690297 0.590668 -0.138313 ... -0.551629 -0.402629 -0.395069
16 1.448652 0.590668 -0.869179 ... -0.605865 0.341529 -0.157929
17 -0.690297 0.590668 -0.484788 ... -0.460479 -0.527355 1.048362
18 1.448652 0.590668 0.522499 ... -0.178780 -0.024041 0.587926
19 -0.690297 0.590668 -0.334071 ... -0.495536 -0.076325 -0.363474
20 1.448652 0.590668 0.438987 ... -0.413666 -0.130709 0.212691
21 -0.690297 0.590668 -0.509248 ... 0.064149 -0.526305 -0.339334
22 -0.690297 0.590668 1.525828 ... 1.306634 -0.105092 0.997242
23 1.448652 0.590668 1.137716 ... 0.429367 0.305623 5.324340
24 1.448652 0.590668 0.842773 ... -0.032363 0.336069 1.509862

[25 rows x 8 columns]

In [ ]:
# Using MinMax Scaler Method
minmax_scale = preprocessing.MinMaxScaler().fit_transform (WholesaleData_df)
scaled_frame = pd.DataFrame (minmax_scale, columns = WholesaleData_df.columns)
print (scaled_frame.head())

Channel Region Fresh ... Frozen Detergents_Paper Delicassen


0 1.0 1.0 0.112940 ... 0.003106 0.065427 0.027847
1 1.0 1.0 0.062899 ... 0.028548 0.080590 0.036984
2 1.0 1.0 0.056622 ... 0.039116 0.086052 0.163559
3 0.0 1.0 0.118254 ... 0.104842 0.012346 0.037234
4 1.0 1.0 0.201626 ... 0.063934 0.043455 0.108093

[5 rows x 8 columns]

4. Data Discretization
A process of converting continuous data into discrete buckets by grouping it.

Benefits of Data Discretization:

Easy maintainability of data


Training of Machine Learning models will be faster and more effective

Dataset: Student_bucketing.csv

In [ ]:
DATA_DIR_6 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Student_bucketing.csv"

In [ ]:
StudentBucketing_df = pd.read_csv (DATA_DIR_6, header=0)
print (StudentBucketing_df.head())

Student_id Age Grade Employed marks


0 1 19 1st Class yes 29
1 2 20 2nd Class no 41
2 3 18 1st Class no 57
3 4 21 2nd Class no 29
4 5 19 1st Class no 57

In [ ]:
# Perform Bucketing using pd.cut ()
StudentBucketing_df['bucket']=pd.cut(StudentBucketing_df['marks'], 5, labels = ['Poor', 'Below_average', 'Average', 'Above_Average','Excellent'])

In [ ]:
print (StudentBucketing_df.head(10))

Student_id Age Grade Employed marks bucket


0 1 19 1st Class yes 29 Poor
1 2 20 2nd Class no 41 Below_average
2 3 18 1st Class no 57 Average
3 4 21 2nd Class no 29 Poor
4 5 19 1st Class no 57 Average
5 6 20 2nd Class yes 53 Average
6 7 19 3rd Class yes 78 Above_Average
7 8 21 3rd Class yes 70 Above_Average
8 9 22 3rd Class yes 97 Excellent
9 10 21 1st Class no 58 Average

In [ ]:
# Perform Bucketing using pd.cut ()
StudentBucketing_df['bucket']=pd.cut(StudentBucketing_df['marks'], 3, labels = ['Poor', 'Average', 'Excellent'])
print (StudentBucketing_df.head(10))

Student_id Age Grade Employed marks bucket


0 1 19 1st Class yes 29 Poor
1 2 20 2nd Class no 41 Poor
2 3 18 1st Class no 57 Average
3 4 21 2nd Class no 29 Poor
4 5 19 1st Class no 57 Average
5 6 20 2nd Class yes 53 Average
6 7 19 3rd Class yes 78 Excellent
7 8 21 3rd Class yes 70 Average
8 9 22 3rd Class yes 97 Excellent
9 10 21 1st Class no 58 Average

End of TutorialWeek02, PracticalWeek02

You might also like