0% found this document useful (0 votes)

14 views1 page

PracticalWeek02

The document outlines a practical exercise focused on data pre-processing and cleansing, including handling missing data, data integration, transformation, and discretization using various datasets. Key tasks include removing and imputing missing values, handling outliers, and transforming categorical data into numerical formats. The practical is designed to be used alongside a tutorial for a comprehensive understanding of data pre-processing techniques.

Uploaded by

Chloe Tee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views1 page

PracticalWeek02

Uploaded by

Chloe Tee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

PracticalWeek02 - Data Pre-processing & Cleansing

Use this practical together with TutorialWeek02

Exercise:

1. Handle Missing Data

Removing of data (Banking_Marketing.csv)
Imputation (Banking_Marketing.csv)
Removing Outliers (german_credit_data.csv)
2. Data Integration (student.csv & marks.csv)
3. Data Transformation
Replacement of Categorical Data with Numbers (student.csv)
Label encoding (Banking_Marketing.csv)
Transforming Data of Different Scale (Wholesale customers data.csv)
4. Data Discretization (Student_bucketing.csv)

In [1]:
# Line Wrapping in Collaboratory Google results
# put this in the first cell of your notebook

from IPython.display import HTML, display

def set_css():
display(HTML('''
<style>
pre {
white-space: pre-wrap;
}
</style>
'''))
get_ipython().events.register('pre_run_cell', set_css)

Mount Google Drive

Important: Remember to re-mount for each time a new dataset is added to Google Drive

In [2]:
import io
import requests
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive

1. Handle Missing Data

- Removing of data (Banking_Marketing.csv)
- Imputation (Banking_Marketing.csv)
- Removing Outliers (german_credit_data.csv)

Dataset import as:

Banking_Marketing_df
german_credit_df

In [3]:
# Import dataset
import pandas as pd
DATA_DIR_1 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Banking_Marketing.csv"
Banking_Marketing_df = pd.read_csv (DATA_DIR_1, header=0)

1.1 - Removing of Data

In [ ]:
# Determine the datatype of Each Column by using dtypes
print (Banking_Marketing_df.dtypes)

age float64
job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration float64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object

In [ ]:
print("Find missing value of each column using isna()")
print (Banking_Marketing_df.isna().sum())

Find missing value of each column using isna()

age 2
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
print("\nRemove all rows with missing data by using dropna()")
data = Banking_Marketing_df.dropna ()
print(data.isna().sum())

Remove all rows with missing data by using dropna()

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 0
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
print(Banking_Marketing_df.isna().sum())

1.2 - Imputation
Dataset: Banking_Marketing.csv

In [ ]:
# Computation of the Mean value by using mean ()
mean_age = Banking_Marketing_df.age.mean ()
print()
print ("Mean age: %.2f" % mean_age)

# Impute the missing data with its mean by using fillna ()

Banking_Marketing_df.age.fillna(mean_age, inplace=True)
print("\nImpute missing data with mean value:")
print (Banking_Marketing_df.isna().sum())

Mean age: 40.02

Impute missing data with mean value:

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 7
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
# Computation of Median value by using median ()
# used median because the 'duration' variable is too diverse
median_duration = Banking_Marketing_df.duration.median()
print ("\nMedian duration: %.2f" % median_duration)

# Impute the missing data with its median by using fillna ()

Banking_Marketing_df.duration.fillna(median_duration, inplace=True)
print("\nImpute missing data with median value:")
print (Banking_Marketing_df.isna().sum())

Median duration: 180.00

Impute missing data with median value:

age 0
job 0
marital 0
education 0
default 0
housing 0
loan 0
contact 6
month 0
day_of_week 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
emp_var_rate 0
cons_price_idx 0
cons_conf_idx 0
euribor3m 0
nr_employed 0
y 0
dtype: int64

In [ ]:
# Computation of the Mean value by using mean ()
mean_age = Banking_Marketing_df.age.mean ()
print()
print ("Mean age: %.2f" % mean_age)

# Impute the missing data with its mean by using fillna ()

Banking_Marketing_df.age.fillna(mean_age, inplace=True)
print("\nImpute missing data with mean value:")
print (Banking_Marketing_df.isna().sum())

# Computation of Median value by using median ()

# used median because the 'duration' variable is too diverse
median_duration = Banking_Marketing_df.duration.median()
print ("\nMedian duration: %.2f" % median_duration)

# Impute the missing data with its median by using fillna ()

Banking_Marketing_df.duration.fillna(median_duration, inplace=True)
print("\nImpute missing data with median value:")
print (Banking_Marketing_df.isna().sum())

# Impute Categorical Data with its mode by using mode ()

# find out the mode
mode_contact = Banking_Marketing_df.contact.mode()[0]
print("\nImpute categorical data with its mode:")
print (mode_contact)

# impute using fillna. Used mode to find the most popular contact
Banking_Marketing_df.contact.fillna (mode_contact, inplace = True)
print("\nImpute missing data with mode (most popular contact):")
print (Banking_Marketing_df.isna().sum())

Mean age: 40.02

Impute missing data with mean value:

Median duration: 180.00

Impute missing data with median value:

Impute categorical data with its mode:

cellular

Impute missing data with mode (most popular contact):

1.3 - Removing Outliers

Dataset: german_credit_data.csv

In [4]:
DATA_DIR_2 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/german_credit_data.csv"
german_credit_df = pd.read_csv (DATA_DIR_2, header=0)

In [6]:
german_credit_df.shape

Out[6]: (1000, 10)

In [5]:
# Display a BoxPlot
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sbn
sbn.boxplot(german_credit_df['Age'])

# Compute the Interquartile Range (IQR)

Q1 = german_credit_df['Age'].quantile(0.25)
Q3 = german_credit_df['Age'].quantile(0.75)
IQR = Q3 - Q1
print ("IQR: %.2f" %IQR)

# Calculate the Lower and Upper Fence

Lower_Fence = Q1 - (1.5 * IQR)
print ("Lower_Fence: %.2f" %Lower_Fence)
Upper_Fence = Q3 + (1.5 * IQR)
print ("Upper_Fence: %.2f" %Upper_Fence)

# Display Outliers and Filtering Out the Outliers

print("\nDisplay Outliers")
print (german_credit_df[((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])

# display data with outliers filtered out, use ~ to filter

print("\nDisplay data without outliers")
print (german_credit_df[~((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argum
ent will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
IQR: 15.00
Lower_Fence: 4.50
Upper_Fence: 64.50

Display Outliers
Unnamed: 0 Age Sex ... Credit amount Duration Purpose
0 0 67 male ... 1169 6 radio/TV
75 75 66 male ... 1526 12 car
137 137 66 male ... 766 12 radio/TV
163 163 70 male ... 7308 10 car
179 179 65 male ... 571 21 car
186 186 74 female ... 5129 9 car
187 187 68 male ... 1175 16 car
213 213 66 male ... 1908 30 business
330 330 75 male ... 6615 24 car
430 430 74 male ... 3448 5 business
438 438 65 male ... 3394 42 repairs
536 536 75 female ... 1374 6 car
554 554 67 female ... 1199 9 education
606 606 74 male ... 4526 24 business
624 624 65 male ... 2600 18 radio/TV
723 723 66 female ... 790 9 radio/TV
756 756 74 male ... 1299 6 car
774 774 66 male ... 1480 12 car
779 779 67 female ... 3872 18 repairs
807 807 65 male ... 930 12 radio/TV
846 846 68 male ... 6761 18 car
883 883 65 female ... 1098 18 radio/TV
917 917 68 male ... 14896 6 car

[23 rows x 10 columns]

Display data without outliers

Unnamed: 0 Age Sex ... Credit amount Duration Purpose
1 1 22 female ... 5951 48 radio/TV
2 2 49 male ... 2096 12 education
3 3 45 male ... 7882 42 furniture/equipment
4 4 53 male ... 4870 24 car
5 5 35 male ... 9055 36 education
.. ... ... ... ... ... ... ...
995 995 31 female ... 1736 12 furniture/equipment
996 996 40 male ... 3857 30 car
997 997 38 male ... 804 12 radio/TV
998 998 23 male ... 1845 45 radio/TV
999 999 27 male ... 4576 45 car

[977 rows x 10 columns]

Why Seaborn Boxplot still showing outliers, after removing the outliers?
Seaborn uses inter-quartile range to detect the outliers. When you remove outliers, the number of data changes thus its quantile changes

means lower range and upper range changes

thus it is again showing outliers

Let's investigate by computing a new quantile range after remove the outliers.

Before remove outliers:

IQR: 15.00
Lower_Fence: 4.50
Upper_Fence: 64.50

After remove outliers:

IQRb: 14.00
Lower_Fence_b: 6.00
Upper_Fence_b: 63.00

The new upper fence now is at 63, if you check the condition based on the new upper and lower fence, you will see there a 5 rows with outliers (german_credit_remOutliers["Age"] < Lower_Fence_b) |
(german_credit_remOutliers["Age"] > Upper_Fence_b)

But if you check the condition against the firstly calculated upper and lower fence, you will get an empty array print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence) |
(german_credit_remOutliers["Age"] > Upper_Fence))])

What does it mean?

The outliers are actually removed (for the attribute age of the dataframe), but Seaborn boxplot shows the outliers based on the newly calculated inter-quartile range.

In [26]:
german_credit_remOutliers = (german_credit_df[~((german_credit_df["Age"] < Lower_Fence) | (german_credit_df["Age"] > Upper_Fence))])
german_credit_remOutliers.shape

Out[26]: (977, 10)

In [28]:
# Compute new quantile range after remove outliers
# Compute the Interquartile Range (IQR)
Q1b = german_credit_remOutliers['Age'].quantile(0.25)
Q3b = german_credit_remOutliers['Age'].quantile(0.75)
IQRb = Q3b - Q1b
print ("IQRb: %.2f" %IQRb)

# Calculate the Lower and Upper Fence

Lower_Fence_b = Q1b - (1.5 * IQRb)
print ("Lower_Fence: %.2f" %Lower_Fence_b)
Upper_Fence_b = Q3 + (1.5 * IQRb)
print ("Upper_Fence: %.2f" %Upper_Fence_b)

IQRb: 14.00
Lower_Fence: 6.00
Upper_Fence: 63.00

In [31]:
sbn.boxplot(german_credit_remOutliers['Age'])
# Use showfliers=False if you want to disable outliers from boxplot

In [33]:
# Check condition based on the firstly calculated IQR → results return empty df
# Display Outliers and Filtering Out the Outliers
print("\nDisplay Outliers")
print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence) | (german_credit_remOutliers["Age"] > Upper_Fence))])

Display Outliers
Empty DataFrame
Columns: [Unnamed: 0, Age, Sex, Job, Housing, Saving accounts, Checking account, Credit amount, Duration, Purpose]
Index: []

In [35]:
# Check condition based on the newly calculated IQR → results return 5 rows with outliers
# Note that the age 64 > new upper fence 63
# Display Outliers and Filtering Out the Outliers
print("\nDisplay Outliers")
print (german_credit_remOutliers[((german_credit_remOutliers["Age"] < Lower_Fence_b) | (german_credit_remOutliers["Age"] > Upper_Fence_b))])

Display Outliers
Unnamed: 0 Age Sex ... Credit amount Duration Purpose
219 219 64 female ... 1364 10 car
629 629 64 male ... 3832 9 education
678 678 64 male ... 2384 24 radio/TV
976 976 64 female ... 753 6 radio/TV
987 987 64 female ... 1409 13 radio/TV

[5 rows x 10 columns]

2. Data Integration
Dataset:

1. student.csv
2. marks.csv

In [ ]:
# Import dataset
import pandas as pd
DATA_DIR_3 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/student.csv"
DATA_DIR_4 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/marks.csv"
student_df = pd.read_csv (DATA_DIR_3, header=0)
marks_df = pd.read_csv (DATA_DIR_4, header=0)

In [ ]:
#Checking of Data
print (student_df.head())
print (marks_df.head())

# Merging of DataFrame using the pd.merge ()

df = pd.merge(student_df, marks_df, on = "Student_id")
print (df.head (10))

Student_id Age Gender Grade Employed

0 1 19 Male 1st Class yes
1 2 20 Female 2nd Class no
2 3 18 Male 1st Class no
3 4 21 Female 2nd Class no
4 5 19 Male 1st Class no
Student_id Mark City
0 1 95 Chennai
1 2 70 Delhi
2 3 98 Mumbai
3 4 75 Pune
4 5 89 Kochi
Student_id Age Gender Grade Employed Mark City
0 1 19 Male 1st Class yes 95 Chennai
1 2 20 Female 2nd Class no 70 Delhi
2 3 18 Male 1st Class no 98 Mumbai
3 4 21 Female 2nd Class no 75 Pune
4 5 19 Male 1st Class no 89 Kochi
5 6 20 Male 2nd Class yes 69 Gwalior
6 7 19 Female 3rd Class yes 52 Bhopal
7 8 21 Male 3rd Class yes 54 Chennai
8 9 22 Female 3rd Class yes 55 Delhi
9 10 21 Male 1st Class no 94 Mumbai

3. Data Transformation
- Replacement of Categorical Data with Numbers (student.csv)
- Label encoding (Banking_Marketing.csv)
- Transforming Data of Different Scale (Wholesale customers data.csv)

Numerical Data

Discrete: Numerical data that is countable

Continuous: Numerical data that is measurable

Categorical Data

Ordered: Categorical data that is orderly or structured

Nominal: Categorical data that has no order or structure

Dataset:

1. student.csv
2. Banking_Marketing.csv
3. Wholesale customers data.csv

In [ ]:
import numpy as np

# Separating Categorical Columns from Dataframe using select_dtypes()

df_categorical = student_df.select_dtypes(exclude=[np.number]) # exclude numerical using numpy
print(df_categorical)

Gender Grade Employed

0 Male 1st Class yes
1 Female 2nd Class no
2 Male 1st Class no
3 Female 2nd Class no
4 Male 1st Class no
.. ... ... ...
227 Female 1st Class no
228 Male 2nd Class no
229 Male 3rd Class yes
230 Female 1st Class yes
231 Male 3rd Class yes

[232 rows x 3 columns]

Finding the Frequency of Distribution to Each Categorical Column

In [ ]:
print(df_categorical['Grade'].unique())

['1st Class' '2nd Class' '3rd Class']

In [ ]:
print(df_categorical.Grade.value_counts())

2nd Class 80
3rd Class 80
1st Class 72
Name: Grade, dtype: int64

In [ ]:
print(df_categorical.Gender.value_counts())

Male 136
Female 96
Name: Gender, dtype: int64

In [ ]:
print(df_categorical.Employed.value_counts())

no 133
yes 99
Name: Employed, dtype: int64

3.1 - Replacing Categorical Data with Numbers

The following code may produce warning, but it's okay, still able to replace categorical data with numbers.

Warning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https: ...

In [ ]:
df_categorical.Grade.replace({"1st Class": 1, "2nd Class": 2, "3rd Class": 3 }, inplace=True)

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

method=method,

In [ ]:
df_categorical.Gender.replace({"Male": 0, "Female": 1}, inplace=True)

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

method=method,

In [ ]:
df_categorical.Employed.replace({"yes": 1, "no": 2}, inplace=True)
print (df_categorical.head())

Gender Grade Employed

0 0 1 1
1 1 2 2
2 0 1 2
3 1 2 2
4 0 1 2
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:4582: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

method=method,

3.2 - Label Encoding

This technique is used to replace each value in a categorical column with numbers from 0 to N-1.

Dataset:

Banking_Marketing.csv
(this dataset already imported previously and used as 'Banking_Marketing_df')

In [ ]:
# Read Dataset and import LabelEncoder from sklearn.preprocessing package
from sklearn.preprocessing import LabelEncoder

print (Banking_Marketing_df.head())

age job marital ... euribor3m nr_employed y

0 44.0 blue-collar married ... 4.963 5228.1 0
1 53.0 technician married ... 4.021 5195.8 0
2 28.0 management single ... 0.729 4991.6 1
3 39.0 services married ... 1.405 5099.1 0
4 55.0 retired married ... 0.869 5076.2 1

[5 rows x 21 columns]

In [ ]:
# Remove Missing Data
Banking_Marketing_df = Banking_Marketing_df.dropna()

In [ ]:
# Select Non-Numerical Columns
data_column_category = Banking_Marketing_df.select_dtypes (exclude=[np.number]).columns
print (data_column_category)
print (Banking_Marketing_df[data_column_category].head())

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',

'month', 'day_of_week', 'poutcome'],
dtype='object')
job marital education ... month day_of_week poutcome
0 blue-collar married basic.4y ... aug thu nonexistent
1 technician married unknown ... nov fri nonexistent
2 management single university.degree ... jun thu success
3 services married high.school ... apr fri nonexistent
4 retired married basic.4y ... aug fri success

[5 rows x 10 columns]

In [ ]:
# Iterate through column to convert to numeric data using LabelEncoder ()
label_encoder = LabelEncoder()
for i in data_column_category:
Banking_Marketing_df[i] = label_encoder.fit_transform (Banking_Marketing_df[i])

In [ ]:
print("Label Encoder Data:")
print(Banking_Marketing_df.head())

Label Encoder Data:

age job marital education ... cons_conf_idx euribor3m nr_employed y
0 44.0 1 1 0 ... -36.1 4.963 5228.1 0
1 53.0 9 1 7 ... -42.0 4.021 5195.8 0
2 28.0 4 2 6 ... -39.8 0.729 4991.6 1
3 39.0 7 1 3 ... -47.1 1.405 5099.1 0
4 55.0 5 1 0 ... -31.4 0.869 5076.2 1

[5 rows x 21 columns]

3.3 - Transforming Data of Different Scale

Dataset:

Wholesale customers data.csv

In [ ]:
DATA_DIR_5 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Wholesale customers data.csv"

In [ ]:
# Read Dataset
from sklearn import preprocessing
WholesaleData_df = pd.read_csv (DATA_DIR_5, header=0)
print (WholesaleData_df.head())

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen

0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 5410 7198 3915 1777 5185

In [ ]:
null_ = WholesaleData_df.isna().any()

In [ ]:
dtypes = WholesaleData_df.dtypes

In [ ]:
# Check for Missing Data
null_ = WholesaleData_df.isna().any()
dtypes = WholesaleData_df.dtypes
info = pd.concat ([null_,dtypes], axis = 1, keys = ['Null', 'type'])
print(info) # This is different way of viewing data

Null type
Channel False int64
Region False int64
Fresh False int64
Milk False int64
Grocery False int64
Frozen False int64
Detergents_Paper False int64
Delicassen False int64

In [ ]:
# Perform Standard Scaling and Implement fit_transform () method
std_scale = preprocessing.StandardScaler().fit_transform (WholesaleData_df)
scaled_frame = pd.DataFrame (std_scale, columns = WholesaleData_df.columns)
print (scaled_frame.head(25))

Channel Region Fresh ... Frozen Detergents_Paper Delicassen

0 1.448652 0.590668 0.052933 ... -0.589367 -0.043569 -0.066339
1 1.448652 0.590668 -0.391302 ... -0.270136 0.086407 0.089151
2 1.448652 0.590668 -0.447029 ... -0.137536 0.133232 2.243293
3 -0.690297 0.590668 0.100111 ... 0.687144 -0.498588 0.093411
4 1.448652 0.590668 0.840239 ... 0.173859 -0.231918 1.299347
5 1.448652 0.590668 -0.204806 ... -0.496155 -0.228138 -0.026224
6 1.448652 0.590668 0.009950 ... -0.534512 0.054280 -0.347854
7 1.448652 0.590668 -0.349981 ... -0.289315 0.092286 0.369601
8 -0.690297 0.590668 -0.477901 ... -0.545854 -0.244726 -0.275079
9 1.448652 0.590668 -0.474497 ... -0.394488 0.954031 0.203461
10 1.448652 0.590668 -0.683474 ... 0.273876 0.649984 0.077791
11 1.448652 0.590668 0.090692 ... -0.340664 -0.489769 -0.364894
12 1.448652 0.590668 1.560499 ... -0.574313 0.209873 0.499176
13 1.448652 0.590668 0.729576 ... 0.004757 0.803267 -0.327619
14 1.448652 0.590668 1.001564 ... -0.572869 0.457016 0.228311
15 -0.690297 0.590668 -0.138313 ... -0.551629 -0.402629 -0.395069
16 1.448652 0.590668 -0.869179 ... -0.605865 0.341529 -0.157929
17 -0.690297 0.590668 -0.484788 ... -0.460479 -0.527355 1.048362
18 1.448652 0.590668 0.522499 ... -0.178780 -0.024041 0.587926
19 -0.690297 0.590668 -0.334071 ... -0.495536 -0.076325 -0.363474
20 1.448652 0.590668 0.438987 ... -0.413666 -0.130709 0.212691
21 -0.690297 0.590668 -0.509248 ... 0.064149 -0.526305 -0.339334
22 -0.690297 0.590668 1.525828 ... 1.306634 -0.105092 0.997242
23 1.448652 0.590668 1.137716 ... 0.429367 0.305623 5.324340
24 1.448652 0.590668 0.842773 ... -0.032363 0.336069 1.509862

[25 rows x 8 columns]

In [ ]:
# Using MinMax Scaler Method
minmax_scale = preprocessing.MinMaxScaler().fit_transform (WholesaleData_df)
scaled_frame = pd.DataFrame (minmax_scale, columns = WholesaleData_df.columns)
print (scaled_frame.head())

Channel Region Fresh ... Frozen Detergents_Paper Delicassen

0 1.0 1.0 0.112940 ... 0.003106 0.065427 0.027847
1 1.0 1.0 0.062899 ... 0.028548 0.080590 0.036984
2 1.0 1.0 0.056622 ... 0.039116 0.086052 0.163559
3 0.0 1.0 0.118254 ... 0.104842 0.012346 0.037234
4 1.0 1.0 0.201626 ... 0.063934 0.043455 0.108093

[5 rows x 8 columns]

4. Data Discretization
A process of converting continuous data into discrete buckets by grouping it.

Benefits of Data Discretization:

Easy maintainability of data

Training of Machine Learning models will be faster and more effective

Dataset: Student_bucketing.csv

In [ ]:
DATA_DIR_6 = "/content/gdrive/MyDrive/Colab Notebooks/210412-ITS70304/Student_bucketing.csv"

In [ ]:
StudentBucketing_df = pd.read_csv (DATA_DIR_6, header=0)
print (StudentBucketing_df.head())

Student_id Age Grade Employed marks

0 1 19 1st Class yes 29
1 2 20 2nd Class no 41
2 3 18 1st Class no 57
3 4 21 2nd Class no 29
4 5 19 1st Class no 57

In [ ]:
# Perform Bucketing using pd.cut ()
StudentBucketing_df['bucket']=pd.cut(StudentBucketing_df['marks'], 5, labels = ['Poor', 'Below_average', 'Average', 'Above_Average','Excellent'])

In [ ]:
print (StudentBucketing_df.head(10))

Student_id Age Grade Employed marks bucket

0 1 19 1st Class yes 29 Poor
1 2 20 2nd Class no 41 Below_average
2 3 18 1st Class no 57 Average
3 4 21 2nd Class no 29 Poor
4 5 19 1st Class no 57 Average
5 6 20 2nd Class yes 53 Average
6 7 19 3rd Class yes 78 Above_Average
7 8 21 3rd Class yes 70 Above_Average
8 9 22 3rd Class yes 97 Excellent
9 10 21 1st Class no 58 Average

In [ ]:
# Perform Bucketing using pd.cut ()
StudentBucketing_df['bucket']=pd.cut(StudentBucketing_df['marks'], 3, labels = ['Poor', 'Average', 'Excellent'])
print (StudentBucketing_df.head(10))

Student_id Age Grade Employed marks bucket

0 1 19 1st Class yes 29 Poor
1 2 20 2nd Class no 41 Poor
2 3 18 1st Class no 57 Average
3 4 21 2nd Class no 29 Poor
4 5 19 1st Class no 57 Average
5 6 20 2nd Class yes 53 Average
6 7 19 3rd Class yes 78 Excellent
7 8 21 3rd Class yes 70 Average
8 9 22 3rd Class yes 97 Excellent
9 10 21 1st Class no 58 Average

End of TutorialWeek02, PracticalWeek02

Thera Bank-Project
100% (12)
Thera Bank-Project
26 pages
Telecom Customer Churn
0% (1)
Telecom Customer Churn
39 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Trainity Data Analytics Trainee Task 6
No ratings yet
Trainity Data Analytics Trainee Task 6
52 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Auto Insurance Output
No ratings yet
Auto Insurance Output
22 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Project 5
No ratings yet
Project 5
29 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
71 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
DM Project
No ratings yet
DM Project
36 pages
ML Cops
No ratings yet
ML Cops
17 pages
Capstone Removed
No ratings yet
Capstone Removed
17 pages
Ensemmmmm
No ratings yet
Ensemmmmm
10 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Capstone Project - Employee Attrition Rate
No ratings yet
Capstone Project - Employee Attrition Rate
66 pages
AML Project LearnerNotebook LowCode
No ratings yet
AML Project LearnerNotebook LowCode
74 pages
Examen Final Stat Numerique
No ratings yet
Examen Final Stat Numerique
31 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Predictive Modeling
No ratings yet
Predictive Modeling
42 pages
Bank Rpubs
No ratings yet
Bank Rpubs
24 pages
LoanTap Case Study
No ratings yet
LoanTap Case Study
37 pages
5185 Yuwen 300342996
No ratings yet
5185 Yuwen 300342996
4 pages
Data Analyst Interview Assignment
No ratings yet
Data Analyst Interview Assignment
26 pages
Module 9 Seaborn - Loans MSIS2407 20241113 Filled
No ratings yet
Module 9 Seaborn - Loans MSIS2407 20241113 Filled
38 pages
Test 4 - Up-New
No ratings yet
Test 4 - Up-New
10 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Tugas 6 - Ali Al Faruq Rahmatillah - 9882405221121004
No ratings yet
Tugas 6 - Ali Al Faruq Rahmatillah - 9882405221121004
3 pages
Bank Marketing Ingles
No ratings yet
Bank Marketing Ingles
37 pages
Danmairo - Analysis - Ipynb - Colaboratory
No ratings yet
Danmairo - Analysis - Ipynb - Colaboratory
18 pages
Customer Marketing Analysis 1738244935
No ratings yet
Customer Marketing Analysis 1738244935
42 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Bank Marketing Data Set Analysis
No ratings yet
Bank Marketing Data Set Analysis
33 pages
Credit Card Default
No ratings yet
Credit Card Default
5 pages
Project3: Loading Library
No ratings yet
Project3: Loading Library
17 pages
ML Project 2
No ratings yet
ML Project 2
19 pages
Naive Bayes Vs Logistic Regression
No ratings yet
Naive Bayes Vs Logistic Regression
16 pages
Task 2 Exploratory Data Analysis
No ratings yet
Task 2 Exploratory Data Analysis
5 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Classification - Bank - Marketing - Dataset - Jupyter Notebook
No ratings yet
Classification - Bank - Marketing - Dataset - Jupyter Notebook
23 pages
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
No ratings yet
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
18 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Etl Testing Material
100% (2)
Etl Testing Material
17 pages
Project On Data Mining-Raveendra Babu Gaddam
No ratings yet
Project On Data Mining-Raveendra Babu Gaddam
29 pages
DM Assignment - Thena Bank
No ratings yet
DM Assignment - Thena Bank
39 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
No ratings yet
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
18 pages
Advanced Modelling Techniques Anurag Payel
No ratings yet
Advanced Modelling Techniques Anurag Payel
41 pages
Banking Analysis
No ratings yet
Banking Analysis
2 pages
PA v0.21
No ratings yet
PA v0.21
17 pages
Animesh Jain
No ratings yet
Animesh Jain
13 pages
Credit Risk Modelling (EDA & Classification) - Kaggle
No ratings yet
Credit Risk Modelling (EDA & Classification) - Kaggle
21 pages
Data Cleaning
No ratings yet
Data Cleaning
1 page
Mir Moscow
100% (1)
Mir Moscow
69 pages
Apache Spark-Real Time Project-Marketing Analysis
No ratings yet
Apache Spark-Real Time Project-Marketing Analysis
4 pages
Roe All 5 Units PDF
100% (1)
Roe All 5 Units PDF
60 pages
Hindu College, Moradabad
No ratings yet
Hindu College, Moradabad
2 pages
Social Studies Clinical Lesson Plan
100% (1)
Social Studies Clinical Lesson Plan
3 pages
ChE 101 PPT 1 - Definition of ChE
No ratings yet
ChE 101 PPT 1 - Definition of ChE
11 pages
Topic 4
100% (1)
Topic 4
11 pages
Image Segmentation DeepLearning
No ratings yet
Image Segmentation DeepLearning
18 pages
Adder (Electronics)
No ratings yet
Adder (Electronics)
4 pages
Template For Egra Grade 2
No ratings yet
Template For Egra Grade 2
9 pages
Image Segmentation in Python - Practical Hands-On
No ratings yet
Image Segmentation in Python - Practical Hands-On
24 pages
Texto Agroforestería MARTIN CRAWFORD
No ratings yet
Texto Agroforestería MARTIN CRAWFORD
20 pages
Self-Concept Questionnaire (SCQ)
No ratings yet
Self-Concept Questionnaire (SCQ)
12 pages
W11 Lecture ITS69204 Image Recognition
No ratings yet
W11 Lecture ITS69204 Image Recognition
44 pages
3BSE020923R5001 CIO S800 Install
No ratings yet
3BSE020923R5001 CIO S800 Install
284 pages
Watershed Segmentation
No ratings yet
Watershed Segmentation
22 pages
Cell Theory Timeline
No ratings yet
Cell Theory Timeline
14 pages
Lecture 08 Image Segmentation
No ratings yet
Lecture 08 Image Segmentation
31 pages
ITS66034 Group 24 Assignment
No ratings yet
ITS66034 Group 24 Assignment
13 pages
AI-ML Using Py
No ratings yet
AI-ML Using Py
10 pages
PC4020 v3.3 - Manual de Instrucción: Advertencia
No ratings yet
PC4020 v3.3 - Manual de Instrucción: Advertencia
44 pages
PracticalWeek03a
No ratings yet
PracticalWeek03a
1 page
Seat 220524 e
No ratings yet
Seat 220524 e
36 pages
Students - Students
No ratings yet
Students - Students
1 page
Golden Benthic User Guide
No ratings yet
Golden Benthic User Guide
33 pages
Meetings 1 - Getting Down To Business - Lesson Plan PDF
No ratings yet
Meetings 1 - Getting Down To Business - Lesson Plan PDF
4 pages
Metoda Dezambiguizării Sensurilor Cuvintelor Bazată Pe Restricţii Semantice (Semantic Restricţions)
No ratings yet
Metoda Dezambiguizării Sensurilor Cuvintelor Bazată Pe Restricţii Semantice (Semantic Restricţions)
36 pages
Additional Maths SBA 2 PDF
No ratings yet
Additional Maths SBA 2 PDF
15 pages
Tutorial 6
No ratings yet
Tutorial 6
6 pages
Agricultural Innovation Agricultural Development
No ratings yet
Agricultural Innovation Agricultural Development
17 pages
Perancangan Sistem Pengukuran Kinerja Pada Pdam Lumajang Dengan Balanced Scorecard Manik Ayu Titisari Teknik Industri UNKAR
No ratings yet
Perancangan Sistem Pengukuran Kinerja Pada Pdam Lumajang Dengan Balanced Scorecard Manik Ayu Titisari Teknik Industri UNKAR
14 pages
Budget of Minority
No ratings yet
Budget of Minority
18 pages
G9 Science Activity Sheet
No ratings yet
G9 Science Activity Sheet
1 page
Exercises - False Friend
No ratings yet
Exercises - False Friend
2 pages
Formulas Prelim Midterm 1
No ratings yet
Formulas Prelim Midterm 1
2 pages
Dr. Akram Spring 2015: Signals and Systems
No ratings yet
Dr. Akram Spring 2015: Signals and Systems
4 pages
Resume
No ratings yet
Resume
2 pages
Articulo 12 PDF
No ratings yet
Articulo 12 PDF
1 page
TouchCode Class 7
From Everand
TouchCode Class 7
Team Orange
No ratings yet

PracticalWeek02

Uploaded by

PracticalWeek02

Uploaded by

PracticalWeek02 - Data Pre-processing & Cleansing

Use this practical together with TutorialWeek02

1. Handle Missing Data

from IPython.display import HTML, display

Mount Google Drive

1. Handle Missing Data

Dataset import as:

1.1 - Removing of Data

Find missing value of each column using isna()

Remove all rows with missing data by using dropna()

# Impute the missing data with its mean by using fillna ()

Mean age: 40.02

Impute missing data with mean value:

# Impute the missing data with its median by using fillna ()

Median duration: 180.00

Impute missing data with median value:

# Impute the missing data with its mean by using fillna ()

# Computation of Median value by using median ()

# Impute the missing data with its median by using fillna ()

# Impute Categorical Data with its mode by using mode ()

Mean age: 40.02

Impute missing data with mean value:

Median duration: 180.00

Impute missing data with median value:

Impute categorical data with its mode:

Impute missing data with mode (most popular contact):

1.3 - Removing Outliers

Out[6]: (1000, 10)

# Compute the Interquartile Range (IQR)

# Calculate the Lower and Upper Fence

# Display Outliers and Filtering Out the Outliers

# display data with outliers filtered out, use ~ to filter

[23 rows x 10 columns]

Display data without outliers

[977 rows x 10 columns]

means lower range and upper range changes

Before remove outliers:

After remove outliers:

What does it mean?

Out[26]: (977, 10)

# Calculate the Lower and Upper Fence

# Merging of DataFrame using the pd.merge ()

Student_id Age Gender Grade Employed

Discrete: Numerical data that is countable

Ordered: Categorical data that is orderly or structured

# Separating Categorical Columns from Dataframe using select_dtypes()

Gender Grade Employed

[232 rows x 3 columns]

['1st Class' '2nd Class' '3rd Class']

3.1 - Replacing Categorical Data with Numbers

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https: ...

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Gender Grade Employed

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3.2 - Label Encoding

age job marital ... euribor3m nr_employed y

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',

Label Encoder Data:

3.3 - Transforming Data of Different Scale

Wholesale customers data.csv

Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen

Channel Region Fresh ... Frozen Detergents_Paper Delicassen

[25 rows x 8 columns]

Channel Region Fresh ... Frozen Detergents_Paper Delicassen

Benefits of Data Discretization:

Easy maintainability of data

Student_id Age Grade Employed marks

Student_id Age Grade Employed marks bucket

Student_id Age Grade Employed marks bucket

End of TutorialWeek02, PracticalWeek02

You might also like