Ass 3 - Best
Ass 3 - Best
Semester: 7th
Batch: 2021F
Section: F
ASSIGNMENT # 3
Submitted by:
Asif Hussain(2021F-BSE-233)
Muhammad Qasim(2021F-BSE-249)
Muhammad Tahir(2021F-BSE-262)
Muhammad Daniyal(2021F-BSE-268)
Subject Name:
Teacher Name:
Business Re-engineering
Miss Nida Khalil
Roll # ___________________ Section ________ Name: ________
Assignment 3
SWE-417T: Software Re-engineering
Objective
This assignment aims to provide hands-on experience in data cleaning and preprocessing. You will
work with a real-world dataset to identify, clean, and prepare data for analysis.
b) Generate a code of cleaning process which displays result of cleaned data in source code
using (python, java etc). Ensure your code performs the following:
• Implements all necessary cleaning steps.
• Displays the original dataset before cleaning and the cleaned dataset afterward.
• Outputs a summary of the changes made (e.g., number of missing values filled, rows
removed).
c) Generate output of cleaning process using any Tool OpenRefine, Trifacta Wrangler,
Winpure Clean & Match etc OR any Online Tool.
ANSWER 1 PART(a):
I have chosen a subset (100 instances) of churn rate dataset that is available on Kaggle. This
dataset contains some missing values, outliers and unnecessary columns. I will make this clean
using data cleaning techniques so that this dataset can be used for machine learning model for
high accuracy. I will use pandas library of python for data cleaning.
My Approach:
• First of all I will remove duplicate values using CustomerId column.
• Then I will remove Unnecessary columns like CustomerId and Surname these are not
necessary for machine learning model.
• Then I will identify outliers of numeric columns such as CreditScore, Age, Tenure,
Balance, NumOfProducts, Estimated Salary. I will ignore some numeric columns such as
HasCrCard , IsActiveMember, and Exited because these columns have 0 or 1 value that
represents true or false so these are not actual numeric columns.
• After identifying outliers I will be replace them with null values and then fill those null
values with mean of that column. Because I have a small dataset it is not suitable to drop
those rows so it is better to set them null then set null values to that column mean.
ANSWER 1 PART(b):
CLEANING STEPS:
Source Code:
import pandas as pd
import numpy as np
df = pd.read_csv("Churn_Modelling.csv")
# Identifying Outliers using IQR and replace them with mean value
columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']