0% found this document useful (0 votes)
18 views10 pages

Ass 3 - Best

This document outlines Assignment #3 for the Software Engineering Department at Sir Syed University, focusing on data cleaning and preprocessing using a real-world dataset. Students are required to select a dataset, implement a cleaning process using programming languages like Python, and document their findings. The assignment emphasizes group collaboration, proper referencing, and submission guidelines, with a due date of January 14, 2025.

Uploaded by

Bushra Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Ass 3 - Best

This document outlines Assignment #3 for the Software Engineering Department at Sir Syed University, focusing on data cleaning and preprocessing using a real-world dataset. Students are required to select a dataset, implement a cleaning process using programming languages like Python, and document their findings. The assignment emphasizes group collaboration, proper referencing, and submission guidelines, with a due date of January 14, 2025.

Uploaded by

Bushra Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment # 1 SED Batch 2021F

Sir Syed University of Engineering & Technology (SSUET)


Software Engineering Department

Semester: 7th

Batch: 2021F

Section: F

ASSIGNMENT # 3

Submitted by:

Asif Hussain(2021F-BSE-233)
Muhammad Qasim(2021F-BSE-249)
Muhammad Tahir(2021F-BSE-262)
Muhammad Daniyal(2021F-BSE-268)

Subject Name:
Teacher Name:
Business Re-engineering
Miss Nida Khalil
Roll # ___________________ Section ________ Name: ________

Department: Software Engineering Program: BS (SE)

Assignment 3
SWE-417T: Software Re-engineering

Date: 07-01-2025 Total Marks = 10.53 (04)

Teacher Name: Ms. Nida & Dr. Iqra Marks Obtained=

Sr. No Course Learning Outcomes PLOs Blooms Taxonomy


PLO_4
C6
Set to perform complex design re-engineering (Design/Development
(Create)
CLO_3 and reverse engineering problems of solution)

Assignment Guidelines • This is Group based assignment with 4 members maximum.


• You are required to answer all questions in detail with
references. Consider Book and Internet as Reference
Material
• Submission will be on VLE / Hardcopy.
• Any answers that are copied from another group will
automatically receive a zero mark.

Submission date 14-01-2025

Objective
This assignment aims to provide hands-on experience in data cleaning and preprocessing. You will
work with a real-world dataset to identify, clean, and prepare data for analysis.

Question# 1: Prepare practical implementation of Data cleaning & Preprocessing:


a) Select a dataset from a reliable resource (e.g., Kaggle, GitHub) with extract a subset of 50–
100 instances to work with. Explain your choice of dataset and the problems you expect to
solve in the data.

b) Generate a code of cleaning process which displays result of cleaned data in source code
using (python, java etc). Ensure your code performs the following:
• Implements all necessary cleaning steps.
• Displays the original dataset before cleaning and the cleaned dataset afterward.
• Outputs a summary of the changes made (e.g., number of missing values filled, rows
removed).

c) Generate output of cleaning process using any Tool OpenRefine, Trifacta Wrangler,
Winpure Clean & Match etc OR any Online Tool.

ANSWER 1 PART(a):
I have chosen a subset (100 instances) of churn rate dataset that is available on Kaggle. This
dataset contains some missing values, outliers and unnecessary columns. I will make this clean
using data cleaning techniques so that this dataset can be used for machine learning model for
high accuracy. I will use pandas library of python for data cleaning.

My Approach:
• First of all I will remove duplicate values using CustomerId column.
• Then I will remove Unnecessary columns like CustomerId and Surname these are not
necessary for machine learning model.
• Then I will identify outliers of numeric columns such as CreditScore, Age, Tenure,
Balance, NumOfProducts, Estimated Salary. I will ignore some numeric columns such as
HasCrCard , IsActiveMember, and Exited because these columns have 0 or 1 value that
represents true or false so these are not actual numeric columns.
• After identifying outliers I will be replace them with null values and then fill those null
values with mean of that column. Because I have a small dataset it is not suitable to drop
those rows so it is better to set them null then set null values to that column mean.

ANSWER 1 PART(b):
CLEANING STEPS:
Source Code:
import pandas as pd
import numpy as np

df = pd.read_csv("Churn_Modelling.csv")

#getting some info about dataset


df.describe()

#getting null values of each column


df.isnull().sum()
#Removing duplicates by using Id column
df = df.drop_duplicates(subset='CustomerId', keep='first')

#Removing Columns that are not important


df=df.drop(columns=['CustomerId','Surname'])

# Identifying Outliers using IQR and replace them with mean value
columns = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Loop through each column


for col in columns:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1

#Identifying and saving outliers to outliers variable


outliers = (
(df[col] < (q1 - 1.5 * iqr)) |
(df[col] > (q3 + 1.5 * iqr))
)

# setting outlier values to Nan


df.loc[outliers, col] = np.nan

#Filling all null values with their column mean


for col in columns:
df[col] = df[col].fillna(df[col].mean())

# Verify the changes


print(df.isnull().sum())
ORIGINAL DATASET AND CLEANED DATASET:
Dataset Before Cleaning:

Dataset After Cleaning:


OUTPUT SAMMARY OF ORIGINAL DATASET AND CLEANED DATASET:

Summary of Original Dataset: Summary of Cleaned Dataset:


ANSWER 1 PART(c): Using OpenRefine Tool

Remove Duplicate By CustomerId Column:


Drop Unnecessary Columns:
Fill Null Values of All Column:

Powered by TCPDF (www.tcpdf.org)

You might also like