0% found this document useful (0 votes)
12 views9 pages

DMML Lab Report 03

This lab report focuses on data pre-processing techniques including handling null values, duplicates, and outliers in a healthcare dataset. Key methods employed include filling missing values with mean/median, using SimpleImputer for imputation, and visualizing data distributions through histograms and box plots. The report concludes with the identification and handling of outliers using the IQR method.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

DMML Lab Report 03

This lab report focuses on data pre-processing techniques including handling null values, duplicates, and outliers in a healthcare dataset. Key methods employed include filling missing values with mean/median, using SimpleImputer for imputation, and visualizing data distributions through histograms and box plots. The report concludes with the identification and handling of outliers using the IQR method.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Lab report

Course code: CSE326


Course Title: Data Mining and Machine Learning Lab
Lab report: 03
Topic: Data Pre-processing (Null value, duplicate and outliers handling).

Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University

Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University

Submission Date: 15-03-2025


Code: Import libraries and Data set load
1. import pandas as pd
2. dataset = pd.read_csv('/content/drive/MyDrive/lab
dataset data mining/healthcare-dataset-stroke-data2.csv')
3. dataset
4.

Output:

Explanation:
Importing pandas library. Loading csv file from google drive to dataset dataframe.

Code:
1. dataset.info()
2.

Output:
Explanation:
Showing basic information (not null count and Data type) about the data set of each column.

Code: Missing values Percentage


1. dataset.isnull().mean()*100
2.

Output:

Explanation:
The command dataset.isnull().mean() * 100 calculates the percentage of missing values in each column
of the DataFrame.

Code: Missing values indexes


1. age_missing_index = df1[df1['age'].isnull()].index.tolist()
2. bmi_missing_index =
dataset1[dataset1['bmi'].isnull()].index.tolist() 3.

Explanation:
Finding the missing index of age and bmi columns then store them.
Code: Histogram
1. df1 =
dataset.copy() 2.

1. import seaborn as sns


2. import matplotlib.pyplot as plt
3. sns.histplot(data=df1, x= 'age', hue= "gender")
4. plt.show()
5.

Output:

Explanation:
This code creates a histogram using Seaborn to visualize the distribution of age while differentiating by
gender.
Breakdown:
1. import seaborn as sns → Imports Seaborn for visualization.
2. import matplotlib.pyplot as plt → Imports Matplotlib for additional plotting functions.
3. sns.histplot(data=df1, x='age', hue='gender') →
 Plots the age distribution.
 hue='gender' colors the bars based on gender categories
Seeing data shape of age column.
Code: histogram
1. sns.histplot(data=df1, x= 'bmi', hue= 'work_type')
2. plt.show()
3.

Output:

Explanation:
This helps analyze how bmi is distributed across different work types in the dataset.

Code: Handling Missing values


1. df1.fillna(df1['age'].mean(), inplace=True)
2. df1.fillna(df1['bmi'].median(), inplace=True)
3. df1.isnull().sum()
4.
Output:

Explanation:
This code fills missing values in df1:
 age - Replaced with its mean.
 bmi - Replaced with its median.
 isnull().sum() - Checks for remaining missing values.

Code: Missing value filling using SimpleImputer


1. from sklearn.impute import SimpleImputer
2. df2 =
dataset.copy() 3.
4. imputer =
SimpleImputer(strategy='mean') 5.
6. df2['age'] = imputer.fit_transform(df2[['age']])
7. df2['bmi'] = imputer.fit_transform(df2[['bmi']])
8. print(df2.isnull().sum())
9.
Output:

Explanation:
This code uses SimpleImputer from sklearn to handle missing values in df2.
Key Steps:
1. Copy Dataset - df2 = dataset.copy() (to avoid modifying the original data).
2. Initialize Imputer - SimpleImputer(strategy='mean') (fills missing values with the column
mean).
3. Apply Imputation -
o df2['age'] = imputer.fit_transform(df2[['age']]) (fills missing age values).
o df2['bmi'] = imputer.fit_transform(df2[['bmi']]) (fills missing bmi values).

Code: Duplicates
1. df2 = df2.drop_duplicates()
2. print(df2.duplicated().sum())
3.

Output:
0

Explanation:
Checking duplicate rows. This data set has no duplicate rows.
Code: Outliers
1. numerical_columns = [feature for feature in
df2.columns if df2[feature].dtype != 'O']
2.
3. plt.figure(figsize = (12,12))
4. i = 1
5. for feature in numerical_columns:
6. if feature == 'stroke':
7. continue
8. plt.subplot(2,3, i)
9. sns.boxplot(data= df2, x= feature, hue = 'gender')
10. i = i +1
11. plt.show()
12.

Output:
Explanation:
This code plots box plots for all numerical columns (except stroke) to compare distributions across
genders. Helps detect outliers and distribution patterns across genders in numerical features.
 Loops through numerical columns, creating subplots.
 sns.boxplot() visualizes outliers and spread.

Code: Outliers Handling


1. Q1 = df2['avg_glucose_level'].quantile(0.25)
2. Q3 = df2['avg_glucose_level'].quantile(0.75)
3. IQR = Q3 - Q1
4.
5. lower_bound = Q1 - 1.5 * IQR
6. upper_bound = Q3 + 1.5 *
IQR 7.
8. outlier = df2[(df2['avg_glucose_level'] <
lower_bound) | (df2['avg_glucose_level'] >
upper_bound)]
9. print('avg_glucose_lovel Outliers data point: ',len(outlier))
10. df = df2[(df2['bmi'] >= lower_bound) & (df2['bmi'] <=
upper_bound)]
11. print('bmi in range data poitnts:

Output:
avg_glucose_lovel Outliers data point: 627
bmi in range data poitnts: 4236

Explanation:
Finding the total number of outliers in avg_glucose_level and bmi columns using IQR Method.

You might also like