DMML Lab Report 03
DMML Lab Report 03
Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University
Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University
Output:
Explanation:
Importing pandas library. Loading csv file from google drive to dataset dataframe.
Code:
1. dataset.info()
2.
Output:
Explanation:
Showing basic information (not null count and Data type) about the data set of each column.
Output:
Explanation:
The command dataset.isnull().mean() * 100 calculates the percentage of missing values in each column
of the DataFrame.
Explanation:
Finding the missing index of age and bmi columns then store them.
Code: Histogram
1. df1 =
dataset.copy() 2.
Output:
Explanation:
This code creates a histogram using Seaborn to visualize the distribution of age while differentiating by
gender.
Breakdown:
1. import seaborn as sns → Imports Seaborn for visualization.
2. import matplotlib.pyplot as plt → Imports Matplotlib for additional plotting functions.
3. sns.histplot(data=df1, x='age', hue='gender') →
Plots the age distribution.
hue='gender' colors the bars based on gender categories
Seeing data shape of age column.
Code: histogram
1. sns.histplot(data=df1, x= 'bmi', hue= 'work_type')
2. plt.show()
3.
Output:
Explanation:
This helps analyze how bmi is distributed across different work types in the dataset.
Explanation:
This code fills missing values in df1:
age - Replaced with its mean.
bmi - Replaced with its median.
isnull().sum() - Checks for remaining missing values.
Explanation:
This code uses SimpleImputer from sklearn to handle missing values in df2.
Key Steps:
1. Copy Dataset - df2 = dataset.copy() (to avoid modifying the original data).
2. Initialize Imputer - SimpleImputer(strategy='mean') (fills missing values with the column
mean).
3. Apply Imputation -
o df2['age'] = imputer.fit_transform(df2[['age']]) (fills missing age values).
o df2['bmi'] = imputer.fit_transform(df2[['bmi']]) (fills missing bmi values).
Code: Duplicates
1. df2 = df2.drop_duplicates()
2. print(df2.duplicated().sum())
3.
Output:
0
Explanation:
Checking duplicate rows. This data set has no duplicate rows.
Code: Outliers
1. numerical_columns = [feature for feature in
df2.columns if df2[feature].dtype != 'O']
2.
3. plt.figure(figsize = (12,12))
4. i = 1
5. for feature in numerical_columns:
6. if feature == 'stroke':
7. continue
8. plt.subplot(2,3, i)
9. sns.boxplot(data= df2, x= feature, hue = 'gender')
10. i = i +1
11. plt.show()
12.
Output:
Explanation:
This code plots box plots for all numerical columns (except stroke) to compare distributions across
genders. Helps detect outliers and distribution patterns across genders in numerical features.
Loops through numerical columns, creating subplots.
sns.boxplot() visualizes outliers and spread.
Output:
avg_glucose_lovel Outliers data point: 627
bmi in range data poitnts: 4236
Explanation:
Finding the total number of outliers in avg_glucose_level and bmi columns using IQR Method.