0% found this document useful (0 votes)
3 views

Mod 4

Uploaded by

mranasmalik65
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Mod 4

Uploaded by

mranasmalik65
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

In [10]: import pandas as pd

import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

4.1 Data cleaning using python and scikit-learn

1.Read from a CSV file with comma delimiter

In [11]: df=pd.read_csv('heart failure.csv')

Creating categorical column with age

In [12]: df['age_category']=df['age'].apply(lambda x: 'Younger' if x < 60 else 'Older')


df.head()

Out[12]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

2.Create dummy variables for age_category column

In [13]: dummy=pd.get_dummies(df['age_category'])
print(dummy)
df.head()

Older Younger
0 True False
1 False True
2 True False
3 False True
4 True False
.. ... ...
294 True False
295 False True
296 False True
297 False True
298 False True

[299 rows x 2 columns]


Out[13]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

3.Outliers for the column age

In [14]: Q1 = df['ejection_fraction'].quantile(0.25)
Q3 = df['ejection_fraction'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR


upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['ejection_fraction'] < lower_bound) | (df['ejection_fraction'] > upper_bound)]

print("Outliers in ejection_fraction column:")


print(outliers)

Outliers in ejection_fraction column:


age anaemia creatinine_phosphokinase diabetes ejection_fraction \
64 45.0 0 582 0 80
217 54.0 1 427 0 70

high_blood_pressure platelets serum_creatinine serum_sodium sex \


64 0 263358.03 1.18 137 0
217 1 151000.00 9.00 137 0

smoking time DEATH_EVENT age_category


64 0 63 0 Younger
217 0 196 1 Younger

4.Compare the distribution of the creatinine_phosphokinase with and without missing values using box plots

In [15]: df_original = df.copy()

np.random.seed(0) # For reproducibility


null_indices = np.random.choice(df.index,size=int(len(df) * 0.1), replace=False) # Set 10% as NaN
df.loc[null_indices, 'creatinine_phosphokinase'] = np.nan

# Create a new copy with missing values for comparison


df_with_nulls = df.copy()

plt.figure(figsize=(8, 6))
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],
patch_artist=True, labels=['Without Missing', 'With Missing '])
plt.title("Comparison of 'creatinine_phosphokinase' Distribution With and Without Missing Values")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

C:\Users\MYPC\AppData\Local\Temp\ipykernel_9240\2092470060.py:11: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],

4.2. Missing values imputation using python and scikit-learn

1.Checking for missing values in the dataset

In [16]: print("Missing values in each column:")


print(df.isnull().sum())

Missing values in each column:


age 0
anaemia 0
creatinine_phosphokinase 29
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
DEATH_EVENT 0
age_category 0
dtype: int64

2.Treating Missing Values with Deletion or Imputation

In [17]: imputer = SimpleImputer(strategy='mean')


df['creatinine_phosphokinase'] = imputer.fit_transform(df[['creatinine_phosphokinase']])

3.Visualizing Missing Value Distribution (Histogram)

In [18]: # Histogram of the 'creatinine_phosphokinase' column after imputation


plt.hist(df['creatinine_phosphokinase'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of 'creatinine_phosphokinase' After Imputation")
plt.xlabel("Creatinine Phosphokinase")
plt.ylabel("Frequency")
plt.show()

4.Visualize Relationships between the variable with missing values "creatinine_phosphokinase" and other variable "age" Using Scatter Plot

In [19]: # Scatter plot to see relationship between 'creatinine_phosphokinase' and 'age'


plt.scatter(df['age'], df['creatinine_phosphokinase'], alpha=0.5, color='purple')
plt.title("Creatinine Phosphokinase vs Age")
plt.xlabel("Age")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

4.3 Data normalization using scikit-learn

1.various normalization techniques such as Min-Max scaling, Standardization, Robust scaling, encoding, Normalization

In [20]: from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Example using Min-Max Scaling


min_max_scaler = MinMaxScaler()
df['normalized_age'] = min_max_scaler.fit_transform(df[['age']])

# Example using Standardization


standard_scaler = StandardScaler()
df['standardized_ejection_fraction'] = standard_scaler.fit_transform(df[['ejection_fraction']])

# Example using Robust Scaling (useful if there are outliers)


robust_scaler = RobustScaler()
df['robust_platelets'] = robust_scaler.fit_transform(df[['platelets']])

df.head()

Out[20]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category normalized_age standardized_eje

0 75.0 0 582.0 0 20 1 265000.00 1.9 130 1 0 4 1 Older 0.636364

1 55.0 0 7861.0 0 38 0 263358.03 1.1 136 1 0 6 1 Younger 0.272727

2 65.0 0 146.0 0 20 0 162000.00 1.3 129 1 1 7 1 Older 0.454545

3 50.0 1 111.0 0 20 0 210000.00 1.9 137 1 0 7 1 Younger 0.181818

4 65.0 1 160.0 1 20 0 327000.00 2.7 116 0 0 8 1 Older 0.454545

2.Plot Histograms Before and After Normalization

In [21]: # Before normalization


plt.hist(df['age'], bins=20, alpha=0.5, label='Original Age')
# After normalization
plt.hist(df['normalized_age'], bins=20, alpha=0.5, label='Normalized Age')
plt.legend(loc='upper right')
plt.title("Age Before and After Min-Max Normalization")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

3. Box plots to compare the median, quartiles, and outliers in your data

In [23]: # Box plot before and after normalization


fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].boxplot(df['age'], patch_artist=True)
axes[0].set_title("Original Age Data")

axes[1].boxplot(df['normalized_age'], patch_artist=True)
axes[1].set_title("Normalized Age Data (Min-Max)")

plt.show()
In [ ]:

You might also like