0% found this document useful (0 votes)
17 views2 pages

Mod 4

Uploaded by

mranasmalik65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views2 pages

Mod 4

Uploaded by

mranasmalik65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

In [10]: import pandas as pd

import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

4.1 Data cleaning using python and scikit-learn

1.Read from a CSV file with comma delimiter

In [11]: df=pd.read_csv('heart failure.csv')

Creating categorical column with age

In [12]: df['age_category']=df['age'].apply(lambda x: 'Younger' if x < 60 else 'Older')


df.head()

Out[12]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

2.Create dummy variables for age_category column

In [13]: dummy=pd.get_dummies(df['age_category'])
print(dummy)
df.head()

Older Younger
0 True False
1 False True
2 True False
3 False True
4 True False
.. ... ...
294 True False
295 False True
296 False True
297 False True
298 False True

[299 rows x 2 columns]


Out[13]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

3.Outliers for the column age

In [14]: Q1 = df['ejection_fraction'].quantile(0.25)
Q3 = df['ejection_fraction'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR


upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['ejection_fraction'] < lower_bound) | (df['ejection_fraction'] > upper_bound)]

print("Outliers in ejection_fraction column:")


print(outliers)

Outliers in ejection_fraction column:


age anaemia creatinine_phosphokinase diabetes ejection_fraction \
64 45.0 0 582 0 80
217 54.0 1 427 0 70

high_blood_pressure platelets serum_creatinine serum_sodium sex \


64 0 263358.03 1.18 137 0
217 1 151000.00 9.00 137 0

smoking time DEATH_EVENT age_category


64 0 63 0 Younger
217 0 196 1 Younger

4.Compare the distribution of the creatinine_phosphokinase with and without missing values using box plots

In [15]: df_original = df.copy()

np.random.seed(0) # For reproducibility


null_indices = np.random.choice(df.index,size=int(len(df) * 0.1), replace=False) # Set 10% as NaN
df.loc[null_indices, 'creatinine_phosphokinase'] = np.nan

# Create a new copy with missing values for comparison


df_with_nulls = df.copy()

plt.figure(figsize=(8, 6))
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],
patch_artist=True, labels=['Without Missing', 'With Missing '])
plt.title("Comparison of 'creatinine_phosphokinase' Distribution With and Without Missing Values")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

C:\Users\MYPC\AppData\Local\Temp\ipykernel_9240\2092470060.py:11: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],

4.2. Missing values imputation using python and scikit-learn

1.Checking for missing values in the dataset

In [16]: print("Missing values in each column:")


print(df.isnull().sum())

Missing values in each column:


age 0
anaemia 0
creatinine_phosphokinase 29
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
DEATH_EVENT 0
age_category 0
dtype: int64

2.Treating Missing Values with Deletion or Imputation

In [17]: imputer = SimpleImputer(strategy='mean')


df['creatinine_phosphokinase'] = imputer.fit_transform(df[['creatinine_phosphokinase']])

3.Visualizing Missing Value Distribution (Histogram)

In [18]: # Histogram of the 'creatinine_phosphokinase' column after imputation


plt.hist(df['creatinine_phosphokinase'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of 'creatinine_phosphokinase' After Imputation")
plt.xlabel("Creatinine Phosphokinase")
plt.ylabel("Frequency")
plt.show()

4.Visualize Relationships between the variable with missing values "creatinine_phosphokinase" and other variable "age" Using Scatter Plot

In [19]: # Scatter plot to see relationship between 'creatinine_phosphokinase' and 'age'


plt.scatter(df['age'], df['creatinine_phosphokinase'], alpha=0.5, color='purple')
plt.title("Creatinine Phosphokinase vs Age")
plt.xlabel("Age")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

4.3 Data normalization using scikit-learn

1.various normalization techniques such as Min-Max scaling, Standardization, Robust scaling, encoding, Normalization

In [20]: from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Example using Min-Max Scaling


min_max_scaler = MinMaxScaler()
df['normalized_age'] = min_max_scaler.fit_transform(df[['age']])

# Example using Standardization


standard_scaler = StandardScaler()
df['standardized_ejection_fraction'] = standard_scaler.fit_transform(df[['ejection_fraction']])

# Example using Robust Scaling (useful if there are outliers)


robust_scaler = RobustScaler()
df['robust_platelets'] = robust_scaler.fit_transform(df[['platelets']])

df.head()

Out[20]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category normalized_age standardized_eje

0 75.0 0 582.0 0 20 1 265000.00 1.9 130 1 0 4 1 Older 0.636364

1 55.0 0 7861.0 0 38 0 263358.03 1.1 136 1 0 6 1 Younger 0.272727

2 65.0 0 146.0 0 20 0 162000.00 1.3 129 1 1 7 1 Older 0.454545

3 50.0 1 111.0 0 20 0 210000.00 1.9 137 1 0 7 1 Younger 0.181818

4 65.0 1 160.0 1 20 0 327000.00 2.7 116 0 0 8 1 Older 0.454545

2.Plot Histograms Before and After Normalization

In [21]: # Before normalization


plt.hist(df['age'], bins=20, alpha=0.5, label='Original Age')
# After normalization
plt.hist(df['normalized_age'], bins=20, alpha=0.5, label='Normalized Age')
plt.legend(loc='upper right')
plt.title("Age Before and After Min-Max Normalization")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

3. Box plots to compare the median, quartiles, and outliers in your data

In [23]: # Box plot before and after normalization


fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].boxplot(df['age'], patch_artist=True)
axes[0].set_title("Original Age Data")

axes[1].boxplot(df['normalized_age'], patch_artist=True)
axes[1].set_title("Normalized Age Data (Min-Max)")

plt.show()
In [ ]:

You might also like