Experiment 5
Experiment 5
Aim:
To understand and apply data transformation techniques in order to make a particular
healthcare dataset ready for analysis.
Objective:
To experiment with various data transformation techniques.
Outcome:
Processed clean dataset, free of outliers, properly formatted, scaled, normalized,
ready for analysis.
Theory:
Data transformation is a critical step in the machine learning (ML) process, especially in the
context of healthcare data. Raw healthcare data often comes from various sources, such as
electronic health records, lab results, and patient surveys. This data can be noisy,
incomplete, or unstructured, necessitating a transformation process to convert it into a
suitable format for analysis. Techniques such as normalization, standardization, and
encoding categorical variables help to ensure that the data is consistent and can be
effectively used in ML models. By transforming data, we make it more interpretable and
reliable, which is crucial for drawing valid conclusions in a healthcare setting.
Dataset Description:
With the objective of being able to work with different data transformation techniques, the
dataset for this experiment was synthesized, with manually accommodated outliers, missing
values, etc to observe how the devised code handles them.
The dataset was generated under the scenario of a hospital keeping records of its patients
suffering from either Leukemia, Lymphoma or Myeloma. It consists of 9 features – Age,
Gender, Weight, Height, Haemoglobin level, WBC count, Platelet count, Diagnosis and
Treatment received.
Implementation:
Following is a step-by-step implementation of the various tasks that were to be performed
under data transformation-
Link to the notebook -> DataPreprocessing
num_rows = 1000
data = {
'Age': np.random.randint(18, 80, size=num_rows),
'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
'Weight': np.random.uniform(50, 120, size=num_rows),
'Height': np.random.uniform(150, 200, size=num_rows),
'Hemoglobin_Level': np.random.uniform(10, 18, size=num_rows),
'White_Blood_Cell_Count': np.random.randint(3000, 20000, size=num_rows),
'Platelet_Count': np.random.randint(150000, 450000, size=num_rows),
'Diagnosis': np.random.choice(['Leukemia', 'Lymphoma', 'Myeloma'],
size=num_rows),
'Treatment_Received': np.random.choice(['Chemotherapy', 'Radiation', 'Surgery',
'None'], size=num_rows)
}
df = pd.DataFrame(data)
np.random.seed(42)
df['White_Blood_Cell_Count'] = np.append(
np.random.randint(3000, 20000, size=num_rows - 50),
np.random.randint(20000, 50000, size=50)
)
df
Handling the missing values in the column ‘Hemoglobin_Level’ using mean of the known
values for data imputation (mode may be used for categorical columns, dropping column
where maximum values are missing may also be considered)
df['Hemoglobin_Level'].fillna(df['Hemoglobin_Level'].mean(), inplace=True)
df
Q1 = df['White_Blood_Cell_Count'].quantile(0.25)
Q3 = df['White_Blood_Cell_Count'].quantile(0.75)
IQR = Q3 - Q1
plt.figure(figsize=(10, 6))
sns.boxplot(y=df_cleaned['White_Blood_Cell_Count'])
plt.title('Boxplot of White Blood Cell Count (After Removing Outliers)')
plt.ylabel('White Blood Cell Count (cells/uL)')
plt.show()
Standardizing the updated dataset (usually needed to make sure one feature does not
overpower another purely because of its scalar value)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_cleaned[['Age', 'Weight', 'Height',
'Hemoglobin_Level',
'White_Blood_Cell_Count',
'Platelet_Count']])
scaled_df
Alternatively, dataset may also be normalized with respect to the numerical features
scaler = MinMaxScaler()
numerical_cols = ['Age', 'Weight', 'Height', 'Hemoglobin_Level',
'White_Blood_Cell_Count', 'Platelet_Count']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df
Creating two new features (‘Age_Group’ and ‘Retention Index’) using the data available
bins = [0, 18, 35, 50, 65, 80]
labels = ['0-18', '19-35', '36-50', '51-65', '66-80']
df_cleaned['Age_Group'] = pd.cut(df_cleaned['Age'], bins=bins, labels=labels,
right=False)
df_cleaned
scaled_df['Retention Index'] = scaled_df['Weight'] / ((scaled_df['Height'] / 100)
** 2)
scaled_df
Splitting the dataset for training and testing of an ML model (say for identifying whether a
patient suffers from Lymphoma)
X = scaled_df
y = df_cleaned['Diagnosis_Lymphoma']
X_test as obtained
y_test as obtained
Conclusion:
By performing this experiment, I was able to get a brief overview of the various data
transformation techniques that may be employed in the process of cleaning the data,
making it ready for analysis and further, for training and testing of machine learning models.
Through working with the synthesized healthcare dataset, I was able to successfully carry
out the transformations, while also understanding their respective purposes.