0% found this document useful (0 votes)
26 views10 pages

Experiment 5

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Experiment 5

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Experiment - 5

Name: Ansari Mohammed Shanouf Valijan


Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: M

Aim:
To understand and apply data transformation techniques in order to make a particular
healthcare dataset ready for analysis.

Objective:
 To experiment with various data transformation techniques.

Outcome:
 Processed clean dataset, free of outliers, properly formatted, scaled, normalized,
ready for analysis.

Theory:
Data transformation is a critical step in the machine learning (ML) process, especially in the
context of healthcare data. Raw healthcare data often comes from various sources, such as
electronic health records, lab results, and patient surveys. This data can be noisy,
incomplete, or unstructured, necessitating a transformation process to convert it into a
suitable format for analysis. Techniques such as normalization, standardization, and
encoding categorical variables help to ensure that the data is consistent and can be
effectively used in ML models. By transforming data, we make it more interpretable and
reliable, which is crucial for drawing valid conclusions in a healthcare setting.

The importance of data transformation in healthcare ML cannot be overstated, as the stakes


are incredibly high. Accurate predictions and insights can significantly impact patient
outcomes, treatment effectiveness, and healthcare resource allocation. For instance,
transformed data can improve the accuracy of predictive models that forecast disease
progression or identify high-risk patients. Additionally, clear and consistent data helps
mitigate biases that could arise from disparate data sources, ensuring that the resulting
models are fair and applicable to diverse populations. By addressing these challenges,
healthcare providers can make more informed decisions, ultimately enhancing patient care.

Moreover, data transformation facilitates compliance with regulatory requirements, such as


HIPAA in the United States, by ensuring that sensitive health information is properly
handled. This process includes anonymization or aggregation of data to protect patient
identities while still enabling meaningful analysis. As healthcare increasingly relies on data-
driven approaches, the ability to transform and process data effectively becomes a
competitive advantage. It allows healthcare organizations to harness the full potential of
their data, paving the way for innovations in patient care, operational efficiency, and
predictive analytics. In summary, data transformation is not merely a technical necessity; it
is a fundamental pillar that supports the ethical and effective application of ML in
healthcare.

Dataset Description:
With the objective of being able to work with different data transformation techniques, the
dataset for this experiment was synthesized, with manually accommodated outliers, missing
values, etc to observe how the devised code handles them.

The dataset was generated under the scenario of a hospital keeping records of its patients
suffering from either Leukemia, Lymphoma or Myeloma. It consists of 9 features – Age,
Gender, Weight, Height, Haemoglobin level, WBC count, Platelet count, Diagnosis and
Treatment received.

Implementation:
Following is a step-by-step implementation of the various tasks that were to be performed
under data transformation-
Link to the notebook -> DataPreprocessing

Importing the necessary libraries


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
Generating and viewing the dataset (seeding null values, outliers, etc for proper study)
np.random.seed(13)

num_rows = 1000
data = {
'Age': np.random.randint(18, 80, size=num_rows),
'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
'Weight': np.random.uniform(50, 120, size=num_rows),
'Height': np.random.uniform(150, 200, size=num_rows),
'Hemoglobin_Level': np.random.uniform(10, 18, size=num_rows),
'White_Blood_Cell_Count': np.random.randint(3000, 20000, size=num_rows),
'Platelet_Count': np.random.randint(150000, 450000, size=num_rows),
'Diagnosis': np.random.choice(['Leukemia', 'Lymphoma', 'Myeloma'],
size=num_rows),
'Treatment_Received': np.random.choice(['Chemotherapy', 'Radiation', 'Surgery',
'None'], size=num_rows)
}

df = pd.DataFrame(data)

missing_indices = np.random.choice(df.index, size=int(num_rows * 0.1),


replace=False)
df.loc[missing_indices, 'Hemoglobin_Level'] = np.nan

np.random.seed(42)
df['White_Blood_Cell_Count'] = np.append(
np.random.randint(3000, 20000, size=num_rows - 50),
np.random.randint(20000, 50000, size=50)
)

df
Handling the missing values in the column ‘Hemoglobin_Level’ using mean of the known
values for data imputation (mode may be used for categorical columns, dropping column
where maximum values are missing may also be considered)
df['Hemoglobin_Level'].fillna(df['Hemoglobin_Level'].mean(), inplace=True)
df

Using one-hot encoding (binary) for the categorical features


df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis',
'Treatment_Received'], drop_first=True)
df_encoded
Viewing the box plot of the column ‘White_Blood_Cell_Count’ to check for outliers. Further,
using IQR (Interquartile range) technique to get rid of them.
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['White_Blood_Cell_Count'])
plt.title('Boxplot of White Blood Cell Count (Original Data)')
plt.ylabel('White Blood Cell Count (cells/uL)')
plt.show()

Q1 = df['White_Blood_Cell_Count'].quantile(0.25)
Q3 = df['White_Blood_Cell_Count'].quantile(0.75)
IQR = Q3 - Q1

df_cleaned = df[(df['White_Blood_Cell_Count'] >= (Q1 - 1.5 * IQR)) &


(df['White_Blood_Cell_Count'] <= (Q3 + 1.5 * IQR))]

plt.figure(figsize=(10, 6))
sns.boxplot(y=df_cleaned['White_Blood_Cell_Count'])
plt.title('Boxplot of White Blood Cell Count (After Removing Outliers)')
plt.ylabel('White Blood Cell Count (cells/uL)')
plt.show()

Boxplot of the original column as mentioned above (Showing a lot of outliers)


Boxplot of the updated column after implementing the IQR outlier reduction technique

Standardizing the updated dataset (usually needed to make sure one feature does not
overpower another purely because of its scalar value)

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_cleaned[['Age', 'Weight', 'Height',
'Hemoglobin_Level',
'White_Blood_Cell_Count',
'Platelet_Count']])

scaled_df = pd.DataFrame(scaled_features, columns=['Age', 'Weight', 'Height',


'Hemoglobin_Level',
'White_Blood_Cell_Count',
'Platelet_Count'])

scaled_df
Alternatively, dataset may also be normalized with respect to the numerical features

scaler = MinMaxScaler()
numerical_cols = ['Age', 'Weight', 'Height', 'Hemoglobin_Level',
'White_Blood_Cell_Count', 'Platelet_Count']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df

Creating two new features (‘Age_Group’ and ‘Retention Index’) using the data available
bins = [0, 18, 35, 50, 65, 80]
labels = ['0-18', '19-35', '36-50', '51-65', '66-80']
df_cleaned['Age_Group'] = pd.cut(df_cleaned['Age'], bins=bins, labels=labels,
right=False)
df_cleaned
scaled_df['Retention Index'] = scaled_df['Weight'] / ((scaled_df['Height'] / 100)
** 2)
scaled_df

Splitting the dataset for training and testing of an ML model (say for identifying whether a
patient suffers from Lymphoma)
X = scaled_df
y = df_cleaned['Diagnosis_Lymphoma']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

X_test as obtained
y_test as obtained

Conclusion:
By performing this experiment, I was able to get a brief overview of the various data
transformation techniques that may be employed in the process of cleaning the data,
making it ready for analysis and further, for training and testing of machine learning models.
Through working with the synthesized healthcare dataset, I was able to successfully carry
out the transformations, while also understanding their respective purposes.

You might also like