0% found this document useful (0 votes)

26 views10 pages

Experiment 5

Uploaded by

mohammed.ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views10 pages

Experiment 5

Uploaded by

mohammed.ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Experiment - 5

Name: Ansari Mohammed Shanouf Valijan

Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: M

Aim:
To understand and apply data transformation techniques in order to make a particular
healthcare dataset ready for analysis.

Objective:
 To experiment with various data transformation techniques.

Outcome:
 Processed clean dataset, free of outliers, properly formatted, scaled, normalized,
ready for analysis.

Theory:
Data transformation is a critical step in the machine learning (ML) process, especially in the
context of healthcare data. Raw healthcare data often comes from various sources, such as
electronic health records, lab results, and patient surveys. This data can be noisy,
incomplete, or unstructured, necessitating a transformation process to convert it into a
suitable format for analysis. Techniques such as normalization, standardization, and
encoding categorical variables help to ensure that the data is consistent and can be
effectively used in ML models. By transforming data, we make it more interpretable and
reliable, which is crucial for drawing valid conclusions in a healthcare setting.

The importance of data transformation in healthcare ML cannot be overstated, as the stakes

are incredibly high. Accurate predictions and insights can significantly impact patient
outcomes, treatment effectiveness, and healthcare resource allocation. For instance,
transformed data can improve the accuracy of predictive models that forecast disease
progression or identify high-risk patients. Additionally, clear and consistent data helps
mitigate biases that could arise from disparate data sources, ensuring that the resulting
models are fair and applicable to diverse populations. By addressing these challenges,
healthcare providers can make more informed decisions, ultimately enhancing patient care.

Moreover, data transformation facilitates compliance with regulatory requirements, such as

HIPAA in the United States, by ensuring that sensitive health information is properly
handled. This process includes anonymization or aggregation of data to protect patient
identities while still enabling meaningful analysis. As healthcare increasingly relies on data-
driven approaches, the ability to transform and process data effectively becomes a
competitive advantage. It allows healthcare organizations to harness the full potential of
their data, paving the way for innovations in patient care, operational efficiency, and
predictive analytics. In summary, data transformation is not merely a technical necessity; it
is a fundamental pillar that supports the ethical and effective application of ML in
healthcare.

Dataset Description:
With the objective of being able to work with different data transformation techniques, the
dataset for this experiment was synthesized, with manually accommodated outliers, missing
values, etc to observe how the devised code handles them.

The dataset was generated under the scenario of a hospital keeping records of its patients
suffering from either Leukemia, Lymphoma or Myeloma. It consists of 9 features – Age,
Gender, Weight, Height, Haemoglobin level, WBC count, Platelet count, Diagnosis and
Treatment received.

Implementation:
Following is a step-by-step implementation of the various tasks that were to be performed
under data transformation-
Link to the notebook -> DataPreprocessing

Importing the necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
Generating and viewing the dataset (seeding null values, outliers, etc for proper study)
np.random.seed(13)

num_rows = 1000
data = {
'Age': np.random.randint(18, 80, size=num_rows),
'Gender': np.random.choice(['Male', 'Female'], size=num_rows),
'Weight': np.random.uniform(50, 120, size=num_rows),
'Height': np.random.uniform(150, 200, size=num_rows),
'Hemoglobin_Level': np.random.uniform(10, 18, size=num_rows),
'White_Blood_Cell_Count': np.random.randint(3000, 20000, size=num_rows),
'Platelet_Count': np.random.randint(150000, 450000, size=num_rows),
'Diagnosis': np.random.choice(['Leukemia', 'Lymphoma', 'Myeloma'],
size=num_rows),
'Treatment_Received': np.random.choice(['Chemotherapy', 'Radiation', 'Surgery',
'None'], size=num_rows)
}

df = pd.DataFrame(data)

missing_indices = np.random.choice(df.index, size=int(num_rows * 0.1),

replace=False)
df.loc[missing_indices, 'Hemoglobin_Level'] = np.nan

np.random.seed(42)
df['White_Blood_Cell_Count'] = np.append(
np.random.randint(3000, 20000, size=num_rows - 50),
np.random.randint(20000, 50000, size=50)
)

df
Handling the missing values in the column ‘Hemoglobin_Level’ using mean of the known
values for data imputation (mode may be used for categorical columns, dropping column
where maximum values are missing may also be considered)
df['Hemoglobin_Level'].fillna(df['Hemoglobin_Level'].mean(), inplace=True)
df

Using one-hot encoding (binary) for the categorical features

df_encoded = pd.get_dummies(df, columns=['Gender', 'Diagnosis',
'Treatment_Received'], drop_first=True)
df_encoded
Viewing the box plot of the column ‘White_Blood_Cell_Count’ to check for outliers. Further,
using IQR (Interquartile range) technique to get rid of them.
plt.figure(figsize=(10, 6))
sns.boxplot(y=df['White_Blood_Cell_Count'])
plt.title('Boxplot of White Blood Cell Count (Original Data)')
plt.ylabel('White Blood Cell Count (cells/uL)')
plt.show()

Q1 = df['White_Blood_Cell_Count'].quantile(0.25)
Q3 = df['White_Blood_Cell_Count'].quantile(0.75)
IQR = Q3 - Q1

df_cleaned = df[(df['White_Blood_Cell_Count'] >= (Q1 - 1.5 * IQR)) &

(df['White_Blood_Cell_Count'] <= (Q3 + 1.5 * IQR))]

plt.figure(figsize=(10, 6))
sns.boxplot(y=df_cleaned['White_Blood_Cell_Count'])
plt.title('Boxplot of White Blood Cell Count (After Removing Outliers)')
plt.ylabel('White Blood Cell Count (cells/uL)')
plt.show()

Boxplot of the original column as mentioned above (Showing a lot of outliers)

Boxplot of the updated column after implementing the IQR outlier reduction technique

Standardizing the updated dataset (usually needed to make sure one feature does not
overpower another purely because of its scalar value)

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_cleaned[['Age', 'Weight', 'Height',
'Hemoglobin_Level',
'White_Blood_Cell_Count',
'Platelet_Count']])

scaled_df = pd.DataFrame(scaled_features, columns=['Age', 'Weight', 'Height',

'Hemoglobin_Level',
'White_Blood_Cell_Count',
'Platelet_Count'])

scaled_df
Alternatively, dataset may also be normalized with respect to the numerical features

scaler = MinMaxScaler()
numerical_cols = ['Age', 'Weight', 'Height', 'Hemoglobin_Level',
'White_Blood_Cell_Count', 'Platelet_Count']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
df

Creating two new features (‘Age_Group’ and ‘Retention Index’) using the data available
bins = [0, 18, 35, 50, 65, 80]
labels = ['0-18', '19-35', '36-50', '51-65', '66-80']
df_cleaned['Age_Group'] = pd.cut(df_cleaned['Age'], bins=bins, labels=labels,
right=False)
df_cleaned
scaled_df['Retention Index'] = scaled_df['Weight'] / ((scaled_df['Height'] / 100)
** 2)
scaled_df

Splitting the dataset for training and testing of an ML model (say for identifying whether a
patient suffers from Lymphoma)
X = scaled_df
y = df_cleaned['Diagnosis_Lymphoma']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42)

X_test as obtained
y_test as obtained

Conclusion:
By performing this experiment, I was able to get a brief overview of the various data
transformation techniques that may be employed in the process of cleaning the data,
making it ready for analysis and further, for training and testing of machine learning models.
Through working with the synthesized healthcare dataset, I was able to successfully carry
out the transformations, while also understanding their respective purposes.

Bendix Air Brake System Schematic PDF
75% (4)
Bendix Air Brake System Schematic PDF
1 page
Deed of EXCHANGE of MOTOR VEHICLE
100% (10)
Deed of EXCHANGE of MOTOR VEHICLE
2 pages
Right To Travel Brief
67% (6)
Right To Travel Brief
62 pages
Jumeed Oral Questions
100% (1)
Jumeed Oral Questions
261 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
BS en ISO 12781-2-2011 - Geometrical Product Specifications (GPS) - Flatness - Part 2 - Specification Operators
No ratings yet
BS en ISO 12781-2-2011 - Geometrical Product Specifications (GPS) - Flatness - Part 2 - Specification Operators
24 pages
AIML Record Batch 9
No ratings yet
AIML Record Batch 9
88 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Diabetes - Test Report
No ratings yet
Diabetes - Test Report
62 pages
4 11 Final Modified Chapter-4
No ratings yet
4 11 Final Modified Chapter-4
32 pages
Health Care MLH File
No ratings yet
Health Care MLH File
76 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
57 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
Thyroid Disease Classification Using ML
No ratings yet
Thyroid Disease Classification Using ML
37 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Marine Spread Specification
No ratings yet
Marine Spread Specification
38 pages
Hgs Phase II
No ratings yet
Hgs Phase II
27 pages
DS Report 03
No ratings yet
DS Report 03
30 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
AI - ML in Heathcare
No ratings yet
AI - ML in Heathcare
15 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
100% (1)
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
9 pages
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
No ratings yet
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
21 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Decision Support
No ratings yet
Decision Support
21 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Cancer Detection Using Data Mining
No ratings yet
Cancer Detection Using Data Mining
13 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
Experiment 5
No ratings yet
Experiment 5
9 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Boo PH 3
No ratings yet
Boo PH 3
11 pages
utf-8''C2M1 Assignment
No ratings yet
utf-8''C2M1 Assignment
24 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Phase 2
No ratings yet
Phase 2
6 pages
Ai ML Exp1
No ratings yet
Ai ML Exp1
8 pages
Intel Report
No ratings yet
Intel Report
15 pages
Personalized Healthcare Recommendations
No ratings yet
Personalized Healthcare Recommendations
6 pages
Healthcare-Project-Simplilearn - Week3
No ratings yet
Healthcare-Project-Simplilearn - Week3
7 pages
Pyhton 2
No ratings yet
Pyhton 2
8 pages
Obermeyer Sample
No ratings yet
Obermeyer Sample
8 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Exp 2
No ratings yet
Exp 2
6 pages
Breeds of Cattle
No ratings yet
Breeds of Cattle
18 pages
AIH Lab5
No ratings yet
AIH Lab5
6 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Journal Heart Attack
No ratings yet
Journal Heart Attack
6 pages
DWM Exp 8
No ratings yet
DWM Exp 8
4 pages
Code 2
No ratings yet
Code 2
3 pages
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
No ratings yet
Iot Hospital Management System and Analysis With Accessing Data From Cloud Using Machine Learning
7 pages
Batch-2 Ieee DMT
No ratings yet
Batch-2 Ieee DMT
4 pages
05 AIHC Exp01
No ratings yet
05 AIHC Exp01
6 pages
Exploratory Data Analysis Main Concepts
No ratings yet
Exploratory Data Analysis Main Concepts
1 page
A Changing of The Guards at The College of Arts and Sciences (1981)
No ratings yet
A Changing of The Guards at The College of Arts and Sciences (1981)
151 pages
Transportation Calculations
No ratings yet
Transportation Calculations
11 pages
Computer Organization: Basic Structure of Computer
No ratings yet
Computer Organization: Basic Structure of Computer
59 pages
Estrada V Sandiganbayan Digest
No ratings yet
Estrada V Sandiganbayan Digest
2 pages
Experiment 5
No ratings yet
Experiment 5
14 pages
Airframes and Engines Option
No ratings yet
Airframes and Engines Option
14 pages
CSS 2024 25 BE CE A B Sem VII OTH Lec 4 Unit II Asymmetric RSA DH Ciphers
No ratings yet
CSS 2024 25 BE CE A B Sem VII OTH Lec 4 Unit II Asymmetric RSA DH Ciphers
29 pages
FD Revised 5 - Asf Devastation and Financial Performance of Pork Suppliers in Davao City
No ratings yet
FD Revised 5 - Asf Devastation and Financial Performance of Pork Suppliers in Davao City
53 pages
Assoland Construction Pte LTD V Malayan Credit Properties Pte LTD (1993) 3 SLR 470
No ratings yet
Assoland Construction Pte LTD V Malayan Credit Properties Pte LTD (1993) 3 SLR 470
2 pages
Dicom Communication Protocols
No ratings yet
Dicom Communication Protocols
23 pages
Experiment 8
No ratings yet
Experiment 8
13 pages
Experiment 7
No ratings yet
Experiment 7
6 pages
LI 2024 Invitation - Online
No ratings yet
LI 2024 Invitation - Online
15 pages
Experiment 2
No ratings yet
Experiment 2
12 pages
Experiment 3
No ratings yet
Experiment 3
5 pages
February 6 Vdi Comparison Gberger PDF
No ratings yet
February 6 Vdi Comparison Gberger PDF
49 pages
2 Edm
No ratings yet
2 Edm
11 pages
DSM Mini Project
No ratings yet
DSM Mini Project
11 pages
Simovert 106035
No ratings yet
Simovert 106035
4 pages
2024 Amherstburg Calendar - Web
No ratings yet
2024 Amherstburg Calendar - Web
36 pages
Lab6A-Asset Tracking
No ratings yet
Lab6A-Asset Tracking
27 pages
Experiment 1
No ratings yet
Experiment 1
21 pages
Experiment 1
No ratings yet
Experiment 1
16 pages
Experiment 7
No ratings yet
Experiment 7
13 pages
DSM Practical 1
No ratings yet
DSM Practical 1
14 pages
CSS 2024 25 BE CE A B Sem VII AVN Lec 1 Introduction
No ratings yet
CSS 2024 25 BE CE A B Sem VII AVN Lec 1 Introduction
14 pages
Experiment 4
No ratings yet
Experiment 4
12 pages
Expression of Interest EOI To Enlist Suppliers For Engineered Unengineered Incinerator Package
No ratings yet
Expression of Interest EOI To Enlist Suppliers For Engineered Unengineered Incinerator Package
5 pages
Experiment 3
No ratings yet
Experiment 3
9 pages
Experiment 4
No ratings yet
Experiment 4
8 pages
Experiment 5
No ratings yet
Experiment 5
8 pages
Experiment 4
No ratings yet
Experiment 4
8 pages
Experiment 2
No ratings yet
Experiment 2
7 pages
Experiment 1
No ratings yet
Experiment 1
7 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Class Assignment On Decision Trees
No ratings yet
Class Assignment On Decision Trees
6 pages
Lesson 1. Basic Concepts of Leadership
No ratings yet
Lesson 1. Basic Concepts of Leadership
14 pages
Class-Work-Naive-Bayes (21-10-2024)
No ratings yet
Class-Work-Naive-Bayes (21-10-2024)
5 pages
Class-Work-1 (26-08-2024)
No ratings yet
Class-Work-1 (26-08-2024)
5 pages
Important Reminders: Step 1
No ratings yet
Important Reminders: Step 1
4 pages
Classics: Invention of The Integrated Circuit
No ratings yet
Classics: Invention of The Integrated Circuit
16 pages
Assignment On Module-3
No ratings yet
Assignment On Module-3
3 pages
Statement of Facts El Matador
No ratings yet
Statement of Facts El Matador
6 pages
Assignment-1, 2
No ratings yet
Assignment-1, 2
2 pages
Affidavit With Pay Slip
No ratings yet
Affidavit With Pay Slip
4 pages
Precipítador Electrostatico Fke-Q
No ratings yet
Precipítador Electrostatico Fke-Q
2 pages
DCIT 65 Class Activity 1
No ratings yet
DCIT 65 Class Activity 1
2 pages
Unit 3
No ratings yet
Unit 3
3 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet

Experiment 5

Uploaded by

Experiment 5

Uploaded by

Experiment - 5

Name: Ansari Mohammed Shanouf Valijan

The importance of data transformation in healthcare ML cannot be overstated, as the stakes

Moreover, data transformation facilitates compliance with regulatory requirements, such as

Importing the necessary libraries

missing_indices = np.random.choice(df.index, size=int(num_rows * 0.1),

Using one-hot encoding (binary) for the categorical features

df_cleaned = df[(df['White_Blood_Cell_Count'] >= (Q1 - 1.5 * IQR)) &

Boxplot of the original column as mentioned above (Showing a lot of outliers)

scaled_df = pd.DataFrame(scaled_features, columns=['Age', 'Weight', 'Height',

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

You might also like