0% found this document useful (0 votes)

17 views2 pages

Mod 4

Uploaded by

mranasmalik65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views2 pages

Mod 4

Uploaded by

mranasmalik65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

In [10]: import pandas as pd

import numpy as np
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

4.1 Data cleaning using python and scikit-learn

1.Read from a CSV file with comma delimiter

In [11]: df=pd.read_csv('heart failure.csv')

Creating categorical column with age

In [12]: df['age_category']=df['age'].apply(lambda x: 'Younger' if x < 60 else 'Older')

df.head()

Out[12]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

2.Create dummy variables for age_category column

In [13]: dummy=pd.get_dummies(df['age_category'])
print(dummy)
df.head()

Older Younger
0 True False
1 False True
2 True False
3 False True
4 True False
.. ... ...
294 True False
295 False True
296 False True
297 False True
298 False True

[299 rows x 2 columns]

Out[13]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category

0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1 Older

1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 Younger

2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1 Older

3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1 Younger

4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1 Older

3.Outliers for the column age

In [14]: Q1 = df['ejection_fraction'].quantile(0.25)
Q3 = df['ejection_fraction'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['ejection_fraction'] < lower_bound) | (df['ejection_fraction'] > upper_bound)]

print("Outliers in ejection_fraction column:")

print(outliers)

Outliers in ejection_fraction column:

age anaemia creatinine_phosphokinase diabetes ejection_fraction \
64 45.0 0 582 0 80
217 54.0 1 427 0 70

high_blood_pressure platelets serum_creatinine serum_sodium sex \

64 0 263358.03 1.18 137 0
217 1 151000.00 9.00 137 0

smoking time DEATH_EVENT age_category

64 0 63 0 Younger
217 0 196 1 Younger

4.Compare the distribution of the creatinine_phosphokinase with and without missing values using box plots

In [15]: df_original = df.copy()

np.random.seed(0) # For reproducibility

null_indices = np.random.choice(df.index,size=int(len(df) * 0.1), replace=False) # Set 10% as NaN
df.loc[null_indices, 'creatinine_phosphokinase'] = np.nan

# Create a new copy with missing values for comparison

df_with_nulls = df.copy()

plt.figure(figsize=(8, 6))
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],
patch_artist=True, labels=['Without Missing', 'With Missing '])
plt.title("Comparison of 'creatinine_phosphokinase' Distribution With and Without Missing Values")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

C:\Users\MYPC\AppData\Local\Temp\ipykernel_9240\2092470060.py:11: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
plt.boxplot([df_original['creatinine_phosphokinase'], df_with_nulls['creatinine_phosphokinase'].dropna()],

4.2. Missing values imputation using python and scikit-learn

1.Checking for missing values in the dataset

In [16]: print("Missing values in each column:")

print(df.isnull().sum())

Missing values in each column:

age 0
anaemia 0
creatinine_phosphokinase 29
diabetes 0
ejection_fraction 0
high_blood_pressure 0
platelets 0
serum_creatinine 0
serum_sodium 0
sex 0
smoking 0
time 0
DEATH_EVENT 0
age_category 0
dtype: int64

2.Treating Missing Values with Deletion or Imputation

In [17]: imputer = SimpleImputer(strategy='mean')

df['creatinine_phosphokinase'] = imputer.fit_transform(df[['creatinine_phosphokinase']])

3.Visualizing Missing Value Distribution (Histogram)

In [18]: # Histogram of the 'creatinine_phosphokinase' column after imputation

plt.hist(df['creatinine_phosphokinase'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of 'creatinine_phosphokinase' After Imputation")
plt.xlabel("Creatinine Phosphokinase")
plt.ylabel("Frequency")
plt.show()

4.Visualize Relationships between the variable with missing values "creatinine_phosphokinase" and other variable "age" Using Scatter Plot

In [19]: # Scatter plot to see relationship between 'creatinine_phosphokinase' and 'age'

plt.scatter(df['age'], df['creatinine_phosphokinase'], alpha=0.5, color='purple')
plt.title("Creatinine Phosphokinase vs Age")
plt.xlabel("Age")
plt.ylabel("Creatinine Phosphokinase")
plt.show()

4.3 Data normalization using scikit-learn

1.various normalization techniques such as Min-Max scaling, Standardization, Robust scaling, encoding, Normalization

In [20]: from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Example using Min-Max Scaling

min_max_scaler = MinMaxScaler()
df['normalized_age'] = min_max_scaler.fit_transform(df[['age']])

# Example using Standardization

standard_scaler = StandardScaler()
df['standardized_ejection_fraction'] = standard_scaler.fit_transform(df[['ejection_fraction']])

# Example using Robust Scaling (useful if there are outliers)

robust_scaler = RobustScaler()
df['robust_platelets'] = robust_scaler.fit_transform(df[['platelets']])

df.head()

Out[20]: age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT age_category normalized_age standardized_eje

0 75.0 0 582.0 0 20 1 265000.00 1.9 130 1 0 4 1 Older 0.636364

1 55.0 0 7861.0 0 38 0 263358.03 1.1 136 1 0 6 1 Younger 0.272727

2 65.0 0 146.0 0 20 0 162000.00 1.3 129 1 1 7 1 Older 0.454545

3 50.0 1 111.0 0 20 0 210000.00 1.9 137 1 0 7 1 Younger 0.181818

4 65.0 1 160.0 1 20 0 327000.00 2.7 116 0 0 8 1 Older 0.454545

2.Plot Histograms Before and After Normalization

In [21]: # Before normalization

plt.hist(df['age'], bins=20, alpha=0.5, label='Original Age')
# After normalization
plt.hist(df['normalized_age'], bins=20, alpha=0.5, label='Normalized Age')
plt.legend(loc='upper right')
plt.title("Age Before and After Min-Max Normalization")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

3. Box plots to compare the median, quartiles, and outliers in your data

In [23]: # Box plot before and after normalization

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].boxplot(df['age'], patch_artist=True)
axes[0].set_title("Original Age Data")

axes[1].boxplot(df['normalized_age'], patch_artist=True)
axes[1].set_title("Normalized Age Data (Min-Max)")

plt.show()
In [ ]:

Paired T Test
No ratings yet
Paired T Test
15 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
No ratings yet
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
71 pages
Aids
No ratings yet
Aids
88 pages
Ch 13 Model Misspecification
No ratings yet
Ch 13 Model Misspecification
4 pages
614022778_BUSN250_W6_HW_Template_1305458110197208
No ratings yet
614022778_BUSN250_W6_HW_Template_1305458110197208
28 pages
s41060-025-00825-9
No ratings yet
s41060-025-00825-9
17 pages
Generalized Kappa Statistic
No ratings yet
Generalized Kappa Statistic
11 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Solution Manual For Introductory Statistics 9th by Mann
100% (1)
Solution Manual For Introductory Statistics 9th by Mann
40 pages
Pearson R Correlation JOHN E. CORDERO
No ratings yet
Pearson R Correlation JOHN E. CORDERO
69 pages
??module 6 ?
No ratings yet
??module 6 ?
33 pages
Proc Capability - Sas User Guide
No ratings yet
Proc Capability - Sas User Guide
365 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
eda-ml-decision-tree.ipynb - Colab
No ratings yet
eda-ml-decision-tree.ipynb - Colab
20 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Copy of TP3.ipynb - Colab
No ratings yet
Copy of TP3.ipynb - Colab
17 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
Diabetic output
No ratings yet
Diabetic output
16 pages
Non Diabetic
No ratings yet
Non Diabetic
16 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Sample Thesis Using Regression Analysis
100% (5)
Sample Thesis Using Regression Analysis
6 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Major project - Colab
No ratings yet
Major project - Colab
15 pages
THE EFFECT OF SOCIAL MEDIA ADDICTION ON MARRIAGE ROLE EXPECTATIONS
No ratings yet
THE EFFECT OF SOCIAL MEDIA ADDICTION ON MARRIAGE ROLE EXPECTATIONS
12 pages
Untitled2.Ipynb - Colab
No ratings yet
Untitled2.Ipynb - Colab
8 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Experiment 5
No ratings yet
Experiment 5
9 pages
sec_8-1
No ratings yet
sec_8-1
7 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
No ratings yet
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
10 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
LSSBB Full Length Simulation Test 1
No ratings yet
LSSBB Full Length Simulation Test 1
23 pages
DMML lab report 03
No ratings yet
DMML lab report 03
9 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
BDA Project Codes
No ratings yet
BDA Project Codes
20 pages
Testing Experimental Data For Univariate Normality
No ratings yet
Testing Experimental Data For Univariate Normality
18 pages
ANOOSHA_ML_LAB02
No ratings yet
ANOOSHA_ML_LAB02
5 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Managing Management Graduates' Give Back Intentions An Empirical Study, Part I
No ratings yet
Managing Management Graduates' Give Back Intentions An Empirical Study, Part I
8 pages
DAL Experiment Outputs 6to10
No ratings yet
DAL Experiment Outputs 6to10
16 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
w5 Classification
No ratings yet
w5 Classification
34 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Math2010 - Statistical Methods I: Stefanie Biedermann
No ratings yet
Math2010 - Statistical Methods I: Stefanie Biedermann
16 pages
IPYNB Converter
No ratings yet
IPYNB Converter
8 pages
Methodolgy
No ratings yet
Methodolgy
8 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
Table 9 3 Contains 40 Annual Counts of The Numbers of Recruits and Spawners in A Salmon
No ratings yet
Table 9 3 Contains 40 Annual Counts of The Numbers of Recruits and Spawners in A Salmon
2 pages
05 - Statind2 - Regresi Linier Sederhana Dan Korelasi
No ratings yet
05 - Statind2 - Regresi Linier Sederhana Dan Korelasi
15 pages
FDA EXP2 E0323040
No ratings yet
FDA EXP2 E0323040
3 pages
My Code
No ratings yet
My Code
7 pages
5
No ratings yet
5
5 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
baseline.ipynb - Colab
No ratings yet
baseline.ipynb - Colab
5 pages
Dovdush_KN-305_lab3
No ratings yet
Dovdush_KN-305_lab3
2 pages
Dovdush_KN-305_lab2
No ratings yet
Dovdush_KN-305_lab2
2 pages
Dpa 2
No ratings yet
Dpa 2
2 pages
Correlation Ii
No ratings yet
Correlation Ii
2 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
No ratings yet
C2M4 - Assignment: 1 Cox Proportional Hazards and Random Survival Forests
18 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
ESD EXAM
No ratings yet
ESD EXAM
6 pages
Mathematical Studies Paper 2 TZ2 SL
No ratings yet
Mathematical Studies Paper 2 TZ2 SL
10 pages
ALY6015 Final Project Report
No ratings yet
ALY6015 Final Project Report
19 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
No ratings yet
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
5 pages
Formulae For Mhrm510: Descriptive Statistics
No ratings yet
Formulae For Mhrm510: Descriptive Statistics
3 pages
Mathews Paul G. Design of Experiments With MINITAB PDF
100% (3)
Mathews Paul G. Design of Experiments With MINITAB PDF
521 pages
2023-Tutorial 08
No ratings yet
2023-Tutorial 08
3 pages
Cluster Analysis. Discriminant Analysis. MDS
No ratings yet
Cluster Analysis. Discriminant Analysis. MDS
2 pages
Skittles Project Six
No ratings yet
Skittles Project Six
2 pages
Reliability Analysis PDF
No ratings yet
Reliability Analysis PDF
3 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
71 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Chapter 08 PowerPoint
No ratings yet
Chapter 08 PowerPoint
18 pages
How To Select Appropriate Statistical Test?: Technical Notes
No ratings yet
How To Select Appropriate Statistical Test?: Technical Notes
3 pages
The Smart Math Tricks Secrets to Solving Math Fast and Easy
From Everand
The Smart Math Tricks Secrets to Solving Math Fast and Easy
Leonardo Cruz
No ratings yet