0% found this document useful (0 votes)
11 views2 pages

Dovdush KN-305 Lab2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views2 pages

Dovdush KN-305 Lab2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Практична робота №2

з дисципліни "Інформаційні технології смартсистем"

на тему "Кардіологічна клініка"


Виконав:

студент групи КН-305

Довбуш Павло
In [1]: !pip install numpy
!pip install matplotlib
!pip install pandas
!pip install seaborn

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: numpy in c:\users\олеся\appdata\roaming\python\python310\site-packages (1.24.2)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in c:\users\олеся\appdata\roaming\python\python310\site-packages (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.0.7)
Requirement already satisfied: cycler>=0.10 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (4.39.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.24.2)
Requirement already satisfied: packaging>=20.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in c:\users\олеся\appdata\roaming\python\python310\site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (2022.7.1)
Requirement already satisfied: numpy>=1.21.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (1.24.2)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in c:\users\олеся\appdata\roaming\python\python310\site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (1.24.2)
Requirement already satisfied: pandas>=0.25 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: cycler>=0.10 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas>=0.25->seaborn) (2022.7.1)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)

In [2]: import os.path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as stats

In [3]: import warnings


warnings.simplefilter('ignore')

In [4]: pd.set_option('display.max_columns', 500)


pd.set_option('display.max_rows', 500)

Read the dataset


In [5]: print(os.path.exists("dataset_3.csv"))

True

In [6]: ds = pd.read_csv("dataset_3.csv")
ds.head()

Out[6]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 40.0 M ATA 140.0 289.0 0.0 Normal 172.0 N 0.0 Up 0.0

1 1 49.0 F NAP NaN 180.0 NaN Normal 156.0 N 1.0 Flat 1.0

2 2 37.0 M ATA 130.0 283.0 0.0 ST NaN N 0.0 Up 0.0

3 3 48.0 F ASY 138.0 214.0 0.0 Normal 108.0 Y 1.5 Flat 1.0

4 4 54.0 M NAP 150.0 195.0 0.0 Normal 122.0 N 0.0 Up 0.0

In [7]: print('columns count - ',len(ds.columns), '\n')


print('columns: ',list(ds.columns))

columns count - 13

columns: ['Unnamed: 0', 'Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']

Missing data imputation


In [8]: ds.shape

Out[8]: (918, 13)

In [9]: ds.dtypes

Out[9]: Unnamed: 0 int64


Age float64
Sex object
ChestPainType object
RestingBP float64
Cholesterol float64
FastingBS float64
RestingECG object
MaxHR float64
ExerciseAngina object
Oldpeak float64
ST_Slope object
HeartDisease float64
dtype: object

In [10]: for col in ds.columns:


if ds[col].isnull().values.any():
print("Missing data in ", col, ds[col].isnull().sum())

Missing data in Age 45


Missing data in Sex 18
Missing data in ChestPainType 18
Missing data in RestingBP 36
Missing data in Cholesterol 82
Missing data in FastingBS 45
Missing data in RestingECG 27
Missing data in MaxHR 91
Missing data in ExerciseAngina 9
Missing data in Oldpeak 73
Missing data in ST_Slope 91
Missing data in HeartDisease 64

In [11]: def impute_na(df, variable, value):

return df[variable].fillna(value)

In [12]: Age_median = ds['Age'].median()


RestingBP_median = ds['RestingBP'].median()
Cholesterol_median = ds['Cholesterol'].median()
FastingBS_median = ds['FastingBS'].median()
MaxHR_median = ds['MaxHR'].median()
Oldpeak_median = ds['Oldpeak'].median()
Sex_mode = ds['Sex'].mode()
ChestPainType_mode = ds['ChestPainType'].mode()
RestingECG_mode = ds['RestingECG'].mode()
ExerciseAngina_mode = ds['ExerciseAngina'].mode()
ST_Slope_mode = ds['ST_Slope'].mode()
HeartDisease_median = ds['HeartDisease'].median()

In [13]: #числові значення з заміною на середнє


ds['Age'] = impute_na(ds, 'Age',Age_median)
ds['RestingBP'] = impute_na(ds, 'RestingBP',RestingBP_median)
ds['Cholesterol'] = impute_na(ds, 'Cholesterol',Cholesterol_median)
ds['FastingBS'] = impute_na(ds, 'FastingBS',FastingBS_median)
ds['MaxHR'] = impute_na(ds, 'MaxHR',MaxHR_median)
ds['Oldpeak'] = impute_na(ds, 'Oldpeak',Oldpeak_median)
ds['HeartDisease'] = impute_na(ds, 'HeartDisease',HeartDisease_median)

#Заміна відсутніх значень на категорію, що найчастіше зустрічається

ds['Sex'] = impute_na(ds, 'Sex',Sex_mode)


ds['ChestPainType'] = impute_na(ds, 'ChestPainType',ChestPainType_mode)
ds['RestingECG'] = impute_na(ds, 'RestingECG',RestingECG_mode)
ds['ExerciseAngina'] = impute_na(ds, 'ExerciseAngina',ExerciseAngina_mode)
ds['ST_Slope'] = impute_na(ds, 'ST_Slope',ST_Slope_mode)

ds['Sex'].fillna(method ='ffill', inplace = True)


ds['ChestPainType'].fillna(method ='ffill', inplace = True)
ds['RestingECG'].fillna(method ='ffill', inplace = True)
ds['ExerciseAngina'].fillna(method ='ffill', inplace = True)
ds['ST_Slope'].fillna(method ='ffill', inplace = True)

In [14]: for col in ds.columns:


if ds[col].isnull().values.any():
print("Missing data in ", col, ds[col].isnull().sum())

Categorical encoding
In [15]: ds.nunique()

Out[15]: Unnamed: 0 918


Age 50
Sex 2
ChestPainType 4
RestingBP 66
Cholesterol 217
FastingBS 2
RestingECG 3
MaxHR 118
ExerciseAngina 2
Oldpeak 51
ST_Slope 3
HeartDisease 2
dtype: int64

In [16]: ds['Sex'].unique()

Out[16]: array(['M', 'F'], dtype=object)

In [17]: ds['ChestPainType'].unique()

Out[17]: array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [18]: ds['RestingECG'].unique()

Out[18]: array(['Normal', 'ST', 'LVH'], dtype=object)

In [19]: ds['ExerciseAngina'].unique()

Out[19]: array(['N', 'Y'], dtype=object)

In [20]: ds['ST_Slope'].unique()

Out[20]: array(['Up', 'Flat', 'Down'], dtype=object)

In [21]: from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

In [22]: ds['Sex'] = le.fit_transform(ds['Sex'])


ds['ChestPainType'] = le.fit_transform(ds['ChestPainType'])
ds['RestingECG'] = le.fit_transform(ds['RestingECG'])
ds['ExerciseAngina'] = le.fit_transform(ds['ExerciseAngina'])
ds['ST_Slope'] = le.fit_transform(ds['ST_Slope'])

In [23]: ds.head(10)

Out[23]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 40.0 1 1 140.0 289.0 0.0 1 172.0 0 0.0 2 0.0

1 1 49.0 0 2 130.0 180.0 0.0 1 156.0 0 1.0 1 1.0

2 2 37.0 1 1 130.0 283.0 0.0 2 138.0 0 0.0 2 0.0

3 3 48.0 0 0 138.0 214.0 0.0 1 108.0 1 1.5 1 1.0

4 4 54.0 1 2 150.0 195.0 0.0 1 122.0 0 0.0 2 0.0

5 5 39.0 1 2 120.0 339.0 0.0 1 138.0 0 0.0 2 0.0

6 6 45.0 0 1 130.0 237.0 0.0 1 170.0 0 0.0 2 0.0

7 7 54.0 1 1 110.0 208.0 0.0 1 142.0 0 0.0 2 0.0

8 8 37.0 1 0 140.0 207.0 0.0 1 130.0 1 1.5 1 1.0

9 9 48.0 0 1 120.0 284.0 0.0 1 120.0 1 0.0 2 1.0

In [24]: def diagnostic_plots(df, variable):


# function takes a dataframe (df) and
# the variable of interest as arguments

# define figure size


plt.figure(figsize=(16, 4))

# histogram
plt.subplot(1, 3, 1)
sns.histplot(df[variable], bins=30)
plt.title('Histogram')

# Q-Q plot
plt.subplot(1, 3, 2)
stats.probplot(df[variable], dist="norm", plot=plt)
plt.ylabel('Variable quantiles')

# boxplot
plt.subplot(1, 3, 3)
sns.boxplot(y=df[variable])
plt.title('Boxplot')

plt.show()

In [25]: diagnostic_plots(ds, 'Age')

In [26]: diagnostic_plots(ds, 'RestingBP')

In [27]: diagnostic_plots(ds, 'Cholesterol')

In [28]: diagnostic_plots(ds, 'MaxHR')

In [29]: diagnostic_plots(ds, 'FastingBS')

In [30]: diagnostic_plots(ds, 'Oldpeak')

Data Scaling
In [31]: from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
ss = StandardScaler() # Standardization

ds['Oldpeak'] = mms.fit_transform(ds[['Oldpeak']])
ds['Age'] = ss.fit_transform(ds[['Age']])
ds['RestingBP'] = ss.fit_transform(ds[['RestingBP']])
ds['Cholesterol'] = ss.fit_transform(ds[['Cholesterol']])
ds['MaxHR'] = ss.fit_transform(ds[['MaxHR']])
ds.head()

Out[31]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 -1.473387 1 1 0.427330 0.846142 0.0 1 1.443735 0 0.295455 2 0.0

1 1 -0.496724 0 2 -0.127534 -0.202998 0.0 1 0.780688 0 0.409091 1 1.0

2 2 -1.798941 1 1 -0.127534 0.788391 0.0 2 0.034759 0 0.295455 2 0.0

3 3 -0.605242 0 0 0.316357 0.124257 0.0 1 -1.208455 1 0.465909 1 1.0

4 4 0.045866 1 2 0.982193 -0.058621 0.0 1 -0.628288 0 0.295455 2 0.0

Модель машинного навчання не розуміє одиниці значень ознак. Він розглядає вхідні дані як просте число, але не розуміє справжнього значення цього значення. Таким чином, виникає необхідність масштабувати дані.

У нас є 2 варіанти масштабування даних: 1) Нормалізація 2) Стандартизація. Оскільки більшість алгоритмів передбачає, що дані мають нормальний (гаусівський) розподіл, нормалізація виконується для функцій, дані яких не відображають нормального розподілу, а
стандартизація виконується для функцій, які нормально розподіляються, де їхні значення величезні або дуже малі порівняно з іншими особливості.

Нормалізація: функцію Oldpeak нормалізовано, оскільки вона відображала правий спотворений розподіл даних. Стандартизація: Age, RestingBP, Cholesterol і MaxHR зменшено, оскільки ці функції розподілені нормально.
In [ ]:

You might also like