0% found this document useful (0 votes)
15 views28 pages

Complete PDF

ML PDF

Uploaded by

Anshuman Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

Complete PDF

ML PDF

Uploaded by

Anshuman Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

K-Nearest Neighbors (KNN) Imputation

K-Nearest Neighbors (KNN) Imputation

Overview:

KNN Imputation is a technique used to handle missing values in a dataset. It replaces missing

values with the mean value of the 'k' nearest neighbors' corresponding feature values.

How it works:

1. Identify Missing Values:

- The algorithm identifies which values are missing in the dataset.

2. Find Neighbors:

- For each missing value, the algorithm finds 'k' nearest neighbors based on other feature values.

3. Compute Imputed Value:

- The missing value is replaced with a weighted average of the corresponding values from its 'k'

nearest neighbors.

KNN Imputation in the provided code:

1. Handling Missing Values:

df['resting bp s'].replace(0, np.nan, inplace=True)

df['cholesterol'].replace(0, np.nan, inplace=True)

- Zero values in 'resting bp s' and 'cholesterol' columns are replaced with NaN to indicate missing

values.
2. Checking Missing Values:

df.isnull().sum()

- The number of missing values in each column is checked.

3. Importing KNNImputer:

from sklearn.impute import KNNImputer

knn = KNNImputer(n_neighbors=3, weights='distance')

df = knn.fit_transform(df)

- The KNNImputer is imported and initialized with n_neighbors=3, meaning it will consider 3

nearest neighbors for imputation.

- weights='distance' means that closer neighbors will have a greater influence on the imputed

value.

- The fit_transform method is used to impute missing values in the dataframe (df).

4. Converting Back to DataFrame:

df = pd.DataFrame(df, columns=['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol', 'fasting

blood sugar', 'resting ecg', 'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'target'])

df.head()

- The imputed data is converted back into a pandas DataFrame with the appropriate column

names.

5. Checking for Missing Values Again:

df.isnull().sum()

- Ensuring that all missing values have been imputed correctly.


K-Nearest Neighbors (KNN) Classifier

K-Nearest Neighbors (KNN) Classifier

Overview:

K-Nearest Neighbors (KNN) is a simple, supervised machine learning algorithm used for classification

and regression tasks. It classifies a data point based on how its neighbors are classified.

How it works:

1. Training Phase:

- KNN stores all the available data points.

- There is no actual training process in KNN; it is a lazy learner, meaning it doesn't learn an

explicit model but memorizes the training instances.

2. Prediction Phase:

- When a new data point needs to be classified, KNN looks at the 'k' nearest data points

(neighbors) in the training dataset.

- 'k' is a user-defined constant. For example, if k=3, the three closest data points to the new point

are considered.

- The class with the majority vote among these 'k' neighbors is assigned to the new data point.

KNN Classifier in the provided code:

1. Importing KNeighborsClassifier:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
2. Training the Model:

model.fit(X_train, Y_train)

- The model is fitted using the training data (X_train, Y_train).

- Although KNN doesn't explicitly train a model, the fit method stores the training data within the

classifier.

3. Cross-Validation:

from sklearn.model_selection import cross_val_score

accuracy_score = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=10)

precision = cross_val_score(model, X_train, Y_train, scoring='precision', cv=10)

recall = cross_val_score(model, X_train, Y_train, scoring='recall', cv=10)

f1_score = cross_val_score(model, X_train, Y_train, scoring='f1', cv=10)

- Cross-validation is used to evaluate the model's performance. Here, 10-fold cross-validation

(cv=10) is performed.

- Accuracy, precision, recall, and F1-score are calculated and printed to assess the classifier's

performance.

4. Making Predictions:

y_pred = model.predict(X_test)

- Predictions are made on the test data (X_test).

5. Evaluating Predictions:

from sklearn.metrics import confusion_matrix

print(confusion_matrix(Y_test, y_pred))

- A confusion matrix is generated to evaluate the predictions. It provides insight into the true
positives, true negatives, false positives, and false negatives.
PCA Analysis, Selection, and Model Training

Principal Component Analysis (PCA)

PCA Overview:

- PCA is a technique used to reduce the dimensionality of a dataset while retaining most of the

variance in the data. It transforms the data into a new set of variables called principal components.

- Each principal component is a linear combination of the original variables and is orthogonal to the

others, ensuring that there is no redundancy.

Steps in the Code

1. Importing Libraries:

from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

2. Loop to Reduce Dimensionality:

accuracies = []

for n_components in range(1, 12):

# Reduce dimensionality

pca = PCA(n_components = n_components)


PCA Analysis, Selection, and Model Training

X_reduced = pca.fit_transform(X)

- A loop is created to iterate over different numbers of principal components (from 1 to 11).

- For each iteration, the PCA model is initialized with a specific number of components.

- The `fit_transform` method is applied to the dataset `X` to reduce its dimensionality.

Data Splitting and Model Training

Data Splitting:

# Split data into training and test sets

X_train, X_test, Y_train, Y_test = train_test_split(X_reduced, Y, test_size=0.2, random_state=32)

- The reduced dataset is split into training and testing sets. 80% of the data is used for training, and

20% for testing.

- The `random_state` parameter ensures that the data is split in the same way each time the code is

run.

Model Training:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X_train, Y_train)

- A k-nearest neighbors (KNN) classifier is initialized and trained using the training data.
PCA Analysis, Selection, and Model Training

Model Evaluation:

# Make predictions on the test set

y_pred = model.predict(X_test)

# Calculate accuracy

from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=10)

accuracies.append(np.mean(accuracy))

print(f"Accuracy with {n_components} components: {accuracy}")

- Predictions are made on the test set.

- The accuracy of the model is calculated using 10-fold cross-validation on the training set.

- The mean accuracy for each number of components is stored and printed.

Explained Variance and Selection of Principal Components

Explained Variance:

explained_variance_ratio = model.explained_variance_ratio_

total_explained_variance_ratio = explained_variance_ratio.sum()

print("Explained Variance Ratio", explained_variance_ratio)

print(f"Total Explained Variance Ratio: {total_explained_variance_ratio:.4f}")


PCA Analysis, Selection, and Model Training

- The explained variance ratio for each principal component is calculated.

- The total explained variance ratio is the sum of the explained variance ratios of all principal

components, representing the amount of variance retained by the PCA transformation.

Cumulative Explained Variance:

cumulative_explained_variance = np.cumsum(explained_variance_ratio)

- The cumulative explained variance is calculated to understand how many components are needed

to retain a certain amount of variance.

Plotting Explained Variance:

plt.figure(figsize=(10, 6))

plt.plot(range(1, len(explained_variance_ratio) + 1), cumulative_explained_variance, marker='o',

linestyle='--')

plt.title('Explained Variance by Different Principal Components')

plt.xlabel('Number of Principal Components')

plt.ylabel('Cumulative Explained Variance')

plt.grid()

plt.show()

- The cumulative explained variance is plotted to visualize the relationship between the number of

principal components and the amount of variance retained.


PCA Analysis, Selection, and Model Training

Selecting Number of Components:

threshold = 0.90

num_components_to_keep = np.where(cumulative_explained_variance >= threshold)[0][0] + 1

print(f'Number of components to keep: {num_components_to_keep}')

- A threshold (e.g., 90%) is set to decide how much variance should be retained.

- The number of components needed to retain at least the threshold amount of variance is

determined and printed.

Final Model Training and Evaluation

Dropping Unnecessary Components:

X_pca = X_pca.drop(columns = 'PC3', axis=1)

# ... (similar lines for other PCs)

X_pca.head()

- Principal components that are not needed based on the explained variance threshold are dropped

from the dataset.

Splitting Data and Training Final Model:

from sklearn.model_selection import train_test_split


PCA Analysis, Selection, and Model Training

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=3)

print(X.shape, X_train.shape, X_test.shape)

- The original data is split into training and testing sets.

- A KNN model is initialized and trained using the training data.

Evaluating the Final Model:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

model.fit(X_train, Y_train)

Cross-validation:

from sklearn.model_selection import cross_val_score

accuracy_score = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=10)

print("accuracy_score=", np.mean(accuracy_score))

precision = cross_val_score(model, X_train, Y_train, scoring='precision', cv=10)

print("precision=", np.mean(precision))

recall = cross_val_score(model, X_train, Y_train, scoring='recall', cv=10)

print("recall=", np.mean(recall))

- Cross-validation is performed to calculate the accuracy, precision, and recall of the final model on

the training data.


PCA Analysis, Selection, and Model Training

Testing the Final Model:

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(Y_test, y_pred)

precision = precision_score(Y_test, y_pred)

recall = recall_score(Y_test, y_pred)

cm = confusion_matrix(Y_test, y_pred)

print("Confusion Matrix:

", cm)

print(f"Accuracy: {accuracy:.4f}")

print(f"Precision: {precision:.4f}")

print(f"Recall: {recall:.4f}")

- Predictions are made on the test set.

- Accuracy, precision, and recall scores are calculated and printed.

- The confusion matrix is printed to visualize the performance of the model.


Data Splitting Based on Target Values
Split the dataset based on the target values:

df0 = df[df['target'] == 0]
df1 = df[df['target'] == 1]
df0.head()

• Splits the DataFrame df into two DataFrames, df0 and df1, based on
the target values (0 and 1). This allows separate analysis for each target
group.

Separating Dependent and Independent Variables

df_dependent = df['target']
df_independent = df.drop(columns='target', axis=1)

• Separates the target column (dependent variable) from the rest of the
dataset (independent variables).

Applying PCA

model = PCA(n_components=11)
# Fit transform
model.fit(df_independent)

• Initializes a PCA model to reduce the dimensionality of the data to 11


components and fits the model to the independent variables.

Summary of Steps
1. Splitting Data Based on Target Values:
o Data is split into two groups based on the target variable.
2. Removing the Target Column:
o The target column is removed from the data to focus on the
features.
3. Converting to Numpy Arrays:
o The data is converted to numpy arrays for mathematical operations.
4. Calculating Frobenius Norm:
The Frobenius norm is calculated to measure the difference
o
between the original and target matrices.
5. Computing Neutrality Target:
o The difference between the Frobenius norms is squared to get the
neutrality target.

This approach helps in understanding the differences in the data before and after
applying transformations, such as SVD, by measuring the Frobenius norm,
which provides an overall error measure.

Centering Data and SVD

Loading and Preparing Data:

df0 = df[df['target'] == 0]
df1 = df[df['target'] == 1]
df0.head()

• df0 and df1 are created by splitting the dataset df based on the target
values (0 and 1).
• The head() method is used to display the first few rows of df0.

Removing the Target Column:

df0 = df0.drop(columns='target', axis=1)


df1 = df1.drop(columns='target', axis=1)
X_target0 = X_target0.drop(columns='target', axis=1)
X_target1 = X_target1.drop(columns='target', axis=1)

• The target column is dropped from df0, df1, X_target0, and


X_target1 to prepare the data for further analysis.

Converting DataFrames to Numpy Arrays:

df0 = df0.to_numpy()
X_target0 = X_target0.to_numpy()
df1 = df1.to_numpy()
X_target1 = X_target1.to_numpy()

• The dataframes are converted to numpy arrays for numerical operations.


Calculating Frobenius Norm:

frobenius_norm0_for_ed = np.linalg.norm(df0 - X_target0)


frobenius_norm0_for_ed
frobenius_norm1_for_ed = np.linalg.norm(df1 - X_target1)
frobenius_norm1_for_ed

• The Frobenius norm is calculated for df0 - X_target0 and df1 -


X_target1. The Frobenius norm is a measure of the difference
between two matrices.
• This norm is stored in frobenius_norm0_for_ed and
frobenius_norm1_for_ed.

Calculating Neutrality Target:

neutrality_target_for_ed = (frobenius_norm1_for_ed -
frobenius_norm0_for_ed) ** 2
neutrality_target_for_ed

• The neutrality target is calculated as the square of the difference between


frobenius_norm1_for_ed and frobenius_norm0_for_ed

Reconstructing the original data from the principal components:

X_reconstructed = model.inverse_transform(X_d)
X_reconstructed

o X_reconstructed reconstructs the original data from the principal


components using the inverse transform method of the PCA model.
o X_reconstructed displays the reconstructed data.

Updating the dataframe with reconstructed data and target values:

df_independent = pd.DataFrame(X_reconstructed, columns=['sex', 'chest


pain type', 'resting bp s', 'cholesterol', 'fasting blood sugar', 'resting ecg',
'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'age_group'])
df_independent['target'] = df_dependent
df_independent.head()

o df_independent is updated with the reconstructed data and


appropriate column names.
o The target column is re-added to df_independent.
o df_independent.head() displays the first few rows of the
updated dataframe.

These explanations cover the main steps in the code provided in your PDF,
focusing on data preprocessing, applying PCA, and reconstructing the data.
1. Explained Variance Ratio

Explained Variance Ratio (EVR) indicates the proportion of the dataset's variance that is

capturedby each principal component in PCA. It's a measure of how much information

(variance) can be attributed to each of the principal components.

Variance Calculation: Variance is calculated for each feature to understand its contribution to

the overall data spread.

Total Variance: Sum of the variances of all features.

EVR Calculation: Each feature's variance is divided by the total variance to get the proportion

of variance explained by that feature.

Sorting: Sorting EVR in descending order helps identify which features contribute most to the
variance.

Example Calculation:

EV = df.var()

s=sum(EV)

EVR = EV / s

EVR.sort_values(ascending=False)

This gives a sorted list of features based on their contribution to the total variance.

2. Binning Cholesterol Data

Binning is a process of converting continuous data into categorical data by dividing it into
intervals (bins).
Define Bin Edges: Set boundaries for the bins.

Labels for Bins: Assign labels to these bins for easy

identification. pd.cut(): Use pd.cut() to segment and sort

data values into bins.

Value Counts: Count the number of occurrences in each bin to understand the distribution.

Example Calculation:

bin_edges = [100, 200, 240, 371]

bin_labels = ['100-200', '200-240', '240-400']

df['Cholesterol_bin'] = pd.cut(df['cholesterol'], bins=bin_edges, labels=bin_labels, right=False)

df['Cholesterol_bin'].value_counts()

This categorizes cholesterol levels and counts how many values fall into each category.

3. Label Encoding

Label Encoding converts categorical data into numerical format, which is necessary for

manymachine learning algorithms.

Import LabelEncoder: Import from sklearn.preprocessing.

Fit and Transform: Encode the binned cholesterol data into numerical

labels. Add Encoded Labels: Create a new column with these labels in

the DataFrame.

Example Calculation:
from sklearn.preprocessing import LabelEncoder

label_encode = LabelEncoder()

labels = label_encode.fit_transform(df.Cholesterol_bin)

df['Cholesterol_group'] = labels
new_df = df.drop(columns='Cholesterol_bin', axis=0)

new_df.info()

This converts the binned cholesterol data into numerical values and adds it to the DataFrame.

4. SVD and Reconstruction

Singular Value Decomposition (SVD) decomposes a matrix into three other matrices and is used for

dimensionality reduction and data compression.

Centering Data: Subtract the mean from each feature to center the data. SVD

Decomposition: Decompose the centered data into U, S, Vt matrices.

Projection: Project the centered data onto the singular vectors.

Reconstruction: Use the inverse transform to approximate the original data from the reduced dimensions.

Example Calculation:

X_centered = new_df - new_df.mean()

U, S, Vt = np.linalg.svd(X_centered)

X_d = X_centered.dot(Vt.T[:, :])

X_d = X_d.to_numpy()

X_reconstructed = model1.inverse_transform(X_d)

X_reconstructed

This process reduces the dimensionality and then reconstructs the data to approximate the original data.
5. Creating DataFrame from Reconstructed Data

After reconstruction, the data needs to be converted back into a DataFrame for further analysis.

DataFrame Creation: Convert the numpy array of reconstructed data into a DataFrame. Add

Categorical Labels: Re-add the cholesterol group labels to the DataFrame.

Example Calculation:

X = pd.DataFrame(X_reconstructed, columns=['sex', 'chest pain type', 'resting bp s', 'cholesterol', 'fasting blood

sugar', 'resting ecg', 'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'age_group'])

X['Cholesterol_group'] = new_Y

X.head()

This organizes the reconstructed data back into a structured format.

6. Splitting Reconstructed Data

Splitting data based on the categorical labels helps in comparing different groups.

Filter Data: Create separate DataFrames for each cholesterol group.

Example Calculation:

X_cholesterol0 = X[X['Cholesterol_group'] == 0]

X_cholesterol1 = X[X['Cholesterol_group'] == 1]

X_cholesterol2 = X[X['Cholesterol_group'] == 2]

X_cholesterol0.head()
This segregates the data based on cholesterol groups for further analysis.

7. Frobenius Norm Calculation

The Frobenius Norm measures the difference between two matrices, often used to quantifyreconstruction

error.

Convert to Numpy: Convert DataFrames to numpy arrays for computation.

Compute Norm: Calculate the Frobenius norm between original and reconstructed data for eachgroup.

Example Calculation:

df_Cholesterol0 = df_Cholesterol0.to_numpy()

X_cholesterol0 = X_cholesterol0.to_numpy()

frobenius_norm0 = np.linalg.norm(df_Cholesterol0 - X_cholesterol0)

frobenius_norm0

This quantifies how well the reconstructed data matches the original data.

8. Neutrality Calculation

Neutrality measures the consistency of reconstruction errors across different groups.

Squared Differences: Compute the squared differences of Frobenius norms between groups. Average Squared

Differences: Average these squared differences to get the neutrality measure.

Example Calculation:

neutrality_cholesterol = ((frobenius_norm1 - frobenius_norm0) ** 2 + (frobenius_norm2 -


frobenius_norm1) ** 2 + (frobenius_norm2 - frobenius_norm0) ** 2) / 3

neutrality_cholesterol

This provides a measure of how uniformly the reconstruction error is distributed across differentgroups.

9. Plotting Frobenius Norms

Visualizing the Frobenius norms helps in comparing the reconstruction errors across groups.

Set Labels and Values: Define labels and norms for plotting.Plot

Setup: Configure the plot dimensions and bar width.

Plot: Use matplotlib to create the plot (incomplete in the given code).

Example Calculation:

labels = ['cholesterol range >240', 'cholesterol range 200-240', 'cholesterol range <200']

svd_norms = [frobenius_norm0, frobenius_norm1, frobenius_norm2]

x = np.arange(len(labels)) # label

locationswidth = 0.35

# width of the bars

# Plotting

fig, ax = plt.subplots()

rects1 = ax.bar(x - width/2, svd_norms, width, label='SVD Norms')

# Add some text for labels, title and custom x-axis tick labels, etc.

ax.set_ylabel('Frobenius Norms')

ax.set_title('Frobenius Norms by Cholesterol Groups')


ax.set_xticks(x)

ax.set_xticklabels(labels)

ax.legend()

fig.tight_layout()

plt.show()

This would create a bar chart comparing the Frobenius norms for different cholesterol groups.

Each of these steps plays a crucial role in the overall analysis and understanding of the data,ensuring

that the results are meaningful and interpretable.


Step-by-Step Explanation

# Drop the target column from the dataframe

new_df = df.drop(columns='target', axis=1)

This line removes the 'target' column from the df DataFrame, creating a new
DataFrame called new_df.

# Split the dataset based on the chest pain type values

df_chest_pain_type1 = new_df[new_df['chest pain type'] == 1]


df_chest_pain_type2 = new_df[new_df['chest pain type'] == 2]
df_chest_pain_type3 = new_df[new_df['chest pain type'] == 3]
df_chest_pain_type4 = new_df[new_df['chest pain type'] == 4]

These lines filter new_df to create four new DataFrames, each containing rows
where the 'chest pain type' column equals 1, 2, 3, and 4, respectively.

new_Y = new_df['chest pain type']

This line extracts the 'chest pain type' column from new_df and stores it in
new_Y.

model2 = PCA(n_components=11)
# Fit transform
model2.fit(new_df)
X_centered = new_df - new_df.mean()
U, S, Vt = np.linalg.svd(X_centered)
X_d = X_centered.dot(Vt.T[:, :])
X_d = X_d.to_numpy()
X_reconstructed = model.inverse_transform(X_d)
X_reconstructed

1. model2 = PCA(n_components=11): Initializes a PCA model to


reduce the dataset to 11 principal components.
2. model2.fit(new_df): Fits the PCA model to the new_df
DataFrame.
3. X_centered = new_df - new_df.mean(): Centers the data by
subtracting the mean of each column.
4. U, S, Vt = np.linalg.svd(X_centered): Performs Singular
Value Decomposition (SVD) on the centered data.
5. X_d = X_centered.dot(Vt.T[:, :]): Projects the centered
data onto the principal components.
6. X_d = X_d.to_numpy(): Converts the projected data into a numpy
array.
7. X_reconstructed = model.inverse_transform(X_d):
Reconstructs the data from the reduced dimensions using the inverse
transform.
8. X_reconstructed: Displays the reconstructed data.

X = pd.DataFrame(X_reconstructed, columns=['sex', 'chest pain type', 'resting


bp s', 'cholesterol', 'fasting blood sugar', 'resting ecg', 'max heart rate', 'exercise
angina', 'oldpeak', 'ST slope', 'age_group'])

X['chest_pain_type_group'] = new_Y

X.head()

1. X = pd.DataFrame(X_reconstructed, columns=['sex', ...]): Converts the


reconstructed numpy array back into a DataFrame with specified column
names.
2. X['chest_pain_type_group'] = new_Y: Adds the 'chest pain type' column
back to the DataFrame.
3. X.head(): Displays the first few rows of the DataFrame.

# Split the dataset based on the chest pain type values


X_chest_pain_type1 = X[X['chest_pain_type_group'] == 1]
X_chest_pain_type2 = X[X['chest_pain_type_group'] == 2]
X_chest_pain_type3 = X[X['chest_pain_type_group'] == 3]
X_chest_pain_type4 = X[X['chest_pain_type_group'] == 4]

These lines filter X to create four new DataFrames, each containing rows where
the 'chest_pain_type_group' column equals 1, 2, 3, and 4, respectively.

X_chest_pain_type1 = X_chest_pain_type1.drop(columns='chest_pain_type_group', axis=1)


X_chest_pain_type2 = X_chest_pain_type2.drop(columns='chest_pain_type_group', axis=1)
X_chest_pain_type3 = X_chest_pain_type3.drop(columns='chest_pain_type_group', axis=1)
X_chest_pain_type4 = X_chest_pain_type4.drop(columns='chest_pain_type_group', axis=1)
These lines remove the 'chest_pain_type_group' column from each of the four
DataFrames.

df_chest_pain_type1 = df_chest_pain_type1.to_numpy()
X_chest_pain_type1 = X_chest_pain_type1.to_numpy()
frobenius_norm1 = np.linalg.norm(df_chest_pain_type1 - X_chest_pain_type1)
frobenius_norm1

1. df_chest_pain_type1 = df_chest_pain_type1.to_numpy(): Converts


df_chest_pain_type1 to a numpy array.
2. X_chest_pain_type1 = X_chest_pain_type1.to_numpy(): Converts
X_chest_pain_type1 to a numpy array.
3. frobenius_norm1 = np.linalg.norm(df_chest_pain_type1 -
X_chest_pain_type1): Calculates the Frobenius norm between the
original and reconstructed data for chest pain type 1.
4. frobenius_norm1: Displays the calculated Frobenius norm.

df_chest_pain_type2 = df_chest_pain_type2.to_numpy()
X_chest_pain_type2 = X_chest_pain_type2.to_numpy()
frobenius_norm2 = np.linalg.norm(df_chest_pain_type2 - X_chest_pain_type2)
frobenius_norm2

Same as above but for chest pain type 2.

df_chest_pain_type3 = df_chest_pain_type3.to_numpy()
X_chest_pain_type3 = X_chest_pain_type3.to_numpy()
frobenius_norm3 = np.linalg.norm(df_chest_pain_type3 - X_chest_pain_type3)
frobenius_norm3

Same as above but for chest pain type 3.

df_chest_pain_type4 = df_chest_pain_type4.to_numpy()
X_chest_pain_type4 = X_chest_pain_type4.to_numpy()
frobenius_norm4 = np.linalg.norm(df_chest_pain_type4 - X_chest_pain_type4)
frobenius_norm4

Same as above but for chest pain type 4.

labels = ['chest pain scenario 1', 'chest pain scenario 2', 'chest pain scenario 3',
'chest pain scenario 4']
svd_norms = [frobenius_norm1, frobenius_norm2, frobenius_norm3,
frobenius_norm4]
x = np.arange(len(labels)) # label locations
width = 0.35 # width of the bars
1. labels = [...]: Defines labels for the bar plot.
2. svd_norms = [...]: Creates a list of Frobenius norms for each chest pain
type.
3. x = np.arange(len(labels)): Creates an array of label locations.
4. width = 0.35: Sets the width of the bars.

# Plotting
fig, ax = plt.subplots()
rects = ax.bar(x, svd_norms, width, label='SVD')

1. fig, ax = plt.subplots(): Creates a figure and a set of subplots.


2. rects = ax.bar(x, svd_norms, width, label='SVD'): Creates a bar plot of
the Frobenius norms.

# Add some text for labels, title, and custom x-axis tick labels, etc.
ax.set_xlabel('Different chest pain scenario')
ax.set_ylabel('Frobenius Norm')
ax.set_title('Comparison of Frobenius Norms for SVD on chest pain type
attribute')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

1. ax.set_xlabel(...): Sets the x-axis label.


2. ax.set_ylabel(...): Sets the y-axis label.
3. ax.set_title(...): Sets the plot title.
4. ax.set_xticks(x): Sets the x-axis ticks.
5. ax.set_xticklabels(labels): Sets the x-axis tick labels.
6. ax.legend(): Adds a legend to the plot.

# Add a function to label bars with their heights


def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate(f'{height:.2f}', xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')

Defines a function to annotate the bars with their heights.

autolabel(rects)
fig.tight_layout()
plt.show()

1. autolabel(rects): Calls the annotation function to label the bars.


2. fig.tight_layout(): Adjusts the layout for better fit.
3. plt.show(): Displays the plot.

You might also like