Complete PDF
Complete PDF
Overview:
KNN Imputation is a technique used to handle missing values in a dataset. It replaces missing
values with the mean value of the 'k' nearest neighbors' corresponding feature values.
How it works:
2. Find Neighbors:
- For each missing value, the algorithm finds 'k' nearest neighbors based on other feature values.
- The missing value is replaced with a weighted average of the corresponding values from its 'k'
nearest neighbors.
- Zero values in 'resting bp s' and 'cholesterol' columns are replaced with NaN to indicate missing
values.
2. Checking Missing Values:
df.isnull().sum()
3. Importing KNNImputer:
df = knn.fit_transform(df)
- The KNNImputer is imported and initialized with n_neighbors=3, meaning it will consider 3
- weights='distance' means that closer neighbors will have a greater influence on the imputed
value.
- The fit_transform method is used to impute missing values in the dataframe (df).
df = pd.DataFrame(df, columns=['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol', 'fasting
blood sugar', 'resting ecg', 'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'target'])
df.head()
- The imputed data is converted back into a pandas DataFrame with the appropriate column
names.
df.isnull().sum()
Overview:
K-Nearest Neighbors (KNN) is a simple, supervised machine learning algorithm used for classification
and regression tasks. It classifies a data point based on how its neighbors are classified.
How it works:
1. Training Phase:
- There is no actual training process in KNN; it is a lazy learner, meaning it doesn't learn an
2. Prediction Phase:
- When a new data point needs to be classified, KNN looks at the 'k' nearest data points
- 'k' is a user-defined constant. For example, if k=3, the three closest data points to the new point
are considered.
- The class with the majority vote among these 'k' neighbors is assigned to the new data point.
1. Importing KNeighborsClassifier:
model = KNeighborsClassifier()
2. Training the Model:
model.fit(X_train, Y_train)
- Although KNN doesn't explicitly train a model, the fit method stores the training data within the
classifier.
3. Cross-Validation:
(cv=10) is performed.
- Accuracy, precision, recall, and F1-score are calculated and printed to assess the classifier's
performance.
4. Making Predictions:
y_pred = model.predict(X_test)
5. Evaluating Predictions:
print(confusion_matrix(Y_test, y_pred))
- A confusion matrix is generated to evaluate the predictions. It provides insight into the true
positives, true negatives, false positives, and false negatives.
PCA Analysis, Selection, and Model Training
PCA Overview:
- PCA is a technique used to reduce the dimensionality of a dataset while retaining most of the
variance in the data. It transforms the data into a new set of variables called principal components.
- Each principal component is a linear combination of the original variables and is orthogonal to the
1. Importing Libraries:
accuracies = []
# Reduce dimensionality
X_reduced = pca.fit_transform(X)
- A loop is created to iterate over different numbers of principal components (from 1 to 11).
- For each iteration, the PCA model is initialized with a specific number of components.
- The `fit_transform` method is applied to the dataset `X` to reduce its dimensionality.
Data Splitting:
- The reduced dataset is split into training and testing sets. 80% of the data is used for training, and
- The `random_state` parameter ensures that the data is split in the same way each time the code is
run.
Model Training:
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
- A k-nearest neighbors (KNN) classifier is initialized and trained using the training data.
PCA Analysis, Selection, and Model Training
Model Evaluation:
y_pred = model.predict(X_test)
# Calculate accuracy
accuracies.append(np.mean(accuracy))
- The accuracy of the model is calculated using 10-fold cross-validation on the training set.
- The mean accuracy for each number of components is stored and printed.
Explained Variance:
explained_variance_ratio = model.explained_variance_ratio_
total_explained_variance_ratio = explained_variance_ratio.sum()
- The total explained variance ratio is the sum of the explained variance ratios of all principal
cumulative_explained_variance = np.cumsum(explained_variance_ratio)
- The cumulative explained variance is calculated to understand how many components are needed
plt.figure(figsize=(10, 6))
linestyle='--')
plt.grid()
plt.show()
- The cumulative explained variance is plotted to visualize the relationship between the number of
threshold = 0.90
- A threshold (e.g., 90%) is set to decide how much variance should be retained.
- The number of components needed to retain at least the threshold amount of variance is
X_pca.head()
- Principal components that are not needed based on the explained variance threshold are dropped
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
Cross-validation:
print("accuracy_score=", np.mean(accuracy_score))
print("precision=", np.mean(precision))
print("recall=", np.mean(recall))
- Cross-validation is performed to calculate the accuracy, precision, and recall of the final model on
y_pred = model.predict(X_test)
cm = confusion_matrix(Y_test, y_pred)
print("Confusion Matrix:
", cm)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
df0 = df[df['target'] == 0]
df1 = df[df['target'] == 1]
df0.head()
• Splits the DataFrame df into two DataFrames, df0 and df1, based on
the target values (0 and 1). This allows separate analysis for each target
group.
df_dependent = df['target']
df_independent = df.drop(columns='target', axis=1)
• Separates the target column (dependent variable) from the rest of the
dataset (independent variables).
Applying PCA
model = PCA(n_components=11)
# Fit transform
model.fit(df_independent)
Summary of Steps
1. Splitting Data Based on Target Values:
o Data is split into two groups based on the target variable.
2. Removing the Target Column:
o The target column is removed from the data to focus on the
features.
3. Converting to Numpy Arrays:
o The data is converted to numpy arrays for mathematical operations.
4. Calculating Frobenius Norm:
The Frobenius norm is calculated to measure the difference
o
between the original and target matrices.
5. Computing Neutrality Target:
o The difference between the Frobenius norms is squared to get the
neutrality target.
This approach helps in understanding the differences in the data before and after
applying transformations, such as SVD, by measuring the Frobenius norm,
which provides an overall error measure.
df0 = df[df['target'] == 0]
df1 = df[df['target'] == 1]
df0.head()
• df0 and df1 are created by splitting the dataset df based on the target
values (0 and 1).
• The head() method is used to display the first few rows of df0.
df0 = df0.to_numpy()
X_target0 = X_target0.to_numpy()
df1 = df1.to_numpy()
X_target1 = X_target1.to_numpy()
neutrality_target_for_ed = (frobenius_norm1_for_ed -
frobenius_norm0_for_ed) ** 2
neutrality_target_for_ed
X_reconstructed = model.inverse_transform(X_d)
X_reconstructed
These explanations cover the main steps in the code provided in your PDF,
focusing on data preprocessing, applying PCA, and reconstructing the data.
1. Explained Variance Ratio
Explained Variance Ratio (EVR) indicates the proportion of the dataset's variance that is
capturedby each principal component in PCA. It's a measure of how much information
Variance Calculation: Variance is calculated for each feature to understand its contribution to
EVR Calculation: Each feature's variance is divided by the total variance to get the proportion
Sorting: Sorting EVR in descending order helps identify which features contribute most to the
variance.
Example Calculation:
EV = df.var()
s=sum(EV)
EVR = EV / s
EVR.sort_values(ascending=False)
This gives a sorted list of features based on their contribution to the total variance.
Binning is a process of converting continuous data into categorical data by dividing it into
intervals (bins).
Define Bin Edges: Set boundaries for the bins.
Value Counts: Count the number of occurrences in each bin to understand the distribution.
Example Calculation:
df['Cholesterol_bin'].value_counts()
This categorizes cholesterol levels and counts how many values fall into each category.
3. Label Encoding
Label Encoding converts categorical data into numerical format, which is necessary for
Fit and Transform: Encode the binned cholesterol data into numerical
labels. Add Encoded Labels: Create a new column with these labels in
the DataFrame.
Example Calculation:
from sklearn.preprocessing import LabelEncoder
label_encode = LabelEncoder()
labels = label_encode.fit_transform(df.Cholesterol_bin)
df['Cholesterol_group'] = labels
new_df = df.drop(columns='Cholesterol_bin', axis=0)
new_df.info()
This converts the binned cholesterol data into numerical values and adds it to the DataFrame.
Singular Value Decomposition (SVD) decomposes a matrix into three other matrices and is used for
Centering Data: Subtract the mean from each feature to center the data. SVD
Reconstruction: Use the inverse transform to approximate the original data from the reduced dimensions.
Example Calculation:
U, S, Vt = np.linalg.svd(X_centered)
X_d = X_d.to_numpy()
X_reconstructed = model1.inverse_transform(X_d)
X_reconstructed
This process reduces the dimensionality and then reconstructs the data to approximate the original data.
5. Creating DataFrame from Reconstructed Data
After reconstruction, the data needs to be converted back into a DataFrame for further analysis.
DataFrame Creation: Convert the numpy array of reconstructed data into a DataFrame. Add
Example Calculation:
X = pd.DataFrame(X_reconstructed, columns=['sex', 'chest pain type', 'resting bp s', 'cholesterol', 'fasting blood
sugar', 'resting ecg', 'max heart rate', 'exercise angina', 'oldpeak', 'ST slope', 'age_group'])
X['Cholesterol_group'] = new_Y
X.head()
Splitting data based on the categorical labels helps in comparing different groups.
Example Calculation:
X_cholesterol0 = X[X['Cholesterol_group'] == 0]
X_cholesterol1 = X[X['Cholesterol_group'] == 1]
X_cholesterol2 = X[X['Cholesterol_group'] == 2]
X_cholesterol0.head()
This segregates the data based on cholesterol groups for further analysis.
The Frobenius Norm measures the difference between two matrices, often used to quantifyreconstruction
error.
Compute Norm: Calculate the Frobenius norm between original and reconstructed data for eachgroup.
Example Calculation:
df_Cholesterol0 = df_Cholesterol0.to_numpy()
X_cholesterol0 = X_cholesterol0.to_numpy()
frobenius_norm0
This quantifies how well the reconstructed data matches the original data.
8. Neutrality Calculation
Squared Differences: Compute the squared differences of Frobenius norms between groups. Average Squared
Example Calculation:
neutrality_cholesterol
This provides a measure of how uniformly the reconstruction error is distributed across differentgroups.
Visualizing the Frobenius norms helps in comparing the reconstruction errors across groups.
Set Labels and Values: Define labels and norms for plotting.Plot
Plot: Use matplotlib to create the plot (incomplete in the given code).
Example Calculation:
labels = ['cholesterol range >240', 'cholesterol range 200-240', 'cholesterol range <200']
x = np.arange(len(labels)) # label
locationswidth = 0.35
# Plotting
fig, ax = plt.subplots()
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Frobenius Norms')
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
This would create a bar chart comparing the Frobenius norms for different cholesterol groups.
Each of these steps plays a crucial role in the overall analysis and understanding of the data,ensuring
This line removes the 'target' column from the df DataFrame, creating a new
DataFrame called new_df.
These lines filter new_df to create four new DataFrames, each containing rows
where the 'chest pain type' column equals 1, 2, 3, and 4, respectively.
This line extracts the 'chest pain type' column from new_df and stores it in
new_Y.
model2 = PCA(n_components=11)
# Fit transform
model2.fit(new_df)
X_centered = new_df - new_df.mean()
U, S, Vt = np.linalg.svd(X_centered)
X_d = X_centered.dot(Vt.T[:, :])
X_d = X_d.to_numpy()
X_reconstructed = model.inverse_transform(X_d)
X_reconstructed
X['chest_pain_type_group'] = new_Y
X.head()
These lines filter X to create four new DataFrames, each containing rows where
the 'chest_pain_type_group' column equals 1, 2, 3, and 4, respectively.
df_chest_pain_type1 = df_chest_pain_type1.to_numpy()
X_chest_pain_type1 = X_chest_pain_type1.to_numpy()
frobenius_norm1 = np.linalg.norm(df_chest_pain_type1 - X_chest_pain_type1)
frobenius_norm1
df_chest_pain_type2 = df_chest_pain_type2.to_numpy()
X_chest_pain_type2 = X_chest_pain_type2.to_numpy()
frobenius_norm2 = np.linalg.norm(df_chest_pain_type2 - X_chest_pain_type2)
frobenius_norm2
df_chest_pain_type3 = df_chest_pain_type3.to_numpy()
X_chest_pain_type3 = X_chest_pain_type3.to_numpy()
frobenius_norm3 = np.linalg.norm(df_chest_pain_type3 - X_chest_pain_type3)
frobenius_norm3
df_chest_pain_type4 = df_chest_pain_type4.to_numpy()
X_chest_pain_type4 = X_chest_pain_type4.to_numpy()
frobenius_norm4 = np.linalg.norm(df_chest_pain_type4 - X_chest_pain_type4)
frobenius_norm4
labels = ['chest pain scenario 1', 'chest pain scenario 2', 'chest pain scenario 3',
'chest pain scenario 4']
svd_norms = [frobenius_norm1, frobenius_norm2, frobenius_norm3,
frobenius_norm4]
x = np.arange(len(labels)) # label locations
width = 0.35 # width of the bars
1. labels = [...]: Defines labels for the bar plot.
2. svd_norms = [...]: Creates a list of Frobenius norms for each chest pain
type.
3. x = np.arange(len(labels)): Creates an array of label locations.
4. width = 0.35: Sets the width of the bars.
# Plotting
fig, ax = plt.subplots()
rects = ax.bar(x, svd_norms, width, label='SVD')
# Add some text for labels, title, and custom x-axis tick labels, etc.
ax.set_xlabel('Different chest pain scenario')
ax.set_ylabel('Frobenius Norm')
ax.set_title('Comparison of Frobenius Norms for SVD on chest pain type
attribute')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
autolabel(rects)
fig.tight_layout()
plt.show()