Classification and Dimension Reduction: Load Dataset
Classification and Dimension Reduction: Load Dataset
# For evaluation
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# For visualizations
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
Load dataset
This is a classification dataset. For more details on the features and labels, please check this
documentation.
Run the following cell to load features (X) and labels (y).
# read data
data = load_breast_cancer()
X = data.data
y = data.target
localhost:8891/nbconvert/html/HW4.ipynb?download=false 1/11
11/6/24, 11:58 AM HW4
Question: Why do you need to do train_test split before your run dimension reduction
algorithms? (3 pts)
Answer: Performing train-test split before dimension reduction prevents data leakage by
ensuring that the test data remains unseen during training. If you apply dimension reduction on
the entire dataset first, information from the test set could influence the transformation, leading
to biased results. By reducing dimensions only on the training set, you ensure a fair evaluation,
with the transformation learned solely from training data and then applied to the test data for
consistent comparison.
As we may know, the best dimension reduction technique depends on the task and your data.
Therefore, we will try several methods and select the best one based on the visualization. Feel
free to use any commands from sklearn.
Sample plots:
The sample plot is given here:
On the left, it is standard PCA. In the middle, it is kernel PCA. You should use RBF kernel for this
assignment and select a good hyperparameter on your own. On the right, you will implement
LLE. Similarly, you should select the number of neighborhoods.
To simplify your code, you do not need to show how you find the hyperparameter.
However, you should include your choices in your visualization, see my sample plots. Moreover,
you should include all visualizations in one Figure using subplots. You should add informative
labels, legends, and titles to make your plots clear.
Your plots will be different from my sample plots due to the different hyperparameters and
random train-test split. But the layout (plot labels, lengends, and etc) should be similar.
localhost:8891/nbconvert/html/HW4.ipynb?download=false 2/11
11/6/24, 11:58 AM HW4
Grading policy:
1. You should implement each algorithm correctly. (5 pts each)
2. You do not need to write a function for this part, but you should add inline comments to
explain each step. (5 pts)
3. Visualization is clear and meets the requirements. (5 pts)
# Visualization
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
# Plot LLE
axs[2].scatter(X_lle[:, 0], X_lle[:, 1], c=y_train, cmap='coolwarm', s=30, alpha=0.7)
axs[2].set_title("LLE (n_neighbors=20)")
axs[2].legend(handles=axs[2].collections, labels=[0, 1])
plt.tight_layout()
plt.show()
localhost:8891/nbconvert/html/HW4.ipynb?download=false 3/11
11/6/24, 11:58 AM HW4
First of all, based on your visualization result in part 2, please determine which dimension
reduction technique you want to use for part 3 and state the reason.
Second, please do dimension reduction on training samples using the technique you select.
(This step is the same as part 2, so you do not need to repeat the code, you can use what
you obtain from part 2)
Third, train k-nearest-neighbors, logistic regression, decision tree, random forest, and
voting classifier models (use all models mention before) on reduced training samples and
then report test accuracy.
Last, show the decision region for each model. Please look at this reference code and
visualize the decision regions. You should write a function to draw decision region for any
classification model and any data samples. Function docstring is required. (Hints: the
reference I provide is in a good shape, but you cannot use the code directly. Slight
modification is required. )
Please follow the following instructions and finish part 3. Inline comments are required for your
code.
3(a) Determine the dimension reduction technique you will use and state the
reason. (5 pts)
I chose LLE because it provides a clear separation between the classes in the 2D visualization,
which can help classification models distinguish between classes more effectively.
localhost:8891/nbconvert/html/HW4.ipynb?download=false 4/11
11/6/24, 11:58 AM HW4
X_train_lle = lle.fit_transform(X_train)
X_test_lle = lle.transform(X_test) # Transform the test set with the trained LLE
Grading policy:
Parameters:
- model: Trained model with `.predict` method.
- X: 2D DataFrame or array-like of features.
- y: 1D array of labels for the dataset.
- feature_names: Tuple, names of the features in the plot (default: 'Dim 1', 'Dim
- class_labels: List of class labels for the legend (default: None).
- ax: Optional, Matplotlib axis to plot on.
"""
# Define min/max values for the grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
Print the test accuracy for all models, and draw the decision region of the model with the best k.
localhost:8891/nbconvert/html/HW4.ipynb?download=false 5/11
11/6/24, 11:58 AM HW4
Grading policy: Do the following correctly, then you will receive credits
# Plotting
fig, ax = plt.subplots()
plot_decision_regions(best_knn, X_train_lle, y_train, feature_names=('Dim 1', 'Dim 2')
ax.set_title(f"K-Nearest Neighbors (k={best_k}) Decision Region")
plt.show()
localhost:8891/nbconvert/html/HW4.ipynb?download=false 6/11
11/6/24, 11:58 AM HW4
Grading policy: Do the following correctly, then you will receive credits
1. Train a model, report test accuracy, and visualize the decision region (5 pts)
# Plotting
fig, ax = plt.subplots()
plot_decision_regions(log_reg, X_train_lle, y_train, feature_names=('Dim 1', 'Dim 2'),
ax.set_title("Logistic Regression Decision Region")
plt.show()
localhost:8891/nbconvert/html/HW4.ipynb?download=false 7/11
11/6/24, 11:58 AM HW4
Grading policy: Do the following correctly, then you will receive credits
# Train Decision Tree with best depth and plot decision region
best_dt = DecisionTreeClassifier(max_depth=best_depth)
best_dt.fit(X_train_lle, y_train)
# Plotting
fig, ax = plt.subplots()
plot_decision_regions(best_dt, X_train_lle, y_train, feature_names=('Dim 1', 'Dim 2'),
localhost:8891/nbconvert/html/HW4.ipynb?download=false 8/11
11/6/24, 11:58 AM HW4
ax.set_title(f"Decision Tree (max_depth={best_depth}) Decision Region")
plt.show()
Grading policy: Do the following correctly, then you will receive credits
1. Try different max_depth and n_estimators , then report all test accuracies (5 pts)
2. Select the best max_depth and n_estimators , then visualize the decision region (5 pts)
# Train Random Forest with best parameters and plot decision region
best_rf = RandomForestClassifier(max_depth=best_depth, n_estimators=best_n_estimators,
best_rf.fit(X_train_lle, y_train)
# Plotting
fig, ax = plt.subplots()
plot_decision_regions(best_rf, X_train_lle, y_train, feature_names=('Dim 1', 'Dim 2'),
ax.set_title(f"Random Forest (max_depth={best_depth}, n_estimators={best_n_estimators}
plt.show()
Grading policy: Do the following correctly, then you will receive credits
1. Train a model, report test accuracy, and visualize the decision region (5 pts)
# Plotting
fig, ax = plt.subplots()
plot_decision_regions(voting_clf, X_train_lle, y_train, feature_names=('Dim 1', 'Dim 2
ax.set_title("Voting Classifier Decision Region")
plt.show()
In [ ]:
In [ ]:
localhost:8891/nbconvert/html/HW4.ipynb?download=false 11/11