CS60050_Machine Learning_Programming Assignment_3
CS60050_Machine Learning_Programming Assignment_3
Problem Statement:
You are tasked with building a Support Vector Machine (SVM) classifier to predict whether a
particle collision event is classified as a signal (Higgs boson) or background. The dataset is
large-scale and high-dimensional, requiring efficient data handling, advanced feature
selection, and model tuning.
Dataset:
Tasks:
● Implement an SVM with a linear kernel and evaluate the model using
cross-validation.
● Report key classification metrics: accuracy, precision, recall, F1-score, and AUC
(Area Under the ROC Curve).
● Scalability and Efficiency (3 Marks)
○ Discuss and implement strategies to handle the large-scale dataset efficiently
(e.g., using Stochastic Gradient Descent or mini-batch learning for SVM).
● Evaluate and report the computational cost (time complexity) of each kernel during
training and prediction.
● Summarize the results from all kernel methods and hyperparameter variations.
● Compare the performance of each kernel and provide insights on which one is most
suitable for the HIGGS dataset based on classification metrics and computational
efficiency.
● Explainability and Interpretability (3 Marks)
○ Use tools like SHAP (SHapley Additive exPlanations) or LIME (Local
Interpretable Model-Agnostic Explanations) to explain the model’s predictions
and assess the importance of the most influential features.
Problem Statement:
You are provided with a dataset of frog species based on their sound frequencies (MFCCs).
Your task is to apply advanced clustering techniques, starting with K-Means, to group the
frogs into clusters based on their acoustic features and explore clustering performance using
additional evaluation methods.
Dataset:
Tasks:
● Exploratory Data Analysis (EDA): Analyze the dataset by checking for missing
values, feature distributions, and outliers.
● Data Scaling: Apply feature scaling using normalization or standardization.
● Feature Engineering: Try to derive new features from the existing MFCCs (e.g.,
polynomial features or interaction terms) to potentially improve clustering
performance.
● Elbow Method: Implement the Elbow Method to determine the optimal number of
clusters.
● Silhouette Score Evaluation: After finding the optimal number of clusters, evaluate
the clustering quality using the silhouette score.
● Cluster Implementation: Implement K-Means clustering based on the optimal
number of clusters.
● Compare different initialization methods for K-Means (e.g., random initialization vs.
k-means++).
● Analyze which features (MFCCs) contribute the most to cluster separation and
visualize these contributions.
Algorithm Comparison
Submission Requirements: