2022CS665
2022CS665
Report
Submitted By:
Syed Ali Irtaza Hassan (2022-CS-665)
Submitted To:
Ma’am Alina Munir
The goal is to analyze the datasets, apply suitable models, and compare
performance with varying input features.
2. Dataset Descriptions
S1: Banknote Authentication Dataset
• Instances: 1,372
• Instances: 748
• Instances: 403
• Target: User knowledge level (UN, VN, OP, EX) [used for clustering]
3. Bivariate Analysis
S1: Banknote Authentication
• Visualizations
S2: Blood Transfusion
• Visualization
S3: User Knowledge Modeling
• Visualization
4. Model Implementations
S1 and S2: Classification with KNN and Naïve Bayes
Elbow Method
Figure 3: The elbow method was used to determine the optimal number of clusters by plotting the within-cluster sum of
squares (WCSS) against different values of K. The 'elbow point' indicates the ideal number of clusters for this dataset.
Figure 4: The K-Means clustering algorithm grouped the user knowledge levels into K clusters based on feature
similarity. The results can be visualized and interpreted by comparing cluster labels with the actual user knowledge
levels.
7. Performance Comparison
Classification Performance: KNN vs Naïve Bayes
• Naïve Bayes slightly outperformed KNN at 2-feature level (76% vs 74.67%), but
KNN matched or exceeded NB as features increased.
• S1 Dataset:
o KNN performed exceptionally well, achieving up to 100% accuracy with
all four features.
o Naïve Bayes performed reasonably but showed a plateau and slight
decline with additional features.
o The dataset's linearly separable features favoured KNN, which relies on
distance-based classification.
• S2 Dataset:
o Both classifiers performed similarly, with marginal differences in
accuracy.
o Naïve Bayes slightly outperformed KNN at lower feature counts, but KNN
caught up or surpassed it as more features were included.
o The closeness in results suggests the data may not be strongly separable
by any particular algorithm without more complex modelling.
Clustering (S3):
• Elbow Method: Suggested the optimal number of clusters for the user
knowledge dataset.
• K-Means Clustering: Successfully grouped the dataset into meaningful clusters
that visually aligned with actual knowledge levels.
• The clustering results validated that the five features (STG, SCG, STR, LPR, PEG)
provide reasonable separation and patterning within the dataset.
General Insights:
• KNN is highly effective for datasets with well-separated classes and low
dimensionality.
• Naïve Bayes provides decent baseline performance, especially in datasets with
probabilistic distributions.
• Feature selection and engineering significantly influence performance;
thoughtful selection improves results even with simpler models.
• Visualization techniques such as bivariate plots and elbow method plots are
crucial in understanding data structure and model behaviour.