0% found this document useful (0 votes)
3 views17 pages

2022CS665

This report explores the application of data science techniques including bivariate analysis, classification (KNN and Naïve Bayes), and clustering (K-Means) on three datasets from the UCI Machine Learning Repository. Key findings indicate that KNN outperformed Naïve Bayes in the Banknote Authentication dataset, while both classifiers showed similar performance in the Blood Transfusion dataset. The clustering analysis using K-Means effectively grouped user knowledge levels, demonstrating the importance of feature selection and visualization in data science.

Uploaded by

ommerbilal69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views17 pages

2022CS665

This report explores the application of data science techniques including bivariate analysis, classification (KNN and Naïve Bayes), and clustering (K-Means) on three datasets from the UCI Machine Learning Repository. Key findings indicate that KNN outperformed Naïve Bayes in the Banknote Authentication dataset, while both classifiers showed similar performance in the Blood Transfusion dataset. The clustering analysis using K-Means effectively grouped user knowledge levels, demonstrating the importance of feature selection and visualization in data science.

Uploaded by

ommerbilal69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Data Science

Report

Submitted By:
Syed Ali Irtaza Hassan (2022-CS-665)

Submitted To:
Ma’am Alina Munir

Department of Computer Science

University of Engineering and Technology, Lahore


1. Introduction
This report presents the application of key data science techniques such as
bivariate analysis, classification (KNN and Naïve Bayes), and clustering (K-
Means) on three UCI Machine Learning Repository datasets:

• S1: Banknote Authentication

• S2: Blood Transfusion Service Center

• S3: User Knowledge Modeling

The goal is to analyze the datasets, apply suitable models, and compare
performance with varying input features.

2. Dataset Descriptions
S1: Banknote Authentication Dataset

• Source: UCI Link

• Instances: 1,372

• Features: 4 (Variance, Skewness, Kurtosis, Entropy)

• Target: Authentic (0) or Fake (1)

S2: Blood Transfusion Service Center Dataset

• Source: UCI Link

• Instances: 748

• Features: 4 (Recency, Frequency, Monetary, Time)

• Target: Whether donor donated blood in March 2007 (0 = No, 1 = Yes)

S3: User Knowledge Modeling Dataset

• Source: UCI Link

• Instances: 403

• Features: 5 (STG, SCG, STR, LPR, PEG)

• Target: User knowledge level (UN, VN, OP, EX) [used for clustering]
3. Bivariate Analysis
S1: Banknote Authentication
• Visualizations
S2: Blood Transfusion

• Visualization
S3: User Knowledge Modeling

• Visualization
4. Model Implementations
S1 and S2: Classification with KNN and Naïve Bayes

• Python Code Snippets:

Feature selection from 2 to all features

Training and testing of KNN and Naïve Bayes classifiers

Figure 1: Code for S1 dataset


Figure 2: Code for S2
5. Confusion Matrices and Accuracy Results
S1: Banknote Authentication
S2: Blood Transfusion
Accuracy Result of S1 with 2 to all features from top to bottom:

Accuracy Result of S2 with 2 to all features from top to bottom:


6. Clustering Analysis (S3)

Elbow Method

Figure 3: The elbow method was used to determine the optimal number of clusters by plotting the within-cluster sum of
squares (WCSS) against different values of K. The 'elbow point' indicates the ideal number of clusters for this dataset.

K-Means Clustering Results

Figure 4: The K-Means clustering algorithm grouped the user knowledge levels into K clusters based on feature
similarity. The results can be visualized and interpreted by comparing cluster labels with the actual user knowledge
levels.
7. Performance Comparison
Classification Performance: KNN vs Naïve Bayes

S1: Banknote Authentication


• KNN achieved an accuracy starting from 93.09% and reached 100% as more
features were added.
• Naïve Bayes started at 82.91% and peaked around 81.82%, then slightly
dropped.

• Overall, KNN outperformed Naïve Bayes significantly on this dataset.

S2: Blood Transfusion


• Both models performed similarly, with accuracies ranging between 74.67% to
75.33%.

• Naïve Bayes slightly outperformed KNN at 2-feature level (76% vs 74.67%), but
KNN matched or exceeded NB as features increased.

• There was no strong superiority of either model on this dataset.

Classification Performance: KNN vs Naïve Bayes

Figure 5: Dataset S2 and Dataset S3


8. Observations and Conclusions
This project demonstrated practical applications of data science techniques across
three datasets. The key observations and takeaways are summarized below:
Classification (S1 and S2):
• Impact of Feature Count: As the number of features increased, model accuracy
generally improved. This trend was more prominent in the S1 dataset.

• S1 Dataset:
o KNN performed exceptionally well, achieving up to 100% accuracy with
all four features.
o Naïve Bayes performed reasonably but showed a plateau and slight
decline with additional features.
o The dataset's linearly separable features favoured KNN, which relies on
distance-based classification.
• S2 Dataset:
o Both classifiers performed similarly, with marginal differences in
accuracy.
o Naïve Bayes slightly outperformed KNN at lower feature counts, but KNN
caught up or surpassed it as more features were included.
o The closeness in results suggests the data may not be strongly separable
by any particular algorithm without more complex modelling.
Clustering (S3):
• Elbow Method: Suggested the optimal number of clusters for the user
knowledge dataset.
• K-Means Clustering: Successfully grouped the dataset into meaningful clusters
that visually aligned with actual knowledge levels.
• The clustering results validated that the five features (STG, SCG, STR, LPR, PEG)
provide reasonable separation and patterning within the dataset.
General Insights:
• KNN is highly effective for datasets with well-separated classes and low
dimensionality.
• Naïve Bayes provides decent baseline performance, especially in datasets with
probabilistic distributions.
• Feature selection and engineering significantly influence performance;
thoughtful selection improves results even with simpler models.
• Visualization techniques such as bivariate plots and elbow method plots are
crucial in understanding data structure and model behaviour.

You might also like