0% found this document useful (0 votes)

3 views17 pages

2022CS665

This report explores the application of data science techniques including bivariate analysis, classification (KNN and Naïve Bayes), and clustering (K-Means) on three datasets from the UCI Machine Learning Repository. Key findings indicate that KNN outperformed Naïve Bayes in the Banknote Authentication dataset, while both classifiers showed similar performance in the Blood Transfusion dataset. The clustering analysis using K-Means effectively grouped user knowledge levels, demonstrating the importance of feature selection and visualization in data science.

Uploaded by

ommerbilal69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views17 pages

2022CS665

Uploaded by

ommerbilal69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Introduction to Data Science

Report

Submitted By:
Syed Ali Irtaza Hassan (2022-CS-665)

Submitted To:
Ma’am Alina Munir

Department of Computer Science

University of Engineering and Technology, Lahore

1. Introduction
This report presents the application of key data science techniques such as
bivariate analysis, classification (KNN and Naïve Bayes), and clustering (K-
Means) on three UCI Machine Learning Repository datasets:

• S1: Banknote Authentication

• S2: Blood Transfusion Service Center

• S3: User Knowledge Modeling

The goal is to analyze the datasets, apply suitable models, and compare
performance with varying input features.

2. Dataset Descriptions
S1: Banknote Authentication Dataset

• Source: UCI Link

• Instances: 1,372

• Features: 4 (Variance, Skewness, Kurtosis, Entropy)

• Target: Authentic (0) or Fake (1)

S2: Blood Transfusion Service Center Dataset

• Source: UCI Link

• Instances: 748

• Features: 4 (Recency, Frequency, Monetary, Time)

• Target: Whether donor donated blood in March 2007 (0 = No, 1 = Yes)

S3: User Knowledge Modeling Dataset

• Source: UCI Link

• Instances: 403

• Features: 5 (STG, SCG, STR, LPR, PEG)

• Target: User knowledge level (UN, VN, OP, EX) [used for clustering]
3. Bivariate Analysis
S1: Banknote Authentication
• Visualizations
S2: Blood Transfusion

• Visualization
S3: User Knowledge Modeling

• Visualization
4. Model Implementations
S1 and S2: Classification with KNN and Naïve Bayes

• Python Code Snippets:

Feature selection from 2 to all features

Training and testing of KNN and Naïve Bayes classifiers

Figure 1: Code for S1 dataset

Figure 2: Code for S2
5. Confusion Matrices and Accuracy Results
S1: Banknote Authentication
S2: Blood Transfusion
Accuracy Result of S1 with 2 to all features from top to bottom:

Accuracy Result of S2 with 2 to all features from top to bottom:

6. Clustering Analysis (S3)

Elbow Method

Figure 3: The elbow method was used to determine the optimal number of clusters by plotting the within-cluster sum of
squares (WCSS) against different values of K. The 'elbow point' indicates the ideal number of clusters for this dataset.

K-Means Clustering Results

Figure 4: The K-Means clustering algorithm grouped the user knowledge levels into K clusters based on feature
similarity. The results can be visualized and interpreted by comparing cluster labels with the actual user knowledge
levels.
7. Performance Comparison
Classification Performance: KNN vs Naïve Bayes

S1: Banknote Authentication

• KNN achieved an accuracy starting from 93.09% and reached 100% as more
features were added.
• Naïve Bayes started at 82.91% and peaked around 81.82%, then slightly
dropped.

• Overall, KNN outperformed Naïve Bayes significantly on this dataset.

S2: Blood Transfusion

• Both models performed similarly, with accuracies ranging between 74.67% to
75.33%.

• Naïve Bayes slightly outperformed KNN at 2-feature level (76% vs 74.67%), but
KNN matched or exceeded NB as features increased.

• There was no strong superiority of either model on this dataset.

Classification Performance: KNN vs Naïve Bayes

Figure 5: Dataset S2 and Dataset S3

8. Observations and Conclusions
This project demonstrated practical applications of data science techniques across
three datasets. The key observations and takeaways are summarized below:
Classification (S1 and S2):
• Impact of Feature Count: As the number of features increased, model accuracy
generally improved. This trend was more prominent in the S1 dataset.

• S1 Dataset:
o KNN performed exceptionally well, achieving up to 100% accuracy with
all four features.
o Naïve Bayes performed reasonably but showed a plateau and slight
decline with additional features.
o The dataset's linearly separable features favoured KNN, which relies on
distance-based classification.
• S2 Dataset:
o Both classifiers performed similarly, with marginal differences in
accuracy.
o Naïve Bayes slightly outperformed KNN at lower feature counts, but KNN
caught up or surpassed it as more features were included.
o The closeness in results suggests the data may not be strongly separable
by any particular algorithm without more complex modelling.
Clustering (S3):
• Elbow Method: Suggested the optimal number of clusters for the user
knowledge dataset.
• K-Means Clustering: Successfully grouped the dataset into meaningful clusters
that visually aligned with actual knowledge levels.
• The clustering results validated that the five features (STG, SCG, STR, LPR, PEG)
provide reasonable separation and patterning within the dataset.
General Insights:
• KNN is highly effective for datasets with well-separated classes and low
dimensionality.
• Naïve Bayes provides decent baseline performance, especially in datasets with
probabilistic distributions.
• Feature selection and engineering significantly influence performance;
thoughtful selection improves results even with simpler models.
• Visualization techniques such as bivariate plots and elbow method plots are
crucial in understanding data structure and model behaviour.

Chapter 1 - Ojt Narrative REPORT
0% (1)
Chapter 1 - Ojt Narrative REPORT
3 pages
Using Games in Teaching Multiplication
100% (1)
Using Games in Teaching Multiplication
23 pages
Summary Writing & Practice 2023
100% (2)
Summary Writing & Practice 2023
6 pages
UC5-Contribute-in-Workplace-Innovation-PTC Final
No ratings yet
UC5-Contribute-in-Workplace-Innovation-PTC Final
66 pages
Lesson Plan Profit Loss
No ratings yet
Lesson Plan Profit Loss
7 pages
Vygotsky Zone of Proximal Development ZPD Early Childhood
100% (2)
Vygotsky Zone of Proximal Development ZPD Early Childhood
11 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
2022CS671 (B)
No ratings yet
2022CS671 (B)
16 pages
Deep Learning Answers
No ratings yet
Deep Learning Answers
36 pages
Vivek Sharma 2k21 Cs 111
No ratings yet
Vivek Sharma 2k21 Cs 111
48 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Just For Fun
No ratings yet
Just For Fun
24 pages
Bi12-019 Bi12-263 LW3
No ratings yet
Bi12-019 Bi12-263 LW3
35 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
4 pages
Data Science Project
No ratings yet
Data Science Project
25 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
Digital Computer Concept and Practice: Supervised Learning
No ratings yet
Digital Computer Concept and Practice: Supervised Learning
30 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Baran Shafqat: Physics Teacher
No ratings yet
Baran Shafqat: Physics Teacher
2 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
ML Viva and Oral Question and Answers
No ratings yet
ML Viva and Oral Question and Answers
5 pages
Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
36 pages
QB - Data Science
No ratings yet
QB - Data Science
4 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
ML 2m Cie2
No ratings yet
ML 2m Cie2
4 pages
Capstone Projects
No ratings yet
Capstone Projects
4 pages
Banknote Authentication Analysis Using Python K-Means Clustering
No ratings yet
Banknote Authentication Analysis Using Python K-Means Clustering
3 pages
ML SummaryFINAL
No ratings yet
ML SummaryFINAL
48 pages
DALT7011 Assignment Report
No ratings yet
DALT7011 Assignment Report
11 pages
Part C
No ratings yet
Part C
15 pages
ML Summary
No ratings yet
ML Summary
23 pages
Clustering Fraud Detection
No ratings yet
Clustering Fraud Detection
45 pages
Himansh PR
No ratings yet
Himansh PR
12 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Aiml Nts
No ratings yet
Aiml Nts
33 pages
R Lect1 Introduction
No ratings yet
R Lect1 Introduction
16 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
DLT Ia-2
No ratings yet
DLT Ia-2
14 pages
Peter H. Kahn, Stephen R. Kellert - Children and Nature - Psychological, Sociocultural, and Evolutionary investigations-MIT Press (2002)
100% (1)
Peter H. Kahn, Stephen R. Kellert - Children and Nature - Psychological, Sociocultural, and Evolutionary investigations-MIT Press (2002)
371 pages
Machine Learning Questions and Answers: Decision Tree
No ratings yet
Machine Learning Questions and Answers: Decision Tree
3 pages
Decision Region Vs
No ratings yet
Decision Region Vs
4 pages
ML Copy
No ratings yet
ML Copy
33 pages
Data Science in FInancial Services - 3
No ratings yet
Data Science in FInancial Services - 3
76 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
DS End Sem.
No ratings yet
DS End Sem.
31 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
Report
No ratings yet
Report
7 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
89 pages
ML Techniques and Concepts
No ratings yet
ML Techniques and Concepts
48 pages
ML FA24 Final Term Exam (Solution)
No ratings yet
ML FA24 Final Term Exam (Solution)
19 pages
Summary Business Analytics
No ratings yet
Summary Business Analytics
24 pages
Module 2 Hebb Net
No ratings yet
Module 2 Hebb Net
12 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Pre Final Review
No ratings yet
Pre Final Review
29 pages
Solution T1
No ratings yet
Solution T1
9 pages
Bilal Ahmed Shaik Data Mining
No ratings yet
Bilal Ahmed Shaik Data Mining
88 pages
Machine Learning Theory Updated
No ratings yet
Machine Learning Theory Updated
8 pages
Lesson Plan 4 - Titch Story
No ratings yet
Lesson Plan 4 - Titch Story
9 pages
Data Analytics On Banking
No ratings yet
Data Analytics On Banking
3 pages
Fam Question Bank CT
No ratings yet
Fam Question Bank CT
14 pages
Jurnal Pendidikan IPA
No ratings yet
Jurnal Pendidikan IPA
10 pages
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
No ratings yet
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
10 pages
DLP 41 48 Second Quarter
No ratings yet
DLP 41 48 Second Quarter
34 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Data Mining 4th Is
No ratings yet
Data Mining 4th Is
24 pages
Lessonplan 3
No ratings yet
Lessonplan 3
1 page
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
No ratings yet
ML Concepts: 1. Parametric Vs Non-Parametric Models:: Examples: Linear, Logistic, SVM
34 pages
Prof Ed 1
No ratings yet
Prof Ed 1
6 pages
Study Plan
No ratings yet
Study Plan
4 pages
DWM - END SEM LAB Questions
No ratings yet
DWM - END SEM LAB Questions
9 pages
Project Report - Credit Card Fraud Detection
No ratings yet
Project Report - Credit Card Fraud Detection
11 pages
Lesson2 2-Hatsofftothewumps
No ratings yet
Lesson2 2-Hatsofftothewumps
6 pages
Group Members:: Unit 2: My Friend, Claire
No ratings yet
Group Members:: Unit 2: My Friend, Claire
3 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Unit of Study Pumpkins
No ratings yet
Unit of Study Pumpkins
4 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Grade 3 Arabic B
No ratings yet
Grade 3 Arabic B
21 pages
Pre Observation Form 10 26 15
No ratings yet
Pre Observation Form 10 26 15
2 pages
O Level Science Chemistry/Biology (5126) Frequently Asked Questions
No ratings yet
O Level Science Chemistry/Biology (5126) Frequently Asked Questions
1 page
Mrs Tam
No ratings yet
Mrs Tam
1 page
Attitude Towards Reading and Academic Performance
No ratings yet
Attitude Towards Reading and Academic Performance
17 pages
Instituto de Educación Superior Pedagógico Público Indoamerica
No ratings yet
Instituto de Educación Superior Pedagógico Público Indoamerica
6 pages
Social Media Usageand Need Gratification
No ratings yet
Social Media Usageand Need Gratification
21 pages
Grade 5 Weekly Plan No 9
No ratings yet
Grade 5 Weekly Plan No 9
4 pages
Producto A1 Ingles I Uss
No ratings yet
Producto A1 Ingles I Uss
5 pages
Virtual Exchange SERVQUAL 2: The Survey
No ratings yet
Virtual Exchange SERVQUAL 2: The Survey
2 pages
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
From Everand
Contemporary Machine Learning Methods: Harnessing Scikit-Learn and TensorFlow
Adam Jones
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

2022CS665

Uploaded by

2022CS665

Uploaded by

Introduction to Data Science

Department of Computer Science

University of Engineering and Technology, Lahore

• S1: Banknote Authentication

• S2: Blood Transfusion Service Center

• S3: User Knowledge Modeling

• Source: UCI Link

• Features: 4 (Variance, Skewness, Kurtosis, Entropy)

• Target: Authentic (0) or Fake (1)

S2: Blood Transfusion Service Center Dataset

• Source: UCI Link

• Features: 4 (Recency, Frequency, Monetary, Time)

• Target: Whether donor donated blood in March 2007 (0 = No, 1 = Yes)

S3: User Knowledge Modeling Dataset

• Source: UCI Link

• Features: 5 (STG, SCG, STR, LPR, PEG)

• Python Code Snippets:

Feature selection from 2 to all features

Training and testing of KNN and Naïve Bayes classifiers

Figure 1: Code for S1 dataset

Accuracy Result of S2 with 2 to all features from top to bottom:

K-Means Clustering Results

S1: Banknote Authentication

• Overall, KNN outperformed Naïve Bayes significantly on this dataset.

S2: Blood Transfusion

• There was no strong superiority of either model on this dataset.

Classification Performance: KNN vs Naïve Bayes

Figure 5: Dataset S2 and Dataset S3

You might also like