0% found this document useful (0 votes)

3 views4 pages

Deliverables and Question Answer

The document outlines an assignment on text mining and analytics, detailing preprocessing steps such as tokenization, lowercasing, and stopword removal. It describes the use of a Naive Bayes classifier, achieving an accuracy of 82%, and suggests improvements like using TF-IDF and advanced models. Additionally, it poses reflection questions regarding preprocessing significance, model performance, and potential applications in customer feedback analysis.

Uploaded by

Faizan Majeed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views4 pages

Deliverables and Question Answer

Uploaded by

Faizan Majeed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment No.

Introduction to Data Science

Deadline: 27 Mar 25

Total Course Weightage: 5%

NAME: FAIZAN MAJEED ENROLLMENT NO:03-134221-011

Text Mining & Text Analytics

1. Preprocessing:

The preprocessing steps applied are:

 Tokenization: Split text into individual words/tokens using nltk.word_tokenize() or a

simple split() as a fallback. This breaks text into manageable units.
 Lowercasing: Convert all tokens to lowercase to ensure uniformity (e.g., "Data" and
"data" are treated the same).
 Stopword Removal: Filter out common English stopwords (e.g., "the", "and") using
NLTK's predefined list. This reduces noise and focuses on meaningful words.

2. Model Building: Python Code (Naive Bayes Classifier):

Naive Bayes is chosen for its efficiency with high-dimensional text data and robustness
to irrelevant features. It assumes feature independence (each word's presence is
independent), which simplifies computation while often performing well in text
classification despite the assumption's naivety.

3. Model Evaluation:

a. Accuracy: 82% of posts are correctly classified.

b. Precision/Recall: High scores indicate the model reliably distinguishes topics.
c. Confusion Matrix: Shows 4 Data Science posts misclassified as GameOfThrones and 3
vice versa.
4. Result Analysis:

The model performs well (82% accuracy) due to distinct vocabularies between subreddits
(e.g., "data" vs. "king"). However, overlapping words (e.g., "plot") or misclassified posts
may reduce performance.

Improvements:

 Use TF-IDF for feature weighting.

 Add lemmatization/stemming.

 Include n-grams for context.

 Experiment with SVM or Neural Networks.

Questions for Reflection and Answer

1. What were the key preprocessing steps you took to prepare the text for
classification? Why are these steps important?

Ans: Tokenization, lowercasing, and stopword removal standardize text and reduce noise,
crucial for meaningful feature extraction.

2. What is the significance of using TF-IDF as a vectorization method? How does it

differ from using a simple word count vectorizer?

Ans: It weights words by importance across documents, unlike word counts which may
overemphasize frequent but irrelevant terms.

3. Why do you think Naive Bayes works well for text classification, and what are its
limitations?
Ans: Efficient and handles high dimensions but assumes independent features, limiting
context understanding.

4. Based on the results, what additional preprocessing or model tuning could help
improve classification accuracy?

Ans: TF-IDF, lemmatization, hyperparameter tuning, and advanced models (e.g BERT).

5. In a real-world scenario, how could this classification model be used in an

application such as customer feedback analysis or social media monitoring?

Ans: Classify customer feedback into categories (e.g., "billing", "support") or monitor social
trends for brand mentions.

Submission Requirements:

Submit a Jupyter Notebook or Python script with your code, explanations, and results.
Include a short report (1-2 pages) summarizing your findings, model evaluation, and
suggestions for improvement.

Natural Language Processing Question Bank
No ratings yet
Natural Language Processing Question Bank
3 pages
Assessment of Disorders in Childhood and Adolescence - Eric A - Youngstrom, Mitchell J - Prinstein, Eric J - Mash, - 5, 2022 - Guilford Press - 9781462543632 - Anna's Archiv
100% (1)
Assessment of Disorders in Childhood and Adolescence - Eric A - Youngstrom, Mitchell J - Prinstein, Eric J - Mash, - 5, 2022 - Guilford Press - 9781462543632 - Anna's Archiv
747 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
Project
No ratings yet
Project
11 pages
03 134221 038 13291998009 28032025 090951pm
No ratings yet
03 134221 038 13291998009 28032025 090951pm
4 pages
Aca 21 Ram
No ratings yet
Aca 21 Ram
68 pages
Set 1
No ratings yet
Set 1
4 pages
ML Case Study
No ratings yet
ML Case Study
1 page
Evaluation of Text Transformers For Classifying Sentiment of Revi
No ratings yet
Evaluation of Text Transformers For Classifying Sentiment of Revi
104 pages
Solutions To Applied Data Science AI
No ratings yet
Solutions To Applied Data Science AI
9 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Spring 2025 - CS619 - 10969
No ratings yet
Spring 2025 - CS619 - 10969
4 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
CSE4062S21 Group3 Project Delivery7 FinalReport
No ratings yet
CSE4062S21 Group3 Project Delivery7 FinalReport
9 pages
CM2060 NLP Coursework
No ratings yet
CM2060 NLP Coursework
5 pages
Assignment
No ratings yet
Assignment
15 pages
Python Que
No ratings yet
Python Que
3 pages
NM Project Phase-2
No ratings yet
NM Project Phase-2
9 pages
Case Study Question Unit 6 DL
No ratings yet
Case Study Question Unit 6 DL
3 pages
NLP - Assignment2 Proper RNN Working
No ratings yet
NLP - Assignment2 Proper RNN Working
3 pages
Akash Kumar Singh - 23WU0202098
No ratings yet
Akash Kumar Singh - 23WU0202098
6 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
Module 3
No ratings yet
Module 3
5 pages
Methodology
No ratings yet
Methodology
9 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
02 134212 059 11952693126 29032024 091605am
No ratings yet
02 134212 059 11952693126 29032024 091605am
6 pages
CRIPS NOTES NLP Speech
No ratings yet
CRIPS NOTES NLP Speech
6 pages
Assignment Data Science Intern
No ratings yet
Assignment Data Science Intern
8 pages
Fine Tuning and Evaluation of A Language Model - Edited
No ratings yet
Fine Tuning and Evaluation of A Language Model - Edited
10 pages
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
No ratings yet
RAJIVRANJAN 26-03-2023 MachineLearningProjectReport Final
54 pages
Data Science
No ratings yet
Data Science
25 pages
Text Classification Using Hugging Face
No ratings yet
Text Classification Using Hugging Face
1 page
Unit 5
No ratings yet
Unit 5
8 pages
Vijayi WFH Tech - Assignment - AI Internship - Jan 2025
No ratings yet
Vijayi WFH Tech - Assignment - AI Internship - Jan 2025
3 pages
Deep Learning Based Sentiment
No ratings yet
Deep Learning Based Sentiment
62 pages
NLP Assignment2
No ratings yet
NLP Assignment2
7 pages
Question Bank
No ratings yet
Question Bank
2 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Types of Data Represented As Strings
No ratings yet
Types of Data Represented As Strings
2 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Data Science Interns Tasks
No ratings yet
Data Science Interns Tasks
2 pages
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
No ratings yet
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
20 pages
NPL Assignment 1
No ratings yet
NPL Assignment 1
5 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Understanding The Self 6
No ratings yet
Understanding The Self 6
30 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Language and Identity in The Arab World Fathiya Al Rashdi Sandhya Rao Mehta Download
No ratings yet
Language and Identity in The Arab World Fathiya Al Rashdi Sandhya Rao Mehta Download
84 pages
Kkwieer Category Wise Cap-I, Cap-II & Cap-III Off 2024-2025
No ratings yet
Kkwieer Category Wise Cap-I, Cap-II & Cap-III Off 2024-2025
4 pages
Research Methodologies Research Exercise
No ratings yet
Research Methodologies Research Exercise
14 pages
NLP - Short Assignments
No ratings yet
NLP - Short Assignments
8 pages
Doormen 1 (Ing)
No ratings yet
Doormen 1 (Ing)
10 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Assure and Tpack Model
100% (1)
Assure and Tpack Model
12 pages
Intership 1st Week Task
No ratings yet
Intership 1st Week Task
22 pages
Me3382 - Manufacturing Technology Laboratory
No ratings yet
Me3382 - Manufacturing Technology Laboratory
115 pages
Bring Your Future Into GOODNESS!: 2x2 Photo
No ratings yet
Bring Your Future Into GOODNESS!: 2x2 Photo
2 pages
Limerance by Dorothy
No ratings yet
Limerance by Dorothy
6 pages
Action Plan For Lis Cy
No ratings yet
Action Plan For Lis Cy
2 pages
Lab Journal 11 09062022 090259am
No ratings yet
Lab Journal 11 09062022 090259am
10 pages
ETHICS Seatwork #1
No ratings yet
ETHICS Seatwork #1
3 pages
Flow Charts
No ratings yet
Flow Charts
8 pages
IDS Assignment Code With Output
No ratings yet
IDS Assignment Code With Output
6 pages
Synonym Match:: Paragraphs 1 and 2
No ratings yet
Synonym Match:: Paragraphs 1 and 2
6 pages
IDS Assignement Report
No ratings yet
IDS Assignement Report
4 pages
Professional Reference List
No ratings yet
Professional Reference List
2 pages
BTVN L P 7 Ngày 4 Tháng 9
No ratings yet
BTVN L P 7 Ngày 4 Tháng 9
7 pages
Sample Question 30032022 093727am
No ratings yet
Sample Question 30032022 093727am
1 page
Lebanese Curriculum: Rida Zogheib/ Zeinab Darwish
100% (1)
Lebanese Curriculum: Rida Zogheib/ Zeinab Darwish
11 pages
Module 2 Turbulent Flow
No ratings yet
Module 2 Turbulent Flow
6 pages
Toeic Listening and Reading Percentile Rank PDF
No ratings yet
Toeic Listening and Reading Percentile Rank PDF
1 page
Make Money As A Speaker Workbook
100% (1)
Make Money As A Speaker Workbook
7 pages
DLL - Mapeh 4 - Q4 - W4
No ratings yet
DLL - Mapeh 4 - Q4 - W4
10 pages
Block 1: Unit 1 Teens Volunteering
No ratings yet
Block 1: Unit 1 Teens Volunteering
24 pages
Fee Structure MBBS
No ratings yet
Fee Structure MBBS
1 page
Memotret Kerawanan Pangan Dengan Metode Hfias (Studi Kasus Di Salah Satu Desa Hutan Di Desa Lembu Kecamatan Bancak, Kabupaten Semarang)
No ratings yet
Memotret Kerawanan Pangan Dengan Metode Hfias (Studi Kasus Di Salah Satu Desa Hutan Di Desa Lembu Kecamatan Bancak, Kabupaten Semarang)
24 pages
DLLmath 7 Q1 Week 3
No ratings yet
DLLmath 7 Q1 Week 3
7 pages
Emerge 1
No ratings yet
Emerge 1
5 pages
SMART Postgraduate Course Scientific Program
No ratings yet
SMART Postgraduate Course Scientific Program
3 pages
Mid-Term Exam F1-2022
No ratings yet
Mid-Term Exam F1-2022
8 pages
Linda Silverman
No ratings yet
Linda Silverman
7 pages
BTEC Nationals in Aerospace Engineering Structures
No ratings yet
BTEC Nationals in Aerospace Engineering Structures
4 pages
Guía 4
No ratings yet
Guía 4
3 pages
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
From Everand
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Deliverables and Question Answer

Uploaded by

Deliverables and Question Answer

Uploaded by

Assignment No.

Introduction to Data Science

Total Course Weightage: 5%

NAME: FAIZAN MAJEED ENROLLMENT NO:03-134221-011

Text Mining & Text Analytics

The preprocessing steps applied are:

 Tokenization: Split text into individual words/tokens using nltk.word_tokenize() or a

2. Model Building: Python Code (Naive Bayes Classifier):

a. Accuracy: 82% of posts are correctly classified.

 Use TF-IDF for feature weighting.

 Include n-grams for context.

 Experiment with SVM or Neural Networks.

Questions for Reflection and Answer

2. What is the significance of using TF-IDF as a vectorization method? How does it

5. In a real-world scenario, how could this classification model be used in an

You might also like