0% found this document useful (0 votes)
3 views4 pages

Deliverables and Question Answer

The document outlines an assignment on text mining and analytics, detailing preprocessing steps such as tokenization, lowercasing, and stopword removal. It describes the use of a Naive Bayes classifier, achieving an accuracy of 82%, and suggests improvements like using TF-IDF and advanced models. Additionally, it poses reflection questions regarding preprocessing significance, model performance, and potential applications in customer feedback analysis.

Uploaded by

Faizan Majeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Deliverables and Question Answer

The document outlines an assignment on text mining and analytics, detailing preprocessing steps such as tokenization, lowercasing, and stopword removal. It describes the use of a Naive Bayes classifier, achieving an accuracy of 82%, and suggests improvements like using TF-IDF and advanced models. Additionally, it poses reflection questions regarding preprocessing significance, model performance, and potential applications in customer feedback analysis.

Uploaded by

Faizan Majeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment No.

Introduction to Data Science

Deadline: 27 Mar 25

Total Course Weightage: 5%

NAME: FAIZAN MAJEED ENROLLMENT NO:03-134221-011

Text Mining & Text Analytics

1. Preprocessing:

The preprocessing steps applied are:

 Tokenization: Split text into individual words/tokens using nltk.word_tokenize() or a


simple split() as a fallback. This breaks text into manageable units.
 Lowercasing: Convert all tokens to lowercase to ensure uniformity (e.g., "Data" and
"data" are treated the same).
 Stopword Removal: Filter out common English stopwords (e.g., "the", "and") using
NLTK's predefined list. This reduces noise and focuses on meaningful words.

2. Model Building: Python Code (Naive Bayes Classifier):


Naive Bayes is chosen for its efficiency with high-dimensional text data and robustness
to irrelevant features. It assumes feature independence (each word's presence is
independent), which simplifies computation while often performing well in text
classification despite the assumption's naivety.

3. Model Evaluation:

a. Accuracy: 82% of posts are correctly classified.


b. Precision/Recall: High scores indicate the model reliably distinguishes topics.
c. Confusion Matrix: Shows 4 Data Science posts misclassified as GameOfThrones and 3
vice versa.
4. Result Analysis:

The model performs well (82% accuracy) due to distinct vocabularies between subreddits
(e.g., "data" vs. "king"). However, overlapping words (e.g., "plot") or misclassified posts
may reduce performance.

Improvements:

 Use TF-IDF for feature weighting.

 Add lemmatization/stemming.

 Include n-grams for context.

 Experiment with SVM or Neural Networks.

Questions for Reflection and Answer

1. What were the key preprocessing steps you took to prepare the text for
classification? Why are these steps important?

Ans: Tokenization, lowercasing, and stopword removal standardize text and reduce noise,
crucial for meaningful feature extraction.

2. What is the significance of using TF-IDF as a vectorization method? How does it


differ from using a simple word count vectorizer?

Ans: It weights words by importance across documents, unlike word counts which may
overemphasize frequent but irrelevant terms.

3. Why do you think Naive Bayes works well for text classification, and what are its
limitations?
Ans: Efficient and handles high dimensions but assumes independent features, limiting
context understanding.

4. Based on the results, what additional preprocessing or model tuning could help
improve classification accuracy?

Ans: TF-IDF, lemmatization, hyperparameter tuning, and advanced models (e.g BERT).

5. In a real-world scenario, how could this classification model be used in an


application such as customer feedback analysis or social media monitoring?

Ans: Classify customer feedback into categories (e.g., "billing", "support") or monitor social
trends for brand mentions.

Submission Requirements:

Submit a Jupyter Notebook or Python script with your code, explanations, and results.
Include a short report (1-2 pages) summarizing your findings, model evaluation, and
suggestions for improvement.

You might also like