Deliverables and Question Answer
Deliverables and Question Answer
Deadline: 27 Mar 25
1. Preprocessing:
3. Model Evaluation:
The model performs well (82% accuracy) due to distinct vocabularies between subreddits
(e.g., "data" vs. "king"). However, overlapping words (e.g., "plot") or misclassified posts
may reduce performance.
Improvements:
Add lemmatization/stemming.
1. What were the key preprocessing steps you took to prepare the text for
classification? Why are these steps important?
Ans: Tokenization, lowercasing, and stopword removal standardize text and reduce noise,
crucial for meaningful feature extraction.
Ans: It weights words by importance across documents, unlike word counts which may
overemphasize frequent but irrelevant terms.
3. Why do you think Naive Bayes works well for text classification, and what are its
limitations?
Ans: Efficient and handles high dimensions but assumes independent features, limiting
context understanding.
4. Based on the results, what additional preprocessing or model tuning could help
improve classification accuracy?
Ans: TF-IDF, lemmatization, hyperparameter tuning, and advanced models (e.g BERT).
Ans: Classify customer feedback into categories (e.g., "billing", "support") or monitor social
trends for brand mentions.
Submission Requirements:
Submit a Jupyter Notebook or Python script with your code, explanations, and results.
Include a short report (1-2 pages) summarizing your findings, model evaluation, and
suggestions for improvement.