Spam Detection Viva Questions Full
Spam Detection Viva Questions Full
**1. What is the main objective of your project, and why is spam email detection important?**
- **Answer:** The main objective is to detect spam emails efficiently using machine learning models
to enhance email security, protect user privacy, and reduce exposure to phishing and malware
**2. Can you explain the workflow of your project from data collection to model evaluation?**
2. **Data Cleaning:** Removed irrelevant columns and duplicates, renamed columns, and checked
3. **EDA:** Explored the dataset and identified imbalances between spam and ham emails.
4. **Text Preprocessing:** Applied label encoding, text cleaning, stemming, and vectorization.
**3. Why did you choose the specific dataset, and what are its key characteristics?**
- **Answer:** The Kaggle dataset is well-labeled and widely used for spam detection, containing
examples of both spam and ham emails. Its diversity helps train robust models.
**4. How do you define spam and ham emails in the context of this project?**
- **Answer:** Spam emails are unwanted, potentially harmful messages, while ham emails are
---
**5. What was the purpose of cleaning the dataset, and what techniques did you use?**
- **Answer:** Cleaning ensures data quality and consistency. Techniques included removing
**6. Why did you perform label encoding, and what does it achieve?**
- **Answer:** Label encoding converts categorical labels (ham/spam) into numerical format (0/1),
**7. What preprocessing steps did you apply to the email text data, and why are they necessary?**
- **Answer:** Steps include lowercasing, tokenization, removing special characters and stopwords,
and stemming. These steps normalize the data and reduce noise for better feature extraction.
- **Answer:** Stemming reduces words to their root form (e.g., "running" to "run"), minimizing
**9. How did you handle imbalanced data in the project, and why is it important?**
- **Answer:** Imbalance was observed but not directly addressed. Handling imbalance (e.g., via
**10. Why did you choose TfidfVectorizer over CountVectorizer for feature extraction?**
documents, reducing the impact of common but less informative words compared to
CountVectorizer.
**11. What does the `max_features` parameter in TfidfVectorizer do, and how does it improve
performance?**
- **Answer:** The `max_features` parameter limits the number of features, focusing on the most
---
**12. Why did you test multiple machine learning models, and how did you select the best one?**
- **Answer:** Testing multiple models helps identify the one best suited for the data. The best model
**13. What are the advantages of using Multinomial Naive Bayes for this task?**
- **Answer:** Multinomial Naive Bayes is efficient, works well with textual data, and performs
**14. Why did you use accuracy and precision as evaluation metrics? Are there any other metrics
you considered?**
- **Answer:** Accuracy measures overall correctness, while precision evaluates the proportion of
correctly identified spam. Recall and F1-score could also be used for a balanced assessment.
**15. What were the challenges in training the models, and how did you address them?**
---
**16. How did you optimize the model's performance, and what results did you achieve?**
focus on significant terms. This combination with MultinomialNB yielded the best accuracy and
precision.
**17. What are the potential limitations of your current approach, and how could they be
addressed?**
- **Answer:** Limitations include the inability to handle real-time adaptation and reliance on static
data. Incorporating user feedback and advanced models like RNNs could address these.
---
**18. How can deep learning models like RNNs or LSTMs improve spam detection performance?**
- **Answer:** RNNs and LSTMs capture sequential patterns in text, making them better suited for
**19. What additional features or techniques could you incorporate to make the detection system
more robust?**
- **Answer:** Advanced feature engineering like semantic analysis, word embeddings, or ensemble
---
**20. Can you explain the difference between precision and recall, and why precision is more critical
in spam detection?**
- **Answer:** Precision measures the proportion of correctly identified spam among all predicted
spam, while recall measures the proportion of correctly identified spam among all actual spam.
Precision is more critical in spam detection to minimize false positives and avoid filtering legitimate
emails.
**21. What are the potential real-world applications of your spam detection system?**
- **Answer:** Applications include email security solutions, anti-phishing systems, and tools to