IR - Group1
IR - Group1
USING SELECTED
MACHINE LEARNING TECHNIQUES
Group - 1
Overview ☞ SVM is a classification algorithm that finds the optimal hyperplane to separate data
into classes.
Key Concepts:
• Hyperplane: Divides data into different classes.
• Support Vectors: Key data points that influence the hyperplane’s position.
• Maximal Margin: SVM maximizes the margin between classes to improve
classification accuracy.
Advantages:
• Effective in high-dimensional data, ideal for text classification.
• Reduces overfitting with a clear margin of separation.
Applications: Used in spam detection, sentiment analysis, and document
categorization.
NAÏVE BAYES (NB)
Overview: NB is a probabilistic classifier that applies Bayes' theorem with the assumption of
independence between features.
Key Concepts:
• Conditional Probability: Calculates the probability of a class given
the input features.
• Feature Independence: Assumes each feature contributes
independently to the probability.
Advantages:
• Simple, efficient, and works well with large datasets.
• Ideal for high-speed text classification tasks.
Applications: Used in spam filtering, document categorization,
and sentiment analysis.
LOGISTIC REGRESSION (LR)
Advantages:
• Interpretable model with efficient performance in binary tasks.
• Suitable for text data with linear separability.
Applications: Used in sentiment analysis, binary text classification, and medical diagnosis.
METHODOLOGY AND
IMPLEMENTATION SETUP
IM PO RTIN G
N ECESSARY
L IB RARIES
Data Preprocessing
• Steps taken: Tokenization,
stop-word removal, and
stemming (PorterStemmer)
• Benefits of preprocessing:
Improves feature relevance,
reduces dimensionality, and
optimizes performance.
• Library: nltk
PREPARING DATA
Dataset Description
•Data sources: UCI library and other English news websites.
•Overview of dataset categories: Topics like sports, literature, campus news, etc.
Prepared three categories of data as described in paper.
MODEL
TRAINING
Steps:
- Vectorize the text data
- Split the train and the test
data
- Define the parameters for
hyperparameter tuning
MODEL
TRAINING
Steps:
- Model initialization using
GridSearchCV
- Fit the models
- Get the best estimators
MODEL
TRAINING
Steps:
- Each model evaluation on
test data
- Store the result metrics for
further comparison
- Return the result
GETTING THE
RESU LT
Steps:
- Find the result for each
model for each data
- Combine and plot the graph
of the result.
EVALUATION METRICS
• Evaluation Metrics
• Metrics used: Precision, recall, F1-score, accuracy.
• Importance: These metrics provide a well-rounded view of model performance, particularly in
multi-class settings.
RESULTS AND COMPARATIVE ANALYSIS
• SVM has more precision as compared to NB and LR for the datasets that we have selected.
OUR RESULTS
Code Link
https://colab.research.google.com/drive/1op8272F2ciF2NE7BZkHaKgTEcOzfvIeq?
usp=sharing#scrollTo=aAOyQmdMRiX9
Resources
• Luo, X. (2021). Efficient English text classification using selected machine
learning techniques. Alexandria Engineering Journal, 60(4), 3401-3409.
https://doi.org/10.1016/j.aej.2021.02.009
• Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the
12th International Conference on Machine Learning (ICML), 331–339. Dataset
available via fetch_20newsgroups in scikit-learn (Pedregosa et al., 2011).
THANK YOU
Code Implementation of Paper
by Group 1