0% found this document useful (0 votes)
11 views27 pages

IR - Group1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

IR - Group1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

EFFICIENT ENGLISH TEXT CLASSIFICATION

USING SELECTED
MACHINE LEARNING TECHNIQUES

Group - 1

Raman Sharma | 22114076 Vikas Kumar | 22114106


Vineet Kumar | 22114107 Harish Nenavath | 22114061
Adarsh Dehariya | 22114002 Priyanshu Nareda | 22114072
INTRODUCTION
TITLE AND OBJECTIVE

• Title: Efficient English Text Classification using Selected Machine Learning


Techniques
• Authors: Xiaoyu Luo (Hunan University of Technology and Business, China)
• Journal & Publication Date: Alexandria Engineering Journal, February 2021
• Objective of the Study:
• To identify efficient machine learning techniques for English text
classification by comparing the performance of various classifiers.
• A specific focus on Support Vector Machines (SVM) to evaluate its
effectiveness relative to other models like Logistic Regression and Naïve
Bayes.
INTRODUCTION - SIGNIFIC ANCE OF
TEXT CLASSIFIC ATION

• Text Classification (TC):


o Technique for categorizing documents into predefined classes using machine learning.
o Useful for organizing vast amounts of unstructured data for better retrieval and
management.
• Growing Demand:
o Significant growth in unstructured text data from corporate records, government, and
personal communications.
o Increase in blogging, social media, and online interactions has amplified the need for
efficient classification methods.
MACHINE LEARNING FOR TEXT
CLASSIFIC ATION

• Role of ML in Text Classification:


• Supervised Learning: Uses labeled data for training, ensuring accurate categorization.
• Preprocessing: Involves stop-word removal, stemming, and feature selection for
optimized performance.
• Popular Applications:
• Spam Filtering: Recognizes and filters unwanted emails.
• Sentiment Analysis: Detects opinions on products and services.
• Fake News Detection: Identifies fraudulent or counterfeit content online.
KEY CHALLENGES

Challenges in Efficient Text Classification:


• Misclassification:
• Handling incorrectly labeled data remains challenging, especially in domains where
language or context may lead to ambiguities.
• Feature Selection and Dimensionality Reduction:
• With thousands of potential features (words or terms), feature selection is critical to
manage computational load and enhance classifier performance.
• Selecting the right features without compromising classification accuracy is essential.
MACHINE LEARNING
ALGORITHMS
SVM ( SU PPORT VECTOR MACHINE)

Overview ☞ SVM is a classification algorithm that finds the optimal hyperplane to separate data
into classes.
Key Concepts:
• Hyperplane: Divides data into different classes.
• Support Vectors: Key data points that influence the hyperplane’s position.
• Maximal Margin: SVM maximizes the margin between classes to improve
classification accuracy.
Advantages:
• Effective in high-dimensional data, ideal for text classification.
• Reduces overfitting with a clear margin of separation.
Applications: Used in spam detection, sentiment analysis, and document
categorization.
NAÏVE BAYES (NB)

Overview: NB is a probabilistic classifier that applies Bayes' theorem with the assumption of
independence between features.
Key Concepts:
• Conditional Probability: Calculates the probability of a class given
the input features.
• Feature Independence: Assumes each feature contributes
independently to the probability.
Advantages:
• Simple, efficient, and works well with large datasets.
• Ideal for high-speed text classification tasks.
Applications: Used in spam filtering, document categorization,
and sentiment analysis.
LOGISTIC REGRESSION (LR)

Overview: LR is a linear model for binary classification, predicting probabilities and


classifying based on a threshold.
Key Concepts:
• Sigmoid Function: Maps inputs to a probability between 0 and 1.
• Decision Boundary: Separates data into two classes based on
a threshold.

Advantages:
• Interpretable model with efficient performance in binary tasks.
• Suitable for text data with linear separability.

Applications: Used in sentiment analysis, binary text classification, and medical diagnosis.
METHODOLOGY AND
IMPLEMENTATION SETUP
IM PO RTIN G
N ECESSARY
L IB RARIES

• Basic: Numpy, Pandas


• Dataset:
fetch_20newsgropus from
sklearn
• Model: SVC, MultinomialNB,
LogisticRegression from
sklearn
• Metrics: sklearn libraries
• Plot: matplotlib
PREPROCESSING
ON TEXT

Data Preprocessing
• Steps taken: Tokenization,
stop-word removal, and
stemming (PorterStemmer)
• Benefits of preprocessing:
Improves feature relevance,
reduces dimensionality, and
optimizes performance.
• Library: nltk
PREPARING DATA

Dataset Description
•Data sources: UCI library and other English news websites.
•Overview of dataset categories: Topics like sports, literature, campus news, etc.
Prepared three categories of data as described in paper.
MODEL
TRAINING

Steps:
- Vectorize the text data
- Split the train and the test
data
- Define the parameters for
hyperparameter tuning
MODEL
TRAINING

Steps:
- Model initialization using
GridSearchCV
- Fit the models
- Get the best estimators
MODEL
TRAINING

Steps:
- Each model evaluation on
test data
- Store the result metrics for
further comparison
- Return the result
GETTING THE
RESU LT

Steps:
- Find the result for each
model for each data
- Combine and plot the graph
of the result.
EVALUATION METRICS

• Evaluation Metrics
• Metrics used: Precision, recall, F1-score, accuracy.
• Importance: These metrics provide a well-rounded view of model performance, particularly in
multi-class settings.
RESULTS AND COMPARATIVE ANALYSIS

• SVM has more precision as compared to NB and LR for the datasets that we have selected.
OUR RESULTS

Insights from Comparative Analysis

- SVM shows superior performance in


precision and recall like data1.
- NB effective on smaller data like data3.
DISCUSSION

• Challenges in Text Classification:


o Data Collection Impact: Quality of data collection affects preprocessing and text mining
accuracy.
o Preprocessing Essentials:
o Tokenization, stop word removal, stemming, and vector space model required for organizing
data.
o Literature Insights:
o Increasing dataset size during training enhances evaluation accuracy.
Conclusion:

• Rapid growth of text classification applications in


IT.
• SVM outperforms NB and LR on selected
datasets with higher precision, recall, and F1-
score.
CONCLUSION • Each algorithm has strengths and weaknesses
& FUTURE depending on dataset size and characteristics.
WORK
Future Work:

• Expand classification to larger datasets, like the


BBC dataset.
• Explore implementations in advanced tools:
TensorFlow, Python, R, or Matlab.
BIBLIOGRAPHY

Code Link
https://colab.research.google.com/drive/1op8272F2ciF2NE7BZkHaKgTEcOzfvIeq?
usp=sharing#scrollTo=aAOyQmdMRiX9
Resources
• Luo, X. (2021). Efficient English text classification using selected machine
learning techniques. Alexandria Engineering Journal, 60(4), 3401-3409.
https://doi.org/10.1016/j.aej.2021.02.009
• Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the
12th International Conference on Machine Learning (ICML), 331–339. Dataset
available via fetch_20newsgroups in scikit-learn (Pedregosa et al., 2011).
THANK YOU
Code Implementation of Paper
by Group 1

You might also like