0% found this document useful (0 votes)

11 views27 pages

IR - Group1

Uploaded by

vineetkumar30092003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views27 pages

IR - Group1

Uploaded by

vineetkumar30092003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

EFFICIENT ENGLISH TEXT CLASSIFICATION

USING SELECTED
MACHINE LEARNING TECHNIQUES

Group - 1

Raman Sharma | 22114076 Vikas Kumar | 22114106

Vineet Kumar | 22114107 Harish Nenavath | 22114061
Adarsh Dehariya | 22114002 Priyanshu Nareda | 22114072
INTRODUCTION
TITLE AND OBJECTIVE

• Title: Efficient English Text Classification using Selected Machine Learning

Techniques
• Authors: Xiaoyu Luo (Hunan University of Technology and Business, China)
• Journal & Publication Date: Alexandria Engineering Journal, February 2021
• Objective of the Study:
• To identify efficient machine learning techniques for English text
classification by comparing the performance of various classifiers.
• A specific focus on Support Vector Machines (SVM) to evaluate its
effectiveness relative to other models like Logistic Regression and Naïve
Bayes.
INTRODUCTION - SIGNIFIC ANCE OF
TEXT CLASSIFIC ATION

• Text Classification (TC):

o Technique for categorizing documents into predefined classes using machine learning.
o Useful for organizing vast amounts of unstructured data for better retrieval and
management.
• Growing Demand:
o Significant growth in unstructured text data from corporate records, government, and
personal communications.
o Increase in blogging, social media, and online interactions has amplified the need for
efficient classification methods.
MACHINE LEARNING FOR TEXT
CLASSIFIC ATION

• Role of ML in Text Classification:

• Supervised Learning: Uses labeled data for training, ensuring accurate categorization.
• Preprocessing: Involves stop-word removal, stemming, and feature selection for
optimized performance.
• Popular Applications:
• Spam Filtering: Recognizes and filters unwanted emails.
• Sentiment Analysis: Detects opinions on products and services.
• Fake News Detection: Identifies fraudulent or counterfeit content online.
KEY CHALLENGES

Challenges in Efficient Text Classification:

• Misclassification:
• Handling incorrectly labeled data remains challenging, especially in domains where
language or context may lead to ambiguities.
• Feature Selection and Dimensionality Reduction:
• With thousands of potential features (words or terms), feature selection is critical to
manage computational load and enhance classifier performance.
• Selecting the right features without compromising classification accuracy is essential.
MACHINE LEARNING
ALGORITHMS
SVM ( SU PPORT VECTOR MACHINE)

Overview ☞ SVM is a classification algorithm that finds the optimal hyperplane to separate data
into classes.
Key Concepts:
• Hyperplane: Divides data into different classes.
• Support Vectors: Key data points that influence the hyperplane’s position.
• Maximal Margin: SVM maximizes the margin between classes to improve
classification accuracy.
Advantages:
• Effective in high-dimensional data, ideal for text classification.
• Reduces overfitting with a clear margin of separation.
Applications: Used in spam detection, sentiment analysis, and document
categorization.
NAÏVE BAYES (NB)

Overview: NB is a probabilistic classifier that applies Bayes' theorem with the assumption of
independence between features.
Key Concepts:
• Conditional Probability: Calculates the probability of a class given
the input features.
• Feature Independence: Assumes each feature contributes
independently to the probability.
Advantages:
• Simple, efficient, and works well with large datasets.
• Ideal for high-speed text classification tasks.
Applications: Used in spam filtering, document categorization,
and sentiment analysis.
LOGISTIC REGRESSION (LR)

Overview: LR is a linear model for binary classification, predicting probabilities and

classifying based on a threshold.
Key Concepts:
• Sigmoid Function: Maps inputs to a probability between 0 and 1.
• Decision Boundary: Separates data into two classes based on
a threshold.

Advantages:
• Interpretable model with efficient performance in binary tasks.
• Suitable for text data with linear separability.

Applications: Used in sentiment analysis, binary text classification, and medical diagnosis.
METHODOLOGY AND
IMPLEMENTATION SETUP
IM PO RTIN G
N ECESSARY
L IB RARIES

• Basic: Numpy, Pandas

• Dataset:
fetch_20newsgropus from
sklearn
• Model: SVC, MultinomialNB,
LogisticRegression from
sklearn
• Metrics: sklearn libraries
• Plot: matplotlib
PREPROCESSING
ON TEXT

Data Preprocessing
• Steps taken: Tokenization,
stop-word removal, and
stemming (PorterStemmer)
• Benefits of preprocessing:
Improves feature relevance,
reduces dimensionality, and
optimizes performance.
• Library: nltk
PREPARING DATA

Dataset Description
•Data sources: UCI library and other English news websites.
•Overview of dataset categories: Topics like sports, literature, campus news, etc.
Prepared three categories of data as described in paper.
MODEL
TRAINING

Steps:
- Vectorize the text data
- Split the train and the test
data
- Define the parameters for
hyperparameter tuning
MODEL
TRAINING

Steps:
- Model initialization using
GridSearchCV
- Fit the models
- Get the best estimators
MODEL
TRAINING

Steps:
- Each model evaluation on
test data
- Store the result metrics for
further comparison
- Return the result
GETTING THE
RESU LT

Steps:
- Find the result for each
model for each data
- Combine and plot the graph
of the result.
EVALUATION METRICS

• Evaluation Metrics
• Metrics used: Precision, recall, F1-score, accuracy.
• Importance: These metrics provide a well-rounded view of model performance, particularly in
multi-class settings.
RESULTS AND COMPARATIVE ANALYSIS

• SVM has more precision as compared to NB and LR for the datasets that we have selected.
OUR RESULTS

Insights from Comparative Analysis

- SVM shows superior performance in

precision and recall like data1.
- NB effective on smaller data like data3.
DISCUSSION

• Challenges in Text Classification:

o Data Collection Impact: Quality of data collection affects preprocessing and text mining
accuracy.
o Preprocessing Essentials:
o Tokenization, stop word removal, stemming, and vector space model required for organizing
data.
o Literature Insights:
o Increasing dataset size during training enhances evaluation accuracy.
Conclusion:

• Rapid growth of text classification applications in

IT.
• SVM outperforms NB and LR on selected
datasets with higher precision, recall, and F1-
score.
CONCLUSION • Each algorithm has strengths and weaknesses
& FUTURE depending on dataset size and characteristics.
WORK
Future Work:

• Expand classification to larger datasets, like the

BBC dataset.
• Explore implementations in advanced tools:
TensorFlow, Python, R, or Matlab.
BIBLIOGRAPHY

Code Link
https://colab.research.google.com/drive/1op8272F2ciF2NE7BZkHaKgTEcOzfvIeq?
usp=sharing#scrollTo=aAOyQmdMRiX9
Resources
• Luo, X. (2021). Efficient English text classification using selected machine
learning techniques. Alexandria Engineering Journal, 60(4), 3401-3409.
https://doi.org/10.1016/j.aej.2021.02.009
• Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the
12th International Conference on Machine Learning (ICML), 331–339. Dataset
available via fetch_20newsgroups in scikit-learn (Pedregosa et al., 2011).
THANK YOU
Code Implementation of Paper
by Group 1

Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Analytics of Machine Learning-Based Algorithms For Text Classification
No ratings yet
Analytics of Machine Learning-Based Algorithms For Text Classification
11 pages
NLP m4
No ratings yet
NLP m4
97 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Qta Lse Day4 PDF
No ratings yet
Qta Lse Day4 PDF
59 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
Sentiment Analysis IMDB Review - Presentation
No ratings yet
Sentiment Analysis IMDB Review - Presentation
19 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
Lect 05
No ratings yet
Lect 05
17 pages
Best Text To Speech Ai - Aitech - Studio
No ratings yet
Best Text To Speech Ai - Aitech - Studio
8 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
No ratings yet
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
15 pages
Unit 3
No ratings yet
Unit 3
27 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Text Classification
No ratings yet
Text Classification
60 pages
Supervised Learning
No ratings yet
Supervised Learning
30 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
Spam Detection
No ratings yet
Spam Detection
39 pages
Unit 2
No ratings yet
Unit 2
26 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
Wa0002
No ratings yet
Wa0002
21 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Research Paper 3
No ratings yet
Research Paper 3
7 pages
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
No ratings yet
MEE 437 Operations Research Project Document Text Mining For Supplier Manufacturing Industries
25 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
ML Projrct Article 2
No ratings yet
ML Projrct Article 2
6 pages
A Project On Stock Market Prediction Sub
No ratings yet
A Project On Stock Market Prediction Sub
24 pages
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Text Classification
No ratings yet
Text Classification
7 pages
Academic Internship Final Report
No ratings yet
Academic Internship Final Report
11 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
DR S.K-IEEE-updated-29-07-24
No ratings yet
DR S.K-IEEE-updated-29-07-24
5 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
Paper 1 - 1662-Article Text-12759-12507-10-20210526
No ratings yet
Paper 1 - 1662-Article Text-12759-12507-10-20210526
2 pages
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Report On Email Spam
No ratings yet
Report On Email Spam
7 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
MADHU-IEEE Update
No ratings yet
MADHU-IEEE Update
5 pages
Mining Text Data and Classificatin
No ratings yet
Mining Text Data and Classificatin
4 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
Lec # 9
No ratings yet
Lec # 9
18 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
Lec # 4-1
No ratings yet
Lec # 4-1
15 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Parabot Notes PDF
No ratings yet
Parabot Notes PDF
2 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Sentiment Analysis: A NLP And: 2. Detailed Approach
No ratings yet
Sentiment Analysis: A NLP And: 2. Detailed Approach
6 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
Machine Learning Semester Notes JNTUK
No ratings yet
Machine Learning Semester Notes JNTUK
2 pages
Detection of Employee Stress Using Machine Learning
No ratings yet
Detection of Employee Stress Using Machine Learning
14 pages
Mini Project
No ratings yet
Mini Project
65 pages
DS Unit 1
No ratings yet
DS Unit 1
37 pages
Technical - Report (1) Anil123
No ratings yet
Technical - Report (1) Anil123
26 pages
Hepatitis Disease Prediction Using - Machine.Learning
No ratings yet
Hepatitis Disease Prediction Using - Machine.Learning
12 pages
Revolutionizing+Workout+Analytics Machine+Learning+Models+for+Calorie+Burn+Estimation Ajmedtech v4n2 Pg+33-45
No ratings yet
Revolutionizing+Workout+Analytics Machine+Learning+Models+for+Calorie+Burn+Estimation Ajmedtech v4n2 Pg+33-45
13 pages
Image Forgeryin
No ratings yet
Image Forgeryin
25 pages
Prediction of Consumer Purchase Intension
100% (1)
Prediction of Consumer Purchase Intension
4 pages
Major Project Documentation Saif
No ratings yet
Major Project Documentation Saif
74 pages
A Hybrid Predictive Maintenance Approach For Ship Machinery Systems A Case of Main Engine Bearings
No ratings yet
A Hybrid Predictive Maintenance Approach For Ship Machinery Systems A Case of Main Engine Bearings
11 pages
Scaler DSML GitHub Search
No ratings yet
Scaler DSML GitHub Search
7 pages
Machine Unit4
No ratings yet
Machine Unit4
55 pages
REPORT Final
No ratings yet
REPORT Final
29 pages
Data Analytics Unit IV
No ratings yet
Data Analytics Unit IV
13 pages
Msword&Rendition 1
No ratings yet
Msword&Rendition 1
21 pages
Unit 2 Svms Linear Logistic Regression
No ratings yet
Unit 2 Svms Linear Logistic Regression
9 pages
Improved Dropping Attacks Detecting System in 5g N
No ratings yet
Improved Dropping Attacks Detecting System in 5g N
24 pages
A Station Keeping Maneuver Detection Method of Non Coop - 2024 - Advances in Spa
No ratings yet
A Station Keeping Maneuver Detection Method of Non Coop - 2024 - Advances in Spa
10 pages
Credit Card Default Report
No ratings yet
Credit Card Default Report
3 pages
Kaur2020 Article Hyper-parameterOptimizationOfD
No ratings yet
Kaur2020 Article Hyper-parameterOptimizationOfD
15 pages
Abstract
No ratings yet
Abstract
5 pages
9 SVM 2
No ratings yet
9 SVM 2
7 pages
Materials Today: Proceedings
No ratings yet
Materials Today: Proceedings
4 pages
The Statistical Analysis of Crash-Frequency Data
No ratings yet
The Statistical Analysis of Crash-Frequency Data
49 pages
Else Iver
No ratings yet
Else Iver
16 pages
A Self-Learning Approach To Single Image Super-Resolution: Min-Chun Yang and Yu-Chiang Frank Wang, Member, IEEE
No ratings yet
A Self-Learning Approach To Single Image Super-Resolution: Min-Chun Yang and Yu-Chiang Frank Wang, Member, IEEE
11 pages
Orange: Data Mining Fruitful and Fun
No ratings yet
Orange: Data Mining Fruitful and Fun
4 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

IR - Group1

Uploaded by

IR - Group1

Uploaded by

EFFICIENT ENGLISH TEXT CLASSIFICATION

Raman Sharma | 22114076 Vikas Kumar | 22114106

• Title: Efficient English Text Classification using Selected Machine Learning

• Text Classification (TC):

• Role of ML in Text Classification:

Challenges in Efficient Text Classification:

Overview: LR is a linear model for binary classification, predicting probabilities and

• Basic: Numpy, Pandas

Insights from Comparative Analysis

- SVM shows superior performance in

• Challenges in Text Classification:

• Rapid growth of text classification applications in

• Expand classification to larger datasets, like the

You might also like