0% found this document useful (0 votes)
52 views3 pages

IEEE-paper On NLP

This paper presents a comparative analysis of traditional machine learning models for text classification using a curated dataset of BBC news articles, focusing on preprocessing techniques and model performance. The study found that Support Vector Machines (SVM) achieved the highest accuracy of 96.94%, while emphasizing the importance of preprocessing in enhancing model performance. The research highlights the effectiveness of traditional methods for text classification tasks and suggests future exploration of larger datasets and deep learning approaches.

Uploaded by

mitalimeshram4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views3 pages

IEEE-paper On NLP

This paper presents a comparative analysis of traditional machine learning models for text classification using a curated dataset of BBC news articles, focusing on preprocessing techniques and model performance. The study found that Support Vector Machines (SVM) achieved the highest accuracy of 96.94%, while emphasizing the importance of preprocessing in enhancing model performance. The research highlights the effectiveness of traditional methods for text classification tasks and suggests future exploration of larger datasets and deep learning approaches.

Uploaded by

mitalimeshram4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Text Processing and Classification using NLP

Given Name Surname (of Affiliation)


dept. name of organization Nagpur, India
(of Affiliation) email address or ORCID
name of organization

Abstract—Text classification is a critical task in natural resources. This paper focuses on traditional machine
language processing (NLP) with extensive applications in areas learning techniques, which remain relevant for resource-
such as spam detection, sentiment analysis, and content constrained environments.
categorization. This paper presents a comparative analysis of
traditional machine learning models applied to a curated
dataset of BBC news articles. Preprocessing techniques,
including tokenization, lemmatization, and TF-IDF
transformation, were employed to optimize feature
Methodology
representation. Four classifiers—Logistic Regression, Support
Vector Machines (SVM), Multinomial Naïve Bayes, and
Random Forest—were trained and evaluated based on A. Dataset
accuracy, precision, recall, and F1-score. Among the models The BBC dataset consists of 2,225 news articles categorized
tested, SVM achieved the highest accuracy of 96.94%. This into five classes:
paper discusses the implications of preprocessing and model
selection on classification performance.
1. Business
2. Entertainment
Keywords—Text Classification, Natural Language Processing,
Logistic Regression, Support Vector Machines, Multinomial
3. Politics
Naïve Bayes, Random Forest, Feature Extraction, 4. Sports
TfidfTransformer, WordCloud, Model Comparison. 5. Technology

The dataset is balanced, ensuring equal representation of


each class for unbiased model training.
Introduction
B. Preprocessing
The exponential growth of textual data from sources such as To transform raw text into usable features, the following
social media, news platforms, and online forums preprocessing steps were applied:
necessitates efficient text classification systems. Text
classification involves assigning a predefined label to a 1. Lowercasing: All text was converted to lowercase
given piece of text based on its content. Traditional machine to maintain consistency.
learning methods have proven effective in tackling this 2. Tokenization: The text was split into individual
problem. These methods often rely on robust preprocessing words using the Regexp Tokenizer to handle
techniques to transform raw text into feature-rich punctuation.
representations that machine learning models can interpret. 3. Stop Word Removal: Common English stop words
were removed, retaining only meaningful terms.
This study uses the BBC text dataset, which contains news 4. Lemmatization: Words were reduced to their root
articles categorized into five distinct topics: business, forms using WordNetLemmatizer.
entertainment, politics, sports, and technology. By 5. TF-IDF Transformation: CountVectorizer was used
comparing multiple classification models, this paper aims to to extract unigram and bigram features, and TF-
identify the most suitable approach for accurate text IDF scores were computed using TfidfTransformer.
categorization.
C. Classification Models
Related Work The following machine learning models were implemented
using the Scikit-learn library:
And this is a level 3 heading: Text classification has been
extensively studied in both traditional and deep learning 1. Logistic Regression (LR): A linear model that
paradigms. Early approaches utilized Naïve Bayes for its predicts probabilities for multi-class classification.
computational efficiency and probabilistic foundation. 2. Support Vector Machines (SVM): Utilizes a linear
Logistic Regression and SVM have gained popularity due to kernel for separating classes with maximum
their ability to handle high-dimensional data effectively. margin.
Ensemble methods like Random Forest provide additional 3. Multinomial Naïve Bayes (MNB): A probabilistic
robustness by combining multiple decision trees. Recent model that assumes conditional independence of
advancements in deep learning, such as recurrent neural features.
networks (RNNs) and transformers, have revolutionized text
classification but require significant computational

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©2025 IEEE


4. Random Forest (RF): An ensemble of decision
trees providing robust performance through The results highlight the importance of preprocessing in
majority voting. enhancing model performance. SVM outperformed other
models due to its robustness in high-dimensional spaces.
D. Evaluation Metrics Logistic Regression provided comparable performance,
The models were evaluated using: while Naïve Bayes was limited by its strong assumptions.
Random Forest exhibited stable results but did not surpass
1. Accuracy: Proportion of correctly classified SVM. These findings demonstrate the suitability of
instances. traditional methods for small-scale text classification tasks.
2. Precision: Proportion of true positive predictions
among all positive predictions. VI. CONCLUSION
3. Recall: Proportion of true positive predictions
among all actual positives.
4. F1-Score: Harmonic mean of precision and recall. This paper explored traditional machine learning models for
text classification on the BBC dataset. Among the models
EXPERIMENTS AND RESULTS evaluated, SVM achieved the highest accuracy of 96.94%,
showcasing its effectiveness for such tasks. The importance
A. Experiment Setup of preprocessing and feature extraction was emphasized,
The dataset was divided into training (75%) and demonstrating their impact on overall performance. Future
testing (25%) sets. Hyperparameters were optimized research will extend this work to larger datasets and explore
using GridSearchCV. deep learning approaches for enhanced accuracy.

B. Results
The models' performances are summarized in Table I. ACKNOWLEDGMENTS

TABLE I. MODEL PERFORMANCE COMPARISON We would like to express our sincere gratitude to Dr.
Deepali Kotambkar, from the Electronics Department at
Model Accura Precisio Reca F1- Shri Ramdeobaba College of Engineering and
cy (%) n ll Scor Management, for her invaluable guidance, encouragement,
e and support throughout this research. Her expertise and
insights played a pivotal role in shaping the direction and
Logistic
outcomes of this work.
Regressio 96.58 0.97 0.96 0.97
n
We also extend our thanks to the Electronics Department
Support of Shri Ramdeobaba College of Engineering and
Vector 96.94 0.97 0.97 0.97 Management for providing access to the necessary
Machines resources and tools required for conducting this study.
Multinomi Additionally, we are grateful to the creators and maintainers
al Naïve 94.97 0.95 0.95 0.95 of the open-source libraries Scikit-learn and NLTK, which
Bayes were integral to the implementation and experimentation of
Random this research. Finally, we acknowledge the unwavering
94.79 0.95 0.95 0.95 support of our peers and family, whose motivation and
Forest
constructive feedback were invaluable during the course of
this project.

REFERENCES
C. Visualizations
[1] T. Joachims, "Text Categorization with Support Vector Machines:
1. Word Clouds: Generated for each category to Learning with Many Relevant Features," Proceedings of the 10th European
identify frequent terms. Conference on Machine Learning, 1998.
[2] A. Zhang, A. Lipton, M. Li, and A. Smola, Dive into Deep Learning.
2. Feature Importance: Bar graphs illustrating the Amazon, 2020.
significance of features in classification tasks. [3] S. Bird, E. Klein, and E. Loper, Natural Language Processing with
Python. O'Reilly Media, 2009.
[4] Scikit-learn Documentation: https://scikit-learn.org/
V. DISCUSSION [5] NLTK Documentation: https://www.nltk.org/

You might also like