0% found this document useful (0 votes)
14 views3 pages

IEEE-paper (1) Original

Uploaded by

mitalimeshram4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

IEEE-paper (1) Original

Uploaded by

mitalimeshram4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Text Processing and Classification using NLP

Given Name Surname (of Affiliation)


dept. name of organization Nagpur, India
(of Affiliation) email address or ORCID
name of organization

Abstract— classification but require significant computational


resources. This paper focuses on traditional machine
Text classification is a critical task in natural language learning techniques, which remain relevant for resource-
processing (NLP) with extensive applications in areas such as constrained environments.
spam detection, sentiment analysis, and content categorization.
This paper presents a comparative analysis of traditional
machine learning models applied to a curated dataset of BBC
news articles. Preprocessing techniques, including tokenization,
lemmatization, and TF-IDF transformation, were employed to Methodology
optimize feature representation. Four classifiers—Logistic
Regression, Support Vector Machines (SVM), Multinomial
Naïve Bayes, and Random Forest—were trained and evaluated A. Dataset
based on accuracy, precision, recall, and F1-score. Among the The BBC dataset consists of 2,225 news articles categorized
models tested, SVM achieved the highest accuracy of 96.94%. into five classes:
This paper discusses the implications of preprocessing and
model selection on classification performance. 1. Business
2. Entertainment
Keywords—Text Classification, Natural Language Processing, 3. Politics
Logistic Regression, Support Vector Machines, Multinomial 4. Sports
Naïve Bayes, Random Forest, Feature Extraction,
TfidfTransformer, WordCloud, Model Comparison.
5. Technology

The dataset is balanced, ensuring equal representation of


each class for unbiased model training.
Introduction
B. Preprocessing
To transform raw text into usable features, the following
The exponential growth of textual data from sources such as
preprocessing steps were applied:
social media, news platforms, and online forums
necessitates efficient text classification systems. Text
classification involves assigning a predefined label to a 1. Lowercasing: All text was converted to lowercase
given piece of text based on its content. Traditional machine to maintain consistency.
learning methods have proven effective in tackling this 2. Tokenization: The text was split into individual
problem. These methods often rely on robust preprocessing words using the Regexp Tokenizer to handle
techniques to transform raw text into feature-rich punctuation.
representations that machine learning models can interpret. 3. Stop Word Removal: Common English stop words
were removed, retaining only meaningful terms.
4. Lemmatization: Words were reduced to their root
This study uses the BBC text dataset, which contains news
forms using WordNetLemmatizer.
articles categorized into five distinct topics: business,
5. TF-IDF Transformation: CountVectorizer was used
entertainment, politics, sports, and technology. By
to extract unigram and bigram features, and TF-
comparing multiple classification models, this paper aims to
IDF scores were computed using TfidfTransformer.
identify the most suitable approach for accurate text
categorization.
C. Classification Models
The following machine learning models were implemented
Related Work
using the Scikit-learn library:
And this is a level 3 heading: Text classification has been
1. Logistic Regression (LR): A linear model that
extensively studied in both traditional and deep learning
predicts probabilities for multi-class classification.
paradigms. Early approaches utilized Naïve Bayes for its
2. Support Vector Machines (SVM): Utilizes a linear
computational efficiency and probabilistic foundation.
kernel for separating classes with maximum
Logistic Regression and SVM have gained popularity due to
margin.
their ability to handle high-dimensional data effectively.
3. Multinomial Naïve Bayes (MNB): A probabilistic
Ensemble methods like Random Forest provide additional
model that assumes conditional independence of
robustness by combining multiple decision trees. Recent
features.
advancements in deep learning, such as recurrent neural
networks (RNNs) and transformers, have revolutionized text

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©2025 IEEE


4. Random Forest (RF): An ensemble of decision
trees providing robust performance through The results highlight the importance of preprocessing in
majority voting. enhancing model performance. SVM outperformed other
models due to its robustness in high-dimensional spaces.
D. Evaluation Metrics Logistic Regression provided comparable performance,
The models were evaluated using: while Naïve Bayes was limited by its strong assumptions.
Random Forest exhibited stable results but did not surpass
1. Accuracy: Proportion of correctly classified SVM. These findings demonstrate the suitability of
instances. traditional methods for small-scale text classification tasks.
2. Precision: Proportion of true positive predictions
among all positive predictions. VI. CONCLUSION
3. Recall: Proportion of true positive predictions
among all actual positives.
4. F1-Score: Harmonic mean of precision and recall. This paper explored traditional machine learning models for
text classification on the BBC dataset. Among the models
EXPERIMENTS AND RESULTS evaluated, SVM achieved the highest accuracy of 96.94%,
showcasing its effectiveness for such tasks. The importance
A. Experiment Setup of preprocessing and feature extraction was emphasized,
The dataset was divided into training (80%) and demonstrating their impact on overall performance. Future
testing (20%) sets. Hyperparameters were optimized research will extend this work to larger datasets and explore
using GridSearchCV. deep learning approaches for enhanced accuracy.

B. Results
The models' performances are summarized in Table I. ACKNOWLEDGMENTS

TABLE I. MODEL PERFORMANCE COMPARISON We would like to express our sincere gratitude to Dr.
Deepali Kotambkar, from the Electronics Department at
Model Accurac Precisio Reca F1- Shri Ramdeobaba College of Engineering and
y (%) n ll Scor Management, for her invaluable guidance, encouragement,
e and support throughout this research. Her expertise and
insights played a pivotal role in shaping the direction and
Logistic
outcomes of this work.
Regressio 96.58 0.97 0.96 0.97
n
We also extend our thanks to the Electronics Department
Support of Shri Ramdeobaba College of Engineering and
Vector 96.94 0.97 0.97 0.97 Management for providing access to the necessary
Machines resources and tools required for conducting this study.
Multinomi Additionally, we are grateful to the creators and maintainers
al Naïve 94.97 0.95 0.95 0.95 of the open-source libraries Scikit-learn and NLTK, which
Bayes were integral to the implementation and experimentation of
Random this research. Finally, we acknowledge the unwavering
94.79 0.95 0.95 0.95 support of our peers and family, whose motivation and
Forest
constructive feedback were invaluable during the course of
this project.

REFERENCES
C. Visualizations
[1] T. Joachims, "Text Categorization with Support Vector Machines:
1. Word Clouds: Generated for each category to Learning with Many Relevant Features," Proceedings of the 10th European
identify frequent terms. Conference on Machine Learning, 1998.
[2] A. Zhang, A. Lipton, M. Li, and A. Smola, Dive into Deep Learning.
2. Feature Importance: Bar graphs illustrating the Amazon, 2020.
significance of features in classification tasks. [3] S. Bird, E. Klein, and E. Loper, Natural Language Processing with
Python. O'Reilly Media, 2009.
[4] Scikit-learn Documentation: https://scikit-learn.org/
V. DISCUSSION [5] NLTK Documentation: https://www.nltk.org/

You might also like