Text Classification Reseach Paper

xdcfvgbhnjmk

Uploaded by

Manish jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views4 pages

Text Classification Reseach Paper

xdcfvgbhnjmk

Uploaded by

Manish jaiswal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Proceedings of the 3d International Conference Computational Linguistics And Intelligent Systems

 extending the structure (addition of elements) and vice versa;

 splitting long sentences into several smaller ones and vice versa;
 replacing synonymous words orphrases (collocations) [2].
The paper proposesthe method for paraphrase extraction from the news text
corpus based on the developed syntactic rules [3] to define phrases (collocations) and
the use of WordNet [4] to identify synonymous words in the text corpus.
The developed corpus consists ofBBCnews articles, the sport section [5].
For preprocessing(POS-tagging), the NLTK's Python language library tools are
suggested to use.
Figure 1 shows the synonymous pairs obtained in WordNet.

Fig. 1. Synonymous Pairs Extracted from WordNet

For extracting paraphrases, we check the correspondence of the grammatical

characteristics of collocates (synonymous words of phrases identified at the previous
stage) with the syntactic rules.
Thus, phrases whose grammatical characteristics correspond to the rules are
considered to be synonymous.As a result, the proposed method for paraphrase
extraction from the news text corpus allows identifying a common information space
for topical news.

References
1. Koloiev, A.S.: Rewrite as a new phenomenon in modern journalism. In: SPU Bulletin.
Philology, vol. 1, 221-226 (2012)
2. Bolshakov, I.A.: Two methods of synonymous paraphrasing in linguistic steganography.
In:Proceedings of the International ConferenceDialogue-2004,http://www.dialog-
21.ru/media/2496/bolshakov.pdf, last accessed 2019/02/10.
3. Petrasova, S., Khairova, N., Lewoniewski, W.: Building the semantic similarity model for
social network data streams. In:Data Stream Mining & Processing, Proceedings of the
2018 IEEE Second International Conference (DSMP), 21-24 (2018)
4. WordNet: https://wordnet.princeton.edu, last accessed 2019/02/10.
5. BBC, https://www.bbc.com/news,last accessed 2019/02/10.

COLINS’2019, Volume II: Workshop. Kharkiv, Ukraine, April 18-19,

2019, ISSN 2523-4013 http://colins.in.ua, online
Proceedings of the 3d International Conference Computational Linguistics And Intelligent Systems

Machine Learning Text Classification Model with NLP

Approach

Maria Razno[0000-0003-3356-5027]
National Technical University "Kharkiv Polytechnic Institute",
Pushkinska str., 79/2, Kharkiv, Ukraine

[email protected]

Abstract. This article describes the relevance of the word processing task that
is written in human language by the methods of Machine Learning and NLP
approach, that can be used on Python programming language. It also portrays
the concept of Machine Learning, its main varieties and the most popular
Pythonpackages and libraries for working with text data using Machine
Learning methods. The concept of NLP and the most popular python packages
are also presented in the article. The machine learning classification model
algorithm based on the text processing is introduced in the article. It shows how
to use classification machine learning and NLP methods in practice.

Keywords: Machine learning, Python, Pandas, Text classification, NLP,

NLTK, Scikit-learn, Artificial Intelligence, Python Library, Deep Learning
Texts

Over the last few years machine learning and artificial intelligence have become
very hot topics. Nowadays their methods and approaches are a part of a huge amount
of products, moreover it is a necessary thing in most applications and appliances. An
example of using ML (Machine Learning) can be the automatic determination of
important emails and quick responses in Gmail. Nowadays we can confidently say
that and artificial intelligence with machine learning can push a person out of many
technological processes.
Machine learning is the scientific study of algorithms and statistical methods that
computer systems use to effectively perform a specific task without using explicit
instructions, relying on patterns and inference instead. It is seen as a subset of
artificial intelligence. Machine learning algorithms build a mathematical model of
sample data, known as "training data", in order to make predictions or decisions
without being explicitly programmed to perform the task. There are five types of
machine learning algorithms: supervised, semi-supervised, active learning,
reinforcement and unsupervised learning [1].
Natural language processing is a subfield of computer science, information
engineering, and artificial intelligence concerned with the interactions between
computers and human (natural) languages, in particular, how to program computers in
order to process and analyze large amounts of natural language data. Tasks in natural

COLINS’2019, Volume II: Workshop. Kharkiv, Ukraine, April 18-19,

2019, ISSN 2523-4013 http://colins.in.ua, online
Proceedings of the 3d International Conference Computational Linguistics And Intelligent Systems

language processing frequently involve speech recognition, natural language

understanding, and natural language generation.
Text classification is one of the most important and typical task in supervised
machine learning. Assigning categories of documents, which can be a web page,
library book, media articles, gallery etc. has many applications like spam filtering,
email routing, sentiment analysis etc. We would like to demonstrate how we can do
text classification using the most common python machine learning and natural
language processing packages like: Pandas, Scikit-learn, Numpy and little bit of
NLTK.
In our study, we are creating the model, that will be able to classify user`s
comment and give it a star rate from 1 to 5. Supervised machine learning requires to
have prepared labeled data, so we use Yelp_academic_dataset_review in json format.
We downloaded the dataset via the link: https://www.kaggle.com/yelp-dataset/yelp-
dataset#yelp_academic _dataset_review.json. We got a lot of necessary tools by using
Pandas library, that helped us to store data in convenient table, the columns of which
are classification parameters and the rows – information for each classified object.
This form of data storage is very effective in our study, especially for further
accessing a particular column of data during the text processing [2].
The next step is to use natural language processing methods to normalize the text
data. During our work we realized, that the package of libraries NLTK(Natural
Language Toolkit) is great for our purpose. Thanks to its methods we removed all the
stop words, that were not necessary for further analysis, from the text data. Also we
needed to use text stemming in order to remove morphological affixes from the text.
All of those step helped the model to make an accurate analysis of the text data and
get the best clear features for the future classification [3].
The next step of our study was building the machine learning model. We used
Scikit-learn due to the fact, that it is a wonderful library with a huge amount of
opportunities. It has various types of analysis, moreover, it is the most convenient
way of forming a model, because it provides a single interface for all conversion steps
and the final result. Instead of using “Bag of words” approach and counting of each
word in our text data, we use the tf-idf method for each pair of words in our reviews.
Tf-idf normalizes the count by dividing the total sum of the meeting of a certain pair
of words into the number of reviews in which these words appear. In such way, we
get the model, that will find the most common words for each star rate, in other words
it will get the appropriate features and select the best ones. As the result of our study,
the model will be able to analyze user`s comment according to found features.
To summarize, in the course of our research, we can say that Python is a
wonderful programming language, which provides a lot of great libraries for creating
powerful machine learning models and proper natural language processing. For the
task of building a machinelearning text classification model with NLP approach, we
have reviewed the most popular machine learning libraries like : Pandas, Scikit-learn,
Numpy, NLTK and built the text classification model with NLP approach. At the end
of our study we get the learning model, that gives user answer with the appropriate
star rate to user, according to his comment, and the list of the most common words for
each star rate.

COLINS’2019, Volume II: Workshop. Kharkiv, Ukraine, April 18-19,

2019, ISSN 2523-4013 http://colins.in.ua, online
Proceedings of the 3d International Conference Computational Linguistics And Intelligent Systems

References
1. Langley, P.: Human and machine learning.Machine Learning,1, pp. 243–248 (1986)
2. Masch, C.: Text classification with Convolution Neural Net-works on Yelp, IMDB &
sentence polarity dataset, https://github.com/cmasch/cnn-text-classification,24/02/2019.
3. Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: A
Comprehensive Study. In: Lecture Notes in Computer Science vol. 2997, pp. 181-196,
Springer Science + Business Media (2004)

COLINS’2019, Volume II: Workshop. Kharkiv, Ukraine, April 18-19,

2019, ISSN 2523-4013 http://colins.in.ua, online

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
Zte Lte FDD Volte Feature Guide
100% (2)
Zte Lte FDD Volte Feature Guide
216 pages
Nursing Care Plan PPROM 1
100% (2)
Nursing Care Plan PPROM 1
1 page
Address Proof
No ratings yet
Address Proof
1 page
WAEC Syllabus For Technical Drawing
100% (1)
WAEC Syllabus For Technical Drawing
5 pages
โค้งสุดท้ายเข้าเตรียมอุดม 2
No ratings yet
โค้งสุดท้ายเข้าเตรียมอุดม 2
39 pages
Install Win7 To USB3 - 0 Computers PDF
No ratings yet
Install Win7 To USB3 - 0 Computers PDF
8 pages
A History of The Cavendish Laboratory 1871-1910 PDF
No ratings yet
A History of The Cavendish Laboratory 1871-1910 PDF
388 pages
Coen3114 Intro To Assembly Language Programming PDF
No ratings yet
Coen3114 Intro To Assembly Language Programming PDF
80 pages
Final Case
No ratings yet
Final Case
45 pages
KEC-751B (VLSI Design Lab)
No ratings yet
KEC-751B (VLSI Design Lab)
44 pages
Easylyte Plus Manual: Page 3 of About 90,800 Results (0.31 Seconds)
No ratings yet
Easylyte Plus Manual: Page 3 of About 90,800 Results (0.31 Seconds)
1 page
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
Law of Karma Value Systems For Success.
No ratings yet
Law of Karma Value Systems For Success.
51 pages
Hacienda Luisita and Agrarian Reform
No ratings yet
Hacienda Luisita and Agrarian Reform
34 pages
Guía Práctica de Sintaxis Inglesa
No ratings yet
Guía Práctica de Sintaxis Inglesa
23 pages
Ed 3 Book
No ratings yet
Ed 3 Book
577 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
NLP m4
No ratings yet
NLP m4
97 pages
Summer Gizmo Lab 2
No ratings yet
Summer Gizmo Lab 2
4 pages
Beginners Simple Enhancement For SE38: Applies To
No ratings yet
Beginners Simple Enhancement For SE38: Applies To
16 pages
Jai Gurudev Maharishi Vidya Mandir, Mangadu Physics - Worksheet Electric Fields & Charges
No ratings yet
Jai Gurudev Maharishi Vidya Mandir, Mangadu Physics - Worksheet Electric Fields & Charges
3 pages
News Classsification
No ratings yet
News Classsification
11 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Numerical Reasoning Test - Managers
No ratings yet
Numerical Reasoning Test - Managers
8 pages
Data Science & Data Analytics Project - Documentation
No ratings yet
Data Science & Data Analytics Project - Documentation
10 pages
Project Synopsis-1
100% (1)
Project Synopsis-1
11 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
Improvised Project Rubric
No ratings yet
Improvised Project Rubric
1 page
Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
No ratings yet
Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
8 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Core Answer
No ratings yet
Core Answer
22 pages
Technovate Poster - Template (AutoRecovered)
No ratings yet
Technovate Poster - Template (AutoRecovered)
1 page
Searchq 8070+Mytee+Lite&Rlz 1CDGOYI EnUS1063US1063&Oq 80&Gs LCRP EgZjaHJvbWUqDggBEEUYJxg7GIAEGIoFMggIABB
No ratings yet
Searchq 8070+Mytee+Lite&Rlz 1CDGOYI EnUS1063US1063&Oq 80&Gs LCRP EgZjaHJvbWUqDggBEEUYJxg7GIAEGIoFMggIABB
1 page
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
Text Classification PDF
No ratings yet
Text Classification PDF
7 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
NLP Book
No ratings yet
NLP Book
599 pages
Annual Report 2020 Maj 21-09-23 Compressed
No ratings yet
Annual Report 2020 Maj 21-09-23 Compressed
160 pages
Literature Review On Vulnerability Detection Using
No ratings yet
Literature Review On Vulnerability Detection Using
10 pages
Abandoned Mine Management in Australia
No ratings yet
Abandoned Mine Management in Australia
8 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
TUQ English
No ratings yet
TUQ English
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
Spam Detection
No ratings yet
Spam Detection
39 pages
1.machine Learning and Its Applications
No ratings yet
1.machine Learning and Its Applications
75 pages
Research Paper 3
No ratings yet
Research Paper 3
7 pages
Text Classification Using NLP
No ratings yet
Text Classification Using NLP
28 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Matching Ending
No ratings yet
Matching Ending
3 pages
Wa0002
No ratings yet
Wa0002
21 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Mining Text Data and Classificatin
No ratings yet
Mining Text Data and Classificatin
4 pages
Mod 1
No ratings yet
Mod 1
71 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
Text Classification Based On Machine Learning and
No ratings yet
Text Classification Based On Machine Learning and
12 pages
CM19352 Process Optimization
No ratings yet
CM19352 Process Optimization
2 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
CONSUMER DECISION MAKING Notes
No ratings yet
CONSUMER DECISION MAKING Notes
16 pages
Automotive E&E Arch
No ratings yet
Automotive E&E Arch
12 pages
IEEE-paper On NLP
No ratings yet
IEEE-paper On NLP
3 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Student Lms - Usecs
No ratings yet
Student Lms - Usecs
1 page
Paper 2 DK
No ratings yet
Paper 2 DK
20 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Q Bank
No ratings yet
Q Bank
3 pages
Analytics of Machine Learning-Based Algorithms For Text Classification
No ratings yet
Analytics of Machine Learning-Based Algorithms For Text Classification
11 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet