Review 3 - Journal Submission Format: Team Number Title (New)
Review 3 - Journal Submission Format: Team Number Title (New)
Review 3 - Journal Submission Format: Team Number Title (New)
Team number 09
Is your No
submission your
special issue
Indexed in Yes
SCOPUS?
Indexed in No
Emerging
sources
citations index
(Thomson
Reuters)
Number of 12
issues per year
published by the
journal
Text Categorization Techniques: Literature Review and
Current Trends
Saravanakumar Kandasamy
(Vellore Institute of Technology, Vellore, Tamil Nadu
[email protected])
Abhisu Jain
(Vellore Institute of Technology, Vellore, Tamil Nadu
[email protected])
Aditya Goyal
(Vellore Institute of Technology, Vellore, Tamil Nadu
aditya.goyal2017@ vitstudent.ac.in)
Vikrant Singh
(Vellore Institute of Technology, Vellore, Tamil Nadu
vikrant.singh2017@ vitstudent.ac.in)
Anshul Tripathi
(Vellore Institute of Technology, Vellore, Tamil Nadu
anshul.tripathi2018@ vitstudent.ac.in)
Abstract: Text Categorization is a task for categorizing text mining and it has been important
for effective analysis of textual data frameworks. The archives can be ordered in three different
ways unsupervised, supervised and semi supervised techniques. Text categorization alludes to
the procedure of dole out a classification or a few classes among predefined ones to each
archive, naturally. For the given text data, these words that can be expressed in the correct
meaning of a word in different documents are usually considered as good features. We are
going to review different papers on the basis of different text categorization sections and a
comparative and conclusive analysis is presented in this paper. This paper will present
classification on various kinds of ways to deal and compare with text categorization.
2.1.1 Model
In this model the problem of tagging a sentence based on its context is solved using
the CNN with trained word vectors and task specific vectors are used in this. Initially
we keep the word vectors static and learn the other parameters. After that we learn the
word vectors [Kim 2011].
Problem-High-Performance Word-based text classification. In this problem
the classification is done on word level and it solved using the CNN with less
computational complexity so that means using less layers and as the computational
cost is less it produces more accurate results so in this a model is created with less
computational cost and more layers called deep pyramid CNN[Johnson and Zhang
2017].
Problem-Character based text categorization. In this problem text
classification is done on character level using Very deep convolutional networks
which operates directly at the character level using small convolutions and pooling
using up to 29 convolutional layers and about size 3 max pooling [Conneau,
Schwenk, Barrault and Lecun 2016].
Problem-Neural networks need a lot of training data and if there is a shift in
data it handles very poorly because of which text categorization can be very difficult
so a generative model is made which adapts to the shifting data distributions. Using
RNN whereas earlier a bag of words was used which was just finding conditional
relation among the words [Yogatama, Dyer, Ling and Blunsom 2017].
Problem- Training of the RNN takes a lot of time because of which we use
Hierarchical Convolutional Attention Networks to increase training speed without
compromising its accuracy in this we combine the CNN and self-attention-based text
categorization (in this we pay attention to the target and gives this more weight)[Gao,
Ramanathan and Tourassi 2018].
2.2.1 Model
The problem of selecting relevant features from such a large set of text is a very
difficult task. In the starting the text classification was done by the bag of words in
which dependencies between the words were found out in which there were a lot of
noisy features were there and the dimensionality was also large enough so as the
algorithm for selection of features progressed dimensionality reduction and weighing
of terms have become very important so as to remove the unwanted terms. So, these
things made the feature selection process very important. Its aim is to select the
smallest set of features that differs the most in property and can classify easily.
Multivariate feature detection algorithms are not very scalable and can be
computationally very expensive so we use the univariate feature selection method. In
univariate selection methods generally, the features are scored based on different FEF
and as each feature is scored some of the times the redundant features surfaced up so
it does not really help in classification of text. The model is proposed by the author in
which the ranking of the features is done according to the various classes known.
Therefore, we don't rank the features individually, we do so by taking the top K
ranked features for each class and K represents the number of classes the features are
ranked in groups where a group with higher rank consists of K top ranked features
one for each class where K is the number of classes present. Whereas in the
traditional models just the top features with maximum ranks are used and select the
top P groups for the processing part. And a relevancy matrix is generated and every
column gives us the scores related to the classes. The datasets used are the
benchmarking text datasets on which the conventional model is tested so to compare
the performance of the conventional models and the proposed one. And in these
datasets, there are a lot of features because of which the testing of and selecting of
various features can be done. On the basis of different metrics such as the precision,
recall and the F-measures we were able to see the performance of the proposed model
with the conventional model. The result of proposed method is compared using the
different FEF and we see that DFS selects least features per group and the MI selects
the most number of features per group and it was seen that the proposed framework
beat the conventional framework in almost every case and the best FEF of all is the
chi square whereas the performance with MI is least and as the feature groups are
increasing the performance increases. And after 7 groups it reaches a saturation point
and the performance just stabilizes.
2.3.1 Model
The deep neural networks are widely used these days but there are some attacks
reported on them which can raise question on the reliability of these networks and this
attack can cause to misclassify the images or the text so to keep the people aware of
such attacks and there were other researches of these attacks on CNN but this one
focuses on RNN and as RNN plays a crucial role in text categorization and many
other applications so it was important to tell about these attack. So, a backdoor attack
model is made to attack the RNN we select a sentence as the backdoor trigger (It is a
state in which the model is trained on a poisoned dataset which is sent to a backdoor,
the attacker's goal is such that the model handles the inputs having certain features
known as trigger sentence incorrectly) and create poisoning samples by randomly
inserting these triggers. In this model the system is manipulated in such a way that it
misclassifies only that result that contains the trigger sentence while other inputs are
classified properly. The opponent determines the trigger sentence and target class then
it sees the poisoning samples which are not as same as target class (malicious mails)
then these samples are added to the training set and then the user can use backdoor
instances that can trigger the sentences to attack the system. Various metrics are used
to see the success of the attack such as in the proposed method the trigger length and
the poisoning rate is changed on two types of sets i.e the positive review set and the
negative review set and then we see the test accuracy and the attack success rate and it
was seen that as the poisoning rate is directly proportional to the success rate of attack
and highest success rate of 96% was achieved and also as we increase the trigger
length the attack success rate increases. So, finally we can see that this work can
spread awareness about the attack
In this an attack is investigated against the Support vector machines these attacks
injects modified data that can increase the SVM test set error and this is due to the
fact that we think that our data is coming from a trusted source and a well-behaved
distribution [Biggio, Nelson and Laskov 2012].
Deep learning algorithms produce very good results in the presence of a
large dataset and perform better than most of the algorithms but some imperfections
in the training phase can make them vulnerable to some adversarial samples. These
make the learning algorithms to misjudge so this new algorithm is made to reduce this
vulnerability. And some defences are also described by measuring distance between
input and the target classification [Papernot et al. 2015].
ML is used in various spheres of life such as driverless cars and aviation
where these adversaries can cause some serious harm to life and property so a method
is developed in which we use gradient descent to find out the adversaries and a metric
is also defined to measure quality of adversarial samples[Jang, Wu and Jha 2017].
In deep classifiers small changes in images data can cause harm and can lead
to misclassification of data so in this method a Deep Fool algorithm is generated to
find the perturbations that fool our deep network [Moosavi-Dezfooli, Fawzi, Frossard
´, Polytechnique and de Lausanne n.d.].
When malicious inputs are given to the sample it can yield wrong model
output which cannot be seen by human observers. So, in this an attack is planned
which does not need any knowledge of the model; in this input are synthetically
generated and are tagged with target class [Papernot et al. 2017].
2.4 Feature Selection
2.4.1 Model
While studying data analytics, we come across huge amounts of data which requires
more computational time and memory constraints apply too. Feature selection is the
solution proposed in this paper which does not change the physicality of the original
features. Therefore, as compared to feature extraction, the feature selection possesses
better readability and interpretability. We focus on a special type of feature selection
which is mutual information (MI) based feature selection which has higher-
dimensionality of joint mutual information and uses the ‘maximum of the minimum’
method to enhance the feature selection problem. We introduce the FJMI (Five way
joint mutual Interaction) feature selection algorithm and discuss its performance
metrics.
MI based feature selection makes use of MI terms which have low dimensionality and
includes relevancy and conditional redundancy to extracts information between
selected features and class labels and thus we cannot directly calculate the features
and class labels [Hu, Gao, Zhao, Zhang and Wang 2018].
Interaction Weight based Feature Selection (IWFS) method is introduced
which considers three-way interactions. To assess interaction between features and
measure redundancy, the method uses interaction weight factor mechanism.[Zeng,
Zhang, Zhang and Yin 2015]
2.5.1 Model
2.6.1 Model
K-Nearest Neighbour (kNN), Decision Trees, Naive Bayes (NB), etc are some
traditional text classification methods and all have sparsity problems. The elimination
of sparsity problems by the distribution of words is indeed triggered with the
advancement in deep learning [Bengio et al. 2003].
LSTM networks, a tree structured network is proposed due to the problems
faced in Recurrent Neural Networks (RNN).[Tai, Socher and Manning n.d.]
To improve the generalization of the network, we apply Recurrent Neural
Networks to classification.[P. Liu, Qiu and Huang 2016]
For sentiment analysis, we use CNN to obtain sentence vectors, and
following that for classification, the bidirectional LSTM discovers the document
vectors [D. Tang, Qin and Liu 2015].
2.7.1 Model
The paper has basically described the problem with the example of how twitter works,
how people put their different opinions over there and they can be divided in different
categories with the help of these improved techniques. The above-mentioned
techniques basically help us in describing them in a better way. The mentioned
algorithms are giving us the better F1-score on twitter dataset whereas the modified
modOR I have been found as a consistent performer for giving the best results. Idf is
not enough to reflect the important terms on the basis of different categories. The
modification is important as tf decreases the performance of question
categorization.so these algorithms or techniques are modified so that to make it more
effective in text classification. This problem is chosen so as to divide the text on the
basis of different categories whether it is yahoo questions or different kinds of
opinions on twitter. As mentioned tf decreases the efficiency of question
categorization. So, modifying it to our techniques will give us a better way to
understand and find this. This paper is dealing with the problem of searching and
categorizing on the basis of different items posted on social media. Now we have
different social media platforms which helps us in connecting or answering, Because
of social media people have got their views. Maybe we can say they can put their
views in a small blog on social media. Short texts or messages have now become an
important form of communication. Reviews about online products or criticism on
social media has now been a big part of social media. People tweet about real world
problems some viewpoints are negative and some are positive so this gives us an
opinion that what we are thinking and now it has become influencing us. Tweets now
are used by researchers for different predictions on the basis of datasets they derive
from twitter. A news service based on twitter has been proposed in the research by
researchers only using the data from twitter. The proposed solution is the modification
of these techniques with the help of different approaches. They are using K-NN
classifiers, Single value decomposition and calculating different statistical data with
the mentioned formulas which will be used. Firstly, the supervised alternative of TF-
IDF is mentioned in it. There are three short text datasets. They have a common
attribute in them and they have written short text by users. These texts have different
event discussions, reviews of products and questions. First dataset is the Twitter event
dataset. It has groups of more than 5000 groups. Second dataset we have is an
opinosis dataset. It has short reviews of products. The 3rd dataset we have is yahoo’s
question answer dataset. The TF component gives us the weight for words in a
document by taking in account the occurrence of the word locally in the document
and the IDF helps in balancing it by taking the number of occurrences of the word
globally. IFN-TP-ICF is the second proposed technique in this paper so as to get it
categorized. In this it matters on what is the value of TP as the ratio of TP to TN is
considered. As tp for higher categories is high.
As we are going to modified the existing algorithm and in our base paper TF-IDF is
one of the major equations being used so we can see that for term weighting when
modification id=s being done this paper is helpful for taking care of modification.
[Chen, Zhang, Long and Zhang 2016]
The graphs given in our base paper are taken with the help of this paper and
how they have been analysed is seen on the basis of this as approach to abstractive
summarization of highly redundant opinions. this helped in data from the opinosis for
the product reviews.[Ganesan, Zhai and Han n.d.]
This cited paper is taken into consideration so as to get the data and event
detection on twitter as we have a dataset related to twitter in our base paper so it is
important to use a large-scale corpus for dataset.[McMinn, Moshfeghi and Jose 2013]
As we are going to analyse everything on the basis of vector space model so
this resolves the problem of how we can analyse in a vector space model the result we
will get from the TF-IDF weighting.[Soucy and Mineau n.d.]
2.8.1 Model
There are so many secured patterns, and they are not easy to choose appropriately for
a pattern. whereas, selection of the given patterns needs knowledge of security. So, to
select a proper and secure pattern of Design on the basis of its SRS. As mentioned
above There are so many different types of secure patterns. and also, determination of
these examples needs security information. Basically, software developers are
generally not trained to take care of these problems in the domain of security
knowledge. This paper can give a suggestion and insight in the generalization of
secure patterns on the basis of transaction of the secure pattern using text
categorization. An archive of secure structure designs is utilized as an informational
index and store of necessities ancient rarities as programming prerequisites
determination are utilized. A book order conspiracy, which starts with pre-processing,
ordering of secure examples, winds up by questioning SRS highlights for recovering
secure structure design utilizing the archive recovery model. For the assessment of the
proposed model, we have utilized three distinct areas' SRS. This problem is chosen in
this paper so as to resolve the problem of developers in finding a proper and secure
design pattern on the basis of their particular SRS.as the developers are not
specialized in deciding this so this solution is a major solution for a bigger problem
.so this paper has the description of why this secure design pattern is to be used. As
security as always been the biggest concern in every field so it is providing solution
for developing a software and maintain security and confidentiality. Security is
regarded in the form of non-functional necessity programming improvement life
cycle. In the present day of programming advancement life cycle, security
prerequisites are remembered for each period of the product improvement life cycle.
As the progression of advances has additionally expanded security issues. So as to
adapt to security concerns, engineers need to learn security prerequisites of a product
and must have security area information to endorse a safe improvement arrangement.
Security concerns and dangers are commonly arranged in 5 unique parameters:
Identification and Authentication of clients, Access Control components and
Authorization Rules, Cryptography Intrusion Detection and Logging. It characterized
security prerequisites properties into four classes for example secrecy, Integrity,
accessibility, responsibility. This classification of security issues, various measures to
be taken so as to meet security necessities. Secure advancement security concerns
must be portrayed in a solid manner in every improvement stage. Proposed solution is
effective so as to help software developers in a sector they are not specialized in. If
the developers don’t have the specialization in the security sector then when our
algorithm describes the things. They need to understand if the machine has chosen
this pattern then what is the reason for it.
2.9.1 Model
This paper is proposing an approach based on a dictionary that uses content’s info to
generate possible varieties and pairs for normalization. They are mentioned on the
basis of string similarity and it is considered to increase the dictionary’s content.
[Sarker et al. 2015]
The problem to perform multiple tasks character-level attentional systems to
contemplate the standardization of clinical ideas [F. Liu, Weng and Jiang 2012]
These advancements have prompted the development of new research-based
clinical internet-based life content mining, including pharmacovigilance [Goodfellow
et al. n.d.]
In certainty, there is an absence of pertinent preparing corpora. Along these
lines, we propose utilizing an ill-disposed system which will help in paper for
preparing sets [Belazzoug, Touahria, Nouioua and Brahimi 2019b]
2.10.1 Model
One of the most common models for text categorization is Bag of words (BOW). This
model faces limitations as the number of features involved in this model is large
causing influence on text categorization performance. ISCA is the result of some
added improvements of the powerful algorithm Sine Cosine Algorithm (SCA) which
helps in discovering new regions of the search space in comparison to the original
SCA algorithm. This algorithm evaluates on the basis of two positions to find the best
solution: the position of the best solution found till now, and the second one is a given
random position from the search space whereas the original SCA focuses only on the
best solution to generate a new solution. This combination allows in avoiding
premature convergence and improving the performance.
2.11.1 Model
2.12.1 Model
For modelling text sequences many deep learning architectures have been
implemented but the main problem faced by them is the fact that they require a great
amount of unsupervised data for training their parameters which makes them
infeasible when large number annotated samples do not exist or they cannot be
accessed. SWEMs are able to extract representations for text classification with the
help of only few support examples. A modified approach of applying hierarchical
pooling method is proposed for few-shot text classification and which shows high
performance on long text datasets [Pan et al. 2019].
2.12.2 Problem Answered
The Model Agnostic Meta-Learner model has the main purpose to meta-learn initial
conditions for the subsequent fine-tunings about the problems of few shots. [Finn,
Abbeel and Levine 2017]
The paper proposes a model known as LSTM-based meta-learner model with
main focus to learn the exact optimization algorithm which can be used to train
another learner neural network classifier of the same regime.[ [Ravi and Larochelle
n.d.]
In this paper ideas from metric learning and from recent advances that
combines neural networks with external memories are employed.[Vinyals et al. n.d.]
This framework maps a small labelled support set to an unlabelled example
to its label. Then it defines one-shot learning problems on vision and language tasks.
[Snell, Swersky and Zemel n.d.]
This paper presents a short text classification framework which uses Siamese
CNN. The Siamese CNNs will learn the discriminative text encoding which will help
classifiers to distinguish informal sentences. To improve classifier's generalization
few shots, take different sentence structures and different descriptions of a topic as
prototypes.[Yan, Zheng and Cao 2018]
3 Evaluation
All the algorithms are evaluated using various methods and the results of experiments
are compared on the basis of different datasets for example 20Newsgroups, Reuters,
Yahoo Answers these datasets are among some of the benchmarking text datasets on
which the conventional model is tested so to take care of the performance of the
conventional models which are the one proposed here and in these datasets, there are
a lot of features because of which the testing of and selecting of various features can
be done. Precision and recall measures are used for evaluating algorithms for
categorization. Precision is the ratio of the quantity of records effectively allocated in
category C to the complete number of docs classified having a place with category C.
Review presents the ratio of the number of archives accurately appointed in category
C to the complete number of records really having a place with category C. There is a
third basic measure known as the F-measure (FM) which indicates the harmonic mean
of exactness and review. These three measures are described in the given equations.
Here TP represents the number of true positives, FP represents the number of false
positives and FN denotes the number of false negatives. In multiclass categorization,
macro averaging and micro averaging of precision and recall are used. In macro
averaging technique all the given classes are weighted equally, regardless of no. of
documents belonging to it while the micro average has equally all the given
documents, thus favouring the given performance on these classes. Also, Micro F1
basically depends on these given categories while macro F1 is taken care of by each
and every category. micro averaging and macro averaging on precision, recall and F-
measure for a |c| generally independent classification problems.
Broad tests are led to give a reasonable correlation with the unaided profound element
portrayal strategies. In this manner, a dataset with an enormous number of content
archives have been used. CNAE is an assortment of 1080 content reports.
20Newsgroups comprises 18821 content archives. Reuters is a subset of the database
Reuters21578 of content archives.
4 Comparison
Table 1 shows comparison of all the techniques used in our base papers. All the
algorithms have been compared on the basis of their advantages and limitations along
with their datasets and performance matrices.
5 Conclusions
On studying and closely analysing the various text classification techniques, we
identified various methods and highlighted their strengths and weaknesses in
extracting useful information from data. It is also important to realize the problems
present in text classification techniques in order to make a comparative study of
various classifiers and their performance and it is interesting to infer that it is
impossible to attach one single classifier for a specific problem. The semi-supervised
text classification reduces temporal costs and is important in the field of text mining.
We addressed some of the other crucial issues in the paper which includes
enhancement of performance, feature selection and zones of document.
In this paper we surveyed various approaches on text categorization and
feature selection based on an renown dataset known as 20Newsgroup and the results
were astonishing as we saw that when we were finding out the precision the
traditional Naive Bayes, Tsetlin Machine and BRCAN were among the best methods
whereas when we calculated the recall there was a huge difference in the recall of
Tsetlin Machine and the other algorithms and Tsetlin Machine outperformed all the
other algorithms by a great margin. But this caused confusion as we were not able to
compare the whole scenario so we used an evaluation method known as the F-
Measure that combines both the approaches. So, from the results of all the evaluation
methods we can conclude that the Tsetlin Machine is the best among the all methods
compared.
6 Future Work
In future work, we plan to reduce the complexity while calculating the computational
time of MI terms of which the main challenge to be faced is of estimating the joint
probability of MI terms. We also plan to propose the ISCA algorithm along with other
search algorithms to study some aspects of feature selection problems. For further
study on pattern selection, we will consider using the techniques of unsupervised
learning, and will increase the feature vector size and state difference of different
techniques to reduce the sparsity terms. We plan to study the Tsetlin Machine and its
usage for unsupervised learning of word embeddings. We can build text
categorization using more efficient method of selection and increase performance
using MI on which we can change the way the word vector is created and by adding
some features to it we can also do sentiment analysis. Developing a defence
mechanism against this backdoor attack and studying the influence of trigger sense
content on the solution.
Acknowledgements
References