Chi-Square Feature Selection Effect On Naive Bayes Classifier Algorithm Performance For Sentiment Analysis Document
Chi-Square Feature Selection Effect On Naive Bayes Classifier Algorithm Performance For Sentiment Analysis Document
net/publication/338799936
CITATIONS READS
8 984
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nurhayati Buslim on 10 February 2021.
Abstract— The main problem in using a sentiment analysis training data for a machine learning algorithm that is already
algorithm Naïve Bayes is sensitivity to the selection of features. there. Accuracy obtained ranged from 72% to 83%.
There exist Chi-Square feature selections to eliminate features
that are not very influential. This study aimed to determine the The information contained in the review document is the
effect of Chi-Square feature selection on the performance text data is not patterned. Text Mining is needed to handle
Naïve Bayes algorithm in analyzing sentiment document. Data the text is not patterned. Text Mining activities are important
were taken from Corpus v1.0 Indonesian Movie Review 700 in the classification or categorization.[6] explained that the
training data and 30 test data. Testing was done by analyzing Text Mining is one of the techniques that can be used to
sentiment documents with and without a Chi-Square feature classify documents in which the Text Mining is a variation of
selection. The evaluated of subsequently by the method of data mining are trying to find interesting patterns from the
accuracy, precision and recall. The result from the analysis of collection of large amounts of textual data. The process of
sentiment without feature selection obtained 73.33% accuracy, text classification has previously been done by various
precision 100.00%, 65.21% recall. While the Chi-Square methods, such as Naïve Bayes [5], K-Nearest Neighbor[1]
feature selection (significance level α 0.1) obtained 93.33% conducted a comparison of two methods of text
accuracy results, Precision 93.33%, and 93.33% recall. From classification. The result Naïve Bayes method accuracy rate
these results, it can be seen that the selection of Chi-Square reached 97%, 96.9% Support Vector Machines, Neural
features affect performance Naïve Bayes algorithm in
Networks Decision Trees 93% and 91.1%.
analyzing sentiment documents.
II. RELATED WORK
Keywords— Sentiment analysis, Feature Selection, Naïve
Bayes, Chi-Square, Text Mining Research on sentiment analysis using Naïve Bayes
method has previously been done by [7] entitled Sentiment
I. INTRODUCTION Analysis Method Using Naïve Bayes classifier With Feature
Sentiment analysis or opinion mining is the process of Selection Chi-Square. Naïve Bayes method can obtain
understanding, extract and process the textual data accuracy of 83%. Meanwhile, according to a study[3]
automatically to get the sentiments of information contained entitled Text Mining In Insurance Sentiment Analysis
in an opinion sentence. [1][2] Method Using Naïve Bayes classifier mention NBC method
can obtain up to 95% accuracy.
Based on research conducted[3][4], sentiment analysis is
done to see opinions or opinions tendency to issue or topic in Excess Naïve Bayes method is simple but has a high
the news by a person, do tend to be opinionated negative, accuracy. The advantage of using Naïve Bayes method is
positive or neutral, so expect the opinions collected can be only requires a small amount of training data to estimate the
useful information. parameters (means and variances of the variables) required
for classification.[3]
One example of the use of sentiment analysis in the real
world is the identification of market trends and market As described above, from the research that has been done
opinion to an object goods. The magnitude of the effect and analyzing the results obtained documents to obtain the value
benefits of sentiment analysis led to research and sentiment of sentiment in the document is a Naïve Bayes analysis
analysis-based applications is growing rapidly. Even in method with 95% accuracy rate. Therefore, in this study the
America, there are about 20-30 companies are focusing on authors chose to use the Naïve Bayes classifier algorithm as
sentiment analysis service[5]. Research in the field of a classification algorithm.
opinion mining began to flourish in 2002. Turney in 2002 to Naïve Bayes method is very simple and efficient, but on
do research with the theme of opinion mining by using data the other hand Naïve Bayes is very sensitive to the selection
contains data consumer reviews on a product. The method of features. According to Chen et al.[8], a major problem in
used is Semantic Orientation using point wise Mutual the text classification is the height dimension of the space
Information (SO-PMI). The best results are achieved is 84% features, it is often the case in the text that has tens of
accuracy of the data review of motor vehicles and 66% for thousands of features. Most of these features are irrelevant
data review of the movie. Pang et al in 2002 classified the and not useful for text classification can even reduce the
review of the film at the level of a document that has a level of accuracy, and therefore the selection of appropriate
positive or negative opinion by using supervised learning features indispensable.
techniques. A set of movie reviews that have previously been
determined to be either positive or negative is used as
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019
Research conducted [8] indicates that the feature documents, and the amount of data testing as many as 30
selection is an important step in the classification of text and documents.
directly influence on performance classification. It is known
that the text classification has a problem related to the B. Sentiment Analysis
amount of data dimensions. This is because the text At this stage, sentiment analysis process using Text
classification, a document is represented as a collection of Mining consisting of text preprocessing, text transformation,
words (part of the word), in which each word in the feature selection, and pattern discovery.
document do not depend on each other.
C. Performance Evaluation System
Previously have obtained some of the techniques used to
conduct the election document features include threshold Sentiment analysis system performance evaluation is
Document Frequency (DF), Information Gain, Mutual done by calculating the results of accuracy, precision, and
Information (MI), Term Strength (TS) and Chi-Square recall to measure system performance results.
testing (X2). IV. DISCUSSION
Sun et al.[9], expressed in the classification of a A. Data Preparation
document, Chi-Square is one of the supervised feature
selection that can eliminate a lot of features without 1. Document of Data Training
compromising accuracy. Data training in this research were 700 documents that
The previous research, the study [9] entitled "Sentiment have been labeled sentiments and used for learning
Analysis Method Using Naïve Bayes classifier With Chi- algorithms.
Square Feature Selection" said NBC method can obtain the 2. Document of Data testing
value of accuracy reached 83%. While the study [9] entitled
"Feature Selection By Chi-Square In Naïve Bayes algorithm Data testing in this research were 30 documents to be
for classification News" values obtained with 98% accuracy. analyzed sentiment contained therein by using a system that
Therefore, in this study the authors chose to use the method will run automatically.
of Chi-Square as a method of feature selection. B. Sentiment Analysis Algorithm Naïve Bayes
This study was conducted to prove the influence of Chi- According [7], there are four basic process steps in the
Square feature selection effect on the performance Naïve Text Mining, which is preprocessed text (Text
Bayes algorithm to analysis sentiment of a document. How preprocessing), the transformation of the text (Text
big is the influence of Chi-Square feature selection to the transformation), the selection of features (Feature selection),
values of accuracy, precision, and recall that serves as a and the discovery of patterns (Pattern discovery). [7]
measure of the accuracy of the system in generating an
appropriate sentiment. This research will use the training data 700 documents.
They consisting of 350 documents were positive labeled and
Based on the above background the authors decided to negative labeled, and data testing were 30 documents.
conduct research using the title " Chi-Square Feature
Selection Effect On Naïve Bayes Classifier Algorithm 1. Data Training
Performance For Sentiment Analysis Document".
The training process sentiment analysis using Naïve
III. METHOD Bayes Algorithm.
There are several methods used in this research. a) Text Preprocessing
A. Collecting data Method: In this process will be done case folding is changing
character uppercase to lowercase (lowercase). In addition to
1. Literature study; it required data from a variety of the articles to be cleaned of punctuation and numbers are
books, journals, and so on. considered to have no value sentiment. The next do
tokenizing is the process of separating a row of words in
2. Collection of Documents
sentences, paragraphs or pages into a single token or word
We collection of documents originating from Corpus fragments.
IndonesianMovieReview v1.0 [10]
b) Text Transformation
These documents are in Indonesian language movie At this stage it will be filtering, which said the document
review documents. The number of documents is as much as will be examined whether the word is included into the list of
765 consisting of 381 384 documents labeled 'positive and stopword is a word considered to have no value sentiment. If
negative labeled documents. The reason the author uses the word is included in the stopword list, the word will be
documents from [10] research is because writers need deleted. Furthermore, the word will be returned to the basic
documents to the training data and testing the data as much word (stemming) using algorithms Nazief-Andriani. Text
as 700 training data and 30 data testing which of the Transformation in the training process produces 5,324
documents concerned sufficient research it. Data will be vocabulary, 83 037 grudges said positive and negative
divided into two parts, namely the training data and data grudges 72 226 words.
testing. Of the total number of documents, the amount of
training data used as many as 700 documents consisting of c) Feature Selection
350 documents labeled positive and negative labeled Although the previous stage has been doing the deletion
of words is not descriptive (stop words), but not all the
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019
words in the document has significance. The basic idea is to of words in sentences, paragraphs or pages into a single
remove the feature selection of words whose occurrence in a token or word fragments. Table 3. shown an example of the
document too little or too much. This was done to reduce the text preprocessing.
words that are considered less influential on a sentiment. table 3. Sample Results Text preprocessing
Examples of feature selection results to the critical value
Authentic document Results Text preprocessing
at significance level α 0.1, the 2706 can be seen in Table 1. From helfi kardit, a director of helfi kardit a director who is
Shown some of the Chi-Square values that have been who is known for his works famous for his work as the ghost of
obtained. such as Ghost Empty Bench, bench, The Rasul and then “dLove”
table 1 Sample Selection Features Claiming Apostles until arisan coming which are admirably able to
No. word value Chi brondong, exists “dLove”, summarize the whole thing that
1 need 17.4538 which is admirably able to exist in Indonesian film. The
2 director 14.9225 summarize the whole cliche in duration on all minutes from start in
3 production 2.75146 Indonesian film in duration and minute for tell love triangle.
4 success 6.41822 only to the extent 90 minutes.
Starting from triangle love
𝐶𝑗 𝑃 𝐶𝑗 𝐶𝑀𝐴𝑃 𝑃(𝐹𝑖 |𝐶𝑗 ) False Positive, False Negative 8 document, and 7 True
Negative documents with details as shown in Table 7.
Here is an example of analyzing a document using Naïve
Table 7 The results of experiments testing data without the Chi-Square
Bayes algorithm previously been using the training data as
not
much as 700 documents. Previous documents have gone relevant
Relevant
through the stages of training data that generate positive
word as much as 51 664 grudges and negative grudges said Retrieved 15 0
as many as 42 332, as well as generate 1,011 vocabulary. not
The calculation of the value shown in Table 5 for each 8 7
Retrieved
data testing.
Table 5 Value calculation Referring to the data in Table 4.2 then do the counting
system performance evaluation in accordance with the
The earned value as follows: equation of precision, recall, and accuracy.
Data testing
accuracy precision recall
POSITIVE NEGATIVE TOTAL
15 15 30 93.33% 93.33% 93.33%
Retrieved 14 1
not
1 14
Retrieved
Referring to the data in Table 12 then do the In the system of performance evaluation results obtained
counting system performance evaluation in accordance with accuracy of 90%, precision of 80%, and the recall of 100%
the equation of precision, recall, and accuracy. as shown in Table 17.
Chi
2 0:25 58 837 48 734 1,824
3 0.1 51 664 42 332 1,011
4 0:05 46 091 37 149 694