0% found this document useful (0 votes)
69 views8 pages

Chi-Square Feature Selection Effect On Naive Bayes Classifier Algorithm Performance For Sentiment Analysis Document

This document discusses using chi-square feature selection to improve the performance of a naive bayes classifier for sentiment analysis. It analyzes sentiment from 700 training documents and 30 test documents of Indonesian movie reviews. Without feature selection, the naive bayes classifier achieved 73.33% accuracy, 100% precision, and 65.21% recall. With chi-square feature selection, the accuracy improved to 93.33%, precision to 93.33%, and recall to 93.33%. The results show that chi-square feature selection positively impacts the performance of the naive bayes algorithm for sentiment analysis of documents.

Uploaded by

sumina so
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views8 pages

Chi-Square Feature Selection Effect On Naive Bayes Classifier Algorithm Performance For Sentiment Analysis Document

This document discusses using chi-square feature selection to improve the performance of a naive bayes classifier for sentiment analysis. It analyzes sentiment from 700 training documents and 30 test documents of Indonesian movie reviews. Without feature selection, the naive bayes classifier achieved 73.33% accuracy, 100% precision, and 65.21% recall. With chi-square feature selection, the accuracy improved to 93.33%, precision to 93.33%, and recall to 93.33%. The results show that chi-square feature selection positively impacts the performance of the naive bayes algorithm for sentiment analysis of documents.

Uploaded by

sumina so
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338799936

Chi-Square Feature Selection Effect On Naive Bayes Classifier Algorithm


Performance For Sentiment Analysis Document

Conference Paper · November 2019


DOI: 10.1109/CITSM47753.2019.8965332

CITATIONS READS

8 984

4 authors, including:

Nurhayati Buslim Luh Kesuma Wardhani


Sun Moon University Syarif Hidayatullah State Islamic University Jakarta
29 PUBLICATIONS   101 CITATIONS    58 PUBLICATIONS   154 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Algorithm in cryptography View project

Journal View project

All content following this page was uploaded by Nurhayati Buslim on 10 February 2021.

The user has requested enhancement of the downloaded file.


Chi-Square Feature Selection Effect On Naive
Bayes Classifier Algorithm Performance
For Sentiment Analysis Document
(1)
Nurhayati, (2)Armanda Eka Putra, (3) Luh Kesuma Wardhani Busman
(1),(2),(3
Department of Informatics Department of Management
Syarif Hidayatullah State Islamic University School of Economics Gotong Royong
Jakarta, Indonesia Jakarta, Indonesia
(nurhayati, luhkesuma)@uinjkt.ac.id, armandaekaputra@gmail [email protected]

Abstract— The main problem in using a sentiment analysis training data for a machine learning algorithm that is already
algorithm Naïve Bayes is sensitivity to the selection of features. there. Accuracy obtained ranged from 72% to 83%.
There exist Chi-Square feature selections to eliminate features
that are not very influential. This study aimed to determine the The information contained in the review document is the
effect of Chi-Square feature selection on the performance text data is not patterned. Text Mining is needed to handle
Naïve Bayes algorithm in analyzing sentiment document. Data the text is not patterned. Text Mining activities are important
were taken from Corpus v1.0 Indonesian Movie Review 700 in the classification or categorization.[6] explained that the
training data and 30 test data. Testing was done by analyzing Text Mining is one of the techniques that can be used to
sentiment documents with and without a Chi-Square feature classify documents in which the Text Mining is a variation of
selection. The evaluated of subsequently by the method of data mining are trying to find interesting patterns from the
accuracy, precision and recall. The result from the analysis of collection of large amounts of textual data. The process of
sentiment without feature selection obtained 73.33% accuracy, text classification has previously been done by various
precision 100.00%, 65.21% recall. While the Chi-Square methods, such as Naïve Bayes [5], K-Nearest Neighbor[1]
feature selection (significance level α 0.1) obtained 93.33% conducted a comparison of two methods of text
accuracy results, Precision 93.33%, and 93.33% recall. From classification. The result Naïve Bayes method accuracy rate
these results, it can be seen that the selection of Chi-Square reached 97%, 96.9% Support Vector Machines, Neural
features affect performance Naïve Bayes algorithm in
Networks Decision Trees 93% and 91.1%.
analyzing sentiment documents.
II. RELATED WORK
Keywords— Sentiment analysis, Feature Selection, Naïve
Bayes, Chi-Square, Text Mining Research on sentiment analysis using Naïve Bayes
method has previously been done by [7] entitled Sentiment
I. INTRODUCTION Analysis Method Using Naïve Bayes classifier With Feature
Sentiment analysis or opinion mining is the process of Selection Chi-Square. Naïve Bayes method can obtain
understanding, extract and process the textual data accuracy of 83%. Meanwhile, according to a study[3]
automatically to get the sentiments of information contained entitled Text Mining In Insurance Sentiment Analysis
in an opinion sentence. [1][2] Method Using Naïve Bayes classifier mention NBC method
can obtain up to 95% accuracy.
Based on research conducted[3][4], sentiment analysis is
done to see opinions or opinions tendency to issue or topic in Excess Naïve Bayes method is simple but has a high
the news by a person, do tend to be opinionated negative, accuracy. The advantage of using Naïve Bayes method is
positive or neutral, so expect the opinions collected can be only requires a small amount of training data to estimate the
useful information. parameters (means and variances of the variables) required
for classification.[3]
One example of the use of sentiment analysis in the real
world is the identification of market trends and market As described above, from the research that has been done
opinion to an object goods. The magnitude of the effect and analyzing the results obtained documents to obtain the value
benefits of sentiment analysis led to research and sentiment of sentiment in the document is a Naïve Bayes analysis
analysis-based applications is growing rapidly. Even in method with 95% accuracy rate. Therefore, in this study the
America, there are about 20-30 companies are focusing on authors chose to use the Naïve Bayes classifier algorithm as
sentiment analysis service[5]. Research in the field of a classification algorithm.
opinion mining began to flourish in 2002. Turney in 2002 to Naïve Bayes method is very simple and efficient, but on
do research with the theme of opinion mining by using data the other hand Naïve Bayes is very sensitive to the selection
contains data consumer reviews on a product. The method of features. According to Chen et al.[8], a major problem in
used is Semantic Orientation using point wise Mutual the text classification is the height dimension of the space
Information (SO-PMI). The best results are achieved is 84% features, it is often the case in the text that has tens of
accuracy of the data review of motor vehicles and 66% for thousands of features. Most of these features are irrelevant
data review of the movie. Pang et al in 2002 classified the and not useful for text classification can even reduce the
review of the film at the level of a document that has a level of accuracy, and therefore the selection of appropriate
positive or negative opinion by using supervised learning features indispensable.
techniques. A set of movie reviews that have previously been
determined to be either positive or negative is used as
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

Research conducted [8] indicates that the feature documents, and the amount of data testing as many as 30
selection is an important step in the classification of text and documents.
directly influence on performance classification. It is known
that the text classification has a problem related to the B. Sentiment Analysis
amount of data dimensions. This is because the text At this stage, sentiment analysis process using Text
classification, a document is represented as a collection of Mining consisting of text preprocessing, text transformation,
words (part of the word), in which each word in the feature selection, and pattern discovery.
document do not depend on each other.
C. Performance Evaluation System
Previously have obtained some of the techniques used to
conduct the election document features include threshold Sentiment analysis system performance evaluation is
Document Frequency (DF), Information Gain, Mutual done by calculating the results of accuracy, precision, and
Information (MI), Term Strength (TS) and Chi-Square recall to measure system performance results.
testing (X2). IV. DISCUSSION
Sun et al.[9], expressed in the classification of a A. Data Preparation
document, Chi-Square is one of the supervised feature
selection that can eliminate a lot of features without 1. Document of Data Training
compromising accuracy. Data training in this research were 700 documents that
The previous research, the study [9] entitled "Sentiment have been labeled sentiments and used for learning
Analysis Method Using Naïve Bayes classifier With Chi- algorithms.
Square Feature Selection" said NBC method can obtain the 2. Document of Data testing
value of accuracy reached 83%. While the study [9] entitled
"Feature Selection By Chi-Square In Naïve Bayes algorithm Data testing in this research were 30 documents to be
for classification News" values obtained with 98% accuracy. analyzed sentiment contained therein by using a system that
Therefore, in this study the authors chose to use the method will run automatically.
of Chi-Square as a method of feature selection. B. Sentiment Analysis Algorithm Naïve Bayes
This study was conducted to prove the influence of Chi- According [7], there are four basic process steps in the
Square feature selection effect on the performance Naïve Text Mining, which is preprocessed text (Text
Bayes algorithm to analysis sentiment of a document. How preprocessing), the transformation of the text (Text
big is the influence of Chi-Square feature selection to the transformation), the selection of features (Feature selection),
values of accuracy, precision, and recall that serves as a and the discovery of patterns (Pattern discovery). [7]
measure of the accuracy of the system in generating an
appropriate sentiment. This research will use the training data 700 documents.
They consisting of 350 documents were positive labeled and
Based on the above background the authors decided to negative labeled, and data testing were 30 documents.
conduct research using the title " Chi-Square Feature
Selection Effect On Naïve Bayes Classifier Algorithm 1. Data Training
Performance For Sentiment Analysis Document".
The training process sentiment analysis using Naïve
III. METHOD Bayes Algorithm.
There are several methods used in this research. a) Text Preprocessing
A. Collecting data Method: In this process will be done case folding is changing
character uppercase to lowercase (lowercase). In addition to
1. Literature study; it required data from a variety of the articles to be cleaned of punctuation and numbers are
books, journals, and so on. considered to have no value sentiment. The next do
tokenizing is the process of separating a row of words in
2. Collection of Documents
sentences, paragraphs or pages into a single token or word
We collection of documents originating from Corpus fragments.
IndonesianMovieReview v1.0 [10]
b) Text Transformation
These documents are in Indonesian language movie At this stage it will be filtering, which said the document
review documents. The number of documents is as much as will be examined whether the word is included into the list of
765 consisting of 381 384 documents labeled 'positive and stopword is a word considered to have no value sentiment. If
negative labeled documents. The reason the author uses the word is included in the stopword list, the word will be
documents from [10] research is because writers need deleted. Furthermore, the word will be returned to the basic
documents to the training data and testing the data as much word (stemming) using algorithms Nazief-Andriani. Text
as 700 training data and 30 data testing which of the Transformation in the training process produces 5,324
documents concerned sufficient research it. Data will be vocabulary, 83 037 grudges said positive and negative
divided into two parts, namely the training data and data grudges 72 226 words.
testing. Of the total number of documents, the amount of
training data used as many as 700 documents consisting of c) Feature Selection
350 documents labeled positive and negative labeled Although the previous stage has been doing the deletion
of words is not descriptive (stop words), but not all the
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

words in the document has significance. The basic idea is to of words in sentences, paragraphs or pages into a single
remove the feature selection of words whose occurrence in a token or word fragments. Table 3. shown an example of the
document too little or too much. This was done to reduce the text preprocessing.
words that are considered less influential on a sentiment. table 3. Sample Results Text preprocessing
Examples of feature selection results to the critical value
Authentic document Results Text preprocessing
at significance level α 0.1, the 2706 can be seen in Table 1. From helfi kardit, a director of helfi kardit a director who is
Shown some of the Chi-Square values that have been who is known for his works famous for his work as the ghost of
obtained. such as Ghost Empty Bench, bench, The Rasul and then “dLove”
table 1 Sample Selection Features Claiming Apostles until arisan coming which are admirably able to
No. word value Chi brondong, exists “dLove”, summarize the whole thing that
1 need 17.4538 which is admirably able to exist in Indonesian film. The
2 director 14.9225 summarize the whole cliche in duration on all minutes from start in
3 production 2.75146 Indonesian film in duration and minute for tell love triangle.
4 success 6.41822 only to the extent 90 minutes.
Starting from triangle love

The results of the selection process produced 1,011


feature or features which previously amounted to 5,324 b) Text Transformation
word feature. At this stage it will be filtering. The document will be
examined whether the word is included into the list of
stopword is a word considered to have no value sentiment.
d) Pattern Discovery
If the word is included in the stopword list, the word will be
At this stage will be sought probability value of each deleted. Furthermore, the word will be returned to the basic
sentiment. P (Cj) with C is sentiment and sentiment j is the word (stemming) using algorithms Nazief-Andriani. Table 4
number used. Next will be the probability values calculated is the result of the document after a text transformation
at each sentiment word using equation (2.8) . process.
table 4 Sample Results Text transformation
Results Text preprocessing Results Text transformation
of helfi kardit a director who Famous director, ghosts of benches,
is known for his work as the empty, apostles, gathering, present,
ghost of the empty chair amazed, summarize, cliche, duration,
Where is the word probability value or feature on sentiment, apostle confessed to gathering minute, love triangle
is frequency of words in sentiment, is the sum of all words foregoing hadirlah dlove
which are admirably able to
in the category, is the number of unique words in all summarize the whole thing
learning data. The calculation of the value shown in Table 2. clichés that exist in Indonesian
Data training. film in the duration that just
all the minutes it started from
Table 2. Sample Selection Features a love triangle
Words
On
Trainin POSITIVE NEGATIVE c) Pattern Discovery
g From the training data obtained probability value
Docum
of each sentiment. P (Cj) with C is sentiment and sentiment j
is the number used. At this stage used 2 categories namely
ents
POSITIVE and NEGATIVE sentiment with second chances
(136 + 1) / (51664 + 1011) = (65 + 1) / (42332 + sentiment calculation using equation (2.3). The results are as
need
0.0026008543 1011) = 0.0015227372
follows:
directo (609 + 1) / (51664 + 1011) = (428 + 1) / (42332 +
r 0.0115804461 1011) = 0.0098977920
product (224 + 1) / (51664 + 1011) = (156 + 1) / (42332 +
ion 0.0042714760 1011) = 0.0036222689
(184 + 1) / (51664 + 1011) = (139 + 1) / (42332 +
success
0.0035121025 1011) = 0.0032300487
(95 + 1) / (51664 + 1011) = (60 + 1) / (42332 +
plot
0.0018224964 1011) = 0.0014073784
Furthermore, we will classify the document using the
Bayes rules for the classification of documents set forth in
2. Data Testing equation (2.4).
The Training process a flowchart sentiment analysis The sentiment is that would be classified category and is
using Naïve Bayes algorithm. a probability category of j sentiment. During the process of
a) Text Preprocessing classifying documents, then the Bayes approach will choose
In this process will be done case folding is changing the category that has the highest probability value ()
character uppercase to lowercase (lowercase). In addition to calculated using equation (2.9).
the articles to be cleaned of punctuation and numbers are The value calculated by equation (2.8).
considered to have no value sentiment. This process will
also be done tokenizing. It is the process of separating a row
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

𝐶𝑗 𝑃 𝐶𝑗 𝐶𝑀𝐴𝑃 𝑃(𝐹𝑖 |𝐶𝑗 ) False Positive, False Negative 8 document, and 7 True
Negative documents with details as shown in Table 7.
Here is an example of analyzing a document using Naïve
Table 7 The results of experiments testing data without the Chi-Square
Bayes algorithm previously been using the training data as
not
much as 700 documents. Previous documents have gone relevant
Relevant
through the stages of training data that generate positive
word as much as 51 664 grudges and negative grudges said Retrieved 15 0
as many as 42 332, as well as generate 1,011 vocabulary. not
The calculation of the value shown in Table 5 for each 8 7
Retrieved
data testing.
Table 5 Value calculation Referring to the data in Table 4.2 then do the counting
system performance evaluation in accordance with the
The earned value as follows: equation of precision, recall, and accuracy.

In the system of performance evaluation results


obtained 73.33% accuracy, precision 100%, and recall
65.21% as shown in Table 8.
The results obtained for a value greater Negative
sentiment. The result of document is classified as a Table 8 Results of Performance Evaluation System Experiments without
document of Negative sentiment. Chi-Square
Data testing
accuracy precision recall
POSITIVE NEGATIVE TOTAL
V. RESULTS 15 15 30 73.33% 100% 65.21%

A. Experimental Results Analysis of Sentiments


2. Chi-Square with significance level α 0:25 (critical
Words On value = 1.323)
Document The results of the training data with 700 documents
Testing comprising document 350 positive and 350 negative
POSITIVE NEGATIVE document using the Chi-Square with critical value at
significance level α 0:25, namely 1323generating grudges
said as many as 58 837 positive and negative grudges said
director (((609 + 1) / (51 664 + (((428 + 1) / (42 332 + as many as 48 734, as well as generate 1,824 vocabulary as
shown in Table 9.
1011)) ^ 1)) 1011)) ^ 1) Table 9. Experiment with the word Results real level of α0:25
know (((406 + 1) / (51 664 + (((205 + 1) / (42 332 + No. Words Worth Positive Words Worth Negative Vocabulary
1 58 837 48 734 1,824
1011)) ^ 1) 1011)) ^ 1)
ghost (((0 + 1) / (51 664 + (((0 + 1) / (42 332 +
The results of the testing of data with 30 documents
1011)) ^ 1) 1011)) ^ 1) consisting of 15 positive documents and negative documents
bench (((0 + 1) / (51 664 + (((0 + 1) / (42 332 + 15 documents produced 15 True Positive, 0 documents
1011)) ^ 1) 1011)) ^ 1) False Positive, False Negative 4 document, and 11 True
Negative documents with details as shown in Table 10.
1. Without Chi-Square Table 10 The results of experiments testing data with real level α 0:25
The results of the training data with 700 documents
comprising document 350 positive and 350 negative not
relevant
Relevant
document without using the Chi-Square generate positive
grudges said as many as 83 037 and 72 266 words as much Retrieved 15 0
negative grudges, and produce 5,324 vocabulary as shown
not
in Table 6. Retrieved
4 11

Table 6 The results of said experiment without Chi-Square


No. Words Worth Positive Words Worth Negative Vocabulary
Referring to the data in Table 4.5 then recount
1 83 037 72 266 5,324
appropriate performance evaluation system with equality of
The results of the testing of data with 30 documents
precision, recall, and accuracy.
consisting of 15 positive documents and negative documents
15 documents produced 15 True Positive, 0 documents
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

Data testing
accuracy precision recall
POSITIVE NEGATIVE TOTAL
15 15 30 93.33% 93.33% 93.33%

4. By Chi-Square real level α 0:05 (critical value =


3,841)
In the system of performance evaluation results The results of the training data with 700
documentscomprising document 350 positive and 350
obtained 86.66% accuracy, precision 100%, and recall
78.94% as shown in Table 11. negative document using the Chi-Square with critical value
at significance level α 0:05, ie 3,841 generating grudges said
Table 11 Results of Performance Evaluation System experiment with real as many as 12 697 positive and negative grudges said as
level of α0:25 many as 9723, and produce 1,009 vocabulary as seen in
Data testing
accuracy precision recall Table 15.
POSITIVE NEGATIVE TOTAL Table 15 Experiment with the word Results real level of α0:05
15 15 30 86.66% 100% 78.94% No. Words Worth Words Worth Negative Vocabulary
Positive
1 46 091 37 149 694
3.Chi-Square with significance level α 0.1 (critical
value = 2.706)
The results of the testing of data with 30
The results of the training data with 700 documents
documents consisting of 15 positive documents and negative
comprising document 350 positive and 350 negative
documents 15 True Positive produce 12 documents, three
document using the Chi-Square with critical value at
documents False Positive, 0 Negative False documents, and
significance level α of 0.1, which is 2.706generating
15 True Negative documents with details as shown in Table
grudges said as many as 51 664 positive and negative
16.
grudges said as many as 42 332, as well as generate 1,011 Table 16 Results of experiments with testing data real level of α0:05
vocabulary as shown in Table 12. not
relevant
Relevant
Table 12 Experiment with the word Results real level of α0.1
No. Words Worth Positive Words Worth Negative Vocabulary Retrieved 12 3
1 51 664 42 332 1,011
not
0 15
Retrieved
The results of the testing of data with 30
documents consisting of 15 positive documents and negative
documents 15 documents produced 14 True Positive, 1 Referring to the data in Table 4.11 recount the
document False Positive, False Negative 1 document, and performance evaluation system in accordance with the
14 True Negative documents with details as shown in Table equation of precision, recall, and accuracy.
13.
Table 13 Results of experiments with testing data real level of α0.1
not
relevant
Relevant

Retrieved 14 1

not
1 14
Retrieved

Referring to the data in Table 12 then do the In the system of performance evaluation results obtained
counting system performance evaluation in accordance with accuracy of 90%, precision of 80%, and the recall of 100%
the equation of precision, recall, and accuracy. as shown in Table 17.

Table 17 Results of Performance Evaluation System experiment with real


level of α0:05
Data testing
accurac precisio recal
POSITIV NEGATIV TOTA
y n l
E E L
15 15 30 90% 80% 100
%

B. Comparison Feature Selection and Evaluation


In the performance evaluation system accuracy of the System
results obtained 93.33%, precision 93.33%And recall
93.33%as shown in Table 14. Table 18 Comparison Feature Selection
No Critical Words Worth Words Worth Vocabula
Table 14. Results of Performance Evaluation System experiment with real . value Positive Negative ry
level of α0.1 1 without 83 037 72 266 5,324
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

Chi
2 0:25 58 837 48 734 1,824
3 0.1 51 664 42 332 1,011
4 0:05 46 091 37 149 694

Use of Chi is very influential on the number of features


to come. Without Chi produced 5,324 features. By using Chi
number of features resulting in fewer that 1,824 on the real
level α 0:25, 1011 the real level α of 0.1 and 694 on the real
level α 0:05.
1. Comparison of Results of Performance
Figure2. Precision Graph Sentiment Analysis
Evaluation System

Table 19 Comparison of Results of Performance Evaluation System


The Figure 2. Shown noticeable decrease in the value of
No. Real-α level accuracy precision recall precision after using Chi-Square feature selection. Precision
results without Chi reached 100%, while using Chi reached
1 without Chi 73.33% 100% 65.21%
100% on the real level α 0:25, 93.33% on a real level α of
2 0:25 86.66% 100% 78.94% 0.1, and 80% on the real level α 0:05.
3 0.1 93.33% 93.33% 93.33%
4 0:05 90% 80% 100% c) Comparison Results Recall
The use of Chi-Square also affects the value of the
resulting recall. The resulting recall value can be seen in
Comparison of the results of accuracy, precision and
Figure 3.
recall as seen in Table 19 will be described in the following
sections.

a) Comparison of Accuracy Results


Based on research that has been done, the use of chi-
square method on the feature selection algorithm improves
accuracy because it can reduce the words that are considered
less influential for analyzing the Naïve Bayes algorithm.
This is shown in Figure 1.

Figure 3. Recall chart Sentiment Analysis

In Figure 3, an overall recall value after feature


selection using Chi-Square. Results recall without Chi
reached 65.21%, while using Chi reached 78.94% at the real
level α 0:25, 93.33% on a real level α of 0.1, and 100% on
the real level α 0:05.
From the above experimental finally been critical value
at significance level α of 0.1, which is 2,706 because it has
the best average results, ie 93.33% of accuracy, precision
Figure 1 Accuracy graph Sentiment Analysis and recall. Use of Chi is very influential on the number of
features to come. Without Chi produced 5,324 feature. By
From Figure 1, it can be seen that the accuracy without using Chi number of features resulting in fewer that 1,824
increasing the selection feature after feature selection using on the real level α 0:25, 1011 the real level α of 0.1 and 694
Chi-Square. Accuracy results without Chi reached 73.33%, on the real level α 0:05.
while using Chi reached 86.66% at the real level α of 0.25,
93.33% on a real level α of 0.1, and 90% on the real level α VI. CONCLUSION
of 0.05. The use of chi-square method on the selection of Experiments also prove that the use of chi-square for
visible features improves the accuracy of sentiment analysis feature selection influential in increasing the results
results using Naïve Bayes algorithm. accuracy, precision, and recall the sentiment analysis in the
document.
b) Comparison of results for Precision In line with the research[11], using Chi Square in
The use of Chi-Square also affects the value of the feature selection can improve the classification results.
resulting precision. As can be seen in Figure 2. System classification accuracy obtained from sentiment
analysis research documents using Naïve Bayes algorithm
without feature selection is 73.33% and 86.66% using the
Chi reach the real level α 0:25, 93.33% on a real level α 0,1,
The 7th International Conference on Cyber and IT Service Management (CITSM 2019)
Jakarta Convention Center – Jakarta, November 4-6, 2019

and 90 % on the real level α 0:05. The results obtained from


the use of the real level of α, in contrast with the results [9]
which generates research classifying news automatically
using algorithms Naïve Bayes without feature selection is
96.67% and the use without feature selection both on real
level α 0:05, 0:01, 0005, and 0001 was 98%.
The fitur selection can be used for checking documens
and analisys fitur in document .We can find presision of
fitur two or more document. It can be implentation in library
for checking similarity of the document content.
REFERENCES

[1] S. Informasi, M. Lalu, K. Tegal, A. Amartiwi, and T.


Andrasto, “Pemilihan Feature Dengan Chi Square
Dalam Algoritma Naïve Bayes Untuk Klasifikasi
Berita,” vol. 5, no. 1, pp. 33–43, 2018.
[2] N. Buslim, B. Busman, N. Sigit Sinatrya, and T. Kania,
Analisa Sentimen Menggunakan Data Twitter, Flume,
Hive Pada Hadoop dan Java Untuk Deteksi
Kemacetan di Jakarta, vol. 3. 2018.
[3] R. Y. Luthfia Oktasari*, Yulison Herry Chrisnanto,
“TEXT MININGDALAM ANALISIS SENTIMEN
ASURANSI MENGGUNAKAN METODE NAÏVE
BAYES CLASSIFIER.”
[4] Nurhayati, T. S. Kania, L. K. Wardhani, N. Hakiem,
Busman, and H. Maarif, “Big Data Technology for
Comparative Study of K-Means and Fuzzy C-Means
Algorithms Performance,” 2018 7th Int. Conf. Comput.
Commun. Eng., no. 1, pp. 202–207, 2018.
[5] B. Liu, “LIBRO_SentimentAnalysis-and-
OpinionMining,” no. May, 2012.
[6] J. LING, I. P. E. N. KENCANA, and T. B. OKA,
“Analisis Sentimen Menggunakan Metode Naïve
Bayes Classifier Dengan Seleksi Fitur Chi Square,” E-
Jurnal Mat., vol. 3, no. 3, p. 92, 2014.
[7] S. Sanjaya, S. Sanjaya, and E. A. Absar,
“Pengelompokan Dokumen Menggunakan Winnowing
Fingerprint dengan Metode K-Nearest Neighbour,” J.
CoreIT J. Has. Penelit. Ilmu Komput. dan Teknol. Inf.,
vol. 1, no. 2, pp. 50–56, Nov. 2015.
[8] S. Ernawati, “Penerapan Particle Swarm Optimization
Untuk Seleksi Fitur Pada Analisis Sentimen Review
Perusahaan Penjualan Online Menggunakan Naïve
Bayes,” J. Evolusi Vol. 4 Nomor 1 - 2016 -
lppm3.bsi.ac.id/jurnal, vol. 4, no. 1, pp. 45–54, 2016.
[9] Informasi S Lalu M Tegal K Amartiwi A Andrasto T,
“No TitlePemilihan Feature Dengan Chi Square Dalam
Algoritma Naïve Bayes Untuk Klasifikasi Berita,” vol.
5910, pp. 33–43, 2018.
[10] K. Alfan Farizki Wicaksono, Korea Advanced Institute
of Science and Technology, “Corpus
IndonesianMovieReview v1.0,” 2013.
[11] B. Kurniawan, S. Effendi, and O. S. Sitompul,
“Klasifikasi Konten Berita Dengan Metode Text
Mining,” J. Dunia Teknol. Inf., vol. 1, no. 1, pp. 14–19,
2012.

View publication stats

You might also like