Abstract-With The Increasing Use of Web 2.0 Platforms: Iterature Survey
Abstract-With The Increasing Use of Web 2.0 Platforms: Iterature Survey
Abstract—With the increasing use of Web 2.0 platforms The organization of this paper is as follows. Section 2
such as Web Blogs, discussion forums, Wikis, and discusses the literature survey. Sections 3-4 outline the
various other types of social media, people began to proposed approach and the system architecture. Section 5
share their experiences and opinions about products or addresses the evaluation study. Section 6 concludes the paper
services on the World Wide Web. Web Blogs have thus with a summary and analysis of results.
become an important source of information. In turn,
great interest in blog mining has arisen, specifically due II. LITERATURE SURVEY
to its potential applications, such as in opinion or review In recent years, there has been a huge burst of research
search engine applications the ability to collect and activity in the areas of sentiment analysis and opinion
analyze data. In this study, we introduce an architecture, mining. Earlier studies focused mostly on interpretation of
implementation, and evaluation of a Web blog mining narrative, point of view in text [5-10]. The widespread
application, called the BlogMiner, which extracts and awareness of the research problems in sentiment analysis and
classifies people’s opinions and emotions (or sentiment) opinion mining has increased with the rise of machine
from the contents of weblogs about movie reviews. learning methods in natural language processing and
information retrieval; of the availability of datasets for
Keywords: blog mining, opinion mining, blog crawler, machine learning algorithms to be trained on (due to the
web blog mining blossoming of the World Wide Web); and, specifically, of
the development of review-aggregation Web sites.
I. INTRODUCTION Reference [3] describes a sentiment classification
The world’s biggest library, the World Wide Web, is application that uses phrase patterns to classify opinions. In
getting fed with data by every internet user around the world. this study, at the document classification phase, the authors
People share their ideas, interests, emotions, experiences, add tags to certain words in the text, and then match the tags
and knowledge with others via the internet every day. Thus, within a sentence with predefined phrase patterns to get the
mining opinions of people on the Web is an important area sentiment orientation of the sentence under consideration.
of research investigation [1]. Next, they take into account the sentiment orientation of each
Sociologists have used many different ways to recognize sentence and classify the text according to the most repeated
people’s natural interests, their aims, and their preferences. sentiment.
In order to grab people’s ideas from their sharing over the Reference [4] describes a sentiment miner that extracts
Web, the most efficient way is to mine people’s diaries or sentiment (or opinions) that people express about a subject,
books (in other words their blogs). This study introduces a such as a company, brand, or product name. In this study, the
system that is designed to mine ideas to understand the point authors design the sentiment miner with the following
of views of a web community. challenge in mind: Not only is the overall opinion about a
With increasing usage of the internet, blogging and blog topic, but also the sentiment about individual aspects of the
pages are growing rapidly. Blog pages have become the most topic essential information of interest. The reason for this is
popular means to express one’s personal opinions. By the that the document level sentiment classification fails to
end of 2008, there were 133 million blogs on the global detect sentiment about individual aspects of the topic. Thus
Internet, which are indexed by Technorati [2]. in the author’s study, the sentiment miner analyzes
Mining opinions from Web pages involves several grammatical sentence structures and phrases based on natural
challenges. For example, these opinions, or review data, have language processing (NLP) techniques and detects, for each
to be crawled from Web sites and then separated from non- occurrence of a known topic spot, the sentiment specifically
review data [8]. about the topic. With these characteristics the proposed NLP
This study proposes a system that extracts review data, of based sentiment mining system, described in [4], achieved
movies from blogs. It introduces an architecture and high quality results (∼90% of accuracy) on various datasets
implementation of the system in detail. It also explains a including online review articles and the general Web pages
classification of the review data. and news articles. The feature extraction algorithm, proposed
by [4], successfully identified topic related feature terms
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
from online review articles, enabling sentiment analysis at feature words and opinion words in a sentence; determining
finer granularity. the class of feature word and the polarity of opinion word;
Reference [5] describes an application on sentiment identifying the relevant opinion word(s) and then obtaining
classification with review extraction. This approach extracts some valid feature-opinion pairs; producing a summary
the review expressions on specific subjects and attaches a using the discovered information. The authors use WordNet
sentiment tag and weight to each expression. Then, it to generate a keyword list for finding features and opinions.
calculates the sentiment indicator of each tag by Grammatical rules between feature words and opinion words
accumulating the weights of all the expressions are then applied to identify the valid feature-opinion pairs.
corresponding to a tag. Next, it uses a classifier to predict the Finally, the authors re-organize the sentences according to
sentiment label of the text. In this study, the authors use on- the extracted feature-opinion pairs to generate the summary.
line documents to test the performance of the proposed The objective of this study is to generate automatically a
application. The experimental documents cover two feature class-based summary for arbitrary online movie
domains: politics and religion. The experiments within those reviews. Experimental results show that this method has an
domains achieve accuracy between %85 and %95. average precision of %65 approximately. In addition, with
Reference [6] describes a method of opinion mining to this approach, it is easy to generate a summary with movie-
help e-learning systems know the users’ opinions on the related people names as the sub-headlines.
course-wares and teachers of the e-learning system and to In this study, we propose a work, which is most similar
help improve the services. In this study, the authors develop with the work described in [9]. Our approach differs from [9]
an opinion mining system for e-learning reviews. The goal of in the way we calculate sentiment orientation of the movie
this system is to extract and summarize the opinions and reviews from the blogs. The previous work focused on a
reviews, and determine whether these reviews and opinions constant dataset, while the proposed approach crawl the
are positive or negative. This study divides the whole task dataset from the blogs. In turn, this is used to calculate movie
into four subtasks: expression identification, opinion scores. We discuss our approach in the next section in detail.
determination, content-value pair identification, and
sentiment analysis. The authors achieve following precisions III. APPROACH
for these subtasks respectively: %94, %84.2, %80.9 and
%92.6. A. Overview
Reference [7] describes a sentiment mining and retrieval In this section, we briefly describe the techniques and
system called Amazing. The authors introduce a ranking goals of this study and what we aim to succeed as a result.
mechanism, which is different from a general web search This study is categorized under three phases. The first phase
engine since it utilizes the quality of each review rather than is the crawling phase, in which data is gathered from Web
the link structures for generating review authorities. In this blogs. The second phase is the analyzing phase, in which the
system, the most important aspect is that the authors data is parsed, processed and analyzed to extract useful
incorporate the temporal dimension information into the information. The third phase is the visualization phase, in
ranking mechanism, and make use of temporal opinion which the information is visualized to better understand the
quality and relevance in ranking review sentences. This results. More details of the system architecture are explained
study monitors the changing trends of customer reviews in in the system architecture section (IV).
time and visualizes the changing trends of positive and B. Problem Definition
negative opinion respectively. It then generates a visual
comparison between positive and negative evaluation of a Web blogs are full of un-indexed and unprocessed text
particular feature, in which potential customers are that reflects the opinions of people. Many people make
interested. The authors conduct experiments on the sentiment choices by taking the suggestions of other people into
mining and retrieval system using the customer reviews of account. For example, one likes to buy a product that is most
four kinds of electronic products including digital cameras, recommended by people who use that product. Thus, there is
cell phones, laptops, and MP3 players. The evaluation results a need to crawl and process peoples’ opinions, so that it can
indicate that the proposed approach achieves a precision of be used in decision making processes of potential Web
%85 approximately. review applications.
Reference [9] describes a multi-knowledge based In this study, we propose a blog mining system that will
approach that utilizes WordNet for statistical analysis and extract movie comments from Web blogs and that will show
movie knowledge. WordNet is a large lexical database of Web blog users what other people think about a particular
English, developed under the direction of George A. Miller movie. Figure 1 shows the overall process model of the
[10]. Nouns, verbs, adjectives, and adverbs are grouped into proposed system. The blog mining process consists of
sets of cognitive synonyms, each expressing a distinct following three main steps: Web crawling, sentiment
concept. The proposed approach, described in [9], analysis, and visualization.
decomposes the problem of review mining and Web crawling: A Web crawler (also known as a Web
summarization into the following subtasks: identifying spider, or Web robot) is a program or automated script that
78
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
browses the World Wide Web in a methodical, automated Visualization: We utilize the Zed Graph [14] for
manner. A Web crawler is a type of software agent that takes visualization to present our findings. Zed Graph provides an
a list of URLs, called seeds, to visit as input. As the crawler ASP web-accessible control for creating 2D line, bar, and pie
visits these URLs, it identifies all the hyperlinks in the page graphs of arbitrary datasets. It is maintained as an open-
and adds them to a list of URLs, called the crawl frontier, to source development project. We presented the results on the
visit. URLs from the frontier are recursively visited project Web site over a shared database.
according to a set of policies. The process of Web crawling
is also known as spidering. Many sites, search engines in
particular, use spidering as a means of providing up-to-date IV. SYSTEM ARCHITECTURE
data. Web crawlers (or spiders) are mainly used to create a
copy of all the visited pages for later processing by a search The proposed system architecture consists of several
engine that will index the downloaded pages to provide fast components: Blog Crawler, Sentiment Analyzer, and Web
searches. Crawlers can also be used for automating Usage Interfaces.
maintenance tasks on a Web site, such as checking links or
validating HTML code. Also, crawlers can be used to gather A. Blog Crawler
specific types of information from Web pages, such as One of the most important parts of the proposed system
harvesting e-mail addresses (usually for spam) or gathering is the Blog crawler. The crawler needs to analyze as much
text content. In this study, we utilized two open source data as possible to provide good accuracy results. If the
projects, OpenWebSpider [15] and Arachnode [20], for analysis has not been conducted with enough data, the results
crawling the Web blogs and collecting data for sentiment will only indicate opinions of restricted group of people.
analysis. Although one needs to crawl as many blogs as it is possible
to reach good results, the blogosphere contains huge amounts
of data. The storage capacity is limited and there also exist
limitations related to the computation and memory
capabilities to crawl all of the blogosphere. Thus, in this
study, to calculate the general opinions of people about a
movie, we were only able to crawl some part of the
blogosphere. We can assume that when the computation
capabilities are improved and crawled part of the
blogosphere is increased, the proposed application will
produce better results.
We use Arachnode.Net to crawl the Web blogs.
Arachnode.net is an open source Web crawler for
downloading, indexing, and storing Internet content
including e-mail addresses, files, hyperlinks, images, and
Web pages. Arachnode.net is written in C# and uses SQL
Server 2005. Arahnode.net uses the Lucene.Net library for
indexing and searching. Arachnode.Net is selected because it
is very customizable and well written. We start crawling with
seeds like www.blogpulse.com and www.technorati.com,
Figure 1 Blog Miner Overall Process Model
because these Web sites contain many links to Web blogs. In
turn, this improves the crawling performance. Figure 3
Sentiment analysis: Sentiment analysis has three main
shows the main working process of the crawler.
tasks: determining subjectivity, determining sentiment
orientation, and determining the strength of the sentiment
orientation. In this study, we use an unsupervised approach
in sentiment analysis. We use OPEN-NLP to find types of
words. In our approach, we use a keyword database, which
contains specific words about a movie domain. We use a
keyword algorithm, which searches the keywords in the text.
If a keyword under consideration is found in the database,
then the algorithm identifies whether the keyword is an
adjective or an adverb and calculates the score. An
alternative algorithm, the all-word algorithm, is to look at all
words in related sentences and then calculate the general
score for a movie.
79
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
Figure 2 Web Crawler Architecure Figure 3 SentiWord Data Table
80
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
Figure 5 shows the Entity Relationship Diagram of the
proposed application. Note that this diagram does not
include the Arachnode.Net database, which is used to store
blog pages. The database diagram of the Arachnode.Net is
available at [12]. In Figure 5, the “Movies” table is used to
store the score results of each movie under investigation.
The “People” table is used to store all related information
about people who play a role in movies such as actor,
actress, and director. The “SentiWord” table stores the
sentiment dictionary, which was obtained from
SentiWordNet [11]. The “Movie Elements” table stores the
9 keyword categories from the movie domain, and the
“Element Alias” table stores the keywords associated to
these categories.
Figure 5 Blog Miner ER Diagram Figure 6 The graphs page of the Application
81
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
The second category is the graphs. The system utilizes We present our experimental study by showing the steps
dynamic charts that are created each time users specify a of the BlogMiner application for processing the raw data and
selection as illustrated in Figure 7. Here, we utilize Zed calculating sentiment scores. Thus, the following sample
Graph [14], which is an open-source library, written in C#, user-review is chosen from IMDB to illustrate the steps.
for creating 2D line and bar graphs of arbitrary datasets. This
library provides a high degree of flexibility, i.e., almost Sample Review: “I thought it wouldn't be as good as it
every aspect of the graph can be user-modified. Zed Graphs was, because thousands of people and reviews said it would
has two different libraries that can be used for creating suck! It was great, but what it missed was that it needed to be
Windows-based applications and Web-based applications. In at-least an hour longer, because it missed a-little bit, but it
this study, we use only some parts of the Zed Graph libraries still rocked! I loved it! I thought it was funny, and as did the
to create a Web-based BlogMiner application. person next to me, when John says: "I'll be back!””.
82
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
“I/PRP thought/VBD it/PRP would/MD not/RB<-1> be/VB as/RB
good/JJ<0.844> as/IN it/PRP was/VBD ,/, because/IN
thousands/NNS of/IN people/NNS and/CC reviews/NNS said/VBD
it/PRP would/MD suck/VB !/.
(sentence score = -0.844)
83
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.
Acknowledgement: We thank Kadir Ardic and Onur Enez for [13] Web site for Porter Stemmer is available at
their contribution to the research presented in this paper. We http://tartarus.org-/~martin/PorterStemmer, Access
also thank the Department of Computer Engineering in data: October 2009
Marmara University for giving us permission to commence [14] Web site for Zed Graphs is available at
this study and to do the necessary research work by utilizing http://zedgraph.org-/wiki/index.php?title=Main_Page,
departmental computer facilities. Access date: October 2009
[15] Web site for NetSpell is available at
http://sourceforge.net-/projects/netspell, Access date:
REFERENCES October 2009
[16] Web site for OpenWebSpider is available at
[1] Bing Liu, Web Data Mining - Exploring Hyperlinks, http://www.openwebspider.org, Access date: October
Contents and Usage Data, Text Book, , Springer, 2009
December, 2006 [17] Web site for The Internet Movie Database (IMDB) is
available at http://www.imdb.com, Access date:
[2] Technorati Web Site is available at
http://technorati.com, Access Data: October 2009 October 2009
[3] Zhongchao Fei, et al., Sentiment Classification Using [18] Kadir Ardic, Onur Enez, Undergraduate graduation
thesis. available at http://www.scribd.com/doc/-
Phrase Patterns Proceedings of the Fourth International
16191423/Web-Blog-Miner-Licence-Thesis, Access
Conference on Computer and Information Technology
date: October 2009
(CIT’04), 2004.
[4] Jeonghee Yi, et al., Sentiment Mining in WebFountain,
Proceedings of the 21st International Conference on
Data Engineering (ICDE 2005), 2005
[5] Jian Liu, et al., Super Parsing: Sentiment Classification
with Review Extraction, Proceedings of the Fifth
International Conference on Computer and Information
Technology (CIT’05), 2005.
[6] Yun-Qing Xia, et al., The Unified collocation
Framework for Opinion Mining, Proceedings of the
Sixth International Conference on Machine Learning
and Cybernetics, Hong Kong, 19-22 August 2007.
[7] Qingliang Miao, et al., AMAZING: A sentiment mining
and retrieval system, Expert Systems with Applications
(2008) doi:10.1016/j.eswa.2008.09.035.
[8] Qiang Ye, et al., Sentiment classification of online
reviews to travel destinations by supervised machine
learning approaches, Expert Systems with Applications
(2008) doi:10.1016/j.eswa.2008.07.035.
[9] Li Zhuang, et al., Movie review mining and
summarization, Proceedings of the 15th ACM
international conference on Information and knowledge
management, 2006.
[10] WordNet Web site is available at
http://wordnet.princeton.edu, Access Date: October
2009.
[11] Andrea Esuli, et al., SENTIWORDNET: A Publicly
Available Lexical Resource for Opinion Mining, The
fifth international conference on Language Resources
and Evaluation, LREC 2006
[12] Web site for Arachnode.Net Database Diagrams is
available at http://arachnode.net/media/g/database_-
diagrams/default.aspx, Access date: October, 2009
84
Authorized licensed use limited to: muthu m. Downloaded on June 15,2010 at 06:11:02 UTC from IEEE Xplore. Restrictions apply.