Literature Review Report
Literature Review Report
1.1 Introduction
Nowadays Knowledge management, data mining, and text mining techniques have been widely
used in many important applications in both scientific and business domains in recent years.
Knowledge management is the system and managerial approach to the gathering, management,
use, analysis, sharing, and discovery of knowledge in an organization or a community in order to
maximize performance [1].Although there is no universal definition of what constitutes
knowledge, it is generally agreed there is a continuum of data, information, and knowledge. Data
are mostly structured, factual, and oftentimes numeric, and reside in database management
systems. Information is factual, but unstructured, and in many cases textual. Knowledge is
inferential, abstract, and is needed to support decision making or hypothesis generation. The
concept of knowledge has become prevalent in many disciplines and business practices[21]. For
example, information scientists consider taxonomies, subject headings, and classification
schemes as representations of knowledge. Consulting firms also have been actively promoting
practices and methodologies to capture corporate knowledge assets and organizational memory.
Data mining is often used during the knowledge discovery process and is one of the most
important subfields in knowledge management. Data mining aims to analyze a set of given data
or information in order to identify novel and potentially useful patterns These techniques, such as
Bayesian models, decision trees, artificial neural networks, associate rule mining, and genetic
algorithms, are often used to discover patterns or knowledge that are previously unknown to the
system and the users[1]. Data mining has been used in many applications such as marketing,
customer relationship management, engineering, medicine, crime analysis, expert prediction,
Web mining, and mobile computing, among others.
Text mining aims to extract useful knowledge from textual data or documents. Although text
mining is often considered a subfield of data mining, some text mining techniques have
originated from other disciplines, such as information retrieval, information visualization,
computational linguistics, and information science. Examples of text mining applications include
document classification, document clustering, entity extraction, information extraction, and
1|P a ge
summarization. Most knowledge management, data mining, and text mining techniques involve
learning patterns from existing data or information, and are therefore built upon the foundation
of machine learning and artificial intelligence.
Taking advantage of these properties, text mining applications are typically used to[2]:
Extract relevant information from a document (summarization, feature extraction, . ..)
Gain insights about trends, relations between people/places/organizations, etc. by
automatically aggregating and comparing information extracted from documents of a
certain type (e.g. incoming mail, customer letters, news-wires, . ..).
Classify and organize documents according to their content; i.e. automatically pre-
select groups of documents with a specific topic and assign them to the appropriate
person.
Organize repositories of document-related meta-information for search and retrieval.
Retrieve documents based on various sorts of information about the document
content
This list of activities shows that the main application areas of text mining technology
cover the two aspects,
(1) knowledge discovery (mining proper) and
(2) information ‘distillation’ (mining on the basis of some pre-established structure).
2|P a ge
An important part of our information-gathering behaviour is always been to check what the other
people are thinking about it. With the growing availability and popularity of opinion-rich
resources such as online review sites and personal blogs, new opportunities and challenges arise
as people now can, and do, actively use information technologies to seek out and understand the
opinions of others. “What other people think” has always been an important piece of information
for most of us during the decision-making process. the Internet and the Web have now made it
possible to find out about the opinions and experiences of those in the vast pool of people that
are neither our personal acquaintances nor well-known professional critics that is, people we
have never heard of and that’s why opinion mining is called the voice of the customer. And
conversely, more and more people are making their opinions available to strangers via the
Internet. The interest that individual users show in online opinions about products and services,
and the potential influence such opinions wield, is something that vendors of these items are
paying more and more attention to.
In today’s world, there are so much data available on the internet. It includes the customer
reviews on different products[9]. It is general tendency that before we go for purchasing any
product, we go thru the reviews written on the website of that product. By reading those reviews
customer takes decision. Sometimes there are so many reviews that the customer is not able to
read, for that the opinion mining is used to help the customer. The reviews of the customers also
help the other customer in getting the suggestions or feedback for the developer of the product.
By these reviews, the company can come to know that what is lacking in their product. For
example, for mobile, it has been written that, the battery life of mobile is very less, or the voice
clarity is not good, so the company can make the battery life and voice clarity better in the next
model of that product. By the comments or reviews, the company of that product can come to
know that, what are the reasons to like the product and what are the reason for not liking the
product.
Indeed, according to two surveys of more than 2000 American adults
3|P a ge
among readers of online reviews of restaurants, hotels, and
various services, between 73% and 87% report that reviews had a significant influence on
their purchase
32% have provided a rating on a product, service, or person via an online ratings system,
and 30% have posted an online comment or review regarding a product or service.
28% said that a major reason for these online activities was to get perspectives from
within their community, and 34% said that a major reason was to get perspectives from
outside their community;
27% had looked online for the endorsements or ratings of external organizations;
4|P a ge
In opinion mining, there are three categories
In term counting based, In the sentence, if there more number of good words or positive words
then it is considered as positive, if more number of negative words are there , then it is negative.
This is mainly used for document level opinion mining. In machine learning based, we are using
supervised approach. Training data set is used, and supervised method is applied on it. This is
mainly used for sentence level opinion mining. In the semantic pattern analysis based, the
relations between the words are found using the natural language processing technology.
In dissertation , I am going to focus on FEATURE level opinion mining using supervised and
machine learning and part of speech tagging.
In opinion mining , our main goal is to find the feelings or view of the customer for the product.
Whether the customer is satisfied with the product or not.
1.3 Motivation
Sometimes it happens that, if so many reviews are written online for the product, and we don’t
have enough time or if we have time, but it is not possible to read all the reviews, then if we get
the number of negative and positive reviews, then it will be helpful to us.
5|P a ge
As shown in the introduction, in document level we can have the opinion of only one customer,
in feature level, it will be too complicated. And if we want to find the number of negative,
positive review, that comes under sentence level opinion mining.
Same as in introduction, there introduced three techniques, from which in term weighting , if
weights are assigned wrong, then we can get wrong result. In semantic based, we need to generate
interdependency between the words, so that will be very complicated. And in machine learning,
if we get the good quality raining data set, then it works nicely. That’s the reason for choosing
machine learning technique.
In machine learning technique is also, it has been implemented using, naïve Bayesian algorithm
and k nearest neighbor algorithm, but its to complex. So, I am focusing on developing an easy
approach for the same.
6|P a ge
Chapter 2: Theritical Background And Litérature Survey
The capacity of storing data becomes enormous as the technology of computer hardware
develops. So amount of data is increasing exponentially, the information required by the
users become varies .actually users deal with textual data more than the numerical data. It
is very difficult to apply techniques of data mining to textual data instead of numerical
data. Therefore it becomes necessary to develop techniques applied to textual data that are
different from the numerical data. Instead of numerical data the mining of the textual data is
called text mining. Text mining [1] is procedure of synthesizing the information by
analyzing relations, the patterns and rules from the textual data. A key element is the
linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. Text
mining is different from what are familiar with in web search. In search, the user is
typically looking for something that is already known and has been written by someone
else. The problem is pushing aside all the material that currently is not relevant to your
needs in order to find the relevant information. In text mining, the goal is to discover
unknown information, something that no one yet knows and so could not have yet
written down. The functions of the text mining are text summarization, text categorization and
text clustering[1].
Text classification is commonly used to handle spam emails, classify large text
collections into topical categories, used to manage knowledge and also to help Internet
search engines. A major characteristic of text categorization is high dimensionality of the
feature space .the native feature space consists of hundreds of thousands of terms for even a
moderate sized text collection. Various feature selection methods are discussed in this
7|P a ge
paper to overcome the problem of the high dimensionality. This survey also focuses on
the various approaches and also the applications of text categorization.
The goal of text categorization is to classify a set of documents into a fixed number of
predefined categories. Each document may belong to more than one class. The goal of text
8|P a ge
categorization is to classify a set of documents into a fixed number of predefined
categories. Document may belong to more than one class.
The enormous amount of information stored in unstructured texts cannot simply be used for
further processing by computers, which typically handle text as simple sequences of character
strings. Therefore, specific (pre-)processing methods and algorithms are required in order to
extract useful patterns. Text mining refers generally to the process of extracting interesting
information and knowledge from unstructured text[20].
From this thesis we discuss text mining as a young and interdisciplinary field in the intersection
of the related areas information retrieval, machine learning, statistics, computational linguistics
and especially data mining. We describe the main analysis tasks preprocessing, classification,
clustering, in-formation extraction and visualization. In addition text mining as a truly
interdisciplinary method drawing on information retrieval, machine learning, statistics,
computational linguistics and es-pecially data mining.
Text mining or knowledge discovery from text (KDT) deals with the machine supported analysis
of text. It uses techniques from information retrieval, information extraction as well as natural
language processing (NLP) and connects them with the algorithms and methods of KDD, data
mining, machine learning and statistics. Thus, one selects a similar procedure as with the KDD
process, whereby not data in general, but text documents are in focus of the analysis. One
problem is that we now have to deal with problems of from the data modeling perspective
unstructured data sets. If we try to define text mining, we can refer to related research areas. For
each of them, we can give a different definition of text mining, which is motivated by the specific
perspective of the area[20]:
Text Mining = Information Extraction. The first approach assumes that text mining essentially
corresponds to information extraction the extraction of facts from texts.
Text Mining = Text Data Mining. Text mining can be also defined similar to data mining as the
application of algorithms and methods from the fields machine learning and statistics to texts
9|P a ge
with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts
accordingly. Many authors use information extraction methods, natural language processing or
some simple pre-processing steps in order to extract data from texts. To the extracted data then
data mining algorithms can be applied.
Text Mining = KDD Process. Following the knowledge discovery process model, we
frequently find in literature text mining as a process with a series of partial steps, among other
things also information extraction as well as the use of data mining or statistical procedures.
Hearst summarizes this in a general manner as the extraction of not yet discovered information in
large collections of texts. Also consider text mining as process orientated approach on texts. In
this thesis, we consider text mining mainly as text data mining. Thus, our focus is on methods
that extract useful patterns from texts in order to, e.g., categorize or structure text collections or
to extract useful information.
Currently text categorization research is investigating the scalability properties of text classification
systems, i.e. understanding whether the systems that have proven the best in terms of effectiveness alone
stand up to the challenge of dealing with very large numbers of categories. Several algorithms or
combination of algorithms as hybrid approaches were proposed for the automatic classification of
documents. Among these algorithms SVM, NB, kNN and their hybrid system with the combination of
different other algorithms and feature selection techniques are shown most appropriate in the existing
literature. Future work is required for the performance improvement and accuracy of the text
classification process. After performing a review on different types of approaches and comparing
existing methods based on various parameters it can be concluded that SVM classifier has been
recognized as one of the most effective text classification method in the comparisons of supervised
machine learning algorithms.
10 | P a g e
DIFFERENT TYPES OF APPROACHES
[1] Naïve Bayes Algorithm
[2] K. Nearest Neighbor
[3] Rocchio’s Algorithm
[4] Decision Trees
[5] Back propagation Network
[6] Support Vector Machines (SVM)
This paper summarizes several automatic text categorization algorithms in common use
recently, analyzes and compares their advantages and disadvantages. It provides clues for
making use of appropriate automatic classifying algorithms in different fields. Finally some
evaluations and summaries of these algorithms are discussed, and directions to further
research have been pointed out.
[1] Statistical based, e.g. Nave Bayes, the maximum Shannon entropy model, KNN,
Support Vector Machine and so on;
[2] Knowledge based classification method, e.g. Productive rules, neural network etc.
The statistical based classification method due to its simple mathematical computation, not
demanding complex semantic knowledge and domain knowledge, has got good effect in
practical applications, and becoming popular text categorization method recently.
While knowledge based text categorization system can he applied to a specific area, and
need the knowledge base of the area as support. Since its problems in extraction,
modification, maintaining of the knowledge and self-learning, its applicable area is
restricted. In addition, there are some other classification methods, such as Boosting
11 | P a g e
algorithm, it is a kind of voting based classification method, the idea is: for the task
requiring experts knowledge.
Text categorization is the task of assigning predefind categories to natural language text. With
the widely used ‘bag of words’ representation, previous researches usually assign a word with
values such that whether this word appears in the document concerned or how frequently this
word appears. Although these values are useful for text categorization, they have not fully
expressed the abundant information contained in the document. This paper explores the effect of
other types of values, which express the distribution of a word in the document. These novel
values assigned to a word are called distributional features, which include the compactness of the
appearances of the word and the position of the first appearance of the word. The proposed
distributional features are exploited by a style equation and different features are combined using
ensemble learning techniques. Experiments show that the distributional features are useful for
text categorization. In contrast to using the traditional term frequency values solely, including the
distributional features requires only a little additional cost, while the categorization performance
can be significantly improved. Further analysis shows that the distributional features are
especially useful when documents are long and the writing style is casual.
The Naïve Bayes text classifier has been widely used because of its simplicity in both the training and
classifying stage. Although it is less accurate than other discriminative methods (such as SVM),
numerous researchers proved that it is effective enough to classify the text in many domains. Naïve
12 | P a g e
Bayes models allow each attribute to contribute towards the final decision equally and independently
from other attributes, in which it is more computational efficient when compared with other text
classifiers. Thus, the present study focuses on employing Naïve Bayes approach as the text classifier for
document classification and thus evaluates its classification performance against other classifiers.
In order to generate a document classifier model, Figure 1 depicts the methodology beginning
with data preprocessing to the model evaluation. Indeed, some data are useless (i.e. do not affect the
classification result even removing them, such as stop words) and some carries similar meanings
(i.e. the term „bank‟ and „banks‟), therefore a preprocessing phase has been to conduct first. In this way,
the dataset can be more precise.
After the data preprocessing phase, critical attributes have to be selected. In this study, critical
means the importance of such attribute towards the solution class. For example, the term „bank‟
categorized in „business‟ class has the highest score in term of term frequency, therefore it is analyzed
that „bank‟ is one of the critical attributes to represent the documents fell in the „business‟ class. Thus,
less important features can be removed and so the computational time can be improved significantly.
As for the classification phase, different classifiers (such as SVM, NN, and DT) are employed to
generate the model. However, this study only focused on using Naïve Bayes to classify the documents.
Given the probabilistic characteristic of Naïve Bayes, each training document is vectorized by the
trained Naïve Bayes classifier through the calculation of the posterior probability value for each
existing..
Finally, the model is evaluated by a set of testing data. In order to test the classification ability of the
model, several evaluation measures (such as precision, recall, and F-measure) are adopted.
Furthermore, to interpret whether Naïve Bayes is best to use as the classifier, its testing result will be
compared with other classifiers‟ results as well.
13 | P a g e
Processing Phase
Stemming
NO
DT
Naïve Bayes
KNN
SVM
Classifier
Generation
15 | P a g e
Phase 4 - Model Evaluation
To test and evaluate the model, 70% of the dataset are used. Instances are extracted and then served as
a benchmarking dataset for machine learning problems. By comparing the actual class of the
instance with the predicted one (i.e. generated by the classification model), system performance can be
measures in term of recall, precision, and F-measure. These can be mathematically defined as below.
= ………………………….[ 1 ]
= …...……………….[ 2 ]
∗ ∗
− = ……………………………………………………….[ 3 ]
Sentiment analysis seeks to characterize opinionated or evaluative aspects of natural language text thus
helping people to discover valuable information from large amounts of unstructured data. In this paper a
new methodology has been explored for sentiment analysis called proximity-based sentiment analysis.
They have taken a different approach, by considering a new set of features based on word proximities in
a written text.
They have proposed three proximity-based features, namely, proximity distribution, mutual information
between proximity types, and proximity patterns by unsupervised approach , mean and median approach
and machine learning approach. It is extracting information from a specific domain.
The amount of textual data accumulated each day by various businesses, scientific, and governmental
organizations around the world is daunting. Success in the development of statistical natural language
processing (NLP) has led to improvements in fundamental text analysis such as part-ofspeech (POS)
tagging, phrase chunking, dependency analysis and parsing. Using these components as fundamental
building blocks, many NLP researchers have become interested in analyzing text "semantically" or
"contextually". For example, named entity tagging, semantic role tagging and discourse parsing are being
16 | P a g e
investigated in the NLP fields. This move towards taking contextual or semantic information into account
has occurred in application areas of NLP such as text classification, text summarization, information
retrieval and question answering. Even with text classification tasks, one of the traditional NLP tasks, the
target classes have recently been diversified from topics such as 'sports' and 'economics' to the contents of
texts such as 'polarity' or 'subjectivity'. This thus calls for methods for opinion mining.
Example
The funniest horror movie ever made and while evil dead dawn is the result of an unstable fusion. The
word "funniest" is a positive word and "horror" is a negative word. There is no word between funniest and
horror so we consider it as positive-negative distance measurement and in this case we consider the
distance as 1. In a similar manner, the distance between the negative-negative words "horror" and "dead"
is 7, while that between "dead" and "unstable" is 3 (ignoring words, such as "of', "and", "the", etc.).
The distances for the following types of pairs.
[1] POSITIVE-POSITIVE (++) words.
[2] NEGATIVE-NEGATIVE (--) words.
[3] POSITIVE-NEGATIVE words (+-)
[4] NEGATIVE- POSITIVE words (-+)
17 | P a g e
the classification efficiency of Naïve Bayes using its complemented version with less sensitivity. The
results show that the feature selection procedure from our previous work combined with these algorithms
results in significant improvement of classification efficiency and reduced over-fitting compared to the
previous work.
Assumed to be independent from the conditional probabilities of other words given that category Imagine
that each document belongs to one of the set of different classes and each document can be modeled as a
set of words Wi.
Where P(D|C) is the prior probability of a document D for a particular class C and P(Wi|C) is the joint
probability of the word Wi for the class C.
Text mining applies the same analytical functions of data mining to the domain of textual
information, relying on sophisticated text analysis techniques that distill information from
free-text documents. IBM’s Intelligent Miner for Text provides the necessary tools to
unlock the business information that is “trapped” in email, insurance claims, news feeds,
or other document repositories. It has been successfully applied in analyzing patent
portfolios, customer complaint letters, and even competitors’ Web pages. After defining our
notion of “text mining”, we focus on the differences between text and data mining and
describe in some more detail the unique technologies that are key to successful text mining[10].
As is the case with data mining technology, one of the primary application areas of text
mining is collecting and condensing facts as a basis for decision support. The main advantages
of mining technology over a traditional ‘information broker’ business are: The ability to
quickly process large amounts of textual data which could not be performed effectively by
human readers. . ‘Objectivity’ and customizability of the process - i.e. the results solely
18 | P a g e
depend on the outcome of the linguistic processing algorithms and statistical calculations
provided by the text mining technology . Possibility to automate labor-intensive routine
tasks and leave the more demanding tasks to human reader
Taking advantage of these properties, text mining applications are typically used to :
Extract relevant information from a document (summarization, feature extraction, . ..) .Gain
insights about trends, relations between people/places/organizations, etc. by automatically
aggregating and comparing information extracted from documents of a certain type (e.g.
incoming mail, customer letters, news-wires, . ..). Classify and organize documents according
to their content; i.e. automatically pre-select groups of documents with a specific topic and
assign them to the appropriate person. Organize repositories of document-related meta-
information for search and retrieval Retrieve documents based on various sorts of
information about the document content This list of activities shows that the main
application areas of text mining technology cover the two aspects
Due to the lack of space we will concentrate on a single application that uses IBM’s
Intelligent Miner for Text to support both aspects, discovery and distillation: customer
relationship management.
In this paper we have described our notion of “text mining” and relevant core technology
components. IBM’s product for text mining applications, the Intelligent Miner for Text,
enables customers to use these new technologies in practical text mining applications. We
have shown this by describing a customer relationship management application that is
based on IBM’s text mining components. As this application shows, text mining to- date
can be used as an effective business tool that supports the creation of knowledge by
preparing and organizing unstructured textual data (discovery) and by supporting the
extraction of relevant information from large amounts of unstructured textual data through
automatic pre-selection based on user-defined criteria (distillation). Using automatic
mining processes to organize and scan huge repositories of textual data can significantly
19 | P a g e
enhance both the efficiency and quality of a routine task while still leaving the more
challenging and critical part of it to the one who can do it best, the human reader.
With the growing popularity of the Web 2.0, we are more and more provided with documents
expressing opinions on differ-ent topics. Recently, new research approaches were defined in
order to automatically extract such opinions on the Internet.Usually they consider that opinions
are expressed through ad-jectives and they extensively use either general dictionaries or experts
in order to provide the relevant adjectives. Unfortu-nately these approach suffer the following
drawback: for a spe-cific domain either the adjective does not exist or its meaning could be
different from another domain.
In this paper, we pro-pose a new approach focusing on two steps. First we automat-ically extract
from the Internet a learning dataset for a specific domain. Second we extract from this learning
set, the set of positive and negative adjectives relevant for the domain. Con-ducted experiments
performed on real data show the usefulness of our approach. In this paper, we proposed a new
approach for automatically extracting positive and negative adjectives in the context of the
opinion mining. Experiments conducted on training sets (blogs vs. cinema reviews) show that
with our approach we are able to extract relevant adjectives for a specific domain. Future works
may be manifold. First, our method depend on good quality of documents extracted from blogs.
We want to extend our training corpora method by applying text min-ing approaches on
collected documents in order to minimize lower noisy texts. Second, in this work we focused on
adjectives, we plan to extend the extraction task to other cate-gories.
The Naïve Bayes classifier, also called simple Bayesian classifier, is essentially a simple BN.
Since no structure learning is required, it is very easy to construct and implement a Naïve Bayes
classifier. Despite its simplic- ity, the Naïve Bayes classifier is competitive with other more
advanced and sophisticated classifiers such as decision trees (Friedman, Geiger & Goldszmidt,
1997). Owing to these advantages, the Naïve Bayes classifier has gained great popularity in
solving different classification problems[6].
20 | P a g e
This section introduces the two groups of approaches that have been used to improve the Naïve
Bayes classifier. In the first group, the strong independence assumption is relaxed by restricted
structure learning. The second group helps to select some major (and approximately
independent) attributes from the original attributes or transform them into some new attributes,
which can then be used by the Naïve Bayes classifier[10].
21 | P a g e
Conclusion And Future Enhancement
Here I have proposed an opinion mining approach using machine learning and supervised
learning, part of speech in which , it will present user friendly and easy approach, for finding the
views of the customer, whether it is negative or positive or neutral for the product. There are
algorithms using supervised learning like naïve Bayesian, k nearest neighbor , these give good
result, but these are complex.
In the SVM algorithm the training set is small the result of SVM model is much poor than others.
Also in the Association Rule Word Set of items two (at least) or more is generated from
Association mining. So there is no option for considering a single word using association
concept. Association mining largely reduces the number of words to be considered for
classifying texts, keeping only words having association between them.
Here I have found that naïve bayes gives good performance and accurate result when training
data set is smaller. So it is best suitable for my proposed work.
22 | P a g e
References
Papers
[1] Hsinchun Chen , Sherrilynne S. Fuller , Carol Friedman and William Hersh - “Knowledge
Management , Data Mining , And Text Mining in Medical Informatics” - Management
Information Systems Department.
[2] Jochen Dijrre, Peter Gerstl, Roland Seiffert - “Text Mining: Finding Nuggets in
Mountains of Textual Data” - IBM Germany , Copyright ACM 1999
[4] Un-Nung Chen, Che-An Lu, Chao-Yu Huang - “Anti-Spam Filter Based on Naïve Bayes,
SVM, and KNN model” - © 2009 AI TERMPROJECT
[6] Jose Alavedra, Laura Stroh, Alper Caglayan Milcord Waltham, MA, USA – “ Bayesian
Analysis of Sentiment Surveys”, 2011 IEEE Paper
[7] Gao Hua ,”Customer Relationship Management Based on Data Mining Technique --Naive
Bayesian classifier” , China – 2011 IEEE
[8] Sun Yueheng, Wang Linmei, Deng Zheng - School of Computer Science and Technology
“Automatic Sentiment Analysis for Web User Reviews” - The 1st International Conference on
Information Science and Engineering [ ICISE – 2009 ]
[9] S.L. Ting, W.H. Ip, Albert H.C. Tsang - “Is Naïve Bayes a Good Classifier for Document
Classification ? ”- International Journal of Software Engineering and Its Applications ,Vol. 5,
No. 3, July, 2011 ,Hongkong
[11] Ahmed Abbasi, Hsinchun Chen, And Arab Salem - The University of Arizona, “ Sentiment
Analysis in Multiple Languages : Feature Selection for Opinion Classification in Web Forums ”
Arizona-2007
23 | P a g e
[12] Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng - “Some
Effective Techniques for Naive Bayes Text Classification” - IEEE Transactions On Knowledge
And Data Engineering , Vol. 18, No. 11, November -2006.
[13] Bo Pang1 and Lillian Lee2 - “Opinion Mining and Sentiment Analysis” , Foundations and
Trends in Information Retrieval Vol. 2, Nos. 1–2 (2008) 1–135 , USA – 2011
[15] Pratiksha Y. Pawar and S. H. Gawande, Member –“A Comparative Study on Different
Types of Approaches to Text Categorization”- International Journal of Machine Learning and
Computing, Vol. 2, No. 4, August 2012
Web Sources
[17] http://blog.echen.me
[18] http://www.drdobbs.com
[19] http://my.safaribooksonline.com
Thesis
[20] Andreas Nurnberger Information Retrieval Group School of Computer Science , “A Brief
Survey of Text Mining" ,KDE Group, University of Kassel – 2005
Book
[21] Paulraj Ponnian “Data Warehousing Fundamentals”, John Wiley.
24 | P a g e