0% found this document useful (0 votes)
117 views24 pages

Literature Review Report

Uploaded by

Vaibhav C Gandhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views24 pages

Literature Review Report

Uploaded by

Vaibhav C Gandhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 1: Introduction and Motivation

1.1 Introduction
Nowadays Knowledge management, data mining, and text mining techniques have been widely
used in many important applications in both scientific and business domains in recent years.
Knowledge management is the system and managerial approach to the gathering, management,
use, analysis, sharing, and discovery of knowledge in an organization or a community in order to
maximize performance [1].Although there is no universal definition of what constitutes
knowledge, it is generally agreed there is a continuum of data, information, and knowledge. Data
are mostly structured, factual, and oftentimes numeric, and reside in database management
systems. Information is factual, but unstructured, and in many cases textual. Knowledge is
inferential, abstract, and is needed to support decision making or hypothesis generation. The
concept of knowledge has become prevalent in many disciplines and business practices[21]. For
example, information scientists consider taxonomies, subject headings, and classification
schemes as representations of knowledge. Consulting firms also have been actively promoting
practices and methodologies to capture corporate knowledge assets and organizational memory.

Data mining is often used during the knowledge discovery process and is one of the most
important subfields in knowledge management. Data mining aims to analyze a set of given data
or information in order to identify novel and potentially useful patterns These techniques, such as
Bayesian models, decision trees, artificial neural networks, associate rule mining, and genetic
algorithms, are often used to discover patterns or knowledge that are previously unknown to the
system and the users[1]. Data mining has been used in many applications such as marketing,
customer relationship management, engineering, medicine, crime analysis, expert prediction,
Web mining, and mobile computing, among others.

Text mining aims to extract useful knowledge from textual data or documents. Although text
mining is often considered a subfield of data mining, some text mining techniques have
originated from other disciplines, such as information retrieval, information visualization,
computational linguistics, and information science. Examples of text mining applications include
document classification, document clustering, entity extraction, information extraction, and

1|P a ge
summarization. Most knowledge management, data mining, and text mining techniques involve
learning patterns from existing data or information, and are therefore built upon the foundation
of machine learning and artificial intelligence.

Usage of Text Mining


As is the case with data mining technology, one of the primary usage areas of text mining is
collecting and condensing facts as a basis for decision support[2]. The main advantages of
mining technology over a traditional ‘information broker’ business are:
 The ability to quickly process large amounts of textual data which could not be
performed effectively by human readers.
 ‘Objectivity’ and customizability of the process - i.e. the results solely depend on
the outcome of the linguistic processing algorithms and statistical calculations
provided by the text mining technology
 Possibility to automate labor-intensive routine tasks and leave the more demanding
tasks to human readers.

Taking advantage of these properties, text mining applications are typically used to[2]:
 Extract relevant information from a document (summarization, feature extraction, . ..)
 Gain insights about trends, relations between people/places/organizations, etc. by
automatically aggregating and comparing information extracted from documents of a
certain type (e.g. incoming mail, customer letters, news-wires, . ..).
 Classify and organize documents according to their content; i.e. automatically pre-
select groups of documents with a specific topic and assign them to the appropriate
person.
 Organize repositories of document-related meta-information for search and retrieval.
 Retrieve documents based on various sorts of information about the document
content
This list of activities shows that the main application areas of text mining technology
cover the two aspects,
(1) knowledge discovery (mining proper) and
(2) information ‘distillation’ (mining on the basis of some pre-established structure).

2|P a ge
An important part of our information-gathering behaviour is always been to check what the other
people are thinking about it. With the growing availability and popularity of opinion-rich
resources such as online review sites and personal blogs, new opportunities and challenges arise
as people now can, and do, actively use information technologies to seek out and understand the
opinions of others. “What other people think” has always been an important piece of information
for most of us during the decision-making process. the Internet and the Web have now made it
possible to find out about the opinions and experiences of those in the vast pool of people that
are neither our personal acquaintances nor well-known professional critics that is, people we
have never heard of and that’s why opinion mining is called the voice of the customer. And
conversely, more and more people are making their opinions available to strangers via the
Internet. The interest that individual users show in online opinions about products and services,
and the potential influence such opinions wield, is something that vendors of these items are
paying more and more attention to.

In today’s world, there are so much data available on the internet. It includes the customer
reviews on different products[9]. It is general tendency that before we go for purchasing any
product, we go thru the reviews written on the website of that product. By reading those reviews
customer takes decision. Sometimes there are so many reviews that the customer is not able to
read, for that the opinion mining is used to help the customer. The reviews of the customers also
help the other customer in getting the suggestions or feedback for the developer of the product.
By these reviews, the company can come to know that what is lacking in their product. For
example, for mobile, it has been written that, the battery life of mobile is very less, or the voice
clarity is not good, so the company can make the battery life and voice clarity better in the next
model of that product. By the comments or reviews, the company of that product can come to
know that, what are the reasons to like the product and what are the reason for not liking the
product.
Indeed, according to two surveys of more than 2000 American adults

 81% of Internet users (or 60% of Americans) have done online


research on a product at least once;

 20% (15% of all Americans) do so on a typical day;

3|P a ge
 among readers of online reviews of restaurants, hotels, and

various services, between 73% and 87% report that reviews had a significant influence on
their purchase

 consumers report being willing to pay from 20% to 99% more

for a 5-star-rated item than a 4-star-rated item

 32% have provided a rating on a product, service, or person via an online ratings system,
and 30% have posted an online comment or review regarding a product or service.
 28% said that a major reason for these online activities was to get perspectives from
within their community, and 34% said that a major reason was to get perspectives from
outside their community;
 27% had looked online for the endorsements or ratings of external organizations;

There are three types of opinion mining approach.

[1] Feature level or Phrase level:


In this, for the product, the particular features are classified and for those features , the comments
or reviews are taken separately.

[2] Sentence level:


In this, the comments or reviews are opinionated. The benefit of this approach is in this, the
customer can come to know about so many different types of customer’s reviews. In this
approach, it mainly differentiate between the subjective and objective information. The subjective
information is the opinion , which can be negative or positive and the objective information is the
fact.

[3] Document level:


In this the whole document is written for the product , it is written by only one person. So, it is
not as useful because the customer will come to know the review of only one customer.

4|P a ge
In opinion mining, there are three categories

[1] Term counting based


[2] Machine learning based
[3] Semantic pattern analysis based.

In term counting based, In the sentence, if there more number of good words or positive words
then it is considered as positive, if more number of negative words are there , then it is negative.
This is mainly used for document level opinion mining. In machine learning based, we are using
supervised approach. Training data set is used, and supervised method is applied on it. This is
mainly used for sentence level opinion mining. In the semantic pattern analysis based, the
relations between the words are found using the natural language processing technology.

In dissertation , I am going to focus on FEATURE level opinion mining using supervised and
machine learning and part of speech tagging.
In opinion mining , our main goal is to find the feelings or view of the customer for the product.
Whether the customer is satisfied with the product or not.

1.2 Problem Statement

As , today lots of information is available online, so it is difficult to find, which information is


useful to us. In the case of online product review, I will implement an algorithm for finding the
positive , negative and neutral reviews from so many reviews written online using supervised
learning , machine learning and part of speech tagging.

1.3 Motivation
Sometimes it happens that, if so many reviews are written online for the product, and we don’t
have enough time or if we have time, but it is not possible to read all the reviews, then if we get
the number of negative and positive reviews, then it will be helpful to us.

5|P a ge
As shown in the introduction, in document level we can have the opinion of only one customer,
in feature level, it will be too complicated. And if we want to find the number of negative,
positive review, that comes under sentence level opinion mining.

Same as in introduction, there introduced three techniques, from which in term weighting , if
weights are assigned wrong, then we can get wrong result. In semantic based, we need to generate
interdependency between the words, so that will be very complicated. And in machine learning,
if we get the good quality raining data set, then it works nicely. That’s the reason for choosing
machine learning technique.

In machine learning technique is also, it has been implemented using, naïve Bayesian algorithm
and k nearest neighbor algorithm, but its to complex. So, I am focusing on developing an easy
approach for the same.

6|P a ge
Chapter 2: Theritical Background And Litérature Survey

2.1 A Survey on Text Categorization


Now a day’s managing a vast amount of documents in digital forms is very important in
text mining applications. Text categorization is a task of automatically sorting a set of
documents into categories from a predefined set. A major characteristic or difficulty of
text categorization is high dimensionality of feature space. The reduction of dimensionality
by selecting new attributes which is subset of old attributes is known as feature selection.

The capacity of storing data becomes enormous as the technology of computer hardware
develops. So amount of data is increasing exponentially, the information required by the
users become varies .actually users deal with textual data more than the numerical data. It
is very difficult to apply techniques of data mining to textual data instead of numerical
data. Therefore it becomes necessary to develop techniques applied to textual data that are
different from the numerical data. Instead of numerical data the mining of the textual data is
called text mining. Text mining [1] is procedure of synthesizing the information by
analyzing relations, the patterns and rules from the textual data. A key element is the
linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. Text
mining is different from what are familiar with in web search. In search, the user is
typically looking for something that is already known and has been written by someone
else. The problem is pushing aside all the material that currently is not relevant to your
needs in order to find the relevant information. In text mining, the goal is to discover
unknown information, something that no one yet knows and so could not have yet
written down. The functions of the text mining are text summarization, text categorization and
text clustering[1].

Text classification is commonly used to handle spam emails, classify large text
collections into topical categories, used to manage knowledge and also to help Internet
search engines. A major characteristic of text categorization is high dimensionality of the
feature space .the native feature space consists of hundreds of thousands of terms for even a
moderate sized text collection. Various feature selection methods are discussed in this

7|P a ge
paper to overcome the problem of the high dimensionality. This survey also focuses on
the various approaches and also the applications of text categorization.

Categorization involves identifying the main themes of a document by placing the


document into a pre-defined set of topics. When categorizing a document, a computer
program will often treat the document as a “bag of words.” It does not attempt to process the
actual information as information extraction does. Rather, categorization only counts words
that appear and, from the counts, identifies the main topics that the document covers.
Categorization often relies on a thesaurus for which topics are predefined, and
relationships are identified by looking for broad terms, narrower terms, synonyms, and related
terms.

Figure 1 Process of Text Categorization

The goal of text categorization is to classify a set of documents into a fixed number of
predefined categories. Each document may belong to more than one class. The goal of text

8|P a ge
categorization is to classify a set of documents into a fixed number of predefined
categories. Document may belong to more than one class.

2.2 A Brief Survey of Text Mining

The enormous amount of information stored in unstructured texts cannot simply be used for
further processing by computers, which typically handle text as simple sequences of character
strings. Therefore, specific (pre-)processing methods and algorithms are required in order to
extract useful patterns. Text mining refers generally to the process of extracting interesting
information and knowledge from unstructured text[20].

From this thesis we discuss text mining as a young and interdisciplinary field in the intersection
of the related areas information retrieval, machine learning, statistics, computational linguistics
and especially data mining. We describe the main analysis tasks preprocessing, classification,
clustering, in-formation extraction and visualization. In addition text mining as a truly
interdisciplinary method drawing on information retrieval, machine learning, statistics,
computational linguistics and es-pecially data mining.

Text mining or knowledge discovery from text (KDT) deals with the machine supported analysis
of text. It uses techniques from information retrieval, information extraction as well as natural
language processing (NLP) and connects them with the algorithms and methods of KDD, data
mining, machine learning and statistics. Thus, one selects a similar procedure as with the KDD
process, whereby not data in general, but text documents are in focus of the analysis. One
problem is that we now have to deal with problems of from the data modeling perspective
unstructured data sets. If we try to define text mining, we can refer to related research areas. For
each of them, we can give a different definition of text mining, which is motivated by the specific
perspective of the area[20]:

Text Mining = Information Extraction. The first approach assumes that text mining essentially
corresponds to information extraction the extraction of facts from texts.

Text Mining = Text Data Mining. Text mining can be also defined similar to data mining as the
application of algorithms and methods from the fields machine learning and statistics to texts

9|P a ge
with the goal of finding useful patterns. For this purpose it is necessary to pre-process the texts
accordingly. Many authors use information extraction methods, natural language processing or
some simple pre-processing steps in order to extract data from texts. To the extracted data then
data mining algorithms can be applied.

Text Mining = KDD Process. Following the knowledge discovery process model, we
frequently find in literature text mining as a process with a series of partial steps, among other
things also information extraction as well as the use of data mining or statistical procedures.
Hearst summarizes this in a general manner as the extraction of not yet discovered information in
large collections of texts. Also consider text mining as process orientated approach on texts. In
this thesis, we consider text mining mainly as text data mining. Thus, our focus is on methods
that extract useful patterns from texts in order to, e.g., categorize or structure text collections or
to extract useful information.

2.3 A Comparative Study on Different Types of Approaches to Text


Categorization
Text Categorization is a pattern classification task for text mining and necessary for efficient
management of textual information systems. The documents can be classified by three ways
unsupervised, supervised and semi supervised methods. Text categorization refers to the process of
assign a category or some categories among predefined ones to each document, automatically. This
paper presents a comparative study on different types of approaches to text categorization[.

Currently text categorization research is investigating the scalability properties of text classification
systems, i.e. understanding whether the systems that have proven the best in terms of effectiveness alone
stand up to the challenge of dealing with very large numbers of categories. Several algorithms or
combination of algorithms as hybrid approaches were proposed for the automatic classification of
documents. Among these algorithms SVM, NB, kNN and their hybrid system with the combination of
different other algorithms and feature selection techniques are shown most appropriate in the existing
literature. Future work is required for the performance improvement and accuracy of the text
classification process. After performing a review on different types of approaches and comparing
existing methods based on various parameters it can be concluded that SVM classifier has been
recognized as one of the most effective text classification method in the comparisons of supervised
machine learning algorithms.

10 | P a g e
 DIFFERENT TYPES OF APPROACHES
[1] Naïve Bayes Algorithm
[2] K. Nearest Neighbor
[3] Rocchio’s Algorithm
[4] Decision Trees
[5] Back propagation Network
[6] Support Vector Machines (SVM)

2.4 Comparison of Text Categorization Algorithms

This paper summarizes several automatic text categorization algorithms in common use
recently, analyzes and compares their advantages and disadvantages. It provides clues for
making use of appropriate automatic classifying algorithms in different fields. Finally some
evaluations and summaries of these algorithms are discussed, and directions to further
research have been pointed out.

The text categorization can be roughly classified as two classes,

[1] Statistical based, e.g. Nave Bayes, the maximum Shannon entropy model, KNN,
Support Vector Machine and so on;

[2] Knowledge based classification method, e.g. Productive rules, neural network etc.

The statistical based classification method due to its simple mathematical computation, not
demanding complex semantic knowledge and domain knowledge, has got good effect in
practical applications, and becoming popular text categorization method recently.

While knowledge based text categorization system can he applied to a specific area, and
need the knowledge base of the area as support. Since its problems in extraction,
modification, maintaining of the knowledge and self-learning, its applicable area is
restricted. In addition, there are some other classification methods, such as Boosting

11 | P a g e
algorithm, it is a kind of voting based classification method, the idea is: for the task
requiring experts knowledge.

2.5 Distributional Features for Text Categorization

Text categorization is the task of assigning predefind categories to natural language text. With
the widely used ‘bag of words’ representation, previous researches usually assign a word with
values such that whether this word appears in the document concerned or how frequently this
word appears. Although these values are useful for text categorization, they have not fully
expressed the abundant information contained in the document. This paper explores the effect of
other types of values, which express the distribution of a word in the document. These novel
values assigned to a word are called distributional features, which include the compactness of the
appearances of the word and the position of the first appearance of the word. The proposed
distributional features are exploited by a style equation and different features are combined using
ensemble learning techniques. Experiments show that the distributional features are useful for
text categorization. In contrast to using the traditional term frequency values solely, including the
distributional features requires only a little additional cost, while the categorization performance
can be significantly improved. Further analysis shows that the distributional features are
especially useful when documents are long and the writing style is casual.

2.6 Naïve Bayes is a Good Classifier for Document Classification


In this paper Document Classification is a growing interest in the research of text mining.
Correctly identifying the documents into particular category is still presenting challenge because
of large and vast amount of features in the dataset. In regards to the existing classifying
approaches, Naïve Bayes is potentially good at serving as a document classification model due to
its simplicity. The aim of this paper is to highlight the performance of employing Naïve Bayes in
document classification. Results show that Naïve Bayes is the best classifiers against several common
classifiers (such as decision tree, neural network, and support vector machines) in term of accuracy
and computational efficiency.

The Naïve Bayes text classifier has been widely used because of its simplicity in both the training and
classifying stage. Although it is less accurate than other discriminative methods (such as SVM),
numerous researchers proved that it is effective enough to classify the text in many domains. Naïve

12 | P a g e
Bayes models allow each attribute to contribute towards the final decision equally and independently
from other attributes, in which it is more computational efficient when compared with other text
classifiers. Thus, the present study focuses on employing Naïve Bayes approach as the text classifier for
document classification and thus evaluates its classification performance against other classifiers.

In order to generate a document classifier model, Figure 1 depicts the methodology beginning
with data preprocessing to the model evaluation. Indeed, some data are useless (i.e. do not affect the
classification result even removing them, such as stop words) and some carries similar meanings
(i.e. the term „bank‟ and „banks‟), therefore a preprocessing phase has been to conduct first. In this way,
the dataset can be more precise.

After the data preprocessing phase, critical attributes have to be selected. In this study, critical
means the importance of such attribute towards the solution class. For example, the term „bank‟
categorized in „business‟ class has the highest score in term of term frequency, therefore it is analyzed
that „bank‟ is one of the critical attributes to represent the documents fell in the „business‟ class. Thus,
less important features can be removed and so the computational time can be improved significantly.

As for the classification phase, different classifiers (such as SVM, NN, and DT) are employed to
generate the model. However, this study only focused on using Naïve Bayes to classify the documents.
Given the probabilistic characteristic of Naïve Bayes, each training document is vectorized by the
trained Naïve Bayes classifier through the calculation of the posterior probability value for each
existing..

Finally, the model is evaluated by a set of testing data. In order to test the classification ability of the
model, several evaluation measures (such as precision, recall, and F-measure) are adopted.
Furthermore, to interpret whether Naïve Bayes is best to use as the classifier, its testing result will be
compared with other classifiers‟ results as well.

13 | P a g e
Processing Phase

Feature Selection Phase


Stop Word Removing
Data Set CFS subset
evaluator
[4000 instances &
Missing Value &
1312 Attribute]
Interpretation
Rank Search

Stemming

Classifier Selection Phase


Classifier Selection

NO
DT
Naïve Bayes
KNN
SVM

Model Evaluation Phase

Training Set Result


Satisfy
Evaluation
Data Set
Measurement

Testing Set YES

Classifier
Generation

Figure 2 – Research Methodology for Data Pre-Processing Phase


14 | P a g e
Phase 1- Preprocessing
It is common to find that several attributes are useless (such as the word „a‟, „the‟, etc.). Thus, stopword
removing algorithm has been applied. To initialize the algorithm, a set of stopword (such as a, a's,
able, about, above, according, accordingly, and across) has set by the human beforehand and hence
stored in a text file. Then, the model can simply match the attributes with those preset stopword.
After the stopword algorithm, a missing data checking algorithm is adopted. This algorithm is
used to identify any missing data and hence interpret a value to it since data mining cannot
perform under missing data situation. The third algorithm applied in the preprocessing phase is the
stemming. Since some words carry similar meanings but in different grammatically form (such as
“bank” and “banks”), therefore it is needed to combine them into one attribute. In this way, the documents
can show a better representation (with stronger correlations) of these terms and even the dataset
can be reduced for achieving faster processing time.

Phase 2 - Feature Selection


Feature selection is one of the most important preprocessing steps in data mining. It is an effective
dimensionality reduction technique to remove noise feature. In general, the basic idea of feature
selection algorithm to searches through all possible combinations of attributes in the data to find which
subset of features works best for prediction. Thus, the attribute vectors can be reduced in number
by which the most meaningful ones are kept and the irrelevant or redundant ones are removed and
deleted.
Phase 3 - Adoption of Document Classifier – Naïve Bayes
Naïve Bayes is used as the classifier because of its simplicity and good performance in document
and text classification. Naïve Bayes classifier is the simplest instance of a probabilistic classifier.
The output Pr(C|d) of a probabilistic classifier is the probability that a document d belongs to a class C.
Each document contains terms which are given probabilities based on its number of occurrence
within that particular documents. With the supervised training, Naïve Bayes can learn the pattern
of examining a set of test documents that have been well-categorized and hence comparing the
contents in all categories by building a list of words as well as their occurrence. Thus, such list of
word occurrence can be used to classify the new documents to their right categories, according to the
highest posterior probability.

15 | P a g e
Phase 4 - Model Evaluation
To test and evaluate the model, 70% of the dataset are used. Instances are extracted and then served as
a benchmarking dataset for machine learning problems. By comparing the actual class of the
instance with the predicted one (i.e. generated by the classification model), system performance can be
measures in term of recall, precision, and F-measure. These can be mathematically defined as below.

= ………………………….[ 1 ]

= …...……………….[ 2 ]

∗ ∗
− = ……………………………………………………….[ 3 ]

2.7 Proximity-Based Sentiment Analysis

Sentiment analysis seeks to characterize opinionated or evaluative aspects of natural language text thus
helping people to discover valuable information from large amounts of unstructured data. In this paper a
new methodology has been explored for sentiment analysis called proximity-based sentiment analysis.
They have taken a different approach, by considering a new set of features based on word proximities in
a written text.

They have proposed three proximity-based features, namely, proximity distribution, mutual information
between proximity types, and proximity patterns by unsupervised approach , mean and median approach
and machine learning approach. It is extracting information from a specific domain.

The amount of textual data accumulated each day by various businesses, scientific, and governmental
organizations around the world is daunting. Success in the development of statistical natural language
processing (NLP) has led to improvements in fundamental text analysis such as part-ofspeech (POS)
tagging, phrase chunking, dependency analysis and parsing. Using these components as fundamental
building blocks, many NLP researchers have become interested in analyzing text "semantically" or
"contextually". For example, named entity tagging, semantic role tagging and discourse parsing are being

16 | P a g e
investigated in the NLP fields. This move towards taking contextual or semantic information into account
has occurred in application areas of NLP such as text classification, text summarization, information
retrieval and question answering. Even with text classification tasks, one of the traditional NLP tasks, the
target classes have recently been diversified from topics such as 'sports' and 'economics' to the contents of
texts such as 'polarity' or 'subjectivity'. This thus calls for methods for opinion mining.

Example
The funniest horror movie ever made and while evil dead dawn is the result of an unstable fusion. The
word "funniest" is a positive word and "horror" is a negative word. There is no word between funniest and
horror so we consider it as positive-negative distance measurement and in this case we consider the
distance as 1. In a similar manner, the distance between the negative-negative words "horror" and "dead"
is 7, while that between "dead" and "unstable" is 3 (ignoring words, such as "of', "and", "the", etc.).
The distances for the following types of pairs.
[1] POSITIVE-POSITIVE (++) words.
[2] NEGATIVE-NEGATIVE (--) words.
[3] POSITIVE-NEGATIVE words (+-)
[4] NEGATIVE- POSITIVE words (-+)

2.8 Automatic Sentiment Analysis for Web User Reviews


The web contains a wealth of user reviews about a topic concerning positive, negative or mixed mood,
but it is a daunting task to discriminate them manually. This paper presents an approach for automatic
sentiment analysis by 1) generating positive and negative sentiment words from Tongyici Cilin with
paradigm words, 2) determining the sentiment orientation of ambiguous words according to their
contexts, 3) setting up proper weight factors to different part-of-speeches, and 4) expanding the initial
sentiment words by an iterative process. The performance of our approach is verified on web
user reviews and achieves an average precision of 83.52%.

2.9 Classification of Movie Reviews Using Complemented Naive Bayesian


Classifier
Text classification is an important research area as it enables the computers to work intelligently process
unstructured data. This unstructured data is a rich source of information for industries. Most of such
opinion rich data (more than 85%) is in text format. In this work we have observed the effect of different
machine learning algorithms on the text data including the Naïve Bayes. Our main focus is on improving

17 | P a g e
the classification efficiency of Naïve Bayes using its complemented version with less sensitivity. The
results show that the feature selection procedure from our previous work combined with these algorithms
results in significant improvement of classification efficiency and reduced over-fitting compared to the
previous work.

Assumed to be independent from the conditional probabilities of other words given that category Imagine
that each document belongs to one of the set of different classes and each document can be modeled as a
set of words Wi.

P(D|C) = p(C) ∗ . p(w |C) … … … … … … … … … … . [4]

Where P(D|C) is the prior probability of a document D for a particular class C and P(Wi|C) is the joint
probability of the word Wi for the class C.

2.10 Text Mining: Finding Nuggets in Mountains of Textual Data

Text mining applies the same analytical functions of data mining to the domain of textual
information, relying on sophisticated text analysis techniques that distill information from
free-text documents. IBM’s Intelligent Miner for Text provides the necessary tools to
unlock the business information that is “trapped” in email, insurance claims, news feeds,
or other document repositories. It has been successfully applied in analyzing patent
portfolios, customer complaint letters, and even competitors’ Web pages. After defining our
notion of “text mining”, we focus on the differences between text and data mining and
describe in some more detail the unique technologies that are key to successful text mining[10].

Usage Of Text Mining

As is the case with data mining technology, one of the primary application areas of text
mining is collecting and condensing facts as a basis for decision support. The main advantages
of mining technology over a traditional ‘information broker’ business are: The ability to
quickly process large amounts of textual data which could not be performed effectively by
human readers. . ‘Objectivity’ and customizability of the process - i.e. the results solely

18 | P a g e
depend on the outcome of the linguistic processing algorithms and statistical calculations
provided by the text mining technology . Possibility to automate labor-intensive routine
tasks and leave the more demanding tasks to human reader

Taking advantage of these properties, text mining applications are typically used to :

Extract relevant information from a document (summarization, feature extraction, . ..) .Gain
insights about trends, relations between people/places/organizations, etc. by automatically
aggregating and comparing information extracted from documents of a certain type (e.g.
incoming mail, customer letters, news-wires, . ..). Classify and organize documents according
to their content; i.e. automatically pre-select groups of documents with a specific topic and
assign them to the appropriate person. Organize repositories of document-related meta-
information for search and retrieval Retrieve documents based on various sorts of
information about the document content This list of activities shows that the main
application areas of text mining technology cover the two aspects

(1) knowledge discovery (mining proper) and

(2) information ‘distillation’ (mining on the basis of some pre-established structure).

Due to the lack of space we will concentrate on a single application that uses IBM’s
Intelligent Miner for Text to support both aspects, discovery and distillation: customer
relationship management.

In this paper we have described our notion of “text mining” and relevant core technology
components. IBM’s product for text mining applications, the Intelligent Miner for Text,
enables customers to use these new technologies in practical text mining applications. We
have shown this by describing a customer relationship management application that is
based on IBM’s text mining components. As this application shows, text mining to- date
can be used as an effective business tool that supports the creation of knowledge by
preparing and organizing unstructured textual data (discovery) and by supporting the
extraction of relevant information from large amounts of unstructured textual data through
automatic pre-selection based on user-defined criteria (distillation). Using automatic
mining processes to organize and scan huge repositories of textual data can significantly

19 | P a g e
enhance both the efficiency and quality of a routine task while still leaving the more
challenging and critical part of it to the one who can do it best, the human reader.

2.11 Opinion Mining From Blogs

With the growing popularity of the Web 2.0, we are more and more provided with documents
expressing opinions on differ-ent topics. Recently, new research approaches were defined in
order to automatically extract such opinions on the Internet.Usually they consider that opinions
are expressed through ad-jectives and they extensively use either general dictionaries or experts
in order to provide the relevant adjectives. Unfortu-nately these approach suffer the following
drawback: for a spe-cific domain either the adjective does not exist or its meaning could be
different from another domain.

In this paper, we pro-pose a new approach focusing on two steps. First we automat-ically extract
from the Internet a learning dataset for a specific domain. Second we extract from this learning
set, the set of positive and negative adjectives relevant for the domain. Con-ducted experiments
performed on real data show the usefulness of our approach. In this paper, we proposed a new
approach for automatically extracting positive and negative adjectives in the context of the
opinion mining. Experiments conducted on training sets (blogs vs. cinema reviews) show that
with our approach we are able to extract relevant adjectives for a specific domain. Future works
may be manifold. First, our method depend on good quality of documents extracted from blogs.
We want to extend our training corpora method by applying text min-ing approaches on
collected documents in order to minimize lower noisy texts. Second, in this work we focused on
adjectives, we plan to extend the extraction task to other cate-gories.

2.12 Improving the Naïve Bayes Classifier

The Naïve Bayes classifier, also called simple Bayesian classifier, is essentially a simple BN.
Since no structure learning is required, it is very easy to construct and implement a Naïve Bayes
classifier. Despite its simplic- ity, the Naïve Bayes classifier is competitive with other more
advanced and sophisticated classifiers such as decision trees (Friedman, Geiger & Goldszmidt,
1997). Owing to these advantages, the Naïve Bayes classifier has gained great popularity in
solving different classification problems[6].

20 | P a g e
This section introduces the two groups of approaches that have been used to improve the Naïve
Bayes classifier. In the first group, the strong independence assumption is relaxed by restricted
structure learning. The second group helps to select some major (and approximately
independent) attributes from the original attributes or transform them into some new attributes,
which can then be used by the Naïve Bayes classifier[10].

21 | P a g e
Conclusion And Future Enhancement

Here I have proposed an opinion mining approach using machine learning and supervised
learning, part of speech in which , it will present user friendly and easy approach, for finding the
views of the customer, whether it is negative or positive or neutral for the product. There are
algorithms using supervised learning like naïve Bayesian, k nearest neighbor , these give good
result, but these are complex.

In the SVM algorithm the training set is small the result of SVM model is much poor than others.
Also in the Association Rule Word Set of items two (at least) or more is generated from
Association mining. So there is no option for considering a single word using association
concept. Association mining largely reduces the number of words to be considered for
classifying texts, keeping only words having association between them.

Here I have found that naïve bayes gives good performance and accurate result when training
data set is smaller. So it is best suitable for my proposed work.

22 | P a g e
References
Papers

[1] Hsinchun Chen , Sherrilynne S. Fuller , Carol Friedman and William Hersh - “Knowledge
Management , Data Mining , And Text Mining in Medical Informatics” - Management
Information Systems Department.

[2] Jochen Dijrre, Peter Gerstl, Roland Seiffert - “Text Mining: Finding Nuggets in
Mountains of Textual Data” - IBM Germany , Copyright ACM 1999

[3] S.Niharika , V.Sneha Latha , D.R.Lavanya - “ A Survey On Text Categorization” -


International Journal of Computer Trends and Technology- volume3Issue1- 2012

[4] Un-Nung Chen, Che-An Lu, Chao-Yu Huang - “Anti-Spam Filter Based on Naïve Bayes,
SVM, and KNN model” - © 2009 AI TERMPROJECT

[5] SHI Yong-feng, ZHAO Yan-ping – “Comparison of Text Categorization Algorithms” –


WUJI Wuhan University Journal of Natural Sciences-Vol. 9 No. 5 2004 year.

[6] Jose Alavedra, Laura Stroh, Alper Caglayan Milcord Waltham, MA, USA – “ Bayesian
Analysis of Sentiment Surveys”, 2011 IEEE Paper

[7] Gao Hua ,”Customer Relationship Management Based on Data Mining Technique --Naive
Bayesian classifier” , China – 2011 IEEE

[8] Sun Yueheng, Wang Linmei, Deng Zheng - School of Computer Science and Technology
“Automatic Sentiment Analysis for Web User Reviews” - The 1st International Conference on
Information Science and Engineering [ ICISE – 2009 ]

[9] S.L. Ting, W.H. Ip, Albert H.C. Tsang - “Is Naïve Bayes a Good Classifier for Document
Classification ? ”- International Journal of Software Engineering and Its Applications ,Vol. 5,
No. 3, July, 2011 ,Hongkong

[10] Siva RamaKrishna Reddy V.1 , D V L N. Somayajulu1, Ajay R. Dani2 – “Classification of


Movie Reviews Using Complemented Naive Bayesian Classifier” , International Journal of
Intelligent Computing Research (IJICR), Volume 1, Issue 4, December , India – 2010

[11] Ahmed Abbasi, Hsinchun Chen, And Arab Salem - The University of Arizona, “ Sentiment
Analysis in Multiple Languages : Feature Selection for Opinion Classification in Web Forums ”
Arizona-2007

23 | P a g e
[12] Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng - “Some
Effective Techniques for Naive Bayes Text Classification” - IEEE Transactions On Knowledge
And Data Engineering , Vol. 18, No. 11, November -2006.

[13] Bo Pang1 and Lillian Lee2 - “Opinion Mining and Sentiment Analysis” , Foundations and
Trends in Information Retrieval Vol. 2, Nos. 1–2 (2008) 1–135 , USA – 2011

[14] S M Kamruzzaman and Chowdhury Mofizur Rahman - “Text Categorization using


Association Rule and Naïve Bayes Classifier”- International Islamic University,Chittagong-
4203, Bangladesh

[15] Pratiksha Y. Pawar and S. H. Gawande, Member –“A Comparative Study on Different
Types of Approaches to Text Categorization”- International Journal of Machine Learning and
Computing, Vol. 2, No. 4, August 2012

Web Sources
[17] http://blog.echen.me

[18] http://www.drdobbs.com

[19] http://my.safaribooksonline.com

Thesis
[20] Andreas Nurnberger Information Retrieval Group School of Computer Science , “A Brief
Survey of Text Mining" ,KDE Group, University of Kassel – 2005
Book
[21] Paulraj Ponnian “Data Warehousing Fundamentals”, John Wiley.

24 | P a g e

You might also like