0% found this document useful (0 votes)

71 views40 pages

Text Mining

The document is a seminar report on text mining submitted to Jai Narain Vyas University by Rohit Jangid. It discusses various techniques in text mining including information extraction, topic tracking, text summarization, text categorization, clustering, concept linkage, information visualization, and question answering. It also covers applications of text mining in areas such as business intelligence, bioinformatics, security/fraud detection, and customer relationship management. The report was prepared under the guidance of Mr. Lalit Kumar Gurnani and submitted in partial fulfillment of the degree of Bachelor in Computer Applications.

Uploaded by

walebic251

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views40 pages

Text Mining

Uploaded by

walebic251

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

A SEMINAR REPORT ON

“TEXT MINING”

SUBMITTED TO JAI NARAIN VYAS UNIVERSITY,

JODHPUR

IN PARTIAL FULFILLMENT FOR AWARD OF DEGREE

BACHELOR IN COMPUTER APPLICATIONS

(BATCH 2021-2024)

SUBMITTED BY

ROHIT JANGID

UNDER THE GUIDANCE OF

MR. LALIT KUMAR GURNANI

(ASST. PROFESSOR)

LUCKY INSTITUTE OF PROFESSIONAL STUDIES

Affiliated to

JAI NARAIN VYAS UNIVERSITY, JODHPUR

Faculty of Information Technology

Lucky Institute of Professional Studies

Jodhpur

CERTIFICATE

This is to certify that the Seminar entitled “Text Mining” has been prepared by Rohit Jangid in partial fulfillment
of the degree of BCA, under my supervision and guidance.

Mr. Lalit Kumar Gurnani

Asst. Professor

Faculty of Information Technology

Date:
Acknowledgment

The success and final outcome of this seminar report required a lot of guidance and assistance from many people
and we are extremely privileged to have got this all along the completion of the report. All that I have done is
only due to such supervision and assistance and we would not forget to thank them.

I am grateful to the mentor Mr. Lalit Kumar Gurnani (Asst. Prof.) for giving guidelines to make the report
successful. The interest and attention which has shown so graciously lavished upon this work.

I extend my thanks to Dr. Saurabh Khatri (HOD, IT) for his cooperation, guidance, encouragement, inspiration,
support and attention led to complete this report.

I would like to give sincere thanks to Dr. Manish Kachhawaha (Director) and Mr. Arjun Singh Sankhala
(Principal) for providing cordial environment to exhibit my abilities to the fullest.

Yours Sincerely,

Rohit Jangid
Declaration

I hereby declare that this Seminar is a record of original work done by me under the supervision and guidance of
Mr. Lalit Kumar Gurnani. I further certify that this report work has not formed the basis for the award of the
Degree/Diploma or similar work to any candidate of any university and no part of this report is reproduced as it
is from any source without seeking permission.

Student name: Rohit Jangid

Roll no: 22BCA22254

Date:
ABSTRACT

The volume of information circulating in a typical enterprise continues to increase. Knowledge hidden in
the information however, is not fully utilized, as most of the information is described in textual form (as
sentences). A large amount of text information can be analyzed objectively and efficiently with Text Mining. The
field of text mining has received a lot of attention due to the ever-increasing need for managing the information
that resides in the vast amount of available text documents. Text documents are characterized by their unstructured
nature. Ever increasing sources of such unstructured information include the World Wide Web, biological
databases, news articles, emails etc.

Text mining is defined as the discovery by computer of new, previously unknown information, by automatically
extracting information from different written resources. A key element is the linking together of the extracted
information together to form new facts or new hypotheses to be explored further by more conventional means of
experimentation. As the amount of unstructured data increases, text-mining tools will be increasingly valuable. A
future trend is integration of data mining and text mining into a single system, a combination known as duo-
mining.

Keywords: topic tracking, classifiers, clustering, information retrieval, datamining

Table of Contents

1. INTRODUCTION ............................................................................................................................................ 1

2. TECHNOLOGY FOUNDATIONS ................................................................................................................. 3

3. INFORMATION EXTRACTION ................................................................................................................... 4

4. TOPIC TRACKING ......................................................................................................................................... 5

5. TEXT SUMMARIZATION ............................................................................................................................. 8

6. TEXT CATEGORIZATION .......................................................................................................................... 11

7. CLUSTERING ............................................................................................................................................... 13

8. CONCEPT LINKAGE ................................................................................................................................... 16

9. INFORMATION VISUALIZATION ............................................................................................................ 17

10. QUESTION ANSWERING ....................................................................................................................... 19

11. TEXT MINING APPLICATIONS ............................................................................................................. 21

12. CONCLUSION........................................................................................................................................... 32

13. REFERENCES ........................................................................................................................................... 33

Table of Figures

Figure 1-1 An example of Text Mining ................................................................................................................... 2

Figure 3-1Overview of IE-based text mining framework ....................................................................................... 4

Figure 4-1 The architecture of keyword extraction system ..................................................................................... 6

Figure 4-2 Keyword extraction module ................................................................................................................... 7

Figure 5-1Kernel of text summarization................................................................................................................ 10

Figure 6-1. Flow Diagram of Text Categorization ................................................................................................ 12

Figure 7-1 The flow diagram of K-means clustering based on co-mining ............................................................ 14

Figure 7-2Word relativity-based clustering method .............................................................................................. 14

Figure 9-1 Info VisModel visualization model ..................................................................................................... 18

Figure 10-1 Architecture of Question answering system ...................................................................................... 19

Figure 11-1 FOCI System Architecture ................................................................................................................. 23

Figure 11-2. Text Mining in Business Intelligence ............................................................................................... 24

Figure 11-3 ETL Data flow diagram ..................................................................................................................... 26

Figure 11-4 question answering system architecture............................................................................................. 29

Figure 11-5 System architecture for the SpeeData data entry module .................................................................. 31
1. INTRODUCTION

In the future, books and magazines will be used only for special purposes because electronic documents
become the primary means of storing, accessing and sorting written communication. As many become
overwhelmed with information, it will become physically impossible for any individual to process all the
information available on a particular topic. Massive amounts of data will reside in cyberspace, generating demand
for text mining technology and solutions.

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting
information from different written resources. A key element is the linking together of the extracted information
together to form new facts or new hypotheses to be explored further by more conventional means of
experimentation. Text mining is different from what are familiar with in web search. In search, the user is typically
looking for something that is already known and has been written by someone else. The problem is pushing aside
all the material that currently is not relevant to your needs in order to find the relevant information. In text mining,
the goal is to discover unknown information, something that no one yet knows and so could not have yet written
down.

Machine intelligence is a problem for text mining. Natural language has developed to help humans communicate
with one another and record information. Computers are a long way from comprehending natural language.
Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome
obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However,
although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to
process text in large volumes or at high speeds. Figure depicts a generic process model for a text mining
application.

Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess
it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating
techniques until information is extracted. Three text analysis techniques are shown in the example, but many other

1
combinations of techniques could be used depending on the goals of the organization. The resulting information
can be placed in a management information system, yielding an abundant amount of knowledge for the user of
that system.

Figure 1-1 An example of Text Mining

2
2. TECHNOLOGY FOUNDATIONS

Although the differences in human and computer languages are expansive, there have been technological
advances which have begun to close the gap. The field of natural language processing has produced technologies
that teach computers natural language so that they may analyze, understand, and even generate text. Some of the
technologies that have been developed and can be used in the text mining process are information extraction, topic
tracking, summarization, categorization, clustering, concept linkage, information visualization, and question
answering. In the following sections we will discuss each of these technologies and the role that they play in text
mining. We will also illustrate the type of situations where each technology may be useful in order to help readers
identify tools of interest to themselves or their organizations.

3
3. INFORMATION EXTRACTION

A starting point for computers to analyze unstructured text is to use information extraction. Information
extraction software identifies key phrases and relationships within text. It does this by looking for predefined
sequences in text, a process called pattern matching. The software infers the relationships between all the
identified people, places, and time to provide the user with meaningful information. This technology can be very
useful when dealing with large volumes of text. Traditional data mining assumes that the information to be
“mined” is already in the form of a relational database. Unfortunately, for many applications, electronic
information is only available in the form of free natural language documents rather than structured databases.
Since IE addresses the problem of transforming a corpus of textual documents into a more structured database,
the database constructed by an IE module can be provided to the KDD module for further mining of knowledge
as illustrated in Figure.

Figure 3-1Overview of IE-based text mining framework

After mining knowledge from extracted data, DISCOTEX can predict information missed by the previous
extraction using discovered rules.

4
4. TOPIC TRACKING

A topic tracking system works by keeping user profiles and, based on the documents the user views,
predicts other documents of interest to the user. Yahoo offers a free topic tracking tool (www.alerts.yahoo.com)
that allows users to choose keywords and notifies them when news relating to those topics becomes available.
Topic tracking technology does have limitations, however. For example, if a user sets up an alert for “text mining”,
s/he will receive several news stories on mining for minerals, and very few that are actually on text mining. Some
of the better text mining tools let users select particular categories of interest or the software automatically can
even infer the user’s interests based on his/her reading history and click-through information. There are many
areas where topic tracking can be applied in industry. It can be used to alert companies anytime a competitor is in
the news. This allows them to keep up with competitive products or changes in the market. Similarly, businesses
might want to track news on their own company and products. It could also be used in the medical industry by
doctors and other people looking for new treatments for illnesses and who wish to keep up on the latest
advancements. Individuals in the field of education could also use topic tracking to be sure they have the latest
references for research in their area of interest.

Keywords are a set of significant words in an article that gives high-level description of its contents to readers.
Identifying keywords from a large amount of on-line news data is very useful in that it can produce a short
summary of news articles. As on-line text documents rapidly increase in size with the growth of WWW, keyword
extraction has become a basis of several text mining applications such as search engine, text categorization,
summarization, and topic detection. Manual keyword extraction is an extremely difficult and time-consuming
task; in fact, it is almost impossible to extract keywords manually in case of news articles published in a single
day due to their volume. For a rapid use of keywords, we need to establish an automated process that extracts
keywords from news articles. The architecture of keyword extraction system is presented in figure . HTML news
pages are gathered from a Internet portal site. And candidate keywords are extracted throw keyword extraction
module. And finally, keywords are extracted by cross-domain comparison module. Keyword extraction module
is described in detail. We make tables for ‘document’, ‘dictionary’, ‘term occur fact’ and ‘TFIDF weight’ in
relational database. At first the downloaded news documents are stored in ‘Document’ table and nouns are
extracted from the documents in ‘Document table.

5
Figure 4-1 The architecture of keyword extraction system

Then the facts which words are appeared in documents are updated to ‘Term occur fact’ table. Next, TF-IDF
weights for each word are calculated using ‘Term occur fact’ table and the result are updated to ‘TF-IDF
weight’ table. Finally, using ‘TF-IDF weight’ table, ‘Candidate keyword list’ for each news domain with words
is ranked high. Keyword extraction module is given in figure.

Lexical chaining is a method of grouping lexically related terms into so called lexical chains. Topic tracking
involves tracking a given news event in a stream of news stories i.e. finding all the subsequent stories in the news
stream. In multi vector topic tracking system proper names, locations and normal terms are extracted into distinct
sub vectors of document representation. Measuring the similarity of two documents is conducted by comparing
two sub-vectors at a time. Number of features that effect the performance of topic tracking system are analyzed.
First choice is to choose one characteristic, such as the choice of words, words or phrases such as string as a
feature in this term to make features as an example that discuss the given event.

6
.

Figure 4-2 Keyword extraction module

7
5. TEXT SUMMARIZATION

Text summarization is immensely helpful for trying to figure out whether or not a lengthy document meets
the user’s needs and is worth reading for further information. With large texts, text summarization software
processes and summarizes the document in the time it would take the user to read the first paragraph. The key to
summarization is to reduce the length and detail of a document while retaining its main points and overall
meaning. The challenge is that, although computers are able to identify people, places, and time, it is still difficult
to teach software to analyze semantics and to interpret meaning.

Generally, when humans summarize text, we read the entire selection to develop a full understanding, and then
write a summary highlighting its main points. Since computers do not yet have the language capabilities of
humans, alternative methods must be considered. One of the strategies most widely used by text summarization
tools, sentence extraction, extracts important sentences from an article by statistically weighting the sentences.
Further heuristics such as position information are also used for summarization.

For example, summarization tools may extract the sentences which follow the key phrase “in conclusion”, after
which typically lie the main points of the document. Summarization tools may also search for headings and other
markers of subtopics in order to identify the key points of a document. Microsoft Word’s AutoSummarize function
is a simple example of text summarization. Many text summarization tools allow the user to choose the percentage
of the total text they want extracted as a summary. Summarization can work with topic tracking tools or
categorization tools in order to summarize the documents that are retrieved on a particular topic. If organizations,
medical personnel, or other researchers were given hundreds of documents that addressed their topic of interest,
then summarization tools could be used to reduce the time spent sorting through the material. Individuals would
be able to more quickly assess the relevance of the information to the topic they are interested in.

An automatic summarization process can be divided into three steps.

✓ In the preprocessing step a structured representation of the original text is obtained;

✓ In the processing step an algorithm must transform the text structure into a summary structure.

8
✓ In the generation step the final summary is obtained from the summary structure. The methods of
summarization can be classified, in terms of the level in the linguistic space, in two broad groups.

✓ shallow approaches, which are restricted to the syntactic level of representation and try to extract salient parts
of the text in a convenient way.

✓ deeper approaches, which assume a semantics level of representation of the original text and involve linguistic
processing at some level.

In the first approach the aim of the preprocessing step is to reduce the dimensionality of the representation space,
and it normally includes:

1) stop-word elimination –common words with no semantics and which do not aggregate relevant information
to the task (e.g., “the”, “a”) are eliminated.
2) case folding: consists of converting all the characters to the same kind of letter case - either upper case or
lower case.
3) stemming: syntactically-similar words, such as plurals, verbal variations, etc. are considered similar.

The purpose of this procedure is to obtain the stem or radix of each word, which emphasize its semantics. A
frequently employed text model is the vector model. After the preprocessing step each text element –a sentence
in the case of text summarization – is considered as a N-dimensional vector. So, it is possible to use some metric
in this space to measure similarity between text elements. The most employed metric is the cosine measure,
defined as cos q = (<x.y>) / (|x| . |y|) for vectors x and y, where (<,>) indicates the scalar product, and |x| indicates
the module of x. Therefore, maximum similarity corresponds to cos q = 1, whereas cos q = 0 indicates total
discrepancy between the text elements. To implement text summarization based on fuzzy logic, MATLAB is
usually used since it is possible to simulate fuzzy logic in this software. Select characteristic of a text such as
sentence length, similarity to little, similarity to key word and etc. as the input of fuzzy system. Then, all the rules
needed for summarization are entered in the knowledge base of this system. Afterward, a value from zero to one
is obtained for each sentence in the output based on sentence characteristics and the available rules in the
knowledge base. The obtained value in the output determines the degree of the importance of the sentence in the
final summary.

9
The Kernel of generating text summary using sentence selection-based text summarization approach is shown in
figure.

Figure 5-1Kernel of text summarization

10
6. TEXT CATEGORIZATION

Categorization involves identifying the main themes of a document by placing the document into a pre-
defined set of topics. When categorizing a document, a computer program will often treat the document as a “bag
of words.” It does not attempt to process the actual information as information extraction does. Rather,
categorization only counts words that appear and, from the counts, identifies the main topics that the document
covers. Categorization often relies on a thesaurus for which topics are predefined, and relationships are identified
by looking for broad terms, narrower terms, synonyms, and related terms. Categorization tools normally have a
method for ranking the documents in order of which documents have the most content on a particular topic.

As with summarization, categorization can be used with topic tracking to further specify the relevance of a
document to a person seeking information on a topic. The documents returned from topic tracking could be ranked
by content weights so that individuals could give priority to the most relevant documents first. Categorization can
be used in a number of application domains. Many businesses and industries provide customer support or have to
answer questions on a variety of topics from their customers. If they can use categorization schemes to classify
the documents by topic, then customers or end users will be able to access the information they seek much more
readily. The goal of text categorization is to classify a set of documents into a fixed number of predefined
categories. Each document may belong to more than one class.

Using supervised learning algorithms, the objective is to learn classifiers from known examples (labeled
documents) and perform the classification automatically on unknown examples (unlabeled documents). Figure.8
shows the overall flow diagram of the text categorization task. Consider a set of labeled documents from a source
D = [d1,d2,….dn] belonging to a set of classes C = [c1,c2,…,cp]. The text categorization task is to train the
classifier using these documents, and assign categories to new documents. In the training phase, the n documents
are arranged in p separate folders, where each folder corresponds to one class. In the next step, the training data
set is prepared via a feature selection process.

11
Figure 6-1. Flow Diagram of Text Categorization

Text data typically consists of strings of characters, which are transformed into a representation suitable for
learning. It is observed from previous research that words work well as features for many text categorization tasks.
In the feature space representation, the sequences of characters of text documents are represented as sequence of
words. Feature selection involves tokenizing the text, indexing and feature space reduction. Text can be tokenized
using term frequency (TF), inverse document frequency (IDF), term frequency inverse document frequency
(TFIDF) or using binary representation. Using these representations the global feature space is determined from
entire training document collection.

12
7. CLUSTERING

Clustering is a technique used to group similar documents, but it differs from categorization in that
documents are clustered on the fly instead of through the use of predefined topics. Another benefit of clustering
is that documents can appear in multiple subtopics, thus ensuring that a useful document will not be omitted from
search results. A basic clustering algorithm creates a vector of topics for each document and measures the weights
of how well the document fits into each cluster. Clustering technology can be useful in the organization of
management information systems, which may contain thousands of documents.

In K-means clustering algorithm , while calculating Similarity between text documents, not only consider
eigenvector based on algorithm of term frequency statistics ,but also combine the degree of association between
words ,then the relationship between keywords has been taken into consideration ,thereby it lessens sensitivity of
input sequence and frequency, to a certain extent, it considered semantic understanding , effectively raises
similarity accuracy of small text and simple sentence as well as preciseness and recall rate of text cluster result
.The algorithm model with the idea of co-mining shows as Fig .

13
Figure 7-1 The flow diagram of K-means
clustering based on co-mining

In word relativity-based clustering (WRBC) method, text clustering process contains four main parts: text
reprocessing, word relativity computation, word clustering and text classification.

Figure 7-2Word relativity-based clustering

method

The first step in text clustering is to transform documents, which typically are strings of characters into a suitable
representation for the clustering task.

✓ Remove stop-words: The stop-words are high frequent words that carry no information (i.e. pronouns,
prepositions, conjunctions etc.). Remove stop-words can improve clustering results.
✓ Stemming: By word stemming it means the process of suffix removal to generate word stems. This is done to
group words that have the same conceptual meaning, such as work, worker, worked and working.
✓ Filtering: Domain vocabulary V in ontology is used for filtering. By filtering, document is considered with
related domain words (term). It can reduce the documents dimensions. A central problem in statistical text
clustering is the high dimensionality of the feature space.

14
Standard clustering techniques cannot deal with such a large feature set, since processing is extremely costly in
computational terms. We can represent documents with some domain vocabulary in order to solving the high
dimensionality problem. In the beginning of word clustering, one word randomly is chosen to form initial cluster.
The other words are added to this cluster or new cluster, until all words are belonged to m clusters. This method
allows one word belong to many clusters and accord with the fact. This method implements word clustering by
calculating word relativity and then implements text classification.

15
8. CONCEPT LINKAGE

Concept linkage tools connect related documents by identifying their commonly-shared concepts and help
users find information that they perhaps wouldn’t have found using traditional searching methods. It promotes
browsing for information rather than searching for it. Concept linkage is a valuable concept in text mining,
especially in the biomedical fields where so much research has been done that it is impossible for researchers to
read all the material and make associations to other research. Ideally, concept linking software can identify links
between diseases and treatments when humans cannot. For example, a text mining software solution may easily
identify a link between topics X and Y, and Y and Z, which are well-known relations. But the text mining tool
could also detect a potential link between X and Z, something that a human researcher has not come across yet
because of the large volume of information s/he would have to sort through to make the connection.

A well-known nontechnological example is from Dan Swanson, a professor at the University of Chicago, whose
research in the 1980s identified magnesium deficiency as a contributing factor in migraine headaches. Swanson
looked at articles with titles containing the keyword “migraine”, then called the keywords that appeared at a
certain significant frequency within the documents. One such keyword term was “spreading depression”. He then
looked for titles containing “spreading depression” and repeated the process with the text of the documents. He
identified “magnesium deficiency” as a keyword term, hypothesizing that magnesium deficiency was a factor
contributing to migraine headaches. No direct link between magnesium deficiency and migraines could be found
in the previous literature, and no previous research had been done suggesting that two were related. The hypothesis
was made only by linking related documents from migraines to those covering spreading depression to those
covering magnesium deficiency. The direct link between magnesium deficiency and migraine headaches was later
proved valid through scientific experiments, showing that Swanson’s linkage methods could be a valuable process
in another medical research.

16
9. INFORMATION VISUALIZATION

Visual text mining, or information visualization, puts large textual sources in a visual hierarchy or map
and provides browsing capabilities, in addition to simple searching. DocMiner is a tool that shows mappings of
large amounts of text, allowing the user to visually analyze the content. The user can interact with the document
map by zooming, scaling, and creating sub-maps. Information visualization is useful when a user needs to narrow
down a broad range of documents and explore related topics. The government can use information visualization
to identify terrorist networks or to find information about crimes that may have been previously thought
unconnected. It could provide them, with a map of possible relationships between suspicious activities so that
they can investigate connections that they would not have come up with on their own.

The goal of information visualization, the construction may be conducted into three steps:

✓ Data preparation: i.e. determine and acquire original data of visualization and form original data space.
✓ Data analysis and extraction: i.e. analyze and extract visualization data needed from original data and form
visualization data space.
✓ Visualization mapping: i.e. employ certain mapping algorithm to map visualization data space to visualization
target. InfoVisModel divide the construction into five steps:

a) Information collection: to collect information resources needed from databases or WWW.

b) Information indexing: to index collected information resources to form original data sources.

c) Information retrieval: to query information lists in conformity to result from original data sources
according to the need of retrieval.

d) Generation of visualization data: to transform data in the retrieved results into visualization data.

e) Display of visualization interface: to map visualization data to visualization target and display them on
visualization interface. InfoVisModel visualization model is shown in figure.

17
Figure 9-1 Info VisModel visualization model

18
10. QUESTION ANSWERING

Another application area of natural language processing is natural language queries, or question answering
(Q&A), which deals with how to find the best answer to a given question. Many websites that are equipped with
question answering technology, allow end users to “ask” the computer a question and be given an answer. Q&A
can utilize multiple text mining techniques. For example, it can use information extraction to extract entities such
as people, places, events; or question categorization to assign questions into known types (who, where, when,
how, etc.). In addition to web applications, companies can use Q&A techniques internally for employees who are
searching for answers to common questions. The education and medical areas may also find uses for Q&A in
areas where there are frequently asked questions that people wish to search.

Figure 10-1 Architecture of Question answering

system

19
Figure shows the architecture of question answering system. The system takes in a natural language (NL) question
in English from the user. This question is then passed to a Part-of-Speech (POS) tagger which parses the question
and identifies POS of every word involved in the question. This tagged question is then used by the query
generators which generate different types of queries, which can be passed to a search engine. These queries are
then executed by a search engine in parallel. The search engine provides the documents which are likely to have
the answers we are looking for. These documents are checked for this by the answer extractor. Snippet Extractor
extracts snippets which contain the query phrases/words from the documents. These snippets are passed to the
ranker which sorts them according to the ranking algorithm.

20
11. TEXT MINING APPLICATIONS

The main Text Mining applications are most often used in the following sectors:

✓ Publishing and media.

✓ Telecommunications, energy and other services industries.
✓ Information technology sector and Internet.
✓ Banks, insurance and financial markets.
✓ Political institutions, political analysts, public administration and legal documents.
✓ Pharmaceutical and research companies and healthcare.

The sectors analyzed are characterized by a fair variety in the applications being experimented. It is possible to
identify some sectorial specifications in the use of TM, linked to the type of production and the objectives of the
knowledge management leading them to use TM. The publishing sector, for example, is marked by prevalence of
Extraction Transformation Loading applications for the cataloguing, producing and the optimization of the
information retrieval.

In the banking and insurance sectors, on the other hand, CRM applications are prevalent and aimed at improving
the management of customer communication, by automatic systems of message re-routing and with applications
supporting the search engines asking questions in natural language. In the medical and pharmaceutical sectors,
applications of Competitive Intelligence and Technology Watch are widespread for the analysis, classification
and extraction of information from articles, scientific abstracts and patents. A sector in which several types of
applications are widely used is that of the telecommunications and service companies: the most important
objectives of these industries are that all applications find an answer, from market analysis to human resources
management, from spelling correction to customer opinion survey.

Text Mining is widely used in field of knowledge and Human Resource Management.

Following are its few applications in these areas:

21
1) Competitive Intelligence: The need to organize and modify their strategies according to demands and to the
opportunities that the market present requires that companies collect information about themselves, the market
and their competitors, and to manage enormous amount of data, and analyzing them to make plans. The aim
of Competitive Intelligence is to select only relevant information by automatic reading of this data. Once the
material has been collected, it is classified into categories to develop a database, and analyzing the database
to get answers to specific and crucial information for company strategies.

The typical queries concern the products, the sectors of investment of the competitors, the partnerships existing
in markets, the relevant financial indicators, and the names of the employees of a company with a certain profile
of competences. Before the introduction of TM, there was a division that was entirely dedicated to the continuous
monitoring of information (financial, geopolitical, technical and economic) and answering the queries coming
from other sectors of the company. In these cases the return on investment by the use of TM technologies was
self-evident when compared to results previously achieved by manual operators.

In some cases, if a scheme of categories is not defined a priori, cauterization procedures are used to classify the
set of documents (considered) relevant with regard to a certain topic, in clusters of documents with similar
contents. The analysis of the key concepts present in the single clusters gives an overall vision of the subjects
dealt with in the single texts.

More company and news information are increasingly available on the web. As such, it has become a gold mine
of online information that is crucial for competitive intelligence (CI). To harness this information, various search
engines and text mining techniques have been developed to gather and organize it. However, the user has no
control on how the information is organized through these tools and the information clusters generated may not
match their needs. The process of manually compiling documents according to a user's needs and preferences and
into actionable reports is very labor intensive, and is greatly amplified when it needs to be updated frequently.
Updates to what has been collected often require a repeated search, filtering of previously retrieved documents
and re-organizing.

FOCI (Flexible Organizer for Competitive Intelligence), can help the knowledge worker in the gathering,
organizing, tracking, and dissemination of competitive intelligence or knowledge bases on the web. FOCI allow
a user to define and personalize the organization of the information clusters according to their needs and

22
preferences into portfolios. Figure16 shows the architecture of FOCI. It comprises an Information Gathering
module for retrieving relevant information from the web sources; a Content Management module for organizing
information into portfolios and personalizing the portfolios; a Content Mining module for discovering new
information and a Content Publishing module for publishing and sharing of information and a user interface front
end for graphical visualization and users interactions. The portfolios created are stored into CI knowledge bases
which can be shared by the users within an organization.

Figure 11-1 FOCI System Architecture

Text mining can represent flexible approaches to information management, research and analysis. Thus text
mining can expand the fists of data mining to the ability to deal with textual materials. The following Fig. 17
addresses the process of using text mining and related methods and techniques to extract business intelligence
from multi sources of raw text information. Although there seems something like that of data mining, this process
of text mining gains the extra power to extract expanding business intelligence.

23
Figure 11-2. Text Mining in Business Intelligence

2) Extraction Transformation Loading: Extraction Transformation Loading are aimed at filing nonstructured
textual material into categories and structured fields. The search engines are usually associated with ETL that
guarantee the retrieval of information, generally by systems foreseeing conceptual browsing and questioning
in natural language. The applications are found in the editorial sector, the juridical and political document
field and medical health care. In the legal documents sector the document filing and information management
operations deal with the particular features of language, in which the identification and tagging of relevant
elements for juridical purposes is necessary.

The data can come from any source i.e., a mainframe application, an ERP application, a CRM tool, a flat file, and
an Excel spreadsheet—even a message queue. All these types of data must be transformed into a single suitable
format and stored in large repository called Data warehouse. To make a Data warehouse we have to follow a
process known as Extraction, transformation, and loading (ETL) which involves Extracting data from various
outside sources. Transforming it to fit business needs, and ultimately Loading it into the data warehouse.

24
The first part of an ETL process is to extract the data from various source systems. Data warehouse consolidate
data from different source systems. These sources may have different formats of data. Data source formats can be
relational databases and flat files, non-relational database structures such as IMS or other data structures such as
VSAM or ISAM So Extraction of these different format data which uses different internal representation is
difficult process. Extraction tool must understand all different data storage formats.

The transformation phase applies a number of rules to the extracted data so as to convert different data formats
into single format. These transformation rules will be applied by transformation tool as per the requirements.
Following transformations types may be required:

✓ Selecting only those which don’t have null values.

✓ Translating coded values Making all same values to same code.
✓ Deriving a new calculated value (e.g. age = sys_date -d_o_b) Summarizing multiple rows of data.

The loading phase loads the transformed data into the data warehouse so that it can be used for various analytical
purposes. Various reporting and analytical tools can be applied to data warehouse. Once data is loaded into data
warehouse it cannot be updated. Loading is time consuming process so it is being done very few times.

A good ETL tool should be able to communicate with many different relational databases and read the various
file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application
Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction,
transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata
capabilities. ETL Data flow diagram is shown in figure.

25
Figure 11-3 ETL Data flow diagram

3) Human resource management: TM techniques are also used to manage human resources strategically, mainly
with applications aiming at analyzing staff’s opinions, monitoring the level of employee satisfaction, as well
as reading and storing CVs for the selection of new personnel. In the context of human resources management,
the TM techniques are often utilized to monitor the state of health of a company by means of the systematic
analysis of informal documents. Text Mining Applications in Customer Relationship Management and Market
analysis Text Mining is widely used in field of Customer relationship Management and Market Analysis.
Following are its few applications in these areas.

4) Customer Relationship Management (CRM): In CRM [4] domain the most widespread applications are related
to the management of the contents of clients’ messages. This kind of analysis often aims at automatically
rerouting specific requests to the appropriate service or at supplying immediate answers to the most frequently
asked questions. Services research has emerged as a green field area for application of advances in computer
science and IT. CRM practices, particularly contact centers (call centers) in our context, have emerged as
hotbeds for application of innovations in the areas of knowledge management, analytics, and data mining.
Unstructured text documents produced from a variety of sources in today contact centers have exploded in
terms of the sheer volume generated. Companies are increasingly looking to understand and analyze this
content to derive operational and business insights. The customer, the end consumer of products and services,
is receiving increased attention.

26
5) Market Analysis (MA): Market Analysis, instead, uses TM mainly to analyze competitors and/or monitor
customers' opinions to identify new potential customers, as well as to determine the companies’ image through
the analysis of press reviews and other relevant sources. For many companies tele-marketing and e-mail
activity represents one of the main sources for acquiring new customers. The TM instrument makes it possible
to present also more complex market scenarios.

Traditional marketing had a positive impact due to technology over the past few decades. Database technologies
transformed storing information such as customers, partners, demographics, and preferences for making
marketing decisions. In the 90s, the whole world saw economy boom due to improvements and innovation in
various IT-related fields. The amount of web pages ameliorated during dot-com era. Search engines were found
to crawl web pages to throw out useful information from the heaps. Marketing professionals used search engines,
and databases as a part of competitive analyses.

Data mining technology helped extract useful information and find nuggets from various databases. Data
warehouses turned out to be successful for numerical information, but failed when it came to textual information.
The 21st century has taken us beyond the limited amount of information on the web. This is good in one way that
more information would provide greater awareness, and better knowledge. In reality, it turns out to be not that
good because too much of information leads to redundancy. The knowledge of marketing information is available
on the web by means of industry white papers, academic publications relating to markets, trade journals, market
news articles, reviews, and even public opinions when it comes down to customer requirements. Text mining
technology could help marketing professionals use this information for finding nuggets. Market Analysis includes
following things:

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies.

✓ Target marketing
✓ Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits,
etc.
✓ Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,
etc.

27
✓ Cross-market analysis

✓ Associations/co-relations between product sales

✓ Prediction based on the association information Finance planning and asset evaluation 1. Cash flow analysis
and prediction

✓ Contingent claim analysis to evaluate assets

✓ Cross-sectional and time series analysis (financial ratio, trend analysis, etc.) Resource planning:

✓ Summarize and compare the resources and spending competition:

✓ Monitor competitors and market directions

✓ Group customers into classes and a class-based pricing procedure

✓ Set pricing strategy in a highly competitive market

It is instructive to divvy up the text-mining market by the type of customer. Applying a bit of Bayesian reasoning,
early buyers (as text mining is a relatively new technology) will prove a good indication of market directions.
Nonetheless, text mining is testing new directions as the technology has made possible several new applications.
The text-mining market is relatively new and not yet rigidly defined. Growth potential is huge given the ever-
increasing volumes of textual information being produced and consumed by industry and government. While the
market is being targeted by established software vendors, which are extending existing product lines and existing
functional and industry-specific offerings, several pure-play vendors have done quite well and entry barriers for
innovative start-ups are still not high.

The technological monitoring, which analyses the characteristics of existing technologies, as well as identifying
emerging technologies, is characterized by two elements: the capacity to identify in a non-ordinary way what
already exists and that is consolidated and the capacity to identify what is already available, identifying through
its potentiality, application fields and relationships with the existing technology. Powerful Text Mining techniques
now exist to identify the relevant Science & Technology literatures, and extract the required information from
these literatures efficiently. These techniques have been developed, to:

✓ Substantially enhance the retrieval of useful information from global Science & Technology databases.

✓ Identify the technology infrastructure (authors, journals, organizations) of a technical domain;

28
✓ Identify experts for innovation-enhancing technical workshops and review panels;

✓ Develop site visitation strategies for assessment of prolific organizations globally;

✓ Generate technical taxonomies (classification schemes) with human-based and computer-based clustering
methods;

✓ Estimate global levels of emphasis in targeted technical areas;

✓ Provide roadmaps for tracking myriad research impacts across time and applications areas. Text mining has
also been used or proposed for discovery and innovation from disjoint and disparate literatures. This
application has the potential to serve as a cornerstone for credible technology forecasting, and help predict the
technology directions of global military and commercial adversaries.

✓ Text Mining Applications in Natural Language Processing and Multilingual Aspects Text Mining is widely
used in field of Natural Language Processing and Multilingual Aspects. Following are its few applications in
these areas:

6) Questioning in Natural Language: The most important case of application of the linguistic competences
developed in the TM context is the construction of websites that support systems of questioning in natural
language. The need to make sites cater as much as possible for the needs of customers who are not necessarily
expert in computers or web search is common also to those companies that have an important part of their
business on the web.

The system architecture of an automatic question answering system is shown in figure 20. In the user interface,
users can question using natural language and then process automatic word segmentation. For the automatic
question answering system, the user's question is for a specific course. The proposed question answering system
is based on a specific course. Therefore, it is easy to extract keywords from the results of automatic word
segmentation. The ontology-based knowledge base defines the concepts and the relationship between the concepts
in the field of curriculum. The concepts have been clearly defined and given the computer understandable
semantics. With this domain ontology, the system can expand the keywords, increasing the search area for the
problem, improve the system’s recall rate. Then, the system uses the expanded key words to query in the FAQ
base and return the handled answers to users.

29
Figure 11-4 question answering system architecture

7) Multilingual Applications of Natural Language Processing: In NLP, Text Mining applications are also quite
frequent and they are characterized by multilinguist. Use of Text Mining techniques to identify and analyze
web pages published in different languages, is one of its examples.

When working on a multilingual speech recognition system, a good deal of attention must be paid to the languages
to be recognized by the system. In this application, a recognition system for Italian and German is built, thus the
properties of both languages are of importance. From the acoustic point of view, the Italian language presents a
significantly smaller number of phones with respect to German - e.g. 5 vowels in Italian versus 25 in German.
Moreover, recognition experiments on Italian large vocabulary dictation conducted at IRST showed that only
minor improvements are achieved with context-dependent (CD) phone models with respect to context-
independent (CI) ones.

German is characterized by a higher variety of flexions and cases, a large use of compounding words, and the
usage of capital and small letters to specify role of words. All these features heavily reflect on the vocabulary size
and on out-of-vocabulary rate, that are in general higher for German. For the German language, it can be said,
that pronunciation and lexicon strongly depend on the region. South Tyrolean German uses different words and
pronunciation rules than standard German. Moreover, the land register experts have either Italian or German
mother language and may thus have an accent whenever they enter data in a non-native language. Therefore, the

30
recognition system must not only cope with dialectal variations, but also with a certain amount of accent by the
speaker. The architecture for SpeeData data entry module is shown in figure21. This comprises four modules,
namely the central manager (CM), the user interface (UI), the data base interface (DBI) and the speech recognizer
(SR).

Figure 11-5 System architecture for the SpeeData data entry module

31
12. CONCLUSION

As the amount of unstructured data increases, text-mining tools will be increasingly valuable. Text-mining
methods are useful to government intelligence and security agencies. In education area students and educators are
better able to find information relating to their topics. In business applications text-mining tools can help them
analyze their competition, customer base, and marketing strategies. A future trend is integration of data mining
and text mining into a single system, a combination known as duo-mining.

32
13. REFERENCES

✓ Vishal Gupta . A Survey of Text Mining Techniques and Applications

http://ojs.academypublisher.com/index.php/jetwi/article/viewPDFInterstitial/01016076/8

✓ Periklis Andritsos, Nicolas Nicoloyannis, Anna Stavrianou.

Overview and Semantic Issues of Text Mining

http://portal.acm.org/citation.cfm
id=1324185.1324190&coll=ACM&dl=ACM&CFID=97756486&CFTOKEN=26855260

✓ Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang.

Tapping the power of Text Mining

http://portal.acm.org/citation.cfm?id=1151030.1151032&coll=ACM&dl=ACM&CFID=9775
6486&CFTOKEN=26855260

✓ Jan H. Kroeze, Machdel C. Matthee and Theo J.D. Bothma.

Differentiating Data- and Text-Mining Terminology

http://portal.acm.org/citation.cfm?id=954014.954024

Text Data Management and Analysis PDF
100% (3)
Text Data Management and Analysis PDF
531 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
The Text Mining Handbook
No ratings yet
The Text Mining Handbook
423 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Wayne Wright PowerPoint
80% (5)
Wayne Wright PowerPoint
176 pages
Literature Review On Text Mining
100% (3)
Literature Review On Text Mining
5 pages
Elements of Deep Learning for Computer Vision: Explore Deep Neural Network Architectures, PyTorch, Object Detection Algorithms, and Computer Vision Applications for Python Coders (English Edition)
From Everand
Elements of Deep Learning for Computer Vision: Explore Deep Neural Network Architectures, PyTorch, Object Detection Algorithms, and Computer Vision Applications for Python Coders (English Edition)
Bharat Sikka
No ratings yet
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
G5. Improvisation in Various Art Forms
No ratings yet
G5. Improvisation in Various Art Forms
21 pages
An Introduction to Python Programming: A Practical Approach: step-by-step approach to Python programming with machine learning fundamental and theoretical principles.
From Everand
An Introduction to Python Programming: A Practical Approach: step-by-step approach to Python programming with machine learning fundamental and theoretical principles.
Dr. Krishna Kumar Mohbey
No ratings yet
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
From Everand
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
Dr. Rajkumar Tekchandani
No ratings yet
Personality Factors and Second Language Acquisition
No ratings yet
Personality Factors and Second Language Acquisition
61 pages
Text Mining
No ratings yet
Text Mining
18 pages
Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition)
From Everand
Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition)
Kamalkant Hiran
No ratings yet
Machine Learning for Beginners - 2nd Edition: Build and deploy Machine Learning systems using Python (English Edition)
From Everand
Machine Learning for Beginners - 2nd Edition: Build and deploy Machine Learning systems using Python (English Edition)
Dr. Harsh Bhasin
No ratings yet
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Cryptography and Network Security: Demystifying the ideas of Network Security, Cryptographic Algorithms, Wireless Security, IP Security, System Security, and Email Security
From Everand
Cryptography and Network Security: Demystifying the ideas of Network Security, Cryptographic Algorithms, Wireless Security, IP Security, System Security, and Email Security
Bhushan Trivedi
No ratings yet
Payal Technical Seminar Final
No ratings yet
Payal Technical Seminar Final
23 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
Oral Communication in Context: Quarter 2 - Module 3: Principles of Effective Speech Writing and Delivery
100% (2)
Oral Communication in Context: Quarter 2 - Module 3: Principles of Effective Speech Writing and Delivery
36 pages
AIML - Module 1-Question Bank
No ratings yet
AIML - Module 1-Question Bank
3 pages
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
From Everand
Building Modern GUIs with tkinter and Python: Building user-friendly GUI applications with ease (English Edition)
Saurabh Chandrakar
No ratings yet
1 2 3 4 5 Merged
No ratings yet
1 2 3 4 5 Merged
23 pages
Subordinate Clauses: A Clause Is A Group of Words That Could Be A Sentence
No ratings yet
Subordinate Clauses: A Clause Is A Group of Words That Could Be A Sentence
11 pages
7600
No ratings yet
7600
101 pages
Python for Everyone: Learn and polish your coding skills in Python (English Edition)
From Everand
Python for Everyone: Learn and polish your coding skills in Python (English Edition)
Saurabh Chandrakar
No ratings yet
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Text Mining in Big Data Analytics (1) (1) - 1
No ratings yet
Text Mining in Big Data Analytics (1) (1) - 1
32 pages
Text Mining in Big Data Analytics
No ratings yet
Text Mining in Big Data Analytics
34 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Multipliicty As Tradition Knill's Intermodal
No ratings yet
Multipliicty As Tradition Knill's Intermodal
10 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Text Mining and Its Business Applications
No ratings yet
Text Mining and Its Business Applications
17 pages
EBM
No ratings yet
EBM
16 pages
IMTC634 - Data Science - Chapter 7
No ratings yet
IMTC634 - Data Science - Chapter 7
24 pages
An Introduction To Academic Writing
100% (2)
An Introduction To Academic Writing
14 pages
Text Mining
No ratings yet
Text Mining
16 pages
Bda Mod5
No ratings yet
Bda Mod5
20 pages
10 1109@icaccs 2019 8728547
No ratings yet
10 1109@icaccs 2019 8728547
5 pages
Emerging Concepts & Trends in Business Analytics
No ratings yet
Emerging Concepts & Trends in Business Analytics
15 pages
Assignment Rubel - Data Mining
No ratings yet
Assignment Rubel - Data Mining
12 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
HCI Chapter One New
No ratings yet
HCI Chapter One New
8 pages
Disciplines - Unit 3
No ratings yet
Disciplines - Unit 3
8 pages
Text Mining
No ratings yet
Text Mining
12 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
DMW Report
No ratings yet
DMW Report
20 pages
Comparative Analysis of Text Mining Techniques For
No ratings yet
Comparative Analysis of Text Mining Techniques For
12 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
GROENING SARKIS ZHU, 2018 - Green-Marketing-Consumer-Level-Theory-Review-A-Compendium-of-Applied-Theories-and-Further-Research-Directions
No ratings yet
GROENING SARKIS ZHU, 2018 - Green-Marketing-Consumer-Level-Theory-Review-A-Compendium-of-Applied-Theories-and-Further-Research-Directions
59 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Text Mining: Concepts, Process and Applications: January 2013
No ratings yet
Text Mining: Concepts, Process and Applications: January 2013
5 pages
Text Mining: Promises and Challenges
No ratings yet
Text Mining: Promises and Challenges
7 pages
PROGRAM Evaluation Form
No ratings yet
PROGRAM Evaluation Form
3 pages
Reflexive Object Pronouns: Reflexive Verbs in Daily Routines
No ratings yet
Reflexive Object Pronouns: Reflexive Verbs in Daily Routines
11 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Cambridge 04
No ratings yet
Cambridge 04
3 pages
Self As A Cognitive Construct
No ratings yet
Self As A Cognitive Construct
10 pages
European Work and Organizational Psychologist
No ratings yet
European Work and Organizational Psychologist
14 pages
Quiz 3
No ratings yet
Quiz 3
4 pages
PowerPoint Helping Students With Proving
No ratings yet
PowerPoint Helping Students With Proving
90 pages
Year 9 Assessment Task Term 2 Islamic Studies
No ratings yet
Year 9 Assessment Task Term 2 Islamic Studies
5 pages
CS-03 Technical Advisor Competencies
No ratings yet
CS-03 Technical Advisor Competencies
10 pages
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
Entrevista Halit Ergenc (Halit Ergenc Interview)
No ratings yet
Entrevista Halit Ergenc (Halit Ergenc Interview)
15 pages
Stay Cool With Mindfulness Practice
No ratings yet
Stay Cool With Mindfulness Practice
10 pages
Writing - Module 4 - Verbs Essays
No ratings yet
Writing - Module 4 - Verbs Essays
7 pages
Tle 8 LP This Is A Helpful Lesson Plan
No ratings yet
Tle 8 LP This Is A Helpful Lesson Plan
5 pages
Student Engagement Strategies
No ratings yet
Student Engagement Strategies
11 pages
The Pampangans: Their Early History and Culture
No ratings yet
The Pampangans: Their Early History and Culture
38 pages
An Empirical Study On Organizational Acceptance Accounting Information Systems in Sharia Banking
No ratings yet
An Empirical Study On Organizational Acceptance Accounting Information Systems in Sharia Banking
26 pages
PSYCHOLING CASE STUDY Music
No ratings yet
PSYCHOLING CASE STUDY Music
18 pages
Why Perseverance For Success?
No ratings yet
Why Perseverance For Success?
12 pages

Text Mining

Uploaded by

Text Mining

Uploaded by

A SEMINAR REPORT ON

SUBMITTED TO JAI NARAIN VYAS UNIVERSITY,

IN PARTIAL FULFILLMENT FOR AWARD OF DEGREE

BACHELOR IN COMPUTER APPLICATIONS

UNDER THE GUIDANCE OF

MR. LALIT KUMAR GURNANI

LUCKY INSTITUTE OF PROFESSIONAL STUDIES

JAI NARAIN VYAS UNIVERSITY, JODHPUR

Lucky Institute of Professional Studies

Mr. Lalit Kumar Gurnani

Faculty of Information Technology

Student name: Rohit Jangid

Roll no: 22BCA22254

Keywords: topic tracking, classifiers, clustering, information retrieval, datamining

2. TECHNOLOGY FOUNDATIONS ................................................................................................................. 3

3. INFORMATION EXTRACTION ................................................................................................................... 4

4. TOPIC TRACKING ......................................................................................................................................... 5

5. TEXT SUMMARIZATION ............................................................................................................................. 8

6. TEXT CATEGORIZATION .......................................................................................................................... 11

8. CONCEPT LINKAGE ................................................................................................................................... 16

9. INFORMATION VISUALIZATION ............................................................................................................ 17

10. QUESTION ANSWERING ....................................................................................................................... 19

11. TEXT MINING APPLICATIONS ............................................................................................................. 21

13. REFERENCES ........................................................................................................................................... 33

Figure 1-1 An example of Text Mining ................................................................................................................... 2

Figure 3-1Overview of IE-based text mining framework ....................................................................................... 4

Figure 4-1 The architecture of keyword extraction system ..................................................................................... 6

Figure 4-2 Keyword extraction module ................................................................................................................... 7

Figure 5-1Kernel of text summarization................................................................................................................ 10

Figure 6-1. Flow Diagram of Text Categorization ................................................................................................ 12

Figure 7-2Word relativity-based clustering method .............................................................................................. 14

Figure 9-1 Info VisModel visualization model ..................................................................................................... 18

Figure 10-1 Architecture of Question answering system ...................................................................................... 19

Figure 11-1 FOCI System Architecture ................................................................................................................. 23

Figure 11-2. Text Mining in Business Intelligence ............................................................................................... 24

Figure 11-3 ETL Data flow diagram ..................................................................................................................... 26

Figure 11-4 question answering system architecture............................................................................................. 29

Figure 1-1 An example of Text Mining

Figure 3-1Overview of IE-based text mining framework

Figure 4-2 Keyword extraction module

An automatic summarization process can be divided into three steps.

✓ In the preprocessing step a structured representation of the original text is obtained;

Figure 5-1Kernel of text summarization

Figure 7-2Word relativity-based clustering

a) Information collection: to collect information resources needed from databases or WWW.

Figure 10-1 Architecture of Question answering

✓ Publishing and media.

Following are its few applications in these areas:

Figure 11-1 FOCI System Architecture

✓ Selecting only those which don’t have null values.

✓ Associations/co-relations between product sales

✓ Contingent claim analysis to evaluate assets

✓ Summarize and compare the resources and spending competition:

✓ Monitor competitors and market directions

✓ Group customers into classes and a class-based pricing procedure

✓ Set pricing strategy in a highly competitive market

✓ Identify the technology infrastructure (authors, journals, organizations) of a technical domain;

✓ Develop site visitation strategies for assessment of prolific organizations globally;

✓ Estimate global levels of emphasis in targeted technical areas;

✓ Vishal Gupta . A Survey of Text Mining Techniques and Applications

✓ Periklis Andritsos, Nicolas Nicoloyannis, Anna Stavrianou.

Overview and Semantic Issues of Text Mining

✓ Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang.

Tapping the power of Text Mining

✓ Jan H. Kroeze, Machdel C. Matthee and Theo J.D. Bothma.

Differentiating Data- and Text-Mining Terminology

You might also like