0% found this document useful (0 votes)
66 views

Dept. of ISE, Acit 1

Text mining can analyze texts from any source or format to extract useful information. It has applications across many domains and industries for tasks like decision support, sentiment analysis, and fraud detection. The text mining process involves gathering text data, structuring it, and then mining it through techniques like creating a term-document matrix. This matrix maps important terms to source documents based on term frequencies. It can then be analyzed to discover patterns and extract knowledge from the text corpus.

Uploaded by

Dipa Shuvo Roy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Dept. of ISE, Acit 1

Text mining can analyze texts from any source or format to extract useful information. It has applications across many domains and industries for tasks like decision support, sentiment analysis, and fraud detection. The text mining process involves gathering text data, structuring it, and then mining it through techniques like creating a term-document matrix. This matrix maps important terms to source documents based on term frequencies. It can then be analyzed to discover patterns and extract knowledge from the text corpus.

Uploaded by

Dipa Shuvo Roy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Module 5

Text mining works on texts from practically any kind of sources from any business or non-
business domains, in any formats including Word documents, PDF files, XML files, text
messages, etc.
1. In the legal profession, text sources would include law, court deliberations, court orders,
etc.
2. In academic research, it would include texts of interviews, published research articles, etc.
3. The world of finance will include statutory reports, internal reports, CFO statements, and
more.
4. In medicine, it would include medical journals, patient histories, discharge summaries, etc.
5. In marketing, it would include advertisements, customer comments, etc.
6. In the world of technology and search, it would include patent applications, the whole of
information on the world-wide web, and more.
Text Mining Applications:
Text mining is a useful tool in the hands of chief knowledge officers to extract knowledge
relevant to an organization.
Text mining can be used across industry sectors and application areas, including decision
support, sentiment analysis, fraud detection, survey analysis, and many more.
1. Marketing: The voice of the customer can be captured in its native and raw format
and then analyzed for customer preferences and complaints.
 Social personas are a clustering technique to develop customer segments of interest.
Consumer input from social media sources, such as reviews, blogs, and tweets,
contain numerous leading indicators that can be used towards anticipating and
predicting consumer behavior.
 A ‘listening platform’ is a text mining application, that in real time, gathers social
media, blogs, and other textual feedback, and filters out the chatter to extract true
consumer sentiment. The insights can lead to more effective product marketing and
better customer service.
 The customer call center conversations and records can be analyzed for patterns of
customer complaints. Decision trees can organize this data to create decision choices
that could help with product management activities and to become proactive in
avoiding those complaints.
2. Business operations: Many aspects of business functioning can be accurately gauged from
analyzing text.
 Social network analysis and text mining can be applied to emails, blogs, social media
and other data to measure the emotional states and the mood of employee populations.

Dept. of ISE,
AcIT 1
Sentiment analysis can reveal early signs of employee dissatisfaction which can then
can be proactively managed.
 Studying people as emotional investors and using text analysis of the social Internet
to measure mass psychology can help in obtaining superior investment returns.
3. Legal: In legal applications, lawyers and paralegals can more easily search case histories
and laws for relevant documents in a particular case to improve their chances of winning.
 Text mining is also embedded in e-discovery platforms that help in minimizing risk in
the process of sharing legally mandated documents.
 Case histories, testimonies, and client meeting notes can reveal additional
information, such as morbidities in a healthcare situation that can help better predict
high-cost injuries and prevent costs.
4. Governance and Politics: Governments can be overturned based on a tweet originating
from a self-immolating fruit-vendor in Tunisia.
 Social network analysis and text mining of large-scale social media data can be used
for measuring the emotional states and the mood of constituent populations. Micro-
targeting constituents with specific messages gleaned from social media analysis can
be a more efficient use of resources when fighting democratic elections.
 In geopolitical security, internet chatter can be processed for realtime information
and to connect the dots on any emerging threats.
 In academic, research streams could be meta-analyzed for underlying research trends.
Text Mining Process
• Text Mining is a rapidly evolving area of research. As the amount of social media and
other text data grows, there is need for efficient abstraction and categorization of
meaningful information from the text.
• The first level of analysis is identifying frequent words. This creates a bag of
important words. Texts – documents or smaller messages – can then be ranked on
how they match to a particular bag-of-words. However, there are challenges with this
approach. For example, the words may be spelled a little differently. Or there may be
different words with similar meanings.
• The next level is at the level of identifying meaningful phrases from words. Thus ‘ice’
and ‘cream’ will be two different key words that often come together. However, there
is a more meaningful phrase by combining the two words into ‘ice cream’. There
might be similarly meaningful phrases like ‘Apple Pie’.
• The next higher level is that of Topics. Multiple phrases could be combined into
Topic area. Thus the two phrases above could be put into a common basket, and this
bucket could be called ‘Desserts’. Text mining is a semi-automated process. Text data
needs to be gathered, structured, and then mined, in a 3-step process.
1. The text and documents are first gathered into a corpus, and organized.

Dept. of ISE,
AcIT 2
2. The corpus is then analyzed for structure. The result is a matrix mapping important
terms to source documents.
3. The structured data is then analyzed for word structures, sequences, and frequency.

Term Document Matrix


1. There are several efficient techniques for identifying key terms from a text. There are
less efficient techniques available for creating topics out of them.
• For the purpose of this discussion, one could call key words, phrases or topics as a
term of interest.
• This approach measures the frequencies of select important terms occurring in each
document.
• This creates a t x d Term–by–Document Matrix (TDM) where t is the number of
terms and d is the number of documents
2. Creating a TDM requires making choices of which terms to include.
• The terms chosen should reflect the stated purpose of the text mining exercise.
• The list of terms should be as extensive as needed, but should not include unnecessary
stuff that will serve to confuse the analysis, or slow the computation.

Dept. of ISE,
AcIT 3
Here are some considerations in creating a TDM
1. A large collection of documents mapped to a large bag of words will likely lead to a very
sparse matrix if they have few common words.
• Reducing dimensionality of data will help improve the speed of analysis and
meaningfulness of the results.
• Synonyms, or terms will similar meaning, should be combined and should be counted
together, as a common term.
• This would help reduce the number of distinct terms of words or ‘tokens’.
2. Data should be cleaned for spelling errors.
• Common spelling errors should be ignored and the terms should be combined.
Uppercaselowercase terms should also be combined.
3. When many variants of the same term are used, just the stem of the word would be used to
reduce the number of terms.
• For instance, terms like customer order, ordering, order data, should be combined into
a single token word, called ‘Order’.
4. On the other side, homonyms (terms with the same spelling but different meanings) should
be counted separately.
• This would enhance the quality of analysis.
• For example, the term order can mean a customer order, or the ranking of certain
choices. These two should be treated separately.

Dept. of ISE,
AcIT 4
• “The boss ordered that the customer orders data analysis be presented in
chronological order’.
• This statement shows three different meanings for the word ‘order’.
• Thus, there will be a need for a manual review of the TD matrix.
5. Terms with very few occurrences in very few documents should be eliminated from the
matrix.
• This would help increase the density of the matrix and the quality of analysis.
6. The measures in each cell of the matrix could be one of several possibilities.
• It could be a simple count of the number of occurrences of each term in a document.
It could also be the log of that number.
• It could be the fraction number computed by dividing the frequency count by the total
number of words in the document.
• Or there may be binary values in the matrix to represent whether a term is mentioned
or not. The choice of value in the cells will depend upon the purpose of the text
analysis.
Comparing Text Mining and Data Mining:

Dept. of ISE,
AcIT 5
Mining the TDM
• The TDM can be mined to extract patterns/knowledge.
• Predictors of desirable terms could be discovered through predictive techniques, such
as regression analysis.
• Suppose the word profit is a desirable word in a document.
• The number of occurrences of the word profit in a document could be regressed
against many other terms in the TDM.
• The relative strengths of the coefficients of various predictor variables would show
the relative impact of those terms on creating a profit discussion.
• Predicting the chances of a document being liked is another form of analysis.
• For example, important speeches made by the CEO or the CFO to investors could be
evaluated for quality. If the classification of those documents (such as good or poor
speeches) was available, then the terms of TDM could be used to predict the speech
class.

Dept. of ISE,
AcIT 6
• A decision tree could be constructed that makes a simple tree with a few decision
points that predicts the success of a speech 80 percent of the time.
• This tree could be trained with more data to become better over time.
• Clustering techniques can help categorize documents by common profile.
• For example, documents containing the words investment and profit more often could
be bundled together.
• Similarly, documents containing the words, customer orders and marketing, more
often could be bundled together.
• Thus, a few strongly demarcated bundles could capture the essence of the entire
TDM.
• These bundles could thus help with further processing, such as handing over select
documents to others for legal discovery.
• Association rule analysis could show relationships of coexistence.
• Thus, one could say that the words, tasty and sweet, occur together often (say 5
percent of the time); and further, when these two words are present, 70 percent of the
time, the word happy, is also present in the document.
Text Mining Best Practices
1. Right question:
A good question is one which gives an answer and would lead to large payoffs for the
organization. The purpose and the key question will define how and at what levels of
granularity the TDM would be made. For example, TDM defined for simpler searches
would be different from those used for complex semantic analysis or network
analysis.
2. To be creative and open in proposing imaginative hypotheses for the solution:
Thinking outside the box is important, both in the quality of the proposed solution as
well as in finding the high quality data sets required to test the hypothesized solution.
For example, a TDM of consumer sentiment data should be combined with customer
order data in order to develop a comprehensive view of customer behavior. It’s
important to assemble a team that has a healthy mix of technical and business skills.
3. Pursue the problem iteratively:
Too much data can overwhelm the infrastructure and also befuddle the mind. It is
better to divide and conquer the problem with a simpler TDM, with fewer terms and
fewer documents and data sources. Expand as needed, in an iterative sequence of
steps. In the future, add new terms to help improve predictive accuracy.
4. A variety of data mining tools should be used to test the relationships in the TDM.
Different decision tree algorithms could be run alongside cluster analysis and other
techniques. Triangulating the findings with multiple techniques, and many what-if

Dept. of ISE,
AcIT 7
scenarios, helps build confidence in the solution. Test the solution in many ways
before committing to deploy it.

Web Mining
• Web mining is the art and science of discovering patterns and insights from the World-
wide web so as to improve it.
• The web is used for electronic commerce, business communication, and many other
applications.
• Web mining analyzes data from the web and helps find insights that could optimize the
web content and improve the user experience.
• Data for web mining is collected via Web crawlers, web logs, and other means.
• Here are some characteristics of optimized websites:
1. Appearance: Aesthetic design. Well-formatted content, easy to scan and navigate.
Good color contrasts.
2. Content: Well planned information architecture with useful content. Fresh content.
Search-engine optimized. Links to other good sites.
3. Functionality: Accessible to all authorized users. Fast loading times. Usable forms.
Mobile enabled.
• The web could be analyzed for its structure as well as content.
• The usage pattern of web pages could also be analyzed.
• Depending upon objectives, web mining can be divided into three different types: Web
usage mining, Web content mining and Web structure(URL link) mining.

Web content mining:

Dept. of ISE,
AcIT 8
• A website is designed in the form of pages with a distinct URL (universal resource
locator).
• A large website may contain thousands of pages.
• These pages and their content is managed using specialized software systems called
Content Management Systems.
• Every page can have text, graphics, audio, video, forms, applications, and more kinds
of content including user generated content.
• The websites keep a record of all requests received for its page/URLs, including the
requester information using ‘cookies’.
• The log of these requests could be analyzed to gauge the popularity of those pages
among different segments of the population.
• The text and application content on the pages could be analyzed for its usage by visit
counts.
• The pages on a website themselves could be analyzed for quality of content that attracts
most users.
• Thus the unwanted or unpopular pages could be weeded out, or they can be transformed
with different content and style.
• Similarly, more resources could be assigned to keep the more popular pages more fresh
and inviting.
Web structure mining:
• The Web works through a system of hyperlinks using the hypertext protocol (http). Any
page can create a hyperlink to any other page, it can be linked to by another page.
• The intertwined or self-referral nature of web lends itself to some unique network
analytical algorithms.
• The structure of Web pages could also be analyzed to examine the pattern of hyperlinks
among pages.
• There are two basic strategic models for successful websites: Hubs and Authorities.
1. Hubs: These are pages with a large number of interesting links. They serve as a hub,
or a gathering point, where people visit to access a variety of information. Media
sites like Yahoo.com, or government sites would serve that purpose. More focused
sites like Traveladvisor.com and yelp.com could aspire to becoming hubs for new
emerging areas.
2. Authorities: Ultimately, people would gravitate towards pages that provide the
most complete and authoritative information on a particular subject. This could be
factual information, news, advice, user reviews etc. These websites would have the
most number of inbound links from other websites. Thus Mayoclinic.com would

Dept. of ISE,
AcIT 9
serve as an authoritative page for expert medical opinion. NYtimes.com would
serve as an authoritative page for daily news.
Web usage mining
• As a user clicks anywhere on a webpage or application, the action is recorded by many
entities in many locations.
• The browser at the client machine will record the click, and the web server providing
the content would also make a record of the pages served and the user activity on those
pages.
• The entities between the client and the server, such as the router, proxy server, or ad
server, too would record that click.
• The goal of web usage mining is to extract useful information and patterns from data
generated through Web page visits and transactions.
• The activity data comes from data stored in server access logs, referrer logs, agent logs,
and client-side cookies.
• The user characteristics and usage profiles are also gathered directly, or indirectly,
through syndicated data.
• Further, metadata, such as page attributes, content attributes, and usage data are also
gathered.
• The web content could be analyzed at multiple levels (Figure 12.2).
1. The server side analysis would show the relative popularity of the web pages accessed.
Those websites could be hubs and authorities.
2. The client side analysis could focus on the usage pattern or the actual content consumed
and created by users.
1. Usage pattern could be analyzed using ‘clickstream’ analysis, i.e. analyzing web
activity for patterns of sequence of clicks, and the location and duration of visits on websites.
Clickstream analysis canbe useful for web activity analysis, software testing, market
research, and analyzing employee productivity.
2. Textual information accessed on the pages retrieved by users could be analyzed
using text mining techniques.
The text would be gathered and structured using the bag-of-words technique to build
a Term-document matrix.
This matrix could then be mined using cluster analysis and association rules for
patterns such as popular topics, user segmentation, and sentiment analysis.
• Web usage mining has many business applications.
• It can help predict user behavior based on previously learned rules and users' profiles,
and can help determine lifetime value of clients.

Dept. of ISE,
AcIT 10
• It can also help design cross-marketing strategies across products, by observing
association rules among the pages on the website.
• Web usage can help evaluate promotional campaigns and see if the users were
attracted to the website and used the pages relevant to the campaign.
• Web usage mining could be used to present dynamic information to users based on
their interests and profiles.
• This includes targeted online ads and coupons at user groups based on user access
patterns.

Web Mining Algorithms:


• Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates web
pages as being hubs or authorities.
• Many other HITS-based algorithms have also been published.
• The most famous and powerful of these algorithms is the PageRank algorithm.
• Invented by Google ,this algorithm is used by Google to organize the results of its
search function.
• This algorithm helps determine the relative importance of any particular web page by
counting the number and quality of links to a page.
• The websites with more number of links, and/or more links from higher-quality
websites, will be ranked higher.
• It works in a similar way as determining the status of a person in a society of people.
• Those with relations to more people and/or relations to people of higher status will be
accorded a higher status.
• PageRank is the algorithm that helps determine the order of pages listed upon a
Google Search query.
• The original PageRank algorithm formuation has been updated in many ways and the
latest algorithm is kept a secret so other websites cannot take advantage of the
algorithm and manipulate their website according to it.

Dept. of ISE,
AcIT 11
• However, there are many standard elements that remain unchanged.
• These elements lead to the principles for a good website.
• This process is also called Search Engine Optimization (SEO).

Dept. of ISE,
AcIT 12

You might also like